P3 features =========== While the pipeline is generic enough to support arbitrary numbers of arbitrary features, here we describe the actual features used. This section can be used as a starting point for creating custom sets of features for other types of data. Each set of features requires a different set of pre-processing steps. To accommodate this, and to isolate the customization into discrete, easily-editable components, features are handled by separate workflows. In some cases, the workflows are interdependent on each other. Each workflow is expected to create the output files defined in the ``config.yaml`` file. Below, each sub-workflow is shown separately as a directed acyclic graph (DAG) of the tasks performed and the first few lines of the output file (or files) are shown. To illustrate interdepdencies among workflows, sub-workflows have been color coded. .. _all: ``Snakefile`` .. image:: images/all_dag.png :target: images/all_dag.png .. _normcounts: ``norm_counts.snakefile`` ------------------------- This workflow starts with `htseq-count` files, one for each sample. They have the following format: .. literalinclude:: ../../example_data/raw/rnaseq_expression/LineA_1_counts.tsv :lines: 1-5 The workflow is the following: .. image:: images/normed_counts_dag.png The output file of ``norm_counts.snakefile`` is a table of gene-level quantile-normalized counts per million (CPM). One row per gene, and one column per sample: .. literalinclude:: ../../example_data/cleaned/rnaseq_expression/counts_matrix_normalized.csv :lines: 1-5 .. _zscores: ``zscores.snakefile`` --------------------- This workflow takes the output of :ref:`normcounts` and converts counts into zscores. In the absence of matched controls, these zscores are then used as a proxy for direction and magnitude of differential expression. .. image:: images/zscores_dag.png Output looks like this: .. literalinclude:: ../../example_data/cleaned/rnaseq_expression/zscores.csv :lines: 1-5 .. _variants: ``variants.snakefile`` ---------------------- .. todo:: Need more methods here about how the original data files were processed. The starting point of the exome variant data are tab-delimited files, one for each sample, generated by running SnpEff on VCF files, performing filtering steps to exclude likely germline mutations, and excluding the VCF header: .. literalinclude:: ../../example_data/raw/exome_variants/LineA_1_exome_variants.txt :lines: 1-5 SnpEff reports effects on a per-transcript basis. For integration with other features that consider genes (i.e., the pathway features), after collecting samples into a `transcripts x sample` matrix, we aggregate into a `gene x sample` matrix by summing variants across all transcripts of a gene. .. image:: images/exome_variants_dag.png Output of the `gene x sample` matrix looks like this, where values indicate the total number of variants across a gene in a sample. Specifically, since the variants in these files have been pre-filtered to only contain those with high impact, the value for each gene represents the total number of rows in the VCF file annotated with transcript IDs (`EFF[*].TRID` column) belonging to that gene. .. literalinclude:: ../../example_data/cleaned/exome_variants/exome_variants_by_gene.tab :lines: 1-5 .. _cnv: ``cnv.snakefile`` ----------------- The copy number variation (CNV) data starts as files in SEG format for each sample. For example: .. literalinclude:: ../../example_data/raw/cnv/LineA_1_cnv.seg :lines: 1-5 .. todo:: Need methods on how the SEG files were created .. image:: images/cnv_dag.png Since each sample may have a different set of CNVs, the total number of unique CNVs across all samples must be determined. The `multiinter` program from the `BEDTools` suite is used to identify a uniform set of segments that can be used across all samples. This uniform set of segments is then intersected with the actual segments on a per-sample basis to obtain per-sample CNV values for each segment. Files across samples are then aggregated into a single "cluster matrix" file. This diagram shows how cluster scores are calculated for a hypothetical set of 3 samples: .. image:: images/cluster-scores-diagram.png A separate set of scores is calculated at the gene level. A score for each gene can be calculated in several ways. The following diagram shows two ways: the largest magnitude CNV that overlaps the gene ("max"), or the score of the longest segment that overlaps the gene ("longest"): .. image:: images/gene-scores-diagram.png The final cluster scores output file looks like the following: .. literalinclude:: ../../example_data/cleaned/cnv/cluster_scores.tab :lines: 1-5 And the "max" gene scores: .. literalinclude:: ../../example_data/cleaned/cnv/cnv_gene_max_scores.tab :lines: 1-5 And the "longest" gene scores: .. literalinclude:: ../../example_data/cleaned/cnv/cnv_gene_longest_overlap_scores.tab :lines: 1-5 Pathways -------- Several annotation databases are used. These databases have annotations for each gene in a `gene x annotation` file. However, the features need to be in a `annotation x samples` file for use with regression methods. Therefore these pathway workflows use different strategies to calculate a pathway score for each sample, based on some property or properties of the genes for that pathway in that sample. See :ref:`pathway_scores` for more details. In the pathway workflows below, there are several "flavors" of scores, each of which are derived from workflows described above. For example, there are scores derived from zscores, variants, and CNV data. .. _cpdb: Consensus pathway database ~~~~~~~~~~~~~~~~~~~~~~~~~~ Scores for each pathway are calculated based on the output of :ref:`variants` and :ref:`zscores`. .. image:: images/cpdb_dag.png Variants output file: .. literalinclude:: ../../example_data/cleaned/consensus_pathway/cpdb_variants.csv :lines: 1-5 Zscores output file: .. literalinclude:: ../../example_data/cleaned/consensus_pathway/cpdb_zscores.csv :lines: 1-5 .. _go: Gene ontology ~~~~~~~~~~~~~ Scores for each GO term are calculated based on the output of :ref:`variants` and :ref:`zscores`. .. image:: images/go_dag.png Variants output file: .. literalinclude:: ../../example_data/cleaned/go/go_variants.csv :lines: 1-5 Zscores output file: .. literalinclude:: ../../example_data/cleaned/go/go_zscores.csv :lines: 1-5 .. _msigdb: MSIG database ~~~~~~~~~~~~~ Scores for each pathway are calculated based on the output of :ref:`variants` and :ref:`zscores`. .. image:: images/msigdb_dag.png Variants output file: .. literalinclude:: ../../example_data/cleaned/msigdb/msigdb_variants.csv :lines: 1-5 Zscores output file: .. literalinclude:: ../../example_data/cleaned/msigdb/msigdb_zscores.csv :lines: 1-5