Configuration

Most configuration for P3 is performed by editing config.yaml, a text file using YAML syntax.

See the Example config.yaml below for a full example and to see how each part relates. But first, we describe the parts of the config file and what they do.

Config sections

prefix

Example:

prefix: example_data

This entry sets the directory prefix that can be used by any output filenames (which will be described below). For example, in the Example config.yaml, the output filenames for the features start with the {prefix} placeholder.

It is convenient to set this to a directory containing a subset of the data (but with the same filenames as real data) while developing the pipeline, and then set it to the full data set once everything is working.

features_to_use

Example:

features_to_use: [cnv, exome_variants]

This entry selects the feature sets to use for a particular run. The items in the list must correspond to feature sets defined in the features section. Note that if one feature set depends on another, they must both be included in this list.

samples

Example:

samples: celllines.txt

This entry specifies a filename containing sample IDs to include in the analysis. It is a text file with one sample name per line.

The final set of features will be subsetted to only include data from the sample IDs specified in this file.

This file is parsed into a list of sample IDs within the Snakefile workflow such that each child workflow can access the list of samples. This is especially convenient when sample IDs are encoded in filenames and you want to grab all files for all samples.

Rscript

Example:

Rscript: /usr/bin/Rscript

Sets the path to Rscript. Within Snakemake rules, an R file is then called with the {Rscript} placeholder which will be filled in with the path defined here:

rule example_r:
    input: '{prefix}/input_data.txt'
    output: '{prefix}/output_data.txt'
    shell:
        "{Rscript} myscript.R {input} {output}"

features

Example:

features:
    cnv:
        snakefile: features/cnv.snakefile
        output:
            clusters: "{prefix}/filtered/cnv/cluster_scores.tab"
            max_gene: "{prefix}/filtered/cnv/cnv_gene_max_scores.tab"
            longest_gene: "{prefix}/filtered/cnv/cnv_gene_longest_overlap_scores.tab"

    exome_variants:
        snakefile: features/variants.snakefile
        output:
            by_gene: "{prefix}/filtered/exome_variants/exome_variants_by_gene.tab"

This is where new feature sets are defined. In this example, there are two feature sets with the labels cnv and exome_variants. Under each label are two fields: snakefile and output.

snakefile:This field specifies the path, relative to the config.yaml file, to the Snakemake workflow that creates these features. This snakefile can do whatever it needs to do in order to create the output file[s].
output:This field is a dictionary with at least one output file. The combination of feature label and output label must be unique (e.g., (exome_variants, by_gene) is unique). These output files can use the {prefix} placeholder which will be filled in with the prefix field. It is expected that these files will be created by the snakefile specified for this feature set. In this example, featurex/cnv.snakefile is expected to create the three output files listed and features/variants.snakefile is expected to create one output file.

Note

Most of the effort in adding a new feature set is in writing the actual snakefile that does the work. See Pipeline design for more on this.

run_info

Example:

run_info:
    run_1:
        feature_filter: "filterfuncs.run1"
        sample_list: "celllines.txt"
        response_list: "SIDs.txt"
        response_column: "DATA0"
        response_template: "{prefix}/processed/drug_response/{sample}_drugResponse.tab"
        SL_library_file: "tools/default_SL_library.R"

run_info is a dictionary that defines multiple runs. It is intended as the entry point for configuring and tweaking filtering and learning parameters. Each entry (here, there is a single run labeled run_1) defines a unique set of feature filtering, response filtering, and learning parameters. Each run has the following keys:

feature_filter:This is a dotted-notation specification of a Python filter function to use. See Filtering features and responses for more details. In this example, we need a function called run1 in a Python module called filterfuncs.py.
sample_list:A file containing one sample ID per line. You can control the samples that make it into the training by modifying this file.
response_list:A file containing one response ID (i.e., drug SID) per line. It is can be used to determine which drugs make it into the training.
response_template:
 There may be many options for which response data values to use, but only one can be used for training. This template defines which file to use. It should start with {prefix} and have the {sample} placeholder. Note that to figure out the sample, the underlying code will split the basename of the filename on _drug and take the first part, so the response pre-processing should output files of the form {sample}_drugResponse.tab
response_column:
 Each response file specified in the response_template field may have several variables. Specify the column name here.
SL_library_file:
 This is an R script that defines a library for SuperLearner. This file is sourced right before calling SuperLearner(), and, upon being sourced, this script must define a variable SL.library which will be provided to the SuperLearner() function. The simplest case is a character vector of algorithms to use. In the example above, the file tools/default_SL_library.R simply contains the single line, SL.library <- c("SL.randomForest", "SL.glmnet", "SL.mean"). However, it is possible to write custom wrappers in this script for SuperLearner to use. See the SuperLearner docs

Example config.yaml

This is the config file used to run the example data.

# Top-level data dir
prefix: /data/P3Pool/renamed


# The value of "programs" should point to a YAML-formatted file that specifies
# preludes and paths to each program used throughout the pipeline.
#
# This mechanism allows us to support different systems where executables may
# not be on the default path. Furthermore, if a GNU Module needs to be loaded,
# it is done in the prelude. For example, on the cluster imagine we have to
# first load the module for bedtools before calling it, but once the module is
# loaded it is available on the path. In this case, the programs config file
# would have an entry like this:
#
# bedtools:
#   prelude: "module load bedtools"
#   path: "bedtools"
#
# If it's already on the default path, `prelude` can be empty. You can be as
# specific as you need to in the `path` entry.
#
# In any Snakemake rule using bedtools, use the following placeholders to fill
# in the configured prelude and path as follows:
#
# {programs.bedtools.prelude}
# {programs.bedtools.path} intersect -a a.bed -b b.bed
programs: programs.yaml

# Each set of features:
#   - has a unique name that can be used in the "features_to_use" list
#
#   - has one or more output files, which must be of the general format:
#       - rows = features
#       - columns = samples
#
#   - has a snakefile responsible for creating the output file.
#       - paths in these snakefiles should be relative to the runall.snakefile
#         since they are included verbatim.
#       - interdependencies between feature snakefiles can be handled by using
#         the snakemake directive "include:"
features:
    cnv:
        snakefile: features/cnv.snakefile
        output:
            clusters: "{prefix}/cleaned/cnv/cluster_scores.tab"
            max_gene: "{prefix}/cleaned/cnv/cnv_gene_max_scores.tab"
            longest_gene: "{prefix}/cleaned/cnv/cnv_gene_longest_overlap_scores.tab"

    exome_variants:
        snakefile: features/variants.snakefile
        output:
            by_gene: "{prefix}/cleaned/exome_variants/exome_variants_by_gene.tab"

    msigdb:
        snakefile: features/msigdb.snakefile
        output:
            zscores: "{prefix}/cleaned/msigdb/msigdb_zscores.csv"
            variants: "{prefix}/cleaned/msigdb/msigdb_variants.csv"
    go:
        snakefile: features/gene_ontology.snakefile
        output:
            zscores: "{prefix}/cleaned/go/go_zscores.csv"
            variants: "{prefix}/cleaned/go/go_variants.csv"

    cpdb:
        snakefile: features/cpdb.snakefile
        output:
            zscores: "{prefix}/cleaned/consensus_pathway/cpdb_zscores.csv"
            variants: "{prefix}/cleaned/consensus_pathway/cpdb_variants.csv"

    normed_counts:
        snakefile: features/normed_counts.snakefile
        output:
            normed_counts: "{prefix}/cleaned/rnaseq_expression/counts_matrix_normalized.csv"

    zscores:
        snakefile: features/zscores.snakefile
        output:
            zscores: "{prefix}/cleaned/rnaseq_expression/zscores.csv"
            #zscore_estimates: "{prefix}/cleaned/rnaseq_expression/zscore_estimates.csv"

run_info:
    # Each run defines a unique comination of feature filtering, response data,
    # and learning parameters.

    run_1:

        # The `run1` function, found in the Python module `filterfuncs.py`,
        # will be called on each feature set in this run.
        feature_filter: "filterfuncs.run1"

        # One model will be trained for each reponse listed the `response_list`
        # file.
        response_list: "P3_SIDs.txt"

        # Specify the samples to use. This file can be used to globally filter
        # out a particular sample for this run
        sample_list: "P3_celllines.txt"

        # The "process_response" rule creates several output files for each
        # sample. The `response_template` specifies which file to use, and must
        # include a {sample} placeholder in the filename. `response_column`
        # specifies which column in that file to use for the response data.
        response_template: "{prefix}/processed/drug_response/{sample}_drugDrc.tab"
        response_column: "iLAC50"

        # R script defining the "SL.library" to use for SuperLearner.
        SL_library_file: "tools/default_SL_library.R"