Filtering features and responses¶
Depending on the learning algorithm chosen, it can be important to filter features. Even when using random forests or penalized regression which are not as sensitive to the input features, training will be more efficient after removing uninformative features that have the same value across all samples (i.e. features with zero variance).
It is expected that some experimentation will be needed to decide on the
optimal feature set. Therefore, the config.yaml
file provides a mechanism
for specifying different runs. One run consists of a unique set of the
following operations:
- filtering features
- filtering response
- a set of samples
- a set of responses
- learning parameters
For example, one run might use a basic zero variance filter across all features, while a second run might try using a more stringent filter for variant data. A third run could tweak which samples make it into the model.
In practice, feature filtering is performed by writing a custom Python
function. That function must accept 3 arguments: the input file for cleaned
features, the feature label, and the output label. It must return
a pandas.DataFrame object. To illustrate, let’s assume we have features for GO
terms variants in the following example config.yaml
:
prefix: "/data/"
features:
go:
snakefile: features/gene_ontology.snakefile
output:
zscores: "{prefix}/cleaned/go/go_zscores.csv"
variants: "{prefix}/cleaned/go/go_variants.csv"
run_info:
run_1:
feature_filter: filterfuncs.run1
response_filter: filterfuncs.response1
Since we defined the feature_filter to be filterfuncs.run1, we need to
create a filter function in the file filterfuncs.py
called run1:
def run1(infile, features_label, output_label):
# read the input file
d = pandas.read_table(infile, index_col=0)
# The example config above only has one set of features, "go", so we
# don't really have to check `features_label`...but this shows how it
# would be done with more complex setups.
#
if features_label == 'go' and output_label == 'zscores':
nonzero_var = d.var(axis=1) > 0
d = d[nonzero_var]
# only keep features where >10% of samples have variant data
elif features_label == 'go' and output_label == 'variants':
n = d.shape[1]
n_nonzero = (d == 0).sum(axis=1).astype(float)
too_low = (n_nonzero / n) <= nfrac
too_high = (n_nonzero / n) >= (1 - nfrac)
d = d[~(too_low | too_high)]
# regardless of how we filtered, also get rid of rows with NA.
return d.dropna()
Over the course of the workflow, this function will be called once for each
output file defined in each feature set. In the above config, there is one
feature set, go
, which has two expected output files,
/data/cleaned/go/go_zscores.csv and /data/cleaned/go/go_variants.csv. So
this function will be called twice during the filtering stage of the pipeline.
The pipeline will save the resulting files in a run-specific directory, named
after the feature and output label. So the pipeline will run the following:
run1("/data/cleaned/go/go_zscores.csv", "go", "zscores")
# output saved to /data/runs/run_1/filtered/go/zscores_filtered.tab
run1("/data/cleaned/go/go_variants.csv", "go", "variants")
# output saved to /data/runs/run_1/filtered/go/variants_filtered.tab
Filtering samples and responses¶
Filter which samples should be included in the model by editing the file referred to in the sample_list config value in the run_info section.
Filter which responses (i.e. drugs) should be included in the model by editing the file referred to in the response_list value in the run_info section.
In contrast, features are filtered using the custom functions referred to in the feature_filter config value.
The reason for this is that the pipeline will ultimately be creating one model for each drug. Due to the way Snakemake works, this means that we need to know in advance which drugs will be used so that we can tell Snakemake which files should be created.
In practice, deciding which drugs to include may involve some data analysis outside of the pipeline to decide which drugs to add to the response_list. For example, a drug may have no effect on any samples, in which case it is uninteresting and can be removed. Some pre-processing of the data would be required to figure this out, and the corresponding drug can be removed from the response_list.
In contrast, we do not have output files created for each feature, so we don’t need to tell Snakemake about filenames. This allows us to perform the feature filtering from within the pipeline. In addition, there are generally far more features than responses, so using a hypothetical feature_list mechanism would be awkward and difficult to maintain.