.. _filtering:

Filtering features and responses
================================
Depending on the learning algorithm chosen, it can be important to filter
features. Even when using random forests or penalized regression which are not
as sensitive to the input features, training will be more efficient after
removing uninformative features that have the same value across all samples
(i.e. features with zero variance).

It is expected that some experimentation will be needed to decide on the
optimal feature set. Therefore, the ``config.yaml`` file provides a mechanism
for specifying different **runs**. One run consists of a unique set of the
following operations:

    - filtering features
    - filtering response
    - a set of samples
    - a set of responses
    - learning parameters

For example, one run might use a basic zero variance filter across all
features, while a second run might try using a more stringent filter for
variant data. A third run could tweak which samples make it into the model.

In practice, feature filtering is performed by writing a custom Python
function. That function must accept 3 arguments: the input file for cleaned
features, the feature label, and the output label. It must return
a pandas.DataFrame object.  To illustrate, let's assume we have features for GO
terms variants in the following example ``config.yaml``::

    prefix: "/data/"
    features:
        go:
            snakefile: features/gene_ontology.snakefile
            output:
                zscores: "{prefix}/cleaned/go/go_zscores.csv"
                variants: "{prefix}/cleaned/go/go_variants.csv"

    run_info:
        run_1:
            feature_filter: filterfuncs.run1
            response_filter: filterfuncs.response1

Since we defined the `feature_filter` to be `filterfuncs.run1`, we need to
create a filter function in the file ``filterfuncs.py`` called `run1`:

.. code-block:: python

    def run1(infile, features_label, output_label):
        # read the input file
        d = pandas.read_table(infile, index_col=0)

        # The example config above only has one set of features, "go", so we
        # don't really have to check `features_label`...but this shows how it
        # would be done with more complex setups.
        #
        if features_label == 'go' and output_label == 'zscores':
            nonzero_var = d.var(axis=1) > 0
            d = d[nonzero_var]

        # only keep features where >10% of samples have variant data
        elif features_label == 'go' and output_label == 'variants':
            n = d.shape[1]
            n_nonzero = (d == 0).sum(axis=1).astype(float)
            too_low = (n_nonzero / n) <= nfrac
            too_high = (n_nonzero / n) >= (1 - nfrac)
            d = d[~(too_low | too_high)]

        # regardless of how we filtered, also get rid of rows with NA.
        return d.dropna()

Over the course of the workflow, this function will be called once for each
output file defined in each feature set. In the above config, there is one
feature set, ``go``, which has two expected output files,
`/data/cleaned/go/go_zscores.csv` and `/data/cleaned/go/go_variants.csv`. So
this function will be called twice during the filtering stage of the pipeline.
The pipeline will save the resulting files in a run-specific directory, named
after the feature and output label. So the pipeline will run the following:

.. code-block:: python


    run1("/data/cleaned/go/go_zscores.csv", "go", "zscores")
    # output saved to /data/runs/run_1/filtered/go/zscores_filtered.tab

    run1("/data/cleaned/go/go_variants.csv", "go", "variants")
    # output saved to /data/runs/run_1/filtered/go/variants_filtered.tab


Filtering samples and responses
-------------------------------
Filter which samples should be included in the model by editing the file
referred to in the `sample_list` config value in the `run_info` section.

Filter which responses (i.e. drugs) should be included in the model by editing
the file referred to in the `response_list` value in the `run_info` section.

In contrast, features are filtered using the custom functions referred to in
the `feature_filter` config value.

The reason for this is that the pipeline will ultimately be creating one model
for each drug. Due to the way Snakemake works, this means that we need to know
in advance which drugs will be used so that we can tell Snakemake which files
should be created.

In practice, deciding which drugs to include may involve some data analysis
outside of the pipeline to decide which drugs to add to the `response_list`.
For example, a drug may have no effect on any samples, in which case it is
uninteresting and can be removed. Some pre-processing of the data would be
required to figure this out, and the corresponding drug can be removed from the
`response_list`.

In contrast, we do not have output files created for each feature, so we don't
need to tell Snakemake about filenames. This allows us to perform the feature
filtering from within the pipeline. In addition, there are generally far more
features than responses, so using a hypothetical `feature_list` mechanism would
be awkward and difficult to maintain.