.. _pathway_scores:

Obtaining pathway scores
========================
With respect to features, we have many cases where we download annotations as
a table with format `gene X annotation`. However, for training, we need a table
of features that is `annotation X sample`. So for each sample and annotation,
we need to calculate a score across all genes for that annotation in that
sample.

Take GO terms for example. Imagine genes 1, 2, 3, and 4 are annotated with GO
term "GO:001". To get a score for "GO:001" in cell line "A", we need some sort
of cell-line-specific data . . . like, say, RNA-seq. So imagine we also have
RNA-seq data where we've calculated zscores for each gene in each cell line.
To give a concrete example, let's say the zscores in cell line "A" for genes 1,
2, 3, and 4 are -3, -1, 0, and 5 respectively.


Possible scores for "GO:001" in cell line "A" might be:

    - sum of all zscores = 1
    - sum of downregulated = -4
    - sum of upregulated = 5
    - mean zscore = 0.25
    - fraction of all "GO:001"-annotated genes that are upregulated = 1/4
    - fraction of all "GO:001"-annotated genes that are downregulated = 2/4
    - fraction of all "GO:001"-annotated genes that are changed = 3/4
    - mean up-regulated zscore = 5
    - mean downregulated zscore = -2


The Python package "pandas" has some ridiculously fast pivot tables as long as
we restrict ourselves to NumPy and pandas.Series functions.

In the example below, ``file1`` is a `gene X annotation` tab-delimited file
where one column, "GO", contains the GO accession for each gene. ``file2`` is
a `gene x sample` CSV file where values are zscores.

First, we load the files and join them on their index column. So now we have
columns for all samples as well as additional columns for annotations (one of
which is the GO accession column)::

    import pandas as pd
    x1 = pd.read_table(file1, delimiter='\t', index_col=0)
    x2 = pd.read_table(file2, delimiter=',', index_col=0)
    x = x1.join(x2)

Then we use ``pandas.pivot_table`` to do all the work for us. The important
things are which column to aggregate a score for ("`GO`") and how to do the
aggregation ("`aggfunc`")::

    # sum of all zscores
    y0 = pd.pivot_table(x, index='GO', aggfunc=np.sum)

    # sum of all upregulated
    y1 = pd.pivot_table(x[x>0], index='GO', aggfunc=np.sum)

    # mean of all upregulated
    y2 = pd.pivot_table(x[x>0], index='GO', aggfunc=np.mean)

    # fraction upregulated
    # first get the count of upregulated. This calls the .count() method on
    # Series objects.
    y3 = pd.pivot_table(x[x>0], index='GO', aggfunc='count')

    # Then get how many there are total for each GO term.
    y4 = pd.pivot_table(x[x>0], index='GO', aggfunc=len)

    # Get the fraction
    y5 = y3 / y4


Note that at the end of the machine learning, we will want to be able to
inspect the resulting models for variable importance. In the above example, all
of the resulting `y*` dataframes have similar indexes (e.g., the sum-of-zscores
is indexed by GO term, and so is the sum-of-all-upregulated, and so on). So if
we include more than one of them in the final set of features, we won't know
which is which.  The solution to this is to append a unique tag to the
indexes::

    def index_converter(df, label):
        return pd.Series(df.index).apply(lambda x: x.replace(':', '_') + label)

    y0.index = index_converter(y0, '_sum')
    y1.index = index_converter(y0, '_upsum')
    y2.index = index_converter(y0, '_upavg')
    y5.index = index_converter(y0, '_upfrac')

While documented in long form here, this is actually implemented in the
`tools/pipeline_helpers.py` module in `pathway_scores_from_zscores()`. There
are analagous functions in that module to handle variants and CNV data.