Obtaining pathway scoresΒΆ
With respect to features, we have many cases where we download annotations as a table with format gene X annotation. However, for training, we need a table of features that is annotation X sample. So for each sample and annotation, we need to calculate a score across all genes for that annotation in that sample.
Take GO terms for example. Imagine genes 1, 2, 3, and 4 are annotated with GO term “GO:001”. To get a score for “GO:001” in cell line “A”, we need some sort of cell-line-specific data . . . like, say, RNA-seq. So imagine we also have RNA-seq data where we’ve calculated zscores for each gene in each cell line. To give a concrete example, let’s say the zscores in cell line “A” for genes 1, 2, 3, and 4 are -3, -1, 0, and 5 respectively.
Possible scores for “GO:001” in cell line “A” might be:
- sum of all zscores = 1
- sum of downregulated = -4
- sum of upregulated = 5
- mean zscore = 0.25
- fraction of all “GO:001“-annotated genes that are upregulated = 1/4
- fraction of all “GO:001“-annotated genes that are downregulated = 2/4
- fraction of all “GO:001“-annotated genes that are changed = 3/4
- mean up-regulated zscore = 5
- mean downregulated zscore = -2
The Python package “pandas” has some ridiculously fast pivot tables as long as we restrict ourselves to NumPy and pandas.Series functions.
In the example below, file1
is a gene X annotation tab-delimited file
where one column, “GO”, contains the GO accession for each gene. file2
is
a gene x sample CSV file where values are zscores.
First, we load the files and join them on their index column. So now we have columns for all samples as well as additional columns for annotations (one of which is the GO accession column):
import pandas as pd
x1 = pd.read_table(file1, delimiter='\t', index_col=0)
x2 = pd.read_table(file2, delimiter=',', index_col=0)
x = x1.join(x2)
Then we use pandas.pivot_table
to do all the work for us. The important
things are which column to aggregate a score for (“GO”) and how to do the
aggregation (“aggfunc”):
# sum of all zscores
y0 = pd.pivot_table(x, index='GO', aggfunc=np.sum)
# sum of all upregulated
y1 = pd.pivot_table(x[x>0], index='GO', aggfunc=np.sum)
# mean of all upregulated
y2 = pd.pivot_table(x[x>0], index='GO', aggfunc=np.mean)
# fraction upregulated
# first get the count of upregulated. This calls the .count() method on
# Series objects.
y3 = pd.pivot_table(x[x>0], index='GO', aggfunc='count')
# Then get how many there are total for each GO term.
y4 = pd.pivot_table(x[x>0], index='GO', aggfunc=len)
# Get the fraction
y5 = y3 / y4
Note that at the end of the machine learning, we will want to be able to inspect the resulting models for variable importance. In the above example, all of the resulting y* dataframes have similar indexes (e.g., the sum-of-zscores is indexed by GO term, and so is the sum-of-all-upregulated, and so on). So if we include more than one of them in the final set of features, we won’t know which is which. The solution to this is to append a unique tag to the indexes:
def index_converter(df, label):
return pd.Series(df.index).apply(lambda x: x.replace(':', '_') + label)
y0.index = index_converter(y0, '_sum')
y1.index = index_converter(y0, '_upsum')
y2.index = index_converter(y0, '_upavg')
y5.index = index_converter(y0, '_upfrac')
While documented in long form here, this is actually implemented in the tools/pipeline_helpers.py module in pathway_scores_from_zscores(). There are analagous functions in that module to handle variants and CNV data.