Introduction
gffutils
is a Python package for working with GFF and GTF files in a hierarchical manner. It
allows operations which would be complicated or time-consuming using
a text-file-only approach.
Below is a short demonstration of gffutils
. For the full documentation,
see the rest of the table of contents to the left.
Example file
Consider the first gene on chromosome 2L in Drosophila melanogaster, CG11023. Here is a graphical representation of this gene’s three alternative transcripts, from FlyBase :
gffutils
ships with the GFF annotation of this gene, downloaded from
release 5.54 of FlyBase.
We’ll be using it to describe some features of gffutils
, because it
contains just enough complexity to be representative of a real-world,
genome-wide GFF file.
Let’s look at what the file contains:
>>> import gffutils
>>> fn = gffutils.example_filename('intro_docs_example.gff')
>>> print(open(fn).read())
2L FlyBase gene 7529 9484 . + . ID=FBgn0031208;Name=CG11023;Ontology_term=SO:0000010,SO:0000087,GO:0016929,GO:0016926,GO:0006508;Dbxref=FlyBase:FBan0011023,FlyBase_Annotation_IDs:CG11023,GB_protein:ACZ94128,GB_protein:AAO41164,GB:AI944728,GB:AJ564667,GB_protein:CAD92822,GB:BF495604,UniProt/TrEMBL:Q86BM6,INTERPRO:IPR003653,GB_protein:AGB92323,UniProt/TrEMBL:M9PAY1,OrthoDB7_Drosophila:EOG796K1P,OrthoDB7_Diptera:EOG7X1604,EntrezGene:33155,UniProt/TrEMBL:E1JHP8,UniProt/TrEMBL:Q6KEV3,OrthoDB7_Insecta:EOG7Q8QM7,OrthoDB7_Arthropoda:EOG7R5K68,OrthoDB7_Metazoa:EOG7D59MP,InterologFinder:33155,BIOGRID:59420,FlyAtlas:CG11023-RA,GenomeRNAi:33155;gbunit=AE014134;derived_computed_cyto=21A5-21A5
2L FlyBase mRNA 7529 9484 . + . ID=FBtr0300689;Name=CG11023-RB;Parent=FBgn0031208;Dbxref=REFSEQ:NM_001169365,FlyBase_Annotation_IDs:CG11023-RB;score_text=Strongly Supported;score=11
2L FlyBase mRNA 7529 9484 . + . ID=FBtr0300690;Name=CG11023-RC;Parent=FBgn0031208;Dbxref=REFSEQ:NM_175941,FlyBase_Annotation_IDs:CG11023-RC;score_text=Strongly Supported;score=15
2L FlyBase mRNA 7529 9484 . + . ID=FBtr0330654;Name=CG11023-RD;Parent=FBgn0031208;Dbxref=FlyBase_Annotation_IDs:CG11023-RD,REFSEQ:NM_001272857;score_text=Strongly Supported;score=11
2L FlyBase exon 7529 8116 . + . Name=CG11023:1;Parent=FBtr0300689,FBtr0300690,FBtr0330654;parent_type=mRNA
2L FlyBase five_prime_UTR 7529 7679 . + . Name=CG11023-u5;Parent=FBtr0300689,FBtr0300690;parent_type=mRNA
2L FlyBase five_prime_UTR 7529 7679 . + . Name=CG11023-u5;Parent=FBtr0330654;parent_type=mRNA
2L FlyBase CDS 7680 8116 . + 0 Name=CG11023-cds;Parent=FBtr0300689,FBtr0300690;parent_type=mRNA
2L FlyBase CDS 7680 8116 . + 0 Name=CG11023-cds;Parent=FBtr0330654;parent_type=mRNA
2L FlyBase intron 8117 8228 . + . Name=CG11023-in;Parent=FBtr0330654;parent_type=mRNA
2L FlyBase intron 8117 8192 . + . Name=CG11023-in;Parent=FBtr0300689;parent_type=mRNA
2L FlyBase intron 8117 8192 . + . Name=CG11023-in;Parent=FBtr0300690;parent_type=mRNA
2L FlyBase exon 8193 9484 . + . Name=CG11023:3;Parent=FBtr0300689;parent_type=mRNA
2L FlyBase CDS 8193 8610 . + 1 Name=CG11023-cds;Parent=FBtr0300689;parent_type=mRNA
2L FlyBase CDS 8193 8589 . + 1 Name=CG11023-cds;Parent=FBtr0300690;parent_type=mRNA
2L FlyBase exon 8193 8589 . + . Name=CG11023:2;Parent=FBtr0300690;parent_type=mRNA
2L FlyBase exon 8229 9484 . + . Name=CG11023:4;Parent=FBtr0330654;parent_type=mRNA
2L FlyBase CDS 8229 8610 . + 1 Name=CG11023-cds;Parent=FBtr0330654;parent_type=mRNA
2L FlyBase intron 8590 8667 . + . Name=CG11023-in;Parent=FBtr0300690;parent_type=mRNA
2L FlyBase three_prime_UTR 8611 9484 . + . Name=CG11023-u3;Parent=FBtr0300689;parent_type=mRNA
2L FlyBase three_prime_UTR 8611 9484 . + . Name=CG11023-u3;Parent=FBtr0330654;parent_type=mRNA
2L FlyBase three_prime_UTR 9277 9484 . + . Name=CG11023-u3;Parent=FBtr0300690;parent_type=mRNA
2L FlyBase exon 8668 9484 . + . Name=CG11023:5;Parent=FBtr0300690;parent_type=mRNA
2L FlyBase CDS 8668 9276 . + 0 Name=CG11023-cds;Parent=FBtr0300690;parent_type=mRNA
While this is a (relatively) simple gene, the annotation looks quite complex. In no particular order, some comments about this annotation:
The gene has lots of attributes (last column), which would be nice to be able to access from Python.
The gene and mRNA features have “ID” attributes, while UTRs, exons, CDSs, and introns only have “Name” attributes, e.g.:
gene: ID=FBgn0031208;Name=CG11023... mRNA: ID=FBtr0300689;Name=CG11023-RB... exon: Name=CG11023:1...
The “Name” attributes of UTRs, exons, CDSs, and introns contain the “Name” attribute of the parent gene, followed by some extra info. For exons, it’s a colon and a number; for introns its “-in”, and so on. But the “Parent” mRNA is indicated with the mRNA’s “ID” attribute, e.g.,:
exon: Name=CG11023:3;Parent=FBtr0300689;
For routine tasks like counting reads in exons (say, with HTSeq), it would be nice to be able to add a “gene_id” attribute to exons. That is, change:
exon: Name=CG11023:3;Parent=FBtr0300689;
to:
exon: Name=CG11023:3;Parent=FBtr0300689;gene_id=FBgn0031208;
Then, we could use
htseq-count
like this:htseq-count --type=exon --idattr=gene_id reads.sam fixed_annotation.gff
UTRs, exons, CDSs, and introns have the same name for different features. For example, these two introns have different coordinates and have different mRNA parents, yet have the same “Name” attribute:
2L FlyBase intron 8117 8228 . + . Name=CG11023-in;Parent=FBtr0330654;parent_type=mRNA 2L FlyBase intron 8117 8192 . + . Name=CG11023-in;Parent=FBtr0300689;parent_type=mRNA
It would be nice to have unique IDs for these features so we could differentiate them for, say, counting the unique introns of a gene or generating a file of splice junctions.
Create the database
The first step to working with gffutils
is to import the file into
a local sqlite3 file-based database. The gffutils.create_db()
function
is used for this, and it takes many optional arguments for configuring how to
interpret your GFF or GTF file.
>>> db = gffutils.create_db(fn, dbfn='test.db', force=True, keep_order=True,
... merge_strategy='merge', sort_attribute_values=True)
Here, force=True
will overwrite any existing database, keep_order=True
will maintain the order of the attributes (ID, Name, Parent, and so on), and
sort_attribute_values=True
will ensure the values of each attribute are
always sorted. These latter two settings are necessary here for this
documentation (which undergoes automated testing) to have predictable output,
but in general you can gain a speedup by using the default keep_order=False
and sort_attributes_order=False
.
If your input file follows the GFF or GTF file format specifications, this is
all you need to create a database. But real-world files don’t always
completely adhere to the format specifications. So gffutils
provides lots
of configuration options to tailor the pre-processing to your particular file
– see Importing data into a database for more details, and Examples for more
practical examples.
Use the database
Once the database has been created, downstream code can connect to it simply by:
>>> import gffutils
>>> db = gffutils.FeatureDB('test.db', keep_order=True)
Features are accessed by name:
>>> gene = db['FBgn0031208']
>>> gene
<Feature gene (2L:7529-9484[+]) at 0x...>
Inspect fields from the GFF line:
>>> gene.start
7529
>>> gene.end
9484
Attributes are stored in the attributes
dictionary. Values are always in
a list, even if there’s only one item:
>>> gene.attributes['Name']
['CG11023']
Attributes can also be accessed from the Feature
object itself, to
save some typing:
>>> gene['Name']
['CG11023']
Printing a Feature
reproduces the original GFF line as faithfully as
possible. See the Dialects section for more details on how this is
handled and configured.
>>> print(gene)
2L FlyBase gene 7529 9484 . + . ID=FBgn0031208;Name=CG11023;Ontology_term=SO:0000010,SO:0000087,GO:0016929,GO:0016926,GO:0006508;Dbxref=FlyBase:FBan0011023,FlyBase_Annotation_IDs:CG11023,GB_protein:ACZ94128,GB_protein:AAO41164,GB:AI944728,GB:AJ564667,GB_protein:CAD92822,GB:BF495604,UniProt/TrEMBL:Q86BM6,INTERPRO:IPR003653,GB_protein:AGB92323,UniProt/TrEMBL:M9PAY1,OrthoDB7_Drosophila:EOG796K1P,OrthoDB7_Diptera:EOG7X1604,EntrezGene:33155,UniProt/TrEMBL:E1JHP8,UniProt/TrEMBL:Q6KEV3,OrthoDB7_Insecta:EOG7Q8QM7,OrthoDB7_Arthropoda:EOG7R5K68,OrthoDB7_Metazoa:EOG7D59MP,InterologFinder:33155,BIOGRID:59420,FlyAtlas:CG11023-RA,GenomeRNAi:33155;gbunit=AE014134;derived_computed_cyto=21A5-21A5
Get all the mRNAs for a gene:
>>> for i in db.children(gene, featuretype='mRNA', order_by='start'):
... print(i)
2L FlyBase mRNA 7529 9484 . + . ID=FBtr0300689;Name=CG11023-RB;Dbxref=REFSEQ:NM_001169365,FlyBase_Annotation_IDs:CG11023-RB;Parent=FBgn0031208;score_text=Strongly Supported;score=11
2L FlyBase mRNA 7529 9484 . + . ID=FBtr0300690;Name=CG11023-RC;Dbxref=REFSEQ:NM_175941,FlyBase_Annotation_IDs:CG11023-RC;Parent=FBgn0031208;score_text=Strongly Supported;score=15
2L FlyBase mRNA 7529 9484 . + . ID=FBtr0330654;Name=CG11023-RD;Dbxref=FlyBase_Annotation_IDs:CG11023-RD,REFSEQ:NM_001272857;Parent=FBgn0031208;score_text=Strongly Supported;score=11
Get all the exons for a gene:
>>> for i in db.children(gene, featuretype='exon', order_by='start'):
... print(i)
2L FlyBase exon 7529 8116 . + . Name=CG11023:1;Parent=FBtr0300689,FBtr0300690,FBtr0330654;parent_type=mRNA
2L FlyBase exon 8193 9484 . + . Name=CG11023:3;Parent=FBtr0300689;parent_type=mRNA
2L FlyBase exon 8193 8589 . + . Name=CG11023:2;Parent=FBtr0300690;parent_type=mRNA
2L FlyBase exon 8229 9484 . + . Name=CG11023:4;Parent=FBtr0330654;parent_type=mRNA
2L FlyBase exon 8668 9484 . + . Name=CG11023:5;Parent=FBtr0300690;parent_type=mRNA
Does this gene have any constitutive exons (exons found in every alternative
transcript)? This is an example of something that would be difficult to do
using only standard text file processing tools, and demonstrates the usefulness
of the gffutils
:
>>> mRNA_count = len(list(db.children(gene, featuretype='mRNA')))
>>> constitutive_exons = []
>>> for exon in db.features_of_type('exon', order_by='start'):
... parents = db.parents(exon, featuretype='mRNA')
... if len(list(parents)) == mRNA_count:
... constitutive_exons.append(exon)
>>> print(constitutive_exons)
[<Feature exon (2L:7529-8116[+]) at 0x...>]
Retrieve entries by genomic coordinates, which uses the UCSC binning strategy:
>>> list(db.region(region=('2L', 9277, 10000), completely_within=True))
[<Feature three_prime_UTR (2L:9277-9484[+]) at 0x...>]
Retrieve UTRs that overlap the gene (note that a feature can be provided as the coordinates):
>>> for UTR in db.region(gene, featuretype=['three_prime_UTR', 'five_prime_UTR']):
... print(UTR)
2L FlyBase five_prime_UTR 7529 7679 . + . Name=CG11023-u5;Parent=FBtr0300689,FBtr0300690;parent_type=mRNA
2L FlyBase five_prime_UTR 7529 7679 . + . Name=CG11023-u5;Parent=FBtr0330654;parent_type=mRNA
2L FlyBase three_prime_UTR 8611 9484 . + . Name=CG11023-u3;Parent=FBtr0300689;parent_type=mRNA
2L FlyBase three_prime_UTR 8611 9484 . + . Name=CG11023-u3;Parent=FBtr0330654;parent_type=mRNA
2L FlyBase three_prime_UTR 9277 9484 . + . Name=CG11023-u3;Parent=FBtr0300690;parent_type=mRNA
Now let’s illustrate a problem in the GFF file as it is currently formatted by trying to get the consitutive 5’UTRs in the same way we got the constitutive exons above:
>>> constitutive_5UTRs = []
>>> for utr in db.features_of_type('five_prime_UTR', order_by='start'):
... parents = db.parents(utr, featuretype='mRNA')
... if len(list(parents)) == mRNA_count:
... constitutive_5UTRs.append(utr)
>>> print(constitutive_5UTRs)
[]
Yikes! What happened? This is because by default, each line in the GFF file is treated as a unique feature. From this perspective, the first 5’UTR has two parents, and the second 5’UTR has one parent – there are zero 5’UTRs with all three transcripts as parents.
What we’d really like to do is to merge the features that have identical IDs and coordinates. Specifically, we’d like to have these two features:
2L FlyBase five_prime_UTR 7529 7679 . + . Name=CG11023-u5;Parent=FBtr0300689,FBtr0300690;parent_type=mRNA
2L FlyBase five_prime_UTR 7529 7679 . + . Name=CG11023-u5;Parent=FBtr0330654;parent_type=mRNA
become a single feature:
2L FlyBase five_prime_UTR 7529 7679 . + . Name=CG11023-u5;Parent=FBtr0300689,FBtr0300690,FBtr0330654;parent_type=mRNA
Similarly, we’d like these three 3’UTRs:
2L FlyBase three_prime_UTR 8611 9484 . + . Name=CG11023-u3;Parent=FBtr0300689;parent_type=mRNA
2L FlyBase three_prime_UTR 8611 9484 . + . Name=CG11023-u3;Parent=FBtr0330654;parent_type=mRNA
2L FlyBase three_prime_UTR 9277 9484 . + . Name=CG11023-u3;Parent=FBtr0300690;parent_type=mRNA
to become two (merging only the first two because they have identical coordinates):
2L FlyBase three_prime_UTR 8611 9484 . + . Name=CG11023-u3;Parent=FBtr0300689,FBtr0330654;parent_type=mRNA
2L FlyBase three_prime_UTR 9277 9484 . + . Name=CG11023-u3;Parent=FBtr0300690;parent_type=mRNA
In order to do this, we need to change how features are loaded into the
database by using merge_strategy=True
and id_spec=['ID', 'Name']
. You can
read more about exactly what these arguments do, and other ways of tweaking the
import, at Database IDs. But for now, let’s see what happens:
>>> db2 = gffutils.create_db(fn, dbfn='test.db', force=True, keep_order=True,
... sort_attribute_values=True,
... merge_strategy='merge',
... id_spec=['ID', 'Name'])
>>> for utr in db2.features_of_type('five_prime_UTR', order_by='start'):
... parents = db2.parents(utr, featuretype='mRNA')
... if len(list(parents)) == mRNA_count:
... constitutive_5UTRs.append(utr)
>>> print(constitutive_5UTRs)
[<Feature five_prime_UTR (2L:7529-7679[+]) at 0x...>]
Let’s look at all the 5’UTRs in the database to see how the merge strategy affected them:
>>> for i in db2.features_of_type('five_prime_UTR'):
...
... print(i)
2L FlyBase five_prime_UTR 7529 7679 . + . Name=CG11023-u5;Parent=FBtr0300689,FBtr0300690,FBtr0330654;parent_type=mRNA
Aha! A single 5’UTR. What about 3’UTRs?
>>> for i in db2.features_of_type('three_prime_UTR'):
... print(i)
2L FlyBase three_prime_UTR 8611 9484 . + . Name=CG11023-u3;Parent=FBtr0300689,FBtr0330654;parent_type=mRNA
2L FlyBase three_prime_UTR 9277 9484 . + . Name=CG11023-u3;Parent=FBtr0300690;parent_type=mRNA
>>> for i in db2.features_of_type('intron'):
... print(i)
2L FlyBase intron 8117 8228 . + . Name=CG11023-in;Parent=FBtr0330654;parent_type=mRNA
2L FlyBase intron 8117 8192 . + . Name=CG11023-in;Parent=FBtr0300689,FBtr0300690;parent_type=mRNA
2L FlyBase intron 8590 8667 . + . Name=CG11023-in;Parent=FBtr0300690;parent_type=mRNA