Examples
Often, GTF/GFF files have non-standard formatting. While gffutils
is
set up to handle standard formatting by default, it also allows great
flexibility for working with non-standard files. These examples collect
various real-world, difficult-to-work-with files and show how to make them work
with gffutils
. If you have a troublesome file, the best thing to do is
to look at the first few lines for these examples and see if you can find one
with matching formatting.
These example files ship with gffutils
. For interactive use, their
absolute path can be obtained with the gffutils.example_filename()
function. Each example shows the contents of the file, along with text
explaining the troublesome parts and which settings to use in order to
successfully import into a database and retrieve features.
Note
Be mindful of the difference between the ID attribute of a feature and the primary key of that feature in the database.
Feature.id
will always return the primary key. Feature["ID"]
will
return the ID attribute of the feature, if it has one. In many (but not
all) cases, Feature.id == Feature["ID"]
Note
Also be mindful of the difference between an attribute of a Python object vs the 9th field of a GFF line, which is called the “attributes” field according to the spec. Confusing? Yes!
mouse_extra_comma.gff3
This file has gene, mRNA, protein, UTR, CDS, and exon annotations. However, not all of them have a useful ID in their attributes.
File contents
chr17 RefSeq gene 6797760 6818159 . + . ID=NC_000083.5:LOC100040603;Name=NC_000083.5:LOC100040603
chr17 RefSeq mRNA 6797760 6818159 . + . ID=XM_001475631.1;Parent=NC_000083.5:LOC100040603
chr17 RefSeq protein 6806527 6812289 . + . ID=;Parent=XM_001475631.1
chr17 RefSeq five_prime_UTR 6797760 6797769 . + . Parent=XM_001475631.1
chr17 RefSeq five_prime_UTR 6806513 6806526 . + . Parent=XM_001475631.1
chr17 RefSeq CDS 6806527 6806553 . + 0 Name=CDS:NC_000083.5:LOC100040603;Parent=XM_001475631.1,
chr17 RefSeq CDS 6808204 6808245 . + 0 Name=CDS:NC_000083.5:LOC100040603;Parent=XM_001475631.1,
chr17 RefSeq CDS 6811330 6811453 . + 0 Name=CDS:NC_000083.5:LOC100040603;Parent=XM_001475631.1,
chr17 RefSeq CDS 6811792 6811869 . + 2 Name=CDS:NC_000083.5:LOC100040603;Parent=XM_001475631.1,
chr17 RefSeq CDS 6812219 6812289 . + 2 Name=CDS:NC_000083.5:LOC100040603;Parent=XM_001475631.1,
chr17 RefSeq three_prime_UTR 6812290 6818159 . + . Parent=XM_001475631.1
chr17 RefSeq exon 6797760 6797769 . + . Parent=XM_001475631.1
chr17 RefSeq exon 6806513 6806553 . + . Parent=XM_001475631.1
chr17 RefSeq exon 6808204 6808245 . + . Parent=XM_001475631.1
chr17 RefSeq exon 6811330 6811453 . + . Parent=XM_001475631.1
chr17 RefSeq exon 6811792 6811869 . + . Parent=XM_001475631.1
chr17 RefSeq exon 6812219 6818159 . + . Parent=XM_001475631.1
Import
The defaults for gffutils
will load this example into a database.
However, the CDSs do not have ID attributes, and the default for GFF files is
id_spec="ID"
. As described at id_spec, they will be accessed by the
names "CDS_1"
, "CDS_2"
, and so on.
Note that using id_spec=["ID", "Name"]
won’t work, because the CDSs share the
same Name attribute, CDS:NC_000083.5:LOC100040603
. We could use
id_spec=["ID", "Name"]
in combination with merge_strategy="create_unique"
,
which would give ids like CDS:NC_000083.5:LOC100040603_1
(with an underscore
an integer at the end for each consecutive occurence of the same ID). But
we’ll go with the simpler default strategy for this file:
>>> import gffutils
>>> fn = gffutils.example_filename('mouse_extra_comma.gff3')
>>> db = gffutils.create_db(fn, ':memory:')
Example access
Since the mRNA on line 2 has an ID, we can simply access it by ID:
>>> db['XM_001475631.1']
<Feature mRNA (chr17:6797760-6818159[+]) at 0x...>
It’s not very convenient to access the CDSs by their newly created ids
"CDS_1"
, "CDS_2"
, etc. (Which one was “1”? Is that from the same gene as
CDS “5”?) But we can access those CDSs as children of the mRNA:
>>> [f.id for f in db.children('XM_001475631.1', featuretype='CDS')]
['CDS_1', 'CDS_2', 'CDS_3', 'CDS_4', 'CDS_5']
Or as “grandchildren” of the gene:
>>> [f.id for f in db.children('NC_000083.5:LOC100040603', featuretype='CDS', level=2)]
['CDS_1', 'CDS_2', 'CDS_3', 'CDS_4', 'CDS_5']
Note that the database id values are all unique for the CDSs, but their
"Name"
attributes are still all the same, as expected:
>>> list(set([f['Name'][0] for f in db.children('XM_001475631.1', featuretype='CDS')]))
['CDS:NC_000083.5:LOC100040603']
c_elegans_WS199_ann_gff.txt
This is a minimal file, with only one feature and that one feature has no genomic coordinates. It illustrates how things behave when there is very little information for a feature.
File contents
# modified GFF file to remove location coordinates and test annotations
I Expr_profile experimental_result_region . . . + . expr_profile=B0019.1
Import
Import is straightforward:
>>> fn = gffutils.example_filename('c_elegans_WS199_ann_gff.txt')
>>> db = gffutils.create_db(fn, ':memory:')
Access
Confirm there’s only a single feature imported:
>>> list(db.all_features())
[<Feature experimental_result_region (I:.-.[+]) at 0x...]
With no obvious ID fields in the attributes, we can access the single feature by name using the feature type:
>>> db['experimental_result_region_1']
<Feature experimental_result_region (I:.-.[+]) at 0x...>
This shows that the internal representation of null values is None:
>>> f = db['experimental_result_region_1']
>>> assert f.start is None
>>> assert f.stop is None
But the string representation shows “.
” placeholders:
>>> print(f)
I Expr_profile experimental_result_region . . . + . expr_profile=B0019.1
c_elegans_WS199_shortened_gff.txt
This file contains many different kinds of features; much like mouse_extra_comma.gff3 above, some features have IDs, some do not, and there are many that share the same feature ID. This example shows a common way of subsetting features (here, getting all the non-historical CDSs for a gene).
File contents
I Orfeome PCR_product 12759747 12764936 . - . amplified=1;pcr_product=mv_B0019.1
I SAGE_tag_unambiguously_mapped SAGE_tag 12763533 12763553 . - . count=1;gene=amx-2;sequence=SAGE:ggcagagtcttttggca;transcript=B0019.1
I SAGE_tag_unambiguously_mapped SAGE_tag 12761492 12761512 . - . count=5;gene=amx-2;sequence=SAGE:aacggagccgtacacgc;transcript=B0019.1
I SAGE_tag_most_three_prime SAGE_tag 12761499 12761512 . - . count=9;gene=amx-2;sequence=SAGE:aacggagccg;transcript=B0019.1
X SAGE_tag SAGE_tag 6819353 6819366 . + . count=9;gene=amx-2;sequence=SAGE:aacggagccg;transcript=B0019.1
I Expr_profile experimental_result_region 12762449 12764118 . + . expr_profile=B0019.1
I Coding_transcript CDS 12759745 12759828 . - 0 ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I Coding_transcript CDS 12759949 12760013 . - 2 ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I Coding_transcript CDS 12760227 12760319 . - 2 ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I Coding_transcript CDS 12760365 12760494 . - 0 ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I Coding_transcript CDS 12760834 12760904 . - 2 ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I Coding_transcript CDS 12761172 12761516 . - 2 ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I Coding_transcript CDS 12761799 12761953 . - 1 ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I Coding_transcript CDS 12762127 12762268 . - 2 ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I Coding_transcript CDS 12762648 12762806 . - 2 ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I Coding_transcript CDS 12763112 12763249 . - 2 ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I Coding_transcript CDS 12763448 12763655 . - 0 ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I Coding_transcript CDS 12763729 12763882 . - 1 ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I Coding_transcript CDS 12763979 12764102 . - 2 ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I Coding_transcript CDS 12764291 12764471 . - 0 ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I Coding_transcript CDS 12764812 12764937 . - 0 ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I history CDS 12759745 12759828 . - 0 ID=CDS:B0019.1:wp173
I history CDS 12759949 12760013 . - 2 ID=CDS:B0019.1:wp173
I history CDS 12760227 12760319 . - 2 ID=CDS:B0019.1:wp173
I history CDS 12760365 12760494 . - 0 ID=CDS:B0019.1:wp173
I history CDS 12760834 12760904 . - 2 ID=CDS:B0019.1:wp173
I history CDS 12761172 12761516 . - 2 ID=CDS:B0019.1:wp173
I history CDS 12761577 12761626 . - 1 ID=CDS:B0019.1:wp173
I history CDS 12761795 12761953 . - 1 ID=CDS:B0019.1:wp173
I history CDS 12762127 12762268 . - 2 ID=CDS:B0019.1:wp173
I history CDS 12762648 12762806 . - 2 ID=CDS:B0019.1:wp173
I history CDS 12763112 12763249 . - 2 ID=CDS:B0019.1:wp173
I history CDS 12763448 12763655 . - 0 ID=CDS:B0019.1:wp173
I history CDS 12763729 12763882 . - 1 ID=CDS:B0019.1:wp173
I history CDS 12763979 12764102 . - 2 ID=CDS:B0019.1:wp173
I history CDS 12764291 12764471 . - 0 ID=CDS:B0019.1:wp173
I history CDS 12764812 12764937 . - 0 ID=CDS:B0019.1:wp173
I history CDS 12759745 12759828 . - 0 ID=CDS:B0019.1:wp90
I history CDS 12759949 12760013 . - 2 ID=CDS:B0019.1:wp90
I history CDS 12760227 12760319 . - 2 ID=CDS:B0019.1:wp90
I history CDS 12761172 12761516 . - 2 ID=CDS:B0019.1:wp90
I history CDS 12761577 12761626 . - 1 ID=CDS:B0019.1:wp90
I history CDS 12761795 12761953 . - 1 ID=CDS:B0019.1:wp90
I history CDS 12762127 12762268 . - 2 ID=CDS:B0019.1:wp90
I history CDS 12762648 12762806 . - 2 ID=CDS:B0019.1:wp90
I history CDS 12763112 12763249 . - 2 ID=CDS:B0019.1:wp90
I history CDS 12763469 12763655 . - 0 ID=CDS:B0019.1:wp90
I history CDS 12763729 12763882 . - 1 ID=CDS:B0019.1:wp90
I history CDS 12763979 12764102 . - 2 ID=CDS:B0019.1:wp90
I history CDS 12764291 12764471 . - 0 ID=CDS:B0019.1:wp90
I history CDS 12764812 12764937 . - 0 ID=CDS:B0019.1:wp90
I mass_spec_genome translated_nucleotide_match 12761920 12761953 . - . ID=Target:381130;Target=Mass_spec_peptide:MSP:FADFSPLDVSDVNFATDDLAK 10 21 +;Note=MSP:FADFSPLDVSDVNFATDDLAK;cds_matches=B0019.1;protein_matches=WP:CE40797;times_observed=3
I mass_spec_genome translated_nucleotide_match 12762127 12762155 . - . ID=Target:381130;Target=Mass_spec_peptide:MSP:FADFSPLDVSDVNFATDDLAK 1 10 +;Note=MSP:FADFSPLDVSDVNFATDDLAK;cds_matches=B0019.1;protein_matches=WP:CE40797;times_observed=3
I mass_spec_genome translated_nucleotide_match 12763506 12763559 . - . ID=Target:381133;Target=Mass_spec_peptide:MSP:FGHGQSLLAQGGMNEVVR 1 18 +;Note=MSP:FGHGQSLLAQGGMNEVVR;cds_matches=B0019.1;protein_matches=WP:CE40797;times_observed=1
I mass_spec_genome translated_nucleotide_match 12764361 12764411 . - . ID=Target:381144;Target=Mass_spec_peptide:MSP:NIQQNRPGLSVLVLEAR 1 17 +;Note=MSP:NIQQNRPGLSVLVLEAR;cds_matches=B0019.1;protein_matches=WP:CE40797;times_observed=2
I Coding_transcript mRNA 12759582 12764949 . - . ID=Transcript:B0019.1;Note=amx-2;Parent=Gene:WBGene00000138;cds=B0019.1;prediction_status=Partially_confirmed;wormpep=CE:CE40797
I Allele SNP 12764272 12764272 . + . interpolated_map_position=14.003;rflp=No;variation=snp_B0019[1]
I Oligo_set reagent 12759745 12761589 . - . oligo_set=Aff_B0019.1
I Coding_transcript exon 12759745 12759828 . - 0 Parent=Transcript:B0019.1
I Coding_transcript exon 12759949 12760013 . - 2 Parent=Transcript:B0019.1
I Coding_transcript exon 12760227 12760319 . - 2 Parent=Transcript:B0019.1
I Coding_transcript exon 12760365 12760494 . - 0 Parent=Transcript:B0019.1
I Coding_transcript exon 12760834 12760904 . - 2 Parent=Transcript:B0019.1
I Coding_transcript exon 12761172 12761516 . - 2 Parent=Transcript:B0019.1
I Coding_transcript exon 12761799 12761953 . - 1 Parent=Transcript:B0019.1
I Coding_transcript exon 12762127 12762268 . - 2 Parent=Transcript:B0019.1
I Coding_transcript exon 12762648 12762806 . - 2 Parent=Transcript:B0019.1
I Coding_transcript exon 12763112 12763249 . - 2 Parent=Transcript:B0019.1
I Coding_transcript exon 12763448 12763655 . - 0 Parent=Transcript:B0019.1
I Coding_transcript exon 12763729 12763882 . - 1 Parent=Transcript:B0019.1
I Coding_transcript exon 12763979 12764102 . - 2 Parent=Transcript:B0019.1
I Coding_transcript exon 12764291 12764471 . - 0 Parent=Transcript:B0019.1
I Coding_transcript exon 12764812 12764937 . - 0 Parent=Transcript:B0019.1
I Coding_transcript five_prime_UTR 12764938 12764949 . - . Parent=Transcript:B0019.1
I Coding_transcript three_prime_UTR 12759582 12759744 . - . Parent=Transcript:B0019.1
I Coding_transcript intron 12760495 12760833 . - . Parent=Transcript:B0019.1;confirmed_est=EC027594
I Coding_transcript intron 12760905 12761171 . - . Parent=Transcript:B0019.1;confirmed_est=EC027594
I Coding_transcript intron 12761517 12761798 . - . Parent=Transcript:B0019.1;confirmed_est=EC027594
I Coding_transcript intron 12759829 12759948 . - . Parent=Transcript:B0019.1;confirmed_est=EC034652
I Coding_transcript intron 12760014 12760226 . - . Parent=Transcript:B0019.1;confirmed_est=EC034652
I Coding_transcript intron 12760320 12760364 . - . Parent=Transcript:B0019.1;confirmed_est=yk1054h04.3
I Coding_transcript intron 12763883 12763978 . - . Parent=Transcript:B0019.1;confirmed_est=yk1054h04.5,OSTF088D9_1
I Coding_transcript intron 12764103 12764290 . - . Parent=Transcript:B0019.1;confirmed_est=yk1054h04.5,OSTF088D9_1
I Coding_transcript intron 12764472 12764811 . - . Parent=Transcript:B0019.1;confirmed_est=yk1054h04.5,OSTF088D9_1
I Coding_transcript intron 12762807 12763111 . - . Parent=Transcript:B0019.1;confirmed_est=yk1056c07.5
I Coding_transcript intron 12763250 12763447 . - . Parent=Transcript:B0019.1;confirmed_est=yk1056c07.5
I Coding_transcript intron 12763656 12763728 . - . Parent=Transcript:B0019.1;confirmed_est=yk1056c07.5
I Coding_transcript intron 12761954 12762126 . - . Parent=Transcript:B0019.1;confirmed_est=yk262g9.5
I Coding_transcript intron 12762269 12762647 . - . Parent=Transcript:B0019.1;confirmed_est=yk262g9.5
I Promoterome PCR_product 12764938 12766937 . + . pcr_product=p_B0019.1_93
I GenePair_STS PCR_product 12762449 12764118 . + . pcr_product=sjj_B0019.1
I Coding_transcript gene 12759582 12764949 . - . ID=Gene:WBGene00000138
III Orfeome PCR_product 13780230 13780850 . + . amplified=1;pcr_product=mv_3R5.1.v6
IV Orfeome PCR_product 17486939 17488952 . - . amplified=1;pcr_product=mv_4R79.1
IV Orfeome PCR_product 17480353 17483284 . - . amplified=1;pcr_product=mv_4R79.2
X Orfeome PCR_product 17714881 17718531 . + . amplified=1;pcr_product=mv_6R55.1
X Orfeome PCR_product 17712787 17714742 . + . amplified=1;pcr_product=mv_6R55.2
II Orfeome PCR_product 6995874 7010146 . + . amplified=1;pcr_product=mv_AAA03517
III Orfeome PCR_product 5625097 5631795 . + . amplified=1;pcr_product=mv_AAA03544
X GenePair_STS PCR_product 9962853 9963737 . + . pcr_product=cenix:102-c3
II GenePair_STS PCR_product 5507236 5508135 . + . pcr_product=cenix:102-c4
V GenePair_STS PCR_product 10117842 10118735 . + . pcr_product=cenix:102-c5
IV GenePair_STS PCR_product 3566130 3567025 . + . pcr_product=cenix:102-c6
X GenePair_STS PCR_product 6117180 6117930 . + . pcr_product=cenix:102-c7
IV GenePair_STS PCR_product 7189492 7190369 . + . pcr_product=cenix:102-c9
II GenePair_STS PCR_product 14462527 14463202 . + . pcr_product=cenix:102-d1
X Promoterome PCR_product 2258069 2259336 . + . pcr_product=p_AH9.2_93
IV Promoterome PCR_product 12157449 12159448 . + . pcr_product=p_B0001.6_93
I Promoterome PCR_product 12764938 12766937 . + . pcr_product=p_B0019.1_93
V Promoterome PCR_product 10320122 10320689 . + . pcr_product=p_B0024.12_93
I Coding_transcript CDS 4581214 4581237 . - 0 ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I Coding_transcript CDS 4581664 4582026 . - 0 ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I Coding_transcript CDS 4582412 4582718 . - 1 ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I Coding_transcript CDS 4583190 4583374 . - 0 ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I Coding_transcript CDS 4583426 4583509 . - 0 ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I Coding_transcript CDS 4583560 4583805 . - 0 ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I Coding_transcript mRNA 4580734 4583815 . - . ID=Transcript:D1007.5b.1;Parent=Gene:WBGene00017003;cds=D1007.5b;prediction_status=Confirmed;wormpep=WP:CE33577
I Coding_transcript mRNA 4581214 4583811 . - . ID=Transcript:D1007.5b.2;Parent=Gene:WBGene00017003;cds=D1007.5b;prediction_status=Confirmed;wormpep=WP:CE33577
I Coding_transcript exon 4581214 4581237 . - 0 Parent=Transcript:D1007.5b.1
I Coding_transcript exon 4581664 4582026 . - 0 Parent=Transcript:D1007.5b.1
I Coding_transcript exon 4582412 4582718 . - 1 Parent=Transcript:D1007.5b.1
I Coding_transcript exon 4583190 4583374 . - 0 Parent=Transcript:D1007.5b.1
I Coding_transcript exon 4583426 4583509 . - 0 Parent=Transcript:D1007.5b.1
I Coding_transcript exon 4583560 4583805 . - 0 Parent=Transcript:D1007.5b.1
I Coding_transcript five_prime_UTR 4583806 4583815 . - . Parent=Transcript:D1007.5b.1
I Coding_transcript three_prime_UTR 4580734 4581213 . - . Parent=Transcript:D1007.5b.1
I Coding_transcript intron 4582027 4582411 . - . Parent=Transcript:D1007.5b.1;confirmed_est=EB994038
I Coding_transcript intron 4583375 4583425 . - . Parent=Transcript:D1007.5b.1;confirmed_est=EC038345,OSTF085G5_1
I Coding_transcript intron 4583510 4583559 . - . Parent=Transcript:D1007.5b.1;confirmed_est=EC038345,OSTF085G5_1
I Coding_transcript intron 4582719 4583189 . - . Parent=Transcript:D1007.5b.1;confirmed_est=yk1055g06.5,OSTF085G5_1
I Coding_transcript intron 4581238 4581663 . - . Parent=Transcript:D1007.5b.1;confirmed_est=yk1057e08.3
I Coding_transcript exon 4581214 4581237 . - 0 Parent=Transcript:D1007.5b.2
I Coding_transcript exon 4581664 4582026 . - 0 Parent=Transcript:D1007.5b.2
I Coding_transcript exon 4582412 4582718 . - 1 Parent=Transcript:D1007.5b.2
I Coding_transcript exon 4583190 4583374 . - 0 Parent=Transcript:D1007.5b.2
I Coding_transcript exon 4583426 4583509 . - 0 Parent=Transcript:D1007.5b.2
I Coding_transcript exon 4583560 4583805 . - 0 Parent=Transcript:D1007.5b.2
I Coding_transcript five_prime_UTR 4583806 4583811 . - . Parent=Transcript:D1007.5b.2
I Coding_transcript intron 4582027 4582411 . - . Parent=Transcript:D1007.5b.2;confirmed_est=EB994038
I Coding_transcript intron 4583375 4583425 . - . Parent=Transcript:D1007.5b.2;confirmed_est=EC038345,OSTF085G5_1
I Coding_transcript intron 4583510 4583559 . - . Parent=Transcript:D1007.5b.2;confirmed_est=EC038345,OSTF085G5_1
I Coding_transcript intron 4582719 4583189 . - . Parent=Transcript:D1007.5b.2;confirmed_est=yk1055g06.5,OSTF085G5_1
I Coding_transcript intron 4581238 4581663 . - . Parent=Transcript:D1007.5b.2;confirmed_est=yk1057e08.3
I Coding_transcript gene 4580693 4583815 . - . ID=Gene:WBGene00017003
I SAGE_tag_unambiguously_mapped SAGE_tag 4581093 4581113 . - . count=10;gene=D1007.5;sequence=SAGE:tttgcgaattacttgct;transcript=D1007.5b.1,D1007.5a
I SAGE_tag_unambiguously_mapped SAGE_tag 4580748 4580768 . - . count=112;gene=D1007.5;sequence=SAGE:ttttccattaattttga;transcript=D1007.5b.1,D1007.5a
I SAGE_tag_unambiguously_mapped SAGE_tag 4582415 4582428 . - . count=1;gene=D1007.5;sequence=SAGE:cattttcgtg;transcript=D1007.5b.2,D1007.5b.1,D1007.5a
I SAGE_tag_unambiguously_mapped SAGE_tag 4580914 4580927 . - . count=1;gene=D1007.5;sequence=SAGE:taaatttcaa;transcript=D1007.5b.1,D1007.5a
I SAGE_tag_unambiguously_mapped SAGE_tag 4581193 4581206 . - . count=1;gene=D1007.5;sequence=SAGE:tgctcgttcg;transcript=D1007.5b.1,D1007.5a
I SAGE_tag_unambiguously_mapped SAGE_tag 4583465 4583478 . - . count=1;gene=D1007.5;sequence=SAGE:tgttggcctt;transcript=D1007.5b.2,D1007.5b.1,D1007.5a
I SAGE_tag_unambiguously_mapped SAGE_tag 4583458 4583478 . - . count=1;gene=D1007.5;sequence=SAGE:tgttggccttttacttg;transcript=D1007.5b.2,D1007.5b.1,D1007.5a
I SAGE_tag_unambiguously_mapped SAGE_tag 4582533 4582553 . - . count=2;gene=D1007.5;sequence=SAGE:tgcagtgatagtccagc;transcript=D1007.5b.2,D1007.5b.1,D1007.5a
I SAGE_tag_unambiguously_mapped SAGE_tag 4581100 4581113 . - . count=2;gene=D1007.5;sequence=SAGE:tttgcgaatt;transcript=D1007.5b.1,D1007.5a
I SAGE_tag_unambiguously_mapped SAGE_tag 4580755 4580768 . - . count=43;gene=D1007.5;sequence=SAGE:ttttccatta;transcript=D1007.5b.1,D1007.5a
I Coding_transcript CDS 4580993 4581241 . - 0 ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I Coding_transcript CDS 4581664 4582026 . - 0 ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I Coding_transcript CDS 4582412 4582718 . - 1 ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I Coding_transcript CDS 4583190 4583374 . - 0 ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I Coding_transcript CDS 4583426 4583509 . - 0 ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I Coding_transcript CDS 4583560 4583805 . - 0 ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I mass_spec_genome translated_nucleotide_match 4580996 4581052 . - . ID=Target:277116;Target=Mass_spec_peptide:MSP:IYEPSQEDLLLMHQLQQER 1 19 +;Note=MSP:IYEPSQEDLLLMHQLQQER;cds_matches=D1007.5a;protein_matches=WP:CE29034;times_observed=1
I mass_spec_genome translated_nucleotide_match 4581838 4581882 . - . ID=Target:277138;Target=Mass_spec_peptide:MSP:AAIHLGSWHQIEGPR 1 15 +;Note=MSP:AAIHLGSWHQIEGPR;cds_matches=D1007.5b D1007.5a;protein_matches=WP:CE33577 WP:CE29034;times_observed=1
I mass_spec_genome translated_nucleotide_match 4583581 4583601 . - . ID=Target:277176;Target=Mass_spec_peptide:MSP:TLWWLPK 1 7 +;Note=MSP:TLWWLPK;cds_matches=D1007.5b D1007.5a;protein_matches=WP:CE33577 WP:CE29034;times_observed=1
I Coding_transcript mRNA 4580693 4583811 . - . ID=Transcript:D1007.5a;Parent=Gene:WBGene00017003;cds=D1007.5a;prediction_status=Confirmed;wormpep=CE:CE29034
I Coding_transcript exon 4580993 4581241 . - 0 Parent=Transcript:D1007.5a
I Coding_transcript exon 4581664 4582026 . - 0 Parent=Transcript:D1007.5a
I Coding_transcript exon 4582412 4582718 . - 1 Parent=Transcript:D1007.5a
I Coding_transcript exon 4583190 4583374 . - 0 Parent=Transcript:D1007.5a
I Coding_transcript exon 4583426 4583509 . - 0 Parent=Transcript:D1007.5a
I Coding_transcript exon 4583560 4583805 . - 0 Parent=Transcript:D1007.5a
I Coding_transcript five_prime_UTR 4583806 4583811 . - . Parent=Transcript:D1007.5a
I Coding_transcript three_prime_UTR 4580693 4580992 . - . Parent=Transcript:D1007.5a
I Coding_transcript intron 4582027 4582411 . - . Parent=Transcript:D1007.5a;confirmed_est=EB994038
I Coding_transcript intron 4581242 4581663 . - . Parent=Transcript:D1007.5a;confirmed_est=EB994038,OSTR085G5_1
I Coding_transcript intron 4583375 4583425 . - . Parent=Transcript:D1007.5a;confirmed_est=EC038345,OSTF085G5_1
I Coding_transcript intron 4583510 4583559 . - . Parent=Transcript:D1007.5a;confirmed_est=EC038345,OSTF085G5_1
I Coding_transcript intron 4582719 4583189 . - . Parent=Transcript:D1007.5a;confirmed_est=yk1055g06.5,OSTF085G5_1
Import
If merge_strategy="merge"
, is used here, exceptions will be raised because
all those CDSs cannot be logically merged: they have different
coordinates. So merge_strategy="create_unique"
is used instead. This kwarg
will add an underscore and an integer to each duplicate value used as a primary
key:
>>> fn = gffutils.example_filename('c_elegans_WS199_shortened_gff.txt')
>>> db = gffutils.create_db(fn, ":memory:", merge_strategy='create_unique', keep_order=True)
Access
Getting a gene:
>>> g = db['Gene:WBGene00017003']
>>> g
<Feature gene (I:4580693-4583815[-]) at 0x...>
Get all “non-historical” CDSs for this gene. This illustrates that:
you can use a
Feature
object instead of a string for the database primary keyprinting a
Feature
reconstructs the line, andthe string representation of a line does not include a
\n
.
>>> for i in db.children(g, featuretype='CDS', order_by='start'):
... if i.source != 'history':
... print(i)
I Coding_transcript CDS 4580993 4581241 . - 0 ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I Coding_transcript CDS 4581214 4581237 . - 0 ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I Coding_transcript CDS 4581664 4582026 . - 0 ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I Coding_transcript CDS 4581664 4582026 . - 0 ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I Coding_transcript CDS 4582412 4582718 . - 1 ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I Coding_transcript CDS 4582412 4582718 . - 1 ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I Coding_transcript CDS 4583190 4583374 . - 0 ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I Coding_transcript CDS 4583190 4583374 . - 0 ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I Coding_transcript CDS 4583426 4583509 . - 0 ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I Coding_transcript CDS 4583426 4583509 . - 0 ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I Coding_transcript CDS 4583560 4583805 . - 0 ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I Coding_transcript CDS 4583560 4583805 . - 0 ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
ensembl_gtf.txt
In this GTF file, gene_id and transcript_id are identical, creating problems for unique IDs in the database.
File contents
I snoRNA exon 3747 3909 . - . gene_id "Y74C9A.6"; transcript_id "Y74C9A.6"; exon_number "1"; gene_name "Y74C9A.6"; transcript_name "NR_001477.2";
I protein_coding exon 12764812 12764949 . - . gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "1"; gene_name "amx-2"; transcript_name "B0019.1";
I protein_coding CDS 12764812 12764937 . - 0 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "1"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I protein_coding start_codon 12764935 12764937 . - 0 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "1"; gene_name "amx-2"; transcript_name "B0019.1";
I protein_coding exon 12764291 12764471 . - . gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "2"; gene_name "amx-2"; transcript_name "B0019.1";
I protein_coding CDS 12764291 12764471 . - 0 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "2"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I protein_coding exon 12763979 12764102 . - . gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "3"; gene_name "amx-2"; transcript_name "B0019.1";
I protein_coding CDS 12763979 12764102 . - 2 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "3"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I protein_coding exon 12763729 12763882 . - . gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "4"; gene_name "amx-2"; transcript_name "B0019.1";
I protein_coding CDS 12763729 12763882 . - 1 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "4"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I protein_coding exon 12763448 12763655 . - . gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "5"; gene_name "amx-2"; transcript_name "B0019.1";
I protein_coding CDS 12763448 12763655 . - 0 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "5"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I protein_coding exon 12763112 12763249 . - . gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "6"; gene_name "amx-2"; transcript_name "B0019.1";
I protein_coding CDS 12763112 12763249 . - 2 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "6"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I protein_coding exon 12762648 12762806 . - . gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "7"; gene_name "amx-2"; transcript_name "B0019.1";
I protein_coding CDS 12762648 12762806 . - 2 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "7"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I protein_coding exon 12762127 12762268 . - . gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "8"; gene_name "amx-2"; transcript_name "B0019.1";
I protein_coding CDS 12762127 12762268 . - 2 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "8"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I protein_coding exon 12761799 12761953 . - . gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "9"; gene_name "amx-2"; transcript_name "B0019.1";
I protein_coding CDS 12761799 12761953 . - 1 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "9"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I protein_coding exon 12761172 12761516 . - . gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "10"; gene_name "amx-2"; transcript_name "B0019.1";
I protein_coding CDS 12761172 12761516 . - 2 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "10"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I protein_coding exon 12760834 12760904 . - . gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "11"; gene_name "amx-2"; transcript_name "B0019.1";
I protein_coding CDS 12760834 12760904 . - 2 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "11"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I protein_coding exon 12760365 12760494 . - . gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "12"; gene_name "amx-2"; transcript_name "B0019.1";
I protein_coding CDS 12760365 12760494 . - 0 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "12"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I protein_coding exon 12760227 12760319 . - . gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "13"; gene_name "amx-2"; transcript_name "B0019.1";
I protein_coding CDS 12760227 12760319 . - 2 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "13"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I protein_coding exon 12759949 12760013 . - . gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "14"; gene_name "amx-2"; transcript_name "B0019.1";
I protein_coding CDS 12759949 12760013 . - 2 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "14"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I protein_coding exon 12759579 12759828 . - . gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "15"; gene_name "amx-2"; transcript_name "B0019.1";
I protein_coding CDS 12759748 12759828 . - 0 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "15"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I protein_coding stop_codon 12759745 12759747 . - 0 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "15"; gene_name "amx-2"; transcript_name "B0019.1";
Import
As part of the import process, gffutils
creates new “derived” features
for genes and transcripts (see GTF files). These are not explicitly defined in a GTF file, but
can be inferred. Even though gene and transcript features do not exist in the
file, they will still be created. This is why id_spec={'gene': 'gene_id'}
works: the primary key of “gene” features – which don’t exist in the GTF file
but are inferred by gffutils
– will be the “gene_id” attribute.
Note that in this example, gene_id and transcript_id are identical. This causes all sorts of problems. To get around this we can write a transform function (see transform for details) that applies an arbitrary transformation to a dictionary returned by the parser. This will modify the data upon import into the database:
>>> def transform_func(x):
... # adds some text to the end of transcript IDs
... if 'transcript_id' in x.attributes:
... x.attributes['transcript_id'][0] += '_transcript'
... return x
Now we can supply this tranform function to create_db()
:
>>> fn = gffutils.example_filename('ensembl_gtf.txt')
>>> db = gffutils.create_db(fn, ":memory:",
... id_spec={'gene': 'gene_id', 'transcript': "transcript_id"},
... merge_strategy="create_unique",
... transform=transform_func,
... keep_order=True)
Access
We can access the derived “gene” feature by its ID (recall it doesn’t actually exist in the GTF file):
>>> g = db["B0019.1"]
>>> g
<Feature gene (I:12759579-12764949[-]) at 0x...>
>>> print(g)
I gffutils_derived gene 12759579 12764949 . - . gene_id "B0019.1";
And the derived “transcript” by its ID – which now has “_transcript” on the end because of that transform function:
>>> t = db["B0019.1_transcript"]
>>> t
<Feature transcript (I:12759579-12764949[-]) at 0x...>
>>> print(t)
I gffutils_derived transcript 12759579 12764949 . - . gene_id "B0019.1"; transcript_id "B0019.1_transcript";
F3-unique-3.v2.gff
This file does not contain hierarchical information, and has many directives.
Since much of the benefit of gffutils
comes from being able to navigate
the hierarchical relationships, it might be overkill to use gffutils
on
this file. An exeception would be if you want to be able to look up the read
names by their ID. Here, we use yet a differnt way of overriding the way
gffutils
decides on a primary key for the database.
File contents
##solid-gff-version 0.2
##gff-version 2
##source-version MaToGff.java v1.5
##date 2008-05-28
##time 13:11:03
##Type solid_read
##color-code AA=0,AC=1,AG=2,AT=3,CA=1,CC=0,CG=3,CT=2,GA=2,GC=3,GG=0,GT=1,TA=3,TC=2,TG=1,TT=0
##primer-base F3=T
##max-num-mismatches 3
##max-read-length 20
##line-order fragment
##history filter_fasta.pl --noduplicates --output=/data/results/DAEMON/DAEMON_MATE_PAIRS_2_20070326/S1/results.01/primary.20071218094706805 --name=DAEMON_MATE_PAIRS_2_20070326_S1 --tag=F3 --minlength=20 --prefix=T /data/results/DAEMON/DAEMON_MATE_PAIRS_2_20070326/S1/jobs/postPrimerSetPrimary.117/rawseq
##history map /data/results/RegressionDriver/CaseManager/results/r12/integration/case0002/reads1/test_S1_F3.csfasta /data/results/RegressionDriver/CaseManager/knownData/validatedReference/matchingPipeline/ecoli_k12_MG1655.fasta T=30 L=19 C=1 E=.Tmpfile1211939575SVhDtd F=0 B=1 D=1 u=1 r=0 n=1 Z=1000 P="0000000111111111111" M=0 U=0.000000 H=0 > .Tmpfile1211939575SVhDtd.out.1
##history MaToGff.java --sort --qvs=test_S1_F3_QV.qual.txt --convert=unique --clear=3 --tempdir=../tmp test_S1_F3.csfasta.ma.20.3
##hdr seqname source feature start end score strand frame [attributes] [comments]
3_336_815_F3 solid read 55409 55428 10.4 + . g=A3233312322232122211;i=1;p=1.000;q=23,12,18,17,10,24,19,14,27,9,23,9,16,20,11,7,8,4,4,14;u=0,0,0,1
3_142_1011_F3 solid read 91290 91309 5.0 - . g=T0330222333132222222;i=1;p=1.000;q=4,4,14,4,4,4,4,21,4,4,4,4,25,4,4,4,5,21,4,4;u=0,0,0,1
3_341_424_F3 solid read 102717 102736 10.6 - . g=T2203031313223113212;i=1;p=1.000;q=9,27,25,16,18,9,27,26,23,13,14,25,27,5,24,5,26,26,4,5;u=0,0,1
3_6_37_F3 solid read 181053 181072 9.4 + . g=C3220221332111020310;i=1;p=1.000;q=9,5,13,9,10,22,6,12,21,7,13,4,21,16,23,6,20,20,13,6;u=0,0,0,1
3_34_202_F3 solid read 284207 284226 6.9 + . g=G0301333332232122333;i=1;p=1.000;q=6,15,21,8,12,4,4,5,12,8,4,12,4,7,10,6,8,16,4,6;u=0,1
3_277_712_F3 solid read 304136 304155 11.8 - . g=A2033101122223322133;i=1;p=1.000;q=26,11,14,27,4,17,4,26,26,23,17,25,26,27,21,23,5,20,26,23;u=0,1
3_394_71_F3 solid read 308736 308755 10.8 + . g=T3203322323203312331;i=1;p=1.000;q=9,24,19,15,20,18,20,10,13,13,11,21,12,7,4,11,20,24,4,25;u=0,1
3_285_1497_F3 solid read 404055 404074 8.4 - . g=T1221231003202232221;i=1;p=1.000;q=8,10,6,25,16,14,23,27,8,14,21,19,5,4,4,6,22,12,4,6;u=0,0,0,1
3_228_178_F3 solid read 453227 453246 9.5 - . g=G1130333332331110323;i=1;p=1.000;q=4,19,25,18,18,5,19,6,8,24,4,26,21,11,15,4,26,13,13,15;u=0,0,0,1
3_406_794_F3 solid read 504835 504854 8.3 - . g=T3033331301320201111;i=1;p=1.000;q=27,4,13,4,21,11,7,11,5,26,10,8,9,4,6,18,9,26,17,6;u=0,0,0,1
3_303_251_F3 solid read 561501 561520 5.3 + . g=C0011111112222112221;i=1;p=1.000;q=9,8,4,4,10,4,4,4,6,14,4,4,4,4,16,4,4,4,4,23;u=0,0,1
3_152_112_F3 solid read 624012 624031 7.7 - . g=G0301122312213122221;i=1;p=1.000;q=22,14,7,13,18,5,11,4,15,6,6,11,4,8,15,5,10,4,6,24;u=0,0,0,1
3_112_1154_F3 solid read 630582 630601 11.3 - . g=T1333312011131131011;i=1;p=1.000;q=27,27,4,5,17,24,20,19,7,4,25,17,18,15,22,23,17,25,16,26;u=0,0,1
3_196_392_F3 solid read 661664 661683 19.7 - . g=T3321013301122133323;i=1;p=1.000;q=27,25,13,26,21,25,23,27,27,27,27,11,16,27,27,19,26,27,26,27;u=1
3_192_1248_F3 solid read 672037 672056 4.5 - . g=A0333232333121222222;i=1;p=1.000;q=4,7,4,4,4,4,4,4,6,4,4,4,4,4,7,7,4,4,6,4;u=0,0,0,1
3_63_479_F3 solid read 742582 742601 7.9 - . g=A0133333333233232332;i=1;p=1.000;q=4,9,6,11,20,12,11,9,13,20,18,4,4,14,9,15,4,6,21,4;u=0,0,0,1
3_30_710_F3 solid read 816069 816088 9.2 - . g=T3311001223313333313;i=1;p=1.000;q=22,27,18,25,25,7,26,25,14,23,6,25,5,11,7,4,15,7,4,6;u=0,0,0,1
3_284_77_F3 solid read 864876 864895 7.4 + . g=T2003133033233112331;i=1;p=1.000;q=13,19,4,11,22,24,6,16,4,6,13,4,12,18,4,6,7,11,4,5;u=0,0,0,1
3_411_1040_F3 solid read 876023 876042 10.9 - . g=T2121301233200033221;i=1;p=1.000;q=9,9,5,12,11,8,4,16,27,27,18,21,24,9,18,24,21,9,23,17;u=0,0,0,1
3_188_171_F3 solid read 884683 884702 5.8 - . g=A1322330132213322231;i=1;p=1.000;q=4,8,4,5,7,6,5,4,11,6,6,11,4,8,4,8,4,6,4,15;u=0,0,0,1
3_63_787_F3 solid read 1022149 1022168 7.5 + . g=C3131132013020123031;i=1;p=1.000;q=12,13,26,14,9,9,13,14,4,7,8,5,11,4,17,4,4,6,4,21;u=0,1
3_391_2015_F3 solid read 1074989 1075008 18.5 - . g=A2323101222321232322;i=1;p=1.000;q=27,25,18,20,27,27,24,23,27,23,27,25,19,26,12,26,9,21,27,21;u=1
3_8_425_F3 solid read 1119124 1119143 6.7 - . g=T0321201132230303323;i=1;p=1.000;q=6,5,8,6,4,4,23,9,12,10,15,4,13,13,8,4,4,5,5,12;u=0,0,1
3_53_745_F3 solid read 1130179 1130198 7.6 - . g=C0213313233333113321;i=1;p=1.000;q=27,6,9,22,18,9,8,15,6,8,14,5,8,6,16,4,5,4,4,14;u=0,0,0,1
3_123_576_F3 solid read 1219122 1219141 8.7 + . g=A3333133323333323323;i=1;p=1.000;q=18,22,5,11,16,16,8,14,8,5,19,8,9,10,7,11,6,11,9,4;u=0,0,1
3_81_12_F3 solid read 1236732 1236751 8.6 + . g=G2210332302233112321;i=1;p=1.000;q=7,16,17,9,7,9,9,16,9,4,10,21,17,8,4,6,9,16,6,12;u=0,0,0,1
3_96_1862_F3 solid read 1264409 1264428 6.9 - . g=G0301032323231222021;i=1;p=1.000;q=26,23,11,20,15,8,6,4,6,6,9,7,6,4,8,6,4,5,6,5;u=0,0,0,1
3_40_136_F3 solid read 1266177 1266196 7.4 - . g=T2332222332203312221;i=1;p=1.000;q=9,23,6,19,13,9,4,8,17,9,4,4,13,9,8,5,4,6,10,8;u=0,0,1
3_124_1781_F3 solid read 1385416 1385435 10.3 + . g=A1322302333332222132;i=1;p=1.000;q=13,17,8,6,5,9,24,4,7,9,18,27,18,16,16,23,18,18,11,23;u=0,0,1
3_134_1165_F3 solid read 1393169 1393188 9.0 - . g=T3301123202321131311;i=1;p=1.000;q=4,27,18,7,27,4,27,26,4,20,4,27,26,9,27,4,27,14,10,27;u=1
3_224_587_F3 solid read 1490044 1490063 6.1 + . g=G2032313231111233321;i=1;p=1.000;q=4,4,6,6,13,24,4,4,5,15,6,7,9,14,4,4,4,25,5,5;u=0,0,0,1
3_25_747_F3 solid read 1513598 1513617 9.5 + . g=T1223213101133121231;i=1;p=1.000;q=26,27,8,27,27,27,26,27,26,19,8,14,4,17,11,5,7,4,7,6;u=0,0,1
3_143_14_F3 solid read 1528236 1528255 9.7 + . g=T3233113323230202011;i=1;p=1.000;q=13,23,17,19,23,16,24,25,14,15,9,6,4,11,4,9,12,4,16,10;u=0,0,0,1
3_164_1025_F3 solid read 1570107 1570126 7.9 - . g=T3220332323303320231;i=1;p=1.000;q=7,10,20,8,4,24,4,4,21,6,26,22,9,6,11,9,6,4,17,14;u=0,0,0,1
3_137_552_F3 solid read 1630276 1630295 9.1 - . g=G3030333223233102131;i=1;p=1.000;q=6,28,9,4,6,26,27,6,10,9,27,21,6,16,9,25,6,7,23,12;u=0,0,0,1
3_125_1810_F3 solid read 1634104 1634123 10.5 + . g=G1232220322032311332;i=1;p=1.000;q=27,8,26,26,10,6,26,12,27,27,26,4,27,27,23,8,8,4,27,12;u=0,0,0,1
3_314_1310_F3 solid read 1639981 1640000 9.2 + . g=A2221332230322203033;i=1;p=1.000;q=19,12,6,27,11,27,6,11,5,6,9,13,27,27,8,18,5,22,4,27;u=0,0,0,1
3_384_591_F3 solid read 1654341 1654360 6.8 + . g=A3323221133121102313;i=1;p=1.000;q=19,8,7,7,15,4,20,7,4,6,14,7,19,6,8,4,5,9,4,4;u=0,0,0,1
3_145_739_F3 solid read 1791040 1791059 11.9 - . g=A0221223333323131212;i=1;p=1.000;q=20,27,23,13,27,14,27,28,27,25,12,24,8,16,8,4,8,21,9,11;u=0,0,0,1
3_326_2020_F3 solid read 1830564 1830583 9.3 + . g=A3321322331103233322;i=1;p=1.000;q=14,4,25,16,10,12,16,5,14,10,25,5,25,5,9,18,13,26,4,26;u=0,0,0,1
3_233_1265_F3 solid read 1857564 1857583 8.9 + . g=T3112113020130223311;i=1;p=1.000;q=7,27,25,26,27,14,26,27,27,27,4,6,5,10,17,4,5,7,6,12;u=0,0,1
3_235_100_F3 solid read 1912460 1912479 9.6 - . g=G2233020000132311231;i=1;p=1.000;q=23,24,25,16,17,6,21,25,9,4,6,11,8,19,6,6,19,14,13,6;u=0,0,0,1
3_111_107_F3 solid read 1944496 1944515 7.6 - . g=C3023223333211322231;i=1;p=1.000;q=15,5,6,14,5,13,4,12,11,4,9,9,11,12,4,11,11,13,6,6;u=0,0,0,1
3_457_1514_F3 solid read 1956598 1956617 9.9 - . g=T0013331013332110221;i=1;p=1.000;q=18,24,10,24,23,25,22,11,20,10,15,11,4,5,27,4,9,13,5,27;u=0,1
3_183_74_F3 solid read 1992040 1992059 9.8 + . g=C3332233131131222322;i=1;p=1.000;q=27,27,25,23,25,8,11,11,7,11,4,12,14,10,15,7,14,4,9,12;u=0,0,1
3_357_1303_F3 solid read 2037917 2037936 10.9 - . g=T3331331323320311331;i=1;p=1.000;q=7,27,5,19,26,8,27,12,14,27,8,27,23,9,19,4,26,20,9,27;u=0,0,0,1
3_153_186_F3 solid read 2083441 2083460 6.7 + . g=T3112233331133323322;i=1;p=1.000;q=7,14,19,7,12,6,11,4,11,8,4,6,6,4,11,4,6,4,4,18;u=0,1
3_65_1741_F3 solid read 2107441 2107460 8.4 + . g=T3333332330233132123;i=1;p=1.000;q=4,4,6,25,9,4,26,16,21,9,18,15,27,27,4,21,9,7,9,6;u=0,0,0,1
3_98_323_F3 solid read 2118821 2118840 7.5 + . g=A3222212322131112031;i=1;p=1.000;q=13,14,8,10,8,14,4,13,10,7,15,4,6,4,4,12,6,11,6,8;u=0,0,1
3_48_258_F3 solid read 2153882 2153901 9.4 - . g=G0330113313201122321;i=1;p=1.000;q=22,15,20,4,16,17,14,24,4,5,4,22,19,8,10,9,13,22,8,15;u=0,0,0,1
3_140_1125_F3 solid read 2182909 2182928 7.9 + . g=T3231331302232001131;i=1;p=1.000;q=10,4,12,6,4,12,13,6,18,5,8,11,4,26,6,25,5,18,11,12;u=0,0,0,1
3_359_118_F3 solid read 2188393 2188412 8.4 + . g=A0301311133331131322;i=1;p=1.000;q=11,5,7,13,20,6,6,25,8,18,9,15,27,9,6,7,15,17,4,4;u=0,0,0,1
3_203_483_F3 solid read 2272874 2272893 9.1 - . g=C3031223110333133311;i=1;p=1.000;q=23,21,25,27,10,5,22,15,17,18,5,18,17,5,19,4,4,13,4,22;u=0,0,0,1
3_66_301_F3 solid read 2286038 2286057 6.6 - . g=C1113113330132222311;i=1;p=1.000;q=10,4,6,4,8,13,9,4,10,9,4,6,13,9,5,6,11,6,4,9;u=0,0,0,1
3_78_130_F3 solid read 2291021 2291040 7.6 + . g=G3233131332212222321;i=1;p=1.000;q=13,16,6,12,17,11,10,4,12,8,13,4,8,6,4,4,12,10,4,11;u=0,0,0,1
3_141_110_F3 solid read 2291354 2291373 9.3 + . g=T1312203322212123321;i=1;p=1.000;q=9,21,24,11,16,4,23,27,16,16,8,22,6,10,16,4,9,4,7,25;u=0,0,1
3_51_1383_F3 solid read 2374918 2374937 8.8 + . g=T3311203033322222231;i=1;p=1.000;q=24,26,6,27,27,23,27,4,21,27,4,27,6,9,24,4,23,4,4,27;u=0,0,1
3_231_366_F3 solid read 2392091 2392110 10.0 - . g=T2022333223101331322;i=1;p=1.000;q=18,12,9,9,13,8,7,22,7,7,4,26,12,17,9,20,24,8,18,14;u=0,0,0,1
3_214_1802_F3 solid read 2394604 2394623 8.8 - . g=T1232111001220211133;i=1;p=1.000;q=17,18,14,6,19,4,21,4,6,12,11,4,26,20,9,18,7,16,5,18;u=0,0,0,1
3_67_1434_F3 solid read 2454508 2454527 15.2 - . g=T3121311232222231203;i=1;p=1.000;q=9,27,27,18,16,14,25,27,26,21,19,27,27,27,15,5,24,27,24,24;u=0,0,1
3_124_1647_F3 solid read 2493617 2493636 7.5 + . g=A0211320203220231332;i=1;p=1.000;q=9,12,12,9,6,14,12,7,4,4,12,9,4,9,16,4,4,9,9,16;u=0,0,0,1
3_39_328_F3 solid read 2500759 2500778 7.8 + . g=T1332333033231132333;i=1;p=1.000;q=24,27,26,26,25,21,7,8,4,5,20,4,11,6,8,4,6,4,11,7;u=0,0,1
3_378_322_F3 solid read 2541624 2541643 8.9 + . g=T2333331001023011220;i=1;p=1.000;q=14,6,13,25,27,4,24,22,14,19,9,23,15,6,8,4,22,4,4,20;u=0,0,0,1
3_216_848_F3 solid read 2550573 2550592 11.5 - . g=G2320322020031220322;i=1;p=1.000;q=21,24,8,21,20,25,18,6,24,14,21,9,7,18,8,18,7,9,19,12;u=0,0,0,1
3_221_516_F3 solid read 2607559 2607578 11.1 - . g=T2132333313222333332;i=1;p=1.000;q=9,19,27,26,24,26,26,25,25,26,21,4,6,10,21,6,20,13,5,24;u=0,0,0,1
3_56_45_F3 solid read 2662103 2662122 5.5 + . g=G3021122332232122321;i=1;p=1.000;q=4,4,4,6,4,6,4,5,18,9,4,16,10,4,4,4,12,4,6,6;u=0,0,0,1
3_127_210_F3 solid read 2798906 2798925 10.2 + . g=G2331321333232203222;i=1;p=1.000;q=11,25,9,4,23,16,26,14,7,22,9,25,9,8,21,8,15,17,4,26;u=0,0,1
3_417_422_F3 solid read 2812322 2812341 8.8 - . g=T3321222333313333132;i=1;p=1.000;q=9,26,7,19,7,13,23,4,25,4,6,19,4,16,15,15,23,4,19,13;u=0,0,0,1
3_42_1403_F3 solid read 2830264 2830283 9.6 - . g=T3212330132120221212;i=1;p=1.000;q=7,4,25,18,6,17,12,12,17,14,8,26,13,15,10,4,21,5,12,22;u=0,1
3_457_42_F3 solid read 2874245 2874264 7.6 - . g=G0301123332223122221;i=1;p=1.000;q=18,10,14,9,19,4,10,8,11,10,6,8,5,8,11,4,13,6,4,6;u=0,0,1
3_361_728_F3 solid read 2893879 2893898 14.6 + . g=C3213223312310132221;i=1;p=1.000;q=14,18,7,7,17,19,23,24,17,26,12,15,21,23,21,19,17,20,22,24;u=0,0,0,1
3_77_718_F3 solid read 2913092 2913111 9.4 + . g=T3021331333313131231;i=1;p=1.000;q=15,26,7,24,20,18,5,6,17,18,6,11,4,13,19,15,7,4,22,25;u=0,0,0,1
3_116_154_F3 solid read 2917672 2917691 9.8 - . g=A0323231223233132311;i=1;p=1.000;q=20,9,19,18,10,18,8,16,25,6,18,6,12,24,6,7,5,15,7,17;u=0,0,0,1
3_239_1415_F3 solid read 2923256 2923275 19.2 + . g=T3233113121300032200;i=1;p=1.000;q=25,27,27,26,27,24,27,27,25,27,22,27,21,26,22,19,26,9,14,21;u=1
3_142_1468_F3 solid read 2930117 2930136 10.5 - . g=A3233323333303103330;i=1;p=1.000;q=9,20,6,26,16,18,8,13,20,25,25,18,6,12,11,18,4,16,16,6;u=0,0,1
3_394_295_F3 solid read 2930118 2930137 8.1 - . g=T3023333333333311331;i=1;p=1.000;q=4,14,6,12,7,22,10,4,13,24,18,12,12,4,6,9,9,9,14,4;u=0,0,0,1
3_222_1773_F3 solid read 2934040 2934059 11.6 + . g=T1303031311123232302;i=1;p=1.000;q=11,10,24,15,28,6,19,5,13,27,8,26,8,22,25,27,26,27,8,13;u=0,0,0,1
3_276_1344_F3 solid read 2969950 2969969 13.2 - . g=G3211212131233322233;i=1;p=1.000;q=27,27,12,16,11,23,27,8,23,12,27,22,20,12,15,25,8,27,16,6;u=0,1
3_155_1814_F3 solid read 3107393 3107412 13.6 + . g=A2332222213113120221;i=1;p=1.000;q=27,26,20,25,26,27,12,27,26,18,26,4,27,10,23,26,6,23,26,26;u=0,0,0,1
3_373_2014_F3 solid read 3143956 3143975 12.0 - . g=T3013322223222221211;i=1;p=1.000;q=16,8,17,21,10,10,18,18,18,13,4,23,16,24,8,19,14,15,23,11;u=0,1
3_81_1637_F3 solid read 3413619 3413638 9.1 + . g=G2313032322122302111;i=1;p=1.000;q=9,4,7,19,27,6,11,5,12,15,20,27,8,27,6,16,6,27,21,6;u=0,0,1
3_291_969_F3 solid read 3438323 3438342 17.4 + . g=T0021120212032121313;i=1;p=1.000;q=24,27,6,27,27,27,27,13,27,27,25,27,26,27,27,20,23,26,27,20;u=1
3_179_1617_F3 solid read 3475164 3475183 8.0 + . g=A2100132222332123123;i=1;p=1.000;q=21,25,11,22,4,19,7,21,20,4,5,24,25,16,4,4,11,19,4,4;u=0,0,0,1
3_446_861_F3 solid read 3476173 3476192 11.6 - . g=G1213302212022132321;i=1;p=1.000;q=27,27,27,27,26,25,12,27,24,18,24,6,27,26,20,9,6,6,4,23;u=0,0,1
3_397_317_F3 solid read 3545152 3545171 11.1 + . g=T3110031332233111131;i=1;p=1.000;q=22,27,9,9,26,5,22,20,9,10,16,22,24,6,23,25,22,4,17,18;u=0,0,0,1
3_323_713_F3 solid read 3575287 3575306 16.2 - . g=A0322222200213223302;i=1;p=1.000;q=27,25,21,27,26,26,24,26,27,18,27,26,26,27,22,22,6,26,25,8;u=0,1
3_294_1906_F3 solid read 3727542 3727561 8.4 - . g=A3030310223202311021;i=1;p=1.000;q=14,7,5,4,7,18,4,6,13,6,12,12,10,11,15,14,16,7,9,12;u=0,0,0,1
3_443_223_F3 solid read 3730805 3730824 17.1 - . g=T1113320033330133111;i=1;p=1.000;q=28,27,18,27,27,27,20,26,27,14,25,16,19,19,8,23,16,21,16,15;u=0,0,1
3_94_809_F3 solid read 3841898 3841917 21.8 - . g=A2032223110001131310;i=1;p=1.000;q=27,27,27,27,26,27,25,24,27,27,27,25,27,27,27,12,23,16,27,27;u=0,0,0,1
3_245_387_F3 solid read 3878549 3878568 24.4 - . g=A0222211220333132122;i=1;p=1.000;q=27,27,26,27,26,27,27,25,27,25,26,27,18,21,26,25,26,23,24,24;u=1
3_190_1089_F3 solid read 3900038 3900057 13.7 - . g=T1111110323122301202;i=1;p=1.000;q=27,11,27,11,8,9,27,9,9,26,25,27,11,27,23,14,24,20,22,26;u=0,0,1
3_442_1501_F3 solid read 3912610 3912629 8.5 + . g=A0012333103302132301;i=1;p=1.000;q=11,11,15,19,15,6,12,10,4,11,21,5,9,16,7,14,4,4,8,19;u=0,0,1
3_342_678_F3 solid read 4044575 4044594 4.0 + . g=A3333112332213322323;i=1;p=1.000;q=4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4;u=0,0,0,1
3_56_1294_F3 solid read 4058789 4058808 12.7 + . g=G3323331232322213322;i=1;p=1.000;q=26,17,18,27,23,8,8,24,27,27,9,27,25,14,26,4,27,9,24,23;u=0,0,0,1
3_69_1575_F3 solid read 4070467 4070486 9.9 + . g=A2222011012222112121;i=1;p=1.000;q=16,25,14,9,9,9,21,9,4,24,6,21,13,6,27,10,19,8,6,27;u=0,0,0,1
3_198_476_F3 solid read 4080622 4080641 8.9 + . g=C2010231122212011133;i=1;p=1.000;q=16,8,8,16,12,17,4,16,12,15,10,4,9,6,4,25,9,9,23,11;u=0,1
3_24_715_F3 solid read 4136503 4136522 4.0 - . g=G1313332132232313233;i=1;p=1.000;q=4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4;u=0,0,0,1
3_151_283_F3 solid read 4148264 4148283 9.7 + . g=T3230210232022111220;i=1;p=1.000;q=9,14,6,25,25,19,6,4,16,11,12,20,10,13,26,19,6,4,19,14;u=0,0,1
3_164_774_F3 solid read 4156157 4156176 9.6 + . g=G2311112210110223313;i=1;p=1.000;q=8,24,19,7,6,16,12,9,4,8,26,14,26,24,7,18,6,16,14,7;u=0,0,0,1
3_275_1212_F3 solid read 4171385 4171404 8.3 + . g=G0223122231333302232;i=1;p=1.000;q=13,8,5,4,10,7,12,25,4,25,6,15,6,27,6,11,12,7,14,10;u=0,0,0,1
3_148_289_F3 solid read 4177672 4177691 8.0 - . g=T1203101332223323323;i=1;p=1.000;q=9,21,11,6,5,7,25,24,26,24,8,9,7,12,7,4,11,9,4,4;u=0,0,0,1
3_437_1000_F3 solid read 4179623 4179642 12.3 + . g=A0112222212231131001;i=1;p=1.000;q=26,27,26,27,4,27,17,6,22,13,27,24,6,27,21,27,22,15,24,9;u=0,0,1
3_318_2011_F3 solid read 4218181 4218200 12.9 - . g=T2133330223033303323;i=1;p=1.000;q=25,27,27,5,5,16,27,16,27,15,18,25,26,11,27,19,16,24,9,15;u=0,0,0,1
3_14_11_F3 solid read 4222697 4222716 7.8 - . g=T2323310222232322122;i=1;p=1.000;q=6,23,16,25,25,9,7,4,12,4,14,6,10,7,6,9,18,4,10,4;u=0,0,0,1
3_402_391_F3 solid read 4274545 4274564 6.2 - . g=C3303323321111111111;i=1;p=1.000;q=10,19,15,15,7,8,13,4,7,4,5,16,4,4,5,4,9,4,4,4;u=0,0,0,1
3_293_504_F3 solid read 4339235 4339254 9.5 + . g=C2133223303331120213;i=1;p=1.000;q=6,4,5,26,13,7,17,6,24,10,27,24,5,9,21,9,23,24,20,14;u=0,0,0,1
3_360_914_F3 solid read 4407004 4407023 10.7 + . g=T3012102130232022001;i=1;p=1.000;q=23,24,19,17,24,6,26,17,25,15,7,24,14,11,26,9,22,4,8,5;u=0,0,0,1
3_118_1532_F3 solid read 4431702 4431721 10.2 + . g=C3233220201223200322;i=1;p=1.000;q=20,9,17,22,17,23,13,4,9,5,16,11,10,6,17,7,9,22,27,27;u=0,0,1
3_358_133_F3 solid read 4460191 4460210 9.1 + . g=T0221223112322112233;i=1;p=1.000;q=6,23,12,22,7,6,7,4,13,5,9,23,12,9,24,8,14,7,20,26;u=0,0,0,1
3_397_195_F3 solid read 4499390 4499409 6.9 - . g=T3302332313332212121;i=1;p=1.000;q=23,14,15,5,9,8,6,4,4,13,4,16,13,16,4,7,4,12,4,5;u=0,0,0,1
3_158_642_F3 solid read 4533144 4533163 7.1 - . g=A1332103332323233212;i=1;p=1.000;q=8,20,9,22,8,14,4,16,17,4,8,13,7,8,4,12,5,4,4,4;u=0,0,0,1
3_300_1439_F3 solid read 4580452 4580471 12.3 - . g=A0331111211302100201;i=1;p=1.000;q=5,17,21,14,4,16,11,27,21,9,17,17,27,23,12,21,16,27,25,25;u=0,0,0,1
# Elapsed time 0.846 secs
Import
By default, the id_spec="some_string"
parameter tells gffutils
to
create a primary key based on the value of the “some_string” attribute. For
this example, however, we would like to make the first field the primary key.
Luckily, gffutils
will accept the name of a GFF field, surrounded by
“:”. The names of the fields are in gffutils.constants._gffkeys
:
>>> gffutils.constants._gffkeys
['seqid', 'source', 'featuretype', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
We want to use the first field, seqid
, as the primary key for the
database, so we use id_spec=":seqid:"
.
>>> fn = gffutils.example_filename('F3-unique-3.v2.gff')
>>> db = gffutils.create_db(fn, ':memory:', id_spec=[':seqid:'])
>>> db.directives[0]
'solid-gff-version 0.2'
Now we can access the features by their sequence name:
>>> db['3_8_425_F3']
<Feature read (3_8_425_F3:1119124-1119143[-]) at 0x...>
glimmer_nokeyval.gff3
This is a problematic file in which the attributes are not all in format
“key=val”. This example demonstrates how gffutils
handles cases like
this: by implicitly treating them as keys and setting the value to an empty
list.
Like the the ensembl_gtf.txt file, gene and transcript names are not different. Based on IDs alone, it’s not clear whether the CDS’s parent is the gene or the mRNA.
However, the way to fix this is different from ensembl_gtf.txt
. Recall in
that case, we constructed a function to modify transcript IDs, and when
gffutils
inferred the new transcripts from the component exons, it used
that new ID.
Here, we need to do several things in the transform function: if the feature
type contains “RNA”, then set the ID to ID + "_transript"
(this will handle
things like ncRNA, tRNA, etc and is more general than just “mRNA”). Next, we
need to edit the Parent attribute of CDS, exon, etc. How this is handled will
depend completely on the file you are using.
File contents
##gff-version 3
##sequence-region scaffold4215_3 1 6526
scaffold4215_3 glimmer gene 3 62 . - . ID=GL0000006;Name=GL0000006;Lack 3'-end;
scaffold4215_3 glimmer mRNA 3 62 . - . ID=GL0000006;Name=GL0000006;Parent=GL0000006;Lack 3'-end;
scaffold4215_3 glimmer CDS 3 62 2.84 - 0 Parent=GL0000006;Lack 3'-end;
scaffold4215_3 glimmer gene 124 1983 . - . ID=GL0000007;Name=GL0000007;Complete;
Import
>>> def transform(f):
... if f.featuretype == 'gene':
... return f
... elif "RNA" in f.featuretype:
... f.attributes['ID'][0] += '_transcript'
... else:
... # assume that anything else is a child of a transcript, so we need
... # to edit the "Parent" attribute
... if 'Parent' in f.attributes:
... f.attributes['Parent'] \
... = [i + '_transcript' for i in f.attributes['Parent']]
... return f
...
>>> fn = gffutils.example_filename('glimmer_nokeyval.gff3')
>>> db = gffutils.create_db(fn, ":memory:", id_spec="ID", transform=transform)
Access
This shows that keys with missing values are assigned an empty list. It’s up to the calling code to decide how to handle this (say, by checking for the presence of the key “Complete”.
>>> for k, v in sorted(db['GL0000007'].attributes.items()):
... print(k, '=', v)
Complete = []
ID = ['GL0000007']
Name = ['GL0000007']
This illustrates that the CDS finds the proper transcript parent.
>>> for f in db.parents(db['CDS_1'], level=1):
... print(f.id)
GL0000006_transcript
hybrid1.gff3
This file contains a FASTA sequence at the end. Currently, gffutils
assumes that there are no further annotations and stops parsing the file at
this point (specifically, when it sees the string “##FASTA” at the beginning of
a line).
Another issue with this file is that the “Note” attributes field contains url-encoded characters. These are kept verbatim.
File contents
##gff-version 3
##sequence-region foo 1 100
##feature-ontology bar
##attribute-ontology baz
##source-ontology boo
##sequence-region chr17 62467934 62469545
chr17 UCSC mRNA 62467934 62469545 . - . ID=A00469;Dbxref=AFFX-U133:205840_x_at,Locuslink:2688,Genbank-mRNA:A00469,Swissprot:P01241,PFAM:PF00103,AFFX-U95:1332_f_at,Swissprot:SOMA_HUMAN;Note=growth%20hormone%201;Alias=GH1
chr17 UCSC CDS 62468039 62468236 . - 1 Parent=A00469
chr17 UCSC CDS 62468490 62468654 . - 2 Parent=A00469
chr17 UCSC CDS 62468747 62468866 . - 1 Parent=A00469
chr17 UCSC CDS 62469076 62469236 . - 1 Parent=A00469
chr17 UCSC CDS 62469497 62469506 . - 0 Parent=A00469
###
##FASTA
>chr17
GATTACA
GATTACA
Import
Straightforward:
>>> fn = gffutils.example_filename('hybrid1.gff3')
>>> db = gffutils.create_db(fn, ":memory:")
Access
Print out the attributes:
>>> for k, v in sorted(db['A00469'].attributes.items()):
... print(k, '=', v)
Alias = ['GH1']
Dbxref = ['AFFX-U133:205840_x_at', 'Locuslink:2688', 'Genbank-mRNA:A00469', 'Swissprot:P01241', 'PFAM:PF00103', 'AFFX-U95:1332_f_at', 'Swissprot:SOMA_HUMAN']
ID = ['A00469']
Note = ['growth hormone 1']
jgi_gff2.txt
This file is GFF2 format – the attribute format is like GTF, but without the
required “gene_id” and “transcript_id” fields that are used to infer genes and
transcripts. To get around this, we can supply the gtf_transcript_key
kwarg,
which will override the default "transcript_id"
(see GTF files for more
background on this).
There is information in the file on which exon goes to which transcript, but no such information about which CDS goes to which transcript. There is, however, information about the parent protein.
File contents
chr_1 JGI exon 37061 37174 . - . name "fgenesh1_pg.C_chr_1000007"; transcriptId 873
chr_1 JGI CDS 37061 37174 . - 0 name "fgenesh1_pg.C_chr_1000007"; proteinId 873; exonNumber 3
chr_1 JGI exon 37315 37620 . - . name "fgenesh1_pg.C_chr_1000007"; transcriptId 873
chr_1 JGI CDS 37315 37620 . - 0 name "fgenesh1_pg.C_chr_1000007"; proteinId 873; exonNumber 2
chr_1 JGI exon 37752 38216 . - . name "fgenesh1_pg.C_chr_1000007"; transcriptId 873
chr_1 JGI CDS 37752 38216 . - 0 name "fgenesh1_pg.C_chr_1000007"; proteinId 873; exonNumber 1
Import
Inspecting the file, we see that the “name” attribute is repeated on each line. We can use that as the gene ID for inferring the gene extents, like “gene_id” in a normal GTF file. Similarly, the “transcriptId” attribute here can be used like “transcript_id” in a normal GTF file for inferring transcript extents.
We also need to make a choice about how we’re going to use this database. Do we want to have the CDS features be considered children of transcripts? In that case, we’ll need the transform function, which creates a “transcriptId” attribute out of an existing “proteinId” attribute:
>>> def transform(d):
... try:
... d['transcriptId'] = d['proteinId']
... except KeyError:
... pass
... return d
>>> fn = gffutils.example_filename('jgi_gff2.txt')
>>> db = gffutils.create_db(fn, ":memory:",
... id_spec={'transcript': 'transcriptId', 'gene': 'name'},
... gtf_transcript_key='transcriptId', gtf_gene_key='name',
... transform=transform)
Access
Since we used that transform function, the exons and CDSs are all children of the “873” transcript:
>>> sorted(list(db.children('873', level=1)), key=lambda x: (x.featuretype, x.start))
[<Feature CDS (chr_1:37061-37174[-]) at 0x...>, <Feature CDS (chr_1:37315-37620[-]) at 0x...>, <Feature CDS (chr_1:37752-38216[-]) at 0x...>, <Feature exon (chr_1:37061-37174[-]) at 0x...>, <Feature exon (chr_1:37315-37620[-]) at 0x...>, <Feature exon (chr_1:37752-38216[-]) at 0x...>]
Here we can see that all children of the gene are accounted for:
>>> for f in db.children("fgenesh1_pg.C_chr_1000007", order_by='featuretype'):
... print('{0.featuretype:>10}: {0.id}'.format(f))
CDS: CDS_1
CDS: CDS_2
CDS: CDS_3
exon: exon_1
exon: exon_2
exon: exon_3
transcript: 873
ncbi_gff3.txt
This file is problematic because all genes have the same ID attribute. We can get around this by using the “db_xref” field as the key for genes.
File contents
##gff-version 3
##source-version NCBI C++ formatter 0.2
##date 2009-04-25
##Type DNA NC_008596.1
NC_008596.1 RefSeq gene 12272 13301 . + . locus_tag=MSMEG_0013;note=ferric%20enterobactin%20transport%20system%20permease%20protein%20FepG%3B%20this%20gene%20contains%20a%20frame%20shift%20which%20is%20not%20the%20result%20of%20sequencing%20error%3B%20identified%20by%20match%20to%20protein%20family%20HMM%20PF01032;pseudo=;db_xref=GeneID:4537201
NC_008596.1 RefSeq gene 1137579 1138550 . + . ID=NC_008596.1:speB;locus_tag=MSMEG_1072;db_xref=GeneID:4535378
NC_008596.1 RefSeq CDS 1137579 1138547 . + 0 ID=NC_008596.1:speB:unknown_transcript_1;Parent=NC_008596.1:speB;locus_tag=MSMEG_1072;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_885468.1;db_xref=GI:118469242;db_xref=GeneID:4535378;exon_number=1
NC_008596.1 RefSeq start_codon 1137579 1137581 . + 0 ID=NC_008596.1:speB:unknown_transcript_1;Parent=NC_008596.1:speB;locus_tag=MSMEG_1072;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_885468.1;db_xref=GI:118469242;db_xref=GeneID:4535378;exon_number=1
NC_008596.1 RefSeq stop_codon 1138548 1138550 . + 0 ID=NC_008596.1:speB:unknown_transcript_1;Parent=NC_008596.1:speB;locus_tag=MSMEG_1072;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_885468.1;db_xref=GI:118469242;db_xref=GeneID:4535378;exon_number=1
NC_008596.1 RefSeq gene 3597069 3598112 . + . ID=NC_008596.1:speB;locus_tag=MSMEG_3535;db_xref=GeneID:4533678
NC_008596.1 RefSeq CDS 3597069 3598109 . + 0 ID=NC_008596.1:speB:unknown_transcript_2;Parent=NC_008596.1:speB;locus_tag=MSMEG_3535;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_887838.1;db_xref=GI:118470943;db_xref=GeneID:4533678;exon_number=1
NC_008596.1 RefSeq start_codon 3597069 3597071 . + 0 ID=NC_008596.1:speB:unknown_transcript_2;Parent=NC_008596.1:speB;locus_tag=MSMEG_3535;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_887838.1;db_xref=GI:118470943;db_xref=GeneID:4533678;exon_number=1
NC_008596.1 RefSeq stop_codon 3598110 3598112 . + 0 ID=NC_008596.1:speB:unknown_transcript_2;Parent=NC_008596.1:speB;locus_tag=MSMEG_3535;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_887838.1;db_xref=GI:118470943;db_xref=GeneID:4533678;exon_number=1
NC_008596.1 RefSeq gene 4460713 4461672 . - . ID=NC_008596.1:speB;locus_tag=MSMEG_4374;db_xref=GeneID:4535424
NC_008596.1 RefSeq CDS 4460716 4461672 . - 0 ID=NC_008596.1:speB:unknown_transcript_3;Parent=NC_008596.1:speB;locus_tag=MSMEG_4374;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_888649.1;db_xref=GI:118469662;db_xref=GeneID:4535424;exon_number=1
NC_008596.1 RefSeq start_codon 4461670 4461672 . - 0 ID=NC_008596.1:speB:unknown_transcript_3;Parent=NC_008596.1:speB;locus_tag=MSMEG_4374;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_888649.1;db_xref=GI:118469662;db_xref=GeneID:4535424;exon_number=1
NC_008596.1 RefSeq stop_codon 4460713 4460715 . - 0 ID=NC_008596.1:speB:unknown_transcript_3;Parent=NC_008596.1:speB;locus_tag=MSMEG_4374;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_888649.1;db_xref=GI:118469662;db_xref=GeneID:4535424;exon_number=1
NC_008596.1 RefSeq gene 4539385 4540344 . + . ID=NC_008596.1:speB;locus_tag=MSMEG_4459;db_xref=GeneID:4537057
NC_008596.1 RefSeq CDS 4539385 4540341 . + 0 ID=NC_008596.1:speB:unknown_transcript_4;Parent=NC_008596.1:speB;locus_tag=MSMEG_4459;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_888732.1;db_xref=GI:118472833;db_xref=GeneID:4537057;exon_number=1
NC_008596.1 RefSeq start_codon 4539385 4539387 . + 0 ID=NC_008596.1:speB:unknown_transcript_4;Parent=NC_008596.1:speB;locus_tag=MSMEG_4459;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_888732.1;db_xref=GI:118472833;db_xref=GeneID:4537057;exon_number=1
NC_008596.1 RefSeq stop_codon 4540342 4540344 . + 0 ID=NC_008596.1:speB:unknown_transcript_4;Parent=NC_008596.1:speB;locus_tag=MSMEG_4459;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_888732.1;db_xref=GI:118472833;db_xref=GeneID:4537057;exon_number=1
Import
For “gene” featuretypes, we want to use the “db_xref” field as the primary key; all other featuretypes will be incremented versions of the featuretype (“CDS_1”, “CDS_2”, etc).
>>> fn = gffutils.example_filename('ncbi_gff3.txt')
>>> db = gffutils.create_db(fn, ":memory:", id_spec={'gene': 'db_xref'})
Access
Genes can now be accessed by the db_xref:
>>> db['GeneID:4537201']
<Feature gene (NC_008596.1:12272-13301[+]) at 0x...>
wormbase_gff2_alt.txt
This file has no gene_id or transcript_id fields; it appears to use “CDS” as
a gene-level kind of object. So we can use a transform function to add
“gene_id” and “transcript_id” fields to all non-CDS features so that the file
conforms to a GTF standard and gene extents can be inferred. Furthermore, by
default gffutils
uses 'exon'
as the default feature type to merge into
genes. Here, we need to specify gtf_subfeature='coding_exon'
. Last, note
that the file itself has inconsistent trailing semicolons. The dialect
detection decides that there should be no trailing semicolon since that’s what
the majority of the features use.
File contents
Remanei_genome Genomic_canonical region 1 7816 . + . Sequence "Contig1020";
Contig102 WU_MERGED CDS 1629 3377 . - . CDS "cr01.sctg102.wum.2.1"
Contig102 WU_MERGED coding_exon 2927 3377 . - . CDS "cr01.sctg102.wum.2.1"
Contig102 WU_MERGED coding_exon 2474 2875 . - . CDS "cr01.sctg102.wum.2.1"
Contig102 WU_MERGED coding_exon 1928 2430 . - . CDS "cr01.sctg102.wum.2.1"
Contig102 WU_MERGED coding_exon 1629 1883 . - . CDS "cr01.sctg102.wum.2.1"
Contig102 WU_MERGED intron 2876 2926 . - . CDS "cr01.sctg102.wum.2.1"
Contig102 WU_MERGED intron 2431 2473 . - . CDS "cr01.sctg102.wum.2.1"
Contig102 WU_MERGED intron 1884 1927 . - . CDS "cr01.sctg102.wum.2.1"
Import
The transform function to manipulate the attribute dictionary: add “gene_id” and “transcript_id” attributes to intron and coding_exon features:
>>> def transform(f):
... if f.featuretype in ['coding_exon', 'intron']:
... _id = f.attributes['CDS'][0]
... f.attributes['gene_id'] = [_id]
... f.attributes['transcript_id'] = [_id + '_transcript']
... return f
>>> fn = gffutils.example_filename('wormbase_gff2_alt.txt')
>>> db = gffutils.create_db(fn, ":memory:", id_spec={'gene': 'gene_id', 'transcript': 'transcript_id'}, transform=transform, gtf_subfeature='coding_exon')
Access
Note that the inferred genes have a source of “gffutils_derived”:
>>> print(db["cr01.sctg102.wum.2.1"])
Contig102 gffutils_derived gene 1629 3377 . - . gene_id "cr01.sctg102.wum.2.1"
Get a report of the childrent of the gene:
>>> for f in db.children("cr01.sctg102.wum.2.1", order_by='featuretype'):
... print('{0.featuretype:>12}: {0.id}'.format(f))
coding_exon: coding_exon_1
coding_exon: coding_exon_2
coding_exon: coding_exon_3
coding_exon: coding_exon_4
intron: intron_1
intron: intron_2
intron: intron_3
transcript: cr01.sctg102.wum.2.1_transcript
wormbase_gff2.txt
This file is problematic because of inconsistent formatting: the SAGE_tag
features have a different attributes format than the rest of the features. So
we need to do some work with the dialect used for parsing attributes into
dictionaries.
File contents
I Genomic_canonical region 1 2679 . + . Sequence "cTel33B" ; Note "Clone cTel33B; Genbank AC199162" ; Note "Clone cTel33B; Genbank AC199162"
I Coding_transcript Transcript 12759582 12764949 . - . Transcript "B0019.1" ; WormPep "WP:CE40797" ; Note "amx-2" ; Prediction_status "Partially_confirmed" ; Gene "WBGene00000138" ; CDS "B0019.1" ; WormPep "WP:CE40797" ; Note "amx-2" ; Prediction_status "Partially_confirmed" ; Gene "WBGene00000138"
I Coding_transcript intron 12759829 12759948 . - . Transcript "B0019.1" ; Confirmed_EST EC034652
I Coding_transcript intron 12760014 12760226 . - . Transcript "B0019.1" ; Confirmed_EST EC034652
I Coding_transcript intron 12760320 12760364 . - . Transcript "B0019.1" ; Confirmed_EST yk1054h04.3
I Coding_transcript intron 12760495 12760833 . - . Transcript "B0019.1" ; Confirmed_EST EC027594
I Coding_transcript intron 12760905 12761171 . - . Transcript "B0019.1" ; Confirmed_EST EC027594
I Coding_transcript intron 12761517 12761798 . - . Transcript "B0019.1" ; Confirmed_EST EC027594
I Coding_transcript intron 12761954 12762126 . - . Transcript "B0019.1" ; Confirmed_EST yk262g9.5
I Coding_transcript intron 12762269 12762647 . - . Transcript "B0019.1" ; Confirmed_EST yk262g9.5
I Coding_transcript intron 12762807 12763111 . - . Transcript "B0019.1" ; Confirmed_EST yk1056c07.5
I Coding_transcript intron 12763250 12763447 . - . Transcript "B0019.1" ; Confirmed_EST yk1056c07.5
I Coding_transcript intron 12763656 12763728 . - . Transcript "B0019.1" ; Confirmed_EST yk1056c07.5
I Coding_transcript intron 12763883 12763978 . - . Transcript "B0019.1" ; Confirmed_EST yk1054h04.5 ; Confirmed_EST OSTF088D9_1
I Coding_transcript intron 12764103 12764290 . - . Transcript "B0019.1" ; Confirmed_EST yk1054h04.5 ; Confirmed_EST OSTF088D9_1
I Coding_transcript intron 12764472 12764811 . - . Transcript "B0019.1" ; Confirmed_EST yk1054h04.5 ; Confirmed_EST OSTF088D9_1
I Coding_transcript exon 12759582 12759828 . - . Transcript "B0019.1"
I Coding_transcript exon 12759949 12760013 . - . Transcript "B0019.1"
I Coding_transcript exon 12760227 12760319 . - . Transcript "B0019.1"
I Coding_transcript exon 12760365 12760494 . - . Transcript "B0019.1"
I Coding_transcript exon 12760834 12760904 . - . Transcript "B0019.1"
I Coding_transcript exon 12761172 12761516 . - . Transcript "B0019.1"
I Coding_transcript exon 12761799 12761953 . - . Transcript "B0019.1"
I Coding_transcript exon 12762127 12762268 . - . Transcript "B0019.1"
I Coding_transcript exon 12762648 12762806 . - . Transcript "B0019.1"
I Coding_transcript exon 12763112 12763249 . - . Transcript "B0019.1"
I Coding_transcript exon 12763448 12763655 . - . Transcript "B0019.1"
I Coding_transcript exon 12763729 12763882 . - . Transcript "B0019.1"
I Coding_transcript exon 12763979 12764102 . - . Transcript "B0019.1"
I Coding_transcript exon 12764291 12764471 . - . Transcript "B0019.1"
I Coding_transcript exon 12764812 12764949 . - . Transcript "B0019.1"
I SAGE_tag_unambiguously_mapped SAGE_tag 12761492 12761512 . - . Sequence SAGE:aacggagccgtacacgc;count 5;Gene amx-2;Transcript B0019.1
I SAGE_tag_most_three_prime SAGE_tag 12761499 12761512 . - . Sequence SAGE:aacggagccg;count 9;Gene amx-2;Transcript B0019.1
I mass_spec_genome translated_nucleotide_match 12761920 12761953 . - . Target "Mass_spec_peptide:MSP:FADFSPLDVSDVNFATDDLAK" 10 21 ; Note "MSP:FADFSPLDVSDVNFATDDLAK" ; Protein_matches "WP:CE40797" ; CDS_matches "B0019.1" ; Times_observed "3"
I mass_spec_genome translated_nucleotide_match 12762127 12762155 . - . Target "Mass_spec_peptide:MSP:FADFSPLDVSDVNFATDDLAK" 1 10 ; Note "MSP:FADFSPLDVSDVNFATDDLAK" ; Protein_matches "WP:CE40797" ; CDS_matches "B0019.1" ; Times_observed "3"
I mass_spec_genome translated_nucleotide_match 12763506 12763559 . - . Target "Mass_spec_peptide:MSP:FGHGQSLLAQGGMNEVVR" 1 18 ; Note "MSP:FGHGQSLLAQGGMNEVVR" ; Protein_matches "WP:CE40797" ; CDS_matches "B0019.1" ; Times_observed "1"
I SAGE_tag_unambiguously_mapped SAGE_tag 12763533 12763553 . - . Sequence SAGE:ggcagagtcttttggca;count 1;Gene amx-2;Transcript B0019.1
I mass_spec_genome translated_nucleotide_match 12764361 12764411 . - . Target "Mass_spec_peptide:MSP:NIQQNRPGLSVLVLEAR" 1 17 ; Note "MSP:NIQQNRPGLSVLVLEAR" ; Protein_matches "WP:CE40797" ; CDS_matches "B0019.1" ; Times_observed "2"
I GenePair_STS PCR_product 12762449 12764118 . + . PCR_product "sjj_B0019.1"
I Expr_profile experimental_result_region 12762449 12764118 . + . Expr_profile "B0019.1"
I Allele SNP 12764272 12764272 . + . Variation "snp_B0019[1]" ; Interpolated_map_position "14.003" ; ; RFLP "No"
I Promoterome PCR_product 12764938 12766937 . + . PCR_product "p_B0019.1_93"
I Oligo_set reagent 12759745 12761589 . - . Oligo_set "Aff_B0019.1"
I Orfeome PCR_product 12759747 12764936 . - . PCR_product "mv_B0019.1" ; Amplified 1 ; Amplified 1
I Coding_transcript three_prime_UTR 12759582 12759744 . - . Transcript "B0019.1"
I Coding_transcript coding_exon 12759745 12759828 . - 0 Transcript "B0019.1" ; CDS "B0019.1"
I Coding_transcript coding_exon 12759949 12760013 . - 2 Transcript "B0019.1" ; CDS "B0019.1"
I Coding_transcript coding_exon 12760227 12760319 . - 2 Transcript "B0019.1" ; CDS "B0019.1"
I Coding_transcript coding_exon 12760365 12760494 . - 0 Transcript "B0019.1" ; CDS "B0019.1"
I Coding_transcript coding_exon 12760834 12760904 . - 2 Transcript "B0019.1" ; CDS "B0019.1"
I Coding_transcript coding_exon 12761172 12761516 . - 2 Transcript "B0019.1" ; CDS "B0019.1"
I Coding_transcript coding_exon 12761799 12761953 . - 1 Transcript "B0019.1" ; CDS "B0019.1"
I Coding_transcript coding_exon 12762127 12762268 . - 2 Transcript "B0019.1" ; CDS "B0019.1"
I Coding_transcript coding_exon 12762648 12762806 . - 2 Transcript "B0019.1" ; CDS "B0019.1"
I Coding_transcript coding_exon 12763112 12763249 . - 2 Transcript "B0019.1" ; CDS "B0019.1"
I Coding_transcript coding_exon 12763448 12763655 . - 0 Transcript "B0019.1" ; CDS "B0019.1"
I Coding_transcript coding_exon 12763729 12763882 . - 1 Transcript "B0019.1" ; CDS "B0019.1"
I Coding_transcript coding_exon 12763979 12764102 . - 2 Transcript "B0019.1" ; CDS "B0019.1"
I Coding_transcript coding_exon 12764291 12764471 . - 0 Transcript "B0019.1" ; CDS "B0019.1"
I Coding_transcript five_prime_UTR 12764938 12764949 . - . Transcript "B0019.1"
I Coding_transcript coding_exon 12764812 12764937 . - 0 Transcript "B0019.1" ; CDS "B0019.1"
X SAGE_tag SAGE_tag 6819353 6819366 . + . Sequence SAGE:aacggagccg;count 9;Gene amx-2;Transcript B0019.1
X gene processed_transcript 944828 948883 . - . Gene "WBGene00004893"
Import
We will need force_dialect_check=True
in this case if we want to be able to have
the parents of the SAGE_tag
features be recognized, because the attributes
for SAGE_tag
features are formatted differently. That is, they don’t have
quoted values, and the separating semicolon does not have a space on either
side as is the case for the other features.
The default of False
means that only the first few lines are checked to
determine the dialect to be used for all lines. But that won’t work here
because of the inconsistent formatting. So we have to take the more
time-consuming approach of checking every dialect in order to figure out how to
parse them into a dictionary.
Transform function to create appropriate “Parent” attributes:
>>> def transform(f):
... if f.featuretype == 'Transcript':
... f.attributes['Parent'] = f.attributes['Gene']
... else:
... try:
... f.attributes['Parent'] = f.attributes['Transcript']
... except KeyError:
... pass
... return f
>>> fn = gffutils.example_filename('wormbase_gff2.txt')
>>> db = gffutils.create_db(fn,
... ":memory:",
... transform=transform,
... id_spec={'Transcript': "Transcript"},
... force_gff=True,
... force_dialect_check=True,
... keep_order=True)
The dialect for the database will be None:
>>> assert db.dialect is None, db.dialect
For cases like this, we should probably construct our own dialect to force all
attributes to have the same format. To help with this, we can use the
helpers.infer_dialect()
function by providing attributes:
>>> from gffutils import helpers
>>> dialect = helpers.infer_dialect(
... 'Transcript "B0019.1" ; WormPep "WP:CE40797" ; Note "amx-2" ; Prediction_status "Partially_confirmed" ; Gene "WBGene00000138" ; CDS "B0019.1" ; WormPep "WP:CE40797" ; Note "amx-2" ; Prediction_status "Partially_confirmed" ; Gene "WBGene00000138"',
... )
>>> print(dialect)
{'leading semicolon': False, 'trailing semicolon': False, 'quoted GFF2 values': True, 'field separator': ' ; ', 'keyval separator': ' ', 'multival separator': ',', 'fmt': 'gtf', 'repeated keys': True, 'order': ['Transcript', 'WormPep', 'Note', 'Prediction_status', 'Gene', 'CDS', 'WormPep', 'Note', 'Prediction_status', 'Gene']}
>>> db.dialect = dialect
Access
>>> t = db["B0019.1"]
Since we’ve set the dialect for the database, any features returned from the database should follow that dialect:
>>> print(t)
I Coding_transcript Transcript 12759582 12764949 . - . Transcript "B0019.1" ; WormPep "WP:CE40797" ; WormPep "WP:CE40797" ; Note "amx-2" ; Note "amx-2" ; Prediction_status "Partially_confirmed" ; Prediction_status "Partially_confirmed" ; Gene "WBGene00000138" ; Gene "WBGene00000138" ; CDS "B0019.1" ; Parent "WBGene00000138" ; Parent "WBGene00000138"
>>> len(list(db.children(t, featuretype='exon')))
15
>>> db['SAGE_tag_1'].attributes['Transcript']
['B0019.1']
>>> len(list(db.children(t, featuretype='SAGE_tag')))
4
gencode-v19.gtf
This example file contains the first gene from the Gencode v19 annotations. It’s in GTF format, but in contrast to an on-spec GTF file, which only has exon, CDS, UTRs, and start/stop codon features, this file already has genes and transcripts on separate lines. This means we can avoid the gene and transcript inference, which saves time.
##description: evidence-based annotation of the human genome (GRCh37), version 19 (Ensembl 74)
##provider: GENCODE
##contact: gencode@sanger.ac.uk
##format: gtf
##date: 2013-12-05
chr1 HAVANA gene 11869 14412 . + . gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 ENSEMBL transcript 11872 14412 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-201"; level 3; havana_gene "OTTHUMG00000000961.2";
chr1 ENSEMBL exon 11872 12227 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-201"; exon_number 1; exon_id "ENSE00002234632.1"; level 3; havana_gene "OTTHUMG00000000961.2";
chr1 ENSEMBL exon 12613 12721 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-201"; exon_number 2; exon_id "ENSE00003608237.1"; level 3; havana_gene "OTTHUMG00000000961.2";
chr1 ENSEMBL exon 13225 14412 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-201"; exon_number 3; exon_id "ENSE00002306041.1"; level 3; havana_gene "OTTHUMG00000000961.2";
chr1 ENSEMBL transcript 11874 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000518655.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-202"; level 3; havana_gene "OTTHUMG00000000961.2";
chr1 ENSEMBL exon 11874 12227 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000518655.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002269724.1"; level 3; havana_gene "OTTHUMG00000000961.2";
chr1 ENSEMBL exon 12595 12721 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000518655.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00002270865.1"; level 3; havana_gene "OTTHUMG00000000961.2";
chr1 ENSEMBL exon 13403 13655 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000518655.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002216795.1"; level 3; havana_gene "OTTHUMG00000000961.2";
chr1 ENSEMBL exon 13661 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000518655.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-202"; exon_number 4; exon_id "ENSE00002303382.1"; level 3; havana_gene "OTTHUMG00000000961.2";
chr1 HAVANA transcript 12010 13670 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA exon 12010 12057 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA exon 12179 12227 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 2; exon_id "ENSE00001671638.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA exon 12613 12697 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 3; exon_id "ENSE00001758273.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA exon 12975 13052 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 4; exon_id "ENSE00001799933.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA exon 13221 13374 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 5; exon_id "ENSE00001746346.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA exon 13453 13670 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 6; exon_id "ENSE00001863096.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
Import
We can use disable_infer_genes=True
and disable_infer_transcripts=True
to
disable the gene and transcript inference. This will dramatically speed up
import, but will result in identical results:
>>> fn = gffutils.example_filename('gencode-v19.gtf')
>>> db = gffutils.create_db(fn,
... ":memory:",
... keep_order=True,
... disable_infer_genes=True, disable_infer_transcripts=True)
Access
Access is the same as other files:
>>> db['ENSG00000223972.4']
<Feature gene (chr1:11869-14412[+]) at 0x...>
Ensure that inferring gene extent results in the same database – it just takes longer to import:
>>> db2 = gffutils.create_db(fn,
... ":memory:",
... keep_order=True,
... disable_infer_genes=True, disable_infer_transcripts=True)
>>> for i, j in zip(db.all_features(), db2.all_features()):
... assert i == j