Examples

Often, GTF/GFF files have non-standard formatting. While gffutils is set up to handle standard formatting by default, it also allows great flexibility for working with non-standard files. These examples collect various real-world, difficult-to-work-with files and show how to make them work with gffutils. If you have a troublesome file, the best thing to do is to look at the first few lines for these examples and see if you can find one with matching formatting.

These example files ship with gffutils. For interactive use, their absolute path can be obtained with the gffutils.example_filename() function. Each example shows the contents of the file, along with text explaining the troublesome parts and which settings to use in order to successfully import into a database and retrieve features.

Note

Be mindful of the difference between the ID attribute of a feature and the primary key of that feature in the database.

Feature.id will always return the primary key. Feature["ID"] will return the ID attribute of the feature, if it has one. In many (but not all) cases, Feature.id == Feature["ID"]

Note

Also be mindful of the difference between an attribute of a Python object vs the 9th field of a GFF line, which is called the “attributes” field according to the spec. Confusing? Yes!

mouse_extra_comma.gff3

This file has gene, mRNA, protein, UTR, CDS, and exon annotations. However, not all of them have a useful ID in their attributes.

File contents

chr17	RefSeq	gene	6797760	6818159	.	+	.	ID=NC_000083.5:LOC100040603;Name=NC_000083.5:LOC100040603
chr17	RefSeq	mRNA	6797760	6818159	.	+	.	ID=XM_001475631.1;Parent=NC_000083.5:LOC100040603
chr17	RefSeq	protein	6806527	6812289	.	+	.	ID=;Parent=XM_001475631.1
chr17	RefSeq	five_prime_UTR	6797760	6797769	.	+	.	Parent=XM_001475631.1
chr17	RefSeq	five_prime_UTR	6806513	6806526	.	+	.	Parent=XM_001475631.1
chr17	RefSeq	CDS	6806527	6806553	.	+	0	Name=CDS:NC_000083.5:LOC100040603;Parent=XM_001475631.1,
chr17	RefSeq	CDS	6808204	6808245	.	+	0	Name=CDS:NC_000083.5:LOC100040603;Parent=XM_001475631.1,
chr17	RefSeq	CDS	6811330	6811453	.	+	0	Name=CDS:NC_000083.5:LOC100040603;Parent=XM_001475631.1,
chr17	RefSeq	CDS	6811792	6811869	.	+	2	Name=CDS:NC_000083.5:LOC100040603;Parent=XM_001475631.1,
chr17	RefSeq	CDS	6812219	6812289	.	+	2	Name=CDS:NC_000083.5:LOC100040603;Parent=XM_001475631.1,
chr17	RefSeq	three_prime_UTR	6812290	6818159	.	+	.	Parent=XM_001475631.1
chr17	RefSeq	exon	6797760	6797769	.	+	.	Parent=XM_001475631.1
chr17	RefSeq	exon	6806513	6806553	.	+	.	Parent=XM_001475631.1
chr17	RefSeq	exon	6808204	6808245	.	+	.	Parent=XM_001475631.1
chr17	RefSeq	exon	6811330	6811453	.	+	.	Parent=XM_001475631.1
chr17	RefSeq	exon	6811792	6811869	.	+	.	Parent=XM_001475631.1
chr17	RefSeq	exon	6812219	6818159	.	+	.	Parent=XM_001475631.1

Import

The defaults for gffutils will load this example into a database. However, the CDSs do not have ID attributes, and the default for GFF files is id_spec="ID". As described at id_spec, they will be accessed by the names "CDS_1", "CDS_2", and so on.

Note that using id_spec=["ID", "Name"] won’t work, because the CDSs share the same Name attribute, CDS:NC_000083.5:LOC100040603. We could use id_spec=["ID", "Name"] in combination with merge_strategy="create_unique", which would give ids like CDS:NC_000083.5:LOC100040603_1 (with an underscore an integer at the end for each consecutive occurence of the same ID). But we’ll go with the simpler default strategy for this file:

>>> import gffutils
>>> fn = gffutils.example_filename('mouse_extra_comma.gff3')
>>> db = gffutils.create_db(fn, ':memory:')

Example access

Since the mRNA on line 2 has an ID, we can simply access it by ID:

>>> db['XM_001475631.1']
<Feature mRNA (chr17:6797760-6818159[+]) at 0x...>

It’s not very convenient to access the CDSs by their newly created ids "CDS_1", "CDS_2", etc. (Which one was “1”? Is that from the same gene as CDS “5”?) But we can access those CDSs as children of the mRNA:

>>> [f.id for f in db.children('XM_001475631.1', featuretype='CDS')]
['CDS_1', 'CDS_2', 'CDS_3', 'CDS_4', 'CDS_5']

Or as “grandchildren” of the gene:

>>> [f.id for f in db.children('NC_000083.5:LOC100040603', featuretype='CDS', level=2)]
['CDS_1', 'CDS_2', 'CDS_3', 'CDS_4', 'CDS_5']

Note that the database id values are all unique for the CDSs, but their "Name" attributes are still all the same, as expected:

>>> list(set([f['Name'][0] for f in db.children('XM_001475631.1', featuretype='CDS')]))
['CDS:NC_000083.5:LOC100040603']

c_elegans_WS199_ann_gff.txt

This is a minimal file, with only one feature and that one feature has no genomic coordinates. It illustrates how things behave when there is very little information for a feature.

File contents

# modified GFF file to remove location coordinates and test annotations
I	Expr_profile	experimental_result_region	.	.	.	+	.	expr_profile=B0019.1

Import

Import is straightforward:

>>> fn = gffutils.example_filename('c_elegans_WS199_ann_gff.txt')
>>> db = gffutils.create_db(fn, ':memory:')

Access

Confirm there’s only a single feature imported:

>>> list(db.all_features())
[<Feature experimental_result_region (I:.-.[+]) at 0x...]

With no obvious ID fields in the attributes, we can access the single feature by name using the feature type:

>>> db['experimental_result_region_1']
<Feature experimental_result_region (I:.-.[+]) at 0x...>

This shows that the internal representation of null values is None:

>>> f = db['experimental_result_region_1']
>>> assert f.start is None
>>> assert f.stop is None

But the string representation shows “.” placeholders:

>>> print(f) 
I       Expr_profile    experimental_result_region      .       .       .       +       .       expr_profile=B0019.1

c_elegans_WS199_shortened_gff.txt

This file contains many different kinds of features; much like mouse_extra_comma.gff3 above, some features have IDs, some do not, and there are many that share the same feature ID. This example shows a common way of subsetting features (here, getting all the non-historical CDSs for a gene).

File contents

I	Orfeome	PCR_product	12759747	12764936	.	-	.	amplified=1;pcr_product=mv_B0019.1
I	SAGE_tag_unambiguously_mapped	SAGE_tag	12763533	12763553	.	-	.	count=1;gene=amx-2;sequence=SAGE:ggcagagtcttttggca;transcript=B0019.1
I	SAGE_tag_unambiguously_mapped	SAGE_tag	12761492	12761512	.	-	.	count=5;gene=amx-2;sequence=SAGE:aacggagccgtacacgc;transcript=B0019.1
I	SAGE_tag_most_three_prime	SAGE_tag	12761499	12761512	.	-	.	count=9;gene=amx-2;sequence=SAGE:aacggagccg;transcript=B0019.1
X	SAGE_tag	SAGE_tag	6819353	6819366	.	+	.	count=9;gene=amx-2;sequence=SAGE:aacggagccg;transcript=B0019.1
I	Expr_profile	experimental_result_region	12762449	12764118	.	+	.	expr_profile=B0019.1
I	Coding_transcript	CDS	12759745	12759828	.	-	0	ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I	Coding_transcript	CDS	12759949	12760013	.	-	2	ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I	Coding_transcript	CDS	12760227	12760319	.	-	2	ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I	Coding_transcript	CDS	12760365	12760494	.	-	0	ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I	Coding_transcript	CDS	12760834	12760904	.	-	2	ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I	Coding_transcript	CDS	12761172	12761516	.	-	2	ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I	Coding_transcript	CDS	12761799	12761953	.	-	1	ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I	Coding_transcript	CDS	12762127	12762268	.	-	2	ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I	Coding_transcript	CDS	12762648	12762806	.	-	2	ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I	Coding_transcript	CDS	12763112	12763249	.	-	2	ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I	Coding_transcript	CDS	12763448	12763655	.	-	0	ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I	Coding_transcript	CDS	12763729	12763882	.	-	1	ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I	Coding_transcript	CDS	12763979	12764102	.	-	2	ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I	Coding_transcript	CDS	12764291	12764471	.	-	0	ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I	Coding_transcript	CDS	12764812	12764937	.	-	0	ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2;status=Partially_confirmed;wormpep=CE:CE40797
I	history	CDS	12759745	12759828	.	-	0	ID=CDS:B0019.1:wp173
I	history	CDS	12759949	12760013	.	-	2	ID=CDS:B0019.1:wp173
I	history	CDS	12760227	12760319	.	-	2	ID=CDS:B0019.1:wp173
I	history	CDS	12760365	12760494	.	-	0	ID=CDS:B0019.1:wp173
I	history	CDS	12760834	12760904	.	-	2	ID=CDS:B0019.1:wp173
I	history	CDS	12761172	12761516	.	-	2	ID=CDS:B0019.1:wp173
I	history	CDS	12761577	12761626	.	-	1	ID=CDS:B0019.1:wp173
I	history	CDS	12761795	12761953	.	-	1	ID=CDS:B0019.1:wp173
I	history	CDS	12762127	12762268	.	-	2	ID=CDS:B0019.1:wp173
I	history	CDS	12762648	12762806	.	-	2	ID=CDS:B0019.1:wp173
I	history	CDS	12763112	12763249	.	-	2	ID=CDS:B0019.1:wp173
I	history	CDS	12763448	12763655	.	-	0	ID=CDS:B0019.1:wp173
I	history	CDS	12763729	12763882	.	-	1	ID=CDS:B0019.1:wp173
I	history	CDS	12763979	12764102	.	-	2	ID=CDS:B0019.1:wp173
I	history	CDS	12764291	12764471	.	-	0	ID=CDS:B0019.1:wp173
I	history	CDS	12764812	12764937	.	-	0	ID=CDS:B0019.1:wp173
I	history	CDS	12759745	12759828	.	-	0	ID=CDS:B0019.1:wp90
I	history	CDS	12759949	12760013	.	-	2	ID=CDS:B0019.1:wp90
I	history	CDS	12760227	12760319	.	-	2	ID=CDS:B0019.1:wp90
I	history	CDS	12761172	12761516	.	-	2	ID=CDS:B0019.1:wp90
I	history	CDS	12761577	12761626	.	-	1	ID=CDS:B0019.1:wp90
I	history	CDS	12761795	12761953	.	-	1	ID=CDS:B0019.1:wp90
I	history	CDS	12762127	12762268	.	-	2	ID=CDS:B0019.1:wp90
I	history	CDS	12762648	12762806	.	-	2	ID=CDS:B0019.1:wp90
I	history	CDS	12763112	12763249	.	-	2	ID=CDS:B0019.1:wp90
I	history	CDS	12763469	12763655	.	-	0	ID=CDS:B0019.1:wp90
I	history	CDS	12763729	12763882	.	-	1	ID=CDS:B0019.1:wp90
I	history	CDS	12763979	12764102	.	-	2	ID=CDS:B0019.1:wp90
I	history	CDS	12764291	12764471	.	-	0	ID=CDS:B0019.1:wp90
I	history	CDS	12764812	12764937	.	-	0	ID=CDS:B0019.1:wp90
I	mass_spec_genome	translated_nucleotide_match	12761920	12761953	.	-	.	ID=Target:381130;Target=Mass_spec_peptide:MSP:FADFSPLDVSDVNFATDDLAK 10 21 +;Note=MSP:FADFSPLDVSDVNFATDDLAK;cds_matches=B0019.1;protein_matches=WP:CE40797;times_observed=3
I	mass_spec_genome	translated_nucleotide_match	12762127	12762155	.	-	.	ID=Target:381130;Target=Mass_spec_peptide:MSP:FADFSPLDVSDVNFATDDLAK 1 10 +;Note=MSP:FADFSPLDVSDVNFATDDLAK;cds_matches=B0019.1;protein_matches=WP:CE40797;times_observed=3
I	mass_spec_genome	translated_nucleotide_match	12763506	12763559	.	-	.	ID=Target:381133;Target=Mass_spec_peptide:MSP:FGHGQSLLAQGGMNEVVR 1 18 +;Note=MSP:FGHGQSLLAQGGMNEVVR;cds_matches=B0019.1;protein_matches=WP:CE40797;times_observed=1
I	mass_spec_genome	translated_nucleotide_match	12764361	12764411	.	-	.	ID=Target:381144;Target=Mass_spec_peptide:MSP:NIQQNRPGLSVLVLEAR 1 17 +;Note=MSP:NIQQNRPGLSVLVLEAR;cds_matches=B0019.1;protein_matches=WP:CE40797;times_observed=2
I	Coding_transcript	mRNA	12759582	12764949	.	-	.	ID=Transcript:B0019.1;Note=amx-2;Parent=Gene:WBGene00000138;cds=B0019.1;prediction_status=Partially_confirmed;wormpep=CE:CE40797
I	Allele	SNP	12764272	12764272	.	+	.	interpolated_map_position=14.003;rflp=No;variation=snp_B0019[1]
I	Oligo_set	reagent	12759745	12761589	.	-	.	oligo_set=Aff_B0019.1
I	Coding_transcript	exon	12759745	12759828	.	-	0	Parent=Transcript:B0019.1
I	Coding_transcript	exon	12759949	12760013	.	-	2	Parent=Transcript:B0019.1
I	Coding_transcript	exon	12760227	12760319	.	-	2	Parent=Transcript:B0019.1
I	Coding_transcript	exon	12760365	12760494	.	-	0	Parent=Transcript:B0019.1
I	Coding_transcript	exon	12760834	12760904	.	-	2	Parent=Transcript:B0019.1
I	Coding_transcript	exon	12761172	12761516	.	-	2	Parent=Transcript:B0019.1
I	Coding_transcript	exon	12761799	12761953	.	-	1	Parent=Transcript:B0019.1
I	Coding_transcript	exon	12762127	12762268	.	-	2	Parent=Transcript:B0019.1
I	Coding_transcript	exon	12762648	12762806	.	-	2	Parent=Transcript:B0019.1
I	Coding_transcript	exon	12763112	12763249	.	-	2	Parent=Transcript:B0019.1
I	Coding_transcript	exon	12763448	12763655	.	-	0	Parent=Transcript:B0019.1
I	Coding_transcript	exon	12763729	12763882	.	-	1	Parent=Transcript:B0019.1
I	Coding_transcript	exon	12763979	12764102	.	-	2	Parent=Transcript:B0019.1
I	Coding_transcript	exon	12764291	12764471	.	-	0	Parent=Transcript:B0019.1
I	Coding_transcript	exon	12764812	12764937	.	-	0	Parent=Transcript:B0019.1
I	Coding_transcript	five_prime_UTR	12764938	12764949	.	-	.	Parent=Transcript:B0019.1
I	Coding_transcript	three_prime_UTR	12759582	12759744	.	-	.	Parent=Transcript:B0019.1
I	Coding_transcript	intron	12760495	12760833	.	-	.	Parent=Transcript:B0019.1;confirmed_est=EC027594
I	Coding_transcript	intron	12760905	12761171	.	-	.	Parent=Transcript:B0019.1;confirmed_est=EC027594
I	Coding_transcript	intron	12761517	12761798	.	-	.	Parent=Transcript:B0019.1;confirmed_est=EC027594
I	Coding_transcript	intron	12759829	12759948	.	-	.	Parent=Transcript:B0019.1;confirmed_est=EC034652
I	Coding_transcript	intron	12760014	12760226	.	-	.	Parent=Transcript:B0019.1;confirmed_est=EC034652
I	Coding_transcript	intron	12760320	12760364	.	-	.	Parent=Transcript:B0019.1;confirmed_est=yk1054h04.3
I	Coding_transcript	intron	12763883	12763978	.	-	.	Parent=Transcript:B0019.1;confirmed_est=yk1054h04.5,OSTF088D9_1
I	Coding_transcript	intron	12764103	12764290	.	-	.	Parent=Transcript:B0019.1;confirmed_est=yk1054h04.5,OSTF088D9_1
I	Coding_transcript	intron	12764472	12764811	.	-	.	Parent=Transcript:B0019.1;confirmed_est=yk1054h04.5,OSTF088D9_1
I	Coding_transcript	intron	12762807	12763111	.	-	.	Parent=Transcript:B0019.1;confirmed_est=yk1056c07.5
I	Coding_transcript	intron	12763250	12763447	.	-	.	Parent=Transcript:B0019.1;confirmed_est=yk1056c07.5
I	Coding_transcript	intron	12763656	12763728	.	-	.	Parent=Transcript:B0019.1;confirmed_est=yk1056c07.5
I	Coding_transcript	intron	12761954	12762126	.	-	.	Parent=Transcript:B0019.1;confirmed_est=yk262g9.5
I	Coding_transcript	intron	12762269	12762647	.	-	.	Parent=Transcript:B0019.1;confirmed_est=yk262g9.5
I	Promoterome	PCR_product	12764938	12766937	.	+	.	pcr_product=p_B0019.1_93
I	GenePair_STS	PCR_product	12762449	12764118	.	+	.	pcr_product=sjj_B0019.1
I	Coding_transcript	gene	12759582	12764949	.	-	.	ID=Gene:WBGene00000138
III	Orfeome	PCR_product	13780230	13780850	.	+	.	amplified=1;pcr_product=mv_3R5.1.v6
IV	Orfeome	PCR_product	17486939	17488952	.	-	.	amplified=1;pcr_product=mv_4R79.1
IV	Orfeome	PCR_product	17480353	17483284	.	-	.	amplified=1;pcr_product=mv_4R79.2
X	Orfeome	PCR_product	17714881	17718531	.	+	.	amplified=1;pcr_product=mv_6R55.1
X	Orfeome	PCR_product	17712787	17714742	.	+	.	amplified=1;pcr_product=mv_6R55.2
II	Orfeome	PCR_product	6995874	7010146	.	+	.	amplified=1;pcr_product=mv_AAA03517
III	Orfeome	PCR_product	5625097	5631795	.	+	.	amplified=1;pcr_product=mv_AAA03544
X	GenePair_STS	PCR_product	9962853	9963737	.	+	.	pcr_product=cenix:102-c3
II	GenePair_STS	PCR_product	5507236	5508135	.	+	.	pcr_product=cenix:102-c4
V	GenePair_STS	PCR_product	10117842	10118735	.	+	.	pcr_product=cenix:102-c5
IV	GenePair_STS	PCR_product	3566130	3567025	.	+	.	pcr_product=cenix:102-c6
X	GenePair_STS	PCR_product	6117180	6117930	.	+	.	pcr_product=cenix:102-c7
IV	GenePair_STS	PCR_product	7189492	7190369	.	+	.	pcr_product=cenix:102-c9
II	GenePair_STS	PCR_product	14462527	14463202	.	+	.	pcr_product=cenix:102-d1
X	Promoterome	PCR_product	2258069	2259336	.	+	.	pcr_product=p_AH9.2_93
IV	Promoterome	PCR_product	12157449	12159448	.	+	.	pcr_product=p_B0001.6_93
I	Promoterome	PCR_product	12764938	12766937	.	+	.	pcr_product=p_B0019.1_93
V	Promoterome	PCR_product	10320122	10320689	.	+	.	pcr_product=p_B0024.12_93
I	Coding_transcript	CDS	4581214	4581237	.	-	0	ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I	Coding_transcript	CDS	4581664	4582026	.	-	0	ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I	Coding_transcript	CDS	4582412	4582718	.	-	1	ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I	Coding_transcript	CDS	4583190	4583374	.	-	0	ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I	Coding_transcript	CDS	4583426	4583509	.	-	0	ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I	Coding_transcript	CDS	4583560	4583805	.	-	0	ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I	Coding_transcript	mRNA	4580734	4583815	.	-	.	ID=Transcript:D1007.5b.1;Parent=Gene:WBGene00017003;cds=D1007.5b;prediction_status=Confirmed;wormpep=WP:CE33577
I	Coding_transcript	mRNA	4581214	4583811	.	-	.	ID=Transcript:D1007.5b.2;Parent=Gene:WBGene00017003;cds=D1007.5b;prediction_status=Confirmed;wormpep=WP:CE33577
I	Coding_transcript	exon	4581214	4581237	.	-	0	Parent=Transcript:D1007.5b.1
I	Coding_transcript	exon	4581664	4582026	.	-	0	Parent=Transcript:D1007.5b.1
I	Coding_transcript	exon	4582412	4582718	.	-	1	Parent=Transcript:D1007.5b.1
I	Coding_transcript	exon	4583190	4583374	.	-	0	Parent=Transcript:D1007.5b.1
I	Coding_transcript	exon	4583426	4583509	.	-	0	Parent=Transcript:D1007.5b.1
I	Coding_transcript	exon	4583560	4583805	.	-	0	Parent=Transcript:D1007.5b.1
I	Coding_transcript	five_prime_UTR	4583806	4583815	.	-	.	Parent=Transcript:D1007.5b.1
I	Coding_transcript	three_prime_UTR	4580734	4581213	.	-	.	Parent=Transcript:D1007.5b.1
I	Coding_transcript	intron	4582027	4582411	.	-	.	Parent=Transcript:D1007.5b.1;confirmed_est=EB994038
I	Coding_transcript	intron	4583375	4583425	.	-	.	Parent=Transcript:D1007.5b.1;confirmed_est=EC038345,OSTF085G5_1
I	Coding_transcript	intron	4583510	4583559	.	-	.	Parent=Transcript:D1007.5b.1;confirmed_est=EC038345,OSTF085G5_1
I	Coding_transcript	intron	4582719	4583189	.	-	.	Parent=Transcript:D1007.5b.1;confirmed_est=yk1055g06.5,OSTF085G5_1
I	Coding_transcript	intron	4581238	4581663	.	-	.	Parent=Transcript:D1007.5b.1;confirmed_est=yk1057e08.3
I	Coding_transcript	exon	4581214	4581237	.	-	0	Parent=Transcript:D1007.5b.2
I	Coding_transcript	exon	4581664	4582026	.	-	0	Parent=Transcript:D1007.5b.2
I	Coding_transcript	exon	4582412	4582718	.	-	1	Parent=Transcript:D1007.5b.2
I	Coding_transcript	exon	4583190	4583374	.	-	0	Parent=Transcript:D1007.5b.2
I	Coding_transcript	exon	4583426	4583509	.	-	0	Parent=Transcript:D1007.5b.2
I	Coding_transcript	exon	4583560	4583805	.	-	0	Parent=Transcript:D1007.5b.2
I	Coding_transcript	five_prime_UTR	4583806	4583811	.	-	.	Parent=Transcript:D1007.5b.2
I	Coding_transcript	intron	4582027	4582411	.	-	.	Parent=Transcript:D1007.5b.2;confirmed_est=EB994038
I	Coding_transcript	intron	4583375	4583425	.	-	.	Parent=Transcript:D1007.5b.2;confirmed_est=EC038345,OSTF085G5_1
I	Coding_transcript	intron	4583510	4583559	.	-	.	Parent=Transcript:D1007.5b.2;confirmed_est=EC038345,OSTF085G5_1
I	Coding_transcript	intron	4582719	4583189	.	-	.	Parent=Transcript:D1007.5b.2;confirmed_est=yk1055g06.5,OSTF085G5_1
I	Coding_transcript	intron	4581238	4581663	.	-	.	Parent=Transcript:D1007.5b.2;confirmed_est=yk1057e08.3
I	Coding_transcript	gene	4580693	4583815	.	-	.	ID=Gene:WBGene00017003
I	SAGE_tag_unambiguously_mapped	SAGE_tag	4581093	4581113	.	-	.	count=10;gene=D1007.5;sequence=SAGE:tttgcgaattacttgct;transcript=D1007.5b.1,D1007.5a
I	SAGE_tag_unambiguously_mapped	SAGE_tag	4580748	4580768	.	-	.	count=112;gene=D1007.5;sequence=SAGE:ttttccattaattttga;transcript=D1007.5b.1,D1007.5a
I	SAGE_tag_unambiguously_mapped	SAGE_tag	4582415	4582428	.	-	.	count=1;gene=D1007.5;sequence=SAGE:cattttcgtg;transcript=D1007.5b.2,D1007.5b.1,D1007.5a
I	SAGE_tag_unambiguously_mapped	SAGE_tag	4580914	4580927	.	-	.	count=1;gene=D1007.5;sequence=SAGE:taaatttcaa;transcript=D1007.5b.1,D1007.5a
I	SAGE_tag_unambiguously_mapped	SAGE_tag	4581193	4581206	.	-	.	count=1;gene=D1007.5;sequence=SAGE:tgctcgttcg;transcript=D1007.5b.1,D1007.5a
I	SAGE_tag_unambiguously_mapped	SAGE_tag	4583465	4583478	.	-	.	count=1;gene=D1007.5;sequence=SAGE:tgttggcctt;transcript=D1007.5b.2,D1007.5b.1,D1007.5a
I	SAGE_tag_unambiguously_mapped	SAGE_tag	4583458	4583478	.	-	.	count=1;gene=D1007.5;sequence=SAGE:tgttggccttttacttg;transcript=D1007.5b.2,D1007.5b.1,D1007.5a
I	SAGE_tag_unambiguously_mapped	SAGE_tag	4582533	4582553	.	-	.	count=2;gene=D1007.5;sequence=SAGE:tgcagtgatagtccagc;transcript=D1007.5b.2,D1007.5b.1,D1007.5a
I	SAGE_tag_unambiguously_mapped	SAGE_tag	4581100	4581113	.	-	.	count=2;gene=D1007.5;sequence=SAGE:tttgcgaatt;transcript=D1007.5b.1,D1007.5a
I	SAGE_tag_unambiguously_mapped	SAGE_tag	4580755	4580768	.	-	.	count=43;gene=D1007.5;sequence=SAGE:ttttccatta;transcript=D1007.5b.1,D1007.5a
I	Coding_transcript	CDS	4580993	4581241	.	-	0	ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I	Coding_transcript	CDS	4581664	4582026	.	-	0	ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I	Coding_transcript	CDS	4582412	4582718	.	-	1	ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I	Coding_transcript	CDS	4583190	4583374	.	-	0	ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I	Coding_transcript	CDS	4583426	4583509	.	-	0	ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I	Coding_transcript	CDS	4583560	4583805	.	-	0	ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I	mass_spec_genome	translated_nucleotide_match	4580996	4581052	.	-	.	ID=Target:277116;Target=Mass_spec_peptide:MSP:IYEPSQEDLLLMHQLQQER 1 19 +;Note=MSP:IYEPSQEDLLLMHQLQQER;cds_matches=D1007.5a;protein_matches=WP:CE29034;times_observed=1
I	mass_spec_genome	translated_nucleotide_match	4581838	4581882	.	-	.	ID=Target:277138;Target=Mass_spec_peptide:MSP:AAIHLGSWHQIEGPR 1 15 +;Note=MSP:AAIHLGSWHQIEGPR;cds_matches=D1007.5b D1007.5a;protein_matches=WP:CE33577 WP:CE29034;times_observed=1
I	mass_spec_genome	translated_nucleotide_match	4583581	4583601	.	-	.	ID=Target:277176;Target=Mass_spec_peptide:MSP:TLWWLPK 1 7 +;Note=MSP:TLWWLPK;cds_matches=D1007.5b D1007.5a;protein_matches=WP:CE33577 WP:CE29034;times_observed=1
I	Coding_transcript	mRNA	4580693	4583811	.	-	.	ID=Transcript:D1007.5a;Parent=Gene:WBGene00017003;cds=D1007.5a;prediction_status=Confirmed;wormpep=CE:CE29034
I	Coding_transcript	exon	4580993	4581241	.	-	0	Parent=Transcript:D1007.5a
I	Coding_transcript	exon	4581664	4582026	.	-	0	Parent=Transcript:D1007.5a
I	Coding_transcript	exon	4582412	4582718	.	-	1	Parent=Transcript:D1007.5a
I	Coding_transcript	exon	4583190	4583374	.	-	0	Parent=Transcript:D1007.5a
I	Coding_transcript	exon	4583426	4583509	.	-	0	Parent=Transcript:D1007.5a
I	Coding_transcript	exon	4583560	4583805	.	-	0	Parent=Transcript:D1007.5a
I	Coding_transcript	five_prime_UTR	4583806	4583811	.	-	.	Parent=Transcript:D1007.5a
I	Coding_transcript	three_prime_UTR	4580693	4580992	.	-	.	Parent=Transcript:D1007.5a
I	Coding_transcript	intron	4582027	4582411	.	-	.	Parent=Transcript:D1007.5a;confirmed_est=EB994038
I	Coding_transcript	intron	4581242	4581663	.	-	.	Parent=Transcript:D1007.5a;confirmed_est=EB994038,OSTR085G5_1
I	Coding_transcript	intron	4583375	4583425	.	-	.	Parent=Transcript:D1007.5a;confirmed_est=EC038345,OSTF085G5_1
I	Coding_transcript	intron	4583510	4583559	.	-	.	Parent=Transcript:D1007.5a;confirmed_est=EC038345,OSTF085G5_1
I	Coding_transcript	intron	4582719	4583189	.	-	.	Parent=Transcript:D1007.5a;confirmed_est=yk1055g06.5,OSTF085G5_1

Import

If merge_strategy="merge", is used here, exceptions will be raised because all those CDSs cannot be logically merged: they have different coordinates. So merge_strategy="create_unique" is used instead. This kwarg will add an underscore and an integer to each duplicate value used as a primary key:

>>> fn = gffutils.example_filename('c_elegans_WS199_shortened_gff.txt')
>>> db = gffutils.create_db(fn, ":memory:", merge_strategy='create_unique', keep_order=True)

Access

Getting a gene:

>>> g = db['Gene:WBGene00017003']
>>> g
<Feature gene (I:4580693-4583815[-]) at 0x...>

Get all “non-historical” CDSs for this gene. This illustrates that:

you can use a Feature object instead of a string for the database primary key
printing a Feature reconstructs the line, and
the string representation of a line does not include a \n.

>>> for i in db.children(g, featuretype='CDS', order_by='start'):
...     if i.source != 'history':
...         print(i)  
I       Coding_transcript       CDS     4580993 4581241 .       -       0       ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I       Coding_transcript       CDS     4581214 4581237 .       -       0       ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I       Coding_transcript       CDS     4581664 4582026 .       -       0       ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I       Coding_transcript       CDS     4581664 4582026 .       -       0       ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I       Coding_transcript       CDS     4582412 4582718 .       -       1       ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I       Coding_transcript       CDS     4582412 4582718 .       -       1       ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I       Coding_transcript       CDS     4583190 4583374 .       -       0       ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I       Coding_transcript       CDS     4583190 4583374 .       -       0       ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I       Coding_transcript       CDS     4583426 4583509 .       -       0       ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I       Coding_transcript       CDS     4583426 4583509 .       -       0       ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577
I       Coding_transcript       CDS     4583560 4583805 .       -       0       ID=CDS:D1007.5a;Parent=Transcript:D1007.5a;status=Confirmed;wormpep=CE:CE29034
I       Coding_transcript       CDS     4583560 4583805 .       -       0       ID=CDS:D1007.5b;Parent=Transcript:D1007.5b.2,Transcript:D1007.5b.1;status=Confirmed;wormpep=WP:CE33577

ensembl_gtf.txt

In this GTF file, gene_id and transcript_id are identical, creating problems for unique IDs in the database.

File contents

I	snoRNA	exon	3747	3909	.	-	.	 gene_id "Y74C9A.6"; transcript_id "Y74C9A.6"; exon_number "1"; gene_name "Y74C9A.6"; transcript_name "NR_001477.2";
I	protein_coding	exon	12764812	12764949	.	-	.	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "1"; gene_name "amx-2"; transcript_name "B0019.1";
I	protein_coding	CDS	12764812	12764937	.	-	0	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "1"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I	protein_coding	start_codon	12764935	12764937	.	-	0	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "1"; gene_name "amx-2"; transcript_name "B0019.1";
I	protein_coding	exon	12764291	12764471	.	-	.	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "2"; gene_name "amx-2"; transcript_name "B0019.1";
I	protein_coding	CDS	12764291	12764471	.	-	0	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "2"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I	protein_coding	exon	12763979	12764102	.	-	.	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "3"; gene_name "amx-2"; transcript_name "B0019.1";
I	protein_coding	CDS	12763979	12764102	.	-	2	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "3"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I	protein_coding	exon	12763729	12763882	.	-	.	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "4"; gene_name "amx-2"; transcript_name "B0019.1";
I	protein_coding	CDS	12763729	12763882	.	-	1	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "4"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I	protein_coding	exon	12763448	12763655	.	-	.	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "5"; gene_name "amx-2"; transcript_name "B0019.1";
I	protein_coding	CDS	12763448	12763655	.	-	0	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "5"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I	protein_coding	exon	12763112	12763249	.	-	.	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "6"; gene_name "amx-2"; transcript_name "B0019.1";
I	protein_coding	CDS	12763112	12763249	.	-	2	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "6"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I	protein_coding	exon	12762648	12762806	.	-	.	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "7"; gene_name "amx-2"; transcript_name "B0019.1";
I	protein_coding	CDS	12762648	12762806	.	-	2	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "7"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I	protein_coding	exon	12762127	12762268	.	-	.	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "8"; gene_name "amx-2"; transcript_name "B0019.1";
I	protein_coding	CDS	12762127	12762268	.	-	2	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "8"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I	protein_coding	exon	12761799	12761953	.	-	.	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "9"; gene_name "amx-2"; transcript_name "B0019.1";
I	protein_coding	CDS	12761799	12761953	.	-	1	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "9"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I	protein_coding	exon	12761172	12761516	.	-	.	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "10"; gene_name "amx-2"; transcript_name "B0019.1";
I	protein_coding	CDS	12761172	12761516	.	-	2	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "10"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I	protein_coding	exon	12760834	12760904	.	-	.	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "11"; gene_name "amx-2"; transcript_name "B0019.1";
I	protein_coding	CDS	12760834	12760904	.	-	2	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "11"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I	protein_coding	exon	12760365	12760494	.	-	.	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "12"; gene_name "amx-2"; transcript_name "B0019.1";
I	protein_coding	CDS	12760365	12760494	.	-	0	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "12"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I	protein_coding	exon	12760227	12760319	.	-	.	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "13"; gene_name "amx-2"; transcript_name "B0019.1";
I	protein_coding	CDS	12760227	12760319	.	-	2	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "13"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I	protein_coding	exon	12759949	12760013	.	-	.	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "14"; gene_name "amx-2"; transcript_name "B0019.1";
I	protein_coding	CDS	12759949	12760013	.	-	2	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "14"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I	protein_coding	exon	12759579	12759828	.	-	.	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "15"; gene_name "amx-2"; transcript_name "B0019.1";
I	protein_coding	CDS	12759748	12759828	.	-	0	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "15"; gene_name "amx-2"; transcript_name "B0019.1"; protein_id "B0019.1";
I	protein_coding	stop_codon	12759745	12759747	.	-	0	 gene_id "B0019.1"; transcript_id "B0019.1"; exon_number "15"; gene_name "amx-2"; transcript_name "B0019.1";

Import

As part of the import process, gffutils creates new “derived” features for genes and transcripts (see GTF files). These are not explicitly defined in a GTF file, but can be inferred. Even though gene and transcript features do not exist in the file, they will still be created. This is why id_spec={'gene': 'gene_id'} works: the primary key of “gene” features – which don’t exist in the GTF file but are inferred by gffutils – will be the “gene_id” attribute.

Note that in this example, gene_id and transcript_id are identical. This causes all sorts of problems. To get around this we can write a transform function (see transform for details) that applies an arbitrary transformation to a dictionary returned by the parser. This will modify the data upon import into the database:

>>> def transform_func(x):
...     # adds some text to the end of transcript IDs
...     if 'transcript_id' in x.attributes:
...         x.attributes['transcript_id'][0] += '_transcript'
...     return x

Now we can supply this tranform function to create_db():

>>> fn = gffutils.example_filename('ensembl_gtf.txt')
>>> db = gffutils.create_db(fn, ":memory:",
... id_spec={'gene': 'gene_id', 'transcript': "transcript_id"},
... merge_strategy="create_unique",
... transform=transform_func,
... keep_order=True)

Access

We can access the derived “gene” feature by its ID (recall it doesn’t actually exist in the GTF file):

>>> g = db["B0019.1"]
>>> g
<Feature gene (I:12759579-12764949[-]) at 0x...>

>>> print(g)  
I       gffutils_derived        gene    12759579        12764949        .       -       .       gene_id "B0019.1";

And the derived “transcript” by its ID – which now has “_transcript” on the end because of that transform function:

>>> t = db["B0019.1_transcript"]
>>> t
<Feature transcript (I:12759579-12764949[-]) at 0x...>

>>> print(t)  
I       gffutils_derived        transcript      12759579        12764949        .       -       .       gene_id "B0019.1"; transcript_id "B0019.1_transcript";

F3-unique-3.v2.gff

This file does not contain hierarchical information, and has many directives. Since much of the benefit of gffutils comes from being able to navigate the hierarchical relationships, it might be overkill to use gffutils on this file. An exeception would be if you want to be able to look up the read names by their ID. Here, we use yet a differnt way of overriding the way gffutils decides on a primary key for the database.

File contents

##solid-gff-version 0.2
##gff-version 2
##source-version MaToGff.java v1.5
##date 2008-05-28
##time 13:11:03
##Type solid_read
##color-code AA=0,AC=1,AG=2,AT=3,CA=1,CC=0,CG=3,CT=2,GA=2,GC=3,GG=0,GT=1,TA=3,TC=2,TG=1,TT=0
##primer-base F3=T
##max-num-mismatches 3
##max-read-length 20
##line-order fragment
##history filter_fasta.pl --noduplicates --output=/data/results/DAEMON/DAEMON_MATE_PAIRS_2_20070326/S1/results.01/primary.20071218094706805 --name=DAEMON_MATE_PAIRS_2_20070326_S1 --tag=F3 --minlength=20 --prefix=T /data/results/DAEMON/DAEMON_MATE_PAIRS_2_20070326/S1/jobs/postPrimerSetPrimary.117/rawseq
##history map /data/results/RegressionDriver/CaseManager/results/r12/integration/case0002/reads1/test_S1_F3.csfasta /data/results/RegressionDriver/CaseManager/knownData/validatedReference/matchingPipeline/ecoli_k12_MG1655.fasta T=30 L=19 C=1 E=.Tmpfile1211939575SVhDtd F=0 B=1 D=1 u=1 r=0 n=1 Z=1000 P="0000000111111111111" M=0 U=0.000000 H=0  > .Tmpfile1211939575SVhDtd.out.1
##history MaToGff.java --sort --qvs=test_S1_F3_QV.qual.txt --convert=unique --clear=3 --tempdir=../tmp test_S1_F3.csfasta.ma.20.3
##hdr seqname	source	feature	start	end	score	strand	frame	[attributes]	[comments]
3_336_815_F3	solid	read	55409	55428	10.4	+	.	g=A3233312322232122211;i=1;p=1.000;q=23,12,18,17,10,24,19,14,27,9,23,9,16,20,11,7,8,4,4,14;u=0,0,0,1
3_142_1011_F3	solid	read	91290	91309	5.0	-	.	g=T0330222333132222222;i=1;p=1.000;q=4,4,14,4,4,4,4,21,4,4,4,4,25,4,4,4,5,21,4,4;u=0,0,0,1
3_341_424_F3	solid	read	102717	102736	10.6	-	.	g=T2203031313223113212;i=1;p=1.000;q=9,27,25,16,18,9,27,26,23,13,14,25,27,5,24,5,26,26,4,5;u=0,0,1
3_6_37_F3	solid	read	181053	181072	9.4	+	.	g=C3220221332111020310;i=1;p=1.000;q=9,5,13,9,10,22,6,12,21,7,13,4,21,16,23,6,20,20,13,6;u=0,0,0,1
3_34_202_F3	solid	read	284207	284226	6.9	+	.	g=G0301333332232122333;i=1;p=1.000;q=6,15,21,8,12,4,4,5,12,8,4,12,4,7,10,6,8,16,4,6;u=0,1
3_277_712_F3	solid	read	304136	304155	11.8	-	.	g=A2033101122223322133;i=1;p=1.000;q=26,11,14,27,4,17,4,26,26,23,17,25,26,27,21,23,5,20,26,23;u=0,1
3_394_71_F3	solid	read	308736	308755	10.8	+	.	g=T3203322323203312331;i=1;p=1.000;q=9,24,19,15,20,18,20,10,13,13,11,21,12,7,4,11,20,24,4,25;u=0,1
3_285_1497_F3	solid	read	404055	404074	8.4	-	.	g=T1221231003202232221;i=1;p=1.000;q=8,10,6,25,16,14,23,27,8,14,21,19,5,4,4,6,22,12,4,6;u=0,0,0,1
3_228_178_F3	solid	read	453227	453246	9.5	-	.	g=G1130333332331110323;i=1;p=1.000;q=4,19,25,18,18,5,19,6,8,24,4,26,21,11,15,4,26,13,13,15;u=0,0,0,1
3_406_794_F3	solid	read	504835	504854	8.3	-	.	g=T3033331301320201111;i=1;p=1.000;q=27,4,13,4,21,11,7,11,5,26,10,8,9,4,6,18,9,26,17,6;u=0,0,0,1
3_303_251_F3	solid	read	561501	561520	5.3	+	.	g=C0011111112222112221;i=1;p=1.000;q=9,8,4,4,10,4,4,4,6,14,4,4,4,4,16,4,4,4,4,23;u=0,0,1
3_152_112_F3	solid	read	624012	624031	7.7	-	.	g=G0301122312213122221;i=1;p=1.000;q=22,14,7,13,18,5,11,4,15,6,6,11,4,8,15,5,10,4,6,24;u=0,0,0,1
3_112_1154_F3	solid	read	630582	630601	11.3	-	.	g=T1333312011131131011;i=1;p=1.000;q=27,27,4,5,17,24,20,19,7,4,25,17,18,15,22,23,17,25,16,26;u=0,0,1
3_196_392_F3	solid	read	661664	661683	19.7	-	.	g=T3321013301122133323;i=1;p=1.000;q=27,25,13,26,21,25,23,27,27,27,27,11,16,27,27,19,26,27,26,27;u=1
3_192_1248_F3	solid	read	672037	672056	4.5	-	.	g=A0333232333121222222;i=1;p=1.000;q=4,7,4,4,4,4,4,4,6,4,4,4,4,4,7,7,4,4,6,4;u=0,0,0,1
3_63_479_F3	solid	read	742582	742601	7.9	-	.	g=A0133333333233232332;i=1;p=1.000;q=4,9,6,11,20,12,11,9,13,20,18,4,4,14,9,15,4,6,21,4;u=0,0,0,1
3_30_710_F3	solid	read	816069	816088	9.2	-	.	g=T3311001223313333313;i=1;p=1.000;q=22,27,18,25,25,7,26,25,14,23,6,25,5,11,7,4,15,7,4,6;u=0,0,0,1
3_284_77_F3	solid	read	864876	864895	7.4	+	.	g=T2003133033233112331;i=1;p=1.000;q=13,19,4,11,22,24,6,16,4,6,13,4,12,18,4,6,7,11,4,5;u=0,0,0,1
3_411_1040_F3	solid	read	876023	876042	10.9	-	.	g=T2121301233200033221;i=1;p=1.000;q=9,9,5,12,11,8,4,16,27,27,18,21,24,9,18,24,21,9,23,17;u=0,0,0,1
3_188_171_F3	solid	read	884683	884702	5.8	-	.	g=A1322330132213322231;i=1;p=1.000;q=4,8,4,5,7,6,5,4,11,6,6,11,4,8,4,8,4,6,4,15;u=0,0,0,1
3_63_787_F3	solid	read	1022149	1022168	7.5	+	.	g=C3131132013020123031;i=1;p=1.000;q=12,13,26,14,9,9,13,14,4,7,8,5,11,4,17,4,4,6,4,21;u=0,1
3_391_2015_F3	solid	read	1074989	1075008	18.5	-	.	g=A2323101222321232322;i=1;p=1.000;q=27,25,18,20,27,27,24,23,27,23,27,25,19,26,12,26,9,21,27,21;u=1
3_8_425_F3	solid	read	1119124	1119143	6.7	-	.	g=T0321201132230303323;i=1;p=1.000;q=6,5,8,6,4,4,23,9,12,10,15,4,13,13,8,4,4,5,5,12;u=0,0,1
3_53_745_F3	solid	read	1130179	1130198	7.6	-	.	g=C0213313233333113321;i=1;p=1.000;q=27,6,9,22,18,9,8,15,6,8,14,5,8,6,16,4,5,4,4,14;u=0,0,0,1
3_123_576_F3	solid	read	1219122	1219141	8.7	+	.	g=A3333133323333323323;i=1;p=1.000;q=18,22,5,11,16,16,8,14,8,5,19,8,9,10,7,11,6,11,9,4;u=0,0,1
3_81_12_F3	solid	read	1236732	1236751	8.6	+	.	g=G2210332302233112321;i=1;p=1.000;q=7,16,17,9,7,9,9,16,9,4,10,21,17,8,4,6,9,16,6,12;u=0,0,0,1
3_96_1862_F3	solid	read	1264409	1264428	6.9	-	.	g=G0301032323231222021;i=1;p=1.000;q=26,23,11,20,15,8,6,4,6,6,9,7,6,4,8,6,4,5,6,5;u=0,0,0,1
3_40_136_F3	solid	read	1266177	1266196	7.4	-	.	g=T2332222332203312221;i=1;p=1.000;q=9,23,6,19,13,9,4,8,17,9,4,4,13,9,8,5,4,6,10,8;u=0,0,1
3_124_1781_F3	solid	read	1385416	1385435	10.3	+	.	g=A1322302333332222132;i=1;p=1.000;q=13,17,8,6,5,9,24,4,7,9,18,27,18,16,16,23,18,18,11,23;u=0,0,1
3_134_1165_F3	solid	read	1393169	1393188	9.0	-	.	g=T3301123202321131311;i=1;p=1.000;q=4,27,18,7,27,4,27,26,4,20,4,27,26,9,27,4,27,14,10,27;u=1
3_224_587_F3	solid	read	1490044	1490063	6.1	+	.	g=G2032313231111233321;i=1;p=1.000;q=4,4,6,6,13,24,4,4,5,15,6,7,9,14,4,4,4,25,5,5;u=0,0,0,1
3_25_747_F3	solid	read	1513598	1513617	9.5	+	.	g=T1223213101133121231;i=1;p=1.000;q=26,27,8,27,27,27,26,27,26,19,8,14,4,17,11,5,7,4,7,6;u=0,0,1
3_143_14_F3	solid	read	1528236	1528255	9.7	+	.	g=T3233113323230202011;i=1;p=1.000;q=13,23,17,19,23,16,24,25,14,15,9,6,4,11,4,9,12,4,16,10;u=0,0,0,1
3_164_1025_F3	solid	read	1570107	1570126	7.9	-	.	g=T3220332323303320231;i=1;p=1.000;q=7,10,20,8,4,24,4,4,21,6,26,22,9,6,11,9,6,4,17,14;u=0,0,0,1
3_137_552_F3	solid	read	1630276	1630295	9.1	-	.	g=G3030333223233102131;i=1;p=1.000;q=6,28,9,4,6,26,27,6,10,9,27,21,6,16,9,25,6,7,23,12;u=0,0,0,1
3_125_1810_F3	solid	read	1634104	1634123	10.5	+	.	g=G1232220322032311332;i=1;p=1.000;q=27,8,26,26,10,6,26,12,27,27,26,4,27,27,23,8,8,4,27,12;u=0,0,0,1
3_314_1310_F3	solid	read	1639981	1640000	9.2	+	.	g=A2221332230322203033;i=1;p=1.000;q=19,12,6,27,11,27,6,11,5,6,9,13,27,27,8,18,5,22,4,27;u=0,0,0,1
3_384_591_F3	solid	read	1654341	1654360	6.8	+	.	g=A3323221133121102313;i=1;p=1.000;q=19,8,7,7,15,4,20,7,4,6,14,7,19,6,8,4,5,9,4,4;u=0,0,0,1
3_145_739_F3	solid	read	1791040	1791059	11.9	-	.	g=A0221223333323131212;i=1;p=1.000;q=20,27,23,13,27,14,27,28,27,25,12,24,8,16,8,4,8,21,9,11;u=0,0,0,1
3_326_2020_F3	solid	read	1830564	1830583	9.3	+	.	g=A3321322331103233322;i=1;p=1.000;q=14,4,25,16,10,12,16,5,14,10,25,5,25,5,9,18,13,26,4,26;u=0,0,0,1
3_233_1265_F3	solid	read	1857564	1857583	8.9	+	.	g=T3112113020130223311;i=1;p=1.000;q=7,27,25,26,27,14,26,27,27,27,4,6,5,10,17,4,5,7,6,12;u=0,0,1
3_235_100_F3	solid	read	1912460	1912479	9.6	-	.	g=G2233020000132311231;i=1;p=1.000;q=23,24,25,16,17,6,21,25,9,4,6,11,8,19,6,6,19,14,13,6;u=0,0,0,1
3_111_107_F3	solid	read	1944496	1944515	7.6	-	.	g=C3023223333211322231;i=1;p=1.000;q=15,5,6,14,5,13,4,12,11,4,9,9,11,12,4,11,11,13,6,6;u=0,0,0,1
3_457_1514_F3	solid	read	1956598	1956617	9.9	-	.	g=T0013331013332110221;i=1;p=1.000;q=18,24,10,24,23,25,22,11,20,10,15,11,4,5,27,4,9,13,5,27;u=0,1
3_183_74_F3	solid	read	1992040	1992059	9.8	+	.	g=C3332233131131222322;i=1;p=1.000;q=27,27,25,23,25,8,11,11,7,11,4,12,14,10,15,7,14,4,9,12;u=0,0,1
3_357_1303_F3	solid	read	2037917	2037936	10.9	-	.	g=T3331331323320311331;i=1;p=1.000;q=7,27,5,19,26,8,27,12,14,27,8,27,23,9,19,4,26,20,9,27;u=0,0,0,1
3_153_186_F3	solid	read	2083441	2083460	6.7	+	.	g=T3112233331133323322;i=1;p=1.000;q=7,14,19,7,12,6,11,4,11,8,4,6,6,4,11,4,6,4,4,18;u=0,1
3_65_1741_F3	solid	read	2107441	2107460	8.4	+	.	g=T3333332330233132123;i=1;p=1.000;q=4,4,6,25,9,4,26,16,21,9,18,15,27,27,4,21,9,7,9,6;u=0,0,0,1
3_98_323_F3	solid	read	2118821	2118840	7.5	+	.	g=A3222212322131112031;i=1;p=1.000;q=13,14,8,10,8,14,4,13,10,7,15,4,6,4,4,12,6,11,6,8;u=0,0,1
3_48_258_F3	solid	read	2153882	2153901	9.4	-	.	g=G0330113313201122321;i=1;p=1.000;q=22,15,20,4,16,17,14,24,4,5,4,22,19,8,10,9,13,22,8,15;u=0,0,0,1
3_140_1125_F3	solid	read	2182909	2182928	7.9	+	.	g=T3231331302232001131;i=1;p=1.000;q=10,4,12,6,4,12,13,6,18,5,8,11,4,26,6,25,5,18,11,12;u=0,0,0,1
3_359_118_F3	solid	read	2188393	2188412	8.4	+	.	g=A0301311133331131322;i=1;p=1.000;q=11,5,7,13,20,6,6,25,8,18,9,15,27,9,6,7,15,17,4,4;u=0,0,0,1
3_203_483_F3	solid	read	2272874	2272893	9.1	-	.	g=C3031223110333133311;i=1;p=1.000;q=23,21,25,27,10,5,22,15,17,18,5,18,17,5,19,4,4,13,4,22;u=0,0,0,1
3_66_301_F3	solid	read	2286038	2286057	6.6	-	.	g=C1113113330132222311;i=1;p=1.000;q=10,4,6,4,8,13,9,4,10,9,4,6,13,9,5,6,11,6,4,9;u=0,0,0,1
3_78_130_F3	solid	read	2291021	2291040	7.6	+	.	g=G3233131332212222321;i=1;p=1.000;q=13,16,6,12,17,11,10,4,12,8,13,4,8,6,4,4,12,10,4,11;u=0,0,0,1
3_141_110_F3	solid	read	2291354	2291373	9.3	+	.	g=T1312203322212123321;i=1;p=1.000;q=9,21,24,11,16,4,23,27,16,16,8,22,6,10,16,4,9,4,7,25;u=0,0,1
3_51_1383_F3	solid	read	2374918	2374937	8.8	+	.	g=T3311203033322222231;i=1;p=1.000;q=24,26,6,27,27,23,27,4,21,27,4,27,6,9,24,4,23,4,4,27;u=0,0,1
3_231_366_F3	solid	read	2392091	2392110	10.0	-	.	g=T2022333223101331322;i=1;p=1.000;q=18,12,9,9,13,8,7,22,7,7,4,26,12,17,9,20,24,8,18,14;u=0,0,0,1
3_214_1802_F3	solid	read	2394604	2394623	8.8	-	.	g=T1232111001220211133;i=1;p=1.000;q=17,18,14,6,19,4,21,4,6,12,11,4,26,20,9,18,7,16,5,18;u=0,0,0,1
3_67_1434_F3	solid	read	2454508	2454527	15.2	-	.	g=T3121311232222231203;i=1;p=1.000;q=9,27,27,18,16,14,25,27,26,21,19,27,27,27,15,5,24,27,24,24;u=0,0,1
3_124_1647_F3	solid	read	2493617	2493636	7.5	+	.	g=A0211320203220231332;i=1;p=1.000;q=9,12,12,9,6,14,12,7,4,4,12,9,4,9,16,4,4,9,9,16;u=0,0,0,1
3_39_328_F3	solid	read	2500759	2500778	7.8	+	.	g=T1332333033231132333;i=1;p=1.000;q=24,27,26,26,25,21,7,8,4,5,20,4,11,6,8,4,6,4,11,7;u=0,0,1
3_378_322_F3	solid	read	2541624	2541643	8.9	+	.	g=T2333331001023011220;i=1;p=1.000;q=14,6,13,25,27,4,24,22,14,19,9,23,15,6,8,4,22,4,4,20;u=0,0,0,1
3_216_848_F3	solid	read	2550573	2550592	11.5	-	.	g=G2320322020031220322;i=1;p=1.000;q=21,24,8,21,20,25,18,6,24,14,21,9,7,18,8,18,7,9,19,12;u=0,0,0,1
3_221_516_F3	solid	read	2607559	2607578	11.1	-	.	g=T2132333313222333332;i=1;p=1.000;q=9,19,27,26,24,26,26,25,25,26,21,4,6,10,21,6,20,13,5,24;u=0,0,0,1
3_56_45_F3	solid	read	2662103	2662122	5.5	+	.	g=G3021122332232122321;i=1;p=1.000;q=4,4,4,6,4,6,4,5,18,9,4,16,10,4,4,4,12,4,6,6;u=0,0,0,1
3_127_210_F3	solid	read	2798906	2798925	10.2	+	.	g=G2331321333232203222;i=1;p=1.000;q=11,25,9,4,23,16,26,14,7,22,9,25,9,8,21,8,15,17,4,26;u=0,0,1
3_417_422_F3	solid	read	2812322	2812341	8.8	-	.	g=T3321222333313333132;i=1;p=1.000;q=9,26,7,19,7,13,23,4,25,4,6,19,4,16,15,15,23,4,19,13;u=0,0,0,1
3_42_1403_F3	solid	read	2830264	2830283	9.6	-	.	g=T3212330132120221212;i=1;p=1.000;q=7,4,25,18,6,17,12,12,17,14,8,26,13,15,10,4,21,5,12,22;u=0,1
3_457_42_F3	solid	read	2874245	2874264	7.6	-	.	g=G0301123332223122221;i=1;p=1.000;q=18,10,14,9,19,4,10,8,11,10,6,8,5,8,11,4,13,6,4,6;u=0,0,1
3_361_728_F3	solid	read	2893879	2893898	14.6	+	.	g=C3213223312310132221;i=1;p=1.000;q=14,18,7,7,17,19,23,24,17,26,12,15,21,23,21,19,17,20,22,24;u=0,0,0,1
3_77_718_F3	solid	read	2913092	2913111	9.4	+	.	g=T3021331333313131231;i=1;p=1.000;q=15,26,7,24,20,18,5,6,17,18,6,11,4,13,19,15,7,4,22,25;u=0,0,0,1
3_116_154_F3	solid	read	2917672	2917691	9.8	-	.	g=A0323231223233132311;i=1;p=1.000;q=20,9,19,18,10,18,8,16,25,6,18,6,12,24,6,7,5,15,7,17;u=0,0,0,1
3_239_1415_F3	solid	read	2923256	2923275	19.2	+	.	g=T3233113121300032200;i=1;p=1.000;q=25,27,27,26,27,24,27,27,25,27,22,27,21,26,22,19,26,9,14,21;u=1
3_142_1468_F3	solid	read	2930117	2930136	10.5	-	.	g=A3233323333303103330;i=1;p=1.000;q=9,20,6,26,16,18,8,13,20,25,25,18,6,12,11,18,4,16,16,6;u=0,0,1
3_394_295_F3	solid	read	2930118	2930137	8.1	-	.	g=T3023333333333311331;i=1;p=1.000;q=4,14,6,12,7,22,10,4,13,24,18,12,12,4,6,9,9,9,14,4;u=0,0,0,1
3_222_1773_F3	solid	read	2934040	2934059	11.6	+	.	g=T1303031311123232302;i=1;p=1.000;q=11,10,24,15,28,6,19,5,13,27,8,26,8,22,25,27,26,27,8,13;u=0,0,0,1
3_276_1344_F3	solid	read	2969950	2969969	13.2	-	.	g=G3211212131233322233;i=1;p=1.000;q=27,27,12,16,11,23,27,8,23,12,27,22,20,12,15,25,8,27,16,6;u=0,1
3_155_1814_F3	solid	read	3107393	3107412	13.6	+	.	g=A2332222213113120221;i=1;p=1.000;q=27,26,20,25,26,27,12,27,26,18,26,4,27,10,23,26,6,23,26,26;u=0,0,0,1
3_373_2014_F3	solid	read	3143956	3143975	12.0	-	.	g=T3013322223222221211;i=1;p=1.000;q=16,8,17,21,10,10,18,18,18,13,4,23,16,24,8,19,14,15,23,11;u=0,1
3_81_1637_F3	solid	read	3413619	3413638	9.1	+	.	g=G2313032322122302111;i=1;p=1.000;q=9,4,7,19,27,6,11,5,12,15,20,27,8,27,6,16,6,27,21,6;u=0,0,1
3_291_969_F3	solid	read	3438323	3438342	17.4	+	.	g=T0021120212032121313;i=1;p=1.000;q=24,27,6,27,27,27,27,13,27,27,25,27,26,27,27,20,23,26,27,20;u=1
3_179_1617_F3	solid	read	3475164	3475183	8.0	+	.	g=A2100132222332123123;i=1;p=1.000;q=21,25,11,22,4,19,7,21,20,4,5,24,25,16,4,4,11,19,4,4;u=0,0,0,1
3_446_861_F3	solid	read	3476173	3476192	11.6	-	.	g=G1213302212022132321;i=1;p=1.000;q=27,27,27,27,26,25,12,27,24,18,24,6,27,26,20,9,6,6,4,23;u=0,0,1
3_397_317_F3	solid	read	3545152	3545171	11.1	+	.	g=T3110031332233111131;i=1;p=1.000;q=22,27,9,9,26,5,22,20,9,10,16,22,24,6,23,25,22,4,17,18;u=0,0,0,1
3_323_713_F3	solid	read	3575287	3575306	16.2	-	.	g=A0322222200213223302;i=1;p=1.000;q=27,25,21,27,26,26,24,26,27,18,27,26,26,27,22,22,6,26,25,8;u=0,1
3_294_1906_F3	solid	read	3727542	3727561	8.4	-	.	g=A3030310223202311021;i=1;p=1.000;q=14,7,5,4,7,18,4,6,13,6,12,12,10,11,15,14,16,7,9,12;u=0,0,0,1
3_443_223_F3	solid	read	3730805	3730824	17.1	-	.	g=T1113320033330133111;i=1;p=1.000;q=28,27,18,27,27,27,20,26,27,14,25,16,19,19,8,23,16,21,16,15;u=0,0,1
3_94_809_F3	solid	read	3841898	3841917	21.8	-	.	g=A2032223110001131310;i=1;p=1.000;q=27,27,27,27,26,27,25,24,27,27,27,25,27,27,27,12,23,16,27,27;u=0,0,0,1
3_245_387_F3	solid	read	3878549	3878568	24.4	-	.	g=A0222211220333132122;i=1;p=1.000;q=27,27,26,27,26,27,27,25,27,25,26,27,18,21,26,25,26,23,24,24;u=1
3_190_1089_F3	solid	read	3900038	3900057	13.7	-	.	g=T1111110323122301202;i=1;p=1.000;q=27,11,27,11,8,9,27,9,9,26,25,27,11,27,23,14,24,20,22,26;u=0,0,1
3_442_1501_F3	solid	read	3912610	3912629	8.5	+	.	g=A0012333103302132301;i=1;p=1.000;q=11,11,15,19,15,6,12,10,4,11,21,5,9,16,7,14,4,4,8,19;u=0,0,1
3_342_678_F3	solid	read	4044575	4044594	4.0	+	.	g=A3333112332213322323;i=1;p=1.000;q=4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4;u=0,0,0,1
3_56_1294_F3	solid	read	4058789	4058808	12.7	+	.	g=G3323331232322213322;i=1;p=1.000;q=26,17,18,27,23,8,8,24,27,27,9,27,25,14,26,4,27,9,24,23;u=0,0,0,1
3_69_1575_F3	solid	read	4070467	4070486	9.9	+	.	g=A2222011012222112121;i=1;p=1.000;q=16,25,14,9,9,9,21,9,4,24,6,21,13,6,27,10,19,8,6,27;u=0,0,0,1
3_198_476_F3	solid	read	4080622	4080641	8.9	+	.	g=C2010231122212011133;i=1;p=1.000;q=16,8,8,16,12,17,4,16,12,15,10,4,9,6,4,25,9,9,23,11;u=0,1
3_24_715_F3	solid	read	4136503	4136522	4.0	-	.	g=G1313332132232313233;i=1;p=1.000;q=4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4;u=0,0,0,1
3_151_283_F3	solid	read	4148264	4148283	9.7	+	.	g=T3230210232022111220;i=1;p=1.000;q=9,14,6,25,25,19,6,4,16,11,12,20,10,13,26,19,6,4,19,14;u=0,0,1
3_164_774_F3	solid	read	4156157	4156176	9.6	+	.	g=G2311112210110223313;i=1;p=1.000;q=8,24,19,7,6,16,12,9,4,8,26,14,26,24,7,18,6,16,14,7;u=0,0,0,1
3_275_1212_F3	solid	read	4171385	4171404	8.3	+	.	g=G0223122231333302232;i=1;p=1.000;q=13,8,5,4,10,7,12,25,4,25,6,15,6,27,6,11,12,7,14,10;u=0,0,0,1
3_148_289_F3	solid	read	4177672	4177691	8.0	-	.	g=T1203101332223323323;i=1;p=1.000;q=9,21,11,6,5,7,25,24,26,24,8,9,7,12,7,4,11,9,4,4;u=0,0,0,1
3_437_1000_F3	solid	read	4179623	4179642	12.3	+	.	g=A0112222212231131001;i=1;p=1.000;q=26,27,26,27,4,27,17,6,22,13,27,24,6,27,21,27,22,15,24,9;u=0,0,1
3_318_2011_F3	solid	read	4218181	4218200	12.9	-	.	g=T2133330223033303323;i=1;p=1.000;q=25,27,27,5,5,16,27,16,27,15,18,25,26,11,27,19,16,24,9,15;u=0,0,0,1
3_14_11_F3	solid	read	4222697	4222716	7.8	-	.	g=T2323310222232322122;i=1;p=1.000;q=6,23,16,25,25,9,7,4,12,4,14,6,10,7,6,9,18,4,10,4;u=0,0,0,1
3_402_391_F3	solid	read	4274545	4274564	6.2	-	.	g=C3303323321111111111;i=1;p=1.000;q=10,19,15,15,7,8,13,4,7,4,5,16,4,4,5,4,9,4,4,4;u=0,0,0,1
3_293_504_F3	solid	read	4339235	4339254	9.5	+	.	g=C2133223303331120213;i=1;p=1.000;q=6,4,5,26,13,7,17,6,24,10,27,24,5,9,21,9,23,24,20,14;u=0,0,0,1
3_360_914_F3	solid	read	4407004	4407023	10.7	+	.	g=T3012102130232022001;i=1;p=1.000;q=23,24,19,17,24,6,26,17,25,15,7,24,14,11,26,9,22,4,8,5;u=0,0,0,1
3_118_1532_F3	solid	read	4431702	4431721	10.2	+	.	g=C3233220201223200322;i=1;p=1.000;q=20,9,17,22,17,23,13,4,9,5,16,11,10,6,17,7,9,22,27,27;u=0,0,1
3_358_133_F3	solid	read	4460191	4460210	9.1	+	.	g=T0221223112322112233;i=1;p=1.000;q=6,23,12,22,7,6,7,4,13,5,9,23,12,9,24,8,14,7,20,26;u=0,0,0,1
3_397_195_F3	solid	read	4499390	4499409	6.9	-	.	g=T3302332313332212121;i=1;p=1.000;q=23,14,15,5,9,8,6,4,4,13,4,16,13,16,4,7,4,12,4,5;u=0,0,0,1
3_158_642_F3	solid	read	4533144	4533163	7.1	-	.	g=A1332103332323233212;i=1;p=1.000;q=8,20,9,22,8,14,4,16,17,4,8,13,7,8,4,12,5,4,4,4;u=0,0,0,1
3_300_1439_F3	solid	read	4580452	4580471	12.3	-	.	g=A0331111211302100201;i=1;p=1.000;q=5,17,21,14,4,16,11,27,21,9,17,17,27,23,12,21,16,27,25,25;u=0,0,0,1
# Elapsed time 0.846 secs

Import

By default, the id_spec="some_string" parameter tells gffutils to create a primary key based on the value of the “some_string” attribute. For this example, however, we would like to make the first field the primary key. Luckily, gffutils will accept the name of a GFF field, surrounded by “:”. The names of the fields are in gffutils.constants._gffkeys:

>>> gffutils.constants._gffkeys
['seqid', 'source', 'featuretype', 'start', 'end', 'score', 'strand', 'frame', 'attributes']

We want to use the first field, seqid, as the primary key for the database, so we use id_spec=":seqid:".

>>> fn = gffutils.example_filename('F3-unique-3.v2.gff')
>>> db = gffutils.create_db(fn, ':memory:', id_spec=[':seqid:'])

>>> db.directives[0]
'solid-gff-version 0.2'

Now we can access the features by their sequence name:

>>> db['3_8_425_F3']
<Feature read (3_8_425_F3:1119124-1119143[-]) at 0x...>

glimmer_nokeyval.gff3

This is a problematic file in which the attributes are not all in format “key=val”. This example demonstrates how gffutils handles cases like this: by implicitly treating them as keys and setting the value to an empty list.

Like the the ensembl_gtf.txt file, gene and transcript names are not different. Based on IDs alone, it’s not clear whether the CDS’s parent is the gene or the mRNA.

However, the way to fix this is different from ensembl_gtf.txt. Recall in that case, we constructed a function to modify transcript IDs, and when gffutils inferred the new transcripts from the component exons, it used that new ID.

Here, we need to do several things in the transform function: if the feature type contains “RNA”, then set the ID to ID + "_transript" (this will handle things like ncRNA, tRNA, etc and is more general than just “mRNA”). Next, we need to edit the Parent attribute of CDS, exon, etc. How this is handled will depend completely on the file you are using.

File contents

##gff-version 3
##sequence-region scaffold4215_3 1 6526
scaffold4215_3	glimmer	gene	3	62	.	-	.	ID=GL0000006;Name=GL0000006;Lack 3'-end;
scaffold4215_3	glimmer	mRNA	3	62	.	-	.	ID=GL0000006;Name=GL0000006;Parent=GL0000006;Lack 3'-end;
scaffold4215_3	glimmer	CDS	3	62	2.84	-	0	Parent=GL0000006;Lack 3'-end;
scaffold4215_3	glimmer	gene	124	1983	.	-	.	ID=GL0000007;Name=GL0000007;Complete;

Import

>>> def transform(f):
...     if f.featuretype == 'gene':
...         return f
...     elif "RNA" in f.featuretype:
...         f.attributes['ID'][0] += '_transcript'
...     else:
...         # assume that anything else is a child of a transcript, so we need
...         # to edit the "Parent" attribute
...         if 'Parent' in f.attributes:
...             f.attributes['Parent'] \
...                = [i + '_transcript' for i in f.attributes['Parent']]
...     return f
...

>>> fn = gffutils.example_filename('glimmer_nokeyval.gff3')
>>> db = gffutils.create_db(fn, ":memory:", id_spec="ID", transform=transform)

Access

This shows that keys with missing values are assigned an empty list. It’s up to the calling code to decide how to handle this (say, by checking for the presence of the key “Complete”.

>>> for k, v in sorted(db['GL0000007'].attributes.items()):
...     print(k, '=', v)
Complete = []
ID = ['GL0000007']
Name = ['GL0000007']

This illustrates that the CDS finds the proper transcript parent.

>>> for f in db.parents(db['CDS_1'], level=1):
...     print(f.id)
GL0000006_transcript

hybrid1.gff3

This file contains a FASTA sequence at the end. Currently, gffutils assumes that there are no further annotations and stops parsing the file at this point (specifically, when it sees the string “##FASTA” at the beginning of a line).

Another issue with this file is that the “Note” attributes field contains url-encoded characters. These are kept verbatim.

File contents

##gff-version 3
##sequence-region foo 1 100
##feature-ontology bar
##attribute-ontology baz
##source-ontology boo
##sequence-region chr17 62467934 62469545 
chr17	UCSC	mRNA	62467934	62469545	.	-	.	ID=A00469;Dbxref=AFFX-U133:205840_x_at,Locuslink:2688,Genbank-mRNA:A00469,Swissprot:P01241,PFAM:PF00103,AFFX-U95:1332_f_at,Swissprot:SOMA_HUMAN;Note=growth%20hormone%201;Alias=GH1
chr17	UCSC	CDS	62468039	62468236	.	-	1	Parent=A00469
chr17	UCSC	CDS	62468490	62468654	.	-	2	Parent=A00469
chr17	UCSC	CDS	62468747	62468866	.	-	1	Parent=A00469
chr17	UCSC	CDS	62469076	62469236	.	-	1	Parent=A00469
chr17	UCSC	CDS	62469497	62469506	.	-	0	Parent=A00469
###
##FASTA
>chr17
GATTACA
GATTACA

Import

Straightforward:

>>> fn = gffutils.example_filename('hybrid1.gff3')
>>> db = gffutils.create_db(fn, ":memory:")

Access

Print out the attributes:

>>> for k, v in sorted(db['A00469'].attributes.items()):
...     print(k, '=', v)
Alias = ['GH1']
Dbxref = ['AFFX-U133:205840_x_at', 'Locuslink:2688', 'Genbank-mRNA:A00469', 'Swissprot:P01241', 'PFAM:PF00103', 'AFFX-U95:1332_f_at', 'Swissprot:SOMA_HUMAN']
ID = ['A00469']
Note = ['growth hormone 1']

jgi_gff2.txt

This file is GFF2 format – the attribute format is like GTF, but without the required “gene_id” and “transcript_id” fields that are used to infer genes and transcripts. To get around this, we can supply the gtf_transcript_key kwarg, which will override the default "transcript_id" (see GTF files for more background on this).

There is information in the file on which exon goes to which transcript, but no such information about which CDS goes to which transcript. There is, however, information about the parent protein.

File contents

chr_1	JGI	exon	37061	37174	.	-	.	name "fgenesh1_pg.C_chr_1000007"; transcriptId 873
chr_1	JGI	CDS	37061	37174	.	-	0	name "fgenesh1_pg.C_chr_1000007"; proteinId 873; exonNumber 3
chr_1	JGI	exon	37315	37620	.	-	.	name "fgenesh1_pg.C_chr_1000007"; transcriptId 873
chr_1	JGI	CDS	37315	37620	.	-	0	name "fgenesh1_pg.C_chr_1000007"; proteinId 873; exonNumber 2
chr_1	JGI	exon	37752	38216	.	-	.	name "fgenesh1_pg.C_chr_1000007"; transcriptId 873
chr_1	JGI	CDS	37752	38216	.	-	0	name "fgenesh1_pg.C_chr_1000007"; proteinId 873; exonNumber 1

Import

Inspecting the file, we see that the “name” attribute is repeated on each line. We can use that as the gene ID for inferring the gene extents, like “gene_id” in a normal GTF file. Similarly, the “transcriptId” attribute here can be used like “transcript_id” in a normal GTF file for inferring transcript extents.

We also need to make a choice about how we’re going to use this database. Do we want to have the CDS features be considered children of transcripts? In that case, we’ll need the transform function, which creates a “transcriptId” attribute out of an existing “proteinId” attribute:

>>> def transform(d):
...     try:
...         d['transcriptId'] = d['proteinId']
...     except KeyError:
...         pass
...     return d

>>> fn = gffutils.example_filename('jgi_gff2.txt')
>>> db = gffutils.create_db(fn, ":memory:",
... id_spec={'transcript': 'transcriptId', 'gene': 'name'},
... gtf_transcript_key='transcriptId', gtf_gene_key='name',
... transform=transform)

Access

Since we used that transform function, the exons and CDSs are all children of the “873” transcript:

>>> sorted(list(db.children('873', level=1)), key=lambda x: (x.featuretype, x.start))
[<Feature CDS (chr_1:37061-37174[-]) at 0x...>, <Feature CDS (chr_1:37315-37620[-]) at 0x...>, <Feature CDS (chr_1:37752-38216[-]) at 0x...>, <Feature exon (chr_1:37061-37174[-]) at 0x...>, <Feature exon (chr_1:37315-37620[-]) at 0x...>, <Feature exon (chr_1:37752-38216[-]) at 0x...>]

Here we can see that all children of the gene are accounted for:

>>> for f in db.children("fgenesh1_pg.C_chr_1000007", order_by='featuretype'):
...     print('{0.featuretype:>10}: {0.id}'.format(f))
       CDS: CDS_1
       CDS: CDS_2
       CDS: CDS_3
      exon: exon_1
      exon: exon_2
      exon: exon_3
transcript: 873

ncbi_gff3.txt

This file is problematic because all genes have the same ID attribute. We can get around this by using the “db_xref” field as the key for genes.

File contents

##gff-version 3
##source-version NCBI C++ formatter 0.2
##date 2009-04-25
##Type DNA NC_008596.1
NC_008596.1	RefSeq	gene	12272	13301	.	+	.	locus_tag=MSMEG_0013;note=ferric%20enterobactin%20transport%20system%20permease%20protein%20FepG%3B%20this%20gene%20contains%20a%20frame%20shift%20which%20is%20not%20the%20result%20of%20sequencing%20error%3B%20identified%20by%20match%20to%20protein%20family%20HMM%20PF01032;pseudo=;db_xref=GeneID:4537201
NC_008596.1	RefSeq	gene	1137579	1138550	.	+	.	ID=NC_008596.1:speB;locus_tag=MSMEG_1072;db_xref=GeneID:4535378
NC_008596.1	RefSeq	CDS	1137579	1138547	.	+	0	ID=NC_008596.1:speB:unknown_transcript_1;Parent=NC_008596.1:speB;locus_tag=MSMEG_1072;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_885468.1;db_xref=GI:118469242;db_xref=GeneID:4535378;exon_number=1
NC_008596.1	RefSeq	start_codon	1137579	1137581	.	+	0	ID=NC_008596.1:speB:unknown_transcript_1;Parent=NC_008596.1:speB;locus_tag=MSMEG_1072;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_885468.1;db_xref=GI:118469242;db_xref=GeneID:4535378;exon_number=1
NC_008596.1	RefSeq	stop_codon	1138548	1138550	.	+	0	ID=NC_008596.1:speB:unknown_transcript_1;Parent=NC_008596.1:speB;locus_tag=MSMEG_1072;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_885468.1;db_xref=GI:118469242;db_xref=GeneID:4535378;exon_number=1
NC_008596.1	RefSeq	gene	3597069	3598112	.	+	.	ID=NC_008596.1:speB;locus_tag=MSMEG_3535;db_xref=GeneID:4533678
NC_008596.1	RefSeq	CDS	3597069	3598109	.	+	0	ID=NC_008596.1:speB:unknown_transcript_2;Parent=NC_008596.1:speB;locus_tag=MSMEG_3535;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_887838.1;db_xref=GI:118470943;db_xref=GeneID:4533678;exon_number=1
NC_008596.1	RefSeq	start_codon	3597069	3597071	.	+	0	ID=NC_008596.1:speB:unknown_transcript_2;Parent=NC_008596.1:speB;locus_tag=MSMEG_3535;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_887838.1;db_xref=GI:118470943;db_xref=GeneID:4533678;exon_number=1
NC_008596.1	RefSeq	stop_codon	3598110	3598112	.	+	0	ID=NC_008596.1:speB:unknown_transcript_2;Parent=NC_008596.1:speB;locus_tag=MSMEG_3535;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_887838.1;db_xref=GI:118470943;db_xref=GeneID:4533678;exon_number=1
NC_008596.1	RefSeq	gene	4460713	4461672	.	-	.	ID=NC_008596.1:speB;locus_tag=MSMEG_4374;db_xref=GeneID:4535424
NC_008596.1	RefSeq	CDS	4460716	4461672	.	-	0	ID=NC_008596.1:speB:unknown_transcript_3;Parent=NC_008596.1:speB;locus_tag=MSMEG_4374;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_888649.1;db_xref=GI:118469662;db_xref=GeneID:4535424;exon_number=1
NC_008596.1	RefSeq	start_codon	4461670	4461672	.	-	0	ID=NC_008596.1:speB:unknown_transcript_3;Parent=NC_008596.1:speB;locus_tag=MSMEG_4374;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_888649.1;db_xref=GI:118469662;db_xref=GeneID:4535424;exon_number=1
NC_008596.1	RefSeq	stop_codon	4460713	4460715	.	-	0	ID=NC_008596.1:speB:unknown_transcript_3;Parent=NC_008596.1:speB;locus_tag=MSMEG_4374;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_888649.1;db_xref=GI:118469662;db_xref=GeneID:4535424;exon_number=1
NC_008596.1	RefSeq	gene	4539385	4540344	.	+	.	ID=NC_008596.1:speB;locus_tag=MSMEG_4459;db_xref=GeneID:4537057
NC_008596.1	RefSeq	CDS	4539385	4540341	.	+	0	ID=NC_008596.1:speB:unknown_transcript_4;Parent=NC_008596.1:speB;locus_tag=MSMEG_4459;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_888732.1;db_xref=GI:118472833;db_xref=GeneID:4537057;exon_number=1
NC_008596.1	RefSeq	start_codon	4539385	4539387	.	+	0	ID=NC_008596.1:speB:unknown_transcript_4;Parent=NC_008596.1:speB;locus_tag=MSMEG_4459;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_888732.1;db_xref=GI:118472833;db_xref=GeneID:4537057;exon_number=1
NC_008596.1	RefSeq	stop_codon	4540342	4540344	.	+	0	ID=NC_008596.1:speB:unknown_transcript_4;Parent=NC_008596.1:speB;locus_tag=MSMEG_4459;EC_number=3.5.3.11;note=identified%20by%20match%20to%20protein%20family%20HMM%20PF00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01230;transl_table=11;product=agmatinase;protein_id=YP_888732.1;db_xref=GI:118472833;db_xref=GeneID:4537057;exon_number=1

Import

For “gene” featuretypes, we want to use the “db_xref” field as the primary key; all other featuretypes will be incremented versions of the featuretype (“CDS_1”, “CDS_2”, etc).

>>> fn = gffutils.example_filename('ncbi_gff3.txt')
>>> db = gffutils.create_db(fn, ":memory:", id_spec={'gene': 'db_xref'})

Access

Genes can now be accessed by the db_xref:

>>> db['GeneID:4537201']
<Feature gene (NC_008596.1:12272-13301[+]) at 0x...>

wormbase_gff2_alt.txt

This file has no gene_id or transcript_id fields; it appears to use “CDS” as a gene-level kind of object. So we can use a transform function to add “gene_id” and “transcript_id” fields to all non-CDS features so that the file conforms to a GTF standard and gene extents can be inferred. Furthermore, by default gffutils uses 'exon' as the default feature type to merge into genes. Here, we need to specify gtf_subfeature='coding_exon'. Last, note that the file itself has inconsistent trailing semicolons. The dialect detection decides that there should be no trailing semicolon since that’s what the majority of the features use.

File contents

Remanei_genome	Genomic_canonical	region	1	7816	.	+	.	Sequence "Contig1020";
Contig102	WU_MERGED	CDS	1629	3377	.	-	.	CDS "cr01.sctg102.wum.2.1"
Contig102	WU_MERGED	coding_exon	2927	3377	.	-	.	CDS "cr01.sctg102.wum.2.1"
Contig102	WU_MERGED	coding_exon	2474	2875	.	-	.	CDS "cr01.sctg102.wum.2.1"
Contig102	WU_MERGED	coding_exon	1928	2430	.	-	.	CDS "cr01.sctg102.wum.2.1"
Contig102	WU_MERGED	coding_exon	1629	1883	.	-	.	CDS "cr01.sctg102.wum.2.1"
Contig102	WU_MERGED	intron	2876	2926	.	-	.	CDS "cr01.sctg102.wum.2.1"
Contig102	WU_MERGED	intron	2431	2473	.	-	.	CDS "cr01.sctg102.wum.2.1"
Contig102	WU_MERGED	intron	1884	1927	.	-	.	CDS "cr01.sctg102.wum.2.1"

Import

The transform function to manipulate the attribute dictionary: add “gene_id” and “transcript_id” attributes to intron and coding_exon features:

>>> def transform(f):
...     if f.featuretype in ['coding_exon', 'intron']:
...         _id = f.attributes['CDS'][0]
...         f.attributes['gene_id'] = [_id]
...         f.attributes['transcript_id'] = [_id + '_transcript']
...     return f

>>> fn = gffutils.example_filename('wormbase_gff2_alt.txt')
>>> db = gffutils.create_db(fn, ":memory:", id_spec={'gene': 'gene_id', 'transcript': 'transcript_id'}, transform=transform, gtf_subfeature='coding_exon')

Access

Note that the inferred genes have a source of “gffutils_derived”:

>>> print(db["cr01.sctg102.wum.2.1"])  
Contig102       gffutils_derived        gene    1629    3377    .       -       .       gene_id "cr01.sctg102.wum.2.1"

Get a report of the childrent of the gene:

>>> for f in db.children("cr01.sctg102.wum.2.1", order_by='featuretype'):
...     print('{0.featuretype:>12}: {0.id}'.format(f))
 coding_exon: coding_exon_1
 coding_exon: coding_exon_2
 coding_exon: coding_exon_3
 coding_exon: coding_exon_4
      intron: intron_1
      intron: intron_2
      intron: intron_3
  transcript: cr01.sctg102.wum.2.1_transcript

wormbase_gff2.txt

This file is problematic because of inconsistent formatting: the SAGE_tag features have a different attributes format than the rest of the features. So we need to do some work with the dialect used for parsing attributes into dictionaries.

File contents

I	Genomic_canonical	region	1	2679	.	+	.	Sequence "cTel33B" ; Note "Clone cTel33B; Genbank AC199162" ; Note "Clone cTel33B; Genbank AC199162"
I	Coding_transcript	Transcript	12759582	12764949	.	-	.	Transcript "B0019.1" ; WormPep "WP:CE40797" ; Note "amx-2" ; Prediction_status "Partially_confirmed" ; Gene "WBGene00000138" ; CDS "B0019.1" ; WormPep "WP:CE40797" ; Note "amx-2" ; Prediction_status "Partially_confirmed" ; Gene "WBGene00000138"
I	Coding_transcript	intron	12759829	12759948	.	-	.	Transcript "B0019.1" ; Confirmed_EST EC034652
I	Coding_transcript	intron	12760014	12760226	.	-	.	Transcript "B0019.1" ; Confirmed_EST EC034652
I	Coding_transcript	intron	12760320	12760364	.	-	.	Transcript "B0019.1" ; Confirmed_EST yk1054h04.3
I	Coding_transcript	intron	12760495	12760833	.	-	.	Transcript "B0019.1" ; Confirmed_EST EC027594
I	Coding_transcript	intron	12760905	12761171	.	-	.	Transcript "B0019.1" ; Confirmed_EST EC027594
I	Coding_transcript	intron	12761517	12761798	.	-	.	Transcript "B0019.1" ; Confirmed_EST EC027594
I	Coding_transcript	intron	12761954	12762126	.	-	.	Transcript "B0019.1" ; Confirmed_EST yk262g9.5
I	Coding_transcript	intron	12762269	12762647	.	-	.	Transcript "B0019.1" ; Confirmed_EST yk262g9.5
I	Coding_transcript	intron	12762807	12763111	.	-	.	Transcript "B0019.1" ; Confirmed_EST yk1056c07.5
I	Coding_transcript	intron	12763250	12763447	.	-	.	Transcript "B0019.1" ; Confirmed_EST yk1056c07.5
I	Coding_transcript	intron	12763656	12763728	.	-	.	Transcript "B0019.1" ; Confirmed_EST yk1056c07.5
I	Coding_transcript	intron	12763883	12763978	.	-	.	Transcript "B0019.1" ; Confirmed_EST yk1054h04.5 ; Confirmed_EST OSTF088D9_1
I	Coding_transcript	intron	12764103	12764290	.	-	.	Transcript "B0019.1" ; Confirmed_EST yk1054h04.5 ; Confirmed_EST OSTF088D9_1
I	Coding_transcript	intron	12764472	12764811	.	-	.	Transcript "B0019.1" ; Confirmed_EST yk1054h04.5 ; Confirmed_EST OSTF088D9_1
I	Coding_transcript	exon	12759582	12759828	.	-	.	Transcript "B0019.1"
I	Coding_transcript	exon	12759949	12760013	.	-	.	Transcript "B0019.1"
I	Coding_transcript	exon	12760227	12760319	.	-	.	Transcript "B0019.1"
I	Coding_transcript	exon	12760365	12760494	.	-	.	Transcript "B0019.1"
I	Coding_transcript	exon	12760834	12760904	.	-	.	Transcript "B0019.1"
I	Coding_transcript	exon	12761172	12761516	.	-	.	Transcript "B0019.1"
I	Coding_transcript	exon	12761799	12761953	.	-	.	Transcript "B0019.1"
I	Coding_transcript	exon	12762127	12762268	.	-	.	Transcript "B0019.1"
I	Coding_transcript	exon	12762648	12762806	.	-	.	Transcript "B0019.1"
I	Coding_transcript	exon	12763112	12763249	.	-	.	Transcript "B0019.1"
I	Coding_transcript	exon	12763448	12763655	.	-	.	Transcript "B0019.1"
I	Coding_transcript	exon	12763729	12763882	.	-	.	Transcript "B0019.1"
I	Coding_transcript	exon	12763979	12764102	.	-	.	Transcript "B0019.1"
I	Coding_transcript	exon	12764291	12764471	.	-	.	Transcript "B0019.1"
I	Coding_transcript	exon	12764812	12764949	.	-	.	Transcript "B0019.1"
I	SAGE_tag_unambiguously_mapped	SAGE_tag	12761492	12761512	.	-	.	Sequence SAGE:aacggagccgtacacgc;count 5;Gene amx-2;Transcript B0019.1
I	SAGE_tag_most_three_prime	SAGE_tag	12761499	12761512	.	-	.	Sequence SAGE:aacggagccg;count 9;Gene amx-2;Transcript B0019.1
I	mass_spec_genome	translated_nucleotide_match	12761920	12761953	.	-	.	Target "Mass_spec_peptide:MSP:FADFSPLDVSDVNFATDDLAK" 10 21 ; Note "MSP:FADFSPLDVSDVNFATDDLAK" ; Protein_matches "WP:CE40797" ; CDS_matches "B0019.1" ; Times_observed "3"
I	mass_spec_genome	translated_nucleotide_match	12762127	12762155	.	-	.	Target "Mass_spec_peptide:MSP:FADFSPLDVSDVNFATDDLAK" 1 10 ; Note "MSP:FADFSPLDVSDVNFATDDLAK" ; Protein_matches "WP:CE40797" ; CDS_matches "B0019.1" ; Times_observed "3"
I	mass_spec_genome	translated_nucleotide_match	12763506	12763559	.	-	.	Target "Mass_spec_peptide:MSP:FGHGQSLLAQGGMNEVVR" 1 18 ; Note "MSP:FGHGQSLLAQGGMNEVVR" ; Protein_matches "WP:CE40797" ; CDS_matches "B0019.1" ; Times_observed "1"
I	SAGE_tag_unambiguously_mapped	SAGE_tag	12763533	12763553	.	-	.	Sequence SAGE:ggcagagtcttttggca;count 1;Gene amx-2;Transcript B0019.1
I	mass_spec_genome	translated_nucleotide_match	12764361	12764411	.	-	.	Target "Mass_spec_peptide:MSP:NIQQNRPGLSVLVLEAR" 1 17 ; Note "MSP:NIQQNRPGLSVLVLEAR" ; Protein_matches "WP:CE40797" ; CDS_matches "B0019.1" ; Times_observed "2"
I	GenePair_STS	PCR_product	12762449	12764118	.	+	.	PCR_product "sjj_B0019.1"
I	Expr_profile	experimental_result_region	12762449	12764118	.	+	.	Expr_profile "B0019.1"
I	Allele	SNP	12764272	12764272	.	+	.	Variation "snp_B0019[1]" ; Interpolated_map_position "14.003" ; ; RFLP "No"
I	Promoterome	PCR_product	12764938	12766937	.	+	.	PCR_product "p_B0019.1_93"
I	Oligo_set	reagent	12759745	12761589	.	-	.	Oligo_set "Aff_B0019.1"
I	Orfeome	PCR_product	12759747	12764936	.	-	.	PCR_product "mv_B0019.1" ; Amplified 1 ; Amplified 1
I	Coding_transcript	three_prime_UTR	12759582	12759744	.	-	.	Transcript "B0019.1"
I	Coding_transcript	coding_exon	12759745	12759828	.	-	0	Transcript "B0019.1" ; CDS "B0019.1"
I	Coding_transcript	coding_exon	12759949	12760013	.	-	2	Transcript "B0019.1" ; CDS "B0019.1"
I	Coding_transcript	coding_exon	12760227	12760319	.	-	2	Transcript "B0019.1" ; CDS "B0019.1"
I	Coding_transcript	coding_exon	12760365	12760494	.	-	0	Transcript "B0019.1" ; CDS "B0019.1"
I	Coding_transcript	coding_exon	12760834	12760904	.	-	2	Transcript "B0019.1" ; CDS "B0019.1"
I	Coding_transcript	coding_exon	12761172	12761516	.	-	2	Transcript "B0019.1" ; CDS "B0019.1"
I	Coding_transcript	coding_exon	12761799	12761953	.	-	1	Transcript "B0019.1" ; CDS "B0019.1"
I	Coding_transcript	coding_exon	12762127	12762268	.	-	2	Transcript "B0019.1" ; CDS "B0019.1"
I	Coding_transcript	coding_exon	12762648	12762806	.	-	2	Transcript "B0019.1" ; CDS "B0019.1"
I	Coding_transcript	coding_exon	12763112	12763249	.	-	2	Transcript "B0019.1" ; CDS "B0019.1"
I	Coding_transcript	coding_exon	12763448	12763655	.	-	0	Transcript "B0019.1" ; CDS "B0019.1"
I	Coding_transcript	coding_exon	12763729	12763882	.	-	1	Transcript "B0019.1" ; CDS "B0019.1"
I	Coding_transcript	coding_exon	12763979	12764102	.	-	2	Transcript "B0019.1" ; CDS "B0019.1"
I	Coding_transcript	coding_exon	12764291	12764471	.	-	0	Transcript "B0019.1" ; CDS "B0019.1"
I	Coding_transcript	five_prime_UTR	12764938	12764949	.	-	.	Transcript "B0019.1"
I	Coding_transcript	coding_exon	12764812	12764937	.	-	0	Transcript "B0019.1" ; CDS "B0019.1"
X	SAGE_tag	SAGE_tag	6819353	6819366	.	+	.	Sequence SAGE:aacggagccg;count 9;Gene amx-2;Transcript B0019.1
X 	gene 	processed_transcript 	944828 	948883 	.	-	.    	Gene "WBGene00004893"

Import

We will need force_dialect_check=True in this case if we want to be able to have the parents of the SAGE_tag features be recognized, because the attributes for SAGE_tag features are formatted differently. That is, they don’t have quoted values, and the separating semicolon does not have a space on either side as is the case for the other features.

The default of False means that only the first few lines are checked to determine the dialect to be used for all lines. But that won’t work here because of the inconsistent formatting. So we have to take the more time-consuming approach of checking every dialect in order to figure out how to parse them into a dictionary.

Transform function to create appropriate “Parent” attributes:

>>> def transform(f):
...     if f.featuretype == 'Transcript':
...         f.attributes['Parent'] = f.attributes['Gene']
...     else:
...         try:
...             f.attributes['Parent'] = f.attributes['Transcript']
...         except KeyError:
...             pass
...     return f

>>> fn = gffutils.example_filename('wormbase_gff2.txt')

>>> db = gffutils.create_db(fn,
... ":memory:",
... transform=transform,
... id_spec={'Transcript': "Transcript"},
... force_gff=True,
... force_dialect_check=True,
... keep_order=True)

The dialect for the database will be None:

>>> assert db.dialect is None, db.dialect

For cases like this, we should probably construct our own dialect to force all attributes to have the same format. To help with this, we can use the helpers.infer_dialect() function by providing attributes:

>>> from gffutils import helpers
>>> dialect = helpers.infer_dialect(
... 'Transcript "B0019.1" ; WormPep "WP:CE40797" ; Note "amx-2" ; Prediction_status "Partially_confirmed" ; Gene "WBGene00000138" ; CDS "B0019.1" ; WormPep "WP:CE40797" ; Note "amx-2" ; Prediction_status "Partially_confirmed" ; Gene "WBGene00000138"',
... )
>>> print(dialect)
{'leading semicolon': False, 'trailing semicolon': False, 'quoted GFF2 values': True, 'field separator': ' ; ', 'keyval separator': ' ', 'multival separator': ',', 'fmt': 'gtf', 'repeated keys': True, 'order': ['Transcript', 'WormPep', 'Note', 'Prediction_status', 'Gene', 'CDS', 'WormPep', 'Note', 'Prediction_status', 'Gene']}

>>> db.dialect = dialect

Access

>>> t = db["B0019.1"]

Since we’ve set the dialect for the database, any features returned from the database should follow that dialect:

>>> print(t)  
I       Coding_transcript       Transcript      12759582        12764949        .       -       .       Transcript "B0019.1" ; WormPep "WP:CE40797" ; WormPep "WP:CE40797" ; Note "amx-2" ; Note "amx-2" ; Prediction_status "Partially_confirmed" ; Prediction_status "Partially_confirmed" ; Gene "WBGene00000138" ; Gene "WBGene00000138" ; CDS "B0019.1" ; Parent "WBGene00000138" ; Parent "WBGene00000138"

>>> len(list(db.children(t, featuretype='exon')))
15

>>> db['SAGE_tag_1'].attributes['Transcript']
['B0019.1']

>>> len(list(db.children(t, featuretype='SAGE_tag')))
4

gencode-v19.gtf

This example file contains the first gene from the Gencode v19 annotations. It’s in GTF format, but in contrast to an on-spec GTF file, which only has exon, CDS, UTRs, and start/stop codon features, this file already has genes and transcripts on separate lines. This means we can avoid the gene and transcript inference, which saves time.

##description: evidence-based annotation of the human genome (GRCh37), version 19 (Ensembl 74)
##provider: GENCODE
##contact: gencode@sanger.ac.uk
##format: gtf
##date: 2013-12-05
chr1	HAVANA	gene	11869	14412	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1	HAVANA	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	11869	12227	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1;  exon_id "ENSE00002234944.1";  level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	12613	12721	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 2;  exon_id "ENSE00003582793.1";  level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	13221	14409	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 3;  exon_id "ENSE00002312635.1";  level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	ENSEMBL	transcript	11872	14412	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-201"; level 3; havana_gene "OTTHUMG00000000961.2";
chr1	ENSEMBL	exon	11872	12227	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-201"; exon_number 1;  exon_id "ENSE00002234632.1";  level 3; havana_gene "OTTHUMG00000000961.2";
chr1	ENSEMBL	exon	12613	12721	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-201"; exon_number 2;  exon_id "ENSE00003608237.1";  level 3; havana_gene "OTTHUMG00000000961.2";
chr1	ENSEMBL	exon	13225	14412	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-201"; exon_number 3;  exon_id "ENSE00002306041.1";  level 3; havana_gene "OTTHUMG00000000961.2";
chr1	ENSEMBL	transcript	11874	14409	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000518655.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-202"; level 3; havana_gene "OTTHUMG00000000961.2";
chr1	ENSEMBL	exon	11874	12227	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000518655.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-202"; exon_number 1;  exon_id "ENSE00002269724.1";  level 3; havana_gene "OTTHUMG00000000961.2";
chr1	ENSEMBL	exon	12595	12721	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000518655.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-202"; exon_number 2;  exon_id "ENSE00002270865.1";  level 3; havana_gene "OTTHUMG00000000961.2";
chr1	ENSEMBL	exon	13403	13655	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000518655.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-202"; exon_number 3;  exon_id "ENSE00002216795.1";  level 3; havana_gene "OTTHUMG00000000961.2";
chr1	ENSEMBL	exon	13661	14409	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000518655.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-202"; exon_number 4;  exon_id "ENSE00002303382.1";  level 3; havana_gene "OTTHUMG00000000961.2";
chr1	HAVANA	transcript	12010	13670	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	HAVANA	exon	12010	12057	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1;  exon_id "ENSE00001948541.1";  level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	HAVANA	exon	12179	12227	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 2;  exon_id "ENSE00001671638.2";  level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	HAVANA	exon	12613	12697	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 3;  exon_id "ENSE00001758273.2";  level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	HAVANA	exon	12975	13052	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 4;  exon_id "ENSE00001799933.2";  level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	HAVANA	exon	13221	13374	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 5;  exon_id "ENSE00001746346.2";  level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	HAVANA	exon	13453	13670	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 6;  exon_id "ENSE00001863096.1";  level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";

Import

We can use disable_infer_genes=True and disable_infer_transcripts=True to disable the gene and transcript inference. This will dramatically speed up import, but will result in identical results:

>>> fn = gffutils.example_filename('gencode-v19.gtf')
>>> db = gffutils.create_db(fn,
... ":memory:",
... keep_order=True,
... disable_infer_genes=True, disable_infer_transcripts=True)

Access

Access is the same as other files:

>>> db['ENSG00000223972.4']
<Feature gene (chr1:11869-14412[+]) at 0x...>

Ensure that inferring gene extent results in the same database – it just takes longer to import:

>>> db2 = gffutils.create_db(fn,
... ":memory:",
... keep_order=True,
... disable_infer_genes=True, disable_infer_transcripts=True)

>>> for i, j in zip(db.all_features(), db2.all_features()):
...     assert i == j