GTF files
From the GTF definition:
The following feature types are required: ‘CDS’, ‘start_codon’, ‘stop_codon’. The features ‘5UTR’, ‘3UTR’, ‘inter’, ‘inter_CNS’, ‘intron_CNS’ and ‘exon’ are optional. All other features will be ignored.
So genes and transcript are not explicitly defined. The transcript extent, is,
after all, implied by all exons with a single transcript_id
, and the extent
of a gene is implied by all exons with a single gene_id
. However, this can
be tedious to calculate by hand.
gffutils
infers the gene and transcript extents when the file is
imported into a database, and adds new “derived” features for each gene and
transcript. That way, a gene can be easily accessed by its ID, just like for
GFF files.
However, not all files meet the official GTF specifications where each feature
has transcript_id
to indicate its parent feature and gene_id
to indicate
its “grandparent” feature. To accommodate this, gffutils
provides some
extra options for the gffutils.create_db()
function:
transcript_key
and gene_key
These kwargs are used to extract the parent and grandparent feature respectively.
By default, transcript_key="transcript_id"
and gene_key="gene_id"
. But if
your particular data file does not conform to this, then they can be changed.
subfeature
Genes and transcripts are inferred from their component “exon” features. If
your particular data file does not conform to the GTF standard, you can use the
subfeature
kwarg to change this. By default, subfeature="exon"
, but see
the example wormbase_gff2_alt.txt for an instance of where this needs to
be changed.