The following is the schema used for
gffutils. Feel free to skip this if
you’re not familiar with SQL. An explanation of each table can be found below.
>>> print(gffutils.constants.SCHEMA) CREATE TABLE features ( id text, seqid text, source text, featuretype text, start int, end int, score text, strand text, frame text, attributes text, extra text, bin int, primary key (id) ); CREATE TABLE relations ( parent text, child text, level int, primary key (parent, child, level) ); CREATE TABLE meta ( dialect text, version text ); CREATE TABLE directives ( directive text ); CREATE TABLE autoincrements ( base text, n int, primary key (base) ); CREATE TABLE duplicates ( idspecid text, newid text, primary key (newid) );
features table stores the primary information from each line in the
original file (and any additional information added by the user).
Primary key for features. The content of this field is determined by the user and the file format at creation time
Database IDs has more information about how the contents of this field are determined.
- seqid, source, feature, start, end, score, strand, frame:
These fields correspond exactly to the fields in the GFF/GTF lines
A JSON-serialized dictionary of attributes. Note that the string representation of attributes is not stored; rather, it is reconstructed as needed using the dialect
See Dialects for what dialects are and how they are constructed
A JSON-serialized list of non-standard extra fields. These are sometimes added by analysis tools (e.g., BEDTools). For standard GFF/GTF, this field will be empty.
The genomic bin, according to the UCSC binning strategy.
relations table stores the heierarchical information. It’s sort of
a simple directed acyclic graph that seems to work well for GFF/GTF files with
[relatively] simple graph structure.
Foreign key to
features.id– a gene, for example.
Foreign key to
feature.id– an mRNA or an exon, for example.
In graph terms, the number of edges between
parent. In biological terms, if parent=gene and child=mRNA, then level=1. If parent=gene and child=exon, then level=2.
This table stores extra information about the database in general.
A table that acts as a simple list of directives (lines starting with
the original GFF file.
String directive, without the leading
When items have conflicting primary keys based on the user-provided criteria
gffutils can autoincrement in order to get a unique – yet
reasonably meaningful – primary key. For example, if the user specified that
the “ID” attributes field for a GFF3 file should be used for primary keys, but
two lines have the same
ID="GENE_A" field, then the second line’s ID will be
After database creation, this table stores the autoincrementing information so that when features are added later, autoincrementing can start at the correct integer (rather than 0).
By default the feature type (
exon, etc) but can also be the value of any GFF field or attribute (e.g., the seqid or “GENE_1” (in the case of multiple features with ID=”GENE_1”).
Current extent of autoincrementing – add 1 to this when autoincrementing next time.