Database schema

Schema

The following is the schema used for gffutils. Feel free to skip this if you’re not familiar with SQL. An explanation of each table can be found below.

>>> print(gffutils.constants.SCHEMA)  


CREATE TABLE features (
    id text,
    seqid text,
    source text,
    featuretype text,
    start int,
    end int,
    score text,
    strand text,
    frame text,
    attributes text,
    extra text,
    bin int,
    primary key (id)
    );

CREATE TABLE relations (
    parent text,
    child text,
    level int,
    primary key (parent, child, level)
    );

CREATE TABLE meta (
    dialect text,
    version text
    );

CREATE TABLE directives (
    directive text
    );

CREATE TABLE autoincrements (
    base text,
    n int,
    primary key (base)
    );

CREATE TABLE duplicates (
    idspecid text,
    newid text,
    primary key (newid)
    );


features table

The features table stores the primary information from each line in the original file (and any additional information added by the user).

id:

Primary key for features. The content of this field is determined by the user and the file format at creation time

See also

Database IDs has more information about how the contents of this field are determined.

seqid, source, feature, start, end, score, strand, frame:

These fields correspond exactly to the fields in the GFF/GTF lines

attributes:

A JSON-serialized dictionary of attributes. Note that the string representation of attributes is not stored; rather, it is reconstructed as needed using the dialect

See also

See Dialects for what dialects are and how they are constructed

extra:

A JSON-serialized list of non-standard extra fields. These are sometimes added by analysis tools (e.g., BEDTools). For standard GFF/GTF, this field will be empty.

bin:

The genomic bin, according to the UCSC binning strategy.

relations table

The relations table stores the heierarchical information. It’s sort of a simple directed acyclic graph that seems to work well for GFF/GTF files with [relatively] simple graph structure.

parent:

Foreign key to features.id – a gene, for example.

child:

Foreign key to feature.id – an mRNA or an exon, for example.

level:

In graph terms, the number of edges between child and parent. In biological terms, if parent=gene and child=mRNA, then level=1. If parent=gene and child=exon, then level=2.

meta table

This table stores extra information about the database in general.

dialect:

A JSON-serialized version of the dialect empirically determined when parsing the original file.

See also

Dialects

version:

The gffutils version used to create the database.

directives table

A table that acts as a simple list of directives (lines starting with ##) in the original GFF file.

directive:

String directive, without the leading ##.

autoincrements table

When items have conflicting primary keys based on the user-provided criteria then gffutils can autoincrement in order to get a unique – yet reasonably meaningful – primary key. For example, if the user specified that the “ID” attributes field for a GFF3 file should be used for primary keys, but two lines have the same ID="GENE_A" field, then the second line’s ID will be autoincremented to ID="GENE_A_1".

After database creation, this table stores the autoincrementing information so that when features are added later, autoincrementing can start at the correct integer (rather than 0).

See also

Database IDs

base:

By default the feature type (gene, exon, etc) but can also be the value of any GFF field or attribute (e.g., the seqid or “GENE_1” (in the case of multiple features with ID=”GENE_1”).

n:

Current extent of autoincrementing – add 1 to this when autoincrementing next time.