Database IDs

A primary key is a unique identifier used in a database. When importing a GFF or GTF file with gffutils into a database, each unique feature in the file should have its own primary key.

Primary keys are important because they are used to retrieve information from the database using dictionary syntax. For example in the Introduction, the example file has a line like this:

chr2L   FlyBase gene    7529    9484    .       +       .       ID=FBgn0031208;Name=CG11023;

By default, the primary key GFF3 features (like this one) is the “ID” field of the attributes. So the unique key used for the database for this feature is FBgn0031208. This means we can access the gene from the database like this:

gene = db['FBgn0031208']

Now, imagine we wanted to get the 5’ UTRs for the gene. Looking at the example file in the Introduction, we actually do have an ID. So we could access it like this:

utr = db['five_prime_UTR_FBgn0031208:1_737']

This is quite awkward to type. Plus, using this method we would have to type all the unique IDs for each of the UTRs we wanted!

In a sense, we only need one good “hook” into the database by meaningful IDs, and then we can access the other features based on parents and children. That is, we could get all the 5’UTRs for the gene without knowing their individual IDs like this:

utrs = db.children('FBgn0031208', featuretype='five_prime_UTR')

This works because 1) we have a unique ID for the gene, 2) we have unique IDs for each 5’UTR, and 3) the relationships in the database are constructed using these unique IDs.

If your input GFF or GTF file is formatted in the canonical way, the default settings should work fine. The rest of this section details strategies for instructing gffutils to use the most meaningful primary key for your particular input file.

id_spec

See also

Examples that show the use of id_spec:

The id_spec (ID specification) kwarg determines how to extract information from each line in order to construct a primary key for the feature. It can have several different forms – None, string, list, dictionary, or callable.

None:

The primary key for each feature will be an auto-incremented version of the feature type (e.g., “gene_1”, “gene_2”, etc).

string:

Use the attribute value.

For example, id_spec="ID". The primary key for each feature will be the value of the “ID” attribute. If this is not found, then an auto-incremented version of the feature type is used

list:

Use the first available attribute value from the list.

For example, id_spec=["ID", "Name"]: the primary key for each feature will be the value of the “ID” attribute. If no “ID” attribute is found, then use the value of the “Name” field. If this is not found, then an auto-incremented version of the feature type is used.

dict:

Use different strategies according to the featuretype.

For example, id_spec={"gene": "Name", "mRNA": ["ID", "transcript_id"]}:

  • For “gene” features, the primary key will be the value of the “Name” attribute. If this is not found, then use an auto-incremented version of “gene”.

  • For “mRNA” features, if the “ID” attribute exists, then use it as the primary key will be the value of the “ID” attribute. If not, then use the “transcript_id” value. If this is not found, then use an auto-incremented version of “mRNA”.

  • For any other feature, the primary key will be the an auto-incremented version fo the feature type.

special string:

Use a GFF field value (from the first 8 columns) rather than an attributes value. Must be surrounded by :. The options to use can be found in the list gffutils.constants._gffkeys [:-1].

For example, id_spec=":seqid:": use the “seqid” field as the primary key.

function:

Apply a custom function (or other callable object), and use its return value as the primary key.

The function must accept a single gffutils.Feature object. It can return one of the following:

  • None, in which case the behavior is the same as id_spec=None.

  • A special string starting with autoincrement:X, which will auto-increment based on the value of X. That is, if a function returns autoincrement:chr21, then the primary key of the first feature will be chr21_1, the second will be chr21_2, and so on.

  • A string to be used as the primary key.

The default for GFF3 files is id_spec="ID". If a feature has an “ID” attribute, it will be used for the primary key. If not, then an auto-incremented key, based on the featuretype, will be used.

The default for GTF files is id_spec={"gene": "gene_id", "transcript": "transcript_id"}. Even though “gene” and “transcript” features do not exist in the original file, gffutils infers the gene and transcript boundaries (as described in GTF files, and will use this id_spec for those inferred regions.

transform

See also

Examples that show the use of transform:

The transform kwarg is a function that accepts single gffutils.Feature object and that returns a (possibly modified) gffutils.Feature object. It is used to modify, on-the-fly, items as they are being imported into the database. It is generally used for files that don’t fit the standard GFF3 or GTF specs.

One example use-case is that FlyBase GFF3 files do have have a leading “chr” for the seqid GFF field. If we wanted to add this to each feature as it is imported into the database, then we could use the following function:

def add_chr(d):
    d['seqid'] = "chr" + d['seqid']
    return d

merge_strategy

See also

Examples that show the use of merge_strategy:

This parameter specifies the behavior when two items have an identical primary key.

For example, consider the following attribute strings for two consecutive lines. Assume that id_spec="ID", in which case these two lines have the same primary key:

ID="exon_1"; Parent="transcript_1";
ID="exon_1"; Parent="transcript_2";

Using merge_strategy="merge", then there will be a single entry in the database for "exon_1", but the attributes will be merged and only unique values will be retained. The new, edited feature will end up looking like this:

ID="exon_1"; Parent="transcript_1,transcript_2";  # db key: "exon_1"

Using merge_strategy="create_unique", then the second entry will have a unique, autoincremented primary key assigned to it, and both lines will be in the database, accessible by two different keys:

ID="exon_1"; Parent="transcript_1";  # database key: "exon_1"
ID="exon_1"; Parent="transcript_2";  # database key: "exon_1_1"

Using merge_strategy="error", a gffutils.DuplicateIDError exception will be raised. This means you will have to edit the file yourself to fix the duplicated IDs.

Using merge_strategy="warning", a warning will be printed to the logger, and the feature will be skipped.