Database IDs

A primary key is a unique identifier used in a database. When importing a GFF or GTF file with gffutils into a database, each unique feature in the file should have its own primary key.

Primary keys are important because they are used to retrieve information from the database using dictionary syntax. For example in the Introduction, the example file has a line like this:

chr2L   FlyBase gene    7529    9484    .       +       .       ID=FBgn0031208;Name=CG11023;

By default, the primary key GFF3 features (like this one) is the “ID” field of the attributes. So the unique key used for the database for this feature is FBgn0031208. This means we can access the gene from the database like this:

gene = db['FBgn0031208']

Now, imagine we wanted to get the 5’ UTRs for the gene. Looking at the example file in the Introduction, we actually do have an ID. So we could access it like this:

utr = db['five_prime_UTR_FBgn0031208:1_737']

This is quite awkward to type. Plus, using this method we would have to type all the unique IDs for each of the UTRs we wanted!

In a sense, we only need one good “hook” into the database by meaningful IDs, and then we can access the other features based on parents and children. That is, we could get all the 5’UTRs for the gene without knowing their individual IDs like this:

utrs = db.children('FBgn0031208', featuretype='five_prime_UTR')

This works because 1) we have a unique ID for the gene, 2) we have unique IDs for each 5’UTR, and 3) the relationships in the database are constructed using these unique IDs.

If your input GFF or GTF file is formatted in the canonical way, the default settings should work fine. The rest of this section details strategies for instructing gffutils to use the most meaningful primary key for your particular input file.

Database IDs

`id_spec`

`transform`

`merge_strategy`

Database IDs

id_spec

transform

merge_strategy

`id_spec`

`transform`

`merge_strategy`