Database IDs
A primary key is a unique identifier used in a database. When
importing a GFF or GTF file with gffutils
into a database, each unique
feature in the file should have its own primary key.
Primary keys are important because they are used to retrieve information from the database using dictionary syntax. For example in the Introduction, the example file has a line like this:
chr2L FlyBase gene 7529 9484 . + . ID=FBgn0031208;Name=CG11023;
By default, the primary key GFF3 features (like this one) is the “ID” field of
the attributes. So the unique key used for the database for this feature is
FBgn0031208
. This means we can access the gene from the database like this:
gene = db['FBgn0031208']
Now, imagine we wanted to get the 5’ UTRs for the gene. Looking at the example file in the Introduction, we actually do have an ID. So we could access it like this:
utr = db['five_prime_UTR_FBgn0031208:1_737']
This is quite awkward to type. Plus, using this method we would have to type all the unique IDs for each of the UTRs we wanted!
In a sense, we only need one good “hook” into the database by meaningful IDs, and then we can access the other features based on parents and children. That is, we could get all the 5’UTRs for the gene without knowing their individual IDs like this:
utrs = db.children('FBgn0031208', featuretype='five_prime_UTR')
This works because 1) we have a unique ID for the gene, 2) we have unique IDs for each 5’UTR, and 3) the relationships in the database are constructed using these unique IDs.
If your input GFF or GTF file is formatted in the canonical way, the default
settings should work fine. The rest of this section details strategies for
instructing gffutils
to use the most meaningful primary key for your
particular input file.
id_spec
The id_spec
(ID specification) kwarg determines how to extract information
from each line in order to construct a primary key for the feature. It can
have several different forms – None, string, list, dictionary, or callable.
- None:
The primary key for each feature will be an auto-incremented version of the feature type (e.g., “gene_1”, “gene_2”, etc).
- string:
Use the attribute value.
For example,
id_spec="ID"
. The primary key for each feature will be the value of the “ID” attribute. If this is not found, then an auto-incremented version of the feature type is used- list:
Use the first available attribute value from the list.
For example,
id_spec=["ID", "Name"]
: the primary key for each feature will be the value of the “ID” attribute. If no “ID” attribute is found, then use the value of the “Name” field. If this is not found, then an auto-incremented version of the feature type is used.- dict:
Use different strategies according to the featuretype.
For example,
id_spec={"gene": "Name", "mRNA": ["ID", "transcript_id"]}
:For “gene” features, the primary key will be the value of the “Name” attribute. If this is not found, then use an auto-incremented version of “gene”.
For “mRNA” features, if the “ID” attribute exists, then use it as the primary key will be the value of the “ID” attribute. If not, then use the “transcript_id” value. If this is not found, then use an auto-incremented version of “mRNA”.
For any other feature, the primary key will be the an auto-incremented version fo the feature type.
- special string:
Use a GFF field value (from the first 8 columns) rather than an attributes value. Must be surrounded by
:
. The options to use can be found in the listgffutils.constants._gffkeys
[:-1].For example,
id_spec=":seqid:"
: use the “seqid” field as the primary key.- function:
Apply a custom function (or other callable object), and use its return value as the primary key.
The function must accept a single
gffutils.Feature
object. It can return one of the following:None, in which case the behavior is the same as
id_spec=None
.A special string starting with
autoincrement:X
, which will auto-increment based on the value ofX
. That is, if a function returnsautoincrement:chr21
, then the primary key of the first feature will bechr21_1
, the second will bechr21_2
, and so on.A string to be used as the primary key.
The default for GFF3 files is id_spec="ID"
. If a feature has an “ID”
attribute, it will be used for the primary key. If not, then an
auto-incremented key, based on the featuretype, will be used.
The default for GTF files is id_spec={"gene": "gene_id", "transcript":
"transcript_id"}
. Even though “gene” and “transcript” features do not exist
in the original file, gffutils
infers the gene and transcript boundaries
(as described in GTF files, and will use this id_spec
for those inferred
regions.
transform
See also
Examples that show the use of transform
:
The transform
kwarg is a function that accepts single
gffutils.Feature
object and that returns a (possibly modified)
gffutils.Feature
object. It is used to modify, on-the-fly, items as
they are being imported into the database. It is generally used for files that
don’t fit the standard GFF3 or GTF specs.
One example use-case is that FlyBase GFF3 files do have have a leading “chr” for the seqid GFF field. If we wanted to add this to each feature as it is imported into the database, then we could use the following function:
def add_chr(d):
d['seqid'] = "chr" + d['seqid']
return d
merge_strategy
See also
Examples that show the use of merge_strategy
:
This parameter specifies the behavior when two items have an identical primary key.
For example, consider the following attribute strings for two
consecutive lines. Assume that id_spec="ID"
, in which case these two
lines have the same primary key:
ID="exon_1"; Parent="transcript_1";
ID="exon_1"; Parent="transcript_2";
Using merge_strategy="merge"
, then there will be a single entry in
the database for "exon_1"
, but the attributes will be merged and only
unique values will be retained. The new, edited feature will end up
looking like this:
ID="exon_1"; Parent="transcript_1,transcript_2"; # db key: "exon_1"
Using merge_strategy="create_unique"
, then the second entry will have
a unique, autoincremented primary key assigned to it, and both lines
will be in the database, accessible by two different keys:
ID="exon_1"; Parent="transcript_1"; # database key: "exon_1"
ID="exon_1"; Parent="transcript_2"; # database key: "exon_1_1"
Using merge_strategy="error"
, a gffutils.DuplicateIDError
exception will be raised. This means you will have to edit the file
yourself to fix the duplicated IDs.
Using merge_strategy="warning"
, a warning will be printed to the
logger, and the feature will be skipped.