Developer’s docs

This section serves as an entry point for learning about the internals of gffutils. The github repository can be found here

Package modules

  • create.py for creating new databases

  • parser.py for parsing GFF/GTF files and determining the dialect

  • feature.py contains the class definition for individual Feature objects

  • bins.py implements the UCSC genomic binning strategy

  • constants.py stores things like SELECT queries, GFF field names, database schema, and the default dialect

  • interface.py provides the FeatureDB class for interfacing with an existing db

General workflow

How data is read in

Three kinds of input data can be provided: a filename, an iterator of Feature objects, or a string containing GFF/GTF lines. This flexibility comes at the cost of code complexity.

Any of these three kinds of input are provided to DataIterator, which figures out the right iterator class it should delegate out to (_FileIterator, _FeatureIterator, or _UrlIterator). The iterator class must define a _custom_iter() method, which is responsible for taking input in whatever form (filename, Feature objects, string of lines, respectively) and always yielding Feature objects. It must also define a peek() method that returns the first n features. Importantly, both peek() and _custom_iter() must be written such that the act of “peeking” into the first items does not consume the underlying iterator.

These iterator classes are subclasses of _BaseIterator.

Upon initialization, a BaseIterator figures out what the dialect is. It does so by consuming checklines Feature objects using peek(). So now the BaseIterator has figured out what dialect we’re working with, and it has the dialect attribute set. Its __iter__() method iterates over the _custom_iter(), Upon iterating, it will add this dialect to every Feature. This means that, no matter what format data is (filename, iterable of features, or a string with GFF lines), the following will print the lines correctly:

>>> for feature in DataIterator(data):
...     print(feature)

To summarize the path of data from a file: _FileIterator._custom_iter configures how to read from the file, yielding features generated from lines. When instantiating the _FileIterator (see _BaseIterator.__init__), _FileIterator._peek() is run to get the dialect. This uses _FileIterator._custom_iter() to check the first lines. Then whenever the _FileIterator is iterated over, _FileIterator._custom_iter() re-opens the file and _BaseIterator.__iter__() sets the dialect of the feature, applies any transform if needed, and yields the feature. This then goes to the DBCreator classes, who call self._populate_from_lines(self.iterator). self.iterator is whatever _BaseIterator subclass was delegated to.

A dialect can be optionally provided, which will disable the automatic dialect inference. This makes it straightforward to sanitize input, or convert to a new dialect. For example, to convert from GTF to GFF dialects:

>>> for feature in DataIterator(GTF_data, dialect=GFF_dialect):
...     print(feature)

If dialect is not None, then that dialect will be used; otherwise, it will be auto-detected.

Import

While the format of each line in a GFF and GTF file are syntactically similar (same number of fields, roughly the same attributes string formatting), in the context of the file as a whole they can be very semantically different.

For example, GTF files do not have “gene” features defined. The genomic coordinates of a gene must be inferred from the various “exon” features comprising a particular gene. For a GTF file, it’s easy to figure out which gene an exon belongs to by the “gene_id” attribute.

In contrast, GFF files typically have a “Parent” attribute. For an exon, the parent is the transcript; in order to figure out which gene an exon belongs to requires looking at the parent of that transcript. But … the transcript may be defined many lines away in the GFF file, making it difficult to work with using a line-by-line parsing approach.

The point of gffutils is to make access to the underlying data uniform across both formats and to allow inter-conversion for use by downstream tools. It does this by creating a relational database of features and parent-child relationships. That is, GTF and GFF files are all modeled as parent-child reationships between features. This abstraction is what allows interconversion and the hierarchical navigation.

Since the formats are so different, they require different methods of creation. The create._DBCreator class abstracts out common creation tasks. The create._GFFDBCreator and create._GTFDBCreator classes take care of the format-specific routines.

_DBCreator takes care of:
  • setting up the parser

  • logic for autoincrementing and handling primary keys

  • initializing the database

  • finalizing the db after format-specific tasks are complete – things like writing version info, dialect, autoincrent info, etc.

_GFFDBCreator and _GTFDBCreator subclass _DBCreator and override the _populate_from_lines() and _update_relations() methods. Details are best left to the source code itself and the comments in those methods, it gets tricky.

The create.create_db() function delegates out to the appropriate class, and all the docs for the kwargs are in this function.

A lot of work has gone into making the import very flexible. The Database IDs, GTF files and Examples sections discuss the flexibility.

Access

Since the db creation imported the data into a uniform format, access requires only a single class, interface.FeatureDB. Most methods on this class simply perform queries on the database and return iterators of feature.Feature instances.

The Feature instances yielded from these iterators inherit the database’s dialect so that they print correctly.

Little things

Some notes that don’t fit elsewhere:

  • the database stores an autoincrement table, keeping track of the last ID used for each featuretype. This means you can update a database with some more features, and if there are missing IDs (for, say, exons) the primary key numbering will pick up where it left of (so the next exon would have an ID of “exon_199” or something).

  • I really wanted to maintain round-trip invariance: importing into a db and then getting the features back out should not change them at all. That’s where the dialect comes into play – it specifies the format of the attributes string, which is the trickiest thing to get right.

  • at first, I was keeping track of the order of attributes in an OrderedDict. Benchmarking with 1M+ line files showed that this was slow. So now the attributes are stored as plain ol’ dicts, and information about the order of attributes is stored only once: in the db’s dialect. While features with different orders of attributes (on one line “gene_id” comes first; on another line “Name” comes first) will be round-trip invariant, this should at least hold for most cases. I figured it was a good compromise.

  • upon getting features back from a db, the dialect is “injected” into each feature. Each Feature’s dialect can still be changed, though, for on-the-fly dialect conversion

  • many methods on FeatureDB share optional constraints for the underlying query – genomic region, strand, featuretype, order_by, etc. These are factored out into helpers.make_query(), which handles this type of query. I decided on this sort of minimal ORM rather than accept the overhead of something like sqlalchemy.