Developer's docs ================ This section serves as an entry point for learning about the internals of :mod:`gffutils`. The github repository can be found `here `_ Package modules --------------- * `create.py` for creating new databases * `parser.py` for parsing GFF/GTF files and determining the dialect * `feature.py` contains the class definition for individual Feature objects * `bins.py` implements the UCSC genomic binning strategy * `constants.py` stores things like SELECT queries, GFF field names, database schema, and the default dialect * `interface.py` provides the :class:`FeatureDB` class for interfacing with an existing db General workflow ---------------- How data is read in ~~~~~~~~~~~~~~~~~~~ Three kinds of input data can be provided: a filename, an iterator of :class:`Feature` objects, or a string containing GFF/GTF lines. This flexibility comes at the cost of code complexity. Any of these three kinds of input are provided to :class:`DataIterator`, which figures out the right iterator class it should delegate out to (:class:`_FileIterator`, :class:`_FeatureIterator`, or :class:`_UrlIterator`). The iterator class must define a :meth:`_custom_iter` method, which is responsible for taking input in whatever form (filename, Feature objects, string of lines, respectively) and always yielding :class:`Feature` objects. It must also define a :meth:`peek` method that returns the first ``n`` features. Importantly, both :meth:`peek` and :meth:`_custom_iter` must be written such that the act of "peeking" into the first items does not consume the underlying iterator. These iterator classes are subclasses of :class:`_BaseIterator`. Upon initialization, a :class:`BaseIterator` figures out what the dialect is. It does so by consuming `checklines` :class:`Feature` objects using :meth:`peek`. So now the :class:`BaseIterator` has figured out what dialect we're working with, and it has the `dialect` attribute set. Its :meth:`__iter__` method iterates over the :meth:`_custom_iter`, Upon iterating, it will add this dialect to every :class:`Feature`. This means that, no matter what format `data` is (filename, iterable of features, or a string with GFF lines), the following will print the lines correctly:: >>> for feature in DataIterator(data): ... print(feature) To summarize the path of data from a file: ``_FileIterator._custom_iter`` configures how to read from the file, yielding features generated from lines. When instantiating the ``_FileIterator`` (see ``_BaseIterator.__init__``), :meth:`_FileIterator._peek()` is run to get the dialect. This uses :meth:`_FileIterator._custom_iter` to check the first lines. Then whenever the :class:`_FileIterator` is iterated over, :meth:`_FileIterator._custom_iter` re-opens the file and :meth:`_BaseIterator.__iter__` sets the dialect of the feature, applies any transform if needed, and yields the feature. This then goes to the DBCreator classes, who call ``self._populate_from_lines(self.iterator)``. ``self.iterator`` is whatever ``_BaseIterator`` subclass was delegated to. A dialect can be optionally provided, which will disable the automatic dialect inference. This makes it straightforward to sanitize input, or convert to a new dialect. For example, to convert from GTF to GFF dialects:: >>> for feature in DataIterator(GTF_data, dialect=GFF_dialect): ... print(feature) If `dialect` is not None, then that dialect will be used; otherwise, it will be auto-detected. Import ~~~~~~ While the format of each line in a GFF and GTF file are *syntactically* similar (same number of fields, roughly the same attributes string formatting), in the context of the file as a whole they can be very *semantically* different. For example, GTF files do not have "gene" features defined. The genomic coordinates of a gene must be inferred from the various "exon" features comprising a particular gene. For a GTF file, it's easy to figure out which gene an exon belongs to by the "gene_id" attribute. In contrast, GFF files typically have a "Parent" attribute. For an exon, the parent is the transcript; in order to figure out which gene an exon belongs to requires looking at the parent of that transcript. But . . . the transcript may be defined many lines away in the GFF file, making it difficult to work with using a line-by-line parsing approach. The point of :mod:`gffutils` is to make access to the underlying data uniform across both formats and to allow inter-conversion for use by downstream tools. It does this by creating a relational database of features and parent-child relationships. That is, GTF and GFF files are all modeled as parent-child reationships between features. This abstraction is what allows interconversion and the hierarchical navigation. Since the formats are so different, they require different methods of creation. The :class:`create._DBCreator` class abstracts out common creation tasks. The :class:`create._GFFDBCreator` and :class:`create._GTFDBCreator` classes take care of the format-specific routines. :class:`_DBCreator` takes care of: * setting up the parser * logic for autoincrementing and handling primary keys * initializing the database * finalizing the db after format-specific tasks are complete -- things like writing version info, dialect, autoincrent info, etc. :class:`_GFFDBCreator` and :class:`_GTFDBCreator` subclass :class:`_DBCreator` and override the :meth:`_populate_from_lines` and :meth:`_update_relations` methods. Details are best left to the source code itself and the comments in those methods, it gets tricky. The :func:`create.create_db` function delegates out to the appropriate class, and all the docs for the kwargs are in this function. A lot of work has gone into making the import very flexible. The :ref:`database-ids`, :ref:`gtf` and :ref:`examples` sections discuss the flexibility. Access ~~~~~~ Since the db creation imported the data into a uniform format, access requires only a single class, :class:`interface.FeatureDB`. Most methods on this class simply perform queries on the database and return iterators of :class:`feature.Feature` instances. The :class:`Feature` instances yielded from these iterators inherit the database's dialect so that they print correctly. Little things ~~~~~~~~~~~~~ Some notes that don't fit elsewhere: * the database stores an autoincrement table, keeping track of the last ID used for each featuretype. This means you can update a database with some more features, and if there are missing IDs (for, say, exons) the primary key numbering will pick up where it left of (so the next exon would have an ID of "exon_199" or something). * I really wanted to maintain round-trip invariance: importing into a db and then getting the features back out should not change them at all. That's where the dialect comes into play -- it specifies the format of the attributes string, which is the trickiest thing to get right. * at first, I was keeping track of the order of attributes in an OrderedDict. Benchmarking with 1M+ line files showed that this was slow. So now the attributes are stored as plain ol' dicts, and information about the order of attributes is stored only once: in the db's dialect. While features with different orders of attributes (on one line "gene_id" comes first; on another line "Name" comes first) will be round-trip invariant, this should at least hold for most cases. I figured it was a good compromise. * upon getting features back from a db, the dialect is "injected" into each feature. Each Feature's dialect can still be changed, though, for on-the-fly dialect conversion * many methods on FeatureDB share optional constraints for the underlying query -- genomic region, strand, featuretype, order_by, etc. These are factored out into :func:`helpers.make_query`, which handles this type of query. I decided on this sort of minimal ORM rather than accept the overhead of something like sqlalchemy.