API

gffutils

gffutils.DataIterator(data, checklines=10, transform=None, force_dialect_check=False, from_string=False, **kwargs)[source]

Iterate over features, no matter how they are provided.

Parameters:
  • data (str, iterable of Feature objs, FeatureDB) – data can be a string (filename, URL, or contents of a file, if from_string=True), any arbitrary iterable of features, or a FeatureDB (in which case its all_features() method will be called).

  • checklines (int) – Number of lines to check in order to infer a dialect.

  • transform (None or callable) – If not None, transform should accept a Feature object as its only argument and return either a (possibly modified) Feature object or a value that evaluates to False. If the return value is False, the feature will be skipped.

  • force_dialect_check (bool) – If True, check the dialect of every feature. Thorough, but can be slow.

  • from_string (bool) – If True, data should be interpreted as the contents of a file rather than the filename itself.

  • dialect (None or dict) – Provide the dialect, which will override auto-detected dialects. If provided, you should probably also use force_dialect_check=False and checklines=0 but this is not enforced.

exception gffutils.DuplicateIDError[source]
exception gffutils.FeatureNotFoundError(feature_id)[source]

Error to be raised when an ID is not in the database.

gffutils.create_db(data, dbfn, id_spec=None, force=False, verbose=False, checklines=10, merge_strategy='error', transform=None, gtf_transcript_key='transcript_id', gtf_gene_key='gene_id', gtf_subfeature='exon', force_gff=False, force_dialect_check=False, from_string=False, keep_order=False, text_factory=<class 'str'>, force_merge_fields=None, pragmas={'journal_mode': 'MEMORY', 'main.cache_size': 10000, 'main.page_size': 4096, 'synchronous': 'NORMAL'}, sort_attribute_values=False, dialect=None, _keep_tempfiles=False, infer_gene_extent=True, disable_infer_genes=False, disable_infer_transcripts=False, **kwargs)[source]

Create a database from a GFF or GTF file.

For more details on when and how to use the kwargs below, see the examples in the online documentation (Examples).

Parameters:
  • data (string or iterable) –

    If a string (and from_string is False), then data is the path to the original GFF or GTF file.

    If a string and from_string is True, then assume data is the actual data to use.

    Otherwise, it’s an iterable of Feature objects.

  • dbfn (string) – Path to the database that will be created. Can be the special string “:memory:” to create an in-memory database.

  • id_spec (string, list, dict, callable, or None) –

    This parameter guides what will be used as the primary key for the database, which in turn determines how you will access individual features by name from the database.

    If an id spec is not otherwise specified for a featuretype (keep reading below for how to do this), or the provided id spec is not available for a particular feature (say, exons do not have “ID” attributes even though id_spec="ID" was provided) then the default behavior is to autoincrement an ID for that featuretype. For example, if there is no id spec defined for an exon, then the ids for exons will take the form exon1, exon2, exon3, and so on. This ensures that each feature has a unique primary key in the database without requiring lots of configuration. However, if you want to be able to retrieve features based on their primary key, then it is worth the effort to provide an accurate id spec.

    If id_spec=None, then use the default behavior. The default behavior depends on the detected format (or forced format, e.g., if force_gff=True). For GFF files, the default is be id_spec="ID". For GTF files, the default is id_spec={'gene': 'gene_id', 'transcript': 'transcript_id'}.

    If id_spec is a string, then look for this key in the attributes. If it exists, then use its value as the primary key, otherwise autoincrement based on the feature type. For many GFF3 files, “ID” usually works well.

    If id_spec is a list or tuple of keys, then check for each one in order, using the first one found. For GFF3, this might be modified to [“ID”, “Name”], which would use the ID if it exists, otherwise the Name, otherwise autoincrement based on the feature type.

    If id_spec is a dictionary, then it is a mapping of feature types to what should be used as the ID. For example, for GTF files, {'gene': 'gene_id', 'transcript': 'transcript_id'} may be useful. The values of this dictionary can also be a list, e.g., {'gene': ['gene_id', 'geneID']}.

    If id_spec is a callable object, then it accepts a dictionary from the iterator and returns one of the following:

    • None (in which case the feature type will be auto-incremented)

    • string (which will be used as the primary key)

    • special string starting with “autoincrement:X”, where “X” is a string that will be used for auto-incrementing. For example, if “autoincrement:chr10”, then the first feature will be “chr10_1”, the second “chr10_2”, and so on.

  • force (bool) – If False (default), then raise an exception if dbfn already exists. Use force=True to overwrite any existing databases.

  • verbose (bool) –

    Report percent complete and other feedback on how the db creation is progressing.

    In order to report percent complete, the entire file needs to be read once to see how many items there are; for large files you may want to use verbose=False to avoid this.

  • checklines (int) – Number of lines to check the dialect.

  • merge_strategy (str) –

    One of {merge, create_unique, error, warning, replace}.

    This parameter specifies the behavior when two items have an identical primary key.

    Using merge_strategy="merge", then there will be a single entry in the database, but the attributes of all features with the same primary key will be merged. WARNING: this can be quite slow when incorrectly used.

    Using merge_strategy="create_unique", then the first entry will use the original primary key, but the second entry will have a unique, autoincremented primary key assigned to it

    Using merge_strategy="error", a gffutils.DuplicateID exception will be raised. This means you will have to edit the file yourself to fix the duplicated IDs.

    Using merge_strategy="warning", a warning will be printed to the logger, and the duplicate feature will be skipped.

    Using merge_strategy="replace" will replace the entire existing feature with the new feature.

  • transform (callable) – If not None, transform should accept a Feature object as its only argument and return either a (possibly modified) Feature object or a value that evaluates to False. If the return value is False, the feature will be skipped.

  • gtf_transcript_key (string) – Which attribute to use as the transcript ID and gene ID respectively for GTF files. Default is transcript_id and gene_id according to the GTF spec.

  • gtf_gene_key (string) – Which attribute to use as the transcript ID and gene ID respectively for GTF files. Default is transcript_id and gene_id according to the GTF spec.

  • gtf_subfeature (string) – Feature type to use as a “gene component” when inferring gene and transcript extents for GTF files. Default is exon according to the GTF spec.

  • force_gff (bool) – If True, do not do automatic format detection – only use GFF.

  • force_dialect_check (bool) – If True, the dialect will be checkef for every feature (instead of just checklines features). This can be slow, but may be necessary for inconsistently-formatted input files.

  • from_string (bool) – If True, then treat data as actual data (rather than the path to a file).

  • keep_order (bool) –

    If True, all features returned from this instance will have the order of their attributes maintained. This can be turned on or off database-wide by setting the keep_order attribute or with this kwarg, or on a feature-by-feature basis by setting the keep_order attribute of an individual feature.

    Note that a single order of attributes will be used for all features. Specifically, the order will be determined by the order of attribute keys in the first checklines of the input data. See helpers._choose_dialect for more information on this.

    Default is False, since this includes a sorting step that can get time-consuming for many features.

  • infer_gene_extent (bool) – DEPRECATED in version 0.8.4. See disable_infer_transcripts and disable_infer_genes for more granular control.

  • disable_infer_transcripts (bool) –

    Only used for GTF files. By default – and according to the GTF spec – we assume that there are no transcript or gene features in the file. gffutils then infers the extent of each transcript based on its constituent exons and infers the extent of each gene bases on its constituent transcripts.

    This default behavior is problematic if the input file already contains transcript or gene features (like recent GENCODE GTF files for human), since 1) the work to infer extents is unnecessary, and 2) trying to insert an inferred feature back into the database triggers gffutils’ feature-merging routines, which can get time consuming.

    The solution is to use disable_infer_transcripts=True if your GTF already has transcripts in it, and/or disable_infer_genes=True if it already has genes in it. This can result in dramatic (100x) speedup.

    Prior to version 0.8.4, setting infer_gene_extents=False would disable both transcript and gene inference simultaneously. As of version 0.8.4, these argument allow more granular control.

  • disable_infer_genes (bool) –

    Only used for GTF files. By default – and according to the GTF spec – we assume that there are no transcript or gene features in the file. gffutils then infers the extent of each transcript based on its constituent exons and infers the extent of each gene bases on its constituent transcripts.

    This default behavior is problematic if the input file already contains transcript or gene features (like recent GENCODE GTF files for human), since 1) the work to infer extents is unnecessary, and 2) trying to insert an inferred feature back into the database triggers gffutils’ feature-merging routines, which can get time consuming.

    The solution is to use disable_infer_transcripts=True if your GTF already has transcripts in it, and/or disable_infer_genes=True if it already has genes in it. This can result in dramatic (100x) speedup.

    Prior to version 0.8.4, setting infer_gene_extents=False would disable both transcript and gene inference simultaneously. As of version 0.8.4, these argument allow more granular control.

  • force_merge_fields (list) – If merge_strategy=”merge”, then features will only be merged if their non-attribute values are identical (same chrom, source, start, stop, score, strand, phase). Using force_merge_fields, you can override this behavior to allow merges even when fields are different. This list can contain one or more of [‘seqid’, ‘source’, ‘featuretype’, ‘score’, ‘strand’, ‘frame’]. The resulting merged fields will be strings of comma-separated values. Note that ‘start’ and ‘end’ are not available, since these fields need to be integers.

  • text_factory (callable) – Text factory to use for the sqlite3 database.

  • pragmas (dict) – Dictionary of pragmas used when creating the sqlite3 database. See http://www.sqlite.org/pragma.html for a list of available pragmas. The defaults are stored in constants.default_pragmas, which can be used as a template for supplying a custom dictionary.

  • sort_attribute_values (bool) – All features returned from the database will have their attribute values sorted. Typically this is only useful for testing, since this can get time-consuming for large numbers of features.

  • _keep_tempfiles (bool or string) – False by default to clean up intermediate tempfiles created during GTF import. If True, then keep these tempfile for testing or debugging. If string, then keep the tempfile for testing, but also use the string as the suffix fo the tempfile. This can be useful for testing in parallel environments.

Return type:

New FeatureDB object.

gffutils.example_filename(fn)[source]

Return the full path of a data file that ships with gffutils.

Create a database

create_db

Create a database from a GFF or GTF file.

Interact with a database

First, connect to an existing database:

FeatureDB

Then, use the methods of FeatureDB to interact:

FeatureDB.children

Return children of feature id.

FeatureDB.parents

Return parents of feature id.

FeatureDB.schema

Returns the database schema as a string.

FeatureDB.features_of_type

Returns an iterator of gffutils.Feature objects.

FeatureDB.count_features_of_type

Simple count of features.

FeatureDB.all_features

Iterate through the entire database.

FeatureDB.execute

Execute arbitrary queries on the db.

FeatureDB.featuretypes

Iterate over feature types found in the database.

FeatureDB.region

Return features within specified genomic coordinates.

FeatureDB.iter_by_parent_childs

For each parent of type featuretype, yield a list L of that parent and all of its children ([parent] + list(children)).

Modify a FeatureDB:

FeatureDB.update

Update the on-disk database with features in data.

FeatureDB.delete

Delete features from database.

FeatureDB.add_relation

Manually add relations to the database.

FeatureDB.set_pragmas

Set pragmas for the current database connection.

Operate on features:

FeatureDB.interfeatures

Construct new features representing the space between features.

FeatureDB.children_bp

Total bp of all children of a featuretype.

FeatureDB.merge

Merge features matching criteria together

FeatureDB.create_introns

Create introns from existing annotations.

FeatureDB.bed12

Converts feature into a BED12 format.

Feature objects

Most FeatureDB methods return Feature objects:

Feature

You can extract the sequence for a feature:

Feature.sequence

Retrieves the sequence of this feature as a string.

Creating a Feature object:

feature_from_line

Given a line from a GFF file, return a Feature object

Integration with other tools

biopython_integration.to_seqfeature

Converts a gffutils.Feature object to a Bio.SeqFeature object.

biopython_integration.from_seqfeature

Converts a Bio.SeqFeature object to a gffutils.Feature object.

pybedtools_integration.tsses

Create 1-bp transcription start sites for all transcripts in the database and return as a sorted pybedtools.BedTool object pointing to a temporary file.

pybedtools_integration.to_bedtool

Convert any iterator into a pybedtools.BedTool object.

Utilities

helpers.asinterval

Converts a gffutils.Feature to a pybedtools.Interval

helpers.merge_attributes

Merges two attribute dictionaries into a single dictionary.

helpers.sanitize_gff_db

Sanitize given GFF db.

helpers.annotate_gff_db

Annotate a GFF file by cross-referencing it with another GFF file, e.g. one containing gene models.

helpers.infer_dialect

Infer the dialect based on the attributes.

helpers.example_filename

Return the full path of a data file that ships with gffutils.

inspect.inspect

Inspect a GFF or GTF data source.