gffutils.create.create_db

gffutils.create.create_db(data, dbfn, id_spec=None, force=False, verbose=False, checklines=10, merge_strategy='error', transform=None, gtf_transcript_key='transcript_id', gtf_gene_key='gene_id', gtf_subfeature='exon', force_gff=False, force_dialect_check=False, from_string=False, keep_order=False, text_factory=<class 'str'>, force_merge_fields=None, pragmas={'journal_mode': 'MEMORY', 'main.cache_size': 10000, 'main.page_size': 4096, 'synchronous': 'NORMAL'}, sort_attribute_values=False, dialect=None, _keep_tempfiles=False, infer_gene_extent=True, disable_infer_genes=False, disable_infer_transcripts=False, **kwargs)[source]

Create a database from a GFF or GTF file.

For more details on when and how to use the kwargs below, see the examples in the online documentation (Examples).

Parameters:
  • data (string or iterable) –

    If a string (and from_string is False), then data is the path to the original GFF or GTF file.

    If a string and from_string is True, then assume data is the actual data to use.

    Otherwise, it’s an iterable of Feature objects.

  • dbfn (string) – Path to the database that will be created. Can be the special string “:memory:” to create an in-memory database.

  • id_spec (string, list, dict, callable, or None) –

    This parameter guides what will be used as the primary key for the database, which in turn determines how you will access individual features by name from the database.

    If an id spec is not otherwise specified for a featuretype (keep reading below for how to do this), or the provided id spec is not available for a particular feature (say, exons do not have “ID” attributes even though id_spec="ID" was provided) then the default behavior is to autoincrement an ID for that featuretype. For example, if there is no id spec defined for an exon, then the ids for exons will take the form exon1, exon2, exon3, and so on. This ensures that each feature has a unique primary key in the database without requiring lots of configuration. However, if you want to be able to retrieve features based on their primary key, then it is worth the effort to provide an accurate id spec.

    If id_spec=None, then use the default behavior. The default behavior depends on the detected format (or forced format, e.g., if force_gff=True). For GFF files, the default is be id_spec="ID". For GTF files, the default is id_spec={'gene': 'gene_id', 'transcript': 'transcript_id'}.

    If id_spec is a string, then look for this key in the attributes. If it exists, then use its value as the primary key, otherwise autoincrement based on the feature type. For many GFF3 files, “ID” usually works well.

    If id_spec is a list or tuple of keys, then check for each one in order, using the first one found. For GFF3, this might be modified to [“ID”, “Name”], which would use the ID if it exists, otherwise the Name, otherwise autoincrement based on the feature type.

    If id_spec is a dictionary, then it is a mapping of feature types to what should be used as the ID. For example, for GTF files, {'gene': 'gene_id', 'transcript': 'transcript_id'} may be useful. The values of this dictionary can also be a list, e.g., {'gene': ['gene_id', 'geneID']}.

    If id_spec is a callable object, then it accepts a dictionary from the iterator and returns one of the following:

    • None (in which case the feature type will be auto-incremented)

    • string (which will be used as the primary key)

    • special string starting with “autoincrement:X”, where “X” is a string that will be used for auto-incrementing. For example, if “autoincrement:chr10”, then the first feature will be “chr10_1”, the second “chr10_2”, and so on.

  • force (bool) – If False (default), then raise an exception if dbfn already exists. Use force=True to overwrite any existing databases.

  • verbose (bool) –

    Report percent complete and other feedback on how the db creation is progressing.

    In order to report percent complete, the entire file needs to be read once to see how many items there are; for large files you may want to use verbose=False to avoid this.

  • checklines (int) – Number of lines to check the dialect.

  • merge_strategy (str) –

    One of {merge, create_unique, error, warning, replace}.

    This parameter specifies the behavior when two items have an identical primary key.

    Using merge_strategy="merge", then there will be a single entry in the database, but the attributes of all features with the same primary key will be merged. WARNING: this can be quite slow when incorrectly used.

    Using merge_strategy="create_unique", then the first entry will use the original primary key, but the second entry will have a unique, autoincremented primary key assigned to it

    Using merge_strategy="error", a gffutils.DuplicateID exception will be raised. This means you will have to edit the file yourself to fix the duplicated IDs.

    Using merge_strategy="warning", a warning will be printed to the logger, and the duplicate feature will be skipped.

    Using merge_strategy="replace" will replace the entire existing feature with the new feature.

  • transform (callable) – If not None, transform should accept a Feature object as its only argument and return either a (possibly modified) Feature object or a value that evaluates to False. If the return value is False, the feature will be skipped.

  • gtf_transcript_key (string) – Which attribute to use as the transcript ID and gene ID respectively for GTF files. Default is transcript_id and gene_id according to the GTF spec.

  • gtf_gene_key (string) – Which attribute to use as the transcript ID and gene ID respectively for GTF files. Default is transcript_id and gene_id according to the GTF spec.

  • gtf_subfeature (string) – Feature type to use as a “gene component” when inferring gene and transcript extents for GTF files. Default is exon according to the GTF spec.

  • force_gff (bool) – If True, do not do automatic format detection – only use GFF.

  • force_dialect_check (bool) – If True, the dialect will be checkef for every feature (instead of just checklines features). This can be slow, but may be necessary for inconsistently-formatted input files.

  • from_string (bool) – If True, then treat data as actual data (rather than the path to a file).

  • keep_order (bool) –

    If True, all features returned from this instance will have the order of their attributes maintained. This can be turned on or off database-wide by setting the keep_order attribute or with this kwarg, or on a feature-by-feature basis by setting the keep_order attribute of an individual feature.

    Note that a single order of attributes will be used for all features. Specifically, the order will be determined by the order of attribute keys in the first checklines of the input data. See helpers._choose_dialect for more information on this.

    Default is False, since this includes a sorting step that can get time-consuming for many features.

  • infer_gene_extent (bool) – DEPRECATED in version 0.8.4. See disable_infer_transcripts and disable_infer_genes for more granular control.

  • disable_infer_transcripts (bool) –

    Only used for GTF files. By default – and according to the GTF spec – we assume that there are no transcript or gene features in the file. gffutils then infers the extent of each transcript based on its constituent exons and infers the extent of each gene bases on its constituent transcripts.

    This default behavior is problematic if the input file already contains transcript or gene features (like recent GENCODE GTF files for human), since 1) the work to infer extents is unnecessary, and 2) trying to insert an inferred feature back into the database triggers gffutils’ feature-merging routines, which can get time consuming.

    The solution is to use disable_infer_transcripts=True if your GTF already has transcripts in it, and/or disable_infer_genes=True if it already has genes in it. This can result in dramatic (100x) speedup.

    Prior to version 0.8.4, setting infer_gene_extents=False would disable both transcript and gene inference simultaneously. As of version 0.8.4, these argument allow more granular control.

  • disable_infer_genes (bool) –

    Only used for GTF files. By default – and according to the GTF spec – we assume that there are no transcript or gene features in the file. gffutils then infers the extent of each transcript based on its constituent exons and infers the extent of each gene bases on its constituent transcripts.

    This default behavior is problematic if the input file already contains transcript or gene features (like recent GENCODE GTF files for human), since 1) the work to infer extents is unnecessary, and 2) trying to insert an inferred feature back into the database triggers gffutils’ feature-merging routines, which can get time consuming.

    The solution is to use disable_infer_transcripts=True if your GTF already has transcripts in it, and/or disable_infer_genes=True if it already has genes in it. This can result in dramatic (100x) speedup.

    Prior to version 0.8.4, setting infer_gene_extents=False would disable both transcript and gene inference simultaneously. As of version 0.8.4, these argument allow more granular control.

  • force_merge_fields (list) – If merge_strategy=”merge”, then features will only be merged if their non-attribute values are identical (same chrom, source, start, stop, score, strand, phase). Using force_merge_fields, you can override this behavior to allow merges even when fields are different. This list can contain one or more of [‘seqid’, ‘source’, ‘featuretype’, ‘score’, ‘strand’, ‘frame’]. The resulting merged fields will be strings of comma-separated values. Note that ‘start’ and ‘end’ are not available, since these fields need to be integers.

  • text_factory (callable) – Text factory to use for the sqlite3 database.

  • pragmas (dict) – Dictionary of pragmas used when creating the sqlite3 database. See http://www.sqlite.org/pragma.html for a list of available pragmas. The defaults are stored in constants.default_pragmas, which can be used as a template for supplying a custom dictionary.

  • sort_attribute_values (bool) – All features returned from the database will have their attribute values sorted. Typically this is only useful for testing, since this can get time-consuming for large numbers of features.

  • _keep_tempfiles (bool or string) – False by default to clean up intermediate tempfiles created during GTF import. If True, then keep these tempfile for testing or debugging. If string, then keep the tempfile for testing, but also use the string as the suffix fo the tempfile. This can be useful for testing in parallel environments.

Return type:

New FeatureDB object.