.. _database-ids:

Database IDs
============
A primary key is a unique identifier used in a database.  When
importing a GFF or GTF file with :mod:`gffutils` into a database, each unique
feature in the file should have its own primary key.

Primary keys are important because they are used to retrieve information from
the database using dictionary syntax.  For example in the :ref:`introduction`,
the example file has a line like this::

    chr2L   FlyBase gene    7529    9484    .       +       .       ID=FBgn0031208;Name=CG11023;

By default, the primary key GFF3 features (like this one) is the "ID" field of
the attributes.  So the unique key used for the database for this feature is
`FBgn0031208`.  This means we can access the gene from the database like this::

    gene = db['FBgn0031208']

Now, imagine we wanted to get the 5' UTRs for the gene.  Looking at the
example file in the :ref:`introduction`, we actually do
have an ID.  So we could access it like this::

    utr = db['five_prime_UTR_FBgn0031208:1_737']

This is quite awkward to type.  Plus, using this method we would have to type
all the unique IDs for each of the UTRs we wanted!

In a sense, we only need one good "hook" into the database by meaningful IDs,
and then we can access the other features based on parents and children.  That
is, we could get *all* the 5'UTRs for the gene without knowing their individual
IDs like this::

    utrs = db.children('FBgn0031208', featuretype='five_prime_UTR')

This works because 1) we have a unique ID for the gene, 2) we have unique IDs
for each 5'UTR, and 3) the relationships in the database are constructed using
these unique IDs.

If your input GFF or GTF file is formatted in the canonical way, the default
settings should work fine.  The rest of this section details strategies for
instructing :mod:`gffutils` to use the most meaningful primary key for your
particular input file.


.. _id_spec:

`id_spec`
---------


.. seealso::

    Examples that show the use of `id_spec`:

    * :ref:`F3-unique-3.v2.gff` (uses `id_spec=:seqid:`)
    * :ref:`jgi_gff2.txt`
    * :ref:`ncbi_gff3.txt`
    * :ref:`wormbase_gff2.txt`

The `id_spec` (ID specification) kwarg determines how to extract information
from each line in order to construct a primary key for the feature.  It can
have several different forms -- None, string, list, dictionary, or callable.

:None:
    The primary key for each feature will be an auto-incremented version of the
    feature type (e.g., "gene_1", "gene_2", etc).

:string:
    Use the attribute value.

    For example, `id_spec="ID"`. The primary key for each feature will be the
    value of the "ID" attribute.  If this is not found, then an
    auto-incremented version of the feature type is used

:list:

    Use the first available attribute value from the list.

    For example, `id_spec=["ID", "Name"]`: the primary key for each feature
    will be the value of the "ID" attribute.  If no "ID" attribute is found,
    then use the value of the "Name" field.  If this is not found, then an
    auto-incremented version of the feature type is used.

:dict:

    Use different strategies according to the featuretype.

    For example, `id_spec={"gene": "Name", "mRNA": ["ID", "transcript_id"]}`:

    * For "gene" features, the primary key will be the value of the "Name" attribute.
      If this is not found, then use an auto-incremented version of "gene".
    * For "mRNA" features, if the "ID" attribute exists, then use it as the
      primary key will be the value of the "ID" attribute.  If not, then use
      the "transcript_id" value.  If this is not found, then use an
      auto-incremented version of "mRNA".
    * For any other feature, the primary key will be the an auto-incremented version fo
      the feature type.

:special string:

    Use a GFF field value (from the first 8 columns) rather than an attributes
    value.  Must be surrounded by `:`.  The options to use can be found in the
    list :obj:`gffutils.constants._gffkeys` [:-1].

    For example, `id_spec=":seqid:"`:  use the "seqid" field as the primary
    key.


:function:

    Apply a custom function (or other callable object), and use its return
    value as the primary key.

    The function must accept a single :class:`gffutils.Feature` object.  It can
    return one of the following:

    * None, in which case the behavior is the same as `id_spec=None`.
    * A special string starting with `autoincrement:X`, which will
      auto-increment based on the value of `X`.  That is, if a function returns
      `autoincrement:chr21`, then the primary key of the first feature will be
      `chr21_1`, the second will be `chr21_2`, and so on.
    * A string to be used as the primary key.


The default for GFF3 files is `id_spec="ID"`.  If a feature has an "ID"
attribute, it will be used for the primary key.  If not, then an
auto-incremented key, based on the featuretype, will be used.

The default for GTF files is `id_spec={"gene": "gene_id", "transcript":
"transcript_id"}`.  Even though "gene" and "transcript" features do not exist
in the original file, :mod:`gffutils` infers the gene and transcript boundaries
(as described in :ref:`gtf`, and will use this `id_spec` for those inferred
regions.


.. _transform:

`transform`
-----------

.. seealso::

    Examples that show the use of `transform`:

    * :ref:`ensembl_gtf.txt`
    * :ref:`glimmer_nokeyval.gff3`
    * :ref:`wormbase_gff2_alt.txt`
    * :ref:`wormbase_gff2.txt`

The `transform` kwarg is a function that accepts single
:class:`gffutils.Feature` object and that returns a (possibly modified)
:class:`gffutils.Feature` object.  It is used to modify, on-the-fly, items as
they are being imported into the database.  It is generally used for files that
don't fit the standard GFF3 or GTF specs.

One example use-case is that FlyBase GFF3 files do have have a leading "chr"
for the seqid GFF field.  If we wanted to add this to each feature as it is
imported into the database, then we could use the following function::

    def add_chr(d):
        d['seqid'] = "chr" + d['seqid']
        return d


`merge_strategy`
----------------

.. seealso::

    Examples that show the use of `merge_strategy`:

    * :ref:`c_elegans_WS199_shortened_gff.txt`
    * :ref:`mouse_extra_comma.gff3`

This parameter specifies the behavior when two items have an identical
primary key.

For example, consider the following attribute strings for two
consecutive lines.  Assume that `id_spec="ID"`, in which case these two
lines have the same primary key::

    ID="exon_1"; Parent="transcript_1";
    ID="exon_1"; Parent="transcript_2";


Using `merge_strategy="merge"`, then there will be a single entry in
the database for `"exon_1"`, but the attributes will be merged and only
unique values will be retained.  The new, edited feature will end up
looking like this::

   ID="exon_1"; Parent="transcript_1,transcript_2";  # db key: "exon_1"

Using `merge_strategy="create_unique"`, then the second entry will have
a unique, autoincremented primary key assigned to it, and both lines
will be in the database, accessible by two different keys::

    ID="exon_1"; Parent="transcript_1";  # database key: "exon_1"
    ID="exon_1"; Parent="transcript_2";  # database key: "exon_1_1"


Using `merge_strategy="error"`, a :class:`gffutils.DuplicateIDError`
exception will be raised.  This means you will have to edit the file
yourself to fix the duplicated IDs.

Using `merge_strategy="warning"`, a warning will be printed to the
logger, and the feature will be skipped.