Intervals

An Interval object is how pybedtools represents a line in a BED, GFF, GTF, or VCF file in a uniform fashion. This section will describe some useful features of Interval objects.

First, let’s get a BedTool to work with:

>>> a = pybedtools.example_bedtool('a.bed')

We can access the Intervals of a several different ways. Probably the most convenient way is by indexing a BedTool object:

>>> feature = a[0]

BedTool objects support slices, too:

>>> features = a[1:3]

Common Interval attributes

Printing a feature converts it into the original line from the file:

>>> print(feature)
chr1        1       100     feature1        0       +

The string representation of an Interval object is simply a valid line, including the newline, for the format from which that Interval was created (accessible via Interval.file_type).

All features, no matter what the file type, have chrom, start, stop, name, score, and strand attributes. Note that start and stop are integers, while everything else (including score) is a string.

pybedtools supports both Python 2 and 3. When using Python 3, all strings are the str type. When using Python 2, all strings are unicode.

Note

This documentation undergoes testing with Python 2 and Python 3. These versions handle strings differently. For example, under Python 2:

>>> feature.chrom
u'chr1'

But under Python 3:

>>> feature.chrom
'chr1'

Since all strings returned by Interval objects are unicode, we solve this by making a helper function show_value that converts unicode to native string – but only under Python 2.

>>> import sys
>>> def show_value(s):
...     """
...     Convert unicode to str under Python 2;
...     all other values pass through unchanged
...     """
...     if sys.version_info.major == 2:
...         if isinstance(s, unicode):
...             return str(s)
...     return s
>>> show_value(feature.chrom)
'chr1'

>>> show_value(feature.start)
1

>>> show_value(feature.stop)
100

>>> show_value(feature.name)
'feature1'

>>> show_value(feature.score)
'0'

>>> show_value(feature.strand)
'+'

Let’s make another feature that only has chrom, start, and stop to see how pybedtools deals with missing attributes:

>>> feature2 = pybedtools.BedTool('chrX 500 1000', from_string=True)[0]

>>> print(feature2)
chrX        500     1000


>>> show_value(feature2.chrom)
'chrX'

>>> show_value(feature2.start)
500

>>> show_value(feature2.stop)
1000

>>> show_value(feature2.name)
'.'

>>> show_value(feature2.score)
'.'

>>> show_value(feature2.strand)
'.'

This illustrates that default values are the string “.”.

Indexing into Interval objects

Interval objects can also be indexed by position into the original line (like a list) or indexed by name of attribute (like a dictionary).

>>> print(feature)
chr1        1       100     feature1        0       +


>>> show_value(feature[0])
'chr1'

>>> show_value(feature['chrom'])
'chr1'

>>> show_value(feature[1])
'1'

>>> show_value(feature['start'])
1

Fields

Interval objects have a Interval.fields attribute that contains the original line split into a list of strings. When an integer index is used on the Interval (for example, feature[3]), it is the fields attribute that is actually being indexed into.

>>> f = pybedtools.BedTool('chr1 1 100 asdf 0 + a b c d', from_string=True)[0]
>>> [show_value(i) for i in f.fields]
['chr1', '1', '100', 'asdf', '0', '+', 'a', 'b', 'c', 'd']
>>> len(f.fields)
10

BED is 0-based, others are 1-based

One troublesome part about working with multiple formats is that BED files have a different coordinate system than GFF/GTF/VCF/ files.

  • BED files are 0-based (the first base of the chromosome is considered position 0) and the feature does not include the stop position.

  • GFF, GTF, and VCF files are 1-based (the first base of the chromosome is considered position 1) and the feature includes the stop position.

pybedtools follows the following conventions:

  • The value in Interval.start will always contain the 0-based start position, even if it came from a GFF or other 1-based feature.

  • Getting the len() of an Interval will always return Interval.stop - Interval.start, so no matter what format the original file was in, the length will be correct. This greatly simplifies underlying code, and it means you can treat all Intervals identically.

  • The contents of Interval.fields will always be strings, which in turn always represent the original line in the file.

    • This means that for a GFF feature, Interval.fields[3] or Interval[3], which is 1-based according to the file format, will always be one bp larger than Interval.start, which always contains the 0-based start position. Their data types are different; Interval[3] will be a string and Interval.start will be a long.

    • Printing an Interval object created from a GFF file will show the tab-delimited fields in GFF coords while printing an Interval object created from a BED file will show fields in BED coords.

Worked example

To illustrate and confirm this functionality, let’s create a GFF feature and a BED feature from scratch and compare them.

First, let’s create a GFF Interval from scratch:

>>> gff = ["chr1",
...        "fake",
...        "mRNA",
...        "51",   # <- start is 1 greater than start for the BED feature below
...        "300",
...        ".",
...        "+",
...        ".",
...        "ID=mRNA1;Parent=gene1;"]
>>> gff = pybedtools.create_interval_from_list(gff)

Then let’s create a corresponding BED Interval that represents the same genomic coordinates of of the GFF feature, but since BED format is zero-based we need to subtract 1 from the start:

>>> bed = ["chr1",
...        "50",
...        "300",
...        "mRNA1",
...        ".",
...        "+"]
>>> bed = pybedtools.create_interval_from_list(bed)

Let’s confirm these new features were recognized as the right file type – the format is auto-detected based on the position of chrom/start/stop coords in the provided field list:

>>> show_value(gff.file_type)
'gff'

>>> show_value(bed.file_type)
'bed'

Printing the Intervals shows that the strings are in the appropriate coordinates:

>>> # for testing, we make sure keys are sorted. Not needed in practice.
>>> gff.attrs.sort_keys = True
>>> print(gff)
chr1        fake    mRNA    51      300     .       +       .       ID=mRNA1;Parent=gene1;
>>> print(bed)
chr1        50      300     mRNA1   .       +

Since start attributes are always zero-based, the GFF and BED start values should be identical:

>>> bed.start == gff.start == 50
True

For the BED feature, the second string field (representing the start position) and the start attribute should both be 50 (though one is an integer and the other is a string) …

>>> show_value(bed.start)
50
>>> show_value(bed[1])
'50'

… but for the GFF feature, they differ – the start attribute is zero-based while the string representation (the fourth field of a GFF file) remains in one-based GFF coords:

>>> show_value(gff.start)
50
>>> show_value(gff[3])
'51'

As long as we use the integer start attributes, we can treat the Interval objects identically, without having to check for their format every time:

>>> len(bed) == len(gff) == 250
True

GFF features have access to attributes

GFF and GTF files have lots of useful information in their attributes field (the last field in each line). These attributes can be accessed with the Interval.attrs attribute, which acts like a dictionary. For speed, the attributes are lazy – they are only parsed when you ask for them. BED files, which do not have an attributes field, will return an empty dictionary.

>>> # original feature
>>> print(gff)
chr1        fake    mRNA    51      300     .       +       .       ID=mRNA1;Parent=gene1;

>>> # original attributes
>>> sorted(gff.attrs.items())
[('ID', 'mRNA1'), ('Parent', 'gene1')]

>>> # add some new attributes
>>> gff.attrs['Awesomeness'] = "99"
>>> gff.attrs['ID'] = 'transcript1'

>>> # Changes in attributes are propagated to the printable feature

>>> # for testing, we make sure keys are sorted. Not needed in practice.
>>> gff.attrs.sort_keys = True
>>> assert gff.attrs.sort_keys
>>> print(gff)
chr1        fake    mRNA    51      300     .       +       .       Awesomeness=99;ID=transcript1;Parent=gene1;

Understanding Interval objects is important for using the powerful filtering and mapping facilities of BedTool objects, as described in the next section.