Intervals¶
An Interval
object is how pybedtools
represents a line in a BED,
GFF, GTF, or VCF file in a uniform fashion. This section will describe
some useful features of Interval
objects.
First, let’s get a BedTool
to work with:
>>> a = pybedtools.example_bedtool('a.bed')
We can access the Intervals
of a
several different ways.
Probably the most convenient way is by indexing a BedTool
object:
>>> feature = a[0]
BedTool
objects support slices, too:
>>> features = a[1:3]
Common Interval
attributes¶
Printing a feature converts it into the original line from the file:
>>> print(feature)
chr1 1 100 feature1 0 +
The string representation of an Interval
object is simply a valid line,
including the newline, for the format from which that Interval
was
created (accessible via Interval.file_type
).
All features, no matter what the file type, have chrom
, start
, stop
,
name
, score
, and strand
attributes. Note that start
and stop
are
integers, while everything else (including score
) is a string.
pybedtools
supports both Python 2 and 3. When using Python 3, all
strings are the str
type. When using Python 2, all strings are unicode.
Note
This documentation undergoes testing with Python 2 and Python 3. These versions handle strings differently. For example, under Python 2:
>>> feature.chrom
u'chr1'
But under Python 3:
>>> feature.chrom
'chr1'
Since all strings returned by Interval objects are unicode, we solve this
by making a helper function show_value
that converts unicode to native
string – but only under Python 2.
>>> import sys
>>> def show_value(s):
... """
... Convert unicode to str under Python 2;
... all other values pass through unchanged
... """
... if sys.version_info.major == 2:
... if isinstance(s, unicode):
... return str(s)
... return s
>>> show_value(feature.chrom)
'chr1'
>>> show_value(feature.start)
1
>>> show_value(feature.stop)
100
>>> show_value(feature.name)
'feature1'
>>> show_value(feature.score)
'0'
>>> show_value(feature.strand)
'+'
Let’s make another feature that only has chrom, start, and stop to see how
pybedtools
deals with missing attributes:
>>> feature2 = pybedtools.BedTool('chrX 500 1000', from_string=True)[0]
>>> print(feature2)
chrX 500 1000
>>> show_value(feature2.chrom)
'chrX'
>>> show_value(feature2.start)
500
>>> show_value(feature2.stop)
1000
>>> show_value(feature2.name)
'.'
>>> show_value(feature2.score)
'.'
>>> show_value(feature2.strand)
'.'
This illustrates that default values are the string “.
”.
Indexing into Interval
objects¶
Interval
objects can also be indexed by position into the original
line (like a list) or indexed by name of attribute (like a dictionary).
>>> print(feature)
chr1 1 100 feature1 0 +
>>> show_value(feature[0])
'chr1'
>>> show_value(feature['chrom'])
'chr1'
>>> show_value(feature[1])
'1'
>>> show_value(feature['start'])
1
Fields¶
Interval
objects have a Interval.fields
attribute that
contains the original line split into a list of strings. When an integer
index is used on the Interval
(for example, feature[3]
), it is
the fields
attribute that is actually being indexed into.
>>> f = pybedtools.BedTool('chr1 1 100 asdf 0 + a b c d', from_string=True)[0]
>>> [show_value(i) for i in f.fields]
['chr1', '1', '100', 'asdf', '0', '+', 'a', 'b', 'c', 'd']
>>> len(f.fields)
10
BED is 0-based, others are 1-based¶
One troublesome part about working with multiple formats is that BED files have a different coordinate system than GFF/GTF/VCF/ files.
BED files are 0-based (the first base of the chromosome is considered position 0) and the feature does not include the stop position.
GFF, GTF, and VCF files are 1-based (the first base of the chromosome is considered position 1) and the feature includes the stop position.
pybedtools
follows the following conventions:
The value in
Interval.start
will always contain the 0-based start position, even if it came from a GFF or other 1-based feature.Getting the
len()
of anInterval
will always returnInterval.stop - Interval.start
, so no matter what format the original file was in, the length will be correct. This greatly simplifies underlying code, and it means you can treat allIntervals
identically.The contents of
Interval.fields
will always be strings, which in turn always represent the original line in the file.This means that for a GFF feature,
Interval.fields[3]
orInterval[3]
, which is 1-based according to the file format, will always be one bp larger thanInterval.start
, which always contains the 0-based start position. Their data types are different;Interval[3]
will be a string andInterval.start
will be a long.Printing an
Interval
object created from a GFF file will show the tab-delimited fields in GFF coords while printing anInterval
object created from a BED file will show fields in BED coords.
Worked example¶
To illustrate and confirm this functionality, let’s create a GFF feature and a BED feature from scratch and compare them.
First, let’s create a GFF Interval
from scratch:
>>> gff = ["chr1",
... "fake",
... "mRNA",
... "51", # <- start is 1 greater than start for the BED feature below
... "300",
... ".",
... "+",
... ".",
... "ID=mRNA1;Parent=gene1;"]
>>> gff = pybedtools.create_interval_from_list(gff)
Then let’s create a corresponding BED Interval
that represents the
same genomic coordinates of of the GFF feature, but since BED format is
zero-based we need to subtract 1 from the start:
>>> bed = ["chr1",
... "50",
... "300",
... "mRNA1",
... ".",
... "+"]
>>> bed = pybedtools.create_interval_from_list(bed)
Let’s confirm these new features were recognized as the right file type – the format is auto-detected based on the position of chrom/start/stop coords in the provided field list:
>>> show_value(gff.file_type)
'gff'
>>> show_value(bed.file_type)
'bed'
Printing the Intervals
shows that the strings are in the appropriate
coordinates:
>>> # for testing, we make sure keys are sorted. Not needed in practice.
>>> gff.attrs.sort_keys = True
>>> print(gff)
chr1 fake mRNA 51 300 . + . ID=mRNA1;Parent=gene1;
>>> print(bed)
chr1 50 300 mRNA1 . +
Since start
attributes are always zero-based, the GFF and BED start
values
should be identical:
>>> bed.start == gff.start == 50
True
For the BED feature, the second string field (representing the start position)
and the start
attribute should both be 50
(though one is an integer and the
other is a string) …
>>> show_value(bed.start)
50
>>> show_value(bed[1])
'50'
… but for the GFF feature, they differ – the start
attribute is
zero-based while the string representation (the fourth field of a GFF file)
remains in one-based GFF coords:
>>> show_value(gff.start)
50
>>> show_value(gff[3])
'51'
As long as we use the integer start
attributes, we can treat the
Interval
objects identically, without having to check for their format
every time:
>>> len(bed) == len(gff) == 250
True
GFF features have access to attributes¶
GFF and GTF files have lots of useful information in their attributes field
(the last field in each line). These attributes can be accessed with the
Interval.attrs
attribute, which acts like a dictionary. For speed,
the attributes are lazy – they are only parsed when you ask for them. BED
files, which do not have an attributes field, will return an empty
dictionary.
>>> # original feature
>>> print(gff)
chr1 fake mRNA 51 300 . + . ID=mRNA1;Parent=gene1;
>>> # original attributes
>>> sorted(gff.attrs.items())
[('ID', 'mRNA1'), ('Parent', 'gene1')]
>>> # add some new attributes
>>> gff.attrs['Awesomeness'] = "99"
>>> gff.attrs['ID'] = 'transcript1'
>>> # Changes in attributes are propagated to the printable feature
>>> # for testing, we make sure keys are sorted. Not needed in practice.
>>> gff.attrs.sort_keys = True
>>> assert gff.attrs.sort_keys
>>> print(gff)
chr1 fake mRNA 51 300 . + . Awesomeness=99;ID=transcript1;Parent=gene1;
Understanding Interval
objects is important for using the powerful
filtering and mapping facilities of BedTool
objects, as described
in the next section.