Interval object is how
pybedtools represents a line in a BED,
GFF, GTF, or VCF file in a uniform fashion. This section will describe
some useful features of
First, let’s get a
BedTool to work with:
>>> a = pybedtools.example_bedtool('a.bed')
We can access the
a several different ways.
Probably the most convenient way is by indexing a
>>> feature = a
BedTool objects support slices, too:
>>> features = a[1:3]
Printing a feature converts it into the original line from the file:
>>> print(feature) chr1 1 100 feature1 0 +
The string representation of an
Interval object is simply a valid line,
including the newline, for the format from which that
created (accessible via
All features, no matter what the file type, have
strand attributes. Note that
integers, while everything else (including
score) is a string.
pybedtools supports both Python 2 and 3. When using Python 3, all
strings are the
str type. When using Python 2, all strings are unicode.
This documentation undergoes testing with Python 2 and Python 3. These versions handle strings differently. For example, under Python 2:
>>> feature.chrom u'chr1'
But under Python 3:
>>> feature.chrom 'chr1'
Since all strings returned by Interval objects are unicode, we solve this
by making a helper function
show_value that converts unicode to native
string – but only under Python 2.
>>> import sys >>> def show_value(s): ... """ ... Convert unicode to str under Python 2; ... all other values pass through unchanged ... """ ... if sys.version_info.major == 2: ... if isinstance(s, unicode): ... return str(s) ... return s
>>> show_value(feature.chrom) 'chr1' >>> show_value(feature.start) 1 >>> show_value(feature.stop) 100 >>> show_value(feature.name) 'feature1' >>> show_value(feature.score) '0' >>> show_value(feature.strand) '+'
Let’s make another feature that only has chrom, start, and stop to see how
pybedtools deals with missing attributes:
>>> feature2 = pybedtools.BedTool('chrX 500 1000', from_string=True) >>> print(feature2) chrX 500 1000 >>> show_value(feature2.chrom) 'chrX' >>> show_value(feature2.start) 500 >>> show_value(feature2.stop) 1000 >>> show_value(feature2.name) '.' >>> show_value(feature2.score) '.' >>> show_value(feature2.strand) '.'
This illustrates that default values are the string “
Interval objects can also be indexed by position into the original
line (like a list) or indexed by name of attribute (like a dictionary).
>>> print(feature) chr1 1 100 feature1 0 + >>> show_value(feature) 'chr1' >>> show_value(feature['chrom']) 'chr1' >>> show_value(feature) '1' >>> show_value(feature['start']) 1
Interval objects have a
Interval.fields attribute that
contains the original line split into a list of strings. When an integer
index is used on the
Interval (for example,
feature), it is
fields attribute that is actually being indexed into.
>>> f = pybedtools.BedTool('chr1 1 100 asdf 0 + a b c d', from_string=True) >>> [show_value(i) for i in f.fields] ['chr1', '1', '100', 'asdf', '0', '+', 'a', 'b', 'c', 'd'] >>> len(f.fields) 10
BED is 0-based, others are 1-based¶
One troublesome part about working with multiple formats is that BED files have a different coordinate system than GFF/GTF/VCF/ files.
BED files are 0-based (the first base of the chromosome is considered position 0) and the feature does not include the stop position.
GFF, GTF, and VCF files are 1-based (the first base of the chromosome is considered position 1) and the feature includes the stop position.
pybedtools follows the following conventions:
The value in
Interval.startwill always contain the 0-based start position, even if it came from a GFF or other 1-based feature.
Intervalwill always return
Interval.stop - Interval.start, so no matter what format the original file was in, the length will be correct. This greatly simplifies underlying code, and it means you can treat all
The contents of
Interval.fieldswill always be strings, which in turn always represent the original line in the file.
This means that for a GFF feature,
Interval, which is 1-based according to the file format, will always be one bp larger than
Interval.start, which always contains the 0-based start position. Their data types are different;
Intervalwill be a string and
Interval.startwill be a long.
Intervalobject created from a GFF file will show the tab-delimited fields in GFF coords while printing an
Intervalobject created from a BED file will show fields in BED coords.
To illustrate and confirm this functionality, let’s create a GFF feature and a BED feature from scratch and compare them.
First, let’s create a GFF
Interval from scratch:
>>> gff = ["chr1", ... "fake", ... "mRNA", ... "51", # <- start is 1 greater than start for the BED feature below ... "300", ... ".", ... "+", ... ".", ... "ID=mRNA1;Parent=gene1;"] >>> gff = pybedtools.create_interval_from_list(gff)
Then let’s create a corresponding BED
Interval that represents the
same genomic coordinates of of the GFF feature, but since BED format is
zero-based we need to subtract 1 from the start:
>>> bed = ["chr1", ... "50", ... "300", ... "mRNA1", ... ".", ... "+"] >>> bed = pybedtools.create_interval_from_list(bed)
Let’s confirm these new features were recognized as the right file type – the format is auto-detected based on the position of chrom/start/stop coords in the provided field list:
>>> show_value(gff.file_type) 'gff' >>> show_value(bed.file_type) 'bed'
Intervals shows that the strings are in the appropriate
>>> # for testing, we make sure keys are sorted. Not needed in practice. >>> gff.attrs.sort_keys = True >>> print(gff) chr1 fake mRNA 51 300 . + . ID=mRNA1;Parent=gene1;
>>> print(bed) chr1 50 300 mRNA1 . +
start attributes are always zero-based, the GFF and BED
should be identical:
>>> bed.start == gff.start == 50 True
For the BED feature, the second string field (representing the start position)
start attribute should both be
50 (though one is an integer and the
other is a string) …
>>> show_value(bed.start) 50 >>> show_value(bed) '50'
… but for the GFF feature, they differ – the
start attribute is
zero-based while the string representation (the fourth field of a GFF file)
remains in one-based GFF coords:
>>> show_value(gff.start) 50 >>> show_value(gff) '51'
As long as we use the integer
start attributes, we can treat the
Interval objects identically, without having to check for their format
>>> len(bed) == len(gff) == 250 True
GFF features have access to attributes¶
GFF and GTF files have lots of useful information in their attributes field
(the last field in each line). These attributes can be accessed with the
Interval.attrs attribute, which acts like a dictionary. For speed,
the attributes are lazy – they are only parsed when you ask for them. BED
files, which do not have an attributes field, will return an empty
>>> # original feature >>> print(gff) chr1 fake mRNA 51 300 . + . ID=mRNA1;Parent=gene1; >>> # original attributes >>> sorted(gff.attrs.items()) [('ID', 'mRNA1'), ('Parent', 'gene1')] >>> # add some new attributes >>> gff.attrs['Awesomeness'] = "99" >>> gff.attrs['ID'] = 'transcript1' >>> # Changes in attributes are propagated to the printable feature >>> # for testing, we make sure keys are sorted. Not needed in practice. >>> gff.attrs.sort_keys = True >>> assert gff.attrs.sort_keys >>> print(gff) chr1 fake mRNA 51 300 . + . Awesomeness=99;ID=transcript1;Parent=gene1;
Interval objects is important for using the powerful
filtering and mapping facilities of
BedTool objects, as described
in the next section.