FAQs¶
Note
More detailed answers to these questions can often be found on the Issues page.
“Does pybedtools have a simple reader/writer for BED files?”¶
While pybedtools
designed to work with BEDTools, the reading/writing/parsing
function can be easily used for other things.
Simply iterating over a BedTool
object will parse each line into
a Interval
object. You can then manipulate this or access the fields
as needed.
For example:
x = pybedtools.example_bedtool('a.bed')
for interval in x:
# do something with interval
However, if you’re planning on writing the results out to file, it may be more
useful to write a transformation function along with the BedTool.each()
method. This allows you to read, transform, and write all in one command:
def my_func(f):
"""
adds 10 bp to the stop
"""
f.stop += 1
return f
pybedtools.example_bedtool('a.bed')\
.each(my_func)\
.saveas('out.bed')
Another useful idiom is creating a generator function. For example, here we change the name field to reflect the value of a counter. We create a BedTool from the iterator and then save it:
def gen():
counter = 0
for i in pybedtools.example_bedtool('a.bed'):
i.name = str(counter)
counter += 1
yield i
pybedtools.BedTool(gen()).saveas('counted.bed')
See Saving the results for more on saving the results.
“Can I create a BedTool object from an existing list?”¶
Sure, the BedTool
constructor will figure it out:
items = [
('chr1', 100, 200),
('chr1', 500, 600),
]
x = pybedtools.BedTool(items)
“I’m getting an empty BedTool”¶
Check to make sure you’re not consuming a BedTool generator. Note that
BedTool.filter()
and BedTool.each()
will return a generator BedTool
object. Keep in mind that checking the length of a generator BedTool will
completely consume it.
It’s probably best to save intermediate versions to file using
BedTool.saveas()
. If you don’t provide a filename, it’ll save to an
automatically cleaned up tempfile:
my_bedtool\
.filter(my_filter_func)\
.saveas()\
.intersect(y)\
.filter(lambda x: len(x) > 1000)\
.saveas('filtered-intersected-large.bed')
“I’m getting a MalformedBedLineError”¶
This error can be raised by BEDTools itself. Typical reasons are that start > end, or the fields are not tab-delimited.
You can try the pybedtools.remove_invalid()
function to clean up your
file, or manually edit the offending lines.
“I get a segfault when iterating over a BedTool object”¶
Issue #88 which
addresses this issue – in summary, Cython’s handling of iterators works
unexpectedly. It’s best to call the next()
method explicitly when doing
complex manipulations on an iterating BedTool
.
“Can I add extra information to FASTA headers when using BedTool.sequence()?”¶
Since BEDTools adds the feature name to the FASTA header, you can manipulate the feature name on the fly with a custom modifier function:
def fields2name(f):
"replace GFF featuretype field with the attributes field"
f[2] = f[-1]
return f
import pybedtools
g = pybedtools.BedTool("my.gff").each(fields2name).sequence(fi='my.fasta')
print open(g.seqfn).readline()
“Too many files open” error¶
Sometimes you may get the error:
* Too many files open -- please submit a bug report so that this can be fixed
This error occurs because you have hit your operating system’s limit on the
number of open files. This usually happens when creating many BedTool
objects, often within a for-loop.
In general, try to create as few BedTool
objects as you can. Every time you
create a BedTool
object, you create a new open file. There is usually
a BEDTools program that already does what you want, and will do it faster.
For example, say we want to:
start with all annotations
only consider exons
write a file containing just exons
count reads in multiple BAM files for each exon
Here is a first draft. Note that the for-loop creates a BedTool
object each iteration, and the result
is yet another BedTool
. This
will version will raise the “Too many files open” error.
# This version will be slow and, with many exons, will raise the "Too many
# files open" error
import pybedtools
all_features = pybedtools.BedTool('annotations.gff')
fout = open('exons.gff', 'w')
for feature in all_features:
if feature[2] != 'exon':
continue
fout.write(str(feature))
bt = pybedtools.BedTool([feature])
result = bt.multi_bam_coverage(bams=['reads1.bam', 'reads2.bam'])
# ...do something with result
fout.close()
In contrast, it would be better to construct an “exon-only” BedTool
at
the beginning. The BedTool.filter()
method is a good way to do this.
Then, there is only one call to BedTool.multi_bam_coverage()
.
In this version there are only 3 BedTool
objects: the
one that opens annotations.gff
, the one that uses exons.gff
after it is
saved, and result
. (Note that the one created from the filter operation is
a “streaming” BedTool, so there is no open file that will contribute to the
total).
# This is the recommended way.
import pybedtools
exons = pybedtools.BedTool('annotations.gff')\
.filter(lambda x: x[2] == 'exon')\
.saveas('exons.gff')
result = exons.multi_bam_coverage(bams=['reads1.bam', 'reads2.bam'])
# ...do something with result