Saving BedTool results

In general, there are three different ways of saving results from BedTool operations:

Use the BedTool.saveas() method

The BedTool.saveas() method makes a copy of the results, so beware that for large files, this can be time and/or memory-consuming. However, when working with a streaming or iterating BedTool, this is a great way to render the results to disk in the middle of a pipeline.

A good example of this is saving the results from a BedTool.each() call:

>>> from pybedtools.featurefuncs import TSS
>>> a = pybedtools.example_bedtool('a.bed')
>>> result = a.each(TSS, upstream=1000, downstream=0)\
...     .saveas('upstream_regions.bed')

Use the BedTool.moveto() method

The BedTool.moveto() method does a move operation of the results. This is best used when the results have been written to disk already (perhaps to a tempfile) but you’d like to give the file a more reasonable/memorable name.

>>> a = pybedtools.example_bedtool('a.bed')
>>> b = pybedtools.example_bedtool('b.bed')
>>> c = a.intersect(b).moveto('intersection_of_a_and_b.bed')

Use the output keyword argument

If you know ahead of time that you want to save the output to a particular file, use the output keyword argument to any wrapped BedTool method that returns another BedTool object. This will override the default behavior of creating a tempfile.

>>> a = pybedtools.example_bedtool('a.bed')
>>> b = pybedtools.example_bedtool('b.bed')
>>> c = a.intersect(b, output='intersection_of_a_and_b.bed')

Working with non-interval output files

BEDTools commands offer lots of flexibility. This means it is possible to return results that are not supported interval files like BED/GFF/GTF/BAM/SAM/VCF.

Consider the following example, which uses BedTool.groupby() to get a 2-column file containing the number of intervals in each featuretype:

>>> a = pybedtools.example_bedtool('gdc.gff')
>>> b = pybedtools.example_bedtool('gdc.bed')
>>> c = a.intersect(b, c=True)
>>> d = c.groupby(g=[3], c=10, o=['sum'])

We can read the file created by d looks like this:

(note: the latest version of BEDTools, v2.26.0, causes this to fail. This will be fixed in the next BEDTools release (see https://github.com/arq5x/bedtools2/issues/453, https://github.com/arq5x/bedtools2/issues/450, https://github.com/arq5x/bedtools2/issues/435, https://github.com/arq5x/bedtools2/issues/436 for details).

>>> # bedtools v2.26.0
>>> print(open(d.fn).read())
UTR 0
CDS 2
intron      4
CDS 0
UTR 1
exon        3
mRNA        7
CDS 2
exon        2
tRNA        2
gene        7

Trying to iterate over d ([i for i in d]) or save it (d.saveas()) raises exceptions. This is because:

  • saveas() is expected to return a BedTool object that can be used with other BEDTools tools. We can’t create a BedTool object out of an unsupported file format like this

  • iterating over a BedTool object is expected to yield Interval objects, but these lines can’t be converted into the supported formats

To save the output to a filename of your choosing, provide the output argument instead of saveas(), like this:

>>> # only works with bedtools != v2.26.0
>>> # d = c.groupby(g=[3], c=10, o=['sum'], output='counts.txt')

To iterate over the lines of the file, you can use standard Python tools, e.g.:

>>> # only works with bedtools != v2.26.0
>>> # for line in open(d.fn):
>>> #     featuretype, count = line.strip().split()