Filtering¶
The BedTool.filter()
method lets you pass in a function that accepts an
Interval
as its first argument and returns True for False. This
allows you to perform “grep”-like operations on BedTool
objects. For
example, here’s how to get a new BedTool
containing features from a
that are more than 100 bp long:
>>> a = pybedtools.example_bedtool('a.bed')
>>> b = a.filter(lambda x: len(x) > 100)
>>> print(b)
chr1 150 500 feature3 0 -
The filter()
method will pass its *args
and **kwargs
to the function
provided. So here is a more generic case, where the function is defined once
and different arguments are passed in for filtering on different lengths:
>>> def len_filter(feature, L):
... "Returns True if feature is longer than L"
... return len(feature) > L
Now we can pass different lengths without defining a new function for each length of interest, like this:
>>> a = pybedtools.example_bedtool('a.bed')
>>> print(a.filter(len_filter, L=10))
chr1 1 100 feature1 0 +
chr1 100 200 feature2 0 +
chr1 150 500 feature3 0 -
chr1 900 950 feature4 0 +
>>> print(a.filter(len_filter, L=99))
chr1 100 200 feature2 0 +
chr1 150 500 feature3 0 -
>>> print(a.filter(len_filter, L=200))
chr1 150 500 feature3 0 -
See Using BedTool objects as iterators/generators for more advanced and space-efficient usage
of filter()
using iterators.
Note that we could have used the built-in Python function, filter()
, but that
would have returned an iterator that we would have to construct a new
pybedtools.BedTool
out of. The BedTool.filter()
method returns
a ready-to-use BedTool
object, which allows embedding of
BedTool.filter()
calls in a chain of commands, e.g.:
>>> a.intersect(b).filter(lambda x: len(x) < 100).merge()
Fast filtering functions in Cython¶
The featurefuncs
module contains some ready-made functions written
in Cython that will be faster than pure Python equivalents. For example,
there are greater_than()
and less_than()
functions, which are
about 70% faster. In IPython:
>>> from pybedtools.featurefuncs import greater_than
>>> len(a)
310456
>>> def L(x,width=100):
... return len(x) > 100
>>> # The %timeit command is from IPython, and won't work
>>> # in a regular Python script:
>>> %timeit a.filter(greater_than, 100)
1 loops, best of 3: 1.74 s per loop
>>> %timeit a.filter(L, 100)
1 loops, best of 3: 2.96 s per loop