Working with BAM files

Some BEDTools programs, like intersecteBed, support BAM files as input. From the command line, you would need to specify the -abam argument to do so. However, pybedtools auto-detects BAM files and passes the abam argument automatically for you. That means if you create a BedTool out of a BAM file, like this:

x = pybedtools.example_bedtool('gdc.bam')

you can intersect it with a BED file without doing anything special:

b = pybedtools.example_bedtool('gdc.gff')
y = x.intersect(b)

The output of this operation follows the semantics of BEDTools. That is, for programs like intersectBed, if abam is used then the output will be BAM format as well. But if the -bed argument is passed, then the output will be BED format. Similarly, in pybedtools, if a BAM file is used to create the BedTool then the results will also be in BAM format. If the bed=True kwarg is passed, then the results be in BED format.

As an example, let’s intersect a BAM file of reads with annotations using files that ship with pybedtools. First, we create the BedTool objects:

>>> a = pybedtools.example_bedtool('x.bam')
>>> b = pybedtools.example_bedtool('dm3-chr2L-5M.gff.gz')

The first call below will return BAM results, and the second will return BED results.

>>> bam_results = a.intersect(b)
>>> str(bam_results.file_type) == 'bam'
True

>>> bed_results = a.intersect(b, bed=True)
>>> str(bed_results.file_type) == 'bed'
True

We can iterate over BAM files to get Interval objects just like iterating over BED or GFF files. Indexing works, too:

>>> for i in bam_results[:2]:
...     print(i)
HWUSI-NAME:2:69:512:1017#0  16      chr2L   9330    3       36M     *       0       0       TACAAATCTTACGTAAACACTCCAAGCATGAATTCG    Y`V_a_TM[\_V`abb`^^Q]QZaaaaa_aaaaaaa    NM:i:0  NH:i:2  CC:Z:chrX       CP:i:19096815

HWUSI-NAME:2:91:1201:1113#0 16      chr2L   10213   255     36M     *       0       0       TGTAGAATGCAAAAATTACATTTGTGAGTATCATCA    UV[aY`]\VZ`baaaZa`_aab_`_`a`ab``b`aa    NM:i:0  NH:i:1


>>> bam_results[0]
Interval(chr2L:9329-9365)

>>> bam_results[:10]
<itertools.islice object at ...>

>>> cigar_string = i[5]

There are several things to watch out for here.

First, note that pybedtools uses the convention that BAM features in plain text format are considered SAM features, so these SAM features are one-based and include the stop coordinate as illustrated below. (Note that there is some additional complexity here due to supporting Python 2 and 3 simultaneously in this tested documentation)

>>> bam_results[0].start
9329

>>> isinstance(bam_results[0][3], str)
True

>>> print(bam_results[0][3])
9330

Second, the stop coordinate is defined as the start coord plus the length of the sequence; eventually a more sophisticated, CIGAR-aware approach may be used. Similarly, the length is defined to be stop - start – again, not CIGAR-aware at the moment. For more sophisticated low-level manipulation of BAM features, you might want to consider using HTSeq.

Third, while we can iterate over a BAM file and manipulate the features as shown above, calling BEDTools programs on a BAM-based generator is not well-supported.

Specifically:

>>> a = pybedtools.example_bed('gdc.bam')
>>> b = pybedtools.example_bed('b.bed')

>>> # works, gets BAM results
>>> results = a.intersect(b)

>>> # make a generator of features in `a`
>>> a2 = pybedtools.BedTool(i for i in a)

>>> # this does NOT work
>>> a2.intersect(b)

When we specified the bed=True kwarg above, the intersected BAM results are converted to BED format. We can use those like a normal BED file. Note that since we are viewing BED output, the start and stops are 0-based:

>>> d = a.intersect(b, bed=True)
>>> d.head(3)
chr2L       9329    9365    HWUSI-NAME:2:69:512:1017#0      3       -       9329    9365    0,0,0   1       36,     0,
chr2L       9329    9365    HWUSI-NAME:2:69:512:1017#0      3       -       9329    9365    0,0,0   1       36,     0,
chr2L       9329    9365    HWUSI-NAME:2:69:512:1017#0      3       -       9329    9365    0,0,0   1       36,     0,

Consistent with BEDTools programs, BAM files are not supported as the second input argument. In other words, intersectBed does not have both -abam and -bbam arguments, so pybedtools will not not allow this either.

However, pybedtools does allow streaming BAM files to be the input of methods that allow BAM input as the first input. In this [trivial] example, we can stream the first intersection to save disk space, and then send that streaming BAM to the next BedTool.intersect() call. Since it’s not streamed, the second intersection will be saved as a temp BAM file on disk:

>>> a.intersect(b, stream=True).intersect(b)