Notes on BAM file semantics

These are some implementation notes about how BAM files are handled by mod:pybedtools for those interested in the implementation.

The initial creation of a BedTool that points to a file will trigger a check on the first 15 bytes of a file to see if it’s a BAM file. If so, then the BedTool’s _isbam attribute is set to True. If the BedTool is a stream, then the check will not be made, and it is up to the creator (whether it’s the user on the command line or a method or function) to set the BAM-streaming BedTool’s ._isbam attribute to True. This is handled automatically for wrapped BEDTools programs (described below).

Some BEDTools programs natively handle BAM files. The @_wraps decorator that is used to wrap each method has a bam kwarg that specifies what input argument the wrapped tool will accept as BAM (for example, the wrapper for intersectBed has the kwarg bam="abam").

If self._isbam == True, then self.fn is passed to the bam input arg instead of the default implicit input arg (so intersectBed, self.fn is passed as abam instead of -a).

Trying to call a method that does not have a bam kwarg registered will result in a ValueError, along with a message that says to use BedTool.bam_to_bed() first. For example, subtractBed currently doesn’t accept BAM files as input, so this doesn’t work:

>>> a = pybedtools.example_bedtool('gdc.bam')
>>> b = pybedtools.example_bedtool('gdc.gff')

>>> # doesn't work:
>>> c = a.subtract(b)

However, converting to a to BED format first (and setting stream=True to save on disk I/O) works fine:

>>> # works:
>>> c = a.bam_to_bed(stream=True).subtract(b)

Iterating over a file-based BedTool that points to a BAM will call samtools view and yields lines which sent to IntervalIterator, which splits the lines and passes them to create_interval_from_list which in turn decides on the fly whether it’s gff, bed, or sam.

However, we can’t easily check the first 15 bytes of a streaming BedTool, because that would consume those bytes. The @_wraps decorator needs to know some information about which arguments to a wrapped program result in BAM output and which result in non-BAM output.

Given a = BedTool('x.bam'):

  • c = a.intersect(b) creates BAM output, so it returns a new BedTool with c._isbam = True.
  • a.intersect(b, bed=True) returns BED output. @_wraps needs to know, if the input was BAM, which kwarg[s] disable BAM output. For example, if -bed is passed to intersectBed, the output will NOT be BAM. This is implemented with the nonbam kwarg for _wraps(). In this case, the resulting BED file is treated like any other BED file.
  • c = a.intersect(b, stream=True) returns streaming BAM output. In this case, iterating over c will send the BAM stream to stdin of a samtools call