Notes on BAM file semantics¶
These are some implementation notes about how BAM files are handled by
mod:pybedtools for those interested in the implementation.
The initial creation of a BedTool that points to a file will
trigger a check on the first 15 bytes of a file to see if it’s a BAM file.
If so, then the BedTool’s _isbam attribute is set to True. If the
BedTool is a stream, then the check will not be made, and it is up
to the creator (whether it’s the user on the command line or a method or
function) to set the BAM-streaming BedTool’s ._isbam attribute to True.
This is handled automatically for wrapped BEDTools programs (described
below).
Some BEDTools programs natively handle BAM files. The @_wraps decorator
that is used to wrap each method has a bam kwarg that specifies what
input argument the wrapped tool will accept as BAM (for example, the
wrapper for intersectBed has the kwarg bam="abam").
If self._isbam == True, then self.fn is passed to the bam input arg
instead of the default implicit input arg (so intersectBed, self.fn is
passed as abam instead of -a).
Trying to call a method that does not have a bam kwarg registered will
result in a ValueError, along with a message that says to use
BedTool.bam_to_bed() first. For example, subtractBed currently
doesn’t accept BAM files as input, so this doesn’t work:
>>> a = pybedtools.example_bedtool('gdc.bam')
>>> b = pybedtools.example_bedtool('gdc.gff')
>>> # doesn't work:
>>> c = a.subtract(b)
However, converting to a to BED format first (and setting stream=True
to save on disk I/O) works fine:
>>> # works:
>>> c = a.bam_to_bed(stream=True).subtract(b)
Iterating over a file-based BedTool that points to a BAM will call
samtools view and yields lines which sent to IntervalIterator, which
splits the lines and passes them to create_interval_from_list which in
turn decides on the fly whether it’s gff, bed, or sam.
However, we can’t easily check the first 15 bytes of a streaming BedTool,
because that would consume those bytes. The @_wraps decorator needs to
know some information about which arguments to a wrapped program result in
BAM output and which result in non-BAM output.
Given a = BedTool('x.bam'):
c = a.intersect(b)creates BAM output, so it returns a new BedTool withc._isbam = True.a.intersect(b, bed=True)returns BED output.@_wrapsneeds to know, if the input was BAM, which kwarg[s] disable BAM output. For example, if-bedis passed tointersectBed, the output will NOT be BAM. This is implemented with thenonbamkwarg for_wraps(). In this case, the resulting BED file is treated like any other BED file.c = a.intersect(b, stream=True)returns streaming BAM output. In this case, iterating overcwill send the BAM stream to stdin of a samtools call