Three brief examples

Here are three examples to show typical usage of pybedtools. More info can be found in the docstrings of pybedtools methods and in the Tutorial Contents.

You can also check out Shell script comparison for a simple example of how pybedtools can improve readability of your code with no loss of speed compared to bash scripting.

Note

Please take the time to read and understand the conventions pybedtools uses to handle files with different coordinate systems (e.g., 0-based BED files vs 1-based GFF files) which are described here.

In summary,

  • Integer values representing start/stop are always in 0-based coordinates, regardless of file format. This means that all Interval objects can be treated identically, and greatly simplifies underlying code.
  • String values representing start/stop will use coordinates appropriate for the format (1-based for GFF; 0-based for BED).

Example 1: Save a BED file of intersections, with track line

This example saves a new BED file of intersections between your files mydata/snps.bed and mydata/exons.bed, adding a track line to the output:

>>> import pybedtools
>>> a = pybedtools.BedTool('mydata/snps.bed')
>>> a.intersect('mydata/exons.bed').saveas('snps-in-exons.bed', trackline="track name='SNPs in exons' color=128,0,0")

Example 2: Intersections for a 3-way Venn diagram

This example gets values for a 3-way Venn diagram of overlaps. This demonstrates operator overloading of BedTool objects. It assumes that you have the files a.bed, b.bed, and c.bed in your current working directory. If you’d like to use example files that come with pybedtools, then replace strings like 'a.bed' with pybedtools.example_filename('a.bed'), which will retrieve the absolute path to the example data file.:

>>> import pybedtools

>>> # set up 3 different bedtools
>>> a = pybedtools.BedTool('a.bed')
>>> b = pybedtools.BedTool('b.bed')
>>> c = pybedtools.BedTool('c.bed')

>>> (a-b-c).count()  # unique to a
>>> (a+b-c).count()  # in a and b, not c
>>> (a+b+c).count()  # common to all
>>> # ... and so on, for all the combinations.

For more, see the pybedtools.scripts.venn_mpl and pybedtools.scripts.venn_gchart scripts, which wrap this functionality in command-line scripts to create Venn diagrams using either matplotlib or Google Charts API respectively. Also see the pybedtools.contrib.venn_maker module for a flexible interface to the VennDiagram R package.

Example 3: Count reads in introns and exons, in parallel

This example shows how to count the number of reads in introns and exons in parallel. It is somewhat more involved, but illustrates several additional features of pybedtools such as:

#!/usr/bin/env python

"""
Example from pybedtools documentation: find reads in introns and exons using
multiple CPUs.

Prints a tab-separated file containing class (exon, intron, both) and number of
reads in each class.
"""
from __future__ import print_function
import pybedtools
import argparse
import os
import sys
import multiprocessing

if __name__ == "__main__":

    ap = argparse.ArgumentParser(prog=os.path.basename(sys.argv[0]),
                                 usage=__doc__)
    ap.add_argument('--gff', required=True,
                    help='GFF or GTF file containing annotations')
    ap.add_argument('--bam', required=True,
                    help='BAM file containing reads to be counted')
    ap.add_argument('--stranded', action='store_true',
                    help='Use strand-specific merging and overlap. '
                         'Default is to ignore strand')
    ap.add_argument('--processes', default=1, type=int,
                    help='Number of processes to use in parallel.')
    ap.add_argument('-v', '--verbose', action='store_true',
                    help='Verbose (goes to stderr)')
    args = ap.parse_args()

    gff = args.gff
    bam = args.bam
    stranded = args.stranded

    if args.processes > 3:
        print(
            "Only need 3 processes (one each for exon, intron, both), so "
            "resetting processes from {0} to 3".format(args.processes)
        )
        args.processes = 3


    def featuretype_filter(feature, featuretype):
        """
        Only passes features with the specified *featuretype*
        """
        if feature[2] == featuretype:
            return True
        return False


    def subset_featuretypes(featuretype):
        """
        Returns the filename containing only `featuretype` features.
        """
        return g.filter(featuretype_filter, featuretype).saveas().fn


    def count_reads_in_features(features):
        """
        Callback function to count reads in features
        """
        return (
            pybedtools.BedTool(bam)
            .intersect(
                features,
                s=stranded,
                bed=True,
                stream=True,
            )
        ).count()

    # Some GFF files have invalid entries -- like chromosomes with negative coords
    # or features of length = 0.  This line removes them (the `remove_invalid`
    # method) and saves the result in a tempfile
    g = pybedtools.BedTool(gff).remove_invalid().saveas()

    # Set up pool of workers
    pool = multiprocessing.Pool(processes=args.processes)

    # Get separate files for introns and exons in parallel
    featuretypes = ['intron', 'exon']
    introns, exons = pool.map(subset_featuretypes, featuretypes)

    # Since `subset_featuretypes` returns filenames, we convert to BedTool objects
    # to do intersections below.
    introns = pybedtools.BedTool(introns)
    exons = pybedtools.BedTool(exons)

    # Identify unique and shared regions using bedtools commands subtract, merge,
    # and intersect.
    exon_only = exons.subtract(introns).merge()
    intron_only = introns.subtract(exons).merge()
    intron_and_exon = (
        exons
        .intersect(introns)
        .merge()
    )

    # Do intersections with BAM file in parallel. Note that we're passing filenames
    # to multiprocessing.Pool rather than BedTool objects.
    features = (exon_only.fn, intron_only.fn, intron_and_exon.fn)

    # Run count_reads_in_features in parallel over features
    results = pool.map(count_reads_in_features, features)

    labels = ('exon_only',
              'intron_only',
              'intron_and_exon')

    for label, reads in zip(labels, results):
        print('{0}\t{1}'.format(label, reads))

For more on using pybedtools, continue on to the Tutorial Contents . . .