Shell script comparison

The following two scripts illustrate the same analysis. The first script uses pybedtools, and the second uses bash scripting. The filenames in these scripts are written so they can be run without modification from the pybedtools/scripts source directory.

Both scripts print the genes that are <5000 bp from intergenic SNPs. These scripts show how the same analysis can be performed with pybedtools in a much clearer and reusable fashion without losing any speed. Furthermore, note that the bash script requires knowledge in three languages (Perl, bash, and awk) to accomplish the same thing as the Python script.

The bash script contains comparative notes as well as timing comparisons between the two scripts.

pybedtools version (py_ms_example.py)

#!/usr/bin/python
"""
Example from the manuscript; see sh_ms_example.sh for the shell script \
equivalent.

Prints the names of genes that are <5000 bp away from intergenic SNPs.
"""
from os import path
from pybedtools import BedTool


def main():
    """
    Runs Python example from the manuscript
    """
    bedtools_dir = path.split(__file__)[0]
    snps = BedTool(path.join(bedtools_dir, '../test/data/snps.bed.gz'))
    genes = BedTool(path.join(bedtools_dir, '../test/data/hg19.gff'))

    intergenic_snps = (snps - genes)

    nearby = genes.closest(intergenic_snps, d=True, stream=True)

    for gene in nearby:
        if int(gene[-1]) < 5000:
            print(gene.name)

if __name__ == "__main__":
    main()

bash version (sh_ms_example.sh)

#!/bin/bash

# Shell script to do the same analysis as py_ms_example.py -- namely,
# print the names of genes that are <5000 bp away from intergenic SNPs
#
# See below for timing comparisons.

snps=../test/data/snps.bed.gz
genes=../test/data/hg19.gff
intergenic_snps=/tmp/intergenic_snps

# see note #1 below
snp_fields=`zcat $snps | awk '(NR == 2){print NF; exit;}'`
gene_fields=9
distance_field=$(($gene_fields + $snp_fields + 1))

intersectBed -a $snps -b $genes -v > $intergenic_snps

# see note #2 below
closestBed -a $genes -b $intergenic_snps -d \
| awk '($'$distance_field' < 5000){print $9;}' \
| perl -ne 'm/[ID|Name|gene_id]=(.*?);/; print "$1\n"'

# see note #3 below
rm $intergenic_snps







# -----corresponding pybedtools script (see py_ms_example.py in this same dir)----
#
#  from pybedtools import BedTool
#  snps = BedTool('../test/data/snps.bed.gz')
#  genes = BedTool('../test/data/hg19.gff')
#
#  intergenic_snps = (snps - genes)
#  nearby = genes.closest(intergenic_snps,
#                         d=True,
#                         stream=True)
#  for gene in nearby:
#      if int(gene[-1]) < 5000:
#         print gene.name

#------------------------------------------------------------------------------
# Note 1:
#   pybedtools allows the user to index into the fields of a feature using
#   Python indexing.  The "gene[-1]" in the python example above gets last item
#   in the line, no matter what the line sizes of the input files.
#
#   In bash, we need to explicitly check how many fields the original files
#   have so we can know the total field count after the the closestBed
#   operation.  This will change depending on the input BED file, since BED
#   files can have 3-6, 9, or 12 columns.  GFF is defined to have 9 fields.
#   Pre-calculating these numbers saves a little time by not having to get the
#   number of fields for each line in the awk script
#------------------------------------------------------------------------------
# Note 2:
#   First, we use the distance field calculated previously.  The closestBed
#   operation was driven by a GFF file, so that file's fields will come first.
#   GFF files are defined to have 9 fields, and the 9th contains the
#   attributes . . . so we can rely on the 9th field being GFF attributes.
#
#   pybedtools stores GFF attributes as a dictionary, and can pull out the name
#   of a gene with the `name` attribute, as in the Python line "print
#   gene.name".  There is no standard for GFF formats; the gene name could be
#   listed as "ID", "Name", or "gene_id" attributes. Here, the Perl regular
#   expression performs the same function
#------------------------------------------------------------------------------
# Note 3:
#   pybedtools automatically deletes temporary files from disk. The Python line
#   "intergenic_snps = (snps - genes)" saves a temporary file for the length of
#   the script, and that file is deleted at exit.  Here, we need to explicitly
#   remove the tempfile we created.
#------------------------------------------------------------------------------

# =============================================================================
# Timing
# =============================================================================
# The running times of the Python version and the bash script version are
# equivalent:
#
# time ./sh_ms_example.sh > sh.out
# (best of 3)
# --------------------------------
# real	0m42.059s
# user	0m41.970s
# sys	0m0.190s

# time ./py_ms_example.py > py.out
# (best of 3)
# --------------------------------
# real	0m42.000s
# user	0m42.050s
# sys	0m0.230s


# Confirm identical results
# -------------------------
# diff sh.out py.out | wc -l
# 0