metaseq.integration.signal_comparison.compare

metaseq.integration.signal_comparison.compare(signal1, signal2, features, outfn, comparefunc=<ufunc 'subtract'>, batchsize=5000, array_kwargs=None, verbose=False)[source]

Compares two genomic signal objects and outputs results as a bedGraph file. Can be used for entire genome-wide comparisons due to its parallel nature.

Typical usage would be to create genome-wide windows of equal size to provide as features:

windowsize = 10000
features = pybedtools.BedTool().window_maker(
   genome='hg19', w=windowsize)

You will usually want to choose bins for the array based on the final resolution you would like. Say you would like 10-bp bins in the final bedGraph; using the example above you would use array_kwargs={‘bins’: windowsize/10}. Or, for single-bp resolution (beware: file will be large), use {‘bins’: windowsize}.

Here’s how it works. This function:

  • Takes batchsize features at a time from features
  • Constructs normalized (RPMMR) arrays in parallel for each input genomic signal object for those batchsize features
  • Applies comparefunc (np.subtract by default) to the arrays to get a “compared” (e.g., difference matrix by default) for the batchsize features.
  • For each row in this matrix, it outputs each nonzero column as a bedGraph format line in outfn

comparefunc is a function with the signature:

def f(x, y):
    return z

where x and y will be arrays for signal1 and signal2 (normalized to RPMMR) and z is a new array. By default this is np.subtract, but another common comparefunc might be a log2-fold-change function:

def lfc(x, y):
    return np.log2(x / y)
Parameters:
  • signal1 – A genomic_signal object
  • signal2 – Another genomic_signal object
  • features – An iterable of pybedtools.Interval objects. A list will be created for every batchsize features, so you need enough memory for this.
  • comparefunc – Function to use to compare arrays (default is np.subtract)
  • outfn – String filename to write bedGraph file
  • batchsize – Number of features (each with length windowsize bp) to process at a time
  • array_kwargs – Kwargs passed directly to genomic_signal.array. Needs processes and chunksize if you want parallel processing
  • verbose – Be noisy