metaseq.array_helpers._local_coverage¶
-
metaseq.array_helpers.
_local_coverage
(reader, features, read_strand=None, fragment_size=None, shift_width=0, bins=None, use_score=False, accumulate=True, preserve_total=False, method=None, processes=None, stranded=True, verbose=False)[source]¶ Returns a binned vector of coverage.
Computes a 1D vector of coverage at the coordinates for each feature in features, extending each read by fragmentsize bp.
Some arguments cannot be used for bigWig files due to the structure of these files. The parameters docstring below indicates whether or not an argument can be used with bigWig files.
Depending on the arguments provided, this method can return a vector containing values from a single feature or from concatenated features.
An example of the flexibility afforded by the latter case:
features can be a 3-tuple of pybedtools.Intervals representing (TSS + 1kb upstream, gene, TTS + 1kb downstream) and bins can be [100, 1000, 100]. This will return a vector of length 1200 containing the three genomic intervals binned into 100, 1000, and 100 bins respectively. Note that is up to the caller to construct the right axes labels in the final plot!Parameters: features : str, interval-like object, or list
Can be a single interval or an iterable yielding intervals.
Interval-like objects must have chrom, start, and stop attributes, and optionally a strand attribute. One exception to this that if features is a single string, it can be of the form “chrom:start-stop” or “chrom:start-stop[strand]”.
If features is a single interval, then return a 1-D array for that interval.
If features is an iterable of intervals, then return a 1-D array that is a concatenation of signal for these intervals.
Available for bigWig.
bins : None, int, list
If bins is None, then each value in the returned array will correspond to one bp in the genome.
If features is a single Interval, then bins is an integer or None.
If features is an iterable of Intervals, bins is an iterable of integers of the same length as features.
Available for bigWig.
fragment_size : None or int
If not None, then each item from the genomic signal (e.g., reads from a BAM file) will be extended fragment_size bp in the 3’ direction. Higher fragment sizes will result in smoother signal. Not available for bigWig.
shift_width : int
Each item from the genomic signal (e.g., reads from a BAM file) will be shifted shift_width bp in the 3’ direction. This can be useful for reconstructing a ChIP-seq profile, using the shift width determined from the peak-caller (e.g., modeled d in MACS). Not available for bigWig.
read_strand : None or str
If read_strand is one of “+” or “-”, then only items from the genomic signal (e.g., reads from a BAM file) on that strand will be considered and reads on the opposite strand ignored. Useful for plotting genomic signal for stranded libraries. Not available for bigWig.
stranded : bool
If True, then the profile will be reversed for features whose strand attribute is “-”.
use_score : bool
If True, then each bin will contain the sum of the score attribute of genomic features in that bin instead of the number of genomic features falling within each bin. Not available for bigWig.
accumulate : bool
If False, then only record that there was something there, rather than acumulating reads. This is useful for making matrices with called peaks. Available for bigWig.
preserve_total : bool
If True, re-scales the returned value so that each binned row’s total is equal to the sum of the original, un-binned data. The units of the returned array will be in “total per bin”. This is useful for, e.g., counting reads in features. If preserve_total is False, then the returned array will have units of “density”; this is more generally useful and is the default behavior. Available for bigWig, but not when using method=”ucsc_summarize”.
method : str; one of [ “summarize” | “get_as_array” | “ucsc_summarize” ]
Only used for bigWig. The method specifies how data are extracted from the bigWig file. “summarize” is the default. It’s quite fast, but may yield slightly different results when compared to running this same function on the BAM file from which the bigWig was created.
“summarize” uses bx-python. The values returned will not be exactly the same as the values returned when local_coverage is called on a BAM, BED, or bigBed file, but they will be close. This method is quite fast, and is the default when bins is not None.
“get_as_array” uses bx-python, but does a separate binning step. This can be slower than the other two methods, but the results are exactly the same as those from a BAM, BED, or bigBed file. This method is always used if bins=None.
“ucsc_summarize” is an alternative version of “summarize”. It uses the UCSC program bigWigSummary, which must already installed and on your path.
processes : int or None
The feature can be split across multiple processes.
Returns: 1-d NumPy array
Notes
If a feature has a “-” strand attribute, then the resulting profile will be relative to a minus-strand feature. That is, the resulting profile will be reversed.
Returns arrays x and y. x is in genomic coordinates, and y is the coverage at each of those coordinates after extending fragments.
The total number of reads is guaranteed to be the same no matter how it’s binned.
(with ideas from http://www-huber.embl.de/users/anders/HTSeq/doc/tss.html)