.. include:: includeme.rst
.. _genomes:
Specifying genomes
==================
This section illustrates the use of genome files for use with BEDTools
programs that need to know chromosome limits to prevent out-of-range
coordinates.
Using BEDTools programs like `slopBed` or `shuffleBed` from the command
line requires "genome" or "chromsizes" files. :mod:`pybedtools` comes with
common genome assemblies already set up as a dictionary with chromosomes as
keys and zero-based (start, stop) tuples as values:
.. doctest::
>>> from pybedtools import genome_registry
>>> genome_registry.dm3['chr2L']
(0, 23011544)
The rules for specifying a genome for methods that require a genome are as
follows (use whatever is most convenient):
* Use `g` to specify either a filename or a dictionary
* Use `genome` to specify either an assembly name or a dictionary
Below are examples of each.
As a file
---------
This is the typical way of using BEDTools programs, by specifying an existing genome
file with `g`:
.. doctest::
:hide:
>>> fn = pybedtools.chromsizes_to_file(pybedtools.chromsizes('hg19'), fn='hg19.genome')
.. doctest::
>>> a = pybedtools.example_bedtool('a.bed')
>>> b = a.slop(b=100, g='hg19.genome')
.. doctest::
:hide:
>>> import os
>>> os.unlink('hg19.genome')
As a string
-----------
This is probably the most convenient way of specifying a genome. If the
genome exists in the genome registry it will be used directly. Alternatively,
if you have `genomepy`_
installed, you can use the genomepy genome name, such as `hg38`. In this case,
the genome file will be located automatically. Finally, if the genome is not
in the registry or managed by genomepy, it will automatically be downloaded
from UCSC. You must use the `genome` kwarg for this; if you use `g` a string
will be interpreted as a filename:
.. doctest::
>>> c = a.slop(b=100, genome='hg19')
As a dictionary
---------------
This is a good way of providing custom coordinates; either `g` or `genome`
will accept a dictionary:
.. doctest::
>>> d = a.slop(b=100, g={'chr1':(1, 10000)})
>>> e = a.slop(b=100, genome={'chr1':(1,100000)})
Make sure that all these different methods return the same results
.. doctest::
>>> b == c == d == e
True
Converting to a file
--------------------
Since BEDTools programs operate on files, the fastest choice will be to
use an existing file. While the time to convert a dictionary to a file is
extremely small, over 1000's of files (e.g., for Monte Carlo simulations),
the time may add up. The function :func:`pybedtools.chromsizes_to_file`
will create a file from a dictionary or string:
.. doctest::
:options: +NORMALIZE_WHITESPACE
>>> # with no filename specified, a tempfile will be created
>>> pybedtools.chromsizes_to_file(pybedtools.chromsizes('dm3'), 'dm3.genome')
'dm3.genome'
>>> print(open('dm3.genome').read())
chr2L 23011544
chr2R 21146708
chr3L 24543557
chr3R 27905053
chr4 1351857
chrX 22422827
chr2LHet 368872
chr2RHet 3288761
chr3LHet 2555491
chr3RHet 2517507
chrM 19517
chrU 10049037
chrUextra 29004656
chrXHet 204112
chrYHet 347038