gffutils.pybedtools_integration.tsses
- gffutils.pybedtools_integration.tsses(db, merge_overlapping=False, attrs=None, attrs_sep=':', merge_kwargs=None, as_bed6=False, bedtools_227_or_later=True)[source]
Create 1-bp transcription start sites for all transcripts in the database and return as a sorted pybedtools.BedTool object pointing to a temporary file.
To save the file to a known location, use the
.moveto()
method on the resultingpybedtools.BedTool
object.To extend regions upstream/downstream, see the
.slop()
method on the resultingpybedtools.BedTool object
.Requires pybedtools.
- Parameters:
db (gffutils.FeatureDB) – The database to use
as_bed6 (bool) –
If True, output file is in BED6 format; otherwise it remains in the GFF/GTF format and dialect of the file used to create the database.
Note that the merge options below necessarily force
as_bed6=True
.merge_overlapping (bool) – If True, output will be in BED format. Overlapping TSSes will be merged into a single feature, and their names will be collapsed using
merge_sep
and placed in the new name field.merge_kwargs (dict) –
If
merge_overlapping=True
, these keyword arguments are passed to pybedtools.BedTool.merge(), which are in turn sent tobedtools merge
. The merge operates on a BED6 file which will have had the name field constructed as specified by other arguments here. See the available options for your installed version of BEDTools; the defaults used here aremerge_kwargs=dict(o='distinct', c=4, s=True)
.Any provided
merge_kwargs
are used to update the default. It is recommended to not overridec=4
ands=True
, otherwise the post-merge fixing may not work correctly. Good candidates for tweaking ared
(merge distance),o
(operation),delim
(delimiter to use for collapse operations).attrs (str or list) –
Only has an effect when
as_bed6=True
ormerge_overlapping=True
.Determines what goes in the name field of an output BED file. By default, “gene_id” for GTF databases and “ID” for GFF. If a list of attributes is supplied, e.g. [“gene_id”, “transcript_id”], then these will be joined by
attr_join_sep
and then placed in the name field.attrs_sep (str) – If
as_bed6=True
ormerge_overlapping=True
, then use this character to separate attributes in the name field of the output BED. If also usingmerge_overlapping=True
, you’ll probably want this to be different thanmerge_sep
in order to parse things out later.bedtools_227_or_later (bool) –
In version 2.27, BEDTools changed the output for merge. By default, this function expects BEDTools version 2.27 or later, but set this to False to assume the older behavior.
For testing purposes, the environment variable GFFUTILS_USES_BEDTOOLS_227_OR_LATER is set to either “true” or “false” and is used to override this argument.
Examples
>>> import gffutils >>> db = gffutils.create_db( ... gffutils.example_filename('FBgn0031208.gtf'), ... ":memory:", ... keep_order=True, ... verbose=False)
Default settings – no merging, and report a separate TSS on each line even if they overlap (as in the first two):
>>> print(tsses(db)) chr2L gffutils_derived transcript_TSS 7529 7529 . + . gene_id "FBgn0031208"; transcript_id "FBtr0300689"; chr2L gffutils_derived transcript_TSS 7529 7529 . + . gene_id "FBgn0031208"; transcript_id "FBtr0300690"; chr2L gffutils_derived transcript_TSS 11000 11000 . - . gene_id "Fk_gene_1"; transcript_id "transcript_Fk_gene_1"; chr2L gffutils_derived transcript_TSS 12500 12500 . - . gene_id "Fk_gene_2"; transcript_id "transcript_Fk_gene_2";
Default merging, showing the first two TSSes merged and reported as a single unique TSS for the gene. Note the conversion to BED:
>>> x = tsses(db, merge_overlapping=True) >>> print(x) chr2L 7528 7529 FBgn0031208 . + chr2L 10999 11000 Fk_gene_1 . - chr2L 12499 12500 Fk_gene_2 . -
Report both gene ID and transcript ID in the name. In some cases this can be easier to parse than the original GTF or GFF file. With no merging specified, we must add
as_bed6=True
to see the names in BED format.>>> x = tsses(db, attrs=['gene_id', 'transcript_id'], as_bed6=True) >>> print(x) chr2L 7528 7529 FBgn0031208:FBtr0300689 . + chr2L 7528 7529 FBgn0031208:FBtr0300690 . + chr2L 10999 11000 Fk_gene_1:transcript_Fk_gene_1 . - chr2L 12499 12500 Fk_gene_2:transcript_Fk_gene_2 . -
Use a 3kb merge distance so the last 2 features are merged together:
>>> x = tsses(db, merge_overlapping=True, merge_kwargs=dict(d=3000)) >>> print(x) chr2L 7528 7529 FBgn0031208 . + chr2L 10999 12500 Fk_gene_1,Fk_gene_2 . -
The set of unique TSSes for each gene, +1kb upstream and 500bp downstream:
>>> x = tsses(db, merge_overlapping=True) >>> x = x.slop(l=1000, r=500, s=True, genome='dm3') >>> print(x) chr2L 6528 8029 FBgn0031208 . + chr2L 10499 12000 Fk_gene_1 . - chr2L 11999 13500 Fk_gene_2 . -