hubward

https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square

A common question when reading an interesting paper is, “how does relate to my favorite gene locus?”. To address this question, usually we need to download the supplemental data, figure out what format it’s in, convert it to some other useful format, and visualize it alongside our own data.

hubward helps manage this process, lowering the effort required to track, manage, organize, and upload many studies for cross-comparison in the UCSC Genome Browser.

Data are organized into track hubs on the UCSC Genome Browser, and the name hubward indicates the direction in which data are moved into these track hubs. It can also refer to a direction in other complex systems.

The separate repository, hubward-studies, contains examples of prepared track hubs. Subsets of these can be combined into user-defined hubs, or can serve as examples for preparing other studies.

Documentation can be found at https://daler.github.io/hubward.

Overview

This poster gives an overview of hubward, and provides a worked example.

Further details can be found in the full documentation.

hubward uses the following concepts:

track:Data that can be represented as a single track in the UCSC Genome Browser. Examples include a file of called peaks; read pileup from a single RNA-seq sample; CNV scores for one sample; or anything that can be converted into bigBed, bigWig, BAM, or VCF format.
study:A collection of tracks, typically all from the same published article.
group:A collection of studies, typically related in some way.

Studies

The minimal definition of a hubward “study” is a directory with a metadata.yaml file. In practice, the directory contains raw data and conversion scripts. A study generally corresponds to data from a single published paper, but this is not required. The metadata.yaml file describes and configures one or many tracks grouped together. These are uploaded to a track hub as a single composite track.

The metadata.yaml file consists of several sections. The study section stores bibliographic information. It is converted to HTML documentation and added to the study’s configuration page in the UCSC Genome Browser.

study:
  reference: 'Ho, J. W. K. et al. Nature 512, 449-452 (2014).'
  PMID: 25164756
  description: 'ENCODE predicted enhancers'
  label: encode-enhancers
  processing: "Downloaded data were converted to bigBed format"

The tracks section is a list, with one item for each track to be included in the hub. Here is one such item in the tracks list:

tracks:
  -
    label: "enhancers [K562]"
    description: "K562 enhancers"
    genome: hg19
    original: "raw-data/p300_enhancers_K562.txt"
    processed: "processed-data/p300_enhancers_K562.bigbed"
    script: "scr/process.py"
    source:
      fn: "comparative_enhancer_calls.tar.gz"
      url: "http://compbio.med.harvard.edu/modencode/webpage/enh_calls_final/comparative_enhancer_calls.tar.gz"
    trackinfo:
       tracktype: "bigBed 3"
       visibility: "dense"
       itemRgb: "on"
       color: "#FF0000"
    type: bigbed

The config file format and fields are described in detail later in the documentation. In summary, this block defines the source data, an output file to create, and a conversion script to create a bigBed file with features colored red, for enhancers in K562 cells from the ENCODE project. A logical extension of this would be to include additional tracks for other cell lines in this data set.

To process the data for a study, use:

hubward process <directory>

where directory contains the metadata.yaml file. For each defined track, this will:

  • ensure original data exist. If not, the source url is downloaded to the source fn and extracted
  • ensure processed file exists and is up-to-date. If it is older than original or older than script, the script is re-run.

Groups

Multiple studies can be grouped together using a higher-level config file, here called group.yaml. Each study can have multiple tracks; each group can have multiple studies.

For example, if the path to the above metadata.yaml file is encode/hg19/encode-enhancers, then that directory can be included in the studies list so that the K562 enhancers track will be uploaded:

group: encode
genome: hg19
name: "encodetracks"
short_label: "Supp. ENCODE"
long_label: "Supplemental ENCODE tracks"
hub_url: "http://localhost/encode/hg19/compiled.hub.txt"

server:
  hub_remote: "/root/encode/hg19/compiled.hub.txt"
  host: "localhost"
  user: "www"
  email: "www@localhost"

studies:
  - encode/hg19/encode-enhancers

To process all studies in a group, run:

hubward process <group YAML file>

This processes all configured studies to ensure their output is up-to-date.

To create the track hub files and upload to a remote server, run:

hubward upload <group YAML file>

After it runs, it will show the URL that can be used to load the hub into the Genome Browser.

Workflow

To visualize a new dataset, the workflow is the following:

  1. Write a metadata.yaml file and the corresponding scripts to perform conversion.
  2. Write a group config file file that groups together individual studies.
  3. Run hubward process <group config>. This parses the group config file, and for each defined study, parses its metadata.yaml file, downloads data if needed, runs conversion scripts if necessary.
  4. Run hubward upload <group config>. This builds the track hub config files using the trackhub Python package, and uploads to the server configured in the group config.

Going further

Use hubward skeleton to create a template study including directories, and a metadata-builder.py script to aid in programmatic generation of metadata.yaml.

hubward includes many helper functions which can be imported into the processing script.

Contents:

Indices and tables