# Introduction¶

Transcription factor-associated Combinatorial Epigenetic Patterns (T-cep) is a collection of tools for training a univariate first-order hidden Markov model (HMM) over multiple epigenetic data sets to find combinatorial epigenetic regulation patterns. We have used T-cep successfully to find transcription factor (TF)-associated states in several cell lines. If you wish to focus on TF-associated states (downstream of T-cep), we recommend that only the TF of interest be used, although T-cep supports multiple TFs.

Model

T-cep trains a univariate first-order hidden Markov model. The number of states is user-selected. The number of outputs is 2 ^ the number of data sets. Each output represents a unique combination of epigenetic marks, calculated by assigning a bit to every mark. For example, output 0 corresponds to no marks in a bin, and output (2^datasets - 1) corresponds to all marks in a bin.

HMM training is performed using a scaling variant of the Baum-Welch algorithm . The initial HMM parameters may be either randomly generated or user-supplied. By default, a minimum probability of 10-6 for all transition, emission and start probabilities is enforced after every iteration. Every chromosome in every cell line is considered to be a separate observation sequence. The expectation-step may be performed in parallel among chromosomes, provided sufficient memory and cores are present.

T-cep_1.0.tar.gz

http://compbio.uthscsa.edu/T-cep/T-cep_1.0.tar.gz

# Installation¶

Dependencies

T-cep is written in C++, and should compile with GCC and run on any x86 64-bit Linux system running at least Kernel 2.6, You may also need to install boost (CentOS/Redhat: yum install boost boost-devel. Ubuntu: apt-get install libboost-all-dev).

However a computer with several gigabytes of RAM is strongly recommended to avoid an out-of-memory error. T-cep requires Boost and OpenMP to be available. By default, T-cep will attempt to use all cores of a computer if possible (up to a limit depending on the genome and the number of cell lines used, after which additional cores will provide no benefit), however T-cep respects the environment variable OMP_NUM_THREADS. T-cep does not benefit from any GPU installed or CUDA, and is CPU-bound instead of IO-bound on most systems.

Easy-Installtion

To build the T-cep release versions, go to T-cep/build and just run '\$ make'. Then you can Using the software in T-cep/build/bin/

# Usage¶

T-cep consists of three main tools: T-cep-precompile, T-cep-learn and T-cep-results. T-cep-precompile is used to determine the epigenetic marks present in each bin in a cell line. T-cep-learn trains an HMM over data in one or more cell lines. T-cep-results outputs BED files containing the states at each bin in a cell line, as well as the mark combinations present (outputs).

T-cep-precompile <options> binsize config_file -g hg19.chrom.sizes output_file
Options:
--help                    Print help message
-s [ --noshift ]          Do not shift reads (default is to shift by half the estimated fragment width


Within the config file (see File Formats), a P value or read threshold must be defined for each input file. By default, T-cep-precompile shifts all reads in their 5’ to 3’ direction by half of their estimated average fragment length. The estimated average fragment length is calculated separately for each input file using the same algorithm as BELT, defaulting to 150 if there are less than 10,000 reads in a file or the algorithm is unsuccessful. This can be disabled with the -s or the --noshift argument. T-cep-precompile does not filter duplicate reads, and this should be done before running T-cep if desired or required by your sequencing or alignment protocols.

T-cep-learn <options> -h HMM.rawhmm numStates numOutputs filename1 filename2...
Options:
-b: calculate and report BIC after every iteration
-z: zero probabilities when they drop below the min probability threshold
Conflicts with -F
-F: do not keep probabilities from dropping below the min probability threshold
Conflicts with -z
-H [filename]: supply initial HMM parameters. Accepts a .rawhmm file.
If used, the number of states and number of outputs specified on the command line will be overridden by the HMM file.
-h [filename]: output the final HMM in .rawhmm format.
-m [double]: minimum transition, emission or start probability (default: 1.0e-6)
-i [int]: maximum number of iterations (default: 300)
Use -i -1 to train continuously (usually used with -d)
-w [int]: bin width (default: 1000)
-d [double]: stop training if |decrease of BIC| is below this value
Implies -b.
-g [filename]: use a custom genome file (default genome is hg18). This file must have been specified to the precompiler as well.
-p: print transition, emission, and start probabilities of the HMM after each iteration.



Notes: -b: By default, the BIC score of the HMM is only calculated at the beginning and end of training. This is because the PrH values for log-likelihood calculation must be calculated in log-space to avoid underflow, which is much slower than the scaling Baum-Welch variant used for HMM training. -b forces calculation of the BIC score at every iteration, which considerably slows training. For instance, the BIC scores can be graphed to show visual evidence of HMM training convergence. -H: By default, T-cep-learn randomly generates the initial HMM before training. A custom starting HMM can be specified if desired (for instance, if you wish to further train an existing model). -d: By default, T-cep-learn trains an HMM for a set number of iterations (given by -i). T-cep-learn also supports training until ΔBIC is less than a set threshold. In this case, -d <threshold> -i -1 should usually be specified, unless you wish to set a hard limit on the number of iterations (300 by default). -z, -F, -m: ADVANCED FEATURE. By default, T-cep prohibits any model probability from dropping below 10-6 to avoid numerical underflow, and to ensure that the HMM remains fully connected.

T-cep-results hg19.chrom.sizes binWidth states.bed outputs.bed data.precomp HMM.rawhmm


Notes: This program will output a BED file containing the state assignments in a cell line in the file given by states.bed, and a BED file containing the observations at the file given by outputs.bed.

# Example Workflow¶

The following commands have been tested on a server with 64 CPUs/256G Memory, CentOS release 6.3.

## Step1: Precompile¶

In this example, we will be generating a HMM to model the combinatorial epigenetic patterns of several histone marks in two cell lines. Although multiple HMMs would normally be generated in a study (this is why precompilation is decoupled from HMM training), we will be learning a single HMM to obtain an overview of these two cell lines.

T-cep-precompile 500 MCF7_precompile_config.txt -g hg19.chrom.sizes MCF7_hg19_w500.precomp | tee T-cep-precompile_MCF7.out


MCF7_precompile_config.txt contains:

MCF7_TCF7L2.bed 0.0001
MCF7_H3K4me1.bed 0.0001
MCF7_H3K4me3.bed 0.0001
MCF7_H3K9me3.bed 0.0001
MCF7_H3K27me3.bed 0.0001
MCF7_H3K36me3.bed 0.0001
MCF7_H3K27ac.bed 0.0001


The first step is to use T-cep-precompile to generate the precompiled data sets for T-cep-learn. We chose a bin size of 500, although multiple bin sizes would ordinarily be tested in one study (with separate runs of T-cep-precompile and T-cep-learn). In HeLa_precompile_config.txt, we specify BED files containing aligned reads for each ChIP-seq experiment, as well as a maximum P value of 1e-4 for each of our seven HeLa data sets. Finally, we specify the output filename (of the precompiled data set). The output is piped into tee in case it is required for future use or documentation (for instance, it outputs the number of reads and the thresholds to call marks).

T-cep-precompile 500 HeLa_precompile_config.txt -g hg19.chrom.sizes HeLa_hg19_w500.precomp | tee T-cep-precompile_HeLa.out


We repeat this for HeLa. IMPORTANT: The data sets must be given in the same order for each cell line.

## Step2: HMM Training¶

T-cep-learn -b -h HMM_MCF7_HeLa_w500.rawhmm -w 500 -g hg19.chrom.sizes 15 128  MCF7_hg19_w500.precomp HeLa_hg19_w500.precomp > T-cep-learn.out


T-cep-learn trains the HMM. Since -H was not specified, the starting HMM will be randomly generated. We train the model for the default 300 iterations (-i to change this). Since we are performing an overview of these two cell lines, we specify -b to calculate the BIC score of the HMM at each iteration (so that convergence within 300 iterations can be verified). IMPORTANT: -b will significantly slow down HMM training (and is used here primarily for the sake of example). Specifying -h causes the final HMM to be output in an easily parseable format usable by T-cep-results. The bin size is specified with -w 500, and the genome file with -g. We are training a 15-state model with 128 outputs (since 7 marks are included, the number of outputs is 27 = 128, and we are not using getOutputMap.pl in this example). The final arguments are the precompiled data files.

## Step3: Output Combinatorial Mark Emission¶

T-cep-results hg19.chrom.sizes 500 MCF7_states.bed MCF7_outputs.bed MCF7_hg19_w500.precomp HMM_MCF7_HeLa_w500.rawhmm | tee T-cep-results_MCF7.out


T-cep-results calls the states for each bin within a cell line, and outputs results for further processing. HeLa_states_w500.bed contains the states for every bin in the MCF7 genome, and MCF7_outputs.bed contains the outputs. The genome file, bin size (500) and HMM file are passed to T-cep-results as well.

T-cep-results hg19.chrom.sizes 500 HeLa_states.bed HeLa_outputs.bed HeLa_hg19_w500.precomp HMM_MCF7_HeLa_w500.rawhmm | tee T-cep-results_HeLa.out


We repeat this for HeLa.

File Formats

Genome file: List of <chrID> <chromosome length>, one on each line, separated by a space.

Precompiler config file: List of <filename> <P value or read threshold>

Input files must be in BED format. For the second column, a decimal is assumed to be a P value, and an integer is assumed to be a read threshold.

HMM file (.rawhmm extension):
<# states> <# outputs>
<a1,1> <a1,2>...
<a2,1> <a2,2>...
...
<b1,1> <b1,2>...
<b2,1> <b2,2>...
...
<s1> <s2>...

Where ai,j is the transition probability from state i to state j, bi,j is the emission probability of state i and observation j and si is the start probability of state i. All entries on a line are space-separated.

# Supplemental Scripts¶

Several Perl and MATLAB scripts are provided with T-cep, for generating figures and working with large numbers of data sets. These are summarized below (refer to their documentation for more information).

function [out] = byObservationToByMark(probs, numStates, numOutputs)


This is a MATLAB script that calculates the output probabilities of each mark independent of other marks of the states in an HMM.

function[] = prepareMarkOutputProbGraph(probs, numStates, numMarks, labels)


This is a MATLAB script that generates a figure of the independent emission probabilities of a model (see figure 2B of Bonneville et al).

function[] = prepareTransProbGraph(probs, numStates)


This is a MATLAB script that generates a figure of the transition probabilities of an HMM (see figure 2A of Bonneville et al).

getStatesOutputs.pl HMM.rawhmm trans.txt emit.txt


This is a Perl script that extracts the transmission and emission matrices from a .rawhmm file (the HMM format used by T-cep). HMM.rawhmm is the input HMM, and the files trans.txt and emit.txt will be created containing the transition and emission matrices respectively. They will be output as spaced-separated tables.

getOutputMap.pl map.txt file1.precomp file2.precomp ...


With many data sets in each cell line, the number of mark combinations that must be considered appears prohibitive (since it is 2n). However, there can at most be only as many mark combinations in a collection of cell lines as there are bins. getOutputMap.pl, along with manyOutputMapper.pl and manyOutputModelConvert.pl, allow T-cep-learn to only consider mark combinations that are present at least once in the input data. This is accomplished by mapping each mark combination to a separate ID number before training, and back to the original mark combination ID after training. The ID mapping will be written to map.txt. One or more precompiled data sets (from T-cep-precompiler) can be passed to getOutputMap.pl. This script will also output the number of observations. This is necessary for T-cep-learn, manyOutputMapper.pl and manyOutputModelConvert.pl. manyOutputMapper.pl should be run next.

manyOutputMapper.pl file.precomp map.txt output.precomp


This script converts the mark IDs in a precompiled data file to new IDs as specified in map.txt (designed for the output of getOutputMap.pl). file.precomp is the original data file, and output.precomp is the new data file. This should be run on each precompiled data file after getOutputMap.pl is used to generate the mapping file. After getOutputMap.pl and manyOutputMapper.pl, T-cep-learn should be run, with numOutputs set to the number of observations as reported by getOutputMap.pl and -h specified to output a .rawhmm file. Next, manyOutputModelBackConvert.pl should be run.

manyOutputModelBackConvert.pl HMM.rawhmm numCombinations map.txt output.rawhmm


This script converts the probabilities in a .rawhmm file from a separate ID mapping scheme to the original mark combination IDs. HMM.rawhmm is the original HMM file (from T-cep-learn), numCombinations is the number of total mark combinations (2 ^ number of marks), map.txt is the ID mapping file (from getOutputMap.pl), and the new HMM file will be written to output.rawhmm. Finally, T-cep-results should be run as usual.

function[] = prepareTransProbGraph(probs, numStates)


This is a MATLAB script that generates a figure of the transition probabilities of an HMM.