|
|
|
GEM
Genome wide Event finding and Motif discovery
Citation:
High Resolution Genome Wide Binding Event Finding and Motif Discovery Reveals Transcription Factor Spatial Binding Constraints. Yuchun
Guo, Shaun Mahony & David K Gifford, (2012) PLoS Computational Biology 8(8): e1002638.
Deciphering the language of transcription factors, article on the GEM paper.
GEM is a software tool to study protein-DNA
interaction using ChIP-Seq/ChIP-exo data. GEM links binding event discovery and motif discovery with positional priors in the context of a generative probabilistic model of ChIP data and genome sequence, resolves ChIP data into explanatory motifs and binding events at unsurpassed spatial resolution .
GEM has following features:
- Exceptionally high spatial resolution on binding event prediction (aka peak calling)
- Highly accurate de novo motif discovery
- Resolve closely spaced (less than 500bp) homotypic events that appear as a single cluster of reads
- Enable analysis of spatial binding constraints of multiple transcription factors, for detecting TF dimer binding sites, enhanceosomes, etc.
See an example of GEM output for ES cell Sox2 binding, including binding events, K-mer Set motifs (KSMs), PWM motifs, and motif spatial distribution plots.
Download GEM (version 1.3) and test data
GEM is a superset of GPS. GEM can be activated by giving a genome sequence (--genome) and using any one of the following command line options:
--k: the length of the k-mers
--k_min and --k_max: the range for the length of k-mers
--seed: the seed k-mer to jump start k-mer set motif discovery. The length of the seed k-mer will be used to set k.
If these three options are not used, GEM will just run GPS and stop.
Java 1.6 is required to execute the JAR. For analysis with mamalian genome, GEM requires about 5-15G memory. It can be specified at the command line with the option -Xmx (i.e. java
-jar -Xmx10G gem.jar allocates 10GB of memory).
As GPS,
read distribution file is required for GEM. The
user can use the default read distribution file provided with the
software as starting point (ChIP-Seq default read distribution file ChIP-exo default read distribution file). After one round of prediction, GEM will
re-estimate the read distribution using the predicted events.
The
read distribution file specifies
the empirical spatial distribution of reads for a given binding event.
The file contains tab-delimited position/density pairs. The first field
is the position relative to the binding position (i.e. binding event is
at position 0). The second field contains the corresponding read
density at that position. For example,
-344
1.42285E-4
-343 1.42275E-4
-342 1.42265E-4
... ...
Alternatively,
it can be estimated directly from the ChIP-Seq data. Given a set of
events, we count all the reads at each position (the 5' end of the
reads) relative to the corresponding event positions. The initial set
of events for estimating the empirical spatial distribution can be
defined by using known motifs or by finding the center of the forward
and reverse read profiles (if available). GEM has a tool to calculate
the read distribution from a user provided file (coords.txt) containing
the coordinates:
java
-Xmx1G -cp gem.jar
edu.mit.csail.cgs.deepseq.analysis.GPS_ReadDistribution --g "mm8.info"
--coords "coords.txt" --chipseq "SRX000540_mES_CTCF.bed" --f "BED"
--name "CTCF" --range 250 --smooth 5 --mrc 4
After GPS/GEM makes the prediction, it will re-estimate the read
distribution using the predicted events.
If
the data are too noisy or too few events are used for re-estimation,
the new read distribution may not be accurate. The users are encouraged
to examine the read distribution using the plot of read distributions
(X_All_Read_Distributions.png) output by GEM.
GEM
takes an alignment file of ChIP-Seq reads and a genome sequence as input and reports a list
of predicted binding events and the explanatory binding motifs.
ChIP-Seq alignment file
ChIP-Seq alignment file formats that are supported:
Genome sequence
In order to run the motif discovery of GEM algorithm, a genome sequence (UCSC download) is needed. The path to directory containing the genome sequence files (by chromosome, *.fa or *.fasta files, with the prefix "chr") can be specified using option --genome (for example, --genome your_path/mm8/). Note that the chromosome name should match those in the "--g" genome_info file, as well as those in your read alignment file.
For example, the fasta file for chromosome 2 is chr2.fa:
>chr2
TAATTGTAATAGTATATACTTGTATGTACTTAAAATAttttatcatagtt
ATCTGGATTTTTGATGGCTATCATGACCTCTGAATGACTAGGGAATCTTG
... ...
Output files
GEM outputs both the binding event files and the motif files.
Because
of the read distribution re-estimation, GEM outputs event
prediction and read distribution files for multiple rounds. (See more details)
- GEM event text file
- HTML file summarizing the GEM results (see examle)
GEM output folder containing more detailed result files (Round 1 and 2 are GPS and GEM results, respectively (See more details))
- K-mer set motifs
- PFM file of PWM motifs
- Spatial distribution between primary motif and all the secondary motifs in the 61bp around the GEM events (this is based on PWM motif match, not based on the GEM binding sites). You will have the files such as Name_Spatial_dist_0_1.txt (png) showing the spatial distribution of secondary motifs (#1) with respect to the primary motif (#0). If you click on the PNG image on the HTML page, you also get the txt file with the values.
Optionally,
GEM can be set to output BED files (using option --outBED)
for loading the GEM results to Genome Browser as custom tracks for
visualization. Note that in BED file, the coordinates are offset to
left/right 5bp for better visualization.
This data
can be used to test GEM. It comes from a Ng lab publication (PMID:
18555785) and consists of Bowtie alignments of mouse ES cell
CTCF ChIP-seq and GFP control reads.
Once
everything is unpacked into the same directory as the gem.jar
file, use the following command:
java -Xmx10G -jar gem.jar --d Read_Distribution_default.txt --g mm8.info --genome your_path/mm8 --s 2000000000 --expt SRX000540_mES_CTCF.bed --ctrl SRX000543_mES_GFP.bed --f BED --out mouseCTCF --k_min 6 --k_max 13
The
command line parameters are in the format of --flag/name
pairs. Some parameters are required:
| Required parameters |
Detailed information |
--d [path] |
The path to the read distribution model file |
--exptX [path] |
The path to the aligned reads file for experiment
(IP). X is condition name. In multi-condition alignment mode, X are
used to specify different conditions. |
Some
parameters are optional:
| Optional parameters |
Detailed information |
--ctrlX [path] |
The path to the aligned reads file for control. X
should match the condition name in --expt. |
--f [BED|SAM|BOWTIE|ELAND|NOVO] |
Read file format: BED/SAM/BOWTIE/ELAND/NOVO. The
SAM option allows SAM or BAM file input. (default = BED) |
--k [n] |
the width of the k-mers |
--k_min [n] |
minimum value of k |
--k_max [n] |
maximum value of k |
--seed [k-mer] |
the seed k-mer to jump start k-mer class discovery. The width of the seed k-mer will be used to set k |
--genome [path] |
the path to the genome sequence directory, which contains fasta files by chromosomes |
--k_seqs [n] |
the number of top ranking events to get sequences for motif discovery (default=5000) |
--k_win [n] |
the sequence window size around the binding event for motif discovery (default=61bp) |
--nd [n] |
noise distribution model, 0 for no noise model, 1 for uniform noise model (default=1) |
--g [path] |
The path to a genome information file. The file
contains tab-delimited chromosome name/length pairs. If it is not
supplied, GEM will use the maximum value of read coordinate as the
chomosome length. |
--s [n] |
The size of uniquely mappable genome. It depends on the genome and the read length. A good estimate is genome size * 0.8. If it is not supplied, it will be estimated using the genome information. |
--fold [value] |
Fold cutoff to filter predicted events (default=3) |
--icr [value] |
IP/Control Ratio. Default ratio will be estimated from the data using non-specific binding regions. It is important to set this value explicitly for synthetic dataset. |
--out [name] |
Output file base name |
--q [value] |
significance level for q-value, specified as
-log10(q-value). For example, to enforce a q-value threshold of 0.001,
set this value to 3. (default=2, i.e. q-value=0.01) |
--a [value] |
minimum alpha value for sparse prior (default is estimated by mean whole genome read coverage) |
--af [value] |
a constant to scale alpha value with read count
(default=3). A smaller af value will give a larger alpha value. |
--sd [value] |
Shape deviation cutoff to filter predicted events
(default=-0.40). |
--smooth [n] |
The width (bp) to smooth the read distribution.
If it is set to -1, there will be no smoothing (default=30). |
--subs [region strings] |
Subset of genome regions to be analyzed.The
string can be in the format of "chr:start-end" or "chr", or both. For
example, "1:003-1004 2 X". |
--subf [region file] |
File that contains subset of genome regions to be
analyzed. Each line contains a region in "chr:start-end" format. |
--ex [region file] |
File that contains subset of genome regions to be
excluded. Each line contains a region in "chr:start-end" format. |
--t [n] |
Number of threads to run GEM in paralell. It is suggested to be equal to or less than the physical CPU number on the computer. (default: physical CPU number) |
--top [n] |
Number of top ranked GEM events to be used for re-estimating the read distribution. Note that GEM only re-estimate
when there are more than 500 significant events called. |
--w2 [n] |
Size of sliding window to estimate lambda
parameter for Possion distribution when there is no control data
(default=5,000, must be larger than 1000). |
--w3 [n] |
Size of sliding window to esitmate lambda
parameter for Possion distribution when there is no control data
(default=10,000, must be larger than w2). |
Optional
flags:
| Optional flags |
Detailed information |
--k_neg_dinu_shuffle |
Use di-nucleotide shuffled sequences as negative sequences for motif finding |
--bf |
Depreciated
Base filtering is done by Poission filtering by default. |
--fa |
GEM will use a fixed user-specified alpha value
for all the regions |
--help |
Print help information and exit |
--multi |
Depreciated
To run GEM in multi-condition mode, you only need to specify data for
conditions X and Y using --exptX and --exptY, etc. |
--nf |
Do not filter predicted events (default is
filtering by shape and fold) |
--nrf |
Do not filter duplicate reads (default is to apply filtering considering its neighboring base positions)
|
--outNP |
Output narrowPeak files (default is
NO narrowPeak file output) |
--outBED |
Output BED files for Genome Browser (default is
NO BED file output) |
--sl |
Sort GEM output by location (default is sorted by
P-value) |
Q&A
What if GEM finds wrong motif?
Clearly, GEM's binding call accuracy is depending on finding the correct primary motif. Some times a co-factor motif may be more statistically significant in the data, and it is subsequently used to direct the binding calls. There are several alternative options to try:
- If you know the consensus motif of the TF, use
--seed option to set a starting k-mer for the motif discovery process.
- You may want to try some different
k values, or different sequence window size --k_win.
- Try to use a different negative sequence set, for example
--k_neg_dinu_shuffle option will use di-nucleotide shuffled sequences as negative set, instead of the default set taking from 300bp away from binding sites. This maybe useful for TF that binds in CG-rich promoter regions (e.g. SP1).
Which round of result should I use?
Because of the read distribution re-estimation, GEM may output event prediction and read distribution files for multiple rounds. The round numbers are coded in the file name. For example,
- xxx_0_Read_distribution.txt: The input read distribution.
- xxx_0_GEM_events.txt: GPS Events predicted using xxx_0_Read_distribution.
- xxx_1_Read_distribution.txt: The read distribution estimated from xxx_0_GEM_events.
- xxx_1_GEM_events.txt: GPS Events predicted using xxx_1_Read_distribution.
- xxx_2_GEM_events.txt: GEM Events predicted using xxx_2_Read_distribution.
Due
to the large variability of datasets, the refined empirical spatial
distribution may become too noisy because too few events are used to
estimate the distribution. Running GEM further may make the empirical
distribution even worse. Therefore, GEM save the output files from each
round, and allow the user to check the spatial distributions to decide
which one is the best. To facilitate this process, GEM outputs an image
to plot all the read distribution curves
(xxx_All_Read_Distributions.png). Ideally, the read distribution of
later rounds should be smooth and similar to that of round 0. If so,
user can use the event predictions from these rounds. If that is not
the case, we would suggest to use the round 0 prediction results.
Multi-condition v.s. Multi-replicates
GEM can analyze binding data from multiple conditions (time points)
simultaneously. The user need to give them different names, for
example, -–exptCond1 CTCF_cond1.bed -–exptCond2
CTCF_cond2.bed.
For multiple replicates of same condition, you can specify multiple
replicates as separate files, for example,
-–exptCond1 CTCF_cond1_rep1.bed -–exptCond1
CTCF_cond1_rep2.bed (note that they need to have the same
name). GEM will combine the replicates as one large dataset for
analysis.
Read filtering and event filtering
PCR amplification artifacts typically manifest as the observation of
many reads mapping to the exact same base positions. These artifacts
are quite variable and dataset-specific. Therefore, a generic approach
to exclude those regions might result in the loss of true events.
GEM implements an event filtering method by comparing the read
distribution of the predicted event to the expected event read
distribution. A shape deviation score (IPvsEMP field) is computed using
Kullback–Leibler divergence (see method section 2.6 of GPS paper). A
higher score means the event is more divergent from the expected read
distribution, hence more likely to be artifact or noise. A cutoff score
can be specified by user to filter out spurious events using option (--sd).
GEM also excludes events with less than 3 fold enrichment (IP/Control). GEM
reports the filtered events, hence allows the user to verify and adjust
cutoff threshold for a particular dataset. The shape deviation filter
is on by default, but can be turned off using option (--nf).
In addition, GEM also applies a Poisson filter for abnormal high read count at a base position. For each base, we obtain an average read count by estimating a Gaussian Kernel density (with std=20bp) on the read counts of nearby base positions (excluding the base of interest). The estimated value is used to set Lambda parameter of Poisson distribution. The actual read count value is then set to the value corresponding to p-value=0.001 if it is larger.
Contact
Yuchun Guo (yguo at mit dot edu) or Shaun Mahony (mahony at mit
dot edu) with any problems, comments, or suggestions.
Sign
up for GPS
mailing list to receive emails related to GEM/GPS updates,
release, etc.
|