Computational Genomics

Massachusetts Institute of Technology
Computer Science and Artificial Intelligence Laboratory

 
home
pub
people
courses
resources
Contact
wiki






























GPS logo
Contents

GPS

Citation:
Discovering homotypic binding events at high spatial resolution. Yuchun Guo, Georgios Papachristoudis, Robert C Altshuler, Georg K Gerber, Tommi S Jaakkola, David K Gifford & Shaun Mahony, Bioinformatics. 2010 Dec 15;26(24):3028-34. Epub 2010 Oct 21. PMID: 20966006.

GPS has been superceded by our new method, Genome wide Event finding and Motif discovery (GEM). Further improvement for GPS will be incorporated into GEM. Please contact us if you need to download previously released GPS software.

The Genome Positioning System (GPS) is a software tool to study protein-DNA interaction using ChIP-Seq data. GPS builds a probabilistic mixture model to predict the most likely positions of binding events at single-base resolution.

GPS has 3 main features:

  1. Achieve high spatial resolution on predicting binding events
  2. Resolve closely spaced (less than 500bp) events that appear as a single cluster of reads
  3. Process multiple dataset simultaneously to align common events across distinct experiments

Download

Please download our new method, Genome wide Event findin\ g and Motif discovery (GEM). GPS is now a part of GEM.

System requirements

Java 1.6 is required to execute the JAR. For a ChIP-Seq experiment with 4 million IP reads and 8 million control reads, GPS requires about 750M memory, and runs for about 20 minutes on an single CPU AMD 64bit 2.3GHz computer (~6 minutes, multi-threads, on a 8-CPU computer). For some machines, the default maximum memory heap size for the Java virtual machine may not be large enough. It can be specified at the command line with the option -Xmx (i.e. java -jar -Xmx2G gps.jar allocates 2GB of memory).

Read distributions

A read distribution file is required for GPS. The user can use the default read distribution file provided with GPS software as starting point. After one round of prediction, GPS will re-estimate the read distribution using the predicted events.

The read distribution file specifies the empirical spatial distribution of reads for a given binding event. The file contains tab-delimited position/density pairs. The first field is the position relative to the binding position (i.e. binding event is at position 0). The second field contains the corresponding read density at that position. For example,

-344    1.42285E-4
-343    1.42275E-4
-342    1.42265E-4
... ...

Alternatively, it can be estimated directly from the ChIP-Seq data. Given a set of events, we count all the reads at each position (the 5' end of the reads) relative to the corresponding event positions. The initial set of events for estimating the empirical spatial distribution can be defined by using known motifs or by finding the center of the forward and reverse read profiles (if available). GPS has a tool to calculate the read distribution from a user provided file (coords.txt) containing the coordinates:

java -Xmx1G -cp gps.jar edu.mit.csail.cgs.deepseq.analysis.GPS_ReadDistribution --g "mm8.info" --coords "coords.txt" --chipseq "SRX000540_mES_CTCF.bed" --f "BED" --name "CTCF" --range 250 --smooth 5 --mrc 4

After GPS makes the prediction, it will re-estimate the read distribution using the predicted events. A command line option --r [n] can be used to set the maximum times for re-estimating the read distribution until it converges.

If the data are too noisy or too few events are used for re-estimation, the new read distribution may not be accurate. The users are encouraged to examine the read distribution using the plot of read distributions (X_All_Read_Distributions.png) output by GPS.

Input and output

GPS takes an alignment file of ChIP-Seq reads as input and reports a list of predicted binding events.

ChIP-Seq alignment file formats that are supported:

GPS outputs a tab-delimited file (xxx_n_GPS_significant.txt ) with following fields:

  • Location: The genome coordinate
  • IP binding strength: the number of IP reads associated with the event
  • Control binding strength: the number of control reads in corresponding region
  • Fold enrichment (IP/Control)
  • -log10(Q-value)
  • -log10(P-value)
  • IPvsEMP: KL divergence from the empirical read distribution (log10(KL))
  • IPvsCTR: KL divergence from the read density in the corresponding control region (log10(KL))
GPS also outputs the list of insignificant events (those do not pass the statistical test), and the filtered events (those would pass the statistical test using the read count, but have a low IP/Ctrl ratio, or the distribution of reads are quite different from the empirical distribution).

Because of the read distribution re-estimation, GPS may output event prediction and read distribution files for multiple rounds. (See more details)

Optionally, GPS can be set to output BED files (using option --outBED) for loading the GPS results to Genome Browser as custom tracks for visualization. Note that in BED file, the coordinates are offset to left/right 5bp for better visualization.

Examples:

This data can be used to test GPS. It comes from a Ng lab publication (PMID: 18555785) and consists of Bowtie alignments of mouse ES cell CTCF ChIP-seq and GFP control reads.

Once everything is unpacked into the same directory as the gps.jar file, use the following command:
java -Xmx1G -jar gps.jar --g mm8.info --d Read_Distribution_default.txt --s 2000000000 --expt SRX000540_mES_CTCF.bed --ctrl SRX000543_mES_GFP.bed --f BED --out mouseCTCF

An example of GPS run in multi-condition alignment mode is (Please note: the multi-condition mode may take longer time to run)
java -Xmx5G -jar gps.jar --d Read_Distribution_default.txt --s 240000000 --exptCTCF-GM12878 GM12878_chr1_ip.bed --exptCTCF-HUVEC HUVEC_chr1_ip.bed --ctrlCTCF-GM12878 GM12878_chr1_ctrl.bed --ctrlCTCF-HUVEC HUVEC_chr1_ctrl.bed --f BED --g hg18_chr1.info --out HumanCTCF
Note the matching --expt and --ctrl options that encode the name of different conditions.

Command-line options:

The command line parameters are in the format of --flag/name pairs. Some parameters are required:

Required parameters Detailed information
--d [path] The path to the read distribution model file
--exptX [path] The path to the aligned reads file for experiment (IP). X is condition name. In multi-condition alignment mode, X are used to specify different conditions.
--ctrlX [path] The path to the aligned reads file for control. X should match the condition name in --expt.

Some parameters are optional:

Optional parameters Detailed information
--a [value] minimum alpha value for sparse prior (default=6)
--af [value] a constant to scale alpha value with read count (default=3). A smaller af value will give a larger alpha value.
--f [BED|SAM|BOWTIE|ELAND|NOVO] Read file format: BED/SAM/BOWTIE/ELAND/NOVO. The SAM option allows SAM or BAM file input. (default = BED)
--fold [value] Fold cutoff to filter predicted events (default=3)
--g [path] The path to a genome information file. The file contains tab-delimited chromosome name/length pairs. If it is not supplied, GPS will use the maximum value of read coordinate as the chomosome length.
--s [n] The size of uniquely mappable genome. It depends on the genome and the read length. A good estimate is genome size * 0.8. If it is not supplied, it will be estimated using the genome information.
--icr [value] IP/Control Ratio. Default ratio will be estimated from the data using non-specific binding regions. It is important to set this value explicitly for synthetic dataset.
--out [name] Output file base name
--q [value] significance level for q-value, specified as -log10(q-value). For example, to enforce a q-value threshold of 0.001, set this value to 3. (default=2, i.e. q-value=0.01)
--r [n] max number of rounds to re-estimate read distribution model (default=3)
--sd [value] Shape deviation cutoff to filter predicted events (default=-0.40).
--smooth [value] The width (bp) to smooth the read distribution. If it is set to -1, there will be no smoothing (default=30).
--subs [region strings] Subset of genome regions to be analyzed.The string can be in the format of "chr:start-end" or "chr", or both. For example, "1:003-1004 2 X".
--subf [region file] File that contains subset of genome regions to be analyzed. Each line contains a region in "chr:start-end" format.
--ex [region file] File that contains subset of genome regions to be excluded. Each line contains a region in "chr:start-end" format.
--t [value] Number of threads to run GPS in paralell. It is suggested to be equal to or less than the physical CPU number on the computer. (default: physical CPU number)
--top [value] Number of top ranked GPS events to be used for re-estimating the read distribution. Note that GPS only re-estimate when there are more than 500 significant events called.
--w2 [value] Size of sliding window to estimate lambda parameter for Possion distribution when there is no control data (default=5,000, must be larger than 1000).
--w3 [value] Size of sliding window to esitmate lambda parameter for Possion distribution when there is no control data (default=10,000, must be larger than w2).

Optional flags:

Optional flags Detailed information
--bf Depreciated
Base filtering is done by Poission filtering by default.
--fa GPS will use a fixed user-specified alpha value for all the regions
--help Print help information and exit
--multi Depreciated
To run GPS in multi-condition mode, you only need to specify data for conditions X and Y using --exptX and --exptY, etc.
--nf Do not filter predicted events (default is filtering by shape and fold)
--nrf Do not filter duplicate reads (default is to apply filtering considering its neighboring base positions)
--outBED Output BED files for Genome Browser (default is NO BED file output)
--sl Sort GPS output by location (default is sorted by P-value)

Q &A

Which round of result should I use?
Because of the read distribution re-estimation, GPS may output event prediction and read distribution files for multiple rounds.The round numbers are coded in the file name. For example,
  • xxx_0_Read_distribution.txt: The input read distribution.
  • xxx_0_GPS_significant.txt: Events predicted using xxx_0_Read_distribution.
  • xxx_1_Read_distribution.txt: The read distribution estimated from xxx_0_GPS_significant.
  • xxx_1_GPS_significant.txt: Events predicted using xxx_1_Read_distribution.
Due to the large variability of datasets, the refined empirical spatial distribution may become too noisy because too few events are used to estimate the distribution. Running GPS further may make the empirical distribution even worse. Therefore, GPS save the output files from each round, and allow the user to check the spatial distributions to decide which one is the best. To facilitate this process, GPS outputs an image to plot all the read distribution curves (xxx_All_Read_Distributions.png). Ideally, the read distribution of later rounds should be smooth and similar to that of round 0. If so, user can use the event predictions from these rounds. If that is not the case, we would suggest to use the round 0 prediction results.

Multi-condition v.s. Multi-replicates
GPS can analyze binding data from multiple conditions (time points) simultaneously. The user need to give them different names, for example, -–exptCond1 CTCF_cond1.bed -–exptCond2 CTCF_cond2.bed.

For multiple replicates of same condition, you can specify multiple replicates as separate files, for example, -–exptCond1 CTCF_cond1_rep1.bed -–exptCond1 CTCF_cond1_rep2.bed (note that they need to have the same name). GPS will combine the replicates as one large dataset for analysis.

Read filtering and event filtering
PCR amplification artifacts typically manifest as the observation of many reads mapping to the exact same base positions. These artifacts are quite variable and dataset-specific. Therefore, a generic approach to exclude those regions might result in the loss of true events.

GPS implements an event filtering method by comparing the read distribution of the predicted event to the expected event read distribution. A shape deviation score (IPvsEMP field) is computed using Kullback–Leibler divergence (see method section 2.6 of GPS paper). A higher score means the event is more divergent from the expected read distribution, hence more likely to be artifact or noise. A cutoff score can be specified by user to filter out spurious events using option (--sd). GPS also excludes events with less than 3 fold enrichment (IP/Control). GPS reports the filtered events, hence allows the user to verify and adjust cutoff threshold for a particular dataset. The shape deviation filter is on by default, but can be turned off using option (--nf).

In addition, GPS also applies a Poisson filter for abnormal high read count at a base position. For each base, we obtain an average read count by estimating a Gaussian Kernel density (with std=20bp) on the read counts of nearby base positions (excluding the base of interest). The estimated value is used to set Lambda parameter of Poisson distribution. The actual read count value is then set to the value corresponding to p-value=0.001 if it is larger.

Contact

Contact Yuchun Guo (yguo at mit dot edu) or Shaun Mahony (mahony at mit dot edu) with any problems, comments, or suggestions.

Sign up for GPS mailing list to receive emails related to GPS updates, release, etc.