The representation and discovery of transcription factor (TF) sequence binding specificities is critical for understanding gene regulatory networks and interpreting the impact of disease-associated non-coding genetic variants. We present a novel TF binding motif representation, the k-mer set memory (KSM), which consists of a set of aligned k-mers that are over-represented at TF binding sites, and a new method called KMAC for de novo discovery of KSMs. We find that KSMs more accurately predict in vivo binding sites than position weight matrix (PWM) models and other more complex motif models across a large set of ChIP-seq experiments. Furthermore, KSMs outperform PWMs and more complex motif models in predicting in vitro binding sites. KMAC also identifies correct motifs in more experiments than five state-of-the-art motif discovery methods. In addition, KSM derived features outperform both PWM and deep learning model derived sequence features in predicting differential regulatory activities of expression quantitative trait loci (eQTL) alleles. Finally, we have applied KMAC to 1600 ENCODE TF ChIP-seq datasets and created a public resource of KSM and PWM motifs. We expect that the KSM representation and KMAC method will be valuable in characterizing TF binding specificities and in interpreting the effects of non-coding genetic variations.
Guo Y, Tian K, Zeng H, Guo X, Gifford DK.
Genome Res. 2018 Apr 13. pii: gr.226852.117. doi: 10.1101/gr.226852.117
We describe DNase-capture, an assay that increases the analytical resolution of DNase-seq by focusing its sequencing phase on selected genomic regions. We introduce a new method to compensate for capture bias called BaseNormal that allows for accurate recovery of transcription factor protection profiles from DNase-capture data. We show that these normalized data allow for nuanced detection of transcription factor binding heterogeneity with as few as dozens of sites.
Kang D, Sherwood R, Barkal A, Hashimoto T, Engstrom L, Gifford D.
PLoS One. 2017 Dec 28;12(12):e0187046. doi: 10.1371/journal.pone.0187046. eCollection 2017.
Precision medicine aims to predict a patient's disease risk and best therapeutic options by using that individual's genetic sequencing data. The Critical Assessment of Genome Interpretation (CAGI) is a community experiment consisting of genotype-phenotype prediction challenges; participants build models, undergo assessment, and share key findings. For CAGI 4, three challenges involved using exome-sequencing data: Crohn's disease, bipolar disorder, and warfarin dosing. Previous CAGI challenges included prior versions of the Crohn's disease challenge. Here, we discuss the range of techniques used for phenotype prediction as well as the methods used for assessing predictive models. Additionally, we outline some of the difficulties associated with making predictions and evaluating them. The lessons learned from the exome challenges can be applied to both research and clinical efforts to improve phenotype prediction from genotype. In addition, these challenges serve as a vehicle for sharing clinical and research exome data in a secure manner with scientists who have a broad range of expertise, contributing to a collaborative effort to advance our understanding of genotype-phenotype relationships.
Daneshjou R et al.
Hum Mutat. 2017 Sep;38(9):1182-1192. doi: 10.1002/humu.23280. Epub 2017 Jul 7.
We present a model that, after learning on observations of (sequence, outcome) pairs, can be efficiently used to revise a new sequence in order to improve its associated outcome. Our framework requires neither example improvements, nor additional evaluation of outcomes for proposed revisions. To avoid combinatorial-search over sequence elements, we specify a generative model with continuous latent factors, which is learned via joint approximate inference using a recurrent variational autoencoder (VAE) and an outcome-predicting neural network module. Under this model, gradient methods can be used to efficiently optimize the continuous latent factors with respect to inferred outcomes. By appropriately constraining this optimization and using the VAE decoder to generate a revised sequence, we ensure the revision is fundamentally similar to the original sequence, is associated with better outcomes, and looks natural. These desiderata are proven to hold with high probability under our approach, which is empirically demonstrated for revising natural language sentences.
Mueller J, Gifford DK, Jaakkola T.
International Conference on Machine Learning, pp. 2536-2544. 2017.
We present a nonparametric framework to model a short sequence of probability distributions that vary both due to underlying effects of sequential progression and confounding noise. To distinguish between these two types of variation and estimate the sequentialprogression effects, our approach leverages an assumption that these effects follow a persistent trend. This work is motivated by the recent rise of single-cell RNA-sequencing experiments over a brief time course, which aim to identify genes relevant to the progression of a particular biological process across diverse cell populations. While classical statistical tools focus on scalar-response regression or order-agnostic differences between distributions, it is desirable in this setting to consider both the full distributions as well as the structure imposed by their ordering. We introduce a new regression model for ordinal covariates where responses are univariate distributions and the underlying relationship re- flects consistent changes in the distributions over increasing levels of the covariate. This concept is formalized as a trend in distributions, which we define as an evolution that is linear under the Wasserstein metric. Implemented via a fast alternating projections algorithm, our method exhibits numerous strengths in simulations and analyses of single-cell gene expression data.
Mueller J, Jaakkola T, Gifford DK.
Journal of the American Statistical Association 2017
We characterize how genomic variants that alter chromatin accessibility influence regulatory factor binding with a new method called DeltaBind that predicts condition specific factor binding more accurately than other methods based on DNase-seq data. Using DeltaBind and DNase-seq experiments we predicted the differential binding of 18 factors in K562 and GM12878 cells with an average precision of 28% at 10% recall, with the prediction of individual factors ranging from 5% to 65% precision. We further found that genome variants that alter chromatin accessibility are not necessarily predictive of altering proximal factor binding. Taken together these findings suggest that DNase-seq or ATAC-seq Quantitative Trait Loci (dsQTLs), while important, must be considered in a broader context to establish causality for phenotypic changes.
Chen R and Gifford DK.
PLoS One. 2017 Jul 13;12(7):e0179411. doi: 10.1371/journal.pone.0179411. eCollection 2017.
DNA methylation plays a crucial role in the establishment of tissue-specific gene expression and the regulation of key biological processes. However, our present inability to predict the effect of genome sequence variation on DNA methylation precludes a comprehensive assessment of the consequences of non-coding variation. We introduce CpGenie, a sequence-based framework that learns a regulatory code of DNA methylation using a deep convolutional neural network and uses this network to predict the impact of sequence variation on proximal CpG site DNA methylation. CpGenie produces allele-specific DNA methylation prediction with single-nucleotide sensitivity that enables accurate prediction of methylation quantitative trait loci (meQTL). We demonstrate that CpGenie prioritizes validated GWAS SNPs, and contributes to the prediction of functional noncoding variants, including expression quantitative trait loci (eQTL) and disease-associated mutations. CpGenie is publicly available to assist in identifying and interpreting regulatory non-coding variants.
Zeng H and Gifford DK.
Nucleic Acids Res. 2017 Jun 20;45(11):e99. doi: 10.1093/nar/gkx177.
We present a novel ensemble-based computational framework, EnsembleExpr, that achieved the best performance in the Fourth Critical Assessment of Genome Interpretation expression quantitative trait locus "(eQTL)-causal SNPs" challenge for identifying eQTLs and prioritizing their gene expression effects. eQTLs are genome sequence variants that result in gene expression changes and are thus prime suspects in the search for contributions to the causality of complex traits. When EnsembleExpr is trained on data from massively parallel reporter assays, it accurately predicts reporter expression levels from unseen regulatory sequences and identifies sequence variants that exhibit significant changes in reporter expression. Compared with other state-of-the-art methods, EnsembleExpr achieved competitive performance when applied on eQTL datasets determined by other protocols. We envision EnsembleExpr to be a resource to help interpret noncoding regulatory variants and prioritize disease-associated mutations for downstream validation.
Zeng H, Edwards MD, Guo Y, Gifford DK.
Hum Mutat. 2017 Feb 21, doi: 10.1002/humu.23198.
In many human diseases, associated genetic changes tend to occur within noncoding regions, whose effect might be related to transcriptional control. A central goal in human genetics is to understand the function of such noncoding regions: given a region that is statistically associated with changes in gene expression (expression quantitative trait locus [eQTL]), does it in fact play a regulatory role? And if so, how is this role “coded” in its sequence? These questions were the subject of the Critical Assessment of Genome Interpretation eQTL challenge. Participants were given a set of sequences that flank eQTLs in humans and were asked to predict whether these are capable of regulating transcription (as evaluated by massively parallel reporter assays), and whether this capability changes between alternative alleles. Here, we report lessons learned from this community effort. By inspecting predictive properties in isolation, and conducting meta-analysis over the competing methods, we find that using chromatin accessibility and transcription factor binding as features in an ensemble of classifiers or regression models leads to the most accurate results. We then characterize the loci that are harder to predict, putting the spotlight on areas of weakness, which we expect to be the subject of future studies.
Kreimer A, Zeng H, Edwards MD, Guo Y, Tian K, Shin S, Welch R, Wainberg M, Mohan R, Sinnott-Armstrong NA, Li Y, Eraslan G, Amin TB, Goke J, Mueller NS, Kellis M, Kundaje A, Beer MA, Keles S, Gifford DK, Yosef N.
Hum Mutat. 2017 Feb 21., doi: 10.1002/humu.23197.
BACKGROUND: The combinatorial binding of trans-acting factors (TFs) to the DNA is critical to the spatial and temporal specificity of gene regulation. For certain regulatory regions, more than one regulatory module (set of TFs that bind together) are combined to achieve context-specific gene regulation. However, previous approaches are limited to either pairwise TF co-association analysis or assuming that only one module is used in each regulatory region. RESULTS: We present a new computational approach that models the modular organization of TF combinatorial binding. Our method learns compact and coherent regulatory modules from in vivo binding data using a topic model. We found that the binding of 115 TFs in K562 cells can be organized into 49 interpretable modules. Furthermore, we found that tens of thousands of regulatory regions use multiple modules, a structure that cannot be observed with previous hard clustering based methods. The modules discovered recapitulate many published protein-protein physical interactions, have consistent functional annotations of chromatin states, and uncover context specific co-binding such as gene proximal binding of NFY + FOS + SP and distal binding of NFY + FOS + USF. For certain TFs, the co-binding partners of direct binding (motif present) differs from those of indirect binding (motif absent); the distinct set of co-binding partners can predict whether the TF binds directly or indirectly with up to 95% accuracy. Joint analysis across two cell types reveals both cell-type-specific and shared regulatory modules. CONCLUSIONS: Our results provide comprehensive cell-type-specific combinatorial binding maps and suggest a modular organization of combinatorial binding.
Guo Y, Gifford DK.
BMC Genomics. 2017 Jan 6;18(1):45., doi: 10.1186/s12864-016-3434-3.
Generic spinal motor neuron identity is established by cooperative binding of programming transcription factors (TFs), Isl1 and Lhx3, to motor-neuron-specific enhancers. How expression of effector genes is maintained following downregulation of programming TFs in maturing neurons remains unknown. High-resolution exonuclease (ChIP-exo) mapping revealed that the majority of enhancers established by programming TFs are rapidly deactivated following Lhx3 downregulation in stem-cell-derived hypaxial motor neurons. Isl1 is released from nascent motor neuron enhancers and recruited to new enhancers bound by clusters of Onecut1 in maturing neurons. Synthetic enhancer reporter assays revealed that Isl1 operates as an integrator factor, translating the density of Lhx3 or Onecut1 binding sites into transient enhancer activity. Importantly, independent Isl1/Lhx3- and Isl1/Onecut1-bound enhancers contribute to sustained expression of motor neuron effector genes, demonstrating that outwardly stable expression of terminal effector genes in postmitotic neurons is controlled by a dynamic relay of stage-specific enhancers.
Rhee HS, Closser M, Guo Y, Bashkirova EV, Tan CG, Gifford DK, Wichterle H.
Neuron. 2016 Dec 21;92(6):1252-1265., doi: 10.1016/j.neuron.2016.11.037.
Spliced messages constitute one-fourth of expressed mRNAs in the yeast Saccharomyces cerevisiae, and most mRNAs in metazoans. Splicing requires 5′ splice site (5′SS), branch point (BP), and 3′ splice site (3′SS) elements, but the role of the BP in splicing control is poorly understood because BP identification remains difficult. We developed a high-throughput method, Branch-seq, to map BPs and 5′SSs of isolated RNA lariats. Applied to S. cerevisiae, Branch-seq detected 76% of expressed, annotated BPs and identified a comparable number of novel BPs. We performed RNA-seq to confirm associated 3′SS locations, identifying some 200 novel splice junctions, including an AT-AC intron. We show that several yeast introns use two or even three different BPs, with effects on 3′SS choice, protein coding potential, or RNA stability, and identify novel introns whose splicing changes during meiosis or in response to stress. Together, these findings show unanticipated complexity of splicing in yeast.
Gould GM, Paggi JM, Guo Y, Phizicky DV, Zinshteyn B, Wang ET, Gilbert WV, Gifford DK, Burge CB.
RNA. 2016 Oct;22(10):1522-34., doi: 10.1261/ma.057216.116.
Enhancers and promoters commonly occur in accessible chromatin characterized by depleted nucleosome contact; however, it is unclear how chromatin accessibility is governed. We show that log-additive cis-acting DNA sequence features can predict chromatin accessibility at high spatial resolution. We develop a new type of high-dimensional machine learning model, the Synergistic Chromatin Model (SCM), which when trained with DNase-seq data for a cell type is capable of predicting expected read counts of a genome-wide chromatin accessibility at every base from DNA sequence alone, with the highest accuracy at hypersensitive sites shared across cell types. We confirm that a SCM accurately predicts chromatin accessibility for thousands of synthetic DNA sequences using a novel CRISPR-based method of highly efficient site-specific DNA library integration. SCMs are directly interpretable and reveal that a logic based on local, non-specific synergistic effects, largely among pioneer TFs, is sufficient to predict a large fraction of cellular chromatin accessibility in a wide variety of cell types.
Hashimoto TB, Sherwood RI, Kang DD, Barkal AA, Zeng H, Emons BJM, Srinivasan S, Rajagopal N, Jaakkola T, and Gifford DK.
Genome Research, doi: 10.1101/gr.199778.115
We estimate stochastic processes that govern the dynamics of evolving populations such as cell differentiation. The problem is challenging since longitudinal trajectory measurements of individuals in a population are rarely available due to experimental cost and/or privacy. We show that cross-sectional samples from an evolving population suffice for recovery within a class of processes even if samples are available only at a few distinct time points. We provide a stratified analysis of recoverability conditions, and establish that reversibility is sufficient for recoverability. For estimation, we derive a natural loss and regularization, and parameterize the processes as diffusive recurrent neural networks. We demonstrate the approach in the context of uncovering complex cellular dynamics known as the ‘epigenetic landscape’ from existing biological assays.
Hashimoto TB, Gifford DK, Jaakkola TS.
Proceedings of the 33 rd International Conference on Machine Learning (ICML)
We present a systematic exploration of CNN architectures for predicting DNA sequence binding using a large compendium of transcription factor datasets. We identify the best-performing architectures by varying CNN width, depth and pooling designs. We find that adding convolutional kernels to a network is important for motif-based tasks. We show the benefits of CNNs in learning rich higher-order sequence features, such as secondary motifs and local sequence context, by comparing network performance on multiple modeling tasks ranging in difficulty. We also demonstrate how careful construction of sequence benchmark datasets, using approaches that control potentially confounding effects like positional or motif strength bias, is critical in making fair comparisons between competing methods. We explore how to establish the sufficiency of training data for these learning tasks, and we have created a flexible cloud-based framework that permits the rapid exploration of alternative neural network architectures for problems in computational biology.
Zeng H, Edwards MD, Liu G, Gifford DK.
Bioinformatics. 2016 Jun 15;32(12):i121-i127. doi: 10.1093/bioinformatics/btw255.
HLA-G, a nonclassical HLA molecule uniquely expressed in the placenta, is a central component of fetus-induced immune tolerance during pregnancy. The tissue-specific expression of HLA-G, however, remains poorly understood. Here, systematic interrogation of the HLA-G locus using massively parallel reporter assay (MPRA) uncovered a previously unidentified cis-regulatory element 12 kb upstream of HLA-G with enhancer activity, Enhancer L Strikingly, clustered regularly-interspaced short palindromic repeats (CRISPR)/Cas9-mediated deletion of this enhancer resulted in ablation of HLA-G expression in JEG3 cells and in primary human trophoblasts isolated from placenta. RNA-seq analysis demonstrated that Enhancer L specifically controls HLA-G expression. Moreover, DNase-seq and chromatin conformation capture (3C) defined Enhancer L as a cell type-specific enhancer that loops into the HLA-G promoter. Interestingly, MPRA-based saturation mutagenesis of Enhancer L identified motifs for transcription factors of the CEBP and GATA families essential for placentation. These factors associate with Enhancer L and regulate HLA-G expression. Our findings identify long-range chromatin looping mediated by core trophoblast transcription factors as the mechanism controlling tissue-specific HLA-G expression at the maternal-fetal interface. More broadly, these results establish the combination of MPRA and CRISPR/Cas9 deletion as a powerful strategy to investigate human immune gene regulation.
Ferreira LM, Meissner TB, Mikkelsen TS, Mallard W, O'Donnell CW, Tilburgs T, Gomes HA, Camahort R, Sherwood RI, Gifford DK, Rinn JL, Cowan CA, Strominger JL.
Proc Natl Acad Sci U.S.A . 2016 May 10;113(19):5364-9. doi: 10.1073/pnas.1602886113.
Using a nuclease-dead Cas9 mutant, we show that Cas9 reproducibly induces chromatin accessibility at previously inaccessible genomic loci. Cas9 chromatin opening is sufficient to enable adjacent binding and transcriptional activation by the settler transcription factor retinoic acid receptor at previously unbound motifs. Thus, we demonstrate a new use for Cas9 in increasing surrounding chromatin accessibility to alter local transcription factor binding.
Barkal AA Srinivasan S, Hashimoto T, Gifford DK, Sherwood RI.
PLoS One. 2016 Mar 31;11(3):e0152683.
Quantifying the effects of cis-regulatory DNA on gene expression is a major challenge. Here, we present the multiplexed editing regulatory assay (MERA), a high-throughput CRISPR-Cas9–based approach that analyzes the functional impact of the regulatory genome in its native context. MERA tiles thousands of mutations across ~40 kb of cis-regulatory genomic space and uses knock-in green fluorescent protein (GFP) reporters to read out gene activity. Using this approach, we obtain quantitative information on the contribution of cis-regulatory regions to gene expression. We identify proximal and distal regulatory elements necessary for expression of four embryonic stem cell–specific genes. We show a consistent contribution of neighboring gene promoters to gene expression and identify unmarked regulatory elements (UREs) that control gene expression but do not have typical enhancer epigenetic or chromatin features. We compare thousands of functional and nonfunctional genotypes at a genomic location and identify the base pair–resolution functional motifs of regulatory elements.
Rajagopal, Nisha, Sharanya Srinivasan, Kameron Kooshesh, Yuchun Guo, Matthew D. Edwards, Budhaditya Banerjee, Tahin Syed, Bart JM Emons, David K. Gifford, and Richard I. Sherwood.
Nature Biotechnology 34, 167–174 (2016)
The majority of disease-associated variants identified in genome-wide association studies reside in noncoding regions of the genome with regulatory roles. Thus being able to interpret the functional consequence of a variant is essential for identifying causal variants in the analysis of genome-wide association studies. We present GERV (generative evaluation of regulatory variants), a novel computational method for predicting regulatory variants that affect transcription factor binding. GERV learns a k-mer-based generative model of transcription factor binding from ChIP-seq and DNase-seq data, and scores variants by computing the change of predicted ChIP-seq reads between the reference and alternate allele. The k-mers learned by GERV capture more sequence determinants of transcription factor binding than a motif-based approach alone, including both a transcription factor's canonical motif and associated co-factor motifs. We show that GERV outperforms existing methods in predicting single-nucleotide polymorphisms associated with allele-specific binding. GERV correctly predicts a validated causal variant among linked single-nucleotide polymorphisms and prioritizes the variants previously reported to modulate the binding of FOXA1 in breast cancer cell lines. Thus, GERV provides a powerful approach for functionally annotating and prioritizing causal variants for experimental follow-up analysis.
H. Zeng, T. Hashimoto, D.D. Kang, D.K. Gifford.
Bioinformatics 32 (4): 490-496 (2016)
RNA Polymerase II ChIA-PET data has revealed enhancers that are active in a profiled cell type and the genes that the enhancers regulate through chromatin interactions. The most commonly used computational method for analyzing ChIA-PET data, the ChIA-PET Tool,discovers interaction anchors at a spatial resolution that is insufficient to accurately identify individual enhancers. We introduce Germ, a computational method that estimates the likelihood that any two narrowly defined genomic locations are jointly occupied by RNA Polymerase II. Germ takes a blind deconvolution approach to simultaneously estimate the likelihood of RNA Polymerase II occupation as well as a model of the arrangement of read alignments relative to locations occupied by RNA Polymerase II. Both types of information are utilized to estimate the likelihood that RNA Polymerase II jointly occupies any two genomic locations. We apply Germ to RNA Polymerase II ChIA-PET data from embryonic stem cells to identify the genomic locations that are jointly occupied along with transcription start sites. We show that these genomic locations align more closely with features of active enhancers measured by ChIP-Seq than the locations identified using the ChIA-PET Tool. We also apply Germ to RNA Polymerase II ChIA-PET data from motor neuron progenitors. Based on the Germ results, we observe that a combination of cell type specific and cell type independent regulatory interactions are utilized by cells to regulate gene expression.
Christopher Reeder, Michael Closser, Huay Mei Poh, Kuljeet Sandhu,Hynek Wichterle, David Gifford.
PLOS ONE. doi:10.1371/journal.pone.0122420 May 13, 2015
Direct lineage conversion is a promising approach to generate therapeutically important cell types for disease modeling and tissue repair. However, the survival and function of lineage-reprogrammed cells in vivo over the long term has not been examined. Here, using an improved method for in vivo conversion of adult mouse pancreatic acinar cells toward beta cells, we show that induced beta cells persist for up to 13 months (the length of the experiment), form pancreatic isletlike structures and support normoglycemia in diabetic mice. Detailed molecular analyses of induced beta cells over 7 months reveal that global DNA methylation changes occur within 10 d, whereas the transcriptional network evolves over 2 months to resemble that of endogenous beta cells and remains stable thereafter. Progressive gain of beta-cell function occurs over 7 months, as measured by glucose-regulated insulin release and suppression of hyperglycemia. These studies demonstrate that lineage-reprogrammed cells persist for >1 year and undergo epigenetic, transcriptional, anatomical and functional development toward a beta-cell phenotype.
Weida Li, Claudia Cavelti-Weder, Yinying Zhang, Kendell Clement, Scott Donovan, Gabriel Gonzalez, Jiang Zhu, Marianne Stemann, Ke Xu, Tatsu Hashimoto, Takatsugu Yamada, Mio Nakanishi, Yuemei Zhang, Samuel Zeng, David Gifford, Alexander Meissner, Gordon Weir, and Qiao Zhou.
Nat Biotechnol . 2014 Dec;32(12):1223-30. doi: 10.1038/nbt.3082. Epub 2014 Nov 17
The neocortex contains an unparalleled diversity of neuronal subtypes, each defined by distinct traits that are developmentally acquired under the control of subtype-specific and pan-neuronal genes. The regulatory logic that orchestrates the expression of these unique combinations of genes is unknown for any class of cortical neuron. Here, we report that Fezf2 is a selector gene able to regulate the expression of gene sets that collectively define mouse corticospinal motor neurons.
S. Lodato, B.J. Molyneaux, E. Zuccaro, L.A. Goff, H.H. Chen, W. Yuan, A. Meleski, E. Takahashi, S. Mahony, J.L. Rinn, D.K., P. Arlotta.
Nat Neurosci . 2014 Aug;17(8):1046-54. doi: 10.1038/nn.3757. Epub 2014 Jul 6
The measurement of any nonchromosomal genetic contribution to the heritability of a trait is often confounded by the inability to control both the chromosomal and nonchromosomal information in a population. We have designed a unique system in yeast where we can control both sources of information so that the phenotype of a single chromosomal polymorphism can be measured in the presence of different cytoplasmic elements. With this system, we have shown that both the source of the mitochondrial genome and the presence or absence of a dsRNA virus influence the phenotype of chromosomal variants that affect the growth of yeast. Moreover, by considering this nonchromosomal information that is passed from parent to offspring and by allowing chromosomal and nonchromosomal information to exhibit non-additive interactions, we are able to account for much of the heritability of growth traits. Taken together, our results highlight the importance of including all sources of heritable information in genetic studies and suggest a possible avenue of attack for finding additional missing heritability.
Matthew D. Edwards, Anna Symbor-Nagrabska, Lindsey Dollard, David K. Gifford, Gerald R. Fink.
Proc. Natl. Acad. Sci U.S.A., 2014, May 13. pii: 201407126
Regulatory proteins can bind to different sets of genomic targets in various cell types or conditions. To reliably characterize such condition-specific regulatory binding we introduce MultiGPS, an integrated machine learning approach for the analysis of multiple related ChIP-seq experiments. MultiGPS is based on a generalized Expectation Maximization framework that shares information across multiple experiments for binding event discovery. We demonstrate that our framework enables the simultaneous modeling of sparse condition-specific binding changes, sequence dependence, and replicate-specific noise sources. MultiGPS encourages consistency in reported binding event locations across multiple-condition ChIP-seq datasets and provides accurate estimation of ChIP enrichment levels at each event. MultiGPS’s multi-experiment modeling approach thus provides a reliable platform for detecting differential binding enrichment across experimental conditions. We demonstrate the advantages of MultiGPS with an analysis of Cdx2 binding in three distinct developmental contexts. By accurately characterizing condition-specific Cdx2 binding, MultiGPS enables novel insight into the mechanistic basis of Cdx2 site selectivity. Specifically, the condition-specific Cdx2 sites characterized by MultiGPS are highly associated with pre-existing genomic context, suggesting that such sites are pre-determined by cell-specific regulatory architecture. However, MultiGPS-defined condition-independent sites are not predicted by pre-existing regulator signals, suggesting that Cdx2 can bind to a subset of locations regardless of genomic environment. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2.
S. Mahony, M. D. Edwards, E. O. Mazzoni, R. I. Sherwood, A. Kakumanu, C. A. Morrison, H. Wichterle, D. K. Gifford .
PLoS Comput Biol. 2014 Mar 27;10(3):e1003501. doi: 10.1371/journal.pcbi.1003501. eCollection 2014 Mar
We show that existing RNA-seq, DNase-seq, and ChIP-seq data exhibit overdispersed per-base read count distributions that are not matched to existing computational method assumptions. To compensate for this overdispersion we introduce a nonparametric and universal method for processing per-base sequencing read count data called FIXSEQ. We demonstrate that FIXSEQ substantially improves the performance of existing RNA-seq, DNase-seq, and ChIP-seq analysis tools when compared with existing alternatives.
T.B. Hashimoto, M. D. Edwards, D. K. Gifford
PLoS Comput Biol. 2014 Mar 6;10(3):e1003494. doi: 10.1371/journal.pcbi.1003494. eCollection 2014
Transcriptional profiling is a key technique in the study of cell biology that is limited by the availability of reagents to uniquely identify specific cell types and isolate high quality RNA from them. We report a Method for Analyzing RNA following Intracellular Sorting (MARIS) that generates high quality RNA for transcriptome profiling following cellular fixation,intracellular immunofluorescent staining and FACS. MARIS can therefore be used to isolate high quality RNA from many otherwise inaccessible cell types simply based on immunofluorescent tagging of unique intracellular proteins. As proof of principle, we isolate RNA from sorted human embryonic stem cell-derived insulin-expressing cells as well as adult human b cells. MARIS is a basic molecular biology technique that could be used across several biological disciplines.
S. Hrvatin, F. Deng, C. W. O'Donnell, D. K. Gifford, D. A. Melton.
PLoS One. 2014 Mar 3;9(3):e89459. doi: 10.1371/journal.pone.0089459. eCollection 2014
Human pluripotent stem cells (hPSCs) have the potential to generate any human cell type, and one widely recognized goal is to make pancreatic β cells. To this end, comparisons between differentiated cell types produced in vitro and their in vivo counter-parts are essential to validate hPSC-derived cells. Genome-wide transcriptional analysis of sorted insulin-expressing (INS+) cells derived from three independent hPSC lines, human fetal pancreata, and adult human islets points to two major conclusions: (i) Different hPSC lines produce highly similar INS+cells and (ii) hPSC-derived INS+(hPSC-INS+) cells more closely resemble human fetal β cells than adult β cells. This study provides a direct comparison of transcriptional programs between pure hPSC-INS+ cells and true β cells and provides a catalog of genes whose manipulation may convert hPSC-INS+ cells into functional β cells.
S. Hrvatin, C. W. O'Donnell, F. Deng, J. R. Millman, F. W. Pagliuca, P. DiIorio, A. Rezania, D. K. Gifford, D. A. Melton.
Proc Natl Acad Sci U.S.A. , 2014 Feb 25;111(8):3038-43. doi: 10.1073/pnas.1400709111. Epub 2014 Feb 10
We describe protein interaction quantitation (PIQ), a computational method for modeling the magnitude and shape of genome-wide DNase I hypersensitivity profiles to identify transcription factor (TF) binding sites. Through the use of machine-learning techniques, PIQ identified binding sites for >700 TFs from one DNase I hypersensitivity analysis followed by sequencing (DNase-seq) experiment with accuracy comparable to that of chromatin immunoprecipitation followed by sequencing (ChIP-seq). We applied PIQ to analyze DNase-seq data from mouse embryonic stem cells differentiating into prepancreatic and intestinal endoderm. We identified 120 and experimentally validated eight ‘pioneer’ TF families that dynamically open chromatin. Four pioneerTF families only opened chromatin in one direction from their motifs. Furthermore, we identified ‘settler’ TFs whose genomic binding is principally governed by proximity to open chromatin. Our results support a model of hierarchical TF binding in which directional and nondirectional pioneer activity shapes the chromatin landscape for population by settler TFs.
R. I. Sherwood, T. Hashimoto, C. W. O'Donnell, D. Lewis, A. A. Barkal, J. P. van Hoff, V. Karun, T. Jaakkola, D. K. Gifford.
Nat Biotechnol. 2014 Feb;32(2):171-8. doi: 10.1038/nbt.2798. Epub 2014 Jan 19
Deletion of caudal / cdx genes alters hox gene expression and causes defects in posterior tissues and hematopoiesis. Yet, the defects in hox gene expression only partially explain these phenotypes. To gain deeper insight into Cdx4 function, we performed chromatin immuno-precipitation sequencing (ChIP-seq) combined with gene-expression profiling in zebrafish, and identified the transcription factor spalt-like 4 (sall4 ) as a Cdx4 target. ChIP-seq revealed that Sall4 bound to its own gene locus and the cdx4 locus. Expression profiling showed that Cdx4 and Sall4 coregulate genes that initiate hematopoiesis, such as hox,scl, and lmo2. Combined cdx4 /sall4 gene knock-down impaired erythropoiesis, and overexpression of the Cdx4 and Sall4 target genes scl and lmo2 together rescued the erythroid program. These findings suggest that auto- and cross-regulation of Cdx4 and Sall4 establish a stable molecular circuit in the mesoderm that facilitates the activation of the blood-specific program as development proceeds.
E.J. Paik, S. Mahony, R. M. White, E.N. Price, A. Dibiase, B. Dorjsuren, C. Mosimann, A. J. Davidson, D. Gifford, L. I. Zon.
Stem Cell Reports. 2013 Nov 7;1(5):425-436
Saltatory remodeling of Hox chromatin in response to rostrocaudal patterning signals
E. O. Mazzoni, S. Mahony, M. Peljto, T. Patel, S. R. Thornton, S. McCuine, C. Reeder, L. A. Boyer, R. A. Young, D. K. Gifford, H. Wichterle
Nat Neurosci. 2013 Sep;16(9):1191-8. doi: 10.1038/nn.3490. Epub 2013 Aug 18
Neuroscience. Mapping neuronal diversity one cell at a time
H. Wichterle, D. Gifford, E. Mazzoni
Science. 2013 Aug 16;341(6147):726-7. doi: 10.1126/science.1235884.
Synergistic binding of transcription factors to cell-specific enhancers programs motor neuron identity
E.O. Mazzoni, S. Mahony, M. Closser, C. A. Morrison, S. Nedelec, D. J. Williams, D. An, D. K. Gifford, H. Wichterle
Nat Neurosci. 2013 Sep;16(9):1219-27. doi: 10.1038/nn.3467. Epub 2013 Jul 21
High Resolution Modeling of Chromatin Interactions
C. Reeder, D. Gifford
Research in Computational Molecular Biology, 186-198, 2013
A multi-parametric flow cytometric assay to analyze DNA-protein interactions
M. Arbab, S. Mahony, H. Cho, J. M. Chick, P. A. Rolfe, J. P. van Hoff, V. W. Morris, S. P. Gygi, R. L. Maas, D. K. Gifford, R. I. Sherwood
Nucleic Acids Res. 2013 Jan. 41(2)
Global gene deletion analysis exploring yeast filamentous growth
O. Ryan, R. S. Shaprio, C. F. Kurat, D. Mayhew, A. Baryshnikova, B. Chin, Z. Y. Lin, M. J. Cox, F. Vizeacoumar, D. Cheung, S. Bahr, K. Tsui, F. Tebbji, A. Sellam, F. Istel, T. Schwarzmuller T, T. B. Reynolds, K. Kuchler, D. K. Gifford, M. Whiteway, G. Giaever, C. Nislow, M. Costanzo, A. C. Gingras, R. D. Mitra, B. Andrews, G. R. Fink, L. E. Cowen, C Boone
Science, 2012 Sep 14’337(6100):1353-6.
High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints.
Y. Guo, S. Mahony, D. K. Gifford.
PLoS Comput Biol. 2012 Aug;8(8):e1002638.
Ruler arrays reveal haploid genomic structural variation
P. A. Rolfe, D. A. Bernstein, P. Grisafi, D. K. Gifford
PLoS One, 2012;7(8):e43210
Lineage-based identification of cellular states and expression programs
T. Hashimoto, T. Jaakkola, R. Sherwood, E.O. Mazzoni, H. Wichterle, D. Gifford
Bioinformatics, 2012 Jun 15;28(12):i250-7
High Resolution genetic mapping with pooled sequencing
M. D. Edwards, D. K. Gifford
BMC Bioinformatics, 2012 Apr 19l13 Suppl 6:S8
Embryonic stem cell-based mapping of developmental transcriptional programs
E. O. Mazzoni, S. Mahony, M. Iacovino, C. A. Morrison, G. Mountoufaris, M. Closser, W. A. Whyte, R. A. Young, M. Kyba, D. K. Gifford, H. Wichterle H.
Nat Methods, 2011 Nov 13;8(12):1056-8. doi: 10.1038/nmeth.1775
ReadDB provides efficient storage for mapped short reads
P. A. Rolfe, D. K. Gifford
BMC Bioinformatics, 2011 Jul 7;12:278.
Discovering regulatory overlapping RNA transcripts
T. Danford, R. Dowell, S Agarwala, P. Grisafi, G. Fink, D. Gifford
J Comput Biol., 2011 Mar;18(3):295-303
Ligand-dependent dynamics of retinoic acid receptor binding during early neurogenesis S. Mahony, E. O. Mazzoni, S. McCuine, R. A. Young, H. Wichterle, D. K.Gifford.
Genome Biol., 2011;12(1):R2. Epub 2011 Jan 13
Rapid haplotype inference for nuclear families
A. L. Williams, D. E. Housman, M. C. Rinard, D. K. Gifford
Genome Biol., 2010;11(10):R108. Epub 2010 Oct 29
Discovering homotypic binding events at high spatial resolution
Y. Guo, G. Papachristoudis, R. C. Altshuler, G. K. Gerber, T. S. Jaakkola, D. K. Giffo, S. Mahony
Bioinformatics, 2010 Dec 15;26(24):3028-34. Epub 2010 Oct 21
Global control of motor neuron topography mediated by the repressive actions of a single hox gene
H. Jung, J. Lacombe, E.O. Mazzoni, K. F. Liem Jr, J. Grinstein, S. Mahony, D. Mukhopadhyay, D. K. Gifford, R. A. Young, K. V. Anderson, H. Wichterle, J. S. Dasen
Neuron, 2010 Sep 9;67(5):781-96
Control of transcription by cell size
C. Y. Wu, P. A. Rolfe, D. K. Gifford
PLoS Biol., 2010 Nov 2;8(11):e1000523
Genotype to Phenotype: A Complex Problem
R. D. Dowell, O. Ryan, A. Jansen, D. Cheung, S. Agarwala, T. Danford, D. A. Bernstein, P. A. Rolfe, L. E. Heisler, B. Chin, C. Nislow, G. Giaever, P. C. Phillips, G. R. Fink, D. K. Gifford, and C. Boone
Science , 23, April, 2010, p. 469
Feed-forward Regulation of a Cell Fate Determinant by an RNA-binding Protein Generates Asymmetry in Yeast
J. Wolff, R. D. Dowell, S. Mahony, M. Rabani, D. K. Gifford, and G. R. Fink.
2009 and Earlier
Toggle involving cis-interfering noncoding RNAs controls variegated gene expression in yeast
S. L. Bumgarner, R. D. Dowell, P. Grisafi, D. K. Gifford, and G. R. Fink
PNAS 106(43), October, 2009, pp18321-18326
Analysis of the mouse embryonic stem cell regulatory networks obtained by ChIP-chip and ChIP-PET
D. Mathur, T. W. Danford, L. A. Boyer, R. A. Young, D. K. Gifford, and R. Jaenisch
Genome Biol., 2008;9(8)
Tissue-specific transcriptional regulation has diverged significantly between human and mouse
D. T. Odom, R. D. Dowell, E. S. Jacobsen, W. Gordon, T. W. Danford, K. D. MacIsaac, P. A. Rolfe, C. M. Conboy, D. K. Gifford, and E. Fraenkel
Nature Genetics, 39:6, 730-732, June, 2007
Automated Discovery of Functional Generality of Human Gene Expression Programs
G. K. Gerber, R. D. Dowell, T. S. Jaakkola, and D. K. Gifford
PLOS Computational Biology, 3:8, August 2007
Semi-supervised analysis of gene expression profiles for lineage-specific development in the Caenorhabditis elegans embryo
Y. Qi, P. E. Missiuro, A. Kapoor, C. P. Hunter, T. S. Jaakkola D. K. Gifford,and H. Ge
Bioinformatics, 15;22(14), July 2006, pp. 417-423
Control of developmental regulators by Polycomb in human enbryonic stem cells
T. I. Lee, R. G. Jenner, L. A. Boyer, M. G. Guenther, S. S. Levine, R. M Kumar, B. Chevalier, S. E. Johnstone, M. F. Cole, K. Isono, H. Koseki, T. Fuchikami, K. Abe, H. L. Murray, J. P. Zucker, B. Yuan, G. W. Bell, E. Herbolsheimer, N. M. Hannett, K. Sun, D. T. Odom, A. P. Otte, T. L. Volkert, D. P. Bartel, D. A. Melton, D. K. Gifford, R. Jaenisch, and R. A. Young
Cell, 125(2), April, 2006, pp 301-313.
An improved map of conserved regulatory sties for Saccharomyces cerevisiae
K. D. MacIsaac, T. Wang, D. B. Gordon, D. K. Gifford, G. D. Stormo, and E. Fraenkel
BMC Bioinformatics, March, 2006
A hypothesis-based approach for identifying the binding specificity of regulatory proteins from chromatin immunoprecipitation data
K. D. MacIsaac, D. B. Gordon, L. Nekludova, D. T. Odom, J. Schreiber, D. K. Gifford, R. A. Young, and E. Fraenkel
Bioinformatics, Feb., 2006.
Polycomb complexes repress developmental regulators in murine embryonic stem cells
L. A. Boyer, K. Plath, J. Zeitlinger, T. Brambrink, L. A. Medeiros, T. I Lee., S. S. Levine, M. Wernig, A. Tajonar, M. K. Ray, G. W. Bell, A. P. Otte, M. Vidal, D. K. Gifford, R. A. Young, and R. Jaenisch.
Nature, 441:349-353, May 2006
Coordinated binding of NF-kB family members in the response of human cells to lipopolysaccharide
J. Schreiber, R. G. Jenner, H. L. Murray, G. K. Gerber, D. K. Gifford and R. A. Young
Proceedings of the National Academy of Sciences (PNAS), 103(10):5899-5904, 2006
High-resolution computational models of genome binding events
A. Qi, P.A. Rolfe, K. MacIsaac, G.K. Gerber, D. Pokholok, J. Zeitlinger, T. Danford, R.D. Dowell, E. Fraenkel, T.S. Jaakkola, R.A. Young, and D.K. Gifford
Nature Biotechnology, 24, 963-960 (2006)
[Supplemental material] [software]
Core Transcriptional Regulatory Circuitry in Human Hepatocytes.
D.T. Odom, R.D. Dowell, E.S. Jacobsen, L. Nekludova, P.A. Rolfe, T.W. Danford, D.K. Gifford, E. Fraenkel, G.I. Bell, and R.A. Young.
Nature/EMBO Molecular Systems Biology, msb4100059, 2 May 2006.
Genome-wide map of nucleosome acetylation and methylation in yeast
D. K. Pokholok, C. T. Harbison, S. Levine, M. Cole, N. M. Hannett, T. I. Lee, G. W. Bell, K. Walker, P. A. Rolfe, E. Herbolsheimer, J. Zeitlinger, F. Lewitter, D. K. Gifford, and R. A. Young
Cell, 122(4), August, 2005. pp. 517-527.
Core Transcriptional Regulatory Circuitry in Human Embryonic Stem Cells
L. A. Boyer, T. I. Lee, M. F. Cole, S. E. Johnstone, S. S. Levine, J. P. Zucker, M. G. Guenther, R. M. Kuman, H. L. Murray, R. G. Jenner, D. K. Gifford, D. A. Melton , R. Jaenisch, and R. A. Young
Cell, Vol. 122, 1-20, September, 2005
Global postition and recruitment of HATS and HDACS in the yeast genome
F. Robert, D.K. Pokholok, N. M Hannett, N. J. Rinaldi, M. Chandy, A. Rolfe, J. L. Workman, D. K. Gifford, and R. A. Young
Mol. Cell, 16(2), October, 2004, pp. 199-209
Deconvolving cell cyle expression data with complementary information
Z. Bar-Joseph, S. Farkash, D. K. Gifford, I. Simon, R. Rosenfeld
Bioinformatics, Vol. 20 Suppl. 1, 2004. pp i23-i30
Transcriptional regulatory code of a eukaryotic genome
C. Harbison, D. B. Gordon, T. I Lee, N. J. Rinaldi, K. D. MacIsaac, T. W. Danford, N. M. Hannett, J.B. Tagne, D. B. Reynolds, J. Yoo, E. G. Jennings, J. Zeitlinger, D. K. Pokholok, M. Kellis, P. A. Rolfe, K. T. Takusagawa, E. S. Lander, D. K. Gifford, E. Fraenkel, and R. A. Young
Nature, 431:99-104, September, 2004
Control of Pancreas and Liver Gene Expression by HNF Transcription Factors. Odom, D. T., Zizlsperger, N., Gordon, D. B., Bell, G. W., Rinaldi, N. J., Murray, H. L., Volkert, T. L., Schreiber, J., Rolfe, P. A., Gifford, D. K., Fraenkel, E., Bell, G. I., Young, R. A.
Science, 303:1378-1381, February, 2004
Comparing the Continuous Representation of Time Series Gene Expression Profiles to Identify Differentially Expressed Genes Z. Bar-Joseph, G.K. Gerber, I. Simon, D.K. Gifford, and T.S. Jaakkola
Proceedings of the National Academy of Sciences, 2003 Sept. 2; 100(18):10146-10151
Negative Information for Motif Discovery
Takusagawa, K. T., Gifford, D. K.
Pacific Symposium on Biocomputing, 9:360-371, 2004
Computational discovery of gene modules and regulatory networks
Bar-Joseph, Z., Gerber, G. K., Lee, T. I., Rinaldi, N. J., Yoo, J. Y., Robert, F., Gordon, D. B., Fraenkel, E., Jaakkola, T. S., Young, R. A., Gfford D. K.
Nature Biotechnology, 21, pp. 1337-1342 November, 2003
Continuous Representations of Time Series Gene Expression Data
Z. Bar-Joseph, Gerber, D. Gifford, T Jaakkola and I. Simon G. Gerber, D. Gifford, T. Jaakkola and I. Simon.
J Comput Biol., 2003;10(3-4):341-56
K-ary Clustering with Optimal Leaf Ordering for Gene Expression Data
Ziv Bar-Joseph, Erik D. Demaine, David K. Gifford, Angèle M. Hamel, Tommy S. Jaakkola and Nathan Srebro
Bioinformatics, Vol. 19, No. 9, 2003
Transcriptional Regulatory Networks In Saccharomyces cerevisiae
T.I. Lee, N. J. Rinaldi, F. Robert, D. T. Odom, Z. Bar-Joseph, G. K. Gerber, D. K Gifford and R. A. Young
Science, 298:799-804 (2002)
K-ary Clustering with Optimal Leaf Ordering for Gene Expression Data
Ziv Bar-Joseph, Erik D. Demaine, David K. Gifford, Angèle M. Hamel, Tommi S. Jaakkola and Nathan Srebro
Proceedings of the 2nd Workshop on Algorithms in Bioinformatics (WABI 2002), Rome, Italy, September 17-11
Combining Location and Expression Data for Principled Discovery of Genetic Regulatory Network Models.
Alexander J. Hartemink, David K. Gifford, Tommi S. Jaakkola, and Richard A. Young
Pacific Symposium on Biocomputing, 2002, Kauai, January 2002
Bayesian Methods for Elucidating Genetic Regulatory Networks
Hartemink, A. J., Gifford, D. K., Jaakkola, T. S., Young, R. A.
IEEE Intelligent Systems in Biology, Vol. 17, No. 2, March, 2002, pp. 37-43
Serial Regulation of Transcriptional Regulators in the Yeast Cell Cycle
Simon, I., Barnett, J., Hannett, N., Harbison, C. T., Rinaldi, N. J., Volkert, T. L. Wyrick, J. J., Zeitlinger, J., Gifford., D. K., Jaakkola, T. S., Young, R. A.
Cell, 106, Sept., 2001, p. 667-708
A new approach to analyzing gene expression time series data.
Z. Bar-Joseph, G. Gerber, D. Gifford, T. Jaakkola and I. Simon
Proceedings of The Sixth Annual International Conference on Research in Computational Molecular Biology (RECOMB), 2002, pp 39-48
Blazing pathways through genetic mountains
Gifford, D. K.
Science, 2001 Sep 14;293(5537):2049-51
Fast optimal leaf ordering for hierarchical clustering.
Z. Bar-Joseph, D. Gifford, and T. Jaakkola
Bioinformatics (Proceedings of ISMB 2001), 17(S1), 2001, pp 22-19
Using Graphical Models and Genomic Expression Data to Statistically Validate Models of Genetic Regulatory Networks
Alexander J. Hartemink, David K. Gifford, Tommi S. Jaakkola, and Richard A. Young
Pacific Symposium on Biocomputing, 2001, Hawaii, January 2001
Maximum Likelihood Estimation of Optimal Scaling Factors for Expression Array Normalization
Alexander J. Hartemink, David K. Gifford, Tommi S. Jaakkola, and Richard A. Young
SPIE BiOS 2001, San Jose, California, January 2001
Experimental Efficiency of Programmed Mutagenesis
Julia Khodor and David K. Gifford
New Generation Computing 20:3, pp 307-315, 2002
Programmed Mutagenesis is Universal
Julia Khodor, and David K. Gifford
Theory of Computing Systems, 255, pp 483-499, 2002.
Simulating Biological Reactions: A Modular Approach
Alexander J. Hartemink, Tarjei S. Mikkelsen, and David K. Gifford 5th Annual
DIMACS Workshop on DNA-Based Computers, Boston, Massachusetts, June 1999
Automated Constraint-Based Nucleotide Sequence Selection for DNA Computation
Alexander J. Hartemink, David K. Gifford, and Julia Khodor
4th Annual DIMACS Workshop on DNA-Based Computers, Philadelphia, Pennsylvania, June 1998
Design & Implementation of Computational Systems Based on Programmed Mutagenesis
Khodor, J. and Gifford, D. K.
DIMACS Workshop on Nucleic Acid Selection and Computing, Princeton University, March, 1998
Thermodynamic Simulation of Deoxyoligonucleotide Hybridization for DNA Computation
Alexander J. Hartemink and David K. Gifford
3rd Annual DIMACS Workshop on DNA-Based Computers, Philadelphia, Pennsylvania, June 1997
The Efficiency of Sequence-Specific Separation of DNA Mixtures for Biological Computing
Julia Khodor and David K. Gifford
3rd Annual DIMACS Workshop on DNA-Based Computers, Philadelphia, Pennsylvania, June 1997