A novel k-mer set memory (KSM) motif representation improves regulatory variant prediction.

The representation and discovery of transcription factor (TF) sequence binding specificities is critical for understanding gene regulatory networks and interpreting the impact of disease-associated non-coding genetic variants. We present a novel TF binding motif representation, the k-mer set memory (KSM), which consists of a set of aligned k-mers that are over-represented at TF binding sites, and a new method called KMAC for de novo discovery of KSMs. We find that KSMs more accurately predict in vivo binding sites than position weight matrix (PWM) models and other more complex motif models across a large set of ChIP-seq experiments. Furthermore, KSMs outperform PWMs and more complex motif models in predicting in vitro binding sites. KMAC also identifies correct motifs in more experiments than five state-of-the-art motif discovery methods. In addition, KSM derived features outperform both PWM and deep learning model derived sequence features in predicting differential regulatory activities of expression quantitative trait loci (eQTL) alleles. Finally, we have applied KMAC to 1600 ENCODE TF ChIP-seq datasets and created a public resource of KSM and PWM motifs. We expect that the KSM representation and KMAC method will be valuable in characterizing TF binding specificities and in interpreting the effects of non-coding genetic variations.

Guo Y, Tian K, Zeng H, Guo X, Gifford DK.
Genome Res. 2018 Apr 13. pii: gr.226852.117. doi: 10.1101/gr.226852.117

DNase-capture reveals differential transcription factor binding modalities.

We describe DNase-capture, an assay that increases the analytical resolution of DNase-seq by focusing its sequencing phase on selected genomic regions. We introduce a new method to compensate for capture bias called BaseNormal that allows for accurate recovery of transcription factor protection profiles from DNase-capture data. We show that these normalized data allow for nuanced detection of transcription factor binding heterogeneity with as few as dozens of sites.

Kang D, Sherwood R, Barkal A, Hashimoto T, Engstrom L, Gifford D.
PLoS One. 2017 Dec 28;12(12):e0187046. doi: 10.1371/journal.pone.0187046. eCollection 2017.

Working toward precision medicine: Predicting phenotypes from exomes in the Critical Assessment of Genome Interpretation (CAGI) challenges.

Precision medicine aims to predict a patient's disease risk and best therapeutic options by using that individual's genetic sequencing data. The Critical Assessment of Genome Interpretation (CAGI) is a community experiment consisting of genotype-phenotype prediction challenges; participants build models, undergo assessment, and share key findings. For CAGI 4, three challenges involved using exome-sequencing data: Crohn's disease, bipolar disorder, and warfarin dosing. Previous CAGI challenges included prior versions of the Crohn's disease challenge. Here, we discuss the range of techniques used for phenotype prediction as well as the methods used for assessing predictive models. Additionally, we outline some of the difficulties associated with making predictions and evaluating them. The lessons learned from the exome challenges can be applied to both research and clinical efforts to improve phenotype prediction from genotype. In addition, these challenges serve as a vehicle for sharing clinical and research exome data in a secure manner with scientists who have a broad range of expertise, contributing to a collaborative effort to advance our understanding of genotype-phenotype relationships.

Daneshjou R et al.
Hum Mutat. 2017 Sep;38(9):1182-1192. doi: 10.1002/humu.23280. Epub 2017 Jul 7.

Sequence to Better Sequence: Continuous Revision of Combinatorial Structures

We present a model that, after learning on observations of (sequence, outcome) pairs, can be efficiently used to revise a new sequence in order to improve its associated outcome. Our framework requires neither example improvements, nor additional evaluation of outcomes for proposed revisions. To avoid combinatorial-search over sequence elements, we specify a generative model with continuous latent factors, which is learned via joint approximate inference using a recurrent variational autoencoder (VAE) and an outcome-predicting neural network module. Under this model, gradient methods can be used to efficiently optimize the continuous latent factors with respect to inferred outcomes. By appropriately constraining this optimization and using the VAE decoder to generate a revised sequence, we ensure the revision is fundamentally similar to the original sequence, is associated with better outcomes, and looks natural. These desiderata are proven to hold with high probability under our approach, which is empirically demonstrated for revising natural language sentences.

Mueller J, Gifford DK, Jaakkola T.
International Conference on Machine Learning, pp. 2536-2544. 2017.

Modeling Persistent Trends in Distributions

We present a nonparametric framework to model a short sequence of probability distributions that vary both due to underlying effects of sequential progression and confounding noise. To distinguish between these two types of variation and estimate the sequentialprogression effects, our approach leverages an assumption that these effects follow a persistent trend. This work is motivated by the recent rise of single-cell RNA-sequencing experiments over a brief time course, which aim to identify genes relevant to the progression of a particular biological process across diverse cell populations. While classical statistical tools focus on scalar-response regression or order-agnostic differences between distributions, it is desirable in this setting to consider both the full distributions as well as the structure imposed by their ordering. We introduce a new regression model for ordinal covariates where responses are univariate distributions and the underlying relationship re- flects consistent changes in the distributions over increasing levels of the covariate. This concept is formalized as a trend in distributions, which we define as an evolution that is linear under the Wasserstein metric. Implemented via a fast alternating projections algorithm, our method exhibits numerous strengths in simulations and analyses of single-cell gene expression data.

Mueller J, Jaakkola T, Gifford DK.
Journal of the American Statistical Association 2017

Differential chromatin profiles partially determine transcription factor binding

We characterize how genomic variants that alter chromatin accessibility influence regulatory factor binding with a new method called DeltaBind that predicts condition specific factor binding more accurately than other methods based on DNase-seq data. Using DeltaBind and DNase-seq experiments we predicted the differential binding of 18 factors in K562 and GM12878 cells with an average precision of 28% at 10% recall, with the prediction of individual factors ranging from 5% to 65% precision. We further found that genome variants that alter chromatin accessibility are not necessarily predictive of altering proximal factor binding. Taken together these findings suggest that DNase-seq or ATAC-seq Quantitative Trait Loci (dsQTLs), while important, must be considered in a broader context to establish causality for phenotypic changes.

Chen R and Gifford DK.
PLoS One. 2017 Jul 13;12(7):e0179411. doi: 10.1371/journal.pone.0179411. eCollection 2017.

Predicting the impact of non-coding variants on DNA methylation.

DNA methylation plays a crucial role in the establishment of tissue-specific gene expression and the regulation of key biological processes. However, our present inability to predict the effect of genome sequence variation on DNA methylation precludes a comprehensive assessment of the consequences of non-coding variation. We introduce CpGenie, a sequence-based framework that learns a regulatory code of DNA methylation using a deep convolutional neural network and uses this network to predict the impact of sequence variation on proximal CpG site DNA methylation. CpGenie produces allele-specific DNA methylation prediction with single-nucleotide sensitivity that enables accurate prediction of methylation quantitative trait loci (meQTL). We demonstrate that CpGenie prioritizes validated GWAS SNPs, and contributes to the prediction of functional noncoding variants, including expression quantitative trait loci (eQTL) and disease-associated mutations. CpGenie is publicly available to assist in identifying and interpreting regulatory non-coding variants.

Zeng H and Gifford DK.
Nucleic Acids Res. 2017 Jun 20;45(11):e99. doi: 10.1093/nar/gkx177.

Accurate eQTL prioritization with an ensemble-based framework

We present a novel ensemble-based computational framework, EnsembleExpr, that achieved the best performance in the Fourth Critical Assessment of Genome Interpretation expression quantitative trait locus "(eQTL)-causal SNPs" challenge for identifying eQTLs and prioritizing their gene expression effects. eQTLs are genome sequence variants that result in gene expression changes and are thus prime suspects in the search for contributions to the causality of complex traits. When EnsembleExpr is trained on data from massively parallel reporter assays, it accurately predicts reporter expression levels from unseen regulatory sequences and identifies sequence variants that exhibit significant changes in reporter expression. Compared with other state-of-the-art methods, EnsembleExpr achieved competitive performance when applied on eQTL datasets determined by other protocols. We envision EnsembleExpr to be a resource to help interpret noncoding regulatory variants and prioritize disease-associated mutations for downstream validation.

Zeng H, Edwards MD, Guo Y, Gifford DK.
Hum Mutat. 2017 Feb 21, doi: 10.1002/humu.23198.

Predicting gene expression in massively parallel reporter assays: A comparative study.

In many human diseases, associated genetic changes tend to occur within noncoding regions, whose effect might be related to transcriptional control. A central goal in human genetics is to understand the function of such noncoding regions: given a region that is statistically associated with changes in gene expression (expression quantitative trait locus [eQTL]), does it in fact play a regulatory role? And if so, how is this role “coded” in its sequence? These questions were the subject of the Critical Assessment of Genome Interpretation eQTL challenge. Participants were given a set of sequences that flank eQTLs in humans and were asked to predict whether these are capable of regulating transcription (as evaluated by massively parallel reporter assays), and whether this capability changes between alternative alleles. Here, we report lessons learned from this community effort. By inspecting predictive properties in isolation, and conducting meta-analysis over the competing methods, we find that using chromatin accessibility and transcription factor binding as features in an ensemble of classifiers or regression models leads to the most accurate results. We then characterize the loci that are harder to predict, putting the spotlight on areas of weakness, which we expect to be the subject of future studies.

Kreimer A, Zeng H, Edwards MD, Guo Y, Tian K, Shin S, Welch R, Wainberg M, Mohan R, Sinnott-Armstrong NA, Li Y, Eraslan G, Amin TB, Goke J, Mueller NS, Kellis M, Kundaje A, Beer MA, Keles S, Gifford DK, Yosef N.
Hum Mutat. 2017 Feb 21., doi: 10.1002/humu.23197.

Modular combinatorial binding among human trans-acting factors reveals direct and indirect factor binding.

BACKGROUND: The combinatorial binding of trans-acting factors (TFs) to the DNA is critical to the spatial and temporal specificity of gene regulation. For certain regulatory regions, more than one regulatory module (set of TFs that bind together) are combined to achieve context-specific gene regulation. However, previous approaches are limited to either pairwise TF co-association analysis or assuming that only one module is used in each regulatory region. RESULTS: We present a new computational approach that models the modular organization of TF combinatorial binding. Our method learns compact and coherent regulatory modules from in vivo binding data using a topic model. We found that the binding of 115 TFs in K562 cells can be organized into 49 interpretable modules. Furthermore, we found that tens of thousands of regulatory regions use multiple modules, a structure that cannot be observed with previous hard clustering based methods. The modules discovered recapitulate many published protein-protein physical interactions, have consistent functional annotations of chromatin states, and uncover context specific co-binding such as gene proximal binding of NFY + FOS + SP and distal binding of NFY + FOS + USF. For certain TFs, the co-binding partners of direct binding (motif present) differs from those of indirect binding (motif absent); the distinct set of co-binding partners can predict whether the TF binds directly or indirectly with up to 95% accuracy. Joint analysis across two cell types reveals both cell-type-specific and shared regulatory modules. CONCLUSIONS: Our results provide comprehensive cell-type-specific combinatorial binding maps and suggest a modular organization of combinatorial binding.

Guo Y, Gifford DK.
BMC Genomics. 2017 Jan 6;18(1):45., doi: 10.1186/s12864-016-3434-3.

Expression of Terminal Effector Genes in Mammalian Neurons Is Maintained by a Dynamic Relay of Transient Enhancers.

Generic spinal motor neuron identity is established by cooperative binding of programming transcription factors (TFs), Isl1 and Lhx3, to motor-neuron-specific enhancers. How expression of effector genes is maintained following downregulation of programming TFs in maturing neurons remains unknown. High-resolution exonuclease (ChIP-exo) mapping revealed that the majority of enhancers established by programming TFs are rapidly deactivated following Lhx3 downregulation in stem-cell-derived hypaxial motor neurons. Isl1 is released from nascent motor neuron enhancers and recruited to new enhancers bound by clusters of Onecut1 in maturing neurons. Synthetic enhancer reporter assays revealed that Isl1 operates as an integrator factor, translating the density of Lhx3 or Onecut1 binding sites into transient enhancer activity. Importantly, independent Isl1/Lhx3- and Isl1/Onecut1-bound enhancers contribute to sustained expression of motor neuron effector genes, demonstrating that outwardly stable expression of terminal effector genes in postmitotic neurons is controlled by a dynamic relay of stage-specific enhancers.

Rhee HS, Closser M, Guo Y, Bashkirova EV, Tan CG, Gifford DK, Wichterle H.
Neuron. 2016 Dec 21;92(6):1252-1265., doi: 10.1016/j.neuron.2016.11.037.

Identification of new branch points and unconventional introns in Saccharomyces cerevisiae.

Spliced messages constitute one-fourth of expressed mRNAs in the yeast Saccharomyces cerevisiae, and most mRNAs in metazoans. Splicing requires 5′ splice site (5′SS), branch point (BP), and 3′ splice site (3′SS) elements, but the role of the BP in splicing control is poorly understood because BP identification remains difficult. We developed a high-throughput method, Branch-seq, to map BPs and 5′SSs of isolated RNA lariats. Applied to S. cerevisiae, Branch-seq detected 76% of expressed, annotated BPs and identified a comparable number of novel BPs. We performed RNA-seq to confirm associated 3′SS locations, identifying some 200 novel splice junctions, including an AT-AC intron. We show that several yeast introns use two or even three different BPs, with effects on 3′SS choice, protein coding potential, or RNA stability, and identify novel introns whose splicing changes during meiosis or in response to stress. Together, these findings show unanticipated complexity of splicing in yeast.

Gould GM, Paggi JM, Guo Y, Phizicky DV, Zinshteyn B, Wang ET, Gilbert WV, Gifford DK, Burge CB.
RNA. 2016 Oct;22(10):1522-34., doi: 10.1261/ma.057216.116.

A Synergistic DNA Logic Predicts Genome-wide Chromatin Accessibility

Enhancers and promoters commonly occur in accessible chromatin characterized by depleted nucleosome contact; however, it is unclear how chromatin accessibility is governed. We show that log-additive cis-acting DNA sequence features can predict chromatin accessibility at high spatial resolution. We develop a new type of high-dimensional machine learning model, the Synergistic Chromatin Model (SCM), which when trained with DNase-seq data for a cell type is capable of predicting expected read counts of a genome-wide chromatin accessibility at every base from DNA sequence alone, with the highest accuracy at hypersensitive sites shared across cell types. We confirm that a SCM accurately predicts chromatin accessibility for thousands of synthetic DNA sequences using a novel CRISPR-based method of highly efficient site-specific DNA library integration. SCMs are directly interpretable and reveal that a logic based on local, non-specific synergistic effects, largely among pioneer TFs, is sufficient to predict a large fraction of cellular chromatin accessibility in a wide variety of cell types.

Hashimoto TB, Sherwood RI, Kang DD, Barkal AA, Zeng H, Emons BJM, Srinivasan S, Rajagopal N, Jaakkola T, and Gifford DK.
Genome Research, doi: 10.1101/gr.199778.115

Learning Population-Level Diffusions with Generative Recurrent Networks

We estimate stochastic processes that govern the dynamics of evolving populations such as cell differentiation. The problem is challenging since longitudinal trajectory measurements of individuals in a population are rarely available due to experimental cost and/or privacy. We show that cross-sectional samples from an evolving population suffice for recovery within a class of processes even if samples are available only at a few distinct time points. We provide a stratified analysis of recoverability conditions, and establish that reversibility is sufficient for recoverability. For estimation, we derive a natural loss and regularization, and parameterize the processes as diffusive recurrent neural networks. We demonstrate the approach in the context of uncovering complex cellular dynamics known as the ‘epigenetic landscape’ from existing biological assays.

Hashimoto TB, Gifford DK, Jaakkola TS.
Proceedings of the 33 rd International Conference on Machine Learning (ICML)

Convolutional neural network architectures for predicting DNA-protein binding.

We present a systematic exploration of CNN architectures for predicting DNA sequence binding using a large compendium of transcription factor datasets. We identify the best-performing architectures by varying CNN width, depth and pooling designs. We find that adding convolutional kernels to a network is important for motif-based tasks. We show the benefits of CNNs in learning rich higher-order sequence features, such as secondary motifs and local sequence context, by comparing network performance on multiple modeling tasks ranging in difficulty. We also demonstrate how careful construction of sequence benchmark datasets, using approaches that control potentially confounding effects like positional or motif strength bias, is critical in making fair comparisons between competing methods. We explore how to establish the sufficiency of training data for these learning tasks, and we have created a flexible cloud-based framework that permits the rapid exploration of alternative neural network architectures for problems in computational biology.

Zeng H, Edwards MD, Liu G, Gifford DK.
Bioinformatics. 2016 Jun 15;32(12):i121-i127. doi: 10.1093/bioinformatics/btw255.

A distant trophoblast-specific enhancer controls HLA-G expression at the maternal-fetal interface.

HLA-G, a nonclassical HLA molecule uniquely expressed in the placenta, is a central component of fetus-induced immune tolerance during pregnancy. The tissue-specific expression of HLA-G, however, remains poorly understood. Here, systematic interrogation of the HLA-G locus using massively parallel reporter assay (MPRA) uncovered a previously unidentified cis-regulatory element 12 kb upstream of HLA-G with enhancer activity, Enhancer L Strikingly, clustered regularly-interspaced short palindromic repeats (CRISPR)/Cas9-mediated deletion of this enhancer resulted in ablation of HLA-G expression in JEG3 cells and in primary human trophoblasts isolated from placenta. RNA-seq analysis demonstrated that Enhancer L specifically controls HLA-G expression. Moreover, DNase-seq and chromatin conformation capture (3C) defined Enhancer L as a cell type-specific enhancer that loops into the HLA-G promoter. Interestingly, MPRA-based saturation mutagenesis of Enhancer L identified motifs for transcription factors of the CEBP and GATA families essential for placentation. These factors associate with Enhancer L and regulate HLA-G expression. Our findings identify long-range chromatin looping mediated by core trophoblast transcription factors as the mechanism controlling tissue-specific HLA-G expression at the maternal-fetal interface. More broadly, these results establish the combination of MPRA and CRISPR/Cas9 deletion as a powerful strategy to investigate human immune gene regulation.

Ferreira LM, Meissner TB, Mikkelsen TS, Mallard W, O'Donnell CW, Tilburgs T, Gomes HA, Camahort R, Sherwood RI, Gifford DK, Rinn JL, Cowan CA, Strominger JL.
Proc Natl Acad Sci U.S.A . 2016 May 10;113(19):5364-9. doi: 10.1073/pnas.1602886113.

Cas9 Functionally Opens Chromatin

Using a nuclease-dead Cas9 mutant, we show that Cas9 reproducibly induces chromatin accessibility at previously inaccessible genomic loci. Cas9 chromatin opening is sufficient to enable adjacent binding and transcriptional activation by the settler transcription factor retinoic acid receptor at previously unbound motifs. Thus, we demonstrate a new use for Cas9 in increasing surrounding chromatin accessibility to alter local transcription factor binding.

Barkal AA Srinivasan S, Hashimoto T, Gifford DK, Sherwood RI.
PLoS One. 2016 Mar 31;11(3):e0152683.

High-throughput mapping of regulatory DNA

Quantifying the effects of cis-regulatory DNA on gene expression is a major challenge. Here, we present the multiplexed editing regulatory assay (MERA), a high-throughput CRISPR-Cas9–based approach that analyzes the functional impact of the regulatory genome in its native context. MERA tiles thousands of mutations across ~40 kb of cis-regulatory genomic space and uses knock-in green fluorescent protein (GFP) reporters to read out gene activity. Using this approach, we obtain quantitative information on the contribution of cis-regulatory regions to gene expression. We identify proximal and distal regulatory elements necessary for expression of four embryonic stem cell–specific genes. We show a consistent contribution of neighboring gene promoters to gene expression and identify unmarked regulatory elements (UREs) that control gene expression but do not have typical enhancer epigenetic or chromatin features. We compare thousands of functional and nonfunctional genotypes at a genomic location and identify the base pair–resolution functional motifs of regulatory elements.

Rajagopal, Nisha, Sharanya Srinivasan, Kameron Kooshesh, Yuchun Guo, Matthew D. Edwards, Budhaditya Banerjee, Tahin Syed, Bart JM Emons, David K. Gifford, and Richard I. Sherwood.
Nature Biotechnology 34, 167–174 (2016)

GERV: a statistical method for generative evaluation of regulatory variants for transcription factor binding

The majority of disease-associated variants identified in genome-wide association studies reside in noncoding regions of the genome with regulatory roles. Thus being able to interpret the functional consequence of a variant is essential for identifying causal variants in the analysis of genome-wide association studies. We present GERV (generative evaluation of regulatory variants), a novel computational method for predicting regulatory variants that affect transcription factor binding. GERV learns a k-mer-based generative model of transcription factor binding from ChIP-seq and DNase-seq data, and scores variants by computing the change of predicted ChIP-seq reads between the reference and alternate allele. The k-mers learned by GERV capture more sequence determinants of transcription factor binding than a motif-based approach alone, including both a transcription factor's canonical motif and associated co-factor motifs. We show that GERV outperforms existing methods in predicting single-nucleotide polymorphisms associated with allele-specific binding. GERV correctly predicts a validated causal variant among linked single-nucleotide polymorphisms and prioritizes the variants previously reported to modulate the binding of FOXA1 in breast cancer cell lines. Thus, GERV provides a powerful approach for functionally annotating and prioritizing causal variants for experimental follow-up analysis.

H. Zeng, T. Hashimoto, D.D. Kang, D.K. Gifford.
Bioinformatics 32 (4): 490-496 (2016)

High resolution mapping of enhancer-promoter interactions

RNA Polymerase II ChIA-PET data has revealed enhancers that are active in a profiled cell type and the genes that the enhancers regulate through chromatin interactions. The most commonly used computational method for analyzing ChIA-PET data, the ChIA-PET Tool,discovers interaction anchors at a spatial resolution that is insufficient to accurately identify individual enhancers. We introduce Germ, a computational method that estimates the likelihood that any two narrowly defined genomic locations are jointly occupied by RNA Polymerase II. Germ takes a blind deconvolution approach to simultaneously estimate the likelihood of RNA Polymerase II occupation as well as a model of the arrangement of read alignments relative to locations occupied by RNA Polymerase II. Both types of information are utilized to estimate the likelihood that RNA Polymerase II jointly occupies any two genomic locations. We apply Germ to RNA Polymerase II ChIA-PET data from embryonic stem cells to identify the genomic locations that are jointly occupied along with transcription start sites. We show that these genomic locations align more closely with features of active enhancers measured by ChIP-Seq than the locations identified using the ChIA-PET Tool. We also apply Germ to RNA Polymerase II ChIA-PET data from motor neuron progenitors. Based on the Germ results, we observe that a combination of cell type specific and cell type independent regulatory interactions are utilized by cells to regulate gene expression.

Christopher Reeder, Michael Closser, Huay Mei Poh, Kuljeet Sandhu,Hynek Wichterle, David Gifford.
PLOS ONE. doi:10.1371/journal.pone.0122420 May 13, 2015

Long-term persistence and development of induced pancreatic beta cells generated by lineage conversion of acinar cells

Direct lineage conversion is a promising approach to generate therapeutically important cell types for disease modeling and tissue repair. However, the survival and function of lineage-reprogrammed cells in vivo over the long term has not been examined. Here, using an improved method for in vivo conversion of adult mouse pancreatic acinar cells toward beta cells, we show that induced beta cells persist for up to 13 months (the length of the experiment), form pancreatic isletlike structures and support normoglycemia in diabetic mice. Detailed molecular analyses of induced beta cells over 7 months reveal that global DNA methylation changes occur within 10 d, whereas the transcriptional network evolves over 2 months to resemble that of endogenous beta cells and remains stable thereafter. Progressive gain of beta-cell function occurs over 7 months, as measured by glucose-regulated insulin release and suppression of hyperglycemia. These studies demonstrate that lineage-reprogrammed cells persist for >1 year and undergo epigenetic, transcriptional, anatomical and functional development toward a beta-cell phenotype.

Weida Li, Claudia Cavelti-Weder, Yinying Zhang, Kendell Clement, Scott Donovan, Gabriel Gonzalez, Jiang Zhu, Marianne Stemann, Ke Xu, Tatsu Hashimoto, Takatsugu Yamada, Mio Nakanishi, Yuemei Zhang, Samuel Zeng, David Gifford, Alexander Meissner, Gordon Weir, and Qiao Zhou.
Nat Biotechnol . 2014 Dec;32(12):1223-30. doi: 10.1038/nbt.3082. Epub 2014 Nov 17

Gene co-regulation by Fezf2 selects neurotransmitter identity and connectivity of corticospinal neurons

The neocortex contains an unparalleled diversity of neuronal subtypes, each defined by distinct traits that are developmentally acquired under the control of subtype-specific and pan-neuronal genes. The regulatory logic that orchestrates the expression of these unique combinations of genes is unknown for any class of cortical neuron. Here, we report that Fezf2 is a selector gene able to regulate the expression of gene sets that collectively define mouse corticospinal motor neurons.

S. Lodato, B.J. Molyneaux, E. Zuccaro, L.A. Goff, H.H. Chen, W. Yuan, A. Meleski, E. Takahashi, S. Mahony, J.L. Rinn, D.K., P. Arlotta.
Nat Neurosci . 2014 Aug;17(8):1046-54. doi: 10.1038/nn.3757. Epub 2014 Jul 6

Interactions between chromosomal and nonchromosomal elements reveal missing heritability

The measurement of any nonchromosomal genetic contribution to the heritability of a trait is often confounded by the inability to control both the chromosomal and nonchromosomal information in a population. We have designed a unique system in yeast where we can control both sources of information so that the phenotype of a single chromosomal polymorphism can be measured in the presence of different cytoplasmic elements. With this system, we have shown that both the source of the mitochondrial genome and the presence or absence of a dsRNA virus influence the phenotype of chromosomal variants that affect the growth of yeast. Moreover, by considering this nonchromosomal information that is passed from parent to offspring and by allowing chromosomal and nonchromosomal information to exhibit non-additive interactions, we are able to account for much of the heritability of growth traits. Taken together, our results highlight the importance of including all sources of heritable information in genetic studies and suggest a possible avenue of attack for finding additional missing heritability.

Matthew D. Edwards, Anna Symbor-Nagrabska, Lindsey Dollard, David K. Gifford, Gerald R. Fink.
Proc. Natl. Acad. Sci U.S.A., 2014, May 13. pii: 201407126

An integrated model of multiple-condition ChIP-Seq data reveals predeterminants of Cdx2 binding

Regulatory proteins can bind to different sets of genomic targets in various cell types or conditions. To reliably characterize such condition-specific regulatory binding we introduce MultiGPS, an integrated machine learning approach for the analysis of multiple related ChIP-seq experiments. MultiGPS is based on a generalized Expectation Maximization framework that shares information across multiple experiments for binding event discovery. We demonstrate that our framework enables the simultaneous modeling of sparse condition-specific binding changes, sequence dependence, and replicate-specific noise sources. MultiGPS encourages consistency in reported binding event locations across multiple-condition ChIP-seq datasets and provides accurate estimation of ChIP enrichment levels at each event. MultiGPS’s multi-experiment modeling approach thus provides a reliable platform for detecting differential binding enrichment across experimental conditions. We demonstrate the advantages of MultiGPS with an analysis of Cdx2 binding in three distinct developmental contexts. By accurately characterizing condition-specific Cdx2 binding, MultiGPS enables novel insight into the mechanistic basis of Cdx2 site selectivity. Specifically, the condition-specific Cdx2 sites characterized by MultiGPS are highly associated with pre-existing genomic context, suggesting that such sites are pre-determined by cell-specific regulatory architecture. However, MultiGPS-defined condition-independent sites are not predicted by pre-existing regulator signals, suggesting that Cdx2 can bind to a subset of locations regardless of genomic environment. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2.

S. Mahony, M. D. Edwards, E. O. Mazzoni, R. I. Sherwood, A. Kakumanu, C. A. Morrison, H. Wichterle, D. K. Gifford .
PLoS Comput Biol. 2014 Mar 27;10(3):e1003501. doi: 10.1371/journal.pcbi.1003501. eCollection 2014 Mar

Universal count correction for high-throughput sequencing

We show that existing RNA-seq, DNase-seq, and ChIP-seq data exhibit overdispersed per-base read count distributions that are not matched to existing computational method assumptions. To compensate for this overdispersion we introduce a nonparametric and universal method for processing per-base sequencing read count data called FIXSEQ. We demonstrate that FIXSEQ substantially improves the performance of existing RNA-seq, DNase-seq, and ChIP-seq analysis tools when compared with existing alternatives.

T.B. Hashimoto, M. D. Edwards, D. K. Gifford
PLoS Comput Biol. 2014 Mar 6;10(3):e1003494. doi: 10.1371/journal.pcbi.1003494. eCollection 2014

MARIS: method for analyzing RNA following intracellular sorting

Transcriptional profiling is a key technique in the study of cell biology that is limited by the availability of reagents to uniquely identify specific cell types and isolate high quality RNA from them. We report a Method for Analyzing RNA following Intracellular Sorting (MARIS) that generates high quality RNA for transcriptome profiling following cellular fixation,intracellular immunofluorescent staining and FACS. MARIS can therefore be used to isolate high quality RNA from many otherwise inaccessible cell types simply based on immunofluorescent tagging of unique intracellular proteins. As proof of principle, we isolate RNA from sorted human embryonic stem cell-derived insulin-expressing cells as well as adult human b cells. MARIS is a basic molecular biology technique that could be used across several biological disciplines.

S. Hrvatin, F. Deng, C. W. O'Donnell, D. K. Gifford, D. A. Melton.
PLoS One. 2014 Mar 3;9(3):e89459. doi: 10.1371/journal.pone.0089459. eCollection 2014

Differentiated human stem cells resemble fetal, not adult, ß cells.

Human pluripotent stem cells (hPSCs) have the potential to generate any human cell type, and one widely recognized goal is to make pancreatic β cells. To this end, comparisons between differentiated cell types produced in vitro and their in vivo counter-parts are essential to validate hPSC-derived cells. Genome-wide transcriptional analysis of sorted insulin-expressing (INS+) cells derived from three independent hPSC lines, human fetal pancreata, and adult human islets points to two major conclusions: (i) Different hPSC lines produce highly similar INS+cells and (ii) hPSC-derived INS+(hPSC-INS+) cells more closely resemble human fetal β cells than adult β cells. This study provides a direct comparison of transcriptional programs between pure hPSC-INS+ cells and true β cells and provides a catalog of genes whose manipulation may convert hPSC-INS+ cells into functional β cells.

S. Hrvatin, C. W. O'Donnell, F. Deng, J. R. Millman, F. W. Pagliuca, P. DiIorio, A. Rezania, D. K. Gifford, D. A. Melton.
Proc Natl Acad Sci U.S.A. , 2014 Feb 25;111(8):3038-43. doi: 10.1073/pnas.1400709111. Epub 2014 Feb 10

Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape

We describe protein interaction quantitation (PIQ), a computational method for modeling the magnitude and shape of genome-wide DNase I hypersensitivity profiles to identify transcription factor (TF) binding sites. Through the use of machine-learning techniques, PIQ identified binding sites for >700 TFs from one DNase I hypersensitivity analysis followed by sequencing (DNase-seq) experiment with accuracy comparable to that of chromatin immunoprecipitation followed by sequencing (ChIP-seq). We applied PIQ to analyze DNase-seq data from mouse embryonic stem cells differentiating into prepancreatic and intestinal endoderm. We identified 120 and experimentally validated eight ‘pioneer’ TF families that dynamically open chromatin. Four pioneerTF families only opened chromatin in one direction from their motifs. Furthermore, we identified ‘settler’ TFs whose genomic binding is principally governed by proximity to open chromatin. Our results support a model of hierarchical TF binding in which directional and nondirectional pioneer activity shapes the chromatin landscape for population by settler TFs.

R. I. Sherwood, T. Hashimoto, C. W. O'Donnell, D. Lewis, A. A. Barkal, J. P. van Hoff, V. Karun, T. Jaakkola, D. K. Gifford.
Nat Biotechnol. 2014 Feb;32(2):171-8. doi: 10.1038/nbt.2798. Epub 2014 Jan 19

A Cdx4-Sall4 Regulatory Module Controls the Transition from Mesoderm Formation to Embryonic Hematopoiesis

Deletion of caudal / cdx genes alters hox gene expression and causes defects in posterior tissues and hematopoiesis. Yet, the defects in hox gene expression only partially explain these phenotypes. To gain deeper insight into Cdx4 function, we performed chromatin immuno-precipitation sequencing (ChIP-seq) combined with gene-expression profiling in zebrafish, and identified the transcription factor spalt-like 4 (sall4 ) as a Cdx4 target. ChIP-seq revealed that Sall4 bound to its own gene locus and the cdx4 locus. Expression profiling showed that Cdx4 and Sall4 coregulate genes that initiate hematopoiesis, such as hox,scl, and lmo2. Combined cdx4 /sall4 gene knock-down impaired erythropoiesis, and overexpression of the Cdx4 and Sall4 target genes scl and lmo2 together rescued the erythroid program. These findings suggest that auto- and cross-regulation of Cdx4 and Sall4 establish a stable molecular circuit in the mesoderm that facilitates the activation of the blood-specific program as development proceeds.

E.J. Paik, S. Mahony, R. M. White, E.N. Price, A. Dibiase, B. Dorjsuren, C. Mosimann, A. J. Davidson, D. Gifford, L. I. Zon.
Stem Cell Reports. 2013 Nov 7;1(5):425-436





2009 and Earlier