Predicted Cellular Immunity Population Coverage Gaps for SARS-CoV-2 Subunit Vaccines and Their Augmentation by Compact Peptide Sets

Subunit vaccines induce immunity to a pathogen by presenting a component of the pathogen and thus inherently limit the representation of pathogen peptides for cellular immunity-based memory. We find that severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) subunit peptides may not be robustly displayed by the major histocompatibility complex (MHC) molecules in certain individuals. We introduce an augmentation strategy for subunit vaccines that adds a small number of SARS-CoV-2 peptides to a vaccine to improve the population coverage of pathogen peptide display. Our population coverage estimates integrate clinical data on peptide immunogenicity in convalescent COVID-19 patients and machine learning predictions. We evaluate the population coverage of 9 different subunits of SARS-CoV-2, including 5 functional domains and 4 full proteins, and augment each of them to fill a predicted coverage gap.

figure from paper

Ge Liu, Brandon Carter, and David K. Gifford
Cell Systems. Volume 12, P1–6, February 17, 2021
DOI: 10.1016/j.cels.2020.11.010

Comprehensive Mapping of Key Regulatory Networks that Drive Oncogene Expression

Gene expression is controlled by the collective binding of transcription factors to cis-regulatory regions. Deciphering gene-centered regulatory networks is vital to understanding and controlling gene misexpression in human disease; however, systematic approaches to uncovering regulatory networks have been lacking. Here we present high-throughput interrogation of gene-centered activation networks (HIGAN), a pipeline that employs a suite of multifaceted genomic approaches to connect upstream signaling inputs, trans-acting TFs, and cis-regulatory elements. We apply HIGAN to understand the aberrant activation of the cytidine deaminase APOBEC3B, an intrinsic source of cancer hypermutation. We reveal that nuclear factor κB (NF-κB) and AP-1 pathways are the most salient trans-acting inputs, with minor roles for other inflammatory pathways. We identify a cis-regulatory architecture dominated by a major intronic enhancer that requires coordinated NF-κB and AP-1 activity with secondary inputs from distal regulatory regions. Our data demonstrate how integration of cis and trans genomic screening platforms provides a paradigm for building gene-centered regulatory networks.

figure from paper

Lin Lin, Benjamin Holmes, Max W. Shen, Darnell Kammeron, Niels Geijsen, David K. Gifford, and Richard I. Sherwood
Cell Reports. Volume 33, Issue 8, 24 November 2020, 108426
DOI: 10.1016/j.celrep.2020.108426

Chemogenetic System Demonstrates That Cas9 Longevity Impacts Genome Editing Outcomes

Prolonged Cas9 activity can hinder genome engineering as it causes off-target effects, genotoxicity, heterogeneous genome-editing outcomes, immunogenicity, and mosaicism in embryonic editing—issues which could be addressed by controlling the longevity of Cas9. Though some temporal controls of Cas9 activity have been developed, only cumbersome systems exist for modifying the lifetime. Here, we have developed a chemogenetic system that brings Cas9 in proximity to a ubiquitin ligase, enabling rapid ubiquitination and degradation of Cas9 by the proteasome. Despite the large size of Cas9, we were able to demonstrate efficient degradation in cells from multiple species. Furthermore, by controlling the Cas9 lifetime, we were able to bias the DNA repair pathways and the genotypic outcome for both templated and nontemplated genome editing. Finally, we were able to dosably control the Cas9 activity and specificity to ameliorate the off-target effects. The ability of this system to change the Cas9 lifetime and, therefore, bias repair pathways and specificity in the desired direction allows precision control of the genome editing outcome.

figure from paper

Vedagopuram Sreekanth, Qingxuan Zhou, Praveen Kokkonda, Heysol C. Bermudez-Cabrera, Donghyun Lim, Benjamin K. Law, Benjamin R. Holmes, Santosh K. Chaudhary, Rajaiah Pergu, Brittany S. Leger, James A. Walker, David K. Gifford, Richard I. Sherwood, and Amit Choudhary
ACS Cent. Sci.. 2020
DOI: 10.1021/acscentsci.0c00129

Identification of determinants of differential chromatin accessibility through a massively parallel genome-integrated reporter assay

A key mechanism in cellular regulation is the ability of the transcriptional machinery to physically access DNA. Transcription factors interact with DNA to alter the accessibility of chromatin, which enables changes to gene expression during development or disease or as a response to environmental stimuli. However, the regulation of DNA accessibility via the recruitment of transcription factors is difficult to study in the context of the native genome because every genomic site is distinct in multiple ways. Here we introduce the multiplexed integrated accessibility assay (MIAA), an assay that measures chromatin accessibility of synthetic oligonucleotide sequence libraries integrated into a controlled genomic context with low native accessibility. We apply MIAA to measure the effects of sequence motifs on cell type–specific accessibility between mouse embryonic stem cells and embryonic stem cell–derived definitive endoderm cells, screening 7905 distinct DNA sequences. MIAA recapitulates differential accessibility patterns of 100-nt sequences derived from natively differential genomic regions, identifying E-box motifs common to epithelial–mesenchymal transition driver transcription factors in stem cell–specific accessible regions that become repressed in endoderm. We show that a single binding motif for a key regulatory transcription factor is sufficient to open chromatin, and classify sets of stem cell–specific, endoderm-specific, and shared accessibility-modifying transcription factor motifs. We also show that overexpression of two definitive endoderm transcription factors, T and Foxa2, results in changes to accessibility in DNA sequences containing their respective DNA-binding motifs and identify preferential motif arrangements that influence accessibility.

figure from paper

Jennifer Hammelman, Konstantin Krismer, Budhaditya Banerjee, David K. Gifford and Richard I. Sherwood
Genome Research. 2020
DOI: 10.1101/gr.263228.120

Computationally Optimized SARS-CoV-2 MHC Class I and II Vaccine Formulations Predicted to Target Human Haplotype Distributions

We present a combinatorial machine learning method to evaluate and optimize peptide vaccine formulations for SARS-CoV-2. Our approach optimizes the presentation likelihood of a diverse set of vaccine peptides conditioned on a target human-population HLA haplotype distribution and expected epitope drift. Our proposed SARS-CoV-2 MHC class I vaccine formulations provide 93.21% predicted population coverage with at least five vaccine peptide-HLA average hits per person (≥ 1 peptide: 99.91%) with all vaccine peptides perfectly conserved across 4,690 geographically sampled SARS-CoV-2 genomes. Our proposed MHC class II vaccine formulations provide 97.21% predicted coverage with at least five vaccine peptide-HLA average hits per person with all peptides having an observed mutation probability of ≤ 0.001. We provide an open-source implementation of our design methods (OptiVax), vaccine evaluation tool (EvalVax), as well as the data used in our design efforts here:

figure from paper

Ge Liu, Brandon Carter, Trenton Bricken, Siddhartha Jain, Mathias Viard, Mary Carrington, and David K. Gifford
Cell Systems. Volume 11, Issue 2, P131-144.E6, August 26, 2020
DOI: 10.1016/j.cels.2020.06.009

Expanded encyclopaedias of DNA elements in the human and mouse genomes

The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (, including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.

figure from paper

The ENCODE Project Consortium, Jill E. Moore, Michael J. Purcaro, Henry E. Pratt, Charles B. Epstein, Noam Shoresh, Jessika Adrian, Trupti Kawli, Carrie A. Davis, Alexander Dobin, Rajinder Kaul, Jessica Halow, Eric L. Van Nostrand, Peter Freese, David U. Gorkin, Yin Shen, Yupeng He, Mark Mackiewicz, Florencia Pauli-Behn, Brian A. Williams, Ali Mortazavi, Cheryl A. Keller, Xiao-Ou Zhang, Shaimae I. Elhajjajy, Jack Huey, Diane E. Dickel, Valentina Snetkova, Xintao Wei, Xiaofeng Wang, Juan Carlos Rivera-Mulia, Joel Rozowsky, Jing Zhang, Surya B. Chhetri, Jialing Zhang, Alec Victorsen, Kevin P. White, Axel Visel, Gene W. Yeo, Christopher B. Burge, Eric Lécuyer, David M. Gilbert, Job Dekker, John Rinn, Eric M. Mendenhall, Joseph R. Ecker, Manolis Kellis, Robert J. Klein, William S. Noble, Anshul Kundaje, Roderic Guigó, Peggy J. Farnham, J. Michael Cherry, Richard M. Myers, Bing Ren, Brenton R. Graveley, Mark B. Gerstein, Len A. Pennacchio, Michael P. Snyder, Bradley E. Bernstein, Barbara Wold, Ross C. Hardison, Thomas R. Gingeras, John A. Stamatoyannopoulos, and Zhiping Weng
Nature. 583, pages699–710(2020)
DOI: 10.1038/s41586-020-2493-4

Perspectives on ENCODE

The Encylopedia of DNA Elements (ENCODE) Project launched in 2003 with the long-term goal of developing a comprehensive map of functional elements in the human genome. These included genes, biochemical regions associated with gene regulation (for example, transcription factor binding sites, open chromatin, and histone marks) and transcript isoforms. The marks serve as sites for candidate cis-regulatory elements (cCREs) that may serve functional roles in regulating gene expression. The project has been extended to model organisms, particularly the mouse. In the third phase of ENCODE, nearly a million and more than 300,000 cCRE annotations have been generated for human and mouse, respectively, and these have provided a valuable resource for the scientific community.

figure from paper

The ENCODE Project Consortium, Michael P. Snyder, Thomas R. Gingeras, Jill E. Moore, Zhiping Weng, Mark B. Gerstein, Bing Ren, Ross C. Hardison, John A. Stamatoyannopoulos, Brenton R. Graveley, Elise A. Feingold, Michael J. Pazin, Michael Pagan, Daniel A. Gilchrist, Benjamin C. Hitz, J. Michael Cherry, Bradley E. Bernstein, Eric M. Mendenhall, Daniel R. Zerbino, Adam Frankish, Paul Flicek, and Richard M. Myers
Nature. 583, pages693–698(2020)
DOI: 10.1038/s41586-020-2449-8

A Multiplexed Barcodelet Single-Cell RNA-Seq Approach Elucidates Combinatorial Signaling Pathways that Drive ESC Differentiation

Empirical optimization of stem cell differentiation protocols is time consuming, is laborintensive, and typically does not comprehensively interrogate all relevant signaling pathways. Here we describe barcodelet single-cell RNA sequencing (barRNA-seq), which enables systematic exploration of cellular perturbations by tagging individual cells with RNA “barcodelets” to identify them on the basis of the treatments they receive. We apply barRNA-seq to simultaneously manipulate up to seven developmental pathways and study effects on embryonic stem cell (ESC) germ layer specification and mesodermal specification, uncovering combinatorial effects of signaling pathway activation on gene expression. We further develop a data-driven framework for identifying combinatorial signaling perturbations that drive cells toward specific fates, including several annotated in an existing scRNA-seq gastrulation atlas, and use this approach to guide ESC differentiation into a notochord-like population. We expect that barRNA-seq will have broad utility for investigating and understanding how cooperative signaling pathways drive cell fate acquisition.

figure from paper

Grace Hui Ting Yeo, Lin Lin, Celine Yueyue Qi, Minsun Cha, David K. Gifford, and Richard I. Sherwood
Cell Stem Cell. Volume 26, Issue 6, 4 June 2020, Pages 938-950.e6
DOI: 10.1016/j.stem.2020.04.020

IDR2D identifies reproducible genomic interactions

Chromatin interaction data from protocols such as ChIA-PET, HiChIP, and HiC provide valuable insights into genome organization and gene regulation, but can include spurious interactions that do not reflect underlying genome biology. We introduce a generalization of the Irreproducible Discovery Rate (IDR) method called IDR2D that identifies replicable interactions shared by chromatin interaction experiments. IDR2D provides a principled set of interactions and eliminates artifacts from single experiments. The method is available as a Bioconductor package for the R community, as well as an online service at

figure from paper

Konstantin Krismer, Yuchun Guo, and David K. Gifford
Nucleic Acids Research. Volume 48, Issue 6, 06 April 2020, Page e31
DOI: 10.1093/nar/gkaa030


Antibody Complementarity Determining Region Design Using High-Capacity Machine Learning

The precise targeting of antibodies and other protein therapeutics is required for their proper function and the elimination of deleterious off-target effects. Often the molecular structure of a therapeutic target is unknown and randomized methods are used to design antibodies without a model that relates antibody sequence to desired properties. Here we present a machine learning method that can design human Immunoglobulin G (IgG) antibodies with target affinities that are superior to candidates from phage display panning experiments within a limited design budget. We also demonstrate that machine learning can improve target-specificity by the modular composition of models from different experimental campaigns, enabling a new integrative approach to improving target specificity. Our results suggest a new path for the discovery of therapeutic molecules by demonstrating that predictive and differentiable models of antibody binding can be learned from high-throughput experimental data without the need for target structural data.

Significance Antibody based therapeutics must meet both affinity and specificity metrics, and existing in vitro methods for meeting these metrics are based upon randomization and empirical testing. We demonstrate that with sufficient target-specific training data machine learning can suggest novel antibody variable domain sequences that are superior to those observed during training. Our machine learning method does not require any target structural information. We further show that data from disparate antibody campaigns can be combined by machine learning to improve antibody specificity.

figure from paper

Ge Liu, Haoyang Zeng, Jonas Mueller, Brandon Carter, Ziheng Wang, Jonas Schilz, Geraldine Horny, Michael E. Birnbaum, Stefan Ewert, and David K. Gifford
Bioinformatics. btz895
DOI: 10.1093/bioinformatics/btz895

Quantification of Uncertainty in Peptide-MHC Binding Prediction Improves High-Affinity Peptide Selection for Therapeutic Design

The computational identification of peptides that can bind the major histocompatibility complex (MHC) with high affinity is an essential step in developing personal immunotherapies and vaccines. We introduce PUFFIN, a deep residual network-based computational approach that quantifies uncertainty in peptide-MHC affinity prediction that arises from observational noise and the lack of relevant training examples. With PUFFIN’s uncertainty metrics, we define binding likelihood, the probability a peptide binds to a given MHC allele at a specified affinity threshold. Compared to affinity point estimates, we find that binding likelihood correlates better with the observed affinity and reduces false positives in high-affinity peptide design. When applied to examine an existing peptide vaccine, PUFFIN identifies an alternative vaccine formulation with higher binding likelihood. PUFFIN is freely available for download at

figure from paper

Haoyang Zeng and David K. Gifford
Cell Systems. Volume 9, Issue 2, 28 August 2019, Pages 159-166.e3
DOI: 10.1016/j.cels.2019.05.004

Visualizing complex feature interactions and feature sharing in genomic deep neural networks


Visualization tools for deep learning models typically focus on discovering key input features without considering how such low level features are combined in intermediate layers to make decisions. Moreover, many of these methods examine a network’s response to specific input examples that may be insufficient to reveal the complexity of model decision making.


We present DeepResolve, an analysis framework for deep convolutional models of genome function that visualizes how input features contribute individually and combinatorially to network decisions. Unlike other methods, DeepResolve does not depend upon the analysis of a predefined set of inputs. Rather, it uses gradient ascent to stochastically explore intermediate feature maps to 1) discover important features, 2) visualize their contribution and interaction patterns, and 3) analyze feature sharing across tasks that suggests shared biological mechanism. We demonstrate the visualization of decision making using our proposed method on deep neural networks trained on both experimental and synthetic data. DeepResolve is competitive with existing visualization tools in discovering key sequence features, and identifies certain negative features and non-additive feature interactions that are not easily observed with existing tools. It also recovers similarities between poorly correlated classes which are not observed by traditional methods. DeepResolve reveals that DeepSEA’s learned decision structure is shared across genome annotations including histone marks, DNase hypersensitivity, and transcription factor binding. We identify groups of TFs that suggest known shared biological mechanism, and recover correlation between DNA hypersensitivities and TF/Chromatin marks.


DeepResolve is capable of visualizing complex feature contribution patterns and feature interactions that contribute to decision making in genomic deep convolutional networks. It also recovers feature sharing and class similarities which suggest interesting biological mechanisms. DeepResolve is compatible with existing visualization tools and provides complementary insights.

figure from paper

Ge Liu, Haoyang Zeng, and David K. Gifford
BMC Bioinformatics. 20, 401 (2019)
DOI: 10.1186/s12859-019-2957-4

DeepLigand: accurate prediction of MHC class I ligands using peptide embedding


The computational modeling of peptide display by class I major histocompatibility complexes (MHCs) is essential for peptide-based therapeutics design. Existing computational methods for peptide-display focus on modeling the peptide-MHC-binding affinity. However, such models are not able to characterize the sequence features for the other cellular processes in the peptide display pathway that determines MHC ligand selection.


We introduce a semi-supervised model, DeepLigand that outperforms the state-of-the-art models in MHC Class I ligand prediction. DeepLigand combines a peptide language model and peptide binding affinity prediction to score MHC class I peptide presentation. The peptide language model characterizes sequence features that correspond to secondary factors in MHC ligand selection other than binding affinity. The peptide embedding is learned by pre-training on natural ligands, and can discriminate between ligands and non-ligands in the absence of binding affinity prediction. Although conventional affinity-based models fail to classify peptides with moderate affinities, DeepLigand discriminates ligands from non-ligands with consistently high accuracy.

Availability and implementation

We make DeepLigand available at

figure from paper

Haoyang Zeng and David K Gifford
Bioinformatics. Volume 35, Issue 14, July 2019, Pages i278–i283
DOI: 10.1093/bioinformatics/btz330

Maximizing Overall Diversity for Improved Uncertainty Estimates in Deep Ensembles

The inaccuracy of neural network models on inputs that do not stem from the training data distribution is both problematic and at times unrecognized. Model uncertainty estimation can address this issue, where uncertainty estimates are often based on the variation in predictions produced by a diverse ensemble of models applied to the same input. Here we describe Maximize Overall Diversity (MOD), a straightforward approach to improve ensemble-based uncertainty estimates by encouraging larger overall diversity in ensemble predictions across all possible inputs that might be encountered in the future. When applied to various neural network ensembles, MOD significantly improves predictive performance for out-of-distribution test examples without sacrificing in-distribution performance on 38 Protein-DNA binding regression datasets, 9 UCI datasets, and the IMDB-Wiki image dataset. Across many Bayesian optimization tasks, the performance of UCB acquisition is also greatly improved by leveraging MOD uncertainty estimates.

figure from paper

Siddhartha Jain, Ge Liu, Jonas Mueller, and David Gifford
arXiv. preprint arXiv:1906.07380
arXiv: 1906.07380

Wnt Signaling Separates the Progenitor and Endocrine Compartments during Pancreas Development

The pancreatic islets of Langerhans regulate glucose homeostasis. The loss of insulin-producing β cells within islets results in diabetes, and islet transplantation from cadaveric donors can cure the disease. In vitro production of whole islets, not just β cells, will benefit from a better understanding of endocrine differentiation and islet morphogenesis. We used single-cell mRNA sequencing to obtain a detailed description of pancreatic islet development. Contrary to the prevailing dogma, we find islet morphology and endocrine differentiation to be directly related. As endocrine progenitors differentiate, they migrate in cohesion and form bud-like islet precursors, or “peninsulas” (literally “almost islands”). α cells, the first to develop, constitute the peninsular outer layer, and β cells form later, beneath them. This spatiotemporal collinearity leads to the typical core-mantle architecture of the mature, spherical islet. Finally, we induce peninsula-like structures in differentiating human embryonic stem cells, laying the ground for the generation of entire islets in vitro.

figure from paper

Nadav Sharon, Jordan Vanderhooft, Juerg Straubhaar, Jonas Mueller, Raghav Chawla, Quan Zhou, Elise N. Engquist, Cole Trapnell, David K. Gifford, and Doug Melton
Cell Reports. Volume 27, Issue 8, 21 May 2019, Pages 2281-2291.e5
DOI: 10.1016/j.celrep.2019.04.083

High resolution discovery of chromatin interactions

Chromatin interaction analysis by paired-end tag sequencing (ChIA-PET) is a method for the genome-wide de novo discovery of chromatin interactions. Existing computational methods typically fail to detect weak or dynamic interactions because they use a peak-calling step that ignores paired-end linkage information. We have developed a novel computational method called Chromatin Interaction Discovery (CID) to overcome this limitation with an unbiased clustering approach for interaction discovery. CID outperforms existing chromatin interaction detection methods with improved sensitivity, replicate consistency, and concordance with other chromatin interaction datasets. In addition, CID can also be applied to HiChIP data to discover chromatin interactions. We expect that the CID method will be valuable in characterizing 3D chromatin interactions and in understanding the functional consequences of disease-associated distal genetic variations.

figure from paper

Yuchun Guo, Konstantin Krismer, Michael Closser, Hynek Wichterle, David K. Gifford
Nucleic Acids Research. Volume 47, Issue 6, 08 April 2019, Page e35
DOI: 10.1093/nar/gkz051

Disentangled Representations of Cellular Identity

We introduce a disentangled representation for cellular identity that constructs a latent cellular state from a linear combination of condition specific basis vectors that are then decoded into gene expression levels. The basis vectors are learned with a deep autoencoder model from single-cell RNA-seq data. Linear arithmetic in the disentangled representation successfully predicts nonlinear gene expression interactions between biological pathways in unobserved treatment conditions. We are able to recover the mean gene expression profiles of unobserved conditions with an average Pearson r = 0.73, which outperforms two linear baselines, one with an average r = 0.43 and another with an average r = 0.19. Disentangled representations hold the promise to provide new explanatory power for the interaction of biological pathways and the prediction of effects of unobserved conditions for applications such as combinatorial therapy and cellular reprogramming. Our work is motivated by recent advances in deep generative models that have enabled synthesis of images and natural language with desired properties from interpolation in a “latent representation” of the data.

figure from paper

Z Wang, Yeo G, R Sherwood, and DK Gifford
Research in Computational Molecular Biology.

What made you do this? Understanding black-box decisions with sufficient input subsets

Local explanation frameworks aim to rationalize particular decisions made by a black-box prediction model. Existing techniques are often restricted to a specific type of predictor or based on input saliency, which may be undesirably sensitive to factors unrelated to the model’s decision making process. We instead propose sufficient input subsets that identify minimal subsets of features whose observed values alone suffice for the same decision to be reached, even if all other input feature values are missing. General principles that globally govern a model’s decision-making can also be revealed by searching for clusters of such input patterns across many data points. Our approach is conceptually straightforward, entirely model-agnostic, simply implemented using instance-wise backward selection, and able to produce more concise rationales than existing techniques. We demonstrate the utility of our interpretation method on various neural network models trained on text, image, and genomic data.

figure from paper

Brandon Carter, Jonas Mueller, Siddhartha Jain, and David Gifford
arXiv. preprint arXiv:1810.03805
arXiv: 1810.03805

A Peninsular Structure Coordinates Asynchronous Differentiation with Morphogenesis to Generate Pancreatic Islets

The pancreatic islets of Langerhans regulate glucose homeostasis. The loss of insulin-producing β cells within islets results in diabetes, and islet transplantation from cadaveric donors can cure the disease. In vitro production of whole islets, not just β cells, will benefit from a better understanding of endocrine differentiation and islet morphogenesis. We used single-cell mRNA sequencing to obtain a detailed description of pancreatic islet development. Contrary to the prevailing dogma, we find islet morphology and endocrine differentiation to be directly related. As endocrine progenitors differentiate, they migrate in cohesion and form bud-like islet precursors, or “peninsulas” (literally “almost islands”). α cells, the first to develop, constitute the peninsular outer layer, and β cells form later, beneath them. This spatiotemporal collinearity leads to the typical core-mantle architecture of the mature, spherical islet. Finally, we induce peninsula-like structures in differentiating human embryonic stem cells, laying the ground for the generation of entire islets in vitro.

figure from paper

Nadav Sharon, Raghav Chawla, Jonas Mueller, Jordan Vanderhooft, Luke James Whitehorn, Benjamin Rosenthal, Mads Gürtler, Ralph R. Estanboulieh, Dmitry Shvartsman, David K. Gifford, Cole Trapnell, and Doug Melton
Cell. Volume 176, Issue 4, 7 February 2019, Pages 790-804.e13
DOI: 10.1016/j.cell.2018.12.003


Predictable and precise template-free CRISPR editing of pathogenic variants

Following Cas9 cleavage, DNA repair without a donor template is generally considered stochastic, heterogeneous and impractical beyond gene disruption. Here, we show that template-free Cas9 editing is predictable and capable of precise repair to a predicted genotype, enabling correction of disease-associated mutations in humans. We constructed a library of 2,000 Cas9 guide RNAs paired with DNA target sites and trained inDelphi, a machine learning model that predicts genotypes and frequencies of 1- to 60-base-pair deletions and 1-base-pair insertions with high accuracy (r = 0.87) in five human and mouse cell lines. inDelphi predicts that 5-11% of Cas9 guide RNAs targeting the human genome are ‘precise-50’, yielding a single genotype comprising greater than or equal to 50% of all major editing products. We experimentally confirmed precise-50 insertions and deletions in 195 human disease-relevant alleles, including correction in primary patient-derived fibroblasts of pathogenic alleles to wild-type genotype for Hermansky-Pudlak syndrome and Menkes disease. This study establishes an approach for precise, template-free genome editing.

figure from paper

Shen MW, Arbab M, Hsu JY, Worstell D, Culbertson SJ, Krabbe O, Cassa CA, Liu DR, Gifford DK, and Sherwood RI
Nature. 563, 646–651 (2018)
DOI: 10.1038/s41586-018-0686-x

Information-based Acquisition for General Models in Bayesian Optimization

We introduce the Hilbert-Schmidt Independence Criterion (HSIC) Acquisition Function (HAF), an acquisition function for Bayesian optimization that uses HSIC to measure the statistical dependency to a distribution of interest. This enables extensions of information theoretic acquisition functions (e.g. entropy search variants) for more general models than just Gaussian Processes (GPs). HAF is also differentiable, so points can be acquired via gradient search on the input space. On a protein-DNA binding task we compare a particular instance of HAF with Thompson Sampling and Expected Reward. Though preliminary results are not impressive, we identify a major issue with the model used in this task and suggest a future direction to improve upon this work.

figure from paper

Siddhartha Jain, Nathan Hunt, and David Gifford
Bayesian Deep Learning Workshop | NeurIPS 2019.

A novel k-mer set memory (KSM) motif representation improves regulatory variant prediction

The representation and discovery of transcription factor (TF) sequence binding specificities is critical for understanding gene regulatory networks and interpreting the impact of disease-associated non-coding genetic variants. We present a novel TF binding motif representation, the k-mer set memory (KSM), which consists of a set of aligned k-mers that are over-represented at TF binding sites, and a new method called KMAC for de novo discovery of KSMs. We find that KSMs more accurately predict in vivo binding sites than position weight matrix (PWM) models and other more complex motif models across a large set of ChIP-seq experiments. Furthermore, KSMs outperform PWMs and more complex motif models in predicting in vitro binding sites. KMAC also identifies correct motifs in more experiments than five state-of-the-art motif discovery methods. In addition, KSM derived features outperform both PWM and deep learning model derived sequence features in predicting differential regulatory activities of expression quantitative trait loci (eQTL) alleles. Finally, we have applied KMAC to 1600 ENCODE TF ChIP-seq datasets and created a public resource of KSM and PWM motifs. We expect that the KSM representation and KMAC method will be valuable in characterizing TF binding specificities and in interpreting the effects of non-coding genetic variations.

figure from paper

Yuchun Guo, Kevin Tian, Haoyang Zeng, Xiaoyun Guo, and David K. Gifford
Genome Research. 2018 Jun; 28(6):891-900
DOI: 10.1101/gr.226852.117









2009 and Earlier