File formats

PRESCIENT takes as input longitudinal scRNA-seq data. For training, all that is needed is normalized gene expression, time-point labels, and cell type annotations. These inputs are used to generate a PRESCIENT torch object using the prescient process_data (see below). Below, we describe accepted formatting for inputs. For pre-processing, we recommend using Seurat or scanpy. PRESCIENT accepts the following formats: .csv, .tsv, .txt, .h5ad of a scanpy anndata object, or an .rds file of a Seurat object.

Normalized expression

A post-processed gene expression file in .csv, .tsv, or .txt in the following format will work to create a PRESCIENT data object:

id gene_1 gene_2 gene_3 gene_n
cell_1 0.0 0.121 0.0   0.0
cell_2 0.234 0.0 0.0   0.0
cell_3 0.0 0.0 0.0   1.2

Metadata

id timepoint cell_type
cell_1 0 undifferentiated
cell_2 1 neutrophil
cell_3 2 monocyte

PRESCIENT torch object

The prescient process_data command will generate a torch pt file data_pt (serialized dictionary) that contains all the necessary information for downstream training, simulations, and perturbations. It will contain the following information:

  • data_pt[“data”]: Numpy ndarray of normalized expression.
  • data_pt[“celltype”]: List of celltype labels for each cell.
  • data_pt[“genes”]: List of gene features.
  • data_pt[“tps”]: Timepoint assignment for each cell in dataset from metadata.
  • data_pt[“x”]: Torch tensors of normalied expression split by timepoint.
  • data_pt[“xp”]: Torch tensors of cell PCs split by timepoint.
  • data_pt[“xu”]: Torch tensors of cell UMAPs split by timepoint.
  • data_pt[“pca”]: sklearn.decomposition.PCA object fit to normalized expression and used to produce PCs.
  • data_pt[“um”]: umap.UMAP object fit to PCs used to produce UMAP dims.
  • data_pt[“y”]: List of timepoints.
  • data_pt[“w”]: Torch tensors of pre-computed growth weights split by timepoint.