File formats
PRESCIENT takes as input longitudinal scRNA-seq data. For training, all that is needed is normalized gene expression, time-point labels, and cell type annotations. These inputs are used to generate a PRESCIENT torch object using the prescient process_data
(see below). Below, we describe accepted formatting for inputs. For pre-processing, we recommend using Seurat or scanpy. PRESCIENT accepts the following formats: .csv, .tsv, .txt, .h5ad of a scanpy anndata object, or an .rds file of a Seurat object.
Normalized expression
A post-processed gene expression file in .csv, .tsv, or .txt in the following format will work to create a PRESCIENT data object:
id | gene_1 | gene_2 | gene_3 | … | gene_n |
---|---|---|---|---|---|
cell_1 | 0.0 | 0.121 | 0.0 | 0.0 | |
cell_2 | 0.234 | 0.0 | 0.0 | 0.0 | |
cell_3 | 0.0 | 0.0 | 0.0 | 1.2 |
Metadata
id | timepoint | cell_type |
---|---|---|
cell_1 | 0 | undifferentiated |
cell_2 | 1 | neutrophil |
cell_3 | 2 | monocyte |
PRESCIENT torch object
The prescient process_data
command will generate a torch pt file data_pt
(serialized dictionary) that contains all the necessary information for downstream training, simulations, and perturbations. It will contain the following information:
- data_pt[“data”]: Numpy ndarray of normalized expression.
- data_pt[“celltype”]: List of celltype labels for each cell.
- data_pt[“genes”]: List of gene features.
- data_pt[“tps”]: Timepoint assignment for each cell in dataset from metadata.
- data_pt[“x”]: Torch tensors of normalied expression split by timepoint.
- data_pt[“xp”]: Torch tensors of cell PCs split by timepoint.
- data_pt[“xu”]: Torch tensors of cell UMAPs split by timepoint.
- data_pt[“pca”]: sklearn.decomposition.PCA object fit to normalized expression and used to produce PCs.
- data_pt[“um”]: umap.UMAP object fit to PCs used to produce UMAP dims.
- data_pt[“y”]: List of timepoints.
- data_pt[“w”]: Torch tensors of pre-computed growth weights split by timepoint.