Command line interface documentation
PRESCIENT is primarily implemented as a command line tool. Access manual for each command in the command-line using the syntax prescient commands -h
.
Run the following commands via prescient command [params]
.
process_data
Process normalized expression dataframe into compatible PRESCIENT file format.Parameter | Description |
---|---|
data_path | Path to normalized expression CSV. |
out_dir | Directory to store PRESCIENT torch object. |
meta_path | Path to metadata CSV containing timepoint and celltype annotation data. |
tp_col | Column name of timepoint feature in metadata provided as string. |
celltype_col | Column name of timepoint feature in metadata provided as string. |
num_pcs | Define number of PCs to compute for input to training. Default: 50 |
num_neighbors_umap | Define number of neighbors for UMAP trasformation (UMAP used only for visualization.) Default: 10 |
growth_path | Path to torch pt file containg pre-computed growth weights. See vignette notebooks for generating growth rate vector. |
prescient process_data -d data/Veres2019/Stage_5.Seurat.csv -m data/Veres2019/GSE114412_Stage_5.all.cell_metadata.csv --growth_path data/Veres2019/Veres2019_growth-kegg.pt -o './' --tp_col 'CellWeek' --celltype_col 'Assigned_cluster'
This command takes a normalized expression CSV, metadata CSV, and pre-computed weight torch file as input and produces a PRESCIENT training torch object.
train_model
Train a PRESCIENT model using a PRESCIENT data object as input.Parameter | Description |
---|---|
data_path | Path to PRESCIENT data torch object produced by process_data. |
weight_name | Descriptive name of weight vector being used provided as string for model filename. |
loss | Designate distance function for loss. Default: euclidean |
k_dim | Designate activation function for layers of NN. Default: 500 |
activation | Designate hidden units of fully connected layers in model. Default: softplus |
layers | Number of layers for neural network parameterizing the potential function. Default: 2 |
pretrain_lr | Learning rate for Adam optimizer during pretraining. Default: 1e-9 |
pretrain_epochs | Number of epochs for pretraining with contrastive divergence. Default: 500 |
train_epochs | Number of epochs for training. Default: 2500 |
train_lr | Learning rate for Adam optimizer during training. Default: 0.01 |
train_dt | Timestep for simulations during training. Default: 0.1 |
train_sd | Standard deviation of Gaussian noise for simulation steps. Default: 0.5 |
train_tau | Tau hyperparameter of PRESCIENT. Default: 1e-6 |
train_batch | Batch size (fraction) for training. Default: 0.1 |
train_clip | Gradient clipping threshold for training. Default: 0.25 |
save | Save model every n epochs as torch dict. Default: 100 |
prescient train_model -i data.pt --out_dir /experiments/ --weight_name 'kegg-growth' --seed 3 --layers 2 --k_dim 200 --train_tau 1e-06
This command trains a PRESCIENT model using a PRESCIENT training torch object.
simulate_trajectories
Simulate cellular trajectories using a trained PRESCIENT model and a PRESCIENT data object.Parameter | Description |
---|---|
data_path | Path to PRESCIENT training file (stored in out_dir of process_data command). |
model_path | Path to directory containing PRESCIENT model for simulation. |
out_path | Path to directory for storing output. |
num_sims | Number of simulations (random initializations of n cells) to run. Default: 10 |
num_cells | Number of cells per simulation. Default: 200 |
num_steps | Number of steps forward in time. If not provided, steps will be calculated based on start and end point + train dt. |
seed | Choose the seed of the trained model to use for simulations. Default: 1 |
epoch | Choose which epoch of the chosen model to use for simulations. Provide this value as str. Default: 002500 |
gpu | If available, assign GPU device number (requires CUDA). Provide as int. |
celltype_subset | Randomly sample initial cells from a particular celltype defined in metadata. Provide celltype as str as appears in metadata. |
tp_subset | Randomly sample initial cells from a particular timepoint. Provide timepoint as int or as appears in metadata. |
prescient simulate_trajectories -i data.pt --model_path /experiments/kegg-growth-softplus_2_200-1e-06/ --num_steps 10 -o experiments/ --seed 2
This command generates simulated trajectories from randomly initialized cells using a PRESCIENT model and training torch object.
perturbation_analysis
Simulate unperturbed and perturbed simulations of cells using a trained PRESCIENT model and a PRESCIENT data object.Parameter | Description |
---|---|
perturb_genes | Provide a gene or list of genes to be perturbed as a string (commas, no spaces). Must be in the feature set used to train models. |
z_score | Set magnitude of z_score perturbation. Default: 5.0 |
data_path | Path to PRESCIENT training file (stored in out_dir of process_data command). |
model_path | Path to directory containing PRESCIENT model for simulation. |
out_path | Path to directory for storing output. |
num_sims | Number of simulations (random initializations of n cells) to run. Default: 10 |
num_cells | Number of cells per simulation. Default: 200 |
num_steps | Number of steps forward in time. If not provided, steps will be calculated based on start and end point + train dt. Default: nulls |
seed | Choose the seed of the trained model to use for simulations. Default: 1 |
epoch | Choose which epoch of the chosen model to use for simulations. Default: 1344 |
gpu | If available, assign GPU device number (requires CUDA). Provide as int. |
celltype_subset | Randomly sample initial cells from a particular celltype defined in metadata. Provide celltype as str as appears in metadata. |
tp_subset | Randomly sample initial cells from a particular timepoint. Provide timepoint as int or as appears in metadata. |
prescient perturbation_analysis -i ../Downloads/data.pt -p 'GENE1,GENE2,GENE3' -z 5 --model_path /experiments/kegg-softplus_2_200-1e-06/ --num_steps 10 --seed 2 -o experiments/
This command runs forward simulations of unperturbed cells and cells with perturbations of selected genes.
Links to resources for running CLI with Google Cloud SDK
If you do not have access to GPUs and want to use them for training and simulations (alternatively, you can use CPUs) from the command line, we recommend using any cloud computing service that provides NVIDIA GPUs with CUDA support. For an easier approach, we have provided a short demo in the notebooks tab for using free cloud GPUs in a notebook via Google Colab. We recommend this approach, as the setup process for Google Cloud SDKs can be intensive. That being said, we provide a list of Google Cloud web tutorials for setting up a Google Cloud account, Google Cloud SDKs command-line interface, creating a GPU instance, and running an interactive shell: