Model Card for Tahoe-100M-SCVI-v1
An SCVI model and minified AnnData of the Tahoe-100M dataset from Vevo Tx.
Model Details
Model Description
Tahoe-100M-SCVI-v1
- Developed by: Vevo Tx
- Model type: SCVI variational autoencoder
- License: This model is licensed under the MIT License.
Model Architecture
SCVI model
Layers: 1, Hidden Units: 128, Latent Dimensions: 10
Parameters
40,390,510
Intended Use
Direct Use
- Decoding Tahoe-100M data representation vectors to gene expression.
- Encoding scRNA-seq data to Tahoe-100M cell state representation space.
Downstream Use
- Adaptation to additional scRNA-seq data
Intended Users
- Computational biologists analyzing gene expression responses to drug perturbations.
- Machine learning researchers developing methods for downstream drug response prediction.
Bias, Risks, and Limitations
Reconstruced gene expression values may be inaccurate. Calibration analysis shows that the model generates counts that contains the observed counts within the 95% confidence intervals from the posterior predictice distribution 97.7% of the time. However, a naive baseline of producing only 0-counts achieves 97.4% on the same metric.
The Tahoe-100M data is based on cancer cell lines under drug treatment, and the model is trained to represent this data. The model may not be directly applicable to other forms of scRNA-seq data, such as that from primary cells.
{{ bias_recommendations | default("Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.", true)}}
How to Get Started with the Model
Use the code below to get started with the model.
Loading the minified AnnData will require 41 GB storage (saved in the cache-dir
)) and RAM. The model itself requires ~1 GB GPU memory.
> import scvi.hub
> tahoe_hubmodel = scvi.hub.HubModel.pull_from_huggingface_hub(
repo_name = 'vevotx/Tahoe-100M-SCVI-v1',
cache_dir = '/path/to/cache'
)
> tahoe = tahoe_hubmodel.model
> tahoe
SCVI model with the following parameters:
n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene, gene_likelihood: nb,
latent_distribution: normal.
Training status: Trained
Model's adata is minified?: True
> tahoe.adata
AnnData object with n_obs × n_vars = 95624334 × 62710
obs: 'sample', 'species', 'gene_count', 'tscp_count', 'mread_count', 'bc1_wind', 'bc2_wind', 'bc3_wind', 'bc1_well', 'bc2_well', 'bc3_well', 'id', 'drugname_drugconc', 'drug', 'INT_ID', 'NUM.SNPS', 'NUM.READS', 'demuxlet_call', 'BEST.LLK', 'NEXT.LLK', 'DIFF.LLK.BEST.NEXT', 'BEST.POSTERIOR', 'SNG.POSTERIOR', 'SNG.BEST.LLK', 'SNG.NEXT.LLK', 'SNG.ONLY.POSTERIOR', 'DBL.BEST.LLK', 'DIFF.LLK.SNG.DBL', 'sublibrary', 'BARCODE', 'pcnt_mito', 'S_score', 'G2M_score', 'phase', 'pass_filter', 'dataset', '_scvi_batch', '_scvi_labels', '_scvi_observed_lib_size', 'plate', 'Cell_Name_Vevo', 'Cell_ID_Cellosaur'
var: 'gene_id', 'genome', 'SUB_LIB_ID'
uns: '_scvi_adata_minify_type', '_scvi_manager_uuid', '_scvi_uuid'
obsm: 'X_latent_qzm', 'X_latent_qzv', '_scvi_latent_qzm', '_scvi_latent_qzv'
layers: 'counts'
> # Take some random genes
> gene_list = tahoe.adata.var.sample(10).index
> # Take some random cells
> cell_indices = tahoe.adata.obs.sample(10).index
> # Decoode gene expression
> gene_expression = tahoe.get_normalized_expression(tahoe.adata[cell_indices], gene_list = gene_list)
> print(gene_expression)
gene_name TSPAN13 ZSCAN9 ENSG00000200991 ENSG00000224901 \
BARCODE_SUB_LIB_ID
73_177_027-lib_2615 0.000036 0.000005 4.255257e-10 9.856240e-08
63_080_025-lib_2087 0.000012 0.000012 3.183158e-10 1.124618e-07
01_070_028-lib_1543 0.000005 0.000010 1.604187e-10 1.022676e-07
07_110_046-lib_1885 0.000035 0.000018 2.597950e-09 1.063819e-07
93_082_010-lib_2285 0.000008 0.000009 8.147555e-10 9.102466e-08
94_154_081-lib_2562 0.000035 0.000014 5.600219e-10 6.891351e-08
47_102_103-lib_2596 0.000021 0.000010 7.320031e-10 1.190017e-07
92_138_169-lib_2356 0.000038 0.000015 3.393952e-10 7.600610e-08
35_035_133-lib_2378 0.000041 0.000004 1.503101e-10 9.447428e-08
06_084_182-lib_2611 0.000007 0.000014 5.135248e-10 7.896663e-08
gene_name RN7SL69P ENSG00000263301 ENSG00000269886 \
BARCODE_SUB_LIB_ID
73_177_027-lib_2615 2.390874e-10 1.896764e-07 7.665454e-08
63_080_025-lib_2087 1.934646e-10 2.205981e-07 6.038700e-08
01_070_028-lib_1543 9.687608e-11 9.900592e-08 5.225622e-08
07_110_046-lib_1885 1.694676e-09 2.274248e-07 7.741949e-08
93_082_010-lib_2285 6.253397e-10 2.593786e-07 7.113768e-08
94_154_081-lib_2562 3.700961e-10 2.083358e-07 6.379186e-08
47_102_103-lib_2596 4.534019e-10 2.551739e-07 4.840992e-08
92_138_169-lib_2356 2.018963e-10 2.067301e-07 4.144172e-08
35_035_133-lib_2378 8.090239e-11 1.658230e-07 3.890900e-08
06_084_182-lib_2611 3.474709e-10 1.025397e-07 4.995985e-08
...
47_102_103-lib_2596 1.975285e-09 7.876221e-08 1.513182e-08
92_138_169-lib_2356 1.214693e-09 4.208334e-08 1.091937e-08
35_035_133-lib_2378 1.049879e-09 8.961482e-08 1.650536e-08
06_084_182-lib_2611 2.311277e-09 5.680565e-08 1.824982e-08
Training Details
Training Data
Tahoe-100M
Zhang, Jesse, Airol A. Ubas, Richard de Borja, Valentine Svensson, Nicole Thomas, Neha Thakar, Ian Lai, et al. 2025. “Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling.” bioRxiv. https://doi.org/10.1101/2025.02.20.639398.
Training Procedure
The model was trained using the SCVI .train()
method. One plate (plate 14) of the training data was held out for training to be used for evaluation and criticism. A callback was used to evaluate reconstruction error of the training set and validation set every N minibatch rather than every epoch since a single epoch is too large to give informative training curves. An additional callback function was used to save snapshots of the model state at every epoch.
Training Hyperparameters
- Training regime: fp32 precision was used for training.
Speeds, Sizes, Times [optional]
Evaluation
Testing Data, Factors & Metrics
Testing Data
Data in the minified AnnData where the 'plate' column equals '14' was held out from training and used for evaluation and criticism.
Metrics
The main metric is reconstruction error, defined as the average negative log likelihood of the observed counts given the representation vectors. This model uses a negative binomial likelihood.