Model Card for Tahoe-100M-SCVI-v1

An SCVI model and minified AnnData of the Tahoe-100M dataset from Vevo Tx.

Model Details

Model Description

Tahoe-100M-SCVI-v1

Developed by: Vevo Tx
Model type: SCVI variational autoencoder
License: This model is licensed under the MIT License.

Model Architecture

SCVI model

Layers: 1, Hidden Units: 128, Latent Dimensions: 10

Parameters

40,390,510

Intended Use

Direct Use

Decoding Tahoe-100M data representation vectors to gene expression.
Encoding scRNA-seq data to Tahoe-100M cell state representation space.

Downstream Use

Adaptation to additional scRNA-seq data

Intended Users

Computational biologists analyzing gene expression responses to drug perturbations.
Machine learning researchers developing methods for downstream drug response prediction.

Bias, Risks, and Limitations

Reconstruced gene expression values may be inaccurate. Calibration analysis shows that the model generates counts that contains the observed counts within the 95% confidence intervals from the posterior predictice distribution 97.7% of the time. However, a naive baseline of producing only 0-counts achieves 97.4% on the same metric.

The Tahoe-100M data is based on cancer cell lines under drug treatment, and the model is trained to represent this data. The model may not be directly applicable to other forms of scRNA-seq data, such as that from primary cells.

{{ bias_recommendations | default("Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.", true)}}

How to Get Started with the Model

Use the code below to get started with the model.

Loading the minified AnnData will require 41 GB storage (saved in the cache-dir)) and RAM. The model itself requires ~1 GB GPU memory.

> import scvi.hub

> tahoe_hubmodel = scvi.hub.HubModel.pull_from_huggingface_hub(
    repo_name = 'vevotx/Tahoe-100M-SCVI-v1',
    cache_dir = '/path/to/cache'
)

> tahoe = tahoe_hubmodel.model

> tahoe
SCVI model with the following parameters: 
n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene, gene_likelihood: nb, 
latent_distribution: normal.
Training status: Trained
Model's adata is minified?: True

> tahoe.adata
AnnData object with n_obs × n_vars = 95624334 × 62710
    obs: 'sample', 'species', 'gene_count', 'tscp_count', 'mread_count', 'bc1_wind', 'bc2_wind', 'bc3_wind', 'bc1_well', 'bc2_well', 'bc3_well', 'id', 'drugname_drugconc', 'drug', 'INT_ID', 'NUM.SNPS', 'NUM.READS', 'demuxlet_call', 'BEST.LLK', 'NEXT.LLK', 'DIFF.LLK.BEST.NEXT', 'BEST.POSTERIOR', 'SNG.POSTERIOR', 'SNG.BEST.LLK', 'SNG.NEXT.LLK', 'SNG.ONLY.POSTERIOR', 'DBL.BEST.LLK', 'DIFF.LLK.SNG.DBL', 'sublibrary', 'BARCODE', 'pcnt_mito', 'S_score', 'G2M_score', 'phase', 'pass_filter', 'dataset', '_scvi_batch', '_scvi_labels', '_scvi_observed_lib_size', 'plate', 'Cell_Name_Vevo', 'Cell_ID_Cellosaur'
    var: 'gene_id', 'genome', 'SUB_LIB_ID'
    uns: '_scvi_adata_minify_type', '_scvi_manager_uuid', '_scvi_uuid'
    obsm: 'X_latent_qzm', 'X_latent_qzv', '_scvi_latent_qzm', '_scvi_latent_qzv'
    layers: 'counts'

> # Take some random genes
> gene_list = tahoe.adata.var.sample(10).index

> # Take some random cells
> cell_indices = tahoe.adata.obs.sample(10).index

> # Decoode gene expression
> gene_expression = tahoe.get_normalized_expression(tahoe.adata[cell_indices], gene_list = gene_list)
> print(gene_expression)
gene_name             TSPAN13    ZSCAN9  ENSG00000200991  ENSG00000224901  \
BARCODE_SUB_LIB_ID                                                          
73_177_027-lib_2615  0.000036  0.000005     4.255257e-10     9.856240e-08   
63_080_025-lib_2087  0.000012  0.000012     3.183158e-10     1.124618e-07   
01_070_028-lib_1543  0.000005  0.000010     1.604187e-10     1.022676e-07   
07_110_046-lib_1885  0.000035  0.000018     2.597950e-09     1.063819e-07   
93_082_010-lib_2285  0.000008  0.000009     8.147555e-10     9.102466e-08   
94_154_081-lib_2562  0.000035  0.000014     5.600219e-10     6.891351e-08   
47_102_103-lib_2596  0.000021  0.000010     7.320031e-10     1.190017e-07   
92_138_169-lib_2356  0.000038  0.000015     3.393952e-10     7.600610e-08   
35_035_133-lib_2378  0.000041  0.000004     1.503101e-10     9.447428e-08   
06_084_182-lib_2611  0.000007  0.000014     5.135248e-10     7.896663e-08   

gene_name                RN7SL69P  ENSG00000263301  ENSG00000269886  \
BARCODE_SUB_LIB_ID                                                    
73_177_027-lib_2615  2.390874e-10     1.896764e-07     7.665454e-08   
63_080_025-lib_2087  1.934646e-10     2.205981e-07     6.038700e-08   
01_070_028-lib_1543  9.687608e-11     9.900592e-08     5.225622e-08   
07_110_046-lib_1885  1.694676e-09     2.274248e-07     7.741949e-08   
93_082_010-lib_2285  6.253397e-10     2.593786e-07     7.113768e-08   
94_154_081-lib_2562  3.700961e-10     2.083358e-07     6.379186e-08   
47_102_103-lib_2596  4.534019e-10     2.551739e-07     4.840992e-08   
92_138_169-lib_2356  2.018963e-10     2.067301e-07     4.144172e-08   
35_035_133-lib_2378  8.090239e-11     1.658230e-07     3.890900e-08   
06_084_182-lib_2611  3.474709e-10     1.025397e-07     4.995985e-08   
...
47_102_103-lib_2596     1.975285e-09     7.876221e-08     1.513182e-08  
92_138_169-lib_2356     1.214693e-09     4.208334e-08     1.091937e-08  
35_035_133-lib_2378     1.049879e-09     8.961482e-08     1.650536e-08  
06_084_182-lib_2611     2.311277e-09     5.680565e-08     1.824982e-08

Training Details

Training Data

Tahoe-100M

Zhang, Jesse, Airol A. Ubas, Richard de Borja, Valentine Svensson, Nicole Thomas, Neha Thakar, Ian Lai, et al. 2025. “Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling.” bioRxiv. https://doi.org/10.1101/2025.02.20.639398.

Training Procedure

The model was trained using the SCVI .train() method. One plate (plate 14) of the training data was held out for training to be used for evaluation and criticism. A callback was used to evaluate reconstruction error of the training set and validation set every N minibatch rather than every epoch since a single epoch is too large to give informative training curves. An additional callback function was used to save snapshots of the model state at every epoch.

Training Hyperparameters

Training regime: fp32 precision was used for training.

Speeds, Sizes, Times [optional]

Evaluation

Testing Data, Factors & Metrics

Testing Data

Data in the minified AnnData where the 'plate' column equals '14' was held out from training and used for evaluation and criticism.

Metrics

The main metric is reconstruction error, defined as the average negative log likelihood of the observed counts given the representation vectors. This model uses a negative binomial likelihood.