scLDM.CD4 v0.1
scLDM.CD4 is a deep generative framework optimized for CD4+ T cell single-cell transcriptomics composed of an autoencoder and a flow-matching module. The autoencoder learns a rich latent representation of cell state, and the flow model generates perturbed expression profiles conditioned on perturbation identity and context variables, with classifier-free guidance.
For Model Code and additional information on installation/usage please see the associated GitHub repository
Model Details
Model Architecture
scLDM.CD4 v0.1 builds on scLDM (Palla et al., 2025) and is tuned for counterfactual predictions of perturbation effects in single-cell transcriptomic profiles. The model has two main components:
Autoencoder: A transformer-based autoencoder for single-cell gene expression that compresses mRNA counts into a latent representation and reconstructs gene-level expression from that latent space.
Flow Matching (latent generative model): A conditional flow-matching model based on a Diffusion Transformer (DiT, Peebles et al., 2022) that generates latent cell profiles conditioned on attributes such as cell context and perturbation identity. Training uses an optimal-transport formulation to define target couplings along the flow starting from random noise, rather than constructing explicit couplings between control and perturbed cells.
Both architectures are built on PyTorch Lightning and support distributed training and inference. The models can be compiled with torch.compile for faster inference.
Parameters
Parameter counts are architecture-dependent and configured through the training configuration. The released pre-trained checkpoint includes a 15.2M-parameter autoencoder and a 44.3M-parameter flow-matching model. The model also supports flexible architecture configurations, including:
- Variable number of transformer layers
- Configurable embedding dimensions
- Adjustable latent space dimensions
- Gene vocabulary size based on training data
Citation
- scLDM.CD4: Dibaeinia et al. Virtual Cells Need Context, Not Just Scale (2026) arXiv (coming soon)
- scLDM: Palla et al. Scalable Single-Cell Gene Expression Generation with Latent Diffusion Models (2025) arXiv DOI: 10.48550/arXiv.2511.02986
Model Card Authors
Payam Dibaeinia (Biohub), Mei Knudson (Biohub, University of Chicago), Sudarshan Babu (Biohub), Jason Perera (Biohub), Aly A. Khan (Biohub, University of Chicago)
Primary Contact Email
Payam Dibaeinia payam.dibaeinia@biohub.org
To submit feature requests or report issues with the model, please open an issue on the GitHub repository.
System Requirements
- Compute Requirements: GPU
- Inference with the released pre-trained checkpoint has been tested on NVIDIA A100, H100, and A6000 GPUs. CPU-only inference is not currently supported.
Intended Use
Primary Use Cases
scLDM.CD4 v0.1 is designed to synthesize single-cell mRNA expression profiles of CD4+ T cells under single-gene knockdown perturbations, enabling downstream analysis of perturbation effects and counterfactual prediction. Key use cases include:
- In silico exploration of perturbed single-cell profiles for new donors and time points when their empirical training data are limited.
- In silico ranking of candidate perturbations toward a desired transcriptomic effect in naïve CD4+ T cells.
- Latent representations and synthetic cells for downstream tasks, including learning transferable embeddings, data augmentation, and training/evaluating downstream predictive models.
Out-of-Scope or Unauthorized Use Cases
Do not use the model for the following purposes:
- Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
- Any use that is prohibited by the MIT license.
- Any use that is prohibited by the Acceptable Use Policy.
- Clinical decision-making or diagnostic purposes without proper validation.
- Generation of cells with conditions or gene combinations not present in the training data.
- Use of the autoencoder models for generation (autoencoder models are restricted to inference and encoding; use the flow-matching model for data generation tasks).
Training Data
scLDM.CD4 v0.1 is trained on a pre-processed CD4+ T cell Perturb-seq dataset comprising ~14.5M cells, derived from the raw data released by Zhu et al. (2025). Training and evaluation are performed on a fixed panel of 3,699 highly variable genes (HVGs).
The training data include:
- Gene expression profiles: count matrices for ~14.5M cells (restricted to the HVG panel).
- Cell-level metadata: donor identifiers, perturbation time points, and guide/target annotations used to specify perturbation identity and experimental context.
Training Procedure
Training follows a two-stage approach:
Autoencoder Training: The autoencoder is trained first to learn a latent representation of gene expression data. Training includes:
- Gene tokenization
- Encoder-decoder architecture with transformer layers
- Negative binomial reconstruction loss
Flow Matching Training: After autoencoder training, the flow matching model is trained in the learned latent space:
- Uses the pre-trained encoder
- Trains a Diffusion Transformer (DiT) for generation
- Supports conditional generation with metadata features
- Implements classifier-free guidance training
Training Script
Training for both the autoencoder and the flow-matching model is launched via the shared experiments/scripts/train.py entry point and configured through a Hydra config:
- Autoencoder training:
experiments/scripts/train.py --config-name=marson_vae - Flow Matching training:
experiments/scripts/train.py --config-name=marson_fm
Training and Inference Compute
- Autoencoder Training: 60 epochs with distributed training (1 node x 8 NVIDIA H100 GPUs per node), ~3.5 days
- Flow Matching Training: 150 epochs with distributed training (4 nodes x 8 NVIDIA H100 GPUs per node), ~18 hours
- Checkpoint Sizes: vary based on model architecture and configuration
- Inference: generate ~2.4M perturbed cells using distributed inference (8 nodes x 8 NVIDIA H100 GPUs per node), ~1 hour
Training Hyperparameters
Training hyperparameters are configured via Hydra, with separate config files defined for the two stages.
Key hyper-parameters for training autoencoder include:
- Training loop: number of epochs; precision; gradient clipping (norm)
- Batching / data: batch size; gene sequence length (fixed, 3699 HVGs)
- Model architecture: encoder/decoder layers; embedding dim; latent dim; number of inducing points; attention heads; cross-heads; latent projection type (default MLP); positional encoding; input aggregation (e.g., softbin).
- Optimization: AdamW learning rate; weight decay; betas; cosine decay parameters.
Key hyper-parameters for training flow-matching include:
- Training loop: number of epochs; precision; gradient clipping (norm)
- Batching / data: batch size 360 (train)
- Conditioning: condition strategy (e.g. joint); condition classes (we included donor ID, guide target, and perturbation time point).
- DiT backbone (conditional network): number of layers; attention heads; embedding dim
- Flow-matching dynamics: number of timesteps; additional flow parameters
- Optimization: AdamW learning rate; betas; weight decay 0.01; cosine decay parameters.
- Pre-trained autoencoder checkpoint.
Performance Metrics
Evaluation Datasets
A dedicated validation split of ~1.1M pre-processed cells is used to tune hyperparameters for both the autoencoder and flow-matching models (the final selected settings are recorded in the released marson_vae and marson_fm Hydra config files). Final results are reported on a held-out test split of ~2.4M pre-processed cells, disjoint from the training and validation data.
Metrics
Optimal hyper-parameters were selected based on the following three key performance metrics:
- Reconstruction Quality (autoencoder loss): negative binomial log-likelihood measuring how well the autoencoder reconstructs gene expression from latent representations.
- Generation Quality (flow-matching loss): flow-matching (denoising/velocity) loss measuring conditional latent generation performance during training.
- Correlation-Δ: a perturbation-effect metric assessing agreement between predicted and observed perturbation responses.
Evaluation Results
On the held-out test split, we evaluate both generation quality and perturbation-effect prediction using: (1) UMAP comparisons of real vs. generated cells, (2) distributional metrics including MMD and W2, (3) Δ-based metrics that assess how well predicted mean shifts relative to the control population match true observed shifts, and (4) recovery of differentially expressed genes. Comparisons against simple but competitively strong baselines indicate that the model captures key aspects of the perturbation response.
Biases, Risks, and Limitations
Potential Biases
The model may reflect biases present in the training data, including:
- Underrepresentation of certain biological conditions
- Bias toward specific experimental protocols or platforms
- Limited diversity in perturbation conditions
- Generated cells will reflect the distribution and characteristics of the training data
- Condition-specific generation is limited to conditions seen during training
Risks
Areas of risk may include but are not limited to:
- Inaccurate outputs or generation of unrealistic cell profiles
- Potential misuse for incorrect biological interpretations without proper validation
- Extrapolation beyond training data distribution may produce unreliable results
- Generated cells should not be used for clinical decision-making without extensive validation
Limitations
- Autoencoder models: Cannot generate new cells without flow-matching prior (encoding only)
- Condition vocabulary: Only conditions present in training data can be used for conditional generation
- Gene vocabulary: Only genes in the training data can be generated or analyzed
- Checkpoint compatibility: Model checkpoints must match the training configuration architecture
- Domain specificity: Model architecture is optimized for the specific cell types and conditions in the training data
Caveats and Recommendations
- Review and validate outputs generated by the model, especially for downstream analysis
- Be cautious when selecting the conditioning type: joint and mutually exclusive conditioning rely on different underlying logic. Joint conditioning is recommended for perturbational modeling
- Validate that conditions used for conditional generation were present in training data
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.
Should you have any security or privacy issues or questions related to the model, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com, respectively.
Acknowledgements
We are deeply grateful to Giovanni Palla, Jakub Tomczak, Ali ElSheikh, Yibo Wen, Dennis Wu, Weimin Wu, Krunal Patel, Lakshmi Krishnan, Shirin Fuller, and Kavita Kulkarni for their generous support, thoughtful feedback, and many helpful discussions throughout this work.
- Downloads last month
- 1