scLDM.CD4 v0.1

scLDM.CD4 is a deep generative framework optimized for CD4+ T cell single-cell transcriptomics composed of an autoencoder and a flow-matching module. The autoencoder learns a rich latent representation of cell state, and the flow model generates perturbed expression profiles conditioned on perturbation identity and context variables, with classifier-free guidance.

For Model Code and additional information on installation/usage please see the associated GitHub repository

Model Details

Model Architecture

scLDM.CD4 v0.1 builds on scLDM (Palla et al., 2025) and is tuned for counterfactual predictions of perturbation effects in single-cell transcriptomic profiles. The model has two main components:

Autoencoder: A transformer-based autoencoder for single-cell gene expression that compresses mRNA counts into a latent representation and reconstructs gene-level expression from that latent space.

Flow Matching (latent generative model): A conditional flow-matching model based on a Diffusion Transformer (DiT, Peebles et al., 2022) that generates latent cell profiles conditioned on attributes such as cell context and perturbation identity. Training uses an optimal-transport formulation to define target couplings along the flow starting from random noise, rather than constructing explicit couplings between control and perturbed cells.

Both architectures are built on PyTorch Lightning and support distributed training and inference. The models can be compiled with torch.compile for faster inference.

Parameters

Parameter counts are architecture-dependent and configured through the training configuration. The released pre-trained checkpoint includes a 15.2M-parameter autoencoder and a 44.3M-parameter flow-matching model. The model also supports flexible architecture configurations, including:

Variable number of transformer layers
Configurable embedding dimensions
Adjustable latent space dimensions
Gene vocabulary size based on training data

Citation

scLDM.CD4: Dibaeinia et al. Virtual Cells Need Context, Not Just Scale (2026) arXiv (coming soon)
scLDM: Palla et al. Scalable Single-Cell Gene Expression Generation with Latent Diffusion Models (2025) arXiv DOI: 10.48550/arXiv.2511.02986

Model Card Authors

Payam Dibaeinia (Biohub), Mei Knudson (Biohub, University of Chicago), Sudarshan Babu (Biohub), Jason Perera (Biohub), Aly A. Khan (Biohub, University of Chicago)

Primary Contact Email

Payam Dibaeinia payam.dibaeinia@biohub.org

To submit feature requests or report issues with the model, please open an issue on the GitHub repository.

System Requirements

Compute Requirements: GPU
Inference with the released pre-trained checkpoint has been tested on NVIDIA A100, H100, and A6000 GPUs. CPU-only inference is not currently supported.

Intended Use

Primary Use Cases

scLDM.CD4 v0.1 is designed to synthesize single-cell mRNA expression profiles of CD4+ T cells under single-gene knockdown perturbations, enabling downstream analysis of perturbation effects and counterfactual prediction. Key use cases include:

In silico exploration of perturbed single-cell profiles for new donors and time points when their empirical training data are limited.
In silico ranking of candidate perturbations toward a desired transcriptomic effect in naïve CD4+ T cells.
Latent representations and synthetic cells for downstream tasks, including learning transferable embeddings, data augmentation, and training/evaluating downstream predictive models.

Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:

Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
Any use that is prohibited by the MIT license.
Any use that is prohibited by the Acceptable Use Policy.
Clinical decision-making or diagnostic purposes without proper validation.
Generation of cells with conditions or gene combinations not present in the training data.
Use of the autoencoder models for generation (autoencoder models are restricted to inference and encoding; use the flow-matching model for data generation tasks).

Training Data

scLDM.CD4 v0.1 is trained on a pre-processed CD4+ T cell Perturb-seq dataset comprising ~14.5M cells, derived from the raw data released by Zhu et al. (2025). Training and evaluation are performed on a fixed panel of 3,699 highly variable genes (HVGs).

The training data include:

Gene expression profiles: count matrices for ~14.5M cells (restricted to the HVG panel).
Cell-level metadata: donor identifiers, perturbation time points, and guide/target annotations used to specify perturbation identity and experimental context.

Training Procedure

Training follows a two-stage approach:

Autoencoder Training: The autoencoder is trained first to learn a latent representation of gene expression data. Training includes:

Gene tokenization
Encoder-decoder architecture with transformer layers
Negative binomial reconstruction loss

Flow Matching Training: After autoencoder training, the flow matching model is trained in the learned latent space:

Uses the pre-trained encoder
Trains a Diffusion Transformer (DiT) for generation
Supports conditional generation with metadata features
Implements classifier-free guidance training

Training Script

Training for both the autoencoder and the flow-matching model is launched via the shared experiments/scripts/train.py entry point and configured through a Hydra config:

Autoencoder training: experiments/scripts/train.py --config-name=marson_vae
Flow Matching training: experiments/scripts/train.py --config-name=marson_fm

Training and Inference Compute

Autoencoder Training: 60 epochs with distributed training (1 node x 8 NVIDIA H100 GPUs per node), ~3.5 days
Flow Matching Training: 150 epochs with distributed training (4 nodes x 8 NVIDIA H100 GPUs per node), ~18 hours
Checkpoint Sizes: vary based on model architecture and configuration
Inference: generate ~2.4M perturbed cells using distributed inference (8 nodes x 8 NVIDIA H100 GPUs per node), ~1 hour

Training Hyperparameters

Training hyperparameters are configured via Hydra, with separate config files defined for the two stages.

Key hyper-parameters for training autoencoder include:

Training loop: number of epochs; precision; gradient clipping (norm)
Batching / data: batch size; gene sequence length (fixed, 3699 HVGs)
Model architecture: encoder/decoder layers; embedding dim; latent dim; number of inducing points; attention heads; cross-heads; latent projection type (default MLP); positional encoding; input aggregation (e.g., softbin).
Optimization: AdamW learning rate; weight decay; betas; cosine decay parameters.

Key hyper-parameters for training flow-matching include:

Training loop: number of epochs; precision; gradient clipping (norm)
Batching / data: batch size 360 (train)
Conditioning: condition strategy (e.g. joint); condition classes (we included donor ID, guide target, and perturbation time point).
DiT backbone (conditional network): number of layers; attention heads; embedding dim
Flow-matching dynamics: number of timesteps; additional flow parameters
Optimization: AdamW learning rate; betas; weight decay 0.01; cosine decay parameters.
Pre-trained autoencoder checkpoint.

Performance Metrics

Evaluation Datasets

A dedicated validation split of ~1.1M pre-processed cells is used to tune hyperparameters for both the autoencoder and flow-matching models (the final selected settings are recorded in the released marson_vae and marson_fm Hydra config files). Final results are reported on a held-out test split of ~2.4M pre-processed cells, disjoint from the training and validation data.

Metrics

Optimal hyper-parameters were selected based on the following three key performance metrics:

Reconstruction Quality (autoencoder loss): negative binomial log-likelihood measuring how well the autoencoder reconstructs gene expression from latent representations.
Generation Quality (flow-matching loss): flow-matching (denoising/velocity) loss measuring conditional latent generation performance during training.
Correlation-Δ: a perturbation-effect metric assessing agreement between predicted and observed perturbation responses.

Evaluation Results

On the held-out test split, we evaluate both generation quality and perturbation-effect prediction using: (1) UMAP comparisons of real vs. generated cells, (2) distributional metrics including MMD and W2, (3) Δ-based metrics that assess how well predicted mean shifts relative to the control population match true observed shifts, and (4) recovery of differentially expressed genes. Comparisons against simple but competitively strong baselines indicate that the model captures key aspects of the perturbation response.

Biases, Risks, and Limitations

Potential Biases

The model may reflect biases present in the training data, including:

Underrepresentation of certain biological conditions
Bias toward specific experimental protocols or platforms
Limited diversity in perturbation conditions
Generated cells will reflect the distribution and characteristics of the training data
Condition-specific generation is limited to conditions seen during training

Risks

Areas of risk may include but are not limited to:

Inaccurate outputs or generation of unrealistic cell profiles
Potential misuse for incorrect biological interpretations without proper validation
Extrapolation beyond training data distribution may produce unreliable results
Generated cells should not be used for clinical decision-making without extensive validation

Limitations

Autoencoder models: Cannot generate new cells without flow-matching prior (encoding only)
Condition vocabulary: Only conditions present in training data can be used for conditional generation
Gene vocabulary: Only genes in the training data can be generated or analyzed
Checkpoint compatibility: Model checkpoints must match the training configuration architecture
Domain specificity: Model architecture is optimized for the specific cell types and conditions in the training data

Caveats and Recommendations

Review and validate outputs generated by the model, especially for downstream analysis
Be cautious when selecting the conditioning type: joint and mutually exclusive conditioning rely on different underlying logic. Joint conditioning is recommended for perturbational modeling
Validate that conditions used for conditional generation were present in training data

We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.

Should you have any security or privacy issues or questions related to the model, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com, respectively.

Acknowledgements

We are deeply grateful to Giovanni Palla, Jakub Tomczak, Ali ElSheikh, Yibo Wen, Dennis Wu, Weimin Wu, Krunal Patel, Lakshmi Krishnan, Shirin Fuller, and Kavita Kulkarni for their generous support, thoughtful feedback, and many helpful discussions throughout this work.

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for biohub/scldm_cd4

Scalable Single-Cell Gene Expression Generation with Latent Diffusion Models

Paper • 2511.02986 • Published Nov 4, 2025

Scalable Diffusion Models with Transformers

Paper • 2212.09748 • Published Dec 19, 2022 • 18