Model Card for mp_20_base
Model Details
Model Description
mp_20_base is an unconditional generative model designed for the generation of valid inorganic crystal structures. It serves as a foundational pre-trained model for the CrystaLLM-pi framework, specifically optimized for smaller unit cells. Based on a GPT-2 decoder-only architecture, it is trained on a corpus of Crystallographic Information Files (CIFs) to learn the syntax, symmetry, and chemical rules governing crystalline matter.
This model does not accept property conditioning vectors. It generates structures based on text prompts (e.g., chemical composition or space group) or unconditionally (ab-initio generation).
- Developed by: Bone et al. (University College London)
- Model type: Autoregressive Transformer (GPT-2)
- Language(s): CIF (Crystallographic Information File) syntax
- License: MIT
Model Sources
- Repository: GitHub: CrystaLLM-pi
- Paper: Discovery and recovery of crystalline materials with property-conditioned transformers (arXiv:2511.21299)
- Dataset: HuggingFace: c-bone/mp_20
Uses
Direct Use
The model is intended for:
- Unconditional Generation: Exploring the general chemical space of stable crystals with 20 atoms or fewer in the unit cell.
- Composition/Space Group Completion: Generating valid structures given a partial prompt (e.g., a chemical formula).
- Fine-tuning base: Serving as the pre-trained initialization for property-conditional models.
Out-of-Scope Use
- Property Conditioning: This model cannot be steered by properties like band gap or density. Use the specific fine-tuned variants for those tasks.
- Large Unit Cells: The model is strictly trained on and intended for unit cells containing 20 atoms or fewer.
Bias, Risks, and Limitations
- Training Distribution: The model reflects the biases present in the Materials Project dataset. It is biased toward theoretical, DFT-relaxed inorganic compounds rather than experimentally synthesized disordered structures.
- Size Constraint Bias: Because it is trained exclusively on the
mp_20subset, the model has a strong prior for generating small, highly symmetric unit cells (≤ 20 atoms) and will struggle to extrapolate to larger, more complex systems. - Validity: While it learns CIF syntax robustly, it may still generate physically invalid structures (e.g., overlapping atoms) or chemically unstable compositions.
Training Details
Training Data
The model was pre-trained on the mp_20 dataset (c-bone/mp_20), a curated subset of the Materials Project database restricted to crystal structures containing 20 atoms or fewer per unit cell.
- Source: Materials Project (via
c-bone/mp_20) - Preprocessing: CIFs are filtered for size (≤ 20 atoms), deduplicated, augmented (with symmetry operations and fractional coordinate shifts), and tokenized.
Training Procedure
- Architecture: GPT-2 Small (~25.9M parameters).
- Objective: Causal Language Modeling (Next-token prediction).
- Loss Function: Cross-entropy with specific weighting for fixed syntax tokens to accelerate learning of the CIF format.
Evaluation
Metrics
The model is evaluated based on:
- Validity: The rate at which generated sequences can be parsed as valid CIF files.
- Structural Consistency: Adherence to space group symmetry and reasonable bond lengths.
Results
The base model achieves high validity rates for small unit cells and effectively learns to generate chemically plausible structures, serving as a robust foundation for downstream tasks requiring rigid size constraints.
Citation
@misc{bone2025discoveryrecoverycrystallinematerials,
title={Discovery and recovery of crystalline materials with property-conditioned transformers},
author={Cyprien Bone and Matthew Walker and Kuangdai Leng and Luis M. Antunes and Ricardo Grau-Crespo and Amil Aligayev and Javier Dominguez and Keith T. Butler},
year={2025},
eprint={2511.21299},
archivePrefix={arXiv},
primaryClass={cond-mat.mtrl-sci},
url={[https://arxiv.org/abs/2511.21299](https://arxiv.org/abs/2511.21299)},
}
- Downloads last month
- 286