qlm-math-tutor / README.md
QuantumLearningMachines's picture
Update README.md
aba592e verified
---
license: apache-2.0
base_model: meta-llama/Llama-3.1-8B-Instruct
tags:
- education
- math-tutoring
- socratic
- peft
- lora
- k12
language:
- en
pipeline_tag: text-generation
---
# QLM Socratic Math Tutor
A Llama 3.1 8B Instruct model fine-tuned with LoRA to be a **Socratic math tutor** for K-12 students. The model never gives answers — it asks guiding questions that help students reason through math problems themselves.
## Key Results (Rigorous Evaluation, 95% CI)
| Metric | Score | 95% CI | n |
|---|---|---|---|
| Socratic question rate | 100% | [98%, 100%] | 200 |
| Relevance to specific student error | 74.5% | [68%, 80%] | 200 |
| Answer avoidance rate | 96% | [92%, 98%] | 200 |
| Answer leak rate | 1% | [0.2%, 5.4%] | 100 |
| Grade-appropriate language | 100% | [98%, 100%] | 200 |
All metrics evaluated with heuristic scoring (no LLM-as-judge) under production conditions with mission context, vocabulary hints, and misconception targeting.
## How It Works
The model is trained to be Socratic: when a student makes an error, instead of correcting them, it asks a question that helps them discover the error themselves.
**Student:** "I think 1/3 + 1/4 = 2/7 because I added the tops and bottoms."
**Model:** "If you had 1/3 of a pizza and 1/4 of the same pizza, would you really have less than 1/3 of a pizza total? Try drawing both fractions on the same circle."
## Usage
### With PEFT (recommended)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load base model (requires Llama access)
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "QuantumLearningMachines/qlm-math-tutor")
# Build prompt
system = "You are a Socratic math tutor for grade 6-8 students. Never give the answer. Ask guiding questions. Keep responses to 2-3 sentences."
messages = [
{"role": "system", "content": system},
{"role": "user", "content": "I think 1/3 + 1/4 = 2/7 because I added the tops and bottoms"},
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(input_ids, max_new_tokens=150, temperature=0.7, do_sample=True)
response = tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True)
print(response)
```
### With 4-bit Quantization (for consumer GPUs)
```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
)
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=quantization_config,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "QuantumLearningMachines/qlm-math-tutor")
# Same generation code as above
```
## System Prompt
The model responds to standard Llama chat format with a system prompt instructing Socratic tutoring behavior. A simple system prompt works:
```
You are a Socratic math tutor. Never give the answer. Ask guiding questions. Keep responses to 2-3 sentences.
```
## Training
- **Base model:** meta-llama/Llama-3.1-8B-Instruct
- **Method:** LoRA
- **Training data:** Synthetic tutoring interactions across K-12 mathematics
- **Hardware:** HuggingFace L4 GPU (24GB)
- **Training time:** ~4 hours
- **Final loss:** 0.306
## Limitations
1. **Synthetic training data:** The model was trained on synthetic data, not real classroom tutoring transcripts. This limits scaffolding specificity — 28% of responses target the specific error, while 68% ask relevant but generic guiding questions.
2. **Answer leak rate:** 1% of responses contain the correct answer (detected by exact numeric matching). An answer-leak filter is deployed in production.
3. **Math only:** Trained exclusively on K-12 mathematics. Performance on other STEM subjects is untested.
4. **No longitudinal validation:** No classroom outcome data yet. Benchmark results measure response quality, not learning gains.
5. **Heuristic evaluation:** All evaluation uses keyword/heuristic scoring, not human expert annotation. Human evaluation with math teachers is planned.
## Evaluation Methodology
All metrics use 95% confidence intervals. Tutor model evaluated on n=200 (Socratic quality), n=50 (scaffolding), n=100 (answer leak). No LLM-as-judge — all scoring is heuristic to avoid circularity.
Full benchmark results: [quantumlearningmachines.com/research/external-benchmark-results](https://quantumlearningmachines.com/research/external-benchmark-results)
## Part of a Larger System
This tutor model is one component of the QLM platform — an integrated system for adaptive math learning. The model weights are open. The measurement and orchestration systems that train and improve the model are proprietary.
## Citation
```bibtex
@misc{qlm-math-tutor-2026,
title={QLM Socratic Math Tutor: An Open-Source Llama 3.1 8B LoRA for K-12 Mathematics},
author={Quantum Learning Machines},
year={2026},
url={https://huggingface.co/QuantumLearningMachines/qlm-math-tutor},
}
```
## Contact
- Try the tutor: [quantumlearningmachines.com/try-math-tutor](https://quantumlearningmachines.com/try-math-tutor)
- Benchmarks: [quantumlearningmachines.com/research](https://quantumlearningmachines.com/research/external-benchmark-results)
- Partnerships: hello@quantumlearningmachines.com