Update README.md

aba592e verified 22 days ago

5.65 kB

	---
	license: apache-2.0
	base_model: meta-llama/Llama-3.1-8B-Instruct
	tags:
	- education
	- math-tutoring
	- socratic
	- peft
	- lora
	- k12
	language:
	- en
	pipeline_tag: text-generation
	---

	# QLM Socratic Math Tutor

	A Llama 3.1 8B Instruct model fine-tuned with LoRA to be a Socratic math tutor for K-12 students. The model never gives answers — it asks guiding questions that help students reason through math problems themselves.

	## Key Results (Rigorous Evaluation, 95% CI)

	\| Metric \| Score \| 95% CI \| n \|
	\|---\|---\|---\|---\|
	\| Socratic question rate \| 100% \| [98%, 100%] \| 200 \|
	\| Relevance to specific student error \| 74.5% \| [68%, 80%] \| 200 \|
	\| Answer avoidance rate \| 96% \| [92%, 98%] \| 200 \|
	\| Answer leak rate \| 1% \| [0.2%, 5.4%] \| 100 \|
	\| Grade-appropriate language \| 100% \| [98%, 100%] \| 200 \|

	All metrics evaluated with heuristic scoring (no LLM-as-judge) under production conditions with mission context, vocabulary hints, and misconception targeting.

	## How It Works

	The model is trained to be Socratic: when a student makes an error, instead of correcting them, it asks a question that helps them discover the error themselves.

	Student: "I think 1/3 + 1/4 = 2/7 because I added the tops and bottoms."

	Model: "If you had 1/3 of a pizza and 1/4 of the same pizza, would you really have less than 1/3 of a pizza total? Try drawing both fractions on the same circle."

	## Usage

	### With PEFT (recommended)

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel
	import torch

	# Load base model (requires Llama access)
	base_model = AutoModelForCausalLM.from_pretrained(
	"meta-llama/Llama-3.1-8B-Instruct",
	torch_dtype=torch.float16,
	device_map="auto",
	)
	tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

	# Load LoRA adapter
	model = PeftModel.from_pretrained(base_model, "QuantumLearningMachines/qlm-math-tutor")

	# Build prompt
	system = "You are a Socratic math tutor for grade 6-8 students. Never give the answer. Ask guiding questions. Keep responses to 2-3 sentences."

	messages = [
	{"role": "system", "content": system},
	{"role": "user", "content": "I think 1/3 + 1/4 = 2/7 because I added the tops and bottoms"},
	]

	input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

	with torch.no_grad():
	output = model.generate(input_ids, max_new_tokens=150, temperature=0.7, do_sample=True)

	response = tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True)
	print(response)
	```

	### With 4-bit Quantization (for consumer GPUs)

	```python
	from transformers import AutoModelForCausalLM, BitsAndBytesConfig

	quantization_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_compute_dtype=torch.float16,
	)

	base_model = AutoModelForCausalLM.from_pretrained(
	"meta-llama/Llama-3.1-8B-Instruct",
	quantization_config=quantization_config,
	device_map="auto",
	)

	model = PeftModel.from_pretrained(base_model, "QuantumLearningMachines/qlm-math-tutor")
	# Same generation code as above
	```

	## System Prompt

	The model responds to standard Llama chat format with a system prompt instructing Socratic tutoring behavior. A simple system prompt works:

	```
	You are a Socratic math tutor. Never give the answer. Ask guiding questions. Keep responses to 2-3 sentences.
	```

	## Training

	- Base model: meta-llama/Llama-3.1-8B-Instruct
	- Method: LoRA
	- Training data: Synthetic tutoring interactions across K-12 mathematics

	- Hardware: HuggingFace L4 GPU (24GB)
	- Training time: ~4 hours
	- Final loss: 0.306



	## Limitations

	1. Synthetic training data: The model was trained on synthetic data, not real classroom tutoring transcripts. This limits scaffolding specificity — 28% of responses target the specific error, while 68% ask relevant but generic guiding questions.

	2. Answer leak rate: 1% of responses contain the correct answer (detected by exact numeric matching). An answer-leak filter is deployed in production.

	3. Math only: Trained exclusively on K-12 mathematics. Performance on other STEM subjects is untested.

	4. No longitudinal validation: No classroom outcome data yet. Benchmark results measure response quality, not learning gains.

	5. Heuristic evaluation: All evaluation uses keyword/heuristic scoring, not human expert annotation. Human evaluation with math teachers is planned.

	## Evaluation Methodology

	All metrics use 95% confidence intervals. Tutor model evaluated on n=200 (Socratic quality), n=50 (scaffolding), n=100 (answer leak). No LLM-as-judge — all scoring is heuristic to avoid circularity.

	Full benchmark results: [quantumlearningmachines.com/research/external-benchmark-results](https://quantumlearningmachines.com/research/external-benchmark-results)

	## Part of a Larger System

	This tutor model is one component of the QLM platform — an integrated system for adaptive math learning. The model weights are open. The measurement and orchestration systems that train and improve the model are proprietary.

	## Citation

	```bibtex
	@misc{qlm-math-tutor-2026,
	title={QLM Socratic Math Tutor: An Open-Source Llama 3.1 8B LoRA for K-12 Mathematics},
	author={Quantum Learning Machines},
	year={2026},
	url={https://huggingface.co/QuantumLearningMachines/qlm-math-tutor},
	}
	```

	## Contact

	- Try the tutor: [quantumlearningmachines.com/try-math-tutor](https://quantumlearningmachines.com/try-math-tutor)
	- Benchmarks: [quantumlearningmachines.com/research](https://quantumlearningmachines.com/research/external-benchmark-results)
	- Partnerships: hello@quantumlearningmachines.com