qlm-math-tutor / README.md
QuantumLearningMachines's picture
Update README.md
aba592e verified
metadata
license: apache-2.0
base_model: meta-llama/Llama-3.1-8B-Instruct
tags:
  - education
  - math-tutoring
  - socratic
  - peft
  - lora
  - k12
language:
  - en
pipeline_tag: text-generation

QLM Socratic Math Tutor

A Llama 3.1 8B Instruct model fine-tuned with LoRA to be a Socratic math tutor for K-12 students. The model never gives answers — it asks guiding questions that help students reason through math problems themselves.

Key Results (Rigorous Evaluation, 95% CI)

Metric Score 95% CI n
Socratic question rate 100% [98%, 100%] 200
Relevance to specific student error 74.5% [68%, 80%] 200
Answer avoidance rate 96% [92%, 98%] 200
Answer leak rate 1% [0.2%, 5.4%] 100
Grade-appropriate language 100% [98%, 100%] 200

All metrics evaluated with heuristic scoring (no LLM-as-judge) under production conditions with mission context, vocabulary hints, and misconception targeting.

How It Works

The model is trained to be Socratic: when a student makes an error, instead of correcting them, it asks a question that helps them discover the error themselves.

Student: "I think 1/3 + 1/4 = 2/7 because I added the tops and bottoms."

Model: "If you had 1/3 of a pizza and 1/4 of the same pizza, would you really have less than 1/3 of a pizza total? Try drawing both fractions on the same circle."

Usage

With PEFT (recommended)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model (requires Llama access)
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "QuantumLearningMachines/qlm-math-tutor")

# Build prompt
system = "You are a Socratic math tutor for grade 6-8 students. Never give the answer. Ask guiding questions. Keep responses to 2-3 sentences."

messages = [
    {"role": "system", "content": system},
    {"role": "user", "content": "I think 1/3 + 1/4 = 2/7 because I added the tops and bottoms"},
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(input_ids, max_new_tokens=150, temperature=0.7, do_sample=True)

response = tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True)
print(response)

With 4-bit Quantization (for consumer GPUs)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=quantization_config,
    device_map="auto",
)

model = PeftModel.from_pretrained(base_model, "QuantumLearningMachines/qlm-math-tutor")
# Same generation code as above

System Prompt

The model responds to standard Llama chat format with a system prompt instructing Socratic tutoring behavior. A simple system prompt works:

You are a Socratic math tutor. Never give the answer. Ask guiding questions. Keep responses to 2-3 sentences.

Training

  • Base model: meta-llama/Llama-3.1-8B-Instruct

  • Method: LoRA

  • Training data: Synthetic tutoring interactions across K-12 mathematics

  • Hardware: HuggingFace L4 GPU (24GB)

  • Training time: ~4 hours

  • Final loss: 0.306

Limitations

  1. Synthetic training data: The model was trained on synthetic data, not real classroom tutoring transcripts. This limits scaffolding specificity — 28% of responses target the specific error, while 68% ask relevant but generic guiding questions.

  2. Answer leak rate: 1% of responses contain the correct answer (detected by exact numeric matching). An answer-leak filter is deployed in production.

  3. Math only: Trained exclusively on K-12 mathematics. Performance on other STEM subjects is untested.

  4. No longitudinal validation: No classroom outcome data yet. Benchmark results measure response quality, not learning gains.

  5. Heuristic evaluation: All evaluation uses keyword/heuristic scoring, not human expert annotation. Human evaluation with math teachers is planned.

Evaluation Methodology

All metrics use 95% confidence intervals. Tutor model evaluated on n=200 (Socratic quality), n=50 (scaffolding), n=100 (answer leak). No LLM-as-judge — all scoring is heuristic to avoid circularity.

Full benchmark results: quantumlearningmachines.com/research/external-benchmark-results

Part of a Larger System

This tutor model is one component of the QLM platform — an integrated system for adaptive math learning. The model weights are open. The measurement and orchestration systems that train and improve the model are proprietary.

Citation

@misc{qlm-math-tutor-2026,
  title={QLM Socratic Math Tutor: An Open-Source Llama 3.1 8B LoRA for K-12 Mathematics},
  author={Quantum Learning Machines},
  year={2026},
  url={https://huggingface.co/QuantumLearningMachines/qlm-math-tutor},
}

Contact