scallopmemory-1

A 4B extraction specialist for local assistants. It reads a conversation and writes down the durable facts worth remembering, or returns nothing when a turn is just chatter.

scallopmemory-1 is a LoRA fine-tune of Qwen3.5-4B, distilled from ScallopBot production traces with a larger model writing the labels. The student never trained on its own generations. The repo ships a q5_k_m GGUF for local serving and the raw adapter for reproduction.

Links: scallopbot.com · GitHub

Base model Qwen3.5-4B
Adapter LoRA, rank 32, alpha 64, 2 epochs
Quant q5_k_m GGUF (3.16 GB)
Context inherits Qwen3.5-4B
Serving thinking off (chain-of-thought hurts this task at 4B)
Output structured memory entries (durable facts), or empty

Files

File Format Size Notes
scallopmemory-1.q5_k_m.gguf GGUF Q5_K_M 3.16 GB Recommended for llama.cpp / Ollama / LM Studio
adapter/ PEFT LoRA 170 MB Apply on top of Qwen/Qwen3.5-4B with transformers + PEFT

How to run

Serve with thinking disabled. The model is trained and benchmarked in the no-think path.

llama.cpp

llama-server -m scallopmemory-1.q5_k_m.gguf \
  --chat-template-kwargs '{"enable_thinking":false}'

Ollama

ollama run hf.co/tashfene/scallopmemory-1:Q5_K_M

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="tashfene/scallopmemory-1",
    filename="scallopmemory-1.q5_k_m.gguf",
)
out = llm.create_chat_completion(
    messages=[{"role": "user", "content": "<conversation to extract facts from>"}],
)

Adapter on the base model (transformers + PEFT)

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-4B")
model = PeftModel.from_pretrained(base, "tashfene/scallopmemory-1", subfolder="adapter")
tok = AutoTokenizer.from_pretrained("tashfene/scallopmemory-1", subfolder="adapter")

Intended use

An assistant runs this after a conversation to decide what to persist to long-term memory. Two failure modes hurt: writing down noise, and missing a real fact. Most turns produce nothing, so the harder half of the job is staying quiet without going silent on the turns that matter.

Evaluation

33 extraction cases held out from real sessions, none seen in training. Same harness for every model, thinking off.

Model Teacher agreement Parse success Median latency
Qwen3.6-35B MoE 0.877 57.6% 41.1s
Qwen3.6-Plus (the teacher) 0.748 100% 31.1s
scallopmemory-1 0.725 100% 4.2s
Qwen3.5-4B (stock) 0.695 100% 8.8s

Read this as parity, not a win. The 4B lands close to the hosted model that taught it, at a tenth of the latency and on local hardware. The 35B scores higher on agreement but parses cleanly only 57.6% of the time, so it drops two of every five outputs and cannot sit in a pipeline as is. Among models that return valid structure every time, the 4B edges the teacher and beats the stock base it came from.

Extraction quality is where model capacity shows. An 8B base is the obvious next step to clear the teacher rather than match it.

Training

Traces from one person's assistant, so the distribution is narrow and personal. The same deterministic anonymizer as the tools model swaps real names, emails, phones, handles, and project ids for stable fakes and refuses to write a file if any known real token survives. Anonymized and real-name held-out sets scored within 0.002 of each other.

One detail mattered more than the rest. An early run collapsed because most freshly distilled examples were empty extractions from background chatter, which taught the model to write down nothing. Capping empty examples per session moved agreement from 0.70 to 0.725. If you train your own extractor, watch the share of empty targets.

Limitations and bias

  • One user's data, one memory schema. Your facts and format will differ.
  • 0.725 agreement means it disagrees with the teacher on roughly a quarter of cases. Check its output before trusting it as ground truth.
  • Capacity-bound. A larger base would likely extract better; 4B is the floor for this task, not the ceiling.
  • Trained on a single individual's data, so it inherits that person's notion of what counts as memorable.

License

Apache-2.0, inherited from the Qwen3.5-4B base.

Downloads last month
24
GGUF
Model size
4B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tashfene/scallopmemory-1

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(312)
this model

Evaluation results

  • Teacher agreement on ScallopBot held-out traces (33 cases)
    self-reported
    0.725
  • Parse success on ScallopBot held-out traces (33 cases)
    self-reported
    100.000
  • Struct valid on ScallopBot held-out traces (33 cases)
    self-reported
    100.000