Instructions to use tashfene/scallopmemory-1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries
PEFT
How to use tashfene/scallopmemory-1 with PEFT:
```
Task type is invalid.
```

How to use tashfene/scallopmemory-1 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="tashfene/scallopmemory-1",
	filename="scallopmemory-1.q5_k_m.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use tashfene/scallopmemory-1 with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf tashfene/scallopmemory-1:Q5_K_M
# Run inference directly in the terminal:
llama cli -hf tashfene/scallopmemory-1:Q5_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf tashfene/scallopmemory-1:Q5_K_M
# Run inference directly in the terminal:
llama cli -hf tashfene/scallopmemory-1:Q5_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf tashfene/scallopmemory-1:Q5_K_M
# Run inference directly in the terminal:
./llama-cli -hf tashfene/scallopmemory-1:Q5_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf tashfene/scallopmemory-1:Q5_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf tashfene/scallopmemory-1:Q5_K_M

Use Docker

docker model run hf.co/tashfene/scallopmemory-1:Q5_K_M

LM Studio
Jan

vLLM

How to use tashfene/scallopmemory-1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tashfene/scallopmemory-1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tashfene/scallopmemory-1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/tashfene/scallopmemory-1:Q5_K_M

Ollama
How to use tashfene/scallopmemory-1 with Ollama:
```
ollama run hf.co/tashfene/scallopmemory-1:Q5_K_M
```

Unsloth Studio

How to use tashfene/scallopmemory-1 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for tashfene/scallopmemory-1 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for tashfene/scallopmemory-1 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for tashfene/scallopmemory-1 to start chatting

How to use tashfene/scallopmemory-1 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf tashfene/scallopmemory-1:Q5_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "tashfene/scallopmemory-1:Q5_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use tashfene/scallopmemory-1 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf tashfene/scallopmemory-1:Q5_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default tashfene/scallopmemory-1:Q5_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use tashfene/scallopmemory-1 with Docker Model Runner:
```
docker model run hf.co/tashfene/scallopmemory-1:Q5_K_M
```

Lemonade

How to use tashfene/scallopmemory-1 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull tashfene/scallopmemory-1:Q5_K_M

Run and chat with the model

lemonade run user.scallopmemory-1-Q5_K_M

List all available models

lemonade list

scallopmemory-1

A 4B extraction specialist for local assistants. It reads a conversation and writes down the durable facts worth remembering, or returns nothing when a turn is just chatter.

scallopmemory-1 is a LoRA fine-tune of Qwen3.5-4B, distilled from ScallopBot production traces with a larger model writing the labels. The student never trained on its own generations. The repo ships a q5_k_m GGUF for local serving and the raw adapter for reproduction.

Links: scallopbot.com · GitHub


Base model	Qwen3.5-4B
Adapter	LoRA, rank 32, alpha 64, 2 epochs
Quant	q5_k_m GGUF (3.16 GB)
Context	inherits Qwen3.5-4B
Serving	thinking off (chain-of-thought hurts this task at 4B)
Output	structured memory entries (durable facts), or empty

Files

File	Format	Size	Notes
`scallopmemory-1.q5_k_m.gguf`	GGUF Q5_K_M	3.16 GB	Recommended for llama.cpp / Ollama / LM Studio
`adapter/`	PEFT LoRA	170 MB	Apply on top of `Qwen/Qwen3.5-4B` with transformers + PEFT

How to run

Serve with thinking disabled. The model is trained and benchmarked in the no-think path.

llama.cpp

llama-server -m scallopmemory-1.q5_k_m.gguf \
  --chat-template-kwargs '{"enable_thinking":false}'

Ollama

ollama run hf.co/tashfene/scallopmemory-1:Q5_K_M

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="tashfene/scallopmemory-1",
    filename="scallopmemory-1.q5_k_m.gguf",
)
out = llm.create_chat_completion(
    messages=[{"role": "user", "content": "<conversation to extract facts from>"}],
)

Adapter on the base model (transformers + PEFT)

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-4B")
model = PeftModel.from_pretrained(base, "tashfene/scallopmemory-1", subfolder="adapter")
tok = AutoTokenizer.from_pretrained("tashfene/scallopmemory-1", subfolder="adapter")

Intended use

An assistant runs this after a conversation to decide what to persist to long-term memory. Two failure modes hurt: writing down noise, and missing a real fact. Most turns produce nothing, so the harder half of the job is staying quiet without going silent on the turns that matter.

Evaluation

33 extraction cases held out from real sessions, none seen in training. Same harness for every model, thinking off.

Model	Teacher agreement	Parse success	Median latency
Qwen3.6-35B MoE	0.877	57.6%	41.1s
Qwen3.6-Plus (the teacher)	0.748	100%	31.1s
scallopmemory-1	0.725	100%	4.2s
Qwen3.5-4B (stock)	0.695	100%	8.8s

Read this as parity, not a win. The 4B lands close to the hosted model that taught it, at a tenth of the latency and on local hardware. The 35B scores higher on agreement but parses cleanly only 57.6% of the time, so it drops two of every five outputs and cannot sit in a pipeline as is. Among models that return valid structure every time, the 4B edges the teacher and beats the stock base it came from.

Extraction quality is where model capacity shows. An 8B base is the obvious next step to clear the teacher rather than match it.

Training

Traces from one person's assistant, so the distribution is narrow and personal. The same deterministic anonymizer as the tools model swaps real names, emails, phones, handles, and project ids for stable fakes and refuses to write a file if any known real token survives. Anonymized and real-name held-out sets scored within 0.002 of each other.

One detail mattered more than the rest. An early run collapsed because most freshly distilled examples were empty extractions from background chatter, which taught the model to write down nothing. Capping empty examples per session moved agreement from 0.70 to 0.725. If you train your own extractor, watch the share of empty targets.

Limitations and bias

One user's data, one memory schema. Your facts and format will differ.
0.725 agreement means it disagrees with the teacher on roughly a quarter of cases. Check its output before trusting it as ground truth.
Capacity-bound. A larger base would likely extract better; 4B is the floor for this task, not the ceiling.
Trained on a single individual's data, so it inherits that person's notion of what counts as memorable.

License

Apache-2.0, inherited from the Qwen3.5-4B base.

Downloads last month: 24

GGUF

Model size

4B params

Architecture

qwen35

Hardware compatibility

5-bit

Model tree for tashfene/scallopmemory-1

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Finetuned

(312)

this model

Evaluation results

Teacher agreement on ScallopBot held-out traces (33 cases)
self-reported

0.725
Parse success on ScallopBot held-out traces (33 cases)
self-reported

100.000
Struct valid on ScallopBot held-out traces (33 cases)
self-reported

100.000