Aletheia 1.5 — the decision-memory gate, now an encoder

ἀλήθεια — "un-forgetting". Aletheia decides what is worth remembering: given a short candidate utterance from a coding session (a commit subject, or a turn from an AI-coding conversation), it answers one question — is this a real, substantive engineering decision, or is it noise? It is the write-gate of Memtrace's Cortex decision memory: only what Aletheia confidently judges a decision is kept.

Aletheia 1.5 swaps the decoder for an encoder. Where Aletheia 1.0 was a 1.5B Qwen2.5 decoder, 1.5 is a fine-tuned DeBERTa-v3-large — the same precision at a fraction of the footprint. For a 64-token binary classifier, the encoder is simply the right tool.

Task: binary sequence classification — decision vs noise.
Base: microsoft/deberta-v3-large (435M), fully fine-tuned as a sequence classifier.
Format: INT8-quantized ONNX (642 MB on disk, ~1.2 GB resident), runs on-device via ONNX Runtime — no GPU, no network, no per-call cost.
Output: a single logit → P(decision) = sigmoid(logit / T) with calibration temperature T = 0.784, default keep threshold 0.61.

Why an encoder (and not a bigger, newer decoder)

We tested the obvious upgrades — newer and smaller decoders — and the encoder won on the axis that matters for a gate: precision per byte of RAM. For short-text binary classification, a fine-tuned encoder learns a cleaner decision boundary than a decoder many times its size; in the literature a fine-tuned ~0.4B DeBERTa-v3-large beats fine-tuned 7B–13B decoders on binary tasks. On our data it matched the 1.5B at ⅓ the resident RAM, with trivial INT8/ONNX export and an MIT license. (Notably, the flashy newest small models — multimodal, gated-delta, "thinking" — are anti-features here: dead weight and export friction for a 64-token classifier.)

Results

Evaluated on held-out, leakage-guarded test sets — the same splits and protocol as Aletheia 1.0, apples-to-apples. The full sweep that selected the model:

Model	params	commit-register AUC (n=195)	conversational AUC (n=1,589)	resident RAM
Aletheia 1.0 — Qwen2.5-1.5B	1.5B	0.844	0.933	~2.8–3.5 GB
Qwen2.5-0.5B (collapsed — rejected)	0.5B	0.695	0.861	~1.5 GB
Qwen3-0.6B (stable, near-parity)	596M	0.808	0.922	~0.6 GB
DeBERTa-v3-large · LoRA	435M	0.836	0.919	~0.5 GB
Aletheia 1.5 — DeBERTa-v3-large · full-FT	435M	0.832	0.933	~1.2 GB

Conversational AUC is exact parity with the 1.5B (0.933); commit AUC is within test noise (0.832 vs 0.844 on a 195-example set). Precision is a dial: at the default threshold 80% of what it stores is a genuine decision; in "clean mode" it rises to **88% precision** on the validation set. The probabilities are temperature-scaled, so the threshold means what it says.

What it keeps vs. drops

Kept — `P(decision)` high	Dropped — `P(decision)` low
"Use Postgres instead of MongoDB for the event store" — 0.95	"thanks that looks good to me" — 0.05
"Determinism is structural: frozen FNV-1a hash, fixed EMBED_DIM=256" — 0.94	"needs to be more space between nodes" — 0.08
"Drop ArcadeDB and migrate the graph to MemDB" — 0.95	"it has to be fluently and not something I discover" — 0.08

It correctly rejects agent narration, chit-chat, and context-free fragments while keeping self-contained engineering decisions.

Intended use

The decision write-gate / proposer for a code-decision memory system. It is register-robust — trained on both git-commit subjects and conversational turns, so one model scores both streams. Downstream, a deterministic check (a code edit binding to the turn, or a human) promotes a proposed decision to a durable fact.

Out of scope: not a retrieval/search model, not a code generator, not a general chat classifier. It judges decision-worthiness, nothing else.

How to use

ONNX Runtime (the shipping path — Python)

import numpy as np, onnxruntime as ort
from transformers import AutoTokenizer
tok  = AutoTokenizer.from_pretrained("memtrace/aletheia-1.5")
sess = ort.InferenceSession("model_int8.onnx", providers=["CPUExecutionProvider"])
T = 0.784
def p_decision(text):
    e = tok(text, truncation=True, max_length=64, return_tensors="np")
    logit = sess.run(None, {"input_ids": e["input_ids"].astype(np.int64),
                            "attention_mask": e["attention_mask"].astype(np.int64)})[0].reshape(-1)[0]
    return 1 / (1 + np.exp(-logit / T))

p_decision("Switch auth to JWT instead of sessions")        # ~0.95 → decision
p_decision("let me check the file rather than re-reading")   # ~0.05 → noise

Optimum (transformers-compatible)

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
m   = ORTModelForSequenceClassification.from_pretrained("memtrace/aletheia-1.5", file_name="model_int8.onnx")
tok = AutoTokenizer.from_pretrained("memtrace/aletheia-1.5")

Rust (`ort`, in-product)

Memtrace's Cortex gate loads model_int8.onnx + tokenizer.json via the ort crate and reads cortex_serving.json for the temperature and default threshold. Input names: input_ids, attention_mask; output: logits. The gate auto-downloads this repo on first run when no local model is present.

Training

Data: 24,895 multi-judge-labeled examples — 14,305 git-commit subjects + 10,590 turns mined from real AI-coding sessions. Labels are LLM multi-judge consensus (2 judges for commits; 3 diverse-lens judges + majority for conversation), ~95% inter-judge agreement.
Commit sources (license-clean): CommitPackFT (MIT, 74 languages), CommitChronicle, tangled-ccs. CommitBench (CC-BY-NC) was excluded so the shipped model is commercial-clean.
Recipe: full fine-tune of DeBERTa-v3-large as a sequence classifier with soft-label training (vote-fraction BCEWithLogits, so judge disagreement is modeled rather than forced to 0/1), gradient checkpointing, best-checkpoint-by-AUC, post-hoc temperature scaling. Unlike the 1.0 decoder (LoRA), the encoder is small enough to fully fine-tune.
Compute: trained locally on Apple Silicon (no rented GPU).

Limitations & honest notes

RAM is embedding-bound, not param-bound. DeBERTa-v3's 128k-token vocabulary is a ~~131M-param embedding table that ONNX Runtime dequantizes to fp32 at load (~~524 MB) — that, not the transformer, dominates the ~1.2 GB resident. A mixed int8-matmul + fp16-embedding build could reach ~700 MB (future work).
Commit register (0.832) trails the conversational register (0.933); the 195-example cross-register benchmark is small and its hand labels are a noisier standard.
English-centric conversational phrasing (commit data spans 74 languages).
It only proposes — pair it with a deterministic confirmation/promotion step.

Version history

Version	Base	RAM	Notes
1.0	Qwen2.5-1.5B (decoder, LoRA)	~3 GB	first gate
1.5	DeBERTa-v3-large (encoder, full-FT)	~1.2 GB	same precision, ⅓ the RAM

License

MIT (inherits from microsoft/deberta-v3-large). Training data is license-clean for commercial use.

Citation

@software{aletheia2026,
  title  = {Aletheia: an on-device decision-memory gate for code},
  author = {Syncable / Memtrace},
  year   = {2026},
  url    = {https://huggingface.co/memtrace/aletheia-1.5}
}

Downloads last month: 36

Model tree for memtrace/aletheia-1.5

Base model

microsoft/deberta-v3-large

Finetuned

(283)

this model

Dataset used to train memtrace/aletheia-1.5

Evaluation results

ROC-AUC (in-register, held-out) on Cortex decisions — conversational held-out (n=1,589)
self-reported

0.933
Accuracy on Cortex decisions — conversational held-out (n=1,589)
self-reported

0.860
ROC-AUC on Cross-register benchmark (n=195)
self-reported

0.832