You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Entity

The verified scaffold between you and any LLM. Entity is a CLI/TUI that sits between you and a model you choose (via /model), applying structured edits and verification before writing — with a dark-cyan interface and an animated ASCII octopus mascot. It ships with Entity-Bench, the proof-of-concept benchmark.

Read THESIS.md (the claim + the honest scorecard), ARCHITECTURE.md (design), and BLUEPRINT.md (the full build spec that generated this repo).

Install

pip install -e ".[full]"          # from source, with TUI + live LLM calls
# system-wide (Ubuntu/Debian/WSL):
bash packaging/build_deb.sh && sudo dpkg -i dist/entity_0.1.0_all.deb
# Fedora/RHEL or any distro:
sudo bash packaging/install.sh

Core installs dependency-free; [full] adds rich, prompt_toolkit, httpx.

Use

entity                 # launch the TUI (octopus banner, entity› prompt)
entity --plain         # no-TUI line mode (for pipes / dumb terminals)
entity --version

Inside the TUI: /model to connect an LLM, then just chat.

entity› /model
entity› /learner none        # parametric learner is OPTIONAL (none|bitnet|lora|custom)
entity› /edit entity-ast
entity› /verifier z3

Full command list: docs/cli.md. The /model page: docs/model-config.md.

Benchmark

entity bench run --dataset mock --n 128 --out runs/a   # -> PASS

Metrics & scorecard: docs/metrics.md and THESIS.md. Note: mock/synthetic datasets are an offline illustration of the pipeline, not evidence — the real evidence is the real study below.

Documentation, paper & empirical study

  • MANUAL.md — the complete user manual (install, every slash command, the /model wizard, the verification gate, the benchmark, packaging, FAQ).
  • paper/entity.pdf — the pre-print "Entity: A Verified Scaffold Between Language Models and Source Code" (compile from paper/entity.tex with tectonic).
  • experiments/ — the reproducible empirical study. The headline evidence is a real study: a real model (Claude) implementing real library functions (benchmarks/real/), judged by a real differential oracle and a real Z3 gate, with output tokens counted by a real tokenizer (tiktoken o200k_base). The model's solutions are archived in benchmarks/real/solutions.jsonl, so the verification half of the study reproduces deterministically with no model or network access. Also included: real Z3 proofs on a contract corpus, real dense-vs-lexical retrieval, a modelled sensitivity analysis (explicitly not evidence), and CodeCarbon energy accounting.
pip install -e ".[study]"
python -m experiments.run_all --out results --seed 0          # writes results/*.json
python -m experiments.figures --results results --out figures # layered SVG + PDF

Headline results — measured, with honest caveats

n = 25 real functions; effect sizes and bootstrap CIs, no p-value theater.

Result Value
Token economy, whole-entity edit (real tokenizer, real files) median −53% vs search/replace, −97% vs whole-file rewrite (100% of tasks favour entity). Caveat: a small localized change is cheaper as a unified diff — the win is for whole-entity rewrites.
Pass@1 (real model, real differential oracle) 1.00 under a light battery; 0.96 under a strengthened battery — the honest oracle caught 1/25 shallow (subtly-wrong) patch that light testing accepted
Invalid patches written by the gate 0 (invalid_rate = 0.00)
Specification coverage (the hard question) 60% of functions admit a checkable output post-condition; 0% admit a Z3 end-to-end body proof. The formal gate is sound where a spec exists (gate accuracy 1.00 on the contract corpus) but covers a narrow slice of real edits
Verified-memory retrieval (disjoint paraphrases) dense p@1 = 0.50 vs lexical 0.35 vs chance 0.125 (modest, but beats the baselines)
Energy / carbon (CodeCarbon) order ~10⁻⁵ kWh / ~10⁻⁶ kg CO₂eq; see results/summary.json

What we removed and why. Earlier versions led with −85% token savings vs diff, a composite −30% token reduction at p<10⁻¹⁶⁰, and a metric "algebra". The first was an over-estimate; the second came from a Wilcoxon test on a hardcoded constant (it measured sample size, not an effect); the third is a weighted scorecard whose weights are author-chosen. All three are gone. See THESIS.md and the paper's Limitations section.

License

Apache-2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support