FitCheck: What AI can your computer actually run?

Published June 15, 2026

This is the full story of FitCheck, a small tool with one stubborn goal: let anyone, technical or not, type their computer in plain words and get an honest answer about what Deep Learning models it can run, how fast it will feel, and exactly how to start. No jargon walls, no gatekeeping. This post is long on purpose. It is both a guide you can actually learn from and an honest build log, including the parts that did not go to plan.

---

Motivation: a conversation at a hacker house in the Alps

The idea did not come from a brainstorm. It came from the desk next to mine.

I was at a hacker house called Alpine Valley in Austria. My neighbour there, Niklas, was building a biomedical startup. He had just received a new, capable machine, and he is exactly the kind of person you would call technical: he writes code, he ships products, he understands his domain deeply. And yet, when the conversation turned to running AI models locally on his new system, he hit a wall that had everything to do with the technical jargon that clouds this entire domain.

What can this machine actually run? Which model is the correct or optimal one for his use case? Will it be fast enough, or will he be staring at a blinking cursor? Could he fine-tune something on his own data, or is that a fantasy on this hardware? The honest answer to all of these questions is buried under words like quantization, VRAM, teraflops, KV cache, and a dozen benchmark charts that assume you already know what you are looking at. Niklas did not need a PhD in machine learning. He needed someone to translate his hardware into plain language and tell him what was possible.

Here is the thing: this was not a Niklas problem. I have had this exact conversation, in different accents, many times. And I have had the inverse version of it too ie: "I want to do X, what computer should I actually buy?" That advice is scattered across forum threads,sponsored or biased reviews, and Reddit arguments, none of it consolidated, most of it assuming you already speak the language.

The Backyard AI track is about exactly this kind of neighbourly, scratch-your-own-itch building, and that is what FitCheck is. The founding principle is simple: credibility beats cleverness, and clarity beats both. A tool that says "you can run this, here is how fast it will feel, here is the one command to start" is worth more than a benchmark you cannot read. So I set out to build the least intimidating, least jargon-heavy AI-hardware advisor I could, for people who have better things to do than learn what a teraflop is, but who deserve to get the full value out of the machines they already own.

Disclaimer - Although I worked on this solo, I used Claude Code, Codex and ChatGPT for coding, brainstorming and generating the visuals.

The jargon simplified as best I could

If you have ever felt locked out of an AI conversation by a wall of acronyms, and heavy jargon this is for you. Below is every term FitCheck uses, explained the way you would explain it to a smart friend over coffee. No equations, no terminal required.

An LLM (and "local" vs the cloud). A Large Language Model(LLM) is a deep learning model that takes your text and predicts a useful response, one word at a time. Think of it as a very well-read assistant who has been shown a huge library and learned the patterns in it. "Local" (or "on-device") means that assistant lives on your own computer, so your words never leave the room. The cloud version (like ChatGPT's website) means you are renting someone else's computer over the internet, and your data travels there. Local is private and free to run; the cloud is convenient but you give up some control. What is an LLM? (AWS)

Parameters / model size ("7B", "billions of parameters"). A parameter is just a number the model learned during training, and a model can have billions of them. Picture a giant sound mixing board with billions of tiny dials all set to exactly the right spot (as explained by 3Blue1Brown, a great resource to learn all this); those settings are what make the model "smart." "7B" means seven billion of those dials. The catch: more dials means a bigger, smarter model, but also a heavier one that needs more memory to hold. That is the whole tension FitCheck helps you navigate. What's a parameter in an LLM?

Quantization (4-bit, 8-bit) and why a model can be "shrunk." Each of those billions of numbers is normally stored at high precision ie lots of numbers after the decimal point, which takes a lot of space. Quantization rounds them to a coarser, smaller form, like storing "3.14" instead of "3.14159265." The model gets dramatically lighter with only a small loss in quality. In practice, running a model in 8-bit can roughly halve the memory it needs, and 4-bit can shrink it again, which is often what lets a model fit on a normal computer at all. 4-bit quantization and QLoRA (Hugging Face)

RAM (system memory). RAM is your computer's short-term working memory: the desk where it spreads out whatever it is using right now. The bigger the desk, the more it can keep open at once without slowing down. Most laptops have 8, 16, or 32 GB. For AI, RAM matters because a model can sometimes run here when there is no dedicated graphics memory, though usually more slowly. Random-access memory (Wikipedia)

VRAM (graphics-card memory), the key bottleneck. VRAM is a separate, faster pool of memory that sits right next to your graphics card (the GPU). It is like a smaller but ultra-fast workbench bolted directly to the machine doing the heavy lifting. For local AI, VRAM is usually the make-or-break number: the entire model has to fit in it to run fast, so a 10 GB model simply will not load smoothly on an 8 GB card. This single number decides most of what your computer can and cannot run, which is exactly why FitCheck leads with it. Video random-access memory (Wikipedia)

Memory bandwidth (GB/s) and why it sets how fast text appears. Bandwidth is how quickly data can be pulled out of memory, measured in gigabytes per second. If memory is a water tank, bandwidth is the width of the pipe coming out of it. Here is the surprising part: when an AI writes a reply, it has to re-read its entire set of numbers for every single word it produces, so the width of that pipe, not raw computing muscle, usually sets how fast the words stream onto your screen. A wider pipe means snappier replies. Optimizing LLMs for speed and memory (Hugging Face)

TFLOPS / teraflops (compute) and how it differs from memory. A "flop" is one math calculation; a teraflop is a trillion of them per second, and TFLOPS measures how much raw calculating muscle your hardware has. If bandwidth is the width of the pipe, TFLOPS is how strong the engine is. The two are different bottlenecks: a model might fit in your memory and your pipe might be wide enough, yet still be limited by engine power on certain tasks. For everyday chatting, memory usually matters more than TFLOPS. FLOPS (Wikipedia)

Tokens per second (and how it compares to reading speed). A token is roughly a word-piece; models think in tokens, and "tokens per second" is just how fast the AI produces text. The handy reference point is you: most people read comfortably at around 5 to 7 tokens per second. So if a model runs at that pace or faster, it feels like a natural conversation; if it runs well above it, the text appears faster than you can read. Below that, you start waiting on it. Words per minute (Wikipedia)

GGUF, and tools like Ollama, llama.cpp, and LM Studio. GGUF is a single-file format that packs a whole local model, its settings, and its quantization into one tidy file you can download and run. Think of it as a self-contained appliance rather than a box of loose parts. The tools that play these files do the technical work for you: llama.cpp is the lightweight engine under the hood, while Ollama and LM Studio are friendly apps (LM Studio even has a point-and-click interface) that let you download and chat with a model without ever touching a terminal. GGUF (Hugging Face)

Inference vs fine-tuning / training. Inference is simply using a finished model: you type, it answers. Fine-tuning (and training more broadly) is teaching the model, adjusting those dials so it gets better at your specific task. The difference is like driving a car versus building or modifying one. Running a model is light enough for everyday hardware; teaching one is far more demanding, which is why most people only ever do inference. Running a model (Hugging Face)

LoRA / QLoRA (cheap fine-tuning). Full training rewrites the entire model and needs serious hardware, but LoRA is a clever shortcut: instead of changing all billions of dials, it freezes the original model and trains a tiny add-on layer of new dials, like clipping a small custom lens onto an existing camera instead of rebuilding it. QLoRA goes further by first quantizing (shrinking) the base model, so you can fine-tune surprisingly large models on a single ordinary graphics card. The result is a small, swappable file that teaches a big model new tricks without the big-hardware bill. LoRA conceptual guide (Hugging Face PEFT)

The plan: one honest engine, AI only where it earns its place

Once the problem was clear, the design almost wrote itself. I wanted a tool you could use in two directions:

"What can my computer run?" You describe your machine, the tool tell you which models fit, how fast they will feel, and how to start.
"What should I buy?" You pick what you want to do, and it tells you the most affordable/best machine that genuinely does it, plus a comfortable step up, on every platform, with live price links.

The most important design decision was about trust. AI tools that confidently make up numbers are worse than useless for this job, because the entire point is credibility. So I flipped the usual approach. Instead of asking a large model to "estimate" everything, I built a deterministic engine that does every piece of arithmetic from verified data, and I let small AI models help only at the edges, where they are genuinely the right tool and where their output is checked.

Concretely, FitCheck is a transparent rules engine sitting on a catalogue of 110 real models (with real file sizes, licenses, and links, refreshed from the Hugging Face API), and three small AI models layered on top, each gated:

a spec parser that turns your messy description into form fields,
a speed predictor that estimates how many tokens per second you will get,
a narrator that explains the engine's numbers in plain words.

Every number you see comes from the engine and carries its provenance. The AI never silently invents the answer.

A house rule I kept throughout: no fake fallbacks. If something fails, the tool shows you the real error, never a plausible-looking substitute. A wrong answer that looks right is the most dangerous thing a tool like this can do.

The deterministic core: boring on purpose

Before the AI, the unglamorous part that does most of the work. For any model and any machine, the engine adds up three things: the model's file size at a given quantization (real GGUF bytes where we have them, conservative estimates otherwise), the memory the conversation itself needs (the "KV cache"), and a safety buffer for working space. It compares that total to what your machine has, on the fast graphics memory first and then on slower system memory it can spill into, and it returns one of three honest verdicts: runs great, tight but works, or will not fit.

It is deliberately conservative. It would rather tell you "tight" and be pleasantly wrong than tell you "great" and leave you with a crash. Every figure is rounded pessimistically and labelled, so you can check our work.

The narrator: Nemotron, for answering in plain words

Numbers alone do not help if you do not know what they mean. The narrator is a small local model, NVIDIA's Nemotron-3-Nano-4B, whose only job is to explain the engine's output in plain English: why a model fits, what "tight" means for you, what to try next. It runs on Hugging Face's ZeroGPU when the Space has a GPU allocation.

The important detail is the leash. The narrator is not allowed to make up facts. It is given the exact numbers the engine computed and asked only to phrase them, and its output passes a "number-match" faithfulness check that flags any figure it states which does not match the engine. The AI is a translator, not a source of truth. If it ever cannot run, the tool says so plainly rather than faking an explanation.

The spec parser: teaching a small model to say "I don't know"

The friendliest possible input is no form at all: just type "my dad's old Dell, i5, 16 gigs, some nvidia card" and let the tool fill in the rest. That messy human-text-to-structure step is the one place a language model is genuinely the right tool, so I fine-tuned one for it: Qwen3-1.7B, trained with Unsloth on a single RTX 5090 laptop using a LoRA, with "completion-only" loss so the model spends its learning on the answer, not on re-reading the prompt.

The one rule that matters more than accuracy: missing information must become null, never a guess. If you did not mention your GPU, the tool must leave it blank, not invent one. So the metric I cared about most was not raw accuracy, it was the invented-field rate: how often the model fills in something that was never said.

Two design choices kept us honest. First, the training labels are never generated by another model; every example starts from a real machine's specs, and only the phrasing varies, across roughly two dozen registers (casual chat, dxdiag dumps, Task Manager paste, seller listings, consoles, several languages). Second, I evaluated on text the model had never seen generated by a different model and verified by me, with no leak to the agents building out the system.

Across five training rounds, accuracy climbed and the invented rate fell sharply as I added the don't-invent rules and the completion-only loss:

The honesty check that is important to share

Here is the part most write-ups would quietly skip. After five rounds of Claude code iteratively tuning the approach and repeated Unsloth finetunings, I was looking at a 91.6% accuracy and a 1.2% invented rate on our evaluation set, and feeling good.

Then came the uncomfortable question: the agent had been reading the model's mistakes and adjusting after every round, which means the agent was biased and had access to the exam we were using to evaluate the model Those were not honest test numbers. They were practice-exam numbers.

So I did the right thing. I demoted that set to a dev set (labelled as optimistic), and built a proper sealed test: builder-blind examples the model had never influenced, checked to have zero overlap with training, and evaluated exactly once. Generated initially with Codex and verified tweaked by me.

The sealed result was a reality check, and I am publishing it unedited because that is the entire point:

On the builder-blind sealed test, the fine-tuned model scored 88.0% accuracy at a 17.7% invented rate, against the base model's 71.5% and 37.1%. So the good news is real: our model clearly beats the base it started from, lifting accuracy by about 16 points and roughly halving the hallucination rate. But the 1.2% I was quietly proud of was optimism. The honest invented rate is closer to 18%.

A fair nuance, not an excuse: some of those inventions are debatable. The sealed labels were machine-generated, and several disagreements are integrated- graphics edge cases where the model correctly extracts a chip that the label marked as nothing. So the true figure is somewhat lower than 18%, and a human-audited test would tighten it. But the direction is not in doubt: dev numbers flatter, sealed numbers tell the truth, and I would rather ship the truth. The model is a strong, useful extractor that prefers blanks over guesses far more than the base model does. It is not a zero-hallucination oracle, and the card says so.

The speed predictor: how fast will it actually feel?

"It fits" is only half the answer. The other half is "will it be usable, or will I be waiting on every word?" Speed was the one place the project felt genuinely AI-shaped, because how fast a model runs is non-linear, measurable, and data-rich.

Getting honest data (thank you, LocalScore!!!)

You cannot predict speed without real measurements, and I was not about to make them up. The data came mostly from LocalScore (a Mozilla Builders project), a community benchmark of local LLM speed across an enormous range of hardware. After cleaning, our training set was 6,633 real measurement rows. I also kept an independent llama.cpp community table aside, untouched, as an out-of-source test. The hardware bandwidth figures came from a public 274-device spec table. The cleaned data and the holdout are published as a dataset so anyone can check our work.

The physics baseline and a fair contest

There is a beautiful piece of physics here: for everyday single-user chatting, a model's speed is set mostly by memory bandwidth, because the machine re-reads the model's numbers for every token. Divide bandwidth by bytes-read-per-token and you get a surprisingly good estimate for free. That is the "roofline."

The obvious question: does a trained machine-learning model actually beat this free physics formula, or is the AI just a gimmick? I ran a fair contest, following the methodology of IBM's LLM-Pilot paper, comparing five approaches on held-out hardware (median error, lower is better):

The result kept us honest. Linear regression was worse than physics(Makes sense given how non-linear the relationship is). Decision trees, random forests, and gradient boosting (XGBoost) all beat it, with XGBoost being the best. So the trained model earns its place, but only just, and only because it was measured instead of assumed. XGBoost cut the median error from the roofline's 28.1% down to 17.5% on hardware where we know the bandwidth, and importantly it can answer for CPUs the formula cannot score at all.

Two guardrails so it never bluffs

A trained model is dangerous outside the data it has seen, because decision trees cannot extrapolate; ask one about a 32B model when the data tops out at 14B and it will confidently give you a 14B's answer. So the engine only trusts the trained model inside its measured envelope; outside it, the physics formula takes over, and the interface tells you which one answered. I also only trust the learned model for the kind of dense, fully-on-GPU setups it was trained on; exotic configurations fall back to physics.

And I tested it on data it never trained on. On the independent Apple llama.cpp set, the shipping predictor is about 26% off, systematically under-predicting Apple's unified-memory speed:

I am reporting that gap rather than hiding it, and deliberately did not tune the predictor to flatter a handful of Apple data points. An honest 26%-off on an out-of-source hardware family is worth more than a number that is faked.

Sizing fine-tuning: VRAM, with receipts

For the braver users who want to fine-tune, not just run, I built a separate estimator for training memory, which behaves very differently from inference. The naive "memory is proportional to parameters" rule under-predicts badly, because a large, often-overlooked term (the output layer's working memory) scales with your batch and sequence length, not the model size.

Rather than trust a formula off the internet, I measured. I ran a real sweep on the RTX 5090 Mobile across model sizes, methods, sequence lengths, and batch sizes, one clean subprocess per configuration so the measurements did not contaminate each other, and fit an architecture-aware model to the results. The shipped estimator lands within about 6% on the configurations I measured (the raw fit is closer to 4%; the shipping version adds a small safety margin) and sits at or above the published minimums for the optimised training stacks, so it under-promises rather than over-promises.

The inverse: "Just tell me what to buy for ..."

The reverse question deserved as much care as the forward one. You pick your goals, and FitCheck walks a ladder of representative builds on every platform (NVIDIA, AMD, Apple, and small or edge boxes), runs each one through the very same engine, and returns two genuinely different options: the cheapest build that truly does the job, and a comfortable step up with headroom.

Two principles here. First, the cheap end has to actually be cheap: the small-box category starts at a 249-dollar CUDA Jetson, not a 1,500-dollar mini-PC that costs more than the Mac above it. Second, and this matters most, the prices are not made up. Hardware prices drift fast, and a number frozen at the moment a model was trained is a liability. So every build carries a live price link (the vendor's own page for Macs and mini-PCs, a live price tracker for the GPU builds), and our own figures are labelled as dated, research-checked approximations, not live quotes. The link is the source of truth; our number is a signpost.

Bugs and pitfalls: the honest blooper reel

No build is clean. A few of the more instructive things that went wrong, in case they save you the same afternoons:

A blue screen, mid-measurement. Halfway through the fine-tuning VRAM sweep, the laptop hard-crashed with a hypervisor error. After some panic, the cause was not the code at all, it was a known Windows update regression, fixed by a later patch. The lesson kept anyway was to run each heavy measurement in its own subprocess with graceful guards, so one crash cannot poison the rest of the run.
Minimum and Comfortable were the same machine. Our first buy-advice logic picked "cheapest that works" and "cheapest that runs great," and for most goals those collapsed to the identical build. Useless. The fix was to define Comfortable as a genuine step up from Minimum, always.
A Tesla V100 that thought it was an Apple M2. Our hardware name matching used naive substring search, so the short key "m2" matched inside "Tesla V100-SXM2" and assigned it Apple's memory bandwidth. Token-boundary matching fixed it. Small bug, very wrong numbers.
Made-up prices. An early version hard-coded rough prices from memory. That is exactly the unverified confidence this tool exists to fight, so I replaced them with research-checked ranges and live links, and found several were off by more than 15% (one Apple configuration had been discontinued entirely).
A cross-site-scripting hole. A strict code review found that user and model-generated text was being injected into the page without escaping. I closed it, escaped every dynamic value, and added regression tests. Worth saying out loud: a tool being "just a frontend" does not exempt it from security.

None of these are flattering. All of them are the actual reason the final tool is trustworthy.

Niklas's thoughts

I showed it to two of my neighbours during the program as I was building it. Niklas, who inspired it and a CTO of an agentic AI contruction company who felt the same pain.

Both of them loved it and thought it was super helpful. I also found them using it and giving quite a lot of useful feedback like support of external models prompting me to add the section where you can add the model url and Huggingface will fetch it for you.

The one issue I wasnt able to solve was naturally catering to all the niche system specifications people have and giving accurate estimations for all kinds of models but that's a good task for future work.

Conclusion: where this goes next

FitCheck started as a translation problem at a neighbouring desk: turn a person's hardware into a plain answer about what AI it can run, and turn a person's goal into a plain answer about what to buy. The result is a deterministic engine you can audit, with small AI models helping only where they earn their place and only under a leash.

The honest scorecard: the engine and the buy-advice are solid and verifiable; the speed predictor genuinely beats physics on the hardware it knows and falls back to physics where it does not; the spec parser is a real improvement over its base model but invents more than its dev numbers suggested, which I publish openly.

Where it could go from here: a human-audited sealed test to pin the parser's true hallucination rate, a calibrated speed model for Apple and unified-memory boxes, a live (cached) price feed to replace dated estimates, and broader coverage of vision and diffusion speed. But the core bet has held up: for a tool whose entire value is trust, the most important feature is the willingness to show its work and admit what it does not know.

If that sounds useful, try it, and tell a friend who has been too embarrassed to ask what a teraflop is. That friend is exactly who I built it for.

Try it and check our work

The app: build-small-hackathon/FitCheck
The collection (app, models, dataset): FitCheck collection
Spec parser: cn0303/fitcheck-spec-parser
Speed predictor + holdout: cn0303/fitcheck-speed-predictor
Speed dataset: cn0303/local-llm-speed-benchmarks
LocalScore, a community benchmark of local LLM speed (Mozilla Builders). https://localscore.ai
M. Lazuka, A. Anghel, T. Parnell. "LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services." SC '24, 2024. arXiv:2410.02425. https://arxiv.org/abs/2410.02425

Models mentioned in this article 2

Datasets mentioned in this article 1

Spaces mentioned in this article 1

Signal Garden: A Game Engine That Keeps Mutating

June 16, 2026

Noteworthy

June 15, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote