Wardrobe AI: Shop Your Own Closet with a 4B Model
Original blog post My Blog
Live demo: huggingface.co/spaces/build-small-hackathon/wardrobe-us
We've all stood in front of a full closet feeling like we have nothing to wear — or bought something new only to discover an almost identical item buried in the back. My mother has over 200 garments and deals with this every morning: forgetting what she owns, buying duplicates, and struggling to combine pieces into outfits.
Wardrobe AI is my answer. Not another fashion recommender that pushes you to buy more — a personal knowledge graph for the clothes you already own. Upload photos, get a structured catalog, generate outfit ideas, and ask questions in plain English. Everything runs locally through llama.cpp and a single 4-billion-parameter model. No external LLM APIs.
Demo video
The problem wardrobe apps don't solve
Most fashion apps assume you want to buy something. Wardrobe AI assumes you want to use what you have.
The goal is simple:
- Turn messy photos into a searchable catalog with structured metadata.
- Generate valid outfit combinations from your actual tops and bottoms.
- Let you chat with your wardrobe — "What should I wear for a summer wedding?" answered from garments you own, not from a generic style guide.
That last point matters. When the assistant recommends [garment_012], it refers to a real sweater in your catalog, with a photo and a description — not an abstract "navy crewneck" from training data.
How it works: detect, crop, extract
We don't send one big photo straight to a vision model. The pipeline is deliberately split:
Photo → YOLOS detector (CPU) → crop each garment → Gemma 3 4B VLM → JSON catalog
Step 1 — Find garments. A fashion-tuned YOLOS model (WearSense-Detector) runs on CPU and draws bounding boxes around individual items. It works well on flat-lay photos — clothes spread on a bed or table. Hangers and worn-garment photos are harder; that's why manual correction exists (more on that below).
Step 2 — Extract attributes. Each crop goes to Gemma 3 4B (Q4_K_M GGUF via llama-cpp-python). The model returns seven structured fields:
| Field | Example |
|---|---|
type |
sweater |
color |
beige |
material |
knit |
pattern |
solid |
season |
all |
formality |
casual |
description |
A long-sleeved, beige knit sweater with a relaxed fit… |
{
"type": "sweater",
"color": "beige",
"material": "knit",
"pattern": "solid",
"season": "all",
"formality": "casual",
"description": "A long-sleeved, beige knit sweater with a relaxed fit..."
}
Step 3 — Store. Garment images land in data/garments/; metadata goes into data/catalog.json. Plain JSON, no database, no vector store. KISS on purpose.
Detection models are tiny (~3–6M parameters) and stay on CPU. Gemma peaks around ~7 GB VRAM when offloaded to GPU locally. On Hugging Face Spaces, everything runs on CPU Basic (2 vCPU, 16 GB RAM) — detection and inference alike.
Four tabs, one wardrobe
Once your closet is digitized, the app organizes around four workflows:
My Wardrobe
Browse your digital closet as a grid of garment cards. Click any item to open a side panel with the full image, description, and all extracted attributes. Cache-busted image URLs prevent stale thumbnails when the catalog updates.
Add Clothes
Upload a photo via drag-and-drop. The app auto-detects garments and opens an Annotorious bounding-box editor so you can adjust, add, or remove boxes before analysis. When detection misses an item — common on non-flat-lay photos — you draw the rectangle yourself.
Alternatively, Load Dataset pulls garments from public Hugging Face datasets (second-hand for individual items, fashion-1k for multi-garment photos) and processes them through the same VLM pipeline. Progress streams in real time through a log dock at the bottom of the screen. Fair warning: on CPU Basic this takes 15–45 minutes for ~50 garments. Pre-load before a live demo.
Get Dressed
This is where rules meet language models.
- Rule engine generates all compatible top+bottom pairs, filtered by season and formality compatibility. A 51-garment demo wardrobe yields 237 valid combinations.
- LLM ranking sends a diverse subset (up to 20 combos) to Gemma with a compact prompt that fits the model's 4096-token context window. Describe an occasion — casual Friday at the office, dinner on a terrace in summer — and the model re-orders outfits best-to-worst. Leave the field empty and it defaults to everyday casual wear.
- Like / dislike saves preferences to
data/outfits.json. Disliked pairs are excluded on future runs. This is preference filtering, not model training — honest about what it does.
Ask
A chat interface over your full catalog. The system prompt injects every garment with its bracketed ID ([garment_001]), and the assistant answers in the same language you write in. Replies render garment references as inline chips with thumbnail images and truncated descriptions — so "wear your blue linen shirt" becomes something you can actually see and tap.
Architecture: small on purpose
The hackathon allows models up to 32B parameters. I chose 4B and built around that constraint instead of fighting it.
One model, three jobs
A singleton _ModelManager hot-swaps Gemma 3 4B between:
- Vision mode (MTMD handler) — attribute extraction from cropped images
- Text mode — outfit ranking and wardrobe chat
Same GGUF weights, different handlers. Simpler VRAM management than loading separate vision and chat models.
No RAG
A personal closet stays under 500 items. Attribute filters handle structured queries; the full catalog fits in the LLM prompt for chat. Adding embeddings, chunking, and retrieval would add complexity without clear benefit at this scale.
Pluggable detection
The detector uses a registry pattern — swap backends without touching the vision pipeline:
| Backend | Best for |
|---|---|
yolos (default) |
Flat-lay fashion photos |
yolov8 |
General-purpose detection |
grounding_dino |
Open-vocabulary ("find the scarf") |
Two frontends, one backend
| Custom UI (default) | Gradio Blocks (--default) |
|
|---|---|---|
| Stack | gr.Server + Alpine.js |
Gradio 6.17 Blocks |
| Audience | End users — plain language, big buttons | Power users — detector settings, Spanish UI |
| Manual crop | Annotorious v3 editor | gradio-image-annotation |
Both share the same API endpoints exposed through gradio.Server. The custom frontend was built for my mother; the Gradio UI is the full-featured escape hatch.
What we deliberately did not build
- Vector DB / RAG
- External LLM APIs
- User accounts
- Weather API integration
- Video frame extraction
Scope discipline kept the project shippable.
Deployment: CPU Basic on Hugging Face Spaces
The live Space runs on CPU Basic with pre-built CPU wheels — no compilation on wake. First launch downloads the GGUF model (~2–3 minutes), then inference is steady-state slow but functional:
| Task | CPU Basic (Space) | Local GPU |
|---|---|---|
| Garment extraction | ~30–90 s each | ~3–10 s each |
| Outfit ranking | ~5–15 s | ~2–5 s |
| Ask response | ~5–15 s | ~2–5 s |
| Dataset load (50 items) | ~15–45 min | ~5–15 min |
An S3 bucket persists garment JPEGs across Space reboots. The catalog JSON itself is session-local on the Space — for demos, pre-load the wardrobe via Load Dataset or ship a pre-built catalog.json locally.
Set HF_TOKEN in Space Secrets for model and dataset downloads. After that, inference needs no external API calls — Off the Grid.
Hackathon fit
Track: Backyard AI — a practical daily-life problem for someone I know.
| Badge | What it means |
|---|---|
| 🔌 Off the Grid | All inference on Space hardware; no external LLM APIs |
| 🦙 Llama Champion | llama.cpp via llama-cpp-python |
| 🐜 Tiny Titan | Gemma 3 4B — 4 billion parameters |
| 🎨 Off-Brand | Custom frontend via gr.Server + Alpine.js |
| 📓 Field Notes | Build report in FIELD_NOTES.md |
| 📡 Sharing is Caring | Full agent trace published on the Hub |
What I learned
Small models are good enough for structured extraction. Asking a 4B VLM "what type of garment is this?" and parsing JSON is surprisingly reliable. You don't need GPT-4 to label a sweater.
The UI matters more than the model. My mother doesn't care about GGUF quantization. She cares that the button says "Add Clothes" and the outfit cards look nice. Building a custom frontend on top of Gradio's server mode was the right trade-off — hosting and queuing from Gradio, UX entirely under my control.
Gradio Server mode is underrated. gr.Server gives you API endpoints, Spaces deployment, and streaming generators — while letting you ship vanilla HTML/CSS/JS. The custom UI talks to the backend through @gradio/client; no separate FastAPI app required.
Ship demo data. A "Load Dataset" button that processes 50 public garments means reviewers see value immediately without uploading their own closet. Just don't live-demo the 15–45 minute wait.
Context windows are real constraints. An early version of the outfit-ranking prompt sent 40 richly described combinations and blew past n_ctx=4096. The fix: compact garment lines, cap at 20 diverse combos, and reserve tokens for the JSON response. Measure your prompts.
Try it yourself
On Hugging Face Spaces (CPU, no setup):
- Open build-small-hackathon/wardrobe-us
- Add Clothes → Load Dataset (or upload a flat-lay photo)
- Get Dressed → describe an occasion → Generate
- Ask → "What should I wear for a job interview?"
Locally with GPU:
cd packages/wardrobe-us
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# CUDA 12.4 GPU wheel (optional):
pip install llama-cpp-python==0.3.28 \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124 \
--force-reinstall --no-deps
python app.py # custom UI (default)
python app.py --default # full Gradio Blocks UI
Copy .env.example to .env and set HF_TOKEN for model downloads.
Pre-build a sample catalog offline:
python scripts/build_sample_wardrobe.py --dataset second-hand --target 50
Closing thought
Wardrobe AI is about shopping your own closet — less waste, less stress, more of what you already own. Four billion parameters, one JSON file, and a UI my mother can actually use.
If you try the Space or run it locally, I'd love to hear what breaks and what surprises you. The agent trace and field notes are on the Hub for anyone who wants to see how it was built.
License: MIT · Source: packages/wardrobe-us · Field notes: packages/wardrobe-us/FIELD_NOTES.md