Wardrobe AI: Shop Your Own Closet with a 4B Model

Community Article

Published June 15, 2026

Upvote

Ox1

build-small-hackathon

Built for the Gradio × Hugging Face Build Small Hackathon (June 2026).

Original blog post My Blog

Live demo: huggingface.co/spaces/build-small-hackathon/wardrobe-us

We've all stood in front of a full closet feeling like we have nothing to wear — or bought something new only to discover an almost identical item buried in the back. My mother has over 200 garments and deals with this every morning: forgetting what she owns, buying duplicates, and struggling to combine pieces into outfits.

Wardrobe AI is my answer. Not another fashion recommender that pushes you to buy more — a personal knowledge graph for the clothes you already own. Upload photos, get a structured catalog, generate outfit ideas, and ask questions in plain English. Everything runs locally through llama.cpp and a single 4-billion-parameter model. No external LLM APIs.

Demo video

The problem wardrobe apps don't solve

Most fashion apps assume you want to buy something. Wardrobe AI assumes you want to use what you have.

The goal is simple:

Turn messy photos into a searchable catalog with structured metadata.
Generate valid outfit combinations from your actual tops and bottoms.
Let you chat with your wardrobe — "What should I wear for a summer wedding?" answered from garments you own, not from a generic style guide.

That last point matters. When the assistant recommends [garment_012], it refers to a real sweater in your catalog, with a photo and a description — not an abstract "navy crewneck" from training data.

How it works: detect, crop, extract

We don't send one big photo straight to a vision model. The pipeline is deliberately split:

Photo → YOLOS detector (CPU) → crop each garment → Gemma 3 4B VLM → JSON catalog

Step 1 — Find garments. A fashion-tuned YOLOS model (WearSense-Detector) runs on CPU and draws bounding boxes around individual items. It works well on flat-lay photos — clothes spread on a bed or table. Hangers and worn-garment photos are harder; that's why manual correction exists (more on that below).

Step 2 — Extract attributes. Each crop goes to Gemma 3 4B (Q4_K_M GGUF via llama-cpp-python). The model returns seven structured fields:

Field	Example
`type`	sweater
`color`	beige
`material`	knit
`pattern`	solid
`season`	all
`formality`	casual
`description`	A long-sleeved, beige knit sweater with a relaxed fit…

{
  "type": "sweater",
  "color": "beige",
  "material": "knit",
  "pattern": "solid",
  "season": "all",
  "formality": "casual",
  "description": "A long-sleeved, beige knit sweater with a relaxed fit..."
}

Step 3 — Store. Garment images land in data/garments/; metadata goes into data/catalog.json. Plain JSON, no database, no vector store. KISS on purpose.

Detection models are tiny (~3–6M parameters) and stay on CPU. Gemma peaks around ~7 GB VRAM when offloaded to GPU locally. On Hugging Face Spaces, everything runs on CPU Basic (2 vCPU, 16 GB RAM) — detection and inference alike.

Four tabs, one wardrobe

Once your closet is digitized, the app organizes around four workflows:

My Wardrobe

Browse your digital closet as a grid of garment cards. Click any item to open a side panel with the full image, description, and all extracted attributes. Cache-busted image URLs prevent stale thumbnails when the catalog updates.

Add Clothes

Upload a photo via drag-and-drop. The app auto-detects garments and opens an Annotorious bounding-box editor so you can adjust, add, or remove boxes before analysis. When detection misses an item — common on non-flat-lay photos — you draw the rectangle yourself.

Alternatively, Load Dataset pulls garments from public Hugging Face datasets (second-hand for individual items, fashion-1k for multi-garment photos) and processes them through the same VLM pipeline. Progress streams in real time through a log dock at the bottom of the screen. Fair warning: on CPU Basic this takes 15–45 minutes for ~50 garments. Pre-load before a live demo.

Get Dressed

This is where rules meet language models.

Rule engine generates all compatible top+bottom pairs, filtered by season and formality compatibility. A 51-garment demo wardrobe yields 237 valid combinations.
LLM ranking sends a diverse subset (up to 20 combos) to Gemma with a compact prompt that fits the model's 4096-token context window. Describe an occasion — casual Friday at the office, dinner on a terrace in summer — and the model re-orders outfits best-to-worst. Leave the field empty and it defaults to everyday casual wear.
Like / dislike saves preferences to data/outfits.json. Disliked pairs are excluded on future runs. This is preference filtering, not model training — honest about what it does.

Ask

A chat interface over your full catalog. The system prompt injects every garment with its bracketed ID ([garment_001]), and the assistant answers in the same language you write in. Replies render garment references as inline chips with thumbnail images and truncated descriptions — so "wear your blue linen shirt" becomes something you can actually see and tap.

Architecture: small on purpose

The hackathon allows models up to 32B parameters. I chose 4B and built around that constraint instead of fighting it.

One model, three jobs

A singleton _ModelManager hot-swaps Gemma 3 4B between:

Vision mode (MTMD handler) — attribute extraction from cropped images
Text mode — outfit ranking and wardrobe chat

Same GGUF weights, different handlers. Simpler VRAM management than loading separate vision and chat models.

No RAG

A personal closet stays under 500 items. Attribute filters handle structured queries; the full catalog fits in the LLM prompt for chat. Adding embeddings, chunking, and retrieval would add complexity without clear benefit at this scale.

Pluggable detection

The detector uses a registry pattern — swap backends without touching the vision pipeline:

Backend	Best for
`yolos` (default)	Flat-lay fashion photos
`yolov8`	General-purpose detection
`grounding_dino`	Open-vocabulary ("find the scarf")

Two frontends, one backend

	Custom UI (default)	Gradio Blocks (`--default`)
Stack	`gr.Server` + Alpine.js	Gradio 6.17 Blocks
Audience	End users — plain language, big buttons	Power users — detector settings, Spanish UI
Manual crop	Annotorious v3 editor	`gradio-image-annotation`

Both share the same API endpoints exposed through gradio.Server. The custom frontend was built for my mother; the Gradio UI is the full-featured escape hatch.

What we deliberately did not build

Vector DB / RAG
External LLM APIs
User accounts
Weather API integration
Video frame extraction

Scope discipline kept the project shippable.

Deployment: CPU Basic on Hugging Face Spaces

The live Space runs on CPU Basic with pre-built CPU wheels — no compilation on wake. First launch downloads the GGUF model (~2–3 minutes), then inference is steady-state slow but functional:

Task	CPU Basic (Space)	Local GPU
Garment extraction	~30–90 s each	~3–10 s each
Outfit ranking	~5–15 s	~2–5 s
Ask response	~5–15 s	~2–5 s
Dataset load (50 items)	~15–45 min	~5–15 min

An S3 bucket persists garment JPEGs across Space reboots. The catalog JSON itself is session-local on the Space — for demos, pre-load the wardrobe via Load Dataset or ship a pre-built catalog.json locally.

Set HF_TOKEN in Space Secrets for model and dataset downloads. After that, inference needs no external API calls — Off the Grid.

Hackathon fit

Track: Backyard AI — a practical daily-life problem for someone I know.

Badge	What it means
🔌 Off the Grid	All inference on Space hardware; no external LLM APIs
🦙 Llama Champion	llama.cpp via `llama-cpp-python`
🐜 Tiny Titan	Gemma 3 4B — 4 billion parameters
🎨 Off-Brand	Custom frontend via `gr.Server` + Alpine.js
📓 Field Notes	Build report in `FIELD_NOTES.md`
📡 Sharing is Caring	Full agent trace published on the Hub

What I learned

Small models are good enough for structured extraction. Asking a 4B VLM "what type of garment is this?" and parsing JSON is surprisingly reliable. You don't need GPT-4 to label a sweater.

The UI matters more than the model. My mother doesn't care about GGUF quantization. She cares that the button says "Add Clothes" and the outfit cards look nice. Building a custom frontend on top of Gradio's server mode was the right trade-off — hosting and queuing from Gradio, UX entirely under my control.

Gradio Server mode is underrated. gr.Server gives you API endpoints, Spaces deployment, and streaming generators — while letting you ship vanilla HTML/CSS/JS. The custom UI talks to the backend through @gradio/client; no separate FastAPI app required.

Ship demo data. A "Load Dataset" button that processes 50 public garments means reviewers see value immediately without uploading their own closet. Just don't live-demo the 15–45 minute wait.

Context windows are real constraints. An early version of the outfit-ranking prompt sent 40 richly described combinations and blew past n_ctx=4096. The fix: compact garment lines, cap at 20 diverse combos, and reserve tokens for the JSON response. Measure your prompts.

Try it yourself

On Hugging Face Spaces (CPU, no setup):

Open build-small-hackathon/wardrobe-us
Add Clothes → Load Dataset (or upload a flat-lay photo)
Get Dressed → describe an occasion → Generate
Ask → "What should I wear for a job interview?"

Locally with GPU:

cd packages/wardrobe-us
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# CUDA 12.4 GPU wheel (optional):
pip install llama-cpp-python==0.3.28 \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124 \
  --force-reinstall --no-deps

python app.py          # custom UI (default)
python app.py --default  # full Gradio Blocks UI

Copy .env.example to .env and set HF_TOKEN for model downloads.

Pre-build a sample catalog offline:

python scripts/build_sample_wardrobe.py --dataset second-hand --target 50

Closing thought

Wardrobe AI is about shopping your own closet — less waste, less stress, more of what you already own. Four billion parameters, one JSON file, and a UI my mother can actually use.

If you try the Space or run it locally, I'd love to hear what breaks and what surprises you. The agent trace and field notes are on the Hub for anyone who wants to see how it was built.

License: MIT · Source: packages/wardrobe-us · Field notes: packages/wardrobe-us/FIELD_NOTES.md

Models mentioned in this article 1

Spaces mentioned in this article 1

Signal Garden: A Game Engine That Keeps Mutating

June 16, 2026

Noteworthy

June 15, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Wardrobe AI: Shop Your Own Closet with a 4B Model

Demo video

The problem wardrobe apps don't solve

How it works: detect, crop, extract

Four tabs, one wardrobe

My Wardrobe

Add Clothes

Get Dressed

Ask

Architecture: small on purpose

One model, three jobs

No RAG

Pluggable detection

Two frontends, one backend

What we deliberately did not build

Deployment: CPU Basic on Hugging Face Spaces

Hackathon fit

What I learned

Try it yourself

Closing thought

Models mentioned in this article 1

Spaces mentioned in this article 1

Wardrobe Us

Signal Garden: A Game Engine That Keeps Mutating

Noteworthy

Community

Models mentioned in this article 1

Spaces mentioned in this article 1

Wardrobe Us