LocalDuo — Build Small Hackathon Field Notes

Published June 15, 2026

Author: Shayekh Bin Islam, KAIST, South Korea
Date: June 2026
Stack: Gradio · Qwen 3.5-9B VLM · Cohere ASR · Supertonic TTS · HuggingFace Spaces (ZeroGPU)

What I Built

LocalDuo is an end-to-end Korean language learning application that takes any Korean-language content — a PDF textbook, a live website, an audio recording, or a YouTube video — and automatically transforms it into interactive vocabulary flashcards with native-quality audio pronunciation.

The core idea: instead of studying from generic word lists, learn vocabulary from content you actually care about. Upload a chapter from your Korean textbook, paste a BBC Korean news article, or drop in a K-drama YouTube clip, and the app extracts the most useful Korean vocabulary, transliterates it into your native script, explains the grammar, generates TTS pronunciation audio, and packages everything into swipeable flashcards with a built-in quiz mode.

Feature Overview

Feature	Description
Multi-Source Input	Website URLs, PDF uploads, audio file uploads, YouTube links, and pre-saved deck imports — five distinct input pipelines unified into one interface
Vision-Language Extraction	Qwen 3.5-9B processes both text and page images simultaneously, enabling vocabulary extraction from visual content (handwritten notes, textbook diagrams, infographics)
Speech-to-Text Pipeline	Cohere ASR (`cohere-transcribe-03-2026`) transcribes Korean audio from YouTube videos and uploaded audio files, with Korean-only filtering to strip English artifacts
Text-to-Speech Pronunciation	Supertonic-3 TTS generates natural Korean pronunciation for every extracted word, embedded as base64 audio data URIs directly in the flashcard HTML
Interactive Flashcard SPA	A full single-page application embedded via `<iframe srcdoc>` with card flipping, navigation, audio playback, and clipboard copy — all in vanilla JS/CSS
5-Question Quiz Mode	Auto-generated multiple-choice quizzes from the current deck with animated scoring and progress tracking
Multilingual Transliteration	Supports 200+ target languages organized by language family (Indo-European, Sino-Tibetan, Afro-Asiatic, etc.) with native script transliteration
Export to Anki & JSON	One-click export to `.apkg` (via `genanki`) for Anki spaced repetition, or `.json` for programmatic use
Think/Non-Think Toggle	User control over the model's reasoning chain — enable deep thinking for accuracy, or disable for instant JSON output
Korean-Themed UI	Custom dark theme inspired by Korean aesthetics: warm gold (금) accents, ink-wash animated backgrounds, Noto Serif KR typography

Architecture

┌────────────────────────────────────────────────────────────┐
│                    INPUT LAYER                        │
│  Website URL │ PDF Upload │ Audio │ YouTube │ Import  │
└───────┬───────┬─────────────┬────────┬──────────┬──────────┘
       │       │            │       │         │
       ▼       ▼            │       │         ▼
  Playwright  PyMuPDF       │       │    JSON/Anki
  + BS4       (fitz)        │       │    Parser
  Scraper     Extract       │       │
       │       │            ▼       ▼
       │       │     Cohere ASR (cohere-transcribe-03-2026)
       │       │         Korean audio → text
       │       │            │       │
       ▼       ▼            ▼       ▼
┌────────────────────────────────────────────────────────────┐
│              EXTRACTION LAYER (GPU)                   │
│                                                       │
│  Qwen 3.5-9B VLM (AutoModelForImageTextToText)        │
│  • Multimodal: text + page images → structured JSON   │
│  • Streaming via TextIteratorStreamer                 │
│  • Think/Non-think mode (enable_thinking flag)        │
│  • Auto-force JSON after configurable char threshold  │
│  • 3-attempt retry with partial JSON salvaging        │
└─────────────────────────┬──────────────────────────────────┘
                        │
                        ▼
┌────────────────────────────────────────────────────────────┐
│               POST-PROCESSING LAYER                   │
│  • JSON parsing with jiter (partial_mode=True)        │
│  • Supertonic-3 TTS → base64 audio data URIs          │
│  • Flashcard SPA builder (iframe srcdoc)              │
│  • Quiz generator (randomized MCQ)                    │
│  • Anki .apkg export (genanki)                        │
└─────────────────────────┬──────────────────────────────────┘
                        │
                        ▼
┌────────────────────────────────────────────────────────────┐
│                 PRESENTATION LAYER                    │
│  Gradio Blocks UI with Korean-inspired dark theme     │
│  • Streaming model output (live generation view)      │
│  • Interactive flashcard carousel (flip, nav, audio)  │
│  • 5-question multiple-choice quiz                    │
│  • Export buttons (JSON, Anki)                        │
│  • Generation controls (stop thinking, kill gen)      │
└────────────────────────────────────────────────────────────┘

Technical Deep Dives

1. Taming the Thinking Model

The biggest engineering challenge was using Qwen 3.5-9B in production. This model uses a <think>...</think> reasoning chain before generating output, which is great for accuracy but catastrophic for latency — the model would sometimes think for 10,000+ characters before producing any JSON.

Solution: A multi-layered forcing mechanism.

User clicks "Generate"
        │
        ▼
Model starts thinking (<think> block)
        │
        ├── Thinking chars > auto_force_chars (default: 4000)?
        │       YES → Kill generation thread
        │             Append "</think>\n```json\n[\n"
        │             Restart generation with partial context
        │
        ├── User clicks "⚡ Stop thinking, Generate now"?
        │       YES → Same forced restart
        │
        ├── Total output > 10,000 chars (hard limit)?
        │       YES → Hard force, kill and restart again
        │
        └── "Non-Think" mode toggled?
                YES → apply_chat_template(enable_thinking=False)
                      Appends empty <think>\n\n</think> block
                      Forces ```json prefix immediately

This required:

Thread management: model.generate() runs in a separate Thread with a TextIteratorStreamer. Killing generation means setting a StoppingCriteria flag, draining the streamer queue, joining the thread with timeout, then spawning a new thread with extended context.
Partial context stitching: When forcing, the entire output so far (including partial thinking) is appended to the chat template as partial assistant text, so the model has context for what it was about to generate.
Global flag coordination: global_stop_thinking and global_kill_threads are module-level mutable lists ([False]) to enable cross-thread communication in Python's GIL environment.

2. Robust JSON Extraction

LLMs are unreliable JSON producers. The extraction pipeline has 4 fallback layers:

Regex extraction: Search for ```json ... ``` fenced blocks (last match preferred)
Raw JSON detection: Regex for [...] or {...} patterns
json.loads(): Standard parsing
jiter.from_json(partial_mode=True): Rust-based partial JSON parser that can handle truncated arrays, missing closing brackets, and other malformation from killed generation

Additionally, if the generation is killed mid-stream, the app attempts to salvage partial JSON from whatever the model produced before being interrupted.

3. Audio as Data URIs

A design constraint of deploying on HuggingFace Spaces: you can't easily serve dynamic audio files from disk to the frontend.

Solution: Convert all TTS output to base64-encoded WAV data URIs (data:audio/wav;base64,...) and embed them directly in the flashcard HTML. Each card's audio is a self-contained data URI that plays via new Audio(uri).play() in the browser. This eliminates all file-serving concerns but increases HTML payload size — a single deck with 10 cards and audio is ~2-5MB of base64-encoded HTML.

4. Dual-Environment Architecture (`IS_HF`)

The app runs in two modes:

Local development: Full debug logging to ./log/, GPU on CUDA device, fixed ports
HuggingFace Spaces: tempfile.gettempdir() for file I/O, @spaces.GPU decorators for ZeroGPU allocation, Playwright installed at runtime, all debug file writes disabled to prevent file descriptor exhaustion

The IS_HF flag is detected by trying to import spaces — if it succeeds, we're on HF Spaces.

5. YouTube → Flashcards Pipeline

This was the most complex input pipeline:

YouTube URL → yt-dlp (first 5 min, WAV) → Cohere ASR → Korean text
                                                          │
                                                          ▼
                                                Korean-only filtering
                                                (regex: [가-힣ㄱ-ㅎㅏ-ㅣ])
                                                          │
                                                          ▼
                                              Qwen VLM (text-only mode)
                                                          │
                                                          ▼
                                                  Flashcards + TTS

Challenges:

YouTube bot detection required optional cookies.txt support
Cohere ASR sometimes returns English-only lines (song lyrics, UI text), which are filtered out using Korean Unicode range detection
Audio extraction is limited to first 5 minutes to stay within GPU time limits (180s @spaces.GPU duration)

6. The Flashcard SPA

The flashcard interface is a complete single-page application (~800 lines of HTML/CSS/JS) embedded via <iframe srcdoc>. It features:

Card flipping animation: CSS transform: rotateY(180deg) with backface-visibility: hidden
Swipe navigation: Previous/Next buttons with card counter
Audio playback: One-click pronunciation via base64 data URI
Copy to clipboard: Click-to-copy Korean text with visual feedback
Dark theme: Fully self-contained styling that matches the parent Gradio theme
Responsive layout: Works on desktop and mobile viewports

The entire SPA is generated server-side as a Python f-string with the vocabulary JSON baked in. This avoids any client-server communication for card data.

Challenges & Solutions

Challenge	Solution
Qwen thinks for 10K+ chars	Multi-layered auto-force mechanism with configurable thresholds and manual override button
LLM outputs malformed JSON	4-layer fallback: regex extraction → raw JSON detection → `json.loads` → `jiter` partial parser
White backgrounds in Firefox dark theme	Aggressive CSS `!important` overrides on every Gradio internal class including `.file-preview `, `[data-testid="file"] `, `.wrap.default`, etc.
File descriptor errors on HF Spaces	Disabled all debug file I/O behind `if not IS_HF:` guards
Export downloads stuck at "processing"	Switched from hardcoded file paths to `tempfile.mkstemp()` with unique names per export
No audio on initial demo cards	Added TTS generation loop at startup for `BOOTSTRAP_VOCAB` before `create_demo()`
YouTube bot detection	Optional `cookies.txt` upload support via yt-dlp's `cookiefile` parameter
Gradio checkboxes invisible in dark theme	Custom `appearance: none` checkbox CSS with gold gradient fill and ✓ pseudo-element
GPU timeout on Spaces (ZeroGPU)	`@spaces.GPU(duration=180)` decorators, 5-minute YouTube audio limit, configurable auto-force threshold

What I Learned

On LLM Engineering

Thinking models need leashes. Qwen 3.5's <think> block is powerful but unpredictable. In production, you must have kill switches, auto-force thresholds, and timeout mechanisms. Unbounded thinking is a denial-of-service on your own GPU.
Partial JSON parsing is essential. If you're generating structured output from an LLM, invest in a robust partial parser (jiter was excellent). You will kill generation mid-token, and you need to salvage whatever was produced.
enable_thinking=False is underrated. The Qwen chat template has a built-in mechanism to skip the thinking chain entirely — it emits an empty <think>\n\n</think> block. For structured extraction tasks where you've already provided clear examples, non-think mode is 5-10x faster with comparable quality.
Thread management for streaming is tricky. Python's TextIteratorStreamer + threading model works but requires careful cleanup — drain the queue before joining, use stopping criteria flags, and always call streamer.end() in a finally block.

On Multimodal Pipelines

VLMs handle mixed text+image extraction surprisingly well. Qwen 3.5-9B can simultaneously read Korean text from a page render and understand the visual context (diagrams, tables, images) to produce better vocabulary than text-only extraction.
ASR output needs post-filtering. Korean ASR models (even good ones like Cohere's) produce mixed-language output on music, UI sounds, and English speech. A simple regex filter for Korean Unicode characters ([가-힣]) dramatically improves downstream quality.
Audio as data URIs is a viable architecture. For small-to-medium audio clips (1-3 seconds of TTS), base64 data URIs embedded directly in HTML eliminate all file-serving complexity. The payload size is manageable and the UX is seamless.

On Gradio & HuggingFace Spaces

Gradio's CSS is a battlefield. The framework injects deeply nested internal styles that are nearly impossible to override cleanly. The only reliable approach is aggressive !important declarations on specific element selectors. Firefox is especially stubborn — it requires targeting * descendants of containers that Chrome handles transitively.
ZeroGPU has hard time limits. The @spaces.GPU(duration=180) decorator means your entire pipeline — extraction, retry logic, TTS generation — must complete in 3 minutes. This forces architectural decisions: limit input size, cap retry attempts, use auto-force thresholds.
File I/O on Spaces is fragile. Debug logging, temporary files, and export paths all need special handling. tempfile.mkstemp() for exports, tempfile.gettempdir() for log directories, and guarding all debug writes behind IS_HF flags.

On UI/UX

Demo-ready defaults matter enormously. Pre-loading a PDF example, an audio example, a BBC Korean URL, and a YouTube link — plus generating TTS audio for bootstrap flashcards at startup — means the app is immediately impressive on first load. The 30 seconds of startup TTS generation pays for itself in user engagement.
Dark themes need obsessive attention. Every single Gradio component — file previews, checkboxes, tab navs, dropdowns, sliders, progress bars, scrollbars — needs explicit dark styling. Miss one and you get a jarring white rectangle in your otherwise polished UI.

Models Used

Model	Purpose	Notes
Qwen/Qwen3.5-9B	Vision-Language extraction & translation	Custom chat template with `enable_thinking` support. Multimodal (text + images).
CohereLabs/cohere-transcribe-03-2026	Korean speech-to-text	Used for YouTube and audio upload transcription. Runs on CPU.
Supertonic-3	Korean text-to-speech	Generates natural pronunciation audio. F1 voice style, 0.7 speed, 12-step denoising.

Final Reflection

Building LocalDuo taught me that the "last mile" of LLM applications — the gap between a working model and a polished product — is where most of the engineering happens. The model inference itself is perhaps 20% of the code. The other 80% is: input parsing across 5 formats, thread management for streaming, JSON robustness, CSS battles with Gradio, file I/O gymnastics on cloud platforms, and the thousand small UX decisions that make the difference between a demo and a tool someone would actually use to learn Korean.

The most satisfying moment was watching the full pipeline work end-to-end: paste an audio input → Cohere transcribes Korean speech → Qwen extracts vocabulary → Supertonic speaks each word → flashcards appear with audio playback. Three models, three modalities, one click.

Signal Garden: A Game Engine That Keeps Mutating

June 16, 2026

Noteworthy

June 15, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote