LFM2.5-Audio-1.5B — Tool-Aware Fine-Tune (v4.1, blip prefix)
Full fine-tune of LiquidAI/LFM2.5-Audio-1.5B, identical to matbee/lfm2.5-audio-tool-aware-v4 except every tool_match training row's assistant audio is prefixed with a 160 ms two-tone "tool-invocation" blip before Mimi encoding. The fine-tune learns to emit the blip's Mimi codewords as the first audio frames of every ack, so a coordinator can detect the codeword pattern in the streaming audio output and fire the tool dispatcher in parallel with the spoken acknowledgement — eliminating round-trip latency between "ack ends" and "tool runs."
Results vs v4
Held-out eval, 119 rows × 30 per class:
| Class | v4 | v4.1 | Δ |
|---|---|---|---|
tool_match |
86.7% | 100.0% | +13.3 |
tool_result_speak |
100.0% | 100.0% | 0 |
tool_miss |
100.0% | 96.7% | −3.3 |
non_tool |
86.7% | 83.3% | −3.3 |
| Overall | 93.3% | 95.0% | +1.7 |
Novel-facts narration (60 OOD facts, never in training): 96.7% faithful / 0% memorized (v4: 95% / 0%).
The unexpected tool_match win
The blip was designed as a latency optimization — emit a known codeword early so the coordinator can start the tool while the audio LLM finishes speaking. But the v4 → v4.1 numbers show the blip also fixes the v4 over-refusal regression: every one of the 4 v4 tool_match failures ("read my latest email", "how do i get to the airport", etc.) now properly acks.
Hypothesis: the blip acts as an audio-domain commitment cue. Once the model has emitted the blip's codewords, it's deep in the "tool-ack" trajectory and the strong refusal pressure from v4's tool_miss training can't pull it back to a refusal template. This is the audio analogue of a teacher-forced text prefix like "Sure — let me check that for you."
The blip
- Length: 160 ms (24 kHz mono int16)
- Source: synthetic two-tone cue (1.2 kHz + 1.8 kHz, 5 ms attack / 20 ms release envelope)
- Mimi-roundtripped:
tools/lfm2_tool_aware/blips/tool_invoke.preview.wav(the version actually trained on — what Mimi will reconstruct) - Raw codewords:
tools/lfm2_tool_aware/blips/tool_invoke.codes.npy(for coordinator detection)
The blip is only prepended to tool_match rows. tool_result_speak, tool_miss, and non_tool audio is unchanged, so the model learns to emit the codewords only on tool acks.
Blip-emission validation (direct, end-to-end)
validate_blip_emission.py --match-cb0-only --n 40 against held-out tool_match rows:
| Matcher | Match rate |
|---|---|
| Full 8-codebook bit-exact | 0 / 40 (0%) — Mimi's deeper codebooks capture fine acoustic detail stochastically; bf16 fine-tune can't reproduce them deterministically |
| cb0-only (semantic codebook) | 38 / 40 (95.0%) ✓ |
| cb0 OR cb3 | 38 / 40 (95.0%) — the 2 misses are trajectory misses, not codebook misses; OR-combining doesn't help |
Per-position rates: cb0 = 98%, cb3 = 94%, cb5 = 0% (later codebooks drift).
The bifurcation is even cleaner than the 95% TPR suggests:
- tool_match rows →
cb0 = [430, 127, ...](model emits blip's semantic codebook) - non_tool / tool_miss rows →
cb0 = [1049, 127, ...](entirely different opening)
Bit-exact discrimination on the first 2 audio frames' cb0 codeword. Mimi's frame rate is 12.5 Hz so 2 frames = 160 ms of audio — a coordinator that watches cb0 confirms tool intent that fast, then dispatches the tool while the ack is still being spoken.
Coordinator integration recipe
import numpy as np
# The blip is shape (1, 8, 2) on disk; .squeeze(0) gives (codebooks, frames).
# blip[0] is "codebook 0 across frame 0 + frame 1" — exactly what we detect.
BLIP_CB0 = np.array([430, 127], dtype=np.int64)
def is_blip_emitted(first_2_audio_frames: np.ndarray) -> bool:
"""first_2_audio_frames: shape (n_codebooks=8, n_frames=2) of Mimi tokens."""
return np.array_equal(first_2_audio_frames[0], BLIP_CB0)
# During inference: tap the audio_out token stream
audio_frames = []
for tok in model.generate_interleaved(**chat, ...):
if tok.numel() > 1: # audio token
audio_frames.append(tok.cpu().numpy().squeeze())
if len(audio_frames) == 2 and is_blip_emitted(
np.stack(audio_frames, axis=1)):
# Tool intent committed. Dispatch in parallel with the spoken ack.
asyncio.create_task(coordinator.dispatch_tool(tool_name, args))
The 5% misses are fail-safe — when the model takes a different trajectory entirely (longer text preamble, etc.), is_blip_emitted returns False and the coordinator falls back to ack-end detection. No regression risk.
Pitfall when loading the blip: the npy is shape (1, 8, 2) = (batch, codebooks, frames). After .squeeze(0):
blip[0]→[430, 127]← cb0 across both frames (correct detection target)blip[:, 0]→[430, 13, 1903, 1548, ...]← all codebooks at frame 0 (NOT the detection target)
Wiring the wrong index gives 0% match and the appearance of a broken model. See tools/lfm2_tool_aware/validate_blip_emission.py for the reference implementation.
A runnable end-to-end demo (3 tool_match + 3 non_tool turns, streaming PCM playback, notification fires the moment the blip is detected) lives at tools/lfm2_tool_aware/demo_blip_tool_dispatch.py.
Two-turn flow
Same as v4 — set_context() triggers turn-2 narration. See the v4 model card for the full pattern.
Training recipe
Identical to v4:
- Base:
LiquidAI/LFM2.5-Audio-1.5B, full bf16 - 2× RTX 4090
- 3000 train + 400 eval examples (same JSONL as v4 — only the preprocessing differs)
- bs=2/GPU × 2 GPUs × 1120 steps (~1.5 epochs)
- lr 5e-5, cosine + 100 warmup, ctx=512
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True- Final val_loss = 0.90 (v4: 0.89)
Reproducing v4.1 from v4 data
python tools/lfm2_tool_aware/preprocess_for_lfm2.py \
--in train_v4.distilled.jsonl --output-path preprocessed/train_v4_blip \
--assistant-voice am_adam --max-context-length 512 \
--blip-wav-path tools/lfm2_tool_aware/blips/tool_invoke.preview.wav
# train with same recipe as v4
Known limitations
- One
tool_missregression on a hard borderline query ("how much is a lyft to the airport",transportnot listed butmapsis). - Same 4–5
non_toolchitchat hard cases as v4 ("Have a nice flight.","We have that package.") — these are short under-specified utterances where any sensible LLM would also be uncertain. - 5% of tool_match outputs miss the blip — the model takes a different audio-opening trajectory entirely (longer text preamble, etc.). These are trajectory misses, not codebook misses, so OR-combining codebook matchers doesn't recover them. Fail-safe: the coordinator falls back to ack-end detection on misses. Pushing past 95% requires a v4.2 architectural change (more blip-prefixed rows, capped text preamble, or a longer blip with redundancy).
License
Inherited from base: LFM Open License v1.0.
Predecessors
matbee/lfm2.5-audio-tool-aware-v1matbee/lfm2.5-audio-tool-aware-v2matbee/lfm2.5-audio-tool-aware-v4— same data, no blip prefix.- v4.1 (this release) — adds 160 ms tool-invocation blip to
tool_matchrows; 100% tool_match accuracy + parallel-dispatch capable.
- Downloads last month
- 36