LFM2.5-Audio-1.5B — Tool-Aware Fine-Tune (v4.1, blip prefix)

Full fine-tune of LiquidAI/LFM2.5-Audio-1.5B, identical to matbee/lfm2.5-audio-tool-aware-v4 except every tool_match training row's assistant audio is prefixed with a 160 ms two-tone "tool-invocation" blip before Mimi encoding. The fine-tune learns to emit the blip's Mimi codewords as the first audio frames of every ack, so a coordinator can detect the codeword pattern in the streaming audio output and fire the tool dispatcher in parallel with the spoken acknowledgement — eliminating round-trip latency between "ack ends" and "tool runs."

Results vs v4

Held-out eval, 119 rows × 30 per class:

Class v4 v4.1 Δ
tool_match 86.7% 100.0% +13.3
tool_result_speak 100.0% 100.0% 0
tool_miss 100.0% 96.7% −3.3
non_tool 86.7% 83.3% −3.3
Overall 93.3% 95.0% +1.7

Novel-facts narration (60 OOD facts, never in training): 96.7% faithful / 0% memorized (v4: 95% / 0%).

The unexpected tool_match win

The blip was designed as a latency optimization — emit a known codeword early so the coordinator can start the tool while the audio LLM finishes speaking. But the v4 → v4.1 numbers show the blip also fixes the v4 over-refusal regression: every one of the 4 v4 tool_match failures ("read my latest email", "how do i get to the airport", etc.) now properly acks.

Hypothesis: the blip acts as an audio-domain commitment cue. Once the model has emitted the blip's codewords, it's deep in the "tool-ack" trajectory and the strong refusal pressure from v4's tool_miss training can't pull it back to a refusal template. This is the audio analogue of a teacher-forced text prefix like "Sure — let me check that for you."

The blip

  • Length: 160 ms (24 kHz mono int16)
  • Source: synthetic two-tone cue (1.2 kHz + 1.8 kHz, 5 ms attack / 20 ms release envelope)
  • Mimi-roundtripped: tools/lfm2_tool_aware/blips/tool_invoke.preview.wav (the version actually trained on — what Mimi will reconstruct)
  • Raw codewords: tools/lfm2_tool_aware/blips/tool_invoke.codes.npy (for coordinator detection)

The blip is only prepended to tool_match rows. tool_result_speak, tool_miss, and non_tool audio is unchanged, so the model learns to emit the codewords only on tool acks.

Blip-emission validation (direct, end-to-end)

validate_blip_emission.py --match-cb0-only --n 40 against held-out tool_match rows:

Matcher Match rate
Full 8-codebook bit-exact 0 / 40 (0%) — Mimi's deeper codebooks capture fine acoustic detail stochastically; bf16 fine-tune can't reproduce them deterministically
cb0-only (semantic codebook) 38 / 40 (95.0%)
cb0 OR cb3 38 / 40 (95.0%) — the 2 misses are trajectory misses, not codebook misses; OR-combining doesn't help

Per-position rates: cb0 = 98%, cb3 = 94%, cb5 = 0% (later codebooks drift).

The bifurcation is even cleaner than the 95% TPR suggests:

  • tool_match rowscb0 = [430, 127, ...] (model emits blip's semantic codebook)
  • non_tool / tool_miss rowscb0 = [1049, 127, ...] (entirely different opening)

Bit-exact discrimination on the first 2 audio frames' cb0 codeword. Mimi's frame rate is 12.5 Hz so 2 frames = 160 ms of audio — a coordinator that watches cb0 confirms tool intent that fast, then dispatches the tool while the ack is still being spoken.

Coordinator integration recipe

import numpy as np

# The blip is shape (1, 8, 2) on disk; .squeeze(0) gives (codebooks, frames).
# blip[0] is "codebook 0 across frame 0 + frame 1" — exactly what we detect.
BLIP_CB0 = np.array([430, 127], dtype=np.int64)

def is_blip_emitted(first_2_audio_frames: np.ndarray) -> bool:
    """first_2_audio_frames: shape (n_codebooks=8, n_frames=2) of Mimi tokens."""
    return np.array_equal(first_2_audio_frames[0], BLIP_CB0)

# During inference: tap the audio_out token stream
audio_frames = []
for tok in model.generate_interleaved(**chat, ...):
    if tok.numel() > 1:                                # audio token
        audio_frames.append(tok.cpu().numpy().squeeze())
        if len(audio_frames) == 2 and is_blip_emitted(
                np.stack(audio_frames, axis=1)):
            # Tool intent committed. Dispatch in parallel with the spoken ack.
            asyncio.create_task(coordinator.dispatch_tool(tool_name, args))

The 5% misses are fail-safe — when the model takes a different trajectory entirely (longer text preamble, etc.), is_blip_emitted returns False and the coordinator falls back to ack-end detection. No regression risk.

Pitfall when loading the blip: the npy is shape (1, 8, 2) = (batch, codebooks, frames). After .squeeze(0):

  • blip[0][430, 127] ← cb0 across both frames (correct detection target)
  • blip[:, 0][430, 13, 1903, 1548, ...] ← all codebooks at frame 0 (NOT the detection target)

Wiring the wrong index gives 0% match and the appearance of a broken model. See tools/lfm2_tool_aware/validate_blip_emission.py for the reference implementation.

A runnable end-to-end demo (3 tool_match + 3 non_tool turns, streaming PCM playback, notification fires the moment the blip is detected) lives at tools/lfm2_tool_aware/demo_blip_tool_dispatch.py.

Two-turn flow

Same as v4 — set_context() triggers turn-2 narration. See the v4 model card for the full pattern.

Training recipe

Identical to v4:

  • Base: LiquidAI/LFM2.5-Audio-1.5B, full bf16
  • 2× RTX 4090
  • 3000 train + 400 eval examples (same JSONL as v4 — only the preprocessing differs)
  • bs=2/GPU × 2 GPUs × 1120 steps (~1.5 epochs)
  • lr 5e-5, cosine + 100 warmup, ctx=512
  • PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
  • Final val_loss = 0.90 (v4: 0.89)

Reproducing v4.1 from v4 data

python tools/lfm2_tool_aware/preprocess_for_lfm2.py \
    --in train_v4.distilled.jsonl --output-path preprocessed/train_v4_blip \
    --assistant-voice am_adam --max-context-length 512 \
    --blip-wav-path tools/lfm2_tool_aware/blips/tool_invoke.preview.wav

# train with same recipe as v4

Known limitations

  • One tool_miss regression on a hard borderline query ("how much is a lyft to the airport", transport not listed but maps is).
  • Same 4–5 non_tool chitchat hard cases as v4 ("Have a nice flight.", "We have that package.") — these are short under-specified utterances where any sensible LLM would also be uncertain.
  • 5% of tool_match outputs miss the blip — the model takes a different audio-opening trajectory entirely (longer text preamble, etc.). These are trajectory misses, not codebook misses, so OR-combining codebook matchers doesn't recover them. Fail-safe: the coordinator falls back to ack-end detection on misses. Pushing past 95% requires a v4.2 architectural change (more blip-prefixed rows, capped text preamble, or a longer blip with redundancy).

License

Inherited from base: LFM Open License v1.0.

Predecessors

Downloads last month
36
Safetensors
Model size
1B params
Tensor type
I64
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for matbee/lfm2.5-audio-tool-aware-v4.1

Finetuned
(4)
this model

Dataset used to train matbee/lfm2.5-audio-tool-aware-v4.1