LFM2.5-Audio-1.5B — Tool-Aware Fine-Tune
A full fine-tune of LiquidAI/LFM2.5-Audio-1.5B that teaches the model to read a tool list from its system prompt and respond accordingly:
| User says | System prompt contains tool? | Model response |
|---|---|---|
"what's the weather today" |
✅ weather listed |
"let me check the weather for you." (short ack, then stop) |
"what's the weather today" |
❌ weather NOT listed |
"i'm not set up to handle the weather in this session." (refusal) |
"who painted the mona lisa" |
(any tools, or none) | "leonardo da vinci." (normal answer — query needs no tool) |
"tell me a joke" |
(any tools, or none) | "what do you call cheese that isn't yours? nacho cheese." (chitchat) |
The base LFM2.5-Audio has no tool-calling support — it ignores tool lists in the system prompt and either invents answers ("I don't have live internet access...") or tries to instruct the user to use the tool itself ("you can use the iot_lights tool to control them"). This fine-tune lands the missing first turn of a tool-augmented voice assistant — the brief in-character acknowledgement that buys time for a downstream tool dispatcher.
The actual tool execution is out of scope for this model. Pair it with a separate dispatcher (e.g., an LFM2-Instruct sibling) for full tool-call orchestration.
System-prompt format
The fine-tune was trained on this exact preamble — keep it verbatim:
Respond with interleaved text and audio.
Tools available:
- weather: get current weather and forecasts for a location
- alarm: set or cancel alarms
- music: play, pause, or skip music
…
If a request needs one of these tools, acknowledge briefly and stop. Otherwise answer normally.
The tool block is optional — omit it entirely for a pure conversational model.
Eval results
On a 119-row held-out test set (matbee/lfm2-tool-aware-dataset-v1 eval split):
| Class | Accuracy | Failures |
|---|---|---|
tool_match (ack expected) |
93.3% | 2/30 |
tool_miss (refusal expected) |
93.1% | 2/29 |
general (answer normally) |
100.0% | 0/30 |
chitchat (chat normally) |
100.0% | 0/30 |
| Overall | 96.6% | 4/119 |
Critically, general + chitchat are perfect — the fine-tune does not regress baseline conversational behavior when tools are present in the system prompt.
Known failure modes
- State-query phrasings:
"what's the thermostat set to"withiot_thermostatlisted → sometimes refused. Training data has mostly action-style queries ("set the thermostat to 72"); state-query phrasings are underrepresented. - Search-adjacent tool generalization:
"give me a recipe for miso soup"withsearchlisted → "searching for miso soup recipe". Arguably correct (search can find recipes) but not what the label expected.
Both addressable with targeted dataset extensions (~50 examples per pattern).
Usage
from pathlib import Path
import torch
import torchaudio
from liquid_audio import LFM2AudioModel, LFM2AudioProcessor, ChatState
# Pass a Path (not str) so liquid-audio takes the local-checkpoint branch
local = Path("./lfm2.5-audio-tool-aware-v1")
processor = LFM2AudioProcessor.from_pretrained(local, device="cuda").eval()
model = LFM2AudioModel.from_pretrained(
local, device="cuda", dtype=torch.bfloat16
).eval()
# Build a system prompt that advertises some tools
system_prompt = (
"Respond with interleaved text and audio.\n\n"
"Tools available:\n"
"- weather: get current weather and forecasts for a location\n"
"- alarm: set or cancel alarms\n\n"
"If a request needs one of these tools, acknowledge briefly and stop. "
"Otherwise answer normally."
)
chat = ChatState(processor)
chat.new_turn("system"); chat.add_text(system_prompt); chat.end_turn()
# Load user audio at 24 kHz mono float32
wav, sr = torchaudio.load("user_query.wav")
if wav.shape[0] > 1: wav = wav.mean(0, keepdim=True)
chat.new_turn("user"); chat.add_audio(wav, sr); chat.end_turn()
chat.new_turn("assistant")
# Stream interleaved text + audio tokens
text_pieces, audio_codes = [], []
for token in model.generate_interleaved(
**chat, max_new_tokens=120, audio_temperature=1.0, audio_top_k=4,
):
if token.numel() == 1:
text_pieces.append(processor.text.decode(token))
else:
audio_codes.append(token)
print("text:", "".join(text_pieces))
# Decode audio_codes through processor.mimi for the assistant waveform.
Training
- Base:
LiquidAI/LFM2.5-Audio-1.5B(1.45B params, bf16) - Data: 3000 synthetic train + 400 eval examples — see
matbee/lfm2-tool-aware-dataset-v1 - Hardware: 2× RTX 4090 (DDP)
- Trainer: upstream
liquid_audio.trainer.Trainer— full fine-tune, bf16 mixed precision - Hyperparams: AdamW, lr 5e-5, cosine schedule with 50-step warmup, batch_size 4 per GPU (effective 8), 560 steps (~3 epochs), context_length 256
- Wall clock: ~17 minutes
- Loss: train 2.0 → 0.25, val 0.85 → 0.33
Limitations
- All training audio is synthetic (Kokoro TTS for both user side and assistant target audio). Real human speech with accents, clipping, or noise is untested.
- Assistant audio voice is
am_adam(Kokoro male, American English). The fine-tune's Mimi-decoded output speaks in that voice — if you want a different voice, retrain with different--assistant-voice. - Tool taxonomy is 20 SLURP-inspired scenarios (
weather,alarm,music,iot_lights, …). Generalization to unseen tool names is untested. - English only. Other languages would need retraining on multilingual synthesis.
- Single-turn behavior only. Multi-turn conversations with mixed tool/chitchat turns may exhibit chat-state contamination.
License
This model is released under the LFM Open License v1.0 — same as the base LiquidAI/LFM2.5-Audio-1.5B. See LICENSE in this repo.
Citation
If you use this model, please cite the base:
@misc{liquidai2025lfm25audio,
title={LFM2.5-Audio: Speech-to-Speech Foundation Model},
author={Liquid AI},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B}
}
The dataset, fine-tune recipe, and acknowledgement-style training signal are described in matbee/lfm2-tool-aware-dataset-v1.
- Downloads last month
- 7