LFM2.5-Audio-1.5B — Tool-Aware Fine-Tune

A full fine-tune of LiquidAI/LFM2.5-Audio-1.5B that teaches the model to read a tool list from its system prompt and respond accordingly:

User says	System prompt contains tool?	Model response
`"what's the weather today"`	✅ `weather` listed	`"let me check the weather for you."` (short ack, then stop)
`"what's the weather today"`	❌ `weather` NOT listed	`"i'm not set up to handle the weather in this session."` (refusal)
`"who painted the mona lisa"`	(any tools, or none)	`"leonardo da vinci."` (normal answer — query needs no tool)
`"tell me a joke"`	(any tools, or none)	`"what do you call cheese that isn't yours? nacho cheese."` (chitchat)

The base LFM2.5-Audio has no tool-calling support — it ignores tool lists in the system prompt and either invents answers ("I don't have live internet access...") or tries to instruct the user to use the tool itself ("you can use the iot_lights tool to control them"). This fine-tune lands the missing first turn of a tool-augmented voice assistant — the brief in-character acknowledgement that buys time for a downstream tool dispatcher.

The actual tool execution is out of scope for this model. Pair it with a separate dispatcher (e.g., an LFM2-Instruct sibling) for full tool-call orchestration.

System-prompt format

The fine-tune was trained on this exact preamble — keep it verbatim:

Respond with interleaved text and audio.

Tools available:
- weather: get current weather and forecasts for a location
- alarm: set or cancel alarms
- music: play, pause, or skip music
…

If a request needs one of these tools, acknowledge briefly and stop. Otherwise answer normally.

The tool block is optional — omit it entirely for a pure conversational model.

Eval results

On a 119-row held-out test set (matbee/lfm2-tool-aware-dataset-v1 eval split):

Class	Accuracy	Failures
`tool_match` (ack expected)	93.3%	2/30
`tool_miss` (refusal expected)	93.1%	2/29
`general` (answer normally)	100.0%	0/30
`chitchat` (chat normally)	100.0%	0/30
Overall	96.6%	4/119

Critically, general + chitchat are perfect — the fine-tune does not regress baseline conversational behavior when tools are present in the system prompt.

Known failure modes

State-query phrasings: "what's the thermostat set to" with iot_thermostat listed → sometimes refused. Training data has mostly action-style queries ("set the thermostat to 72"); state-query phrasings are underrepresented.
Search-adjacent tool generalization: "give me a recipe for miso soup" with search listed → "searching for miso soup recipe". Arguably correct (search can find recipes) but not what the label expected.

Both addressable with targeted dataset extensions (~50 examples per pattern).

Usage

from pathlib import Path
import torch
import torchaudio
from liquid_audio import LFM2AudioModel, LFM2AudioProcessor, ChatState

# Pass a Path (not str) so liquid-audio takes the local-checkpoint branch
local = Path("./lfm2.5-audio-tool-aware-v1")
processor = LFM2AudioProcessor.from_pretrained(local, device="cuda").eval()
model = LFM2AudioModel.from_pretrained(
    local, device="cuda", dtype=torch.bfloat16
).eval()

# Build a system prompt that advertises some tools
system_prompt = (
    "Respond with interleaved text and audio.\n\n"
    "Tools available:\n"
    "- weather: get current weather and forecasts for a location\n"
    "- alarm: set or cancel alarms\n\n"
    "If a request needs one of these tools, acknowledge briefly and stop. "
    "Otherwise answer normally."
)

chat = ChatState(processor)
chat.new_turn("system"); chat.add_text(system_prompt); chat.end_turn()

# Load user audio at 24 kHz mono float32
wav, sr = torchaudio.load("user_query.wav")
if wav.shape[0] > 1: wav = wav.mean(0, keepdim=True)
chat.new_turn("user"); chat.add_audio(wav, sr); chat.end_turn()
chat.new_turn("assistant")

# Stream interleaved text + audio tokens
text_pieces, audio_codes = [], []
for token in model.generate_interleaved(
    **chat, max_new_tokens=120, audio_temperature=1.0, audio_top_k=4,
):
    if token.numel() == 1:
        text_pieces.append(processor.text.decode(token))
    else:
        audio_codes.append(token)

print("text:", "".join(text_pieces))
# Decode audio_codes through processor.mimi for the assistant waveform.

Training

Base: LiquidAI/LFM2.5-Audio-1.5B (1.45B params, bf16)
Data: 3000 synthetic train + 400 eval examples — see matbee/lfm2-tool-aware-dataset-v1
Hardware: 2× RTX 4090 (DDP)
Trainer: upstream liquid_audio.trainer.Trainer — full fine-tune, bf16 mixed precision
Hyperparams: AdamW, lr 5e-5, cosine schedule with 50-step warmup, batch_size 4 per GPU (effective 8), 560 steps (~3 epochs), context_length 256
Wall clock: ~17 minutes
Loss: train 2.0 → 0.25, val 0.85 → 0.33

Limitations

All training audio is synthetic (Kokoro TTS for both user side and assistant target audio). Real human speech with accents, clipping, or noise is untested.
Assistant audio voice is am_adam (Kokoro male, American English). The fine-tune's Mimi-decoded output speaks in that voice — if you want a different voice, retrain with different --assistant-voice.
Tool taxonomy is 20 SLURP-inspired scenarios (weather, alarm, music, iot_lights, …). Generalization to unseen tool names is untested.
English only. Other languages would need retraining on multilingual synthesis.
Single-turn behavior only. Multi-turn conversations with mixed tool/chitchat turns may exhibit chat-state contamination.

License

This model is released under the LFM Open License v1.0 — same as the base LiquidAI/LFM2.5-Audio-1.5B. See LICENSE in this repo.

Citation

If you use this model, please cite the base:

@misc{liquidai2025lfm25audio,
  title={LFM2.5-Audio: Speech-to-Speech Foundation Model},
  author={Liquid AI},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B}
}

The dataset, fine-tune recipe, and acknowledgement-style training signal are described in matbee/lfm2-tool-aware-dataset-v1.

Downloads last month: 7

Safetensors

Model size

1B params

Tensor type

I64

BF16

Model tree for matbee/lfm2.5-audio-tool-aware-v1

Base model

LiquidAI/LFM2-1.2B

Finetuned

LiquidAI/LFM2.5-Audio-1.5B