---
license: mit
---
# Speech to Text

Fine-tune a [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) acoustic model on the [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) dataset using CTC, then export it to ONNX for inference.

## Requirements

- Python >= 3.10
- A CUDA-capable GPU is recommended for training

Install dependencies:

```bash
pip install -e .
```

## Training

Fine-tune `facebook/wav2vec2-base` on LJSpeech (5% held out for eval). Training takes ~10 epochs by default and writes checkpoints to `wav2vec2-ljspeech/`.

```bash
python train.py
```

Key settings live at the top of `train.py`:

| Constant | Default | Purpose |
| --- | --- | --- |
| `MODEL_ID` | `facebook/wav2vec2-base` | Pre-trained wav2vec2 checkpoint |
| `DATASET_ID` | `lj_speech` | HuggingFace dataset id |

Training hyperparameters (batch size, epochs, learning rate, etc.) are configured through `TrainingArguments` inside `train.py`.

Monitor progress with TensorBoard:

```bash
tensorboard --logdir wav2vec2-ljspeech
```

## ONNX Export

Export the trained checkpoint to ONNX and validate it with ONNX Runtime:

```bash
python export_onnx.py
```

Options:

```
--model-dir   Checkpoint directory (default: wav2vec2-ljspeech)
--output      Output ONNX path   (default: wav2vec2-ljspeech.onnx)
--opset       ONNX opset version (default: 17)
```

The exported model uses dynamic axes on batch and time, so it accepts audio of any length.

## Inference

```python
import numpy as np
import onnxruntime as ort
from transformers import Wav2Vec2Processor

processor = Wav2Vec2Processor.from_pretrained("wav2vec2-ljspeech")
session = ort.InferenceSession("wav2vec2-ljspeech.onnx")

# audio_array: 16 kHz mono float32 numpy array
inputs = processor(audio_array, sampling_rate=16000, return_tensors="np")
logits = session.run(None, {"input_values": inputs.input_values})[0]
text = processor.tokenizer.batch_decode(np.argmax(logits, axis=-1))[0]
print(text)
```

Notes:

- Audio must be **16 kHz mono float32**.
- The `Wav2Vec2Processor` handles waveform normalization and tokenization — always pass audio through it before the ONNX session.
- This exports the **acoustic model only**. Add an external LM (e.g. KenLM) for language-model-rescored decoding if needed.

## Project Layout

```
speech-to-text/
├── train.py          # Wav2Vec2 + CTC fine-tuning on LJSpeech
├── export_onnx.py    # ONNX export and ONNX Runtime validation
├── main.py           # Placeholder entry point
├── pyproject.toml    # Project metadata and dependencies
└── README.md
```