hush / README.md
atulyaatul's picture
Fix PyTorch inference code: use libdf for correct feature extraction
e4ce772 verified
---
license: apache-2.0
language: multilingual
tags:
- speech-enhancement
- denoising
- real-time
- voice-ai
- hush
- background-speaker-suppression
- onnx
- multilingual
- audio
- noise-cancellation
library_name: hush
pipeline_tag: audio-to-audio
---
# Hush
**The first open-source speech enhancement model built specifically for Voice AI β€” with real-time background speaker suppression.**
> **8 MB model Β· Runs fully on CPU in real time Β· Trained on 10,000+ hours of mixed audio Β· Under 1 ms processing per 10 ms of audio**
> πŸš€ **Coming Soon:** We are currently fine-tuning a new model optimized specifically for environments with even **louder background noise and louder background speech**! Stay tuned for the upcoming release.
[![GitHub](https://img.shields.io/badge/GitHub-Repository-blue?logo=github)](https://github.com/pulp-vision/Hush)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
[![Python](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://python.org)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.0%2B-orange.svg)](https://pytorch.org)
---
## Model Overview
Hush is designed from the ground up for **Voice AI applications** β€” phone-based voice agents, call centre bots, voice assistants, real-time transcription pipelines, and conversational AI systems. It isolates exactly one speaker from a live audio stream, in real time, under production conditions.
The model is **language-agnostic** β€” it operates on the acoustic signal directly and works for any spoken language.
### At a Production Glance
| | |
|---|---|
| Model size | **8 MB** |
| Runs on | **CPU only β€” no GPU required** |
| Processing latency | **< 1 ms per 10 ms of audio** |
| Algorithmic latency | ~20 ms (fully causal, zero lookahead) |
| Training data | **10,000+ hours** of mixed speech, noise, and competing speakers |
| Sample rate | 16 kHz (telephony-native: G.711, WebRTC, SIP) |
| Language | **Any** (language-agnostic speech enhancement) |
---
## The Problem It Solves
Every major open-source speech enhancement model (DeepFilterNet3, RNNoise, SEGAN, MetricGAN+, DNS-Challenge entrants) is trained on **stationary noise** β€” fans, traffic, keyboard clicks. None treat a competing human voice as a first-class problem.
When the interference is another person speaking, these models either:
- **Leak the competing speaker** β†’ gets transcribed as part of the conversation, breaking NLP/LLMs
- **Suppress both speakers** β†’ degrades the primary speaker's intelligibility
**Hush is the first open-source model to explicitly train for background speaker suppression.**
---
## What Makes Hush Different
Built on [DeepFilterNet3](https://github.com/Rikorose/DeepFilterNet), extended with one targeted innovation: **teaching the encoder to distinguish speakers, not just speech from noise.**
1. **Training data reflecting the real problem** β€” 60% of training samples include a competing human speaker at 12–24 dB SIR
2. **Auxiliary Separation Head** β€” lightweight `Linear(256β†’32) + Sigmoid` head trained with L1 loss to predict ERB-domain background speaker masks (training only β€” zero inference overhead)
3. **Joint optimization** β€” separation loss (weight 0.1) combined with multi-resolution spectral loss across 4 FFT scales
---
## Architecture
```
Input Waveform [B, 1, T]
|
v
STFT (FFT=320, Hop=160)
|
_____|_______________
| |
v v
ERB features DF features
[B, 1, T, 32] [B, 2, T, 64]
| |
'-------+------------'
|
v
ENCODER
(SqueezedGRU, 256-dim)
|
________|____________________________
| | |
v v v
ERB DECODER DF DECODER SEPARATION HEAD *
(ConvTranspose (3-layer GRU (Linear + Sigmoid
+ skip conns) + DF filter) ERB-domain mask)
| |
v v
ERB gain mask Complex filter
| |
'-------+--------'
|
v
Enhanced Spectrum
|
v
ISTFT
|
v
Enhanced Waveform [B, 1, T]
```
`*` Separation Head is active during training only β€” discarded at inference.
### Model Specifications
| Parameter | Value |
|---|---|
| Model size | **8 MB** |
| Parameters | ~1.8M |
| Sample rate | 16,000 Hz |
| Frame size / hop | 320 / 160 samples (10 ms) |
| ERB bands | 32 |
| DF bins | 64 (order-5 filter) |
| Encoder dim | 256 |
| Lookahead | 0 (fully causal) |
---
## Quick Start: PyTorch Inference
> **Important:** PyTorch inference requires `DeepFilterLib` for correct feature extraction.
> Install it with `pip install DeepFilterLib`.
The simplest way is the CLI script from the [GitHub repo](https://github.com/pulp-vision/Hush):
```bash
python scripts/infer_single.py \
--checkpoint model_best.ckpt \
--input noisy_speech.wav \
--output enhanced.wav
```
Or use the Python API directly:
```python
import torch
import numpy as np
import soundfile as sf
from libdf import DF, erb, erb_norm, unit_norm
from model.dfnet_se import DfNetSE, as_complex, as_real, get_config, get_norm_alpha
# Load model
config = get_config()
model = DfNetSE(config)
checkpoint = torch.load("model_best.ckpt", map_location="cpu")
model.model.load_state_dict(checkpoint)
model.eval()
# Load audio
audio, sr = sf.read("noisy_speech.wav")
assert sr == 16000, "Input must be 16 kHz"
wav = torch.tensor(audio, dtype=torch.float32).unsqueeze(0) # [1, T]
# Feature extraction via libdf (must match training pipeline)
df_state = DF(sr=16000, fft_size=320, hop_size=160, nb_bands=32, min_nb_erb_freqs=2)
alpha = get_norm_alpha(16000, 160, config.norm_tau)
wav_padded = torch.nn.functional.pad(wav, (0, 320))
spec_np = df_state.analysis(wav_padded.numpy(), reset=True)
erb_feat = torch.as_tensor(erb_norm(erb(spec_np, df_state.erb_widths()), alpha)).unsqueeze(1)
spec_feat = as_real(torch.as_tensor(unit_norm(spec_np[..., :64], alpha))).unsqueeze(1)
spec_t = as_real(torch.as_tensor(spec_np)).unsqueeze(1)
# Enhance
with torch.no_grad():
spec_enh = model.model(spec_t.clone(), erb_feat, spec_feat)[0]
spec_enh_c = as_complex(spec_enh.squeeze(1))
# Synthesize and compensate delay
enh_np = df_state.synthesis(spec_enh_c.numpy(), reset=True)
enh = torch.from_numpy(np.asarray(enh_np, dtype=np.float32))
delay = 320 - 160 # fft_size - hop_size
enh = enh[:, delay : len(audio) + delay]
sf.write("enhanced.wav", enh.squeeze().numpy(), 16000)
```
## Quick Start: Production (ONNX, No PyTorch)
For production deployment without PyTorch, use the prebuilt **Weya NC Standalone** library:
```python
import ctypes, platform, numpy as np
lib_name = {"Darwin": "libweya_nc.dylib", "Windows": "weya_nc.dll"}.get(
platform.system(), "libweya_nc.so"
)
lib = ctypes.CDLL(f"deployment/lib/{lib_name}")
model = lib.weya_nc_model_load_from_path(b"onnx/advanced_dfnet16k_model_best_onnx.tar.gz")
session = lib.weya_nc_session_create(model, 16000, ctypes.c_float(100.0))
frame_len = int(lib.weya_nc_get_frame_length(session))
lib.weya_nc_process_frame(session, input_ptr, output_ptr)
```
Prebuilt binaries are available for Linux, macOS (Apple Silicon), and Windows. See the [deployment guide](https://github.com/pulp-vision/Hush/tree/main/deployment) for full integration instructions.
---
## Training Details
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 5e-4 (cosine decay to 1e-6) |
| LR warmup | 3 epochs (1e-4 β†’ 5e-4) |
| Weight decay | 0.05 |
| Batch size | 16 |
| Max sample length | 5 seconds |
| Epochs | 100 |
| Early stopping | patience=25 epochs |
| Gradient clip | 1.0 |
| Loss | MultiResSpecLoss (4 scales) + LocalSNRLoss + SeparationLoss (Γ—0.1) |
| Background speaker prob. | 60% of samples |
| Background SIR range | 12–24 dB |
---
## Datasets Used
The model was trained on standard publicly available datasets totalling **over 10,000 hours of mixed audio**:
| Category | Datasets |
|---|---|
| **Primary speech** | LibriSpeech (train-clean-100/360), VCTK Corpus, Common Voice |
| **Background speech** | LibriSpeech / VCTK / LibriTTS (speaker-disjoint splits) |
| **Noise** | DNS Challenge, FreeSound, ESC-50, AudioSet |
| **Room impulse responses** | MIT IR Survey, OpenAIR, BUT ReverbDB |
> **Note:** Speech enhancement operates on acoustic features, not linguistic content β€” Hush works effectively across all languages.
See [DATASETS.md](https://github.com/pulp-vision/Hush/blob/main/DATASETS.md) for full details with URLs and licensing.
---
## Known Limitations
- **16 kHz only** β€” trained and evaluated at 16 kHz; other sample rates require resampling
- **Separation head is auxiliary** β€” the background speaker mask is an ERB-domain soft mask used for training regularization, not a standalone source separation output
- **Background speakers at moderate SIR** β€” trained with background speakers at 12–24 dB SIR; very loud competing speakers may not be fully suppressed
---
## Repository Structure
```
weya-ai/hush/ (this Hugging Face repo)
β”œβ”€β”€ README.md ← This Model Card
β”œβ”€β”€ config.json ← Model configuration metadata
β”œβ”€β”€ model_best.ckpt ← PyTorch checkpoint
β”œβ”€β”€ onnx/
β”‚ └── advanced_dfnet16k_model_best_onnx.tar.gz ← ONNX production bundle
└── LICENSE ← Apache 2.0
```
Full source code, training scripts, deployment examples, and documentation are available on [**GitHub**](https://github.com/pulp-vision/Hush).
---
## Acknowledgements
Built on [DeepFilterNet](https://github.com/Rikorose/DeepFilterNet) by Hendrik SchrΓΆter, Tobias Rosenkranz, Alberto N. Escalante-B., and Andreas Maier. The core architecture, ERB filterbank, SqueezedGRU module, and loss functions closely follow the DF3 design.
---
## Citation
If you use this model or code, please cite the original DeepFilterNet paper:
```bibtex
@inproceedings{schroter2023deepfilternet3,
title = {DeepFilterNet: Perceptually Motivated Real-Time Speech Enhancement},
author = {SchrΓΆter, Hendrik and Rosenkranz, Tobias and Escalante-B., Alberto N and Maier, Andreas},
booktitle = {INTERSPEECH},
year = {2023}
}
```
---
## License
Apache License 2.0 β€” see [LICENSE](LICENSE) for details.