YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

ONNX Real-Time DOA Streaming

Real-time Direction of Arrival (DOA) detection using an ONNX model with microphone streaming. This script processes audio from a multi-channel microphone array (ReSpeaker) in real-time and displays detected sound source directions.

Overview

The script performs the following process:

Audio Capture: Streams audio from a 6-channel microphone array (ReSpeaker)
Channel Selection: Selects and reorders channels [1, 4, 3, 2] to get 4 channels
Feature Extraction: Computes STFT features (magnitude, phase, cosine, sine) from the audio
ONNX Inference: Runs the DOA model on GPU (CUDA) or CPU to get per-frame logits
Histogram Aggregation: Aggregates logits into a circular histogram of azimuth angles
Peak Detection: Finds peaks in the histogram to identify sound source directions
Event Gating: Filters detections based on audio level changes and coherence
Visualization: Displays detected directions on a polar plot in real-time

Prerequisites

Hardware

ReSpeaker 6-Mic Array (or compatible multi-channel microphone)
microphone: positions:
- [0.0277, 0.0] # Mic 0: 0°
- [0.0, 0.0277] # Mic 1: 90°
- [-0.0277, 0.0] # Mic 2: 180°
- [0.0, -0.0277] # Mic 3: 270°
NVIDIA GPU (optional, for faster inference)

Software Dependencies

Install the required packages:

conda activate doaEnv
pip install onnxruntime-gpu  # For GPU inference
# OR
pip install onnxruntime      # For CPU-only inference

pip install pyaudio numpy matplotlib torch pyyaml

ONNX Model

You need a converted ONNX model file. If you haven't converted your PyTorch model yet:

python convert_to_onnx.py --checkpoint models/basic/2025-11-06_22-37-00-6a5fbc92/last.pt --output models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx

Quick Start

1. List Available Audio Devices

First, find your ReSpeaker device index:

python onnx_stream_microphone.py --list-devices

Look for a device named "ReSpeaker" or "Seeed" or containing "2886". Note the device index.

2. Stop PulseAudio (Required)

On Linux, PulseAudio often locks the ALSA devices. You need to temporarily stop it:

pulseaudio --kill

Note: You can use the helper script run_onnx_stream.sh which automates this (see below).

3. Run the Streaming Script

Basic usage:

python onnx_stream_microphone.py \
    --onnx models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx \
    --device-index 9

4. Restart PulseAudio (After Stopping)

After you're done, restart PulseAudio:

pulseaudio --start

Using the Helper Script

A helper script automates PulseAudio management:

chmod +x run_onnx_stream.sh
./run_onnx_stream.sh --onnx models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx --device-index 9

This script will:

Stop PulseAudio
Run the streaming script
Restart PulseAudio when you exit (Ctrl+C)

Command-Line Arguments

Required Arguments

--onnx PATH: Path to the ONNX model file

Audio Configuration

--device-index INT: Audio device index (use --list-devices to find it)
--sample-rate INT: Sample rate in Hz (default: 16000)
--window-ms INT: Analysis window length in milliseconds (default: 200)
--hop-ms INT: Hop size (overlap) in milliseconds (default: 100)
--chunk-size INT: Audio buffer chunk size (default: 1600)
--cpu-only: Use CPU only (disable GPU inference)
--list-devices: List all available audio input devices and exit

Model Configuration

--config PATH: Path to config.yaml (default: configs/train.yaml)

Histogram Detection Parameters

These control how DOA peaks are detected from the model logits:

--K INT: Number of azimuth bins (default: 72, should match model)
--tau FLOAT: Softmax temperature for histogram (default: 0.8)
--smooth-k INT: Histogram smoothing kernel size (default: 1)
--min-peak-height FLOAT: Minimum peak height threshold (default: 0.10)
--min-window-mass FLOAT: Minimum window mass for peak validation (default: 0.24)
--min-sep-deg FLOAT: Minimum angular separation between peaks in degrees (default: 20.0)
--min-active-ratio FLOAT: Minimum active frame ratio (default: 0.20)
--max-sources INT: Maximum number of sources to detect (default: 3)

Event Gate Parameters

These control when detections are considered valid (filtering noise):

--level-delta-on-db FLOAT: Level increase threshold to open gate (default: 2.5)
--level-delta-off-db FLOAT: Level decrease threshold to close gate (default: 1.0)
--level-min-dbfs FLOAT: Minimum audio level in dBFS (default: -60.0)
--level-ema-alpha FLOAT: Exponential moving average alpha for level tracking (default: 0.05)
--event-hold-ms INT: Minimum time to keep gate open after detection (default: 300)
--min-R-clip FLOAT: Minimum R_clip (coherence measure) to open gate (default: 0.18)
--event-refractory-ms INT: Minimum time between gate state changes (default: 120)

Onset Detection Parameters

--onset-alpha FLOAT: EMA alpha for spectral flux tracking (default: 0.05)

Example with Custom Parameters

python onnx_stream_microphone.py \
    --onnx doa_model.onnx \
    --device-index 9 \
    --window-ms 400 \
    --hop-ms 100 \
    --K 72 \
    --max-sources 2 \
    --tau 0.8 \
    --smooth-k 1 \
    --min-peak-height 0.08 \
    --min-window-mass 0.16 \
    --min-sep-deg 22.5 \
    --min-active-ratio 0.15 \
    --level-delta-on-db 4.0 \
    --level-delta-off-db 1.5 \
    --level-min-dbfs -55.0 \
    --level-ema-alpha 0.05 \
    --event-hold-ms 320 \
    --event-refractory-ms 200 \
    --min-R-clip 0.30 \
    --onset-alpha 0.05

Understanding the Output

Console Output

Each line shows:

[ 12.34s] LVL= -45.2 dBFS diff=+3.5 | FLUXz=2.10 COH=0.75 | GATE=OPEN | MODEL= 12.3ms HIST= 2.1ms | DOA(R=0.45, n=2) [45°, 180°]

[time]: Elapsed time in seconds
LVL: Audio level in dBFS
diff: Level difference from background (dB)
FLUXz: Spectral flux z-score (onset detection)
COH: Inter-microphone coherence
GATE: Gate state (OPEN/CLOSED)
MODEL: Model inference time (ms)
HIST: Histogram processing time (ms)
DOA(R=..., n=...): R_clip value and number of detected peaks
[angles]: Detected azimuth angles in degrees

Visual Output

A polar plot window shows:

Green lines: Detected sound source directions
Line thickness: Proportional to confidence score
Angle labels: Azimuth in degrees (0° = North/front)

Azimuth Convention

0° = North (front of microphone)
90° = East (right)
180° = South (back)
270° = West (left)

How It Works

1. Audio Processing Pipeline

Microphone (6 ch) → Channel Selection [1,4,3,2] → 4-channel audio

2. Feature Extraction

For each analysis window:

Compute STFT for all 4 channels
Extract magnitude, phase, cosine, and sine components
Result: (T_frames, 12_features, F_freq_bins)

3. Model Inference

Batch process features through ONNX model
Output: (T_frames, K_bins) logits per frame
Each frame has K probability scores for different azimuth angles

4. Histogram Aggregation

Apply softmax with temperature tau to logits
Weight by circular coherence (R_clip)
Aggregate across all frames into a single histogram
Smooth the histogram

5. Peak Detection

Find local maxima in the histogram
Filter by minimum height, separation, and window mass
Refine peak positions using parabolic interpolation
Return up to max_sources peaks

6. Event Gating

Track audio level with exponential moving average
Open gate when:
- Level increases by level_delta_on_db OR
- Valid peaks detected AND R_clip > min_R_clip
Close gate when level drops and no valid peaks
Apply hold and refractory periods to prevent flickering

Troubleshooting

"Invalid number of channels" Error

Problem: Device reports 0 channels or PyAudio can't open it.

Solution:

Stop PulseAudio: pulseaudio --kill
Run the script
Restart PulseAudio: pulseaudio --start

Or use the helper script run_onnx_stream.sh.

No Audio Detected

Check microphone connections
Verify device index with --list-devices
Check audio levels (should be above level_min_dbfs)
Adjust level_delta_on_db to be more sensitive

GPU Not Used

Verify CUDA is available: python -c "import torch; print(torch.cuda.is_available())"
Install onnxruntime-gpu instead of onnxruntime
Check that CUDA providers are listed in the model loading message

Model Mismatch Errors

Ensure --K matches the model's K value (usually 72)
Check that the ONNX model was exported with the correct input shape
Verify config.yaml matches training configuration

Poor DOA Accuracy

Increase --window-ms for longer analysis windows (more stable)
Adjust --min-peak-height and --min-window-mass thresholds
Tune --tau (lower = sharper peaks, higher = smoother)
Check microphone array calibration and positioning

Performance Tips

GPU Inference: Use onnxruntime-gpu for 5-10x speedup
Window Size: Larger windows (400ms) = more stable but higher latency
Hop Size: Smaller hops (50ms) = more responsive but more computation
Batch Size: The script uses batch_size=25 internally for efficient GPU usage

Stopping the Script

Press Ctrl+C to stop the stream. The script will:

Close the audio stream
Close the visualization window
Clean up resources

Integration

To use this in your own code, see onnx_doa_inference.py which provides a standalone inference class that can be integrated into other projects.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support