ONNX Real-Time DOA Streaming
Real-time Direction of Arrival (DOA) detection using an ONNX model with microphone streaming. This script processes audio from a multi-channel microphone array (ReSpeaker) in real-time and displays detected sound source directions.
Overview
The script performs the following process:
- Audio Capture: Streams audio from a 6-channel microphone array (ReSpeaker)
- Channel Selection: Selects and reorders channels
[1, 4, 3, 2]to get 4 channels - Feature Extraction: Computes STFT features (magnitude, phase, cosine, sine) from the audio
- ONNX Inference: Runs the DOA model on GPU (CUDA) or CPU to get per-frame logits
- Histogram Aggregation: Aggregates logits into a circular histogram of azimuth angles
- Peak Detection: Finds peaks in the histogram to identify sound source directions
- Event Gating: Filters detections based on audio level changes and coherence
- Visualization: Displays detected directions on a polar plot in real-time
Prerequisites
Hardware
- ReSpeaker 6-Mic Array (or compatible multi-channel microphone)
- microphone:
positions:
- [0.0277, 0.0] # Mic 0: 0°
- [0.0, 0.0277] # Mic 1: 90°
- [-0.0277, 0.0] # Mic 2: 180°
- [0.0, -0.0277] # Mic 3: 270°
- NVIDIA GPU (optional, for faster inference)
Software Dependencies
Install the required packages:
conda activate doaEnv
pip install onnxruntime-gpu # For GPU inference
# OR
pip install onnxruntime # For CPU-only inference
pip install pyaudio numpy matplotlib torch pyyaml
ONNX Model
You need a converted ONNX model file. If you haven't converted your PyTorch model yet:
python convert_to_onnx.py --checkpoint models/basic/2025-11-06_22-37-00-6a5fbc92/last.pt --output models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx
Quick Start
1. List Available Audio Devices
First, find your ReSpeaker device index:
python onnx_stream_microphone.py --list-devices
Look for a device named "ReSpeaker" or "Seeed" or containing "2886". Note the device index.
2. Stop PulseAudio (Required)
On Linux, PulseAudio often locks the ALSA devices. You need to temporarily stop it:
pulseaudio --kill
Note: You can use the helper script run_onnx_stream.sh which automates this (see below).
3. Run the Streaming Script
Basic usage:
python onnx_stream_microphone.py \
--onnx models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx \
--device-index 9
4. Restart PulseAudio (After Stopping)
After you're done, restart PulseAudio:
pulseaudio --start
Using the Helper Script
A helper script automates PulseAudio management:
chmod +x run_onnx_stream.sh
./run_onnx_stream.sh --onnx models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx --device-index 9
This script will:
- Stop PulseAudio
- Run the streaming script
- Restart PulseAudio when you exit (Ctrl+C)
Command-Line Arguments
Required Arguments
--onnx PATH: Path to the ONNX model file
Audio Configuration
--device-index INT: Audio device index (use--list-devicesto find it)--sample-rate INT: Sample rate in Hz (default: 16000)--window-ms INT: Analysis window length in milliseconds (default: 200)--hop-ms INT: Hop size (overlap) in milliseconds (default: 100)--chunk-size INT: Audio buffer chunk size (default: 1600)--cpu-only: Use CPU only (disable GPU inference)--list-devices: List all available audio input devices and exit
Model Configuration
--config PATH: Path to config.yaml (default:configs/train.yaml)
Histogram Detection Parameters
These control how DOA peaks are detected from the model logits:
--K INT: Number of azimuth bins (default: 72, should match model)--tau FLOAT: Softmax temperature for histogram (default: 0.8)--smooth-k INT: Histogram smoothing kernel size (default: 1)--min-peak-height FLOAT: Minimum peak height threshold (default: 0.10)--min-window-mass FLOAT: Minimum window mass for peak validation (default: 0.24)--min-sep-deg FLOAT: Minimum angular separation between peaks in degrees (default: 20.0)--min-active-ratio FLOAT: Minimum active frame ratio (default: 0.20)--max-sources INT: Maximum number of sources to detect (default: 3)
Event Gate Parameters
These control when detections are considered valid (filtering noise):
--level-delta-on-db FLOAT: Level increase threshold to open gate (default: 2.5)--level-delta-off-db FLOAT: Level decrease threshold to close gate (default: 1.0)--level-min-dbfs FLOAT: Minimum audio level in dBFS (default: -60.0)--level-ema-alpha FLOAT: Exponential moving average alpha for level tracking (default: 0.05)--event-hold-ms INT: Minimum time to keep gate open after detection (default: 300)--min-R-clip FLOAT: Minimum R_clip (coherence measure) to open gate (default: 0.18)--event-refractory-ms INT: Minimum time between gate state changes (default: 120)
Onset Detection Parameters
--onset-alpha FLOAT: EMA alpha for spectral flux tracking (default: 0.05)
Example with Custom Parameters
python onnx_stream_microphone.py \
--onnx doa_model.onnx \
--device-index 9 \
--window-ms 400 \
--hop-ms 100 \
--K 72 \
--max-sources 2 \
--tau 0.8 \
--smooth-k 1 \
--min-peak-height 0.08 \
--min-window-mass 0.16 \
--min-sep-deg 22.5 \
--min-active-ratio 0.15 \
--level-delta-on-db 4.0 \
--level-delta-off-db 1.5 \
--level-min-dbfs -55.0 \
--level-ema-alpha 0.05 \
--event-hold-ms 320 \
--event-refractory-ms 200 \
--min-R-clip 0.30 \
--onset-alpha 0.05
Understanding the Output
Console Output
Each line shows:
[ 12.34s] LVL= -45.2 dBFS diff=+3.5 | FLUXz=2.10 COH=0.75 | GATE=OPEN | MODEL= 12.3ms HIST= 2.1ms | DOA(R=0.45, n=2) [45°, 180°]
[time]: Elapsed time in secondsLVL: Audio level in dBFSdiff: Level difference from background (dB)FLUXz: Spectral flux z-score (onset detection)COH: Inter-microphone coherenceGATE: Gate state (OPEN/CLOSED)MODEL: Model inference time (ms)HIST: Histogram processing time (ms)DOA(R=..., n=...): R_clip value and number of detected peaks[angles]: Detected azimuth angles in degrees
Visual Output
A polar plot window shows:
- Green lines: Detected sound source directions
- Line thickness: Proportional to confidence score
- Angle labels: Azimuth in degrees (0° = North/front)
Azimuth Convention
- 0° = North (front of microphone)
- 90° = East (right)
- 180° = South (back)
- 270° = West (left)
How It Works
1. Audio Processing Pipeline
Microphone (6 ch) → Channel Selection [1,4,3,2] → 4-channel audio
2. Feature Extraction
For each analysis window:
- Compute STFT for all 4 channels
- Extract magnitude, phase, cosine, and sine components
- Result:
(T_frames, 12_features, F_freq_bins)
3. Model Inference
- Batch process features through ONNX model
- Output:
(T_frames, K_bins)logits per frame - Each frame has K probability scores for different azimuth angles
4. Histogram Aggregation
- Apply softmax with temperature
tauto logits - Weight by circular coherence (R_clip)
- Aggregate across all frames into a single histogram
- Smooth the histogram
5. Peak Detection
- Find local maxima in the histogram
- Filter by minimum height, separation, and window mass
- Refine peak positions using parabolic interpolation
- Return up to
max_sourcespeaks
6. Event Gating
- Track audio level with exponential moving average
- Open gate when:
- Level increases by
level_delta_on_dbOR - Valid peaks detected AND R_clip >
min_R_clip
- Level increases by
- Close gate when level drops and no valid peaks
- Apply hold and refractory periods to prevent flickering
Troubleshooting
"Invalid number of channels" Error
Problem: Device reports 0 channels or PyAudio can't open it.
Solution:
- Stop PulseAudio:
pulseaudio --kill - Run the script
- Restart PulseAudio:
pulseaudio --start
Or use the helper script run_onnx_stream.sh.
No Audio Detected
- Check microphone connections
- Verify device index with
--list-devices - Check audio levels (should be above
level_min_dbfs) - Adjust
level_delta_on_dbto be more sensitive
GPU Not Used
- Verify CUDA is available:
python -c "import torch; print(torch.cuda.is_available())" - Install
onnxruntime-gpuinstead ofonnxruntime - Check that CUDA providers are listed in the model loading message
Model Mismatch Errors
- Ensure
--Kmatches the model's K value (usually 72) - Check that the ONNX model was exported with the correct input shape
- Verify config.yaml matches training configuration
Poor DOA Accuracy
- Increase
--window-msfor longer analysis windows (more stable) - Adjust
--min-peak-heightand--min-window-massthresholds - Tune
--tau(lower = sharper peaks, higher = smoother) - Check microphone array calibration and positioning
Performance Tips
- GPU Inference: Use
onnxruntime-gpufor 5-10x speedup - Window Size: Larger windows (400ms) = more stable but higher latency
- Hop Size: Smaller hops (50ms) = more responsive but more computation
- Batch Size: The script uses batch_size=25 internally for efficient GPU usage
Stopping the Script
Press Ctrl+C to stop the stream. The script will:
- Close the audio stream
- Close the visualization window
- Clean up resources
Integration
To use this in your own code, see onnx_doa_inference.py which provides a standalone inference class that can be integrated into other projects.