language: en license: apache-2.0 library_name: transformers tags: - audio-classification - deepfake-detection - wav2vec2 - finquest-2026 datasets: - JesseHuang922/VoxSentinel-Base-Dataset metrics: - accuracy
π‘οΈ Sentinel: VoxSentinel-Base
GitHub Repo: https://github.com/JesseLau24/Finquest_VoxSentinel_Base
Sentinel is an industrial-grade forensic model designed to detect AI-synthesized speech. Developed for FinQuest 2026, it captures subtle neural vocoder artifacts that human ears often miss.
β¨ Key Innovation: MSAP Head
Unlike standard pooling, Sentinel uses Multi-Scale Attentive Pooling (MSAP). By fusing Weighted Mean and Weighted Standard Deviation, it extracts a 1536-dimensional "acoustic fingerprint" to identify the rigid textures of voice cloning.
π Performance (In-The-Wild Test)
- Accuracy: 99.56%
- EER: 0.0007
- F1-Score: 0.99 (Fake) / 1.00 (Real)
π 1. Core Architecture: Sentinel-Base (v1.0)
The system transitions from traditional CNN-based detection to a Self-Supervised Transformer backbone with a custom forensic head.
- Backbone:
facebook/wav2vec2-base(Fine-tuned via Layer-wise Learning Rate Decay). - Feature Head: Multi-Scale Attentive Pooling (MSAP).
- Mechanism: Instead of a simple mean, MSAP calculates Attention-weighted Mean ($\mu$) and Attention-weighted Standard Deviation ($\sigma$).
- Dimension: $768 (\mu) + 768 (\sigma) = 1536$ total features.
- Logic: Captures both static spectral artifacts and dynamic temporal "texture" inconsistencies (e.g., unnatural smoothness in neural vocoders).
- Classifier: 3-Layer MLP ($1536 \rightarrow 512 \rightarrow 256 \rightarrow 2$) with BatchNorm and Dropout (0.3).
π 2. Dataset: The Master Protocol (v2)
We utilized a massive aggregated corpus of 116,390 samples to ensure cross-generator generalization.
Data Composition
| Source Category | Sample Count | Weight | Description |
|---|---|---|---|
| FOR (Fake-or-Real) | 50,890 | 43.7% | Original, Norm, 2sec, and Rerecorded variants. |
| WaveFake (WF) | 35,500 | 30.5% | JSUT & LJSpeech official (HiFiGAN, MelGAN, etc). |
| In-the-Wild | 12,000 | 10.3% | Real-world scraped deepfakes and authentic audio. |
| ASVspoof 2019 | 8,000 | 6.9% | Academic benchmark for logical access attacks. |
| LJ_Real | 10,000 | 8.6% | High-fidelity authentic reference. |
π§ͺ 3. Forensic Training Regime
Data Augmentation (Asymmetric Strategy)
To bridge the gap between "Lab" and "Wild", we implemented:
- Cyclic Tiling: Audios < 4s are tiled to maintain temporal receptive fields without energy loss.
- Asymmetric Toughening: Heavy MP3 compression and Room simulation were applied to clean sources to simulate real-world telephonic degradation.
Hyperparameters (RTX 5080 Optimized)
| Parameter | Value | Note |
|---|---|---|
| Batch Size | 32 | Balanced for gradient stability. |
| Encoder LR | 4e-6 | LLRD (Layer-wise Learning Rate Decay). |
| Top-Layer LR | 4e-5 | High-velocity head training (Pooling & MLP). |
| Optimizer | AdamW | Weight decay 0.01. |
π 4. Performance & Validation
A. Internal Benchmark (Master Protocol)
- Best Dev Accuracy:
99.85% - Best Dev EER:
0.0013 - In-the-Wild Test Accuracy: 99.56% (on 31,779 samples).
B. Cross-Dataset Stress Test (Generalization)
Tested against completely unseen datasets to simulate adversarial scenarios:
- Unseen FoR-Norm: 94.28% Accuracy
- Unseen FoR-Rerecorded: 91.05% Accuracy
Forensic Note: The consistency across unseen data ($>90%$) proves that the MSAP head effectively learns intrinsic synthesis fingerprints rather than over-fitting to specific dataset biases.
π» Quick Start (Inference)
Dependencies: pip install torch transformers librosa
import torch
import torch.nn as nn
import librosa
import numpy as np
from transformers import Wav2Vec2PreTrainedModel, Wav2Vec2Model, AutoProcessor
# ==============================================================================
# 1. Architecture Definition (Required for custom MSAP head)
# ==============================================================================
class MultiScaleAttentivePooling(nn.Module):
def __init__(self, hidden_size):
super().__init__()
self.attention = nn.Sequential(
nn.Linear(hidden_size, 128),
nn.Tanh(),
nn.Linear(128, 1)
)
def forward(self, x):
# x shape: (batch, seq_len, hidden_size)
w = torch.softmax(self.attention(x), dim=1)
mu = torch.sum(w * x, dim=1)
# Robust variance calculation
delta = x - mu.unsqueeze(1)
var = torch.sum(w * (delta**2), dim=1)
std = torch.sqrt(torch.clamp(var, min=1e-9))
return torch.cat([mu, std], dim=-1) # (batch, hidden_size * 2)
class SentinelForSyntheticDetection(Wav2Vec2PreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.wav2vec2 = Wav2Vec2Model(config)
self.pooling = MultiScaleAttentivePooling(config.hidden_size)
self.classifier = nn.Sequential(
nn.Linear(config.hidden_size * 2, 512),
nn.ReLU(),
nn.BatchNorm1d(512),
nn.Dropout(0.3),
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 2)
)
self.post_init()
def forward(self, input_values, attention_mask=None):
outputs = self.wav2vec2(input_values, attention_mask=attention_mask)
# Using last_hidden_state (Batch, Seq_Len, Hidden_Size)
pooled_output = self.pooling(outputs.last_hidden_state)
return self.classifier(pooled_output)
# ==============================================================================
# 2. Inference Logic (Forensic Grade)
# ==============================================================================
device = "cuda" if torch.cuda.is_available() else "cpu"
repo = "JesseHuang922/VoxSentinel-Base"
# Load model and processor from Hugging Face
model = SentinelForSyntheticDetection.from_pretrained(repo).to(device).eval()
processor = AutoProcessor.from_pretrained(repo)
def predict(audio_path):
"""
Forensic prediction with automatic channel mixing and resampling.
"""
# 1. Load with mono=True: Automatically mixes stereo to mono (L+R)/2
# 2. sr=16000: Forces resampling to the model's native frequency
try:
wav, _ = librosa.load(audio_path, sr=16000, mono=True)
# Simple Voice Activity Detection (Optional: trims silence)
wav, _ = librosa.effects.trim(wav, top_db=20)
# Preprocess features
inputs = processor(wav, return_tensors="pt", sampling_rate=16000)
input_values = inputs.input_values.to(device)
with torch.no_grad():
logits = model(input_values)
# Softmax to get confidence scores
probs = torch.softmax(logits, dim=-1)
pred_idx = torch.argmax(probs, dim=-1).item()
confidence = probs[0][pred_idx].item()
# Label Mapping: 1 -> Real, 0 -> Fake
result = "Real" if pred_idx == 1 else "Fake"
return f"Result: {result} ({confidence:.2%} confidence)"
except Exception as e:
return f"Error processing {audio_path}: {e}"
# Usage Example:
# print(predict("evidence_stereo_file.mp3"))
π€ Acknowledgements
This research and model development would not be possible without the open-source contributions of the following organizations and researchers. We express our gratitude for their high-quality datasets:
- Fake-or-Real (FoR) Dataset: Provided by Mohammed Abdel-Dayem et al., offering a critical multi-scenario benchmark for synthetic speech detection.
- WaveFake Dataset: A massive cross-generator corpus by Frank et al., which was instrumental in training our model to generalize across multiple TTS architectures.
- ASVspoof 2019: We thank the ASVspoof Consortium for their foundational work in standardizing the evaluation of spoofing countermeasures.
- In-the-Wild Dataset: Gratitude to the researchers of the In-the-Wild project for providing real-world deepfake samples that bridge the gap between laboratory and reality.
- Special Thanks: To the creators of LJSpeech and JSUT official corpora for providing the high-fidelity authentic references used in our WF-Enhanced protocol.
π οΈ Project Pipeline
To reproduce the results, please follow the notebooks in notebooks folder in this order:
01_Protocol_Gen.ipynb: Generates the balanced 116k protocol.02_Training_Wav2Vec2_base.ipynb: Handles the training with Wav2Vec 2.0 + MSAP.03_HuggingFace_Packing.ipynb: Converts the.pthcheckpoint to Hugging Face standard.
Β© 2026 FinQuest AI Defense Taskforce. Powered by Wav2Vec 2.0.
- Downloads last month
- 29