language: en license: apache-2.0 library_name: transformers tags: - audio-classification - deepfake-detection - wav2vec2 - finquest-2026 datasets: - JesseHuang922/VoxSentinel-Base-Dataset metrics: - accuracy

🛡️ Sentinel: VoxSentinel-Base

GitHub Repo: https://github.com/JesseLau24/Finquest_VoxSentinel_Base

Sentinel is an industrial-grade forensic model designed to detect AI-synthesized speech. Developed for FinQuest 2026, it captures subtle neural vocoder artifacts that human ears often miss.

✨ Key Innovation: MSAP Head

Unlike standard pooling, Sentinel uses Multi-Scale Attentive Pooling (MSAP). By fusing Weighted Mean and Weighted Standard Deviation, it extracts a 1536-dimensional "acoustic fingerprint" to identify the rigid textures of voice cloning.

📊 Performance (In-The-Wild Test)

Accuracy: 99.56%
EER: 0.0007
F1-Score: 0.99 (Fake) / 1.00 (Real)

🚀 1. Core Architecture: Sentinel-Base (v1.0)

The system transitions from traditional CNN-based detection to a Self-Supervised Transformer backbone with a custom forensic head.

Backbone: facebook/wav2vec2-base (Fine-tuned via Layer-wise Learning Rate Decay).
Feature Head: Multi-Scale Attentive Pooling (MSAP).
- Mechanism: Instead of a simple mean, MSAP calculates Attention-weighted Mean ($\mu$) and Attention-weighted Standard Deviation ($\sigma$).
- Dimension: $768 (\mu) + 768 (\sigma) = 1536$ total features.
- Logic: Captures both static spectral artifacts and dynamic temporal "texture" inconsistencies (e.g., unnatural smoothness in neural vocoders).
Classifier: 3-Layer MLP ($1536 \rightarrow 512 \rightarrow 256 \rightarrow 2$) with BatchNorm and Dropout (0.3).

📊 2. Dataset: The Master Protocol (v2)

We utilized a massive aggregated corpus of 116,390 samples to ensure cross-generator generalization.

Data Composition

Source Category	Sample Count	Weight	Description
FOR (Fake-or-Real)	50,890	43.7%	Original, Norm, 2sec, and Rerecorded variants.
WaveFake (WF)	35,500	30.5%	JSUT & LJSpeech official (HiFiGAN, MelGAN, etc).
In-the-Wild	12,000	10.3%	Real-world scraped deepfakes and authentic audio.
ASVspoof 2019	8,000	6.9%	Academic benchmark for logical access attacks.
LJ_Real	10,000	8.6%	High-fidelity authentic reference.

🧪 3. Forensic Training Regime

Data Augmentation (Asymmetric Strategy)

To bridge the gap between "Lab" and "Wild", we implemented:

Cyclic Tiling: Audios < 4s are tiled to maintain temporal receptive fields without energy loss.
Asymmetric Toughening: Heavy MP3 compression and Room simulation were applied to clean sources to simulate real-world telephonic degradation.

Hyperparameters (RTX 5080 Optimized)

Parameter	Value	Note
Batch Size	32	Balanced for gradient stability.
Encoder LR	4e-6	LLRD (Layer-wise Learning Rate Decay).
Top-Layer LR	4e-5	High-velocity head training (Pooling & MLP).
Optimizer	AdamW	Weight decay 0.01.

📈 4. Performance & Validation

A. Internal Benchmark (Master Protocol)

Best Dev Accuracy: 99.85%
Best Dev EER: 0.0013
In-the-Wild Test Accuracy: 99.56% (on 31,779 samples).

B. Cross-Dataset Stress Test (Generalization)

Tested against completely unseen datasets to simulate adversarial scenarios:

Unseen FoR-Norm: 94.28% Accuracy
Unseen FoR-Rerecorded: 91.05% Accuracy

Forensic Note: The consistency across unseen data ($>90%$) proves that the MSAP head effectively learns intrinsic synthesis fingerprints rather than over-fitting to specific dataset biases.

💻 Quick Start (Inference)

Dependencies: pip install torch transformers librosa

import torch
import torch.nn as nn
import librosa
import numpy as np
from transformers import Wav2Vec2PreTrainedModel, Wav2Vec2Model, AutoProcessor

# ==============================================================================
# 1. Architecture Definition (Required for custom MSAP head)
# ==============================================================================
class MultiScaleAttentivePooling(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.attention = nn.Sequential(
            nn.Linear(hidden_size, 128), 
            nn.Tanh(), 
            nn.Linear(128, 1)
        )
    def forward(self, x):
        # x shape: (batch, seq_len, hidden_size)
        w = torch.softmax(self.attention(x), dim=1)
        mu = torch.sum(w * x, dim=1)
        # Robust variance calculation
        delta = x - mu.unsqueeze(1)
        var = torch.sum(w * (delta**2), dim=1)
        std = torch.sqrt(torch.clamp(var, min=1e-9))
        return torch.cat([mu, std], dim=-1) # (batch, hidden_size * 2)

class SentinelForSyntheticDetection(Wav2Vec2PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.wav2vec2 = Wav2Vec2Model(config)
        self.pooling = MultiScaleAttentivePooling(config.hidden_size)
        self.classifier = nn.Sequential(
            nn.Linear(config.hidden_size * 2, 512), 
            nn.ReLU(), 
            nn.BatchNorm1d(512), 
            nn.Dropout(0.3), 
            nn.Linear(512, 256), 
            nn.ReLU(), 
            nn.Linear(256, 2)
        )
        self.post_init()

    def forward(self, input_values, attention_mask=None):
        outputs = self.wav2vec2(input_values, attention_mask=attention_mask)
        # Using last_hidden_state (Batch, Seq_Len, Hidden_Size)
        pooled_output = self.pooling(outputs.last_hidden_state)
        return self.classifier(pooled_output)

# ==============================================================================
# 2. Inference Logic (Forensic Grade)
# ==============================================================================
device = "cuda" if torch.cuda.is_available() else "cpu"
repo = "JesseHuang922/VoxSentinel-Base"

# Load model and processor from Hugging Face
model = SentinelForSyntheticDetection.from_pretrained(repo).to(device).eval()
processor = AutoProcessor.from_pretrained(repo)

def predict(audio_path):
    """
    Forensic prediction with automatic channel mixing and resampling.
    """
    # 1. Load with mono=True: Automatically mixes stereo to mono (L+R)/2
    # 2. sr=16000: Forces resampling to the model's native frequency
    try:
        wav, _ = librosa.load(audio_path, sr=16000, mono=True)
        
        # Simple Voice Activity Detection (Optional: trims silence)
        wav, _ = librosa.effects.trim(wav, top_db=20)

        # Preprocess features
        inputs = processor(wav, return_tensors="pt", sampling_rate=16000)
        input_values = inputs.input_values.to(device)

        with torch.no_grad():
            logits = model(input_values)
            # Softmax to get confidence scores
            probs = torch.softmax(logits, dim=-1)
            pred_idx = torch.argmax(probs, dim=-1).item()
            confidence = probs[0][pred_idx].item()

        # Label Mapping: 1 -> Real, 0 -> Fake
        result = "Real" if pred_idx == 1 else "Fake"
        return f"Result: {result} ({confidence:.2%} confidence)"

    except Exception as e:
        return f"Error processing {audio_path}: {e}"

# Usage Example:
# print(predict("evidence_stereo_file.mp3"))

🤝 Acknowledgements

This research and model development would not be possible without the open-source contributions of the following organizations and researchers. We express our gratitude for their high-quality datasets:

Fake-or-Real (FoR) Dataset: Provided by Mohammed Abdel-Dayem et al., offering a critical multi-scenario benchmark for synthetic speech detection.
WaveFake Dataset: A massive cross-generator corpus by Frank et al., which was instrumental in training our model to generalize across multiple TTS architectures.
ASVspoof 2019: We thank the ASVspoof Consortium for their foundational work in standardizing the evaluation of spoofing countermeasures.
In-the-Wild Dataset: Gratitude to the researchers of the In-the-Wild project for providing real-world deepfake samples that bridge the gap between laboratory and reality.
Special Thanks: To the creators of LJSpeech and JSUT official corpora for providing the high-fidelity authentic references used in our WF-Enhanced protocol.

🛠️ Project Pipeline

To reproduce the results, please follow the notebooks in notebooks folder in this order:

01_Protocol_Gen.ipynb: Generates the balanced 116k protocol.
02_Training_Wav2Vec2_base.ipynb: Handles the training with Wav2Vec 2.0 + MSAP.
03_HuggingFace_Packing.ipynb: Converts the .pth checkpoint to Hugging Face standard.

Downloads last month: 29

Safetensors

Model size

95.4M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support