language: en license: apache-2.0 library_name: transformers tags: - audio-classification - deepfake-detection - wav2vec2 - finquest-2026 datasets: - JesseHuang922/VoxSentinel-Base-Dataset metrics: - accuracy

πŸ›‘οΈ Sentinel: VoxSentinel-Base

GitHub Repo: https://github.com/JesseLau24/Finquest_VoxSentinel_Base

Sentinel is an industrial-grade forensic model designed to detect AI-synthesized speech. Developed for FinQuest 2026, it captures subtle neural vocoder artifacts that human ears often miss.

✨ Key Innovation: MSAP Head

Unlike standard pooling, Sentinel uses Multi-Scale Attentive Pooling (MSAP). By fusing Weighted Mean and Weighted Standard Deviation, it extracts a 1536-dimensional "acoustic fingerprint" to identify the rigid textures of voice cloning.

πŸ“Š Performance (In-The-Wild Test)

  • Accuracy: 99.56%
  • EER: 0.0007
  • F1-Score: 0.99 (Fake) / 1.00 (Real)

πŸš€ 1. Core Architecture: Sentinel-Base (v1.0)

The system transitions from traditional CNN-based detection to a Self-Supervised Transformer backbone with a custom forensic head.

  • Backbone: facebook/wav2vec2-base (Fine-tuned via Layer-wise Learning Rate Decay).
  • Feature Head: Multi-Scale Attentive Pooling (MSAP).
    • Mechanism: Instead of a simple mean, MSAP calculates Attention-weighted Mean ($\mu$) and Attention-weighted Standard Deviation ($\sigma$).
    • Dimension: $768 (\mu) + 768 (\sigma) = 1536$ total features.
    • Logic: Captures both static spectral artifacts and dynamic temporal "texture" inconsistencies (e.g., unnatural smoothness in neural vocoders).
  • Classifier: 3-Layer MLP ($1536 \rightarrow 512 \rightarrow 256 \rightarrow 2$) with BatchNorm and Dropout (0.3).

πŸ“Š 2. Dataset: The Master Protocol (v2)

We utilized a massive aggregated corpus of 116,390 samples to ensure cross-generator generalization.

Data Composition

Source Category Sample Count Weight Description
FOR (Fake-or-Real) 50,890 43.7% Original, Norm, 2sec, and Rerecorded variants.
WaveFake (WF) 35,500 30.5% JSUT & LJSpeech official (HiFiGAN, MelGAN, etc).
In-the-Wild 12,000 10.3% Real-world scraped deepfakes and authentic audio.
ASVspoof 2019 8,000 6.9% Academic benchmark for logical access attacks.
LJ_Real 10,000 8.6% High-fidelity authentic reference.

πŸ§ͺ 3. Forensic Training Regime

Data Augmentation (Asymmetric Strategy)

To bridge the gap between "Lab" and "Wild", we implemented:

  • Cyclic Tiling: Audios < 4s are tiled to maintain temporal receptive fields without energy loss.
  • Asymmetric Toughening: Heavy MP3 compression and Room simulation were applied to clean sources to simulate real-world telephonic degradation.

Hyperparameters (RTX 5080 Optimized)

Parameter Value Note
Batch Size 32 Balanced for gradient stability.
Encoder LR 4e-6 LLRD (Layer-wise Learning Rate Decay).
Top-Layer LR 4e-5 High-velocity head training (Pooling & MLP).
Optimizer AdamW Weight decay 0.01.

πŸ“ˆ 4. Performance & Validation

A. Internal Benchmark (Master Protocol)

  • Best Dev Accuracy: 99.85%
  • Best Dev EER: 0.0013
  • In-the-Wild Test Accuracy: 99.56% (on 31,779 samples).

B. Cross-Dataset Stress Test (Generalization)

Tested against completely unseen datasets to simulate adversarial scenarios:

  • Unseen FoR-Norm: 94.28% Accuracy
  • Unseen FoR-Rerecorded: 91.05% Accuracy

Forensic Note: The consistency across unseen data ($>90%$) proves that the MSAP head effectively learns intrinsic synthesis fingerprints rather than over-fitting to specific dataset biases.


πŸ’» Quick Start (Inference)

Dependencies: pip install torch transformers librosa

import torch
import torch.nn as nn
import librosa
import numpy as np
from transformers import Wav2Vec2PreTrainedModel, Wav2Vec2Model, AutoProcessor

# ==============================================================================
# 1. Architecture Definition (Required for custom MSAP head)
# ==============================================================================
class MultiScaleAttentivePooling(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.attention = nn.Sequential(
            nn.Linear(hidden_size, 128), 
            nn.Tanh(), 
            nn.Linear(128, 1)
        )
    def forward(self, x):
        # x shape: (batch, seq_len, hidden_size)
        w = torch.softmax(self.attention(x), dim=1)
        mu = torch.sum(w * x, dim=1)
        # Robust variance calculation
        delta = x - mu.unsqueeze(1)
        var = torch.sum(w * (delta**2), dim=1)
        std = torch.sqrt(torch.clamp(var, min=1e-9))
        return torch.cat([mu, std], dim=-1) # (batch, hidden_size * 2)

class SentinelForSyntheticDetection(Wav2Vec2PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.wav2vec2 = Wav2Vec2Model(config)
        self.pooling = MultiScaleAttentivePooling(config.hidden_size)
        self.classifier = nn.Sequential(
            nn.Linear(config.hidden_size * 2, 512), 
            nn.ReLU(), 
            nn.BatchNorm1d(512), 
            nn.Dropout(0.3), 
            nn.Linear(512, 256), 
            nn.ReLU(), 
            nn.Linear(256, 2)
        )
        self.post_init()

    def forward(self, input_values, attention_mask=None):
        outputs = self.wav2vec2(input_values, attention_mask=attention_mask)
        # Using last_hidden_state (Batch, Seq_Len, Hidden_Size)
        pooled_output = self.pooling(outputs.last_hidden_state)
        return self.classifier(pooled_output)

# ==============================================================================
# 2. Inference Logic (Forensic Grade)
# ==============================================================================
device = "cuda" if torch.cuda.is_available() else "cpu"
repo = "JesseHuang922/VoxSentinel-Base"

# Load model and processor from Hugging Face
model = SentinelForSyntheticDetection.from_pretrained(repo).to(device).eval()
processor = AutoProcessor.from_pretrained(repo)

def predict(audio_path):
    """
    Forensic prediction with automatic channel mixing and resampling.
    """
    # 1. Load with mono=True: Automatically mixes stereo to mono (L+R)/2
    # 2. sr=16000: Forces resampling to the model's native frequency
    try:
        wav, _ = librosa.load(audio_path, sr=16000, mono=True)
        
        # Simple Voice Activity Detection (Optional: trims silence)
        wav, _ = librosa.effects.trim(wav, top_db=20)

        # Preprocess features
        inputs = processor(wav, return_tensors="pt", sampling_rate=16000)
        input_values = inputs.input_values.to(device)

        with torch.no_grad():
            logits = model(input_values)
            # Softmax to get confidence scores
            probs = torch.softmax(logits, dim=-1)
            pred_idx = torch.argmax(probs, dim=-1).item()
            confidence = probs[0][pred_idx].item()

        # Label Mapping: 1 -> Real, 0 -> Fake
        result = "Real" if pred_idx == 1 else "Fake"
        return f"Result: {result} ({confidence:.2%} confidence)"

    except Exception as e:
        return f"Error processing {audio_path}: {e}"

# Usage Example:
# print(predict("evidence_stereo_file.mp3"))

🀝 Acknowledgements

This research and model development would not be possible without the open-source contributions of the following organizations and researchers. We express our gratitude for their high-quality datasets:

  • Fake-or-Real (FoR) Dataset: Provided by Mohammed Abdel-Dayem et al., offering a critical multi-scenario benchmark for synthetic speech detection.
  • WaveFake Dataset: A massive cross-generator corpus by Frank et al., which was instrumental in training our model to generalize across multiple TTS architectures.
  • ASVspoof 2019: We thank the ASVspoof Consortium for their foundational work in standardizing the evaluation of spoofing countermeasures.
  • In-the-Wild Dataset: Gratitude to the researchers of the In-the-Wild project for providing real-world deepfake samples that bridge the gap between laboratory and reality.
  • Special Thanks: To the creators of LJSpeech and JSUT official corpora for providing the high-fidelity authentic references used in our WF-Enhanced protocol.

πŸ› οΈ Project Pipeline

To reproduce the results, please follow the notebooks in notebooks folder in this order:

  1. 01_Protocol_Gen.ipynb: Generates the balanced 116k protocol.
  2. 02_Training_Wav2Vec2_base.ipynb: Handles the training with Wav2Vec 2.0 + MSAP.
  3. 03_HuggingFace_Packing.ipynb: Converts the .pth checkpoint to Hugging Face standard.

Β© 2026 FinQuest AI Defense Taskforce. Powered by Wav2Vec 2.0.

Downloads last month
29
Safetensors
Model size
95.4M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support