Instructions to use mamounyosef/sign-language-bridge with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use mamounyosef/sign-language-bridge with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
sign-language-bridge: Qwen3-VL-2B fine-tuned for ASL to English translation
LoRA / RSLoRA fine-tune of
Qwen/Qwen3-VL-2B-Instruct
for continuous American Sign Language (ASL) to English translation.
Checkpoint corresponds to global optimiser step 4,610 (selected on validation loss). Source code, full technical report, and training pipeline: github.com/mamounyosef/sign-language-bridge.
Test-set results (How2Sign, 944 clips)
| Metric | Value |
|---|---|
| Test loss | 2.7896 |
| Perplexity | 16.28 |
| BLEU-1 | 19.76 |
| BLEU-2 | 6.95 |
| BLEU-4 | 1.64 |
| chrF | 17.42 |
| ROUGE-L | 10.43 |
| METEOR | 9.71 |
| WER (%) | 112.51 |
| Distinct-2 | 0.103 |
Numbers are reported on a custom 90/5/5 stratified split, not the official How2Sign / OpenASL splits, and are therefore not directly comparable to published results on those corpora. See the GitHub repo and the technical report for the full evaluation protocol and the data-cleaning passes that drove the custom split.
The model produces fluent English in the register of the target captions and often captures the meaning of the signed input, but the word-level overlap with the references is modest.
Repository contents
adapter/
adapter_config.json PEFT / LoRA configuration
adapter_model.safetensors LoRA weights + saved embedding & output-head modules
README.md PEFT auto-generated card
training_state.pt optimizer + scheduler states (per tier),
InfoNCE projection-head weights,
InfoNCE MoCo queues, RNG snapshots,
phase / step / epoch bookkeeping
training_state.pt is required only for resuming training or for reusing
the InfoNCE alignment. It is not needed for inference; loading the
adapter/ folder on top of the base model is sufficient to generate.
How to use (inference)
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from peft import PeftModel
REPO_ID = "mamounyosef/sign-language-bridge"
BASE = "Qwen/Qwen3-VL-2B-Instruct"
processor = AutoProcessor.from_pretrained(BASE)
base = AutoModelForImageTextToText.from_pretrained(
BASE, torch_dtype=torch.bfloat16, device_map="auto",
)
model = PeftModel.from_pretrained(base, REPO_ID, subfolder="adapter")
model.eval()
# `video` should be a tensor / list of frames preprocessed by `processor`.
# For best results, replicate the training-time preprocessing:
# 1) pose-guided signer crop (MediaPipe pose bbox)
# 2) CLAHE on L-channel in LAB (clip limit 2.0, 8x8 tile grid)
# 3) MediaPipe landmark overlay (21 keypoints/hand + 6 upper-body joints)
# See https://github.com/mamounyosef/sign-language-bridge for the exact code.
messages = [{
"role": "user",
"content": [
{"type": "video", "video": video},
{"type": "text", "text": "Translate the signed sentence to English."},
],
}]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt", tokenize=True,
).to(model.device)
out = model.generate(
**inputs,
max_new_tokens=32,
num_beams=5,
length_penalty=0.6,
no_repeat_ngram_size=4,
repetition_penalty=1.1,
)
print(processor.batch_decode(out, skip_special_tokens=True)[0])
Training summary
- Base model:
Qwen/Qwen3-VL-2B-Instruct(2B parameters: 24-layer vision tower, 28-layer Qwen3 decoder, M-RoPE, DeepStack mergers at vision layers 5 / 11 / 17). - Adaptation: multi-tier LoRA / RSLoRA, 34,321,920 trainable
parameters (โ1.59%) of the combined model.
- T1 (LM attention + MLP): rank 16
- T2 (Vision encoder): rank 32
- T3 (Embeddings + output head): rank 8 (plus
modules_to_save) - T4 (InfoNCE projection heads): full-rank, trained from scratch
- Auxiliary loss: symmetric InfoNCE between pooled vision and caption embeddings (256-dim, ฯ = 0.07, ฮป = 0.3 with 200-step linear warmup, MoCo-style negative queue of size 64).
- Schedule: OpenASL stage (2 epochs, 2,448 steps) โ How2Sign stage (6 epochs, 3,540 steps). Within OpenASL, Phase 1 (first 20% of steps) trains only T2 and T4; Phase 2 unfreezes all four tiers. Per-tier cosine LR schedules with a 5% linear warmup.
- Preprocessing (always-on): pose-guided signer crop, CLAHE contrast enhancement, and pre-extracted MediaPipe landmark overlays (21 keypoints / hand + 6 upper-body joints).
- Compute: 1ร NVIDIA A100 80GB, effective batch size 24 (per-device 6 ร 4 gradient-accumulation steps), bfloat16, FlashAttention 2, gradient checkpointing, 8-bit AdamW, Liger fused Triton kernels. Total wall-clock โ 4d 18h.
For full details, see the technical report and source code in the GitHub repository.
Generation defaults used for evaluation
| Parameter | Value |
|---|---|
| Beam size | 5 |
| Length penalty | 0.6 |
| No-repeat n-gram | 4 |
| Repetition penalty | 1.1 |
| Max new tokens | 32 |
Datasets
- How2Sign โ multi-view ASL corpus of instructional "How To" videos with manually verified English captions.
- OpenASL โ large open-domain ASL corpus collected from online video.
Both datasets are subject to their own upstream terms of use. This repository does not redistribute the raw videos.
Limitations and intended use
- This is a research preview, not a production translation system. Word-level accuracy is low (BLEU-4 = 1.64, WER = 112.51% on the How2Sign test partition); outputs are fluent and often topically appropriate but frequently disagree with the reference at the word level.
- The model was trained on a custom data split, so reported numbers are not directly comparable to published How2Sign / OpenASL results.
- Outputs may be plausibly fluent but factually wrong with respect to the signed input. Do not use this model in any setting where a mistranslation could cause harm (medical, legal, safety-critical, emergency, etc.).
- The model has been trained almost exclusively on the signers, framings, and lighting conditions present in How2Sign and OpenASL, and may generalise poorly to out-of-distribution signing.
License and attribution
- This adapter is released under the Apache License 2.0.
- The base model
Qwen/Qwen3-VL-2B-Instructis also under Apache 2.0 (upstream LICENSE). Use of this adapter, together with the base model, remains subject to Qwen's Apache 2.0 terms. - Built using ๐ค
peftand ๐คtransformers.
Citation
If you use this model or its results, please cite the project repository:
@misc{yosef2026signbridge,
author = {Ma'moun Yosef},
title = {sign-language-bridge: Fine-Tuning Qwen3-VL-2B for ASL to
English Translation},
year = {2026},
howpublished = {\url{https://github.com/mamounyosef/sign-language-bridge}}
}
and the base model:
@article{qwen3vl2025,
author = {{Qwen Team}},
title = {{Qwen3-VL} Technical Report},
journal = {arXiv preprint arXiv:2511.21631},
year = {2025}
}
- Downloads last month
- -
Model tree for mamounyosef/sign-language-bridge
Base model
Qwen/Qwen3-VL-2B-Instruct