Model card for PM-AI/paraphrase-distilroberta-base-v2_de-en

For internal purposes and for testing, we have made a monolingual paraphrasing model from Sentence Transformers usable for German + English via Knowledge Distillation. The decision was made in favor of sentence-transformers/paraphrase-distilroberta-base-v2 because this model has no public available multilingual version (to our knowledge). In addition, it has significantly more training samples compared to its predecessor: 83.3 million samples were used instead of 24.6 million samples.

Training

Download of datasets
Execution of knowledge distillation

Training Data

Datasets used based on offical source:

AllNLI
sentence-compression
SimpleWiki
altlex
msmarco-triplets
quora_duplicates
coco_captions
flickr30k_captions
yahoo_answers_title_question
S2ORC_citation_pairs
stackexchange_duplicate_questions
wiki-atomic-edits

Training Execution

First we downloaded some german-english parallel datasets via get_parallel_data_*.py.

These datasets are: Tatoeba, WikiMatrix, TED2020, OpenSubtitles, Europarl, News-Commentary

Then we started knowledge distillation with make_multilingual_sys.py

Parameterization of training

Script: make_multilingual_sys.py
Datasets: Tatoeba, WikiMatrix, TED2020, OpenSubtitles, Europarl, News-Commentary
GPU: NVIDIA A40 (Driver Version: 515.48.07; CUDA Version: 11.7)
Batch Size: 64
Max Sequence Length: 256
Train Max Sentence Length: 600
Max Sentences Per Train File: 1000000
Teacher Model: sentence-transformers/paraphrase-distilroberta-base-v2
Student Model: xlm-roberta-base
Loss Function: MSE Loss
Learning Rate: 2e-5
Epochs: 20
Evaluation Steps: 10000
Warmup Steps: 10000

Acknowledgment

This work is a collaboration between Technical University of Applied Sciences Wildau (TH Wildau) and sense.ai.tion GmbH. You can contact us via:

Philipp Müller (M.Eng.); Author
Prof. Dr. Janett Mohnke; TH Wildau
Dr. Matthias Boldt, Jörg Oehmichen; sense.AI.tion GmbH

This work was funded by the European Regional Development Fund (EFRE) and the State of Brandenburg. Project/Vorhaben: "ProFIT: Natürlichsprachliche Dialogassistenten in der Pflege".

Downloads last month: 5

Safetensors

Model size

0.3B params

Tensor type

I64

F32

Paper for PM-AI/paraphrase-distilroberta-base-v2_de-en

Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

Paper • 2004.09813 • Published Apr 21, 2020 • 1