Instructions to use PM-AI/paraphrase-distilroberta-base-v2_de-en with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use PM-AI/paraphrase-distilroberta-base-v2_de-en with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("PM-AI/paraphrase-distilroberta-base-v2_de-en") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Transformers
How to use PM-AI/paraphrase-distilroberta-base-v2_de-en with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="PM-AI/paraphrase-distilroberta-base-v2_de-en")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("PM-AI/paraphrase-distilroberta-base-v2_de-en") model = AutoModel.from_pretrained("PM-AI/paraphrase-distilroberta-base-v2_de-en") - Notebooks
- Google Colab
- Kaggle
Model card for PM-AI/paraphrase-distilroberta-base-v2_de-en
For internal purposes and for testing, we have made a monolingual paraphrasing model from Sentence Transformers usable for German + English via Knowledge Distillation. The decision was made in favor of sentence-transformers/paraphrase-distilroberta-base-v2 because this model has no public available multilingual version (to our knowledge). In addition, it has significantly more training samples compared to its predecessor: 83.3 million samples were used instead of 24.6 million samples.
Training
- Download of datasets
- Execution of knowledge distillation
Training Data
Datasets used based on offical source:
- AllNLI
- sentence-compression
- SimpleWiki
- altlex
- msmarco-triplets
- quora_duplicates
- coco_captions
- flickr30k_captions
- yahoo_answers_title_question
- S2ORC_citation_pairs
- stackexchange_duplicate_questions
- wiki-atomic-edits
Training Execution
First we downloaded some german-english parallel datasets via get_parallel_data_*.py.
These datasets are: Tatoeba, WikiMatrix, TED2020, OpenSubtitles, Europarl, News-Commentary
Then we started knowledge distillation with make_multilingual_sys.py
Parameterization of training
- Script: make_multilingual_sys.py
- Datasets: Tatoeba, WikiMatrix, TED2020, OpenSubtitles, Europarl, News-Commentary
- GPU: NVIDIA A40 (Driver Version: 515.48.07; CUDA Version: 11.7)
- Batch Size: 64
- Max Sequence Length: 256
- Train Max Sentence Length: 600
- Max Sentences Per Train File: 1000000
- Teacher Model: sentence-transformers/paraphrase-distilroberta-base-v2
- Student Model: xlm-roberta-base
- Loss Function: MSE Loss
- Learning Rate: 2e-5
- Epochs: 20
- Evaluation Steps: 10000
- Warmup Steps: 10000
Acknowledgment
This work is a collaboration between Technical University of Applied Sciences Wildau (TH Wildau) and sense.ai.tion GmbH. You can contact us via:
- Philipp Müller (M.Eng.); Author
- Prof. Dr. Janett Mohnke; TH Wildau
- Dr. Matthias Boldt, Jörg Oehmichen; sense.AI.tion GmbH
This work was funded by the European Regional Development Fund (EFRE) and the State of Brandenburg. Project/Vorhaben: "ProFIT: Natürlichsprachliche Dialogassistenten in der Pflege".
- Downloads last month
- 5

