| --- |
| language: |
| - en |
| - ko |
| - zh |
| - ja |
| - es |
| - fr |
| - ru |
| - hi |
| metrics: |
| - accuracy |
| base_model: |
| - distilbert/distilbert-base-multilingual-cased |
| pipeline_tag: text-classification |
| --- |
| |
| --- |
|
|
| # BERTopic Model for Serverless Inference |
|
|
| A BERTopic model for multilingual topic modeling, specifically tailored for tourism feedback analysis. This model is serialized in **safetensors** format for optimized loading and is designed for **serverless inference** in cloud environments. It is a key component of our thesis project, **"Enhancing Tourist Destination Management through a Multilingual Web-Based Tourist Survey System with Machine Learning."** |
|
|
| ## Overview |
|
|
| This model leverages BERTopic to extract meaningful topics from a multilingual dataset of tourist reviews. It supports eight languages, making it a versatile tool for understanding diverse customer feedback. Optimized for serverless architectures, the model is ideal for deployment on platforms such as FastAPI, AWS Lambda, and Cloud Functions. |
|
|
| > **Thesis Context:** |
| > As part of the thesis project, this model works in tandem with a DistilBERT-based sentiment analyzer to revolutionize tourism feedback collection. The combined approach overcomes language barriers and inefficiencies in traditional survey methods, enhancing data-driven decision-making for tourism management. |
|
|
| ## Key Features |
|
|
| - **Multilingual Support:** |
| Supports 8 languages: English, Spanish, French, Chinese, Japanese, German, Korean, and Tagalog. |
| - **Pre-trained & Fine-tuned:** |
| Trained on 160k synthetic and real tourist reviews, capturing both emotional tone and topical diversity. |
| - **Optimized Serialization:** |
| Uses safetensors for faster and safer model loading. |
| - **Serverless Inference Ready:** |
| Tailored for deployment on serverless architectures such as FastAPI, AWS Lambda, and Cloud Functions. |
|
|
| ## Model Architecture & Details |
|
|
| - **Architecture:** BERTopic |
| - **Embedding Model:** `paraphrase-multilingual-MiniLM-L12-v2` |
| - **Dimensionality Reduction:** UMAP |
| - **Clustering Algorithm:** HDBSCAN |
| - **Vectorizer:** CountVectorizer with TF-IDF preprocessing |
| - **Dataset:** 160k synthetic and real tourist reviews categorized by emotional tone and topics |
|
|
| ## Model Performance Metrics |
|
|
| - **Topic Coherence Score:** *XX.XX* (placeholder) |
| - **Diversity Score:** *XX.XX* (placeholder) |
| - **Sentiment Analysis Accuracy:** *≥ 70%* (as part of the complementary system) |
|
|
| ## How to Use |
|
|
| ### Loading the Model |
|
|
| ```python |
| from bertopic import BERTopic |
| from safetensors.torch import load_file |
| |
| # Load the BERTopic model |
| model = BERTopic.load("path/to/model.safetensors") |
| ``` |
|
|
| ### Performing Topic Modeling |
|
|
| ```python |
| # Sample documents for topic modeling |
| docs = [ |
| "The hotel had a great view of the beach and excellent service.", |
| "Transportation was a bit difficult to find late at night." |
| ] |
| |
| # Extract topics from the documents |
| topics, probs = model.transform(docs) |
| print("Topics:", topics) |
| print("Probabilities:", probs) |
| ``` |
|
|
| ## Deployment Guide |
|
|
| - **Serverless Platforms:** |
| Ensure dependencies such as `safetensors`, `bertopic`, and `sentence-transformers` are included in your deployment package for platforms like AWS Lambda or FastAPI. |
| - **Memory Optimization:** |
| Use safetensors for a reduced memory footprint and faster inference. |
| - **Scaling Considerations:** |
| Load the model at cold start and reuse it for subsequent requests to efficiently scale in serverless environments. |
|
|
| ## Limitations |
|
|
| - **Variable Topic Coherence:** |
| Coherence may vary by language. |
| - **Dataset Biases:** |
| The model’s performance may be influenced by biases in the training data. |
| - **Latency Constraints:** |
| Not ideal for real-time low-latency applications (<50ms response time). |
|
|
| ## License |
|
|
| [Insert License Here] |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{your_citation, |
| title={BERTopic Model for Multilingual Tourism Feedback}, |
| author={Paul Andre D. Tadiar}, |
| year={2025} |
| } |
| ``` |
|
|
| --- |
|
|
| *For inquiries or contributions, please open an issue on the Hugging Face repository.* |
|
|
| --- |
|
|
|
|