boltuix
/

EntityBERT

Model card Files Files and versions

xet

Community

boltuix commited on Jun 10, 2025

Commit

9b52f2a

verified ·

1 Parent(s): 4bc4181

Update README.md

Browse files

Files changed (1) hide show

README.md +92 -101

README.md CHANGED Viewed

@@ -49,10 +49,10 @@ base_model:
 ## 🚀 Model Details
 ### 🌈 Description
-The `boltuix/EntityBERT` model is a fine-tuned transformer for **Named Entity Recognition (NER)**, built on the lightweight `boltuix/bert-mini` base model. It excels at identifying 36 entity types, including people, locations, organizations, dates, times, phone numbers, emails, URLs, and more, in English text. Designed for efficiency and high accuracy, it’s perfect for real-time applications like information extraction, chatbots, and knowledge graph construction across domains such as travel, medical, logistics, and education.
 - **Dataset**: [boltuix/conll2025-ner](https://huggingface.co/datasets/boltuix/conll2025-ner) (~143,709 entries, 6.38 MB)
-- **Entity Types**: 36 NER tags (18 entity categories with B-/I- tags + O)
 - **Training Examples**: ~115,812 | **Validation**: ~15,680 | **Test**: ~12,217
 - **Domains**: Travel, medical, logistics, education, news, user-generated content
 - **Tasks**: Sentence-level and document-level NER
@@ -71,20 +71,20 @@ The `boltuix/EntityBERT` model is a fine-tuned transformer for **Named Entity Re
 - **Model Repository**: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT)
 - **Dataset**: [boltuix/conll2025-ner](#download-instructions)
 - **Hugging Face Docs**: [Transformers](https://huggingface.co/docs/transformers)
-- **Demo**: Available at [boltuix.github.io/demo](https://boltuix.github.io/demo) (coming soon)
 ---
 ## 🎯 Use Cases for NER
 ### 🌟 Direct Applications
-- **Information Extraction**: Extract entities like 👤 PERSON (e.g., "Dr. Sarah Lee"), 🌍 LOCATION (e.g., "Baltimore"), 🗓️ DATE (e.g., "July 10, 2025"), and 📞 PHONE_NUMBER (e.g., "+1-410-955-5000") from travel itineraries, medical reports, or logistics documents.
 - **Chatbots & Virtual Assistants**: Enhance user interactions by recognizing entities in queries like "Book a flight from Dubai to Tokyo on October 10, 2025."
 - **Search Enhancement**: Enable semantic search with entity-based indexing, e.g., finding documents mentioning "Emirates" or "Shibuya Crossing."
-- **Knowledge Graphs**: Build structured graphs linking entities like 🏢 ORGANIZATION (e.g., "Johns Hopkins") and 📍 ADDRESS (e.g., "1800 Orleans St").
 ### 🌱 Downstream Tasks
-- **Travel NLP**: Extract travel details like departure/arrival times and transportation modes (e.g., "flight," "train") for booking systems.
 - **Medical NLP**: Identify doctors, hospitals, and contact info in patient records or consultation requests.
 - **Logistics NLP**: Track shipments by extracting locations, dates, and company names (e.g., "FedEx," "DHL").
 - **Education NLP**: Parse academic events, university names, and contact details from seminar announcements.
@@ -106,7 +106,7 @@ tokenizer = AutoTokenizer.from_pretrained("boltuix/EntityBERT")
 model = AutoModelForTokenClassification.from_pretrained("boltuix/EntityBERT")
 # Create NER pipeline
-nlp = pipeline("token-classification", model=model, tokenizer=tokenizer)
 # Input text
 text = "Dr. Sarah Lee at Johns Hopkins, Baltimore, MD, books a flight to Rochester, MN on July 10, 2025, contact +1-410-955-5000 or sarah.lee@jhmi.edu, visit www.airmed.com."
@@ -116,94 +116,94 @@ ner_results = nlp(text)
 # Print results
 for entity in ner_results:
-    print(f"{entity['word']:15} → {entity['entity']}")
 ```
 ### ✨ Example Output
 ```
-Dr.             → B-PERSON
-Sarah           → I-PERSON
-Lee             → I-PERSON
-Johns           → B-ORGANIZATION
-Hopkins         → I-ORGANIZATION
-Baltimore       → B-fromloc.city_name
-MD              → B-fromloc.state_name
-Rochester       → B-toloc.city_name
-MN              → B-toloc.state_name
-July            → B-DATE
-10              → I-DATE
-2025            → I-DATE
-+1-410-955-5000 → B-PHONE_NUMBER
-sarah.lee       → B-EMAIL
-@jhmi.edu       → I-EMAIL
-www.airmed.com  → B-URL
 ```
 ### 🛠️ Requirements
 ```bash
-pip install transformers torch pandas pyarrow
 ```
 - **Python**: 3.8+
-- **Storage**: ~50 MB for model weights
-- **Optional**: `seqeval` for evaluation, `cuda` for GPU acceleration
 ---
 ## 🧠 Entity Labels
-The model supports 36 NER tags, aligned with the slot labels used in the `boltuix/conll2025-ner` dataset, using the **BIO tagging scheme**:
-| Tag Name                 | Description                              | Example                |
-|-------------------------|------------------------------------------|------------------------|
-| O                       | Non-entity                               | "visited"             |
-| B-fromloc.city_name   | Beginning of source city                 | "Baltimore"               |
-| I-fromloc.city_name    | Inside source city                         | "York" (in "New York) |
-| B-fromloc.state_name  | Beginning of source state         | "MD"                   |
-| I-fromloc.state_name   | Inside source state                      |                        |
-| B-fromloc.country_name  | Beginning of source country         | "USA"                  |
-| I-fromloc.country_name | Inside source country                    |                        |
-| B-fromloc.address      | Beginning of source address       | "1800"                 |
-| I-fromloc.address       | Inside source address                    | "Orleans St"    |
-| B-toloc.city_name      | Beginning of destination city            | "Rochester"            |
-| I-toloc.city_name      | Inside destination city                  |                        |
-| B-toloc.state_name    | Beginning of destination state         | "MN"                   |
-| I-toloc.state_name    | Inside destination state                  |                        |
-| B-toloc.country_name   | Beginning of destination country | "Japan"                |
-| I-toloc.country_name   | Inside destination country                |                        |
-| B-toloc.address        | Beginning of destination address | "Shibuya Crossing"     |
-| I-toloc.address           | Inside destination address                 |                        |
-| B-transportation_mode   | Beginning of transport mode              | "flight"               |
-| I-transportation_mode     | Inside transport mode                    | "jet" (in "private jet") |
-| B-date                 | Beginning of date                        | "July"                 |
-| I-date                 | Inside date                              | "10"                   |
-| B-time                 | Beginning of time                        | "9:00"                 |
-| I-time                 | Inside time                              | "AM"                   |
-| B-departure_time       | Beginning of departure time              | "8:00"                 |
-| I-departure_time       | Inside departure time                    | "AM"                   |
-| B-arrival_time         | Beginning of arrival time                | "12:00"                |
-| I-arrival_time         | Inside arrival time                      | "PM"                   |
-| B-company_name         | Beginning of company name                | "Emirates"             |
-| I-company_name         | Inside company name                      |                        |
-| B-organization_name    | Beginning of organization name           | "Johns"                |
-| I-organization_name    | Inside organization name                 | "Hopkins"              |
-| B-person_name          | Beginning of person name                 | "Sarah"                |
-| I-person_name          | Inside person name                       | "Lee"                  |
-| B-job_title            | Beginning of job title                   | "Chief"                |
-| I-job_title            | Inside job title                         | "Cardiologist"         |
-| B-phone_number         | Beginning of phone number                | "+1-410-955-5000"      |
-| I-phone_number         | Inside phone number                      |                        |
-| B-email                | Beginning of email                       | "sarah.lee"            |
-| I-email                | Inside email                             | "@jhmi.edu"            |
-| B-url                  | Beginning of URL                         | "www.airmed.com"       |
-| I-url                  | Inside URL                               |                        |
 **Example**:
 Text: `"Book a flight from Dubai to Tokyo on October 10, 2025 with Emirates."`
-Tags: `[O, O, B-transportation_mode, O, B-fromloc.city_name, O, B-toloc.city_name, O, B-date, I-date, I-date, O, B-company_name]`
 ---
 ## 📈 Performance
 Evaluated on the `boltuix/conll2025-ner` test split using `seqeval`:
 | Metric     | Score |
@@ -213,12 +213,11 @@ Evaluated on the `boltuix/conll2025-ner` test split using `seqeval`:
 | 🎶 F1 Score  | 0.89  |
 | ✅ Accuracy  | 0.94  |
-These high scores demonstrate the model’s ability to accurately identify entities across diverse domains, making it suitable for real-time applications.
 ---
 ## ⚙️ Training Setup
 - **Hardware**: NVIDIA GPU (e.g., A100)
 - **Training Time**: ~1.5 hours
 - **Parameters**: ~11M
@@ -230,7 +229,6 @@ These high scores demonstrate the model’s ability to accurately identify entit
 ---
 ## 🧠 Training the Model
 Fine-tune the `boltuix/bert-mini` model on the `boltuix/conll2025-ner` dataset to replicate or extend `EntityBERT`. Below is a training script:
 ```python
@@ -296,7 +294,7 @@ model = AutoModelForTokenClassification.from_pretrained("boltuix/bert-mini", num
 # Training arguments
 args = TrainingArguments(
-    output_dir="boltuix/EntityBERT",
     eval_strategy="epoch",
     learning_rate=2e-5,
     per_device_train_batch_size=16,
@@ -347,32 +345,31 @@ trainer = Trainer(
 trainer.train()
 # Save model
-trainer.save_model("boltuix/EntityBERT")
-tokenizer.save_pretrained("boltuix/EntityBERT")
 ```
 ### 🛠️ Tips
-- **Hyperparameters**: Experiment with `learning_rate` (1e-5 to 5e-5) or `num_train_epochs` (2-5) for optimal performance.
-- **GPU Acceleration**: Use `fp16=True` for faster training on NVIDIA GPUs.
 - **Custom Datasets**: Adapt the script for custom NER datasets by updating `unique_tags` and preprocessing steps.
 ### ⏱️ Expected Training Time
 - ~1.5 hours on an NVIDIA A100 GPU for ~115,812 training examples, 3 epochs, batch size 16.
 ### 🌍 Carbon Impact
-- Training emits ~40g CO₂eq (estimated via ML Impact tool), optimized for efficiency with FP16 and lightweight architecture.
 ---
 ## 🌍 Carbon Impact
 - **Emissions**: ~40g CO₂eq
 - **Measurement**: ML Impact tool
-- **Optimization**: Used FP16 and efficient `bert-mini` base model
 ---
 ## 🛠️ Installation
 ```bash
 pip install transformers torch pandas pyarrow seqeval
 ```
@@ -394,7 +391,7 @@ Evaluate the model on custom data:
 from transformers import pipeline
 # Load NER pipeline
-nlp = pipeline("token-classification", model="boltuix/EntityBERT")
 # Test data
 text = "Book a Lyft from Metropolis on December 1, 2025, contact support@lyft.com."
@@ -404,22 +401,16 @@ results = nlp(text)
 # Print results
 for entity in results:
-    print(f"{entity['word']:15} → {entity['entity']}")
 ```
 ### ✨ Example Output
 ```
-Book            → O
-Lyft            → B-COMPANY_NAME
-from            → O
-Metropolis      → B-fromloc.city_name
-on              → O
-December        → B-DATE
-1               → I-DATE
-2025            → I-DATE
-contact         → O
-support         → B-EMAIL
-@lyft.com       → I-EMAIL
 ```
 ---
@@ -429,7 +420,7 @@ support         → B-EMAIL
 - **Size**: 6.38 MB (Parquet format)
 - **Columns**: `split`, `tokens`, `ner_tags`
 - **Splits**: Train (~115,812), Validation (~15,680), Test (~12,217)
-- **NER Tags**: 36 (18 entity types with B-/I- tags + O)
 - **Source**: Curated from travel, medical, logistics, education, news, and user-generated content
 - **Annotations**: Expert-labeled for high accuracy
@@ -468,7 +459,7 @@ plt.show()
 ## ⚖️ Comparison to Other Models
 | Model                | Dataset            | Parameters | F1 Score | Size   |
 |----------------------|--------------------|------------|----------|--------|
-| **EntityBERT**   | conll2025-ner      | ~11M       | 0.89     | ~50 MB |
 | BERT-base-NER        | CoNLL-2003         | ~110M      | ~0.89    | ~400 MB|
 | DistilBERT-NER       | CoNLL-2003         | ~66M       | ~0.85    | ~200 MB|
@@ -496,6 +487,6 @@ plt.show()
 ---
 ## 📅 Last Updated
-**June 10, 2025** — Released v1.0 with fine-tuning on `boltuix/conll2025-ner`, optimized for 36 entity types.
 **[Get Started Now](#getting-started)** 🚀

 ## 🚀 Model Details
 ### 🌈 Description
+The `boltuix/EntityBERT` model is a fine-tuned transformer for **Named Entity Recognition (NER)**, built on the lightweight `boltuix/bert-mini` base model. It excels at identifying 43 entity types, including people, locations, organizations, dates, times, phone numbers, emails, URLs, and more, in English text. Optimized for efficiency and high accuracy, it’s ideal for real-time applications like information extraction, chatbots, and knowledge graph construction across domains such as travel, medical, logistics, and education.
 - **Dataset**: [boltuix/conll2025-ner](https://huggingface.co/datasets/boltuix/conll2025-ner) (~143,709 entries, 6.38 MB)
+- **Entity Types**: 43 NER tags (18 core entity categories with B-/I- tags + O + padding labels)
 - **Training Examples**: ~115,812 | **Validation**: ~15,680 | **Test**: ~12,217
 - **Domains**: Travel, medical, logistics, education, news, user-generated content
 - **Tasks**: Sentence-level and document-level NER
 - **Model Repository**: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT)
 - **Dataset**: [boltuix/conll2025-ner](#download-instructions)
 - **Hugging Face Docs**: [Transformers](https://huggingface.co/docs/transformers)
+- **Demo**: [boltuix.github.io/demo](https://boltuix.github.io/demo) (coming soon)
 ---
 ## 🎯 Use Cases for NER
 ### 🌟 Direct Applications
+- **Information Extraction**: Extract entities like 👤 Person (e.g., "Dr. Sarah Lee"), 🌍 Location (e.g., "Baltimore"), 🗓️ Date (e.g., "July 10, 2025"), and 📞 Phone (e.g., "+1-410-955-5000") from travel itineraries, medical reports, or logistics documents.
 - **Chatbots & Virtual Assistants**: Enhance user interactions by recognizing entities in queries like "Book a flight from Dubai to Tokyo on October 10, 2025."
 - **Search Enhancement**: Enable semantic search with entity-based indexing, e.g., finding documents mentioning "Emirates" or "Shibuya Crossing."
+- **Knowledge Graphs**: Build structured graphs linking entities like 🏢 Organization (e.g., "Johns Hopkins") and 📍 Address (e.g., "1800 Orleans St").
 ### 🌱 Downstream Tasks
+- **Travel NLP**: Extract travel details like departure/arrival times and transport modes (e.g., "flight," "train") for booking systems.
 - **Medical NLP**: Identify doctors, hospitals, and contact info in patient records or consultation requests.
 - **Logistics NLP**: Track shipments by extracting locations, dates, and company names (e.g., "FedEx," "DHL").
 - **Education NLP**: Parse academic events, university names, and contact details from seminar announcements.
 model = AutoModelForTokenClassification.from_pretrained("boltuix/EntityBERT")
 # Create NER pipeline
+nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
 # Input text
 text = "Dr. Sarah Lee at Johns Hopkins, Baltimore, MD, books a flight to Rochester, MN on July 10, 2025, contact +1-410-955-5000 or sarah.lee@jhmi.edu, visit www.airmed.com."
 # Print results
 for entity in ner_results:
+    print(f"{entity['word']:15} -> {entity['entity']}")
 ```
 ### ✨ Example Output
 ```
+Dr. Sarah Lee   -> B-person
+Johns Hopkins   -> B-organization
+Baltimore       -> B-from-location
+MD              -> B-from-state
+flight          -> B-transport-mode
+Rochester       -> B-to-location
+MN              -> B-to-state
+July 10, 2025   -> B-date
++1-410-955-5000 -> B-phone
+sarah.lee@jhmi.edu -> B-email
+www.airmed.com  -> B-url
 ```
 ### 🛠️ Requirements
 ```bash
+pip install transformers torch pandas pyarrow seqeval
 ```
 - **Python**: 3.8+
+- **Storage**: ~50 MB for model weights, ~6.38 MB for dataset
+- **Optional**: NVIDIA CUDA for GPU acceleration, `seqeval` for evaluation
 ---
 ## 🧠 Entity Labels
+The model supports 43 NER tags, including 36 core tags aligned with the `boltuix/conll2025-ner` dataset and 6 padding tags, using the **BIO tagging scheme**:
+| Tag Name              | Description                              | Example                |
+|-----------------------|------------------------------------------|------------------------|
+| O                     | Non-entity                               | "visited"             |
+| B-from-location       | Beginning of source location             | "Baltimore"           |
+| I-from-location       | Inside source location                   | "York" (in "New York")|
+| B-from-state          | Beginning of source state                | "MD"                  |
+| I-from-state          | Inside source state                      |                       |
+| B-from-country        | Beginning of source country              | "USA"                 |
+| I-from-country        | Inside source country                    |                       |
+| B-from-address        | Beginning of source address              | "1800"                |
+| I-from-address        | Inside source address                    | "Orleans St"          |
+| B-to-location         | Beginning of destination location        | "Rochester"           |
+| I-to-location         | Inside destination location              |                       |
+| B-to-state            | Beginning of destination state           | "MN"                  |
+| I-to-state            | Inside destination state                 |                       |
+| B-to-country          | Beginning of destination country         | "Japan"               |
+| I-to-country          | Inside destination country               |                       |
+| B-to-address          | Beginning of destination address         | "Shibuya Crossing"    |
+| I-to-address          | Inside destination address               |                       |
+| B-transport-mode      | Beginning of transport mode              | "flight"              |
+| I-transport-mode      | Inside transport mode                    | "jet" (in "private jet") |
+| B-date                | Beginning of date                        | "July"                |
+| I-date                | Inside date                              | "10"                  |
+| B-time                | Beginning of time                        | "9:00"                |
+| I-time                | Inside time                              | "AM"                  |
+| B-departure-time      | Beginning of departure time              | "8:00"                |
+| I-departure-time      | Inside departure time                    | "AM"                  |
+| B-arrival-time        | Beginning of arrival time                | "12:00"               |
+| I-arrival-time        | Inside arrival time                      | "PM"                  |
+| B-company             | Beginning of company name                | "Emirates"            |
+| I-company             | Inside company name                      |                       |
+| B-organization        | Beginning of organization name           | "Johns"               |
+| I-organization        | Inside organization name                 | "Hopkins"             |
+| B-person              | Beginning of person name                 | "Sarah"               |
+| I-person              | Inside person name                       | "Lee"                 |
+| B-job-title           | Beginning of job title                   | "Chief"               |
+| I-job-title           | Inside job title                         | "Cardiologist"        |
+| B-phone               | Beginning of phone number                | "+1-410-955-5000"     |
+| I-phone               | Inside phone number                      |                       |
+| B-email               | Beginning of email                       | "sarah.lee"           |
+| I-email               | Inside email                             | "@jhmi.edu"           |
+| B-url                 | Beginning of URL                         | "www.airmed.com"      |
+| I-url                 | Inside URL                               |                       |
+| B-other               | Beginning of miscellaneous entity        |                       |
+| I-other               | Inside miscellaneous entity              |                       |
+| B-reserved1           | Reserved padding label                   |                       |
+| I-reserved1           | Reserved padding label                   |                       |
+| B-reserved2           | Reserved padding label                   |                       |
+| I-reserved2           | Reserved padding label                   |                       |
 **Example**:
 Text: `"Book a flight from Dubai to Tokyo on October 10, 2025 with Emirates."`
+Tags: `[O, O, B-transport-mode, O, B-from-location, O, B-to-location, O, B-date, I-date, I-date, O, B-company]`
 ---
 ## 📈 Performance
 Evaluated on the `boltuix/conll2025-ner` test split using `seqeval`:
 | Metric     | Score |
 | 🎶 F1 Score  | 0.89  |
 | ✅ Accuracy  | 0.94  |
+These high scores showcase the model’s robust ability to identify entities across diverse domains, ensuring reliability for real-time applications.
 ---
 ## ⚙️ Training Setup
 - **Hardware**: NVIDIA GPU (e.g., A100)
 - **Training Time**: ~1.5 hours
 - **Parameters**: ~11M
 ---
 ## 🧠 Training the Model
 Fine-tune the `boltuix/bert-mini` model on the `boltuix/conll2025-ner` dataset to replicate or extend `EntityBERT`. Below is a training script:
 ```python
 # Training arguments
 args = TrainingArguments(
+    output_dir="boltuix/entitybert",
     eval_strategy="epoch",
     learning_rate=2e-5,
     per_device_train_batch_size=16,
 trainer.train()
 # Save model
+trainer.save_model("boltuix/entitybert")
+tokenizer.save_pretrained("boltuix/entitybert")
 ```
 ### 🛠️ Tips
+- **Hyperparameters**: Adjust `learning_rate` (1e-5 to 5e-5) or `num_train_epochs` (2-5) for optimal results.
+- **GPU Acceleration**: Enable `fp16=True` for faster training on NVIDIA GPUs.
 - **Custom Datasets**: Adapt the script for custom NER datasets by updating `unique_tags` and preprocessing steps.
 ### ⏱️ Expected Training Time
 - ~1.5 hours on an NVIDIA A100 GPU for ~115,812 training examples, 3 epochs, batch size 16.
 ### 🌍 Carbon Impact
+- Training emits ~40g CO₂eq, optimized with FP16 and the lightweight `bert-mini` base model.
 ---
 ## 🌍 Carbon Impact
 - **Emissions**: ~40g CO₂eq
 - **Measurement**: ML Impact tool
+- **Optimization**: FP16 and efficient architecture
 ---
 ## 🛠️ Installation
 ```bash
 pip install transformers torch pandas pyarrow seqeval
 ```
 from transformers import pipeline
 # Load NER pipeline
+nlp = pipeline("token-classification", model="boltuix/EntityBERT", aggregation_strategy="simple")
 # Test data
 text = "Book a Lyft from Metropolis on December 1, 2025, contact support@lyft.com."
 # Print results
 for entity in results:
+    print(f"{entity['word']:15} -> {entity['entity']}")
 ```
 ### ✨ Example Output
 ```
+Book            -> O
+Lyft            -> B-company
+Metropolis      -> B-from-location
+December 1, 2025 -> B-date
+support@lyft.com -> B-email
 ```
 ---
 - **Size**: 6.38 MB (Parquet format)
 - **Columns**: `split`, `tokens`, `ner_tags`
 - **Splits**: Train (~115,812), Validation (~15,680), Test (~12,217)
+- **NER Tags**: 43 (18 core entity types with B-/I- tags + O + padding)
 - **Source**: Curated from travel, medical, logistics, education, news, and user-generated content
 - **Annotations**: Expert-labeled for high accuracy
 ## ⚖️ Comparison to Other Models
 | Model                | Dataset            | Parameters | F1 Score | Size   |
 |----------------------|--------------------|------------|----------|--------|
+| **EntityBERT**       | conll2025-ner      | ~11M       | 0.89     | ~50 MB |
 | BERT-base-NER        | CoNLL-2003         | ~110M      | ~0.89    | ~400 MB|
 | DistilBERT-NER       | CoNLL-2003         | ~66M       | ~0.85    | ~200 MB|
 ---
 ## 📅 Last Updated
+**June 10, 2025** — Released v1.0 with fine-tuning on `boltuix/conll2025-ner`, optimized for 43 entity types.
 **[Get Started Now](#getting-started)** 🚀