Token Classification
Transformers
Safetensors
English
modernbert
fill-mask
orality
linguistics
multi-label
custom_code
Instructions to use HavelockAI/bert-token-classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HavelockAI/bert-token-classifier with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="HavelockAI/bert-token-classifier", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("HavelockAI/bert-token-classifier", trust_remote_code=True) model = AutoModelForMaskedLM.from_pretrained("HavelockAI/bert-token-classifier", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| tags: | |
| - token-classification | |
| - modernbert | |
| - orality | |
| - linguistics | |
| - multi-label | |
| language: | |
| - en | |
| metrics: | |
| - f1 | |
| base_model: | |
| - answerdotai/ModernBERT-base | |
| pipeline_tag: token-classification | |
| library_name: transformers | |
| datasets: | |
| - custom | |
| # Havelock Orality Token Classifier | |
| ModernBERT-based token classifier for detecting **oral and literate markers** in text, based on Walter Ong's "Orality and Literacy" (1982). | |
| This model performs multi-label span-level detection of 53 rhetorical marker types, where each token independently carries B/I/O labels per type β allowing overlapping spans (e.g. a token that is simultaneously part of a concessive and a nested clause). | |
| ## Model Details | |
| | Property | Value | | |
| |----------|-------| | |
| | Base model | `answerdotai/ModernBERT-base` | | |
| | Task | Multi-label token classification (independent B/I/O per type) | | |
| | Marker types | 53 (22 oral, 31 literate) | | |
| | Test macro F1 | **0.378** (per-type detection, binary positive = B or I) | | |
| | Training | 20 epochs, fp16 | | |
| | Regularization | Mixout (p=0.1) β stochastic L2 anchor to pretrained weights | | |
| | Loss | Per-type focal loss (Ξ³=2.0) with inverse-frequency OBI and type weights | | |
| | Min examples | 150 (types below this threshold excluded) | | |
| ## Usage | |
| ```python | |
| import json | |
| import torch | |
| from transformers import AutoModel, AutoTokenizer | |
| from huggingface_hub import hf_hub_download | |
| model_name = "HavelockAI/bert-token-classifier" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModel.from_pretrained(model_name, trust_remote_code=True) | |
| model.eval() | |
| # Load marker type map | |
| type_map_path = hf_hub_download(model_name, "type_to_idx.json") | |
| type_to_idx = json.loads(open(type_map_path).read()) | |
| idx_to_type = {v: k for k, v in type_to_idx.items()} | |
| text = "Tell me, O Muse, of that ingenious hero who travelled far and wide" | |
| inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128) | |
| with torch.no_grad(): | |
| logits = model(**inputs) # (1, seq_len, num_types, 3) | |
| preds = logits.argmax(dim=-1) # (1, seq_len, num_types) | |
| tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) | |
| for i, token in enumerate(tokens): | |
| active = [ | |
| f"{idx_to_type[t]}={'OBI'[v]}" | |
| for t, v in enumerate(preds[0, i].tolist()) | |
| if v > 0 | |
| ] | |
| if active: | |
| print(f"{token:15} {', '.join(active)}") | |
| ``` | |
| > **Note:** This model uses a custom architecture (`HavelockTokenClassifier`) with independent B/I/O heads per marker type, enabling overlapping span detection. Loading requires `trust_remote_code=True`. | |
| ## Training Data | |
| - Sources: Project Gutenberg, textfiles.com, Reddit, Wikipedia talk pages | |
| - Types with fewer than 150 annotated spans are excluded from training | |
| - Multi-label BIO annotation: tokens can carry labels for multiple overlapping marker types simultaneously | |
| ## Marker Types (53) | |
| ### Oral Markers (22 types) | |
| Characteristics of oral tradition and spoken discourse: | |
| | Category | Markers | | |
| |----------|---------| | |
| | **Address & Interaction** | vocative, imperative, second_person, inclusive_we, rhetorical_question, phatic_check, phatic_filler | | |
| | **Repetition & Pattern** | anaphora, parallelism, tricolon, lexical_repetition, antithesis | | |
| | **Conjunction** | simple_conjunction | | |
| | **Formulas** | discourse_formula, intensifier_doubling | | |
| | **Narrative** | named_individual, specific_place, temporal_anchor, sensory_detail, embodied_action, everyday_example | | |
| | **Performance** | self_correction | | |
| ### Literate Markers (31 types) | |
| Characteristics of written, analytical discourse: | |
| | Category | Markers | | |
| |----------|---------| | |
| | **Abstraction** | nominalization, abstract_noun, conceptual_metaphor, categorical_statement | | |
| | **Syntax** | nested_clauses, relative_chain, conditional, concessive, temporal_embedding, causal_explicit | | |
| | **Hedging** | epistemic_hedge, probability, evidential, qualified_assertion, concessive_connector | | |
| | **Impersonality** | agentless_passive, agent_demoted, institutional_subject, objectifying_stance | | |
| | **Scholarly apparatus** | citation, cross_reference, metadiscourse, definitional_move | | |
| | **Technical** | technical_term, technical_abbreviation, enumeration, list_structure | | |
| | **Connectives** | contrastive, additive_formal | | |
| | **Setting** | concrete_setting, aside | | |
| ## Evaluation | |
| Per-type detection F1 on test set (binary: B or I = positive, O = negative): | |
| <details><summary>Click to show per-marker precision/recall/F1/support</summary> | |
| ``` | |
| Type Prec Rec F1 Sup | |
| ======================================================================== | |
| literate_abstract_noun 0.190 0.325 0.240 381 | |
| literate_additive_formal 0.246 0.556 0.341 27 | |
| literate_agent_demoted 0.404 0.368 0.386 304 | |
| literate_agentless_passive 0.575 0.607 0.591 1133 | |
| literate_aside 0.379 0.429 0.403 436 | |
| literate_categorical_statement 0.267 0.146 0.189 514 | |
| literate_causal_explicit 0.227 0.279 0.251 190 | |
| literate_citation 0.639 0.556 0.595 372 | |
| literate_conceptual_metaphor 0.310 0.364 0.335 415 | |
| literate_concessive 0.499 0.470 0.484 502 | |
| literate_concessive_connector 0.455 0.408 0.430 49 | |
| literate_concrete_setting 0.241 0.125 0.165 407 | |
| literate_conditional 0.369 0.630 0.466 760 | |
| literate_contrastive 0.310 0.428 0.360 341 | |
| literate_cross_reference 0.386 0.524 0.444 42 | |
| literate_definitional_move 0.395 0.185 0.252 81 | |
| literate_enumeration 0.495 0.483 0.489 775 | |
| literate_epistemic_hedge 0.421 0.481 0.449 445 | |
| literate_evidential 0.625 0.360 0.457 472 | |
| literate_institutional_subject 0.332 0.326 0.329 282 | |
| literate_list_structure 0.338 0.523 0.411 86 | |
| literate_metadiscourse 0.140 0.393 0.206 135 | |
| literate_nested_clauses 0.091 0.246 0.133 1169 | |
| literate_nominalization 0.499 0.612 0.549 991 | |
| literate_objectifying_stance 0.635 0.365 0.464 167 | |
| literate_probability 0.432 0.593 0.500 27 | |
| literate_qualified_assertion 0.143 0.100 0.118 40 | |
| literate_relative_chain 0.382 0.507 0.436 1424 | |
| literate_technical_abbreviation 0.667 0.711 0.688 225 | |
| literate_technical_term 0.280 0.375 0.321 715 | |
| literate_temporal_embedding 0.228 0.259 0.242 526 | |
| oral_anaphora 0.800 0.028 0.054 287 | |
| oral_antithesis 0.249 0.238 0.243 412 | |
| oral_discourse_formula 0.340 0.408 0.371 557 | |
| oral_embodied_action 0.280 0.391 0.326 425 | |
| oral_everyday_example 0.333 0.156 0.212 404 | |
| oral_imperative 0.591 0.662 0.625 293 | |
| oral_inclusive_we 0.516 0.632 0.568 622 | |
| oral_intensifier_doubling 0.680 0.200 0.309 85 | |
| oral_lexical_repetition 0.404 0.254 0.312 173 | |
| oral_named_individual 0.441 0.749 0.556 770 | |
| oral_parallelism 0.741 0.110 0.191 182 | |
| oral_phatic_check 0.611 0.733 0.667 30 | |
| oral_phatic_filler 0.174 0.409 0.244 93 | |
| oral_rhetorical_question 0.509 0.692 0.586 905 | |
| oral_second_person 0.576 0.552 0.564 811 | |
| oral_self_correction 0.158 0.235 0.189 51 | |
| oral_sensory_detail 0.285 0.169 0.212 461 | |
| oral_simple_conjunction 0.179 0.102 0.130 98 | |
| oral_specific_place 0.556 0.705 0.622 424 | |
| oral_temporal_anchor 0.410 0.559 0.473 546 | |
| oral_tricolon 0.299 0.119 0.171 553 | |
| oral_vocative 0.652 0.747 0.696 158 | |
| ======================================================================== | |
| Macro avg (types w/ support) 0.378 | |
| ``` | |
| </details> | |
| **Missing labels (test set):** 0/53 β all types detected at least once. | |
| Notable patterns: | |
| - **Strong performers** (F1 > 0.5): vocative (0.696), technical_abbreviation (0.688), phatic_check (0.667), imperative (0.625), specific_place (0.622), citation (0.595), agentless_passive (0.591), rhetorical_question (0.586), inclusive_we (0.568), second_person (0.564), named_individual (0.556), nominalization (0.549), probability (0.500) | |
| - **Weak performers** (F1 < 0.2): anaphora (0.054), qualified_assertion (0.118), simple_conjunction (0.130), nested_clauses (0.133), concrete_setting (0.165), tricolon (0.171), categorical_statement (0.189), self_correction (0.189), parallelism (0.191) | |
| - **Precision-recall tradeoff**: Most types show balanced precision/recall. Notable exceptions include `anaphora` (0.800 precision / 0.028 recall), `parallelism` (0.741 / 0.110), and `intensifier_doubling` (0.680 / 0.200), which remain high-precision but very low-recall. | |
| ## Architecture | |
| Custom `MultiLabelTokenClassifier` with independent B/I/O heads per marker type: | |
| ``` | |
| ModernBERT (answerdotai/ModernBERT-base) | |
| βββ Dropout (p=0.1) | |
| βββ Linear (hidden_size β num_types Γ 3) | |
| βββ Reshape to (batch, seq, num_types, 3) | |
| ``` | |
| Each marker type gets an independent 3-way O/B/I classification, so a token can simultaneously carry labels for multiple overlapping marker types. Types share the full backbone representation but make independent predictions. | |
| ### Regularization | |
| - **Mixout** (p=0.1): During training, each backbone weight element has a 10% chance of being replaced by its pretrained value per forward pass, acting as a stochastic L2 anchor that prevents representation drift (Lee et al., 2019) | |
| - **Per-type focal loss** (Ξ³=2.0): Focuses learning on hard examples, reducing the contribution of easy negatives | |
| - **Inverse-frequency type weights**: Rare marker types receive higher loss weighting | |
| - **Inverse-frequency OBI weights**: B and I classes upweighted relative to dominant O class | |
| - **Weighted random sampling**: Examples containing rarer markers sampled more frequently | |
| ### Initialization | |
| Fine-tuned from `answerdotai/ModernBERT-base`. Backbone linear layers wrapped with Mixout during training (frozen pretrained copy used as anchor). The classification head is randomly initialized: | |
| ``` | |
| backbone.* layers β loaded from pretrained, anchored via Mixout | |
| classifier.weight β randomly initialized | |
| classifier.bias β randomly initialized | |
| ``` | |
| ## Limitations | |
| - **Near-zero recall types**: `anaphora` (0.028 recall), `simple_conjunction` (0.102), `parallelism` (0.110), and `tricolon` (0.119) are rarely detected despite being present in training data | |
| - **Low-precision types**: `nested_clauses` (0.091), `metadiscourse` (0.140), and `qualified_assertion` (0.143) have precision below 0.15, meaning most predictions for those types are false positives | |
| - **Context window**: 128 tokens max; longer spans may be truncated | |
| - **Domain**: Trained primarily on historical/literary texts; may underperform on modern social media | |
| - **Subjectivity**: Some marker boundaries are inherently ambiguous | |
| ## Citation | |
| ```bibtex | |
| @misc{havelock2026token, | |
| title={Havelock Orality Token Classifier}, | |
| author={Havelock AI}, | |
| year={2026}, | |
| url={https://huggingface.co/HavelockAI/bert-token-classifier} | |
| } | |
| ``` | |
| ## References | |
| - Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982. | |
| - Lee, C. et al. "Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models." ICLR 2020. | |
| - Warner, A. et al. "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference." 2024. | |
| --- | |
| *Trained: February 2026* |