| --- |
| language: |
| - it |
| - en |
| tags: |
| - tokenizer |
| - bpe |
| - stem |
| - physics |
| - quark |
| license: mit |
| pipeline_tag: text-generation |
| --- |
| |
| # Quark2Tokenizer |
|
|
| Quark2Tokenizer is a highly optimized, bilingual (English/Italian) Byte-Level Byte-Pair Encoding (BPE) tokenizer engineered specifically for the **Quark** Small Language Model (SLM) series. It is designed to maximize context window efficiency across technical domains, including specialized STEM literature, theoretical physics, source code, and structured multi-turn reasoning paths. |
|
|
| ## Architectural Design & Specifications |
|
|
| - **Model Type:** ByteLevelBPE (Rust-backed core for highly parallelized inference and training) |
| - **Vocabulary Size:** 65,536 (Fixed entry constraints) |
| - **Primary Languages:** Italian (IT), English (EN) |
| - **Domain Optimization:** Advanced Mathematics (LaTeX), Deep Theoretical Physics, and Structural Source Code (Python, Shell) |
|
|
| ### Token Allocation & Special Tokens |
| The first 10 token slots (`0-9`) are structurally reserved as non-fragmentable atomic identifiers. This architecture prevents standard ByteLevel segmentation from breaking conversational control markers or degrading the model's available context window during dense logical inference: |
|
|
| | Token | ID | Functional Scope | |
| | :--- | :--- | :--- | |
| | `<\|system\|>` | `4` | System prompt configuration boundary | |
| | `<\|user\|>` | `5` | User prompt interface marker | |
| | `<\|assistant\|>` | `6` | Assistant response generation boundary | |
| | `<\|end\|>` | `7` | Complete sequence/turn termination delimiter | |
| | `<\|thinking\|>` | `8` | **Chain-of-Thought (CoT) sequence initialization** | |
| | `<\|/thinking\|>` | `9` | **Chain-of-Thought (CoT) sequence termination** | |
|
|
| --- |
|
|
| ## Tokenization Efficiency & Compression Metrics |
|
|
| The tokenizer was trained on a meticulously balanced 5-billion-token streaming matrix comprising Wikipedia (EN/IT), FineWeb-2 (IT), CulturaX, OpenWebMath, The Stack-Dedup, and high-quality Claude CoT distillation data. |
|
|
| Evaluated across varied modalities, the vocabulary yields a highly competitive **Characters-per-Token (char/tok)** compression ratio: |
|
|
| * **Natural Language Prosa (Bilingual EN/IT):** **5.31 char/tok** *Consolidates morphological roots and spaces seamlessly (e.g., merging `路modello` or `路architecture` into single vocab entries), significantly reducing context consumption during prompt parsing.* |
| * **Source Code (Python):** **2.72 char/tok** *Maintains exact syntactic integrity of control structures, indentation spaces, and logical keywords without granular sub-word fragmentation.* |
| * **Mathematical Notation (LaTeX & Tensor Calculus):** **2.19 char/tok** *Successfully captures highly frequent structural clusters (e.g., immediate grouping of subscript operators and syntax macros like `_{\`), avoiding the typical cascade of single-byte token explosions found in generic tokenizers.* |
|
|
| --- |
|
|
| ## Quickstart |
|
|
| ```python |
| from transformers import AutoTokenizer |
| |
| # Initialize the Quark2Tokenizer |
| tokenizer = AutoTokenizer.from_pretrained("ThingAI/Quark2Tokenizer") |
| |
| sequence = "<|thinking|>\nExecuting field equations...\n<|/thinking|>G_{\mu\nu} = \\frac{8\\pi G}{c^4} T_{\mu\nu}" |
| tokens = tokenizer.encode(sequence) |
| |
| print(f"Token IDs: {tokens}") |
| print(f"Decoded Sequence: {tokenizer.decode(tokens)}") |