Quark2Tokenizer
Quark2Tokenizer is a highly optimized, bilingual (English/Italian) Byte-Level Byte-Pair Encoding (BPE) tokenizer engineered specifically for the Quark Small Language Model (SLM) series. It is designed to maximize context window efficiency across technical domains, including specialized STEM literature, theoretical physics, source code, and structured multi-turn reasoning paths.
Architectural Design & Specifications
- Model Type: ByteLevelBPE (Rust-backed core for highly parallelized inference and training)
- Vocabulary Size: 65,536 (Fixed entry constraints)
- Primary Languages: Italian (IT), English (EN)
- Domain Optimization: Advanced Mathematics (LaTeX), Deep Theoretical Physics, and Structural Source Code (Python, Shell)
Token Allocation & Special Tokens
The first 10 token slots (0-9) are structurally reserved as non-fragmentable atomic identifiers. This architecture prevents standard ByteLevel segmentation from breaking conversational control markers or degrading the model's available context window during dense logical inference:
| Token | ID | Functional Scope |
|---|---|---|
<|system|> |
4 |
System prompt configuration boundary |
<|user|> |
5 |
User prompt interface marker |
<|assistant|> |
6 |
Assistant response generation boundary |
<|end|> |
7 |
Complete sequence/turn termination delimiter |
<|thinking|> |
8 |
Chain-of-Thought (CoT) sequence initialization |
<|/thinking|> |
9 |
Chain-of-Thought (CoT) sequence termination |
Tokenization Efficiency & Compression Metrics
The tokenizer was trained on a meticulously balanced 5-billion-token streaming matrix comprising Wikipedia (EN/IT), FineWeb-2 (IT), CulturaX, OpenWebMath, The Stack-Dedup, and high-quality Claude CoT distillation data.
Evaluated across varied modalities, the vocabulary yields a highly competitive Characters-per-Token (char/tok) compression ratio:
- Natural Language Prosa (Bilingual EN/IT): 5.31 char/tok Consolidates morphological roots and spaces seamlessly (e.g., merging
·modelloor·architectureinto single vocab entries), significantly reducing context consumption during prompt parsing. - Source Code (Python): 2.72 char/tok Maintains exact syntactic integrity of control structures, indentation spaces, and logical keywords without granular sub-word fragmentation.
- Mathematical Notation (LaTeX & Tensor Calculus): 2.19 char/tok Successfully captures highly frequent structural clusters (e.g., immediate grouping of subscript operators and syntax macros like
_{\), avoiding the typical cascade of single-byte token explosions found in generic tokenizers.
Quickstart
from transformers import AutoTokenizer
# Initialize the Quark2Tokenizer
tokenizer = AutoTokenizer.from_pretrained("ThingAI/Quark2Tokenizer")
sequence = "<|thinking|>\nExecuting field equations...\n<|/thinking|>G_{\mu\nu} = \\frac{8\\pi G}{c^4} T_{\mu\nu}"
tokens = tokenizer.encode(sequence)
print(f"Token IDs: {tokens}")
print(f"Decoded Sequence: {tokenizer.decode(tokens)}")