Quark2Tokenizer

Quark2Tokenizer is a highly optimized, bilingual (English/Italian) Byte-Level Byte-Pair Encoding (BPE) tokenizer engineered specifically for the Quark Small Language Model (SLM) series. It is designed to maximize context window efficiency across technical domains, including specialized STEM literature, theoretical physics, source code, and structured multi-turn reasoning paths.

Architectural Design & Specifications

Model Type: ByteLevelBPE (Rust-backed core for highly parallelized inference and training)
Vocabulary Size: 65,536 (Fixed entry constraints)
Primary Languages: Italian (IT), English (EN)
Domain Optimization: Advanced Mathematics (LaTeX), Deep Theoretical Physics, and Structural Source Code (Python, Shell)

Token Allocation & Special Tokens

The first 10 token slots (0-9) are structurally reserved as non-fragmentable atomic identifiers. This architecture prevents standard ByteLevel segmentation from breaking conversational control markers or degrading the model's available context window during dense logical inference:

Token	ID	Functional Scope
`<\|system\|>`	`4`	System prompt configuration boundary
`<\|user\|>`	`5`	User prompt interface marker
`<\|assistant\|>`	`6`	Assistant response generation boundary
`<\|end\|>`	`7`	Complete sequence/turn termination delimiter
`<\|thinking\|>`	`8`	Chain-of-Thought (CoT) sequence initialization
`<\|/thinking\|>`	`9`	Chain-of-Thought (CoT) sequence termination

Tokenization Efficiency & Compression Metrics

The tokenizer was trained on a meticulously balanced 5-billion-token streaming matrix comprising Wikipedia (EN/IT), FineWeb-2 (IT), CulturaX, OpenWebMath, The Stack-Dedup, and high-quality Claude CoT distillation data.

Evaluated across varied modalities, the vocabulary yields a highly competitive Characters-per-Token (char/tok) compression ratio:

Natural Language Prosa (Bilingual EN/IT): 5.31 char/tok Consolidates morphological roots and spaces seamlessly (e.g., merging ·modello or ·architecture into single vocab entries), significantly reducing context consumption during prompt parsing.
Source Code (Python): 2.72 char/tok Maintains exact syntactic integrity of control structures, indentation spaces, and logical keywords without granular sub-word fragmentation.
Mathematical Notation (LaTeX & Tensor Calculus): 2.19 char/tok Successfully captures highly frequent structural clusters (e.g., immediate grouping of subscript operators and syntax macros like _{\), avoiding the typical cascade of single-byte token explosions found in generic tokenizers.

Quickstart

from transformers import AutoTokenizer

# Initialize the Quark2Tokenizer
tokenizer = AutoTokenizer.from_pretrained("ThingAI/Quark2Tokenizer")

sequence = "<|thinking|>\nExecuting field equations...\n<|/thinking|>G_{\mu\nu} = \\frac{8\\pi G}{c^4} T_{\mu\nu}"
tokens = tokenizer.encode(sequence)

print(f"Token IDs: {tokens}")
print(f"Decoded Sequence: {tokenizer.decode(tokens)}")

Downloads last month: -; Downloads are not tracked for this model. How to track