File size: 3,273 Bytes
372588f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58e4018
372588f
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
language:
- it
- en
tags:
- tokenizer
- bpe
- stem
- physics
- quark
license: mit
pipeline_tag: text-generation
---

# Quark2Tokenizer

Quark2Tokenizer is a highly optimized, bilingual (English/Italian) Byte-Level Byte-Pair Encoding (BPE) tokenizer engineered specifically for the **Quark** Small Language Model (SLM) series. It is designed to maximize context window efficiency across technical domains, including specialized STEM literature, theoretical physics, source code, and structured multi-turn reasoning paths.

## Architectural Design & Specifications

- **Model Type:** ByteLevelBPE (Rust-backed core for highly parallelized inference and training)
- **Vocabulary Size:** 65,536 (Fixed entry constraints)
- **Primary Languages:** Italian (IT), English (EN)
- **Domain Optimization:** Advanced Mathematics (LaTeX), Deep Theoretical Physics, and Structural Source Code (Python, Shell)

### Token Allocation & Special Tokens
The first 10 token slots (`0-9`) are structurally reserved as non-fragmentable atomic identifiers. This architecture prevents standard ByteLevel segmentation from breaking conversational control markers or degrading the model's available context window during dense logical inference:

| Token | ID | Functional Scope |
| :--- | :--- | :--- |
| `<\|system\|>` | `4` | System prompt configuration boundary |
| `<\|user\|>` | `5` | User prompt interface marker |
| `<\|assistant\|>` | `6` | Assistant response generation boundary |
| `<\|end\|>` | `7` | Complete sequence/turn termination delimiter |
| `<\|thinking\|>` | `8` | **Chain-of-Thought (CoT) sequence initialization** |
| `<\|/thinking\|>` | `9` | **Chain-of-Thought (CoT) sequence termination** |

---

## Tokenization Efficiency & Compression Metrics

The tokenizer was trained on a meticulously balanced 5-billion-token streaming matrix comprising Wikipedia (EN/IT), FineWeb-2 (IT), CulturaX, OpenWebMath, The Stack-Dedup, and high-quality Claude CoT distillation data. 

Evaluated across varied modalities, the vocabulary yields a highly competitive **Characters-per-Token (char/tok)** compression ratio:

* **Natural Language Prosa (Bilingual EN/IT):** **5.31 char/tok** *Consolidates morphological roots and spaces seamlessly (e.g., merging `·modello` or `·architecture` into single vocab entries), significantly reducing context consumption during prompt parsing.*
* **Source Code (Python):** **2.72 char/tok** *Maintains exact syntactic integrity of control structures, indentation spaces, and logical keywords without granular sub-word fragmentation.*
* **Mathematical Notation (LaTeX & Tensor Calculus):** **2.19 char/tok** *Successfully captures highly frequent structural clusters (e.g., immediate grouping of subscript operators and syntax macros like `_{\`), avoiding the typical cascade of single-byte token explosions found in generic tokenizers.*

---

## Quickstart

```python
from transformers import AutoTokenizer

# Initialize the Quark2Tokenizer
tokenizer = AutoTokenizer.from_pretrained("ThingAI/Quark2Tokenizer")

sequence = "<|thinking|>\nExecuting field equations...\n<|/thinking|>G_{\mu\nu} = \\frac{8\\pi G}{c^4} T_{\mu\nu}"
tokens = tokenizer.encode(sequence)

print(f"Token IDs: {tokens}")
print(f"Decoded Sequence: {tokenizer.decode(tokens)}")