ThingAI
/

Quark2Tokenizer

Text Generation

Model card Files Files and versions

Quark2Tokenizer / README.md

ThingsAI's picture

Update README.md

58e4018 verified 1 day ago

|

history blame contribute delete

3.27 kB

	---
	language:
	- it
	- en
	tags:
	- tokenizer
	- bpe
	- stem
	- physics
	- quark
	license: mit
	pipeline_tag: text-generation
	---

	# Quark2Tokenizer

	Quark2Tokenizer is a highly optimized, bilingual (English/Italian) Byte-Level Byte-Pair Encoding (BPE) tokenizer engineered specifically for the Quark Small Language Model (SLM) series. It is designed to maximize context window efficiency across technical domains, including specialized STEM literature, theoretical physics, source code, and structured multi-turn reasoning paths.

	## Architectural Design & Specifications

	- Model Type: ByteLevelBPE (Rust-backed core for highly parallelized inference and training)
	- Vocabulary Size: 65,536 (Fixed entry constraints)
	- Primary Languages: Italian (IT), English (EN)
	- Domain Optimization: Advanced Mathematics (LaTeX), Deep Theoretical Physics, and Structural Source Code (Python, Shell)

	### Token Allocation & Special Tokens
	The first 10 token slots (`0-9`) are structurally reserved as non-fragmentable atomic identifiers. This architecture prevents standard ByteLevel segmentation from breaking conversational control markers or degrading the model's available context window during dense logical inference:

	\| Token \| ID \| Functional Scope \|
	\| :--- \| :--- \| :--- \|
	\| `<\\|system\\|>` \| `4` \| System prompt configuration boundary \|
	\| `<\\|user\\|>` \| `5` \| User prompt interface marker \|
	\| `<\\|assistant\\|>` \| `6` \| Assistant response generation boundary \|
	\| `<\\|end\\|>` \| `7` \| Complete sequence/turn termination delimiter \|
	\| `<\\|thinking\\|>` \| `8` \| Chain-of-Thought (CoT) sequence initialization \|
	\| `<\\|/thinking\\|>` \| `9` \| Chain-of-Thought (CoT) sequence termination \|

	---

	## Tokenization Efficiency & Compression Metrics

	The tokenizer was trained on a meticulously balanced 5-billion-token streaming matrix comprising Wikipedia (EN/IT), FineWeb-2 (IT), CulturaX, OpenWebMath, The Stack-Dedup, and high-quality Claude CoT distillation data.

	Evaluated across varied modalities, the vocabulary yields a highly competitive Characters-per-Token (char/tok) compression ratio:

	* Natural Language Prosa (Bilingual EN/IT): 5.31 char/tok Consolidates morphological roots and spaces seamlessly (e.g., merging `·modello` or `·architecture` into single vocab entries), significantly reducing context consumption during prompt parsing.
	* Source Code (Python): 2.72 char/tok Maintains exact syntactic integrity of control structures, indentation spaces, and logical keywords without granular sub-word fragmentation.
	* Mathematical Notation (LaTeX & Tensor Calculus): 2.19 char/tok Successfully captures highly frequent structural clusters (e.g., immediate grouping of subscript operators and syntax macros like `_{\`), avoiding the typical cascade of single-byte token explosions found in generic tokenizers.

	---

	## Quickstart

	```python
	from transformers import AutoTokenizer

	# Initialize the Quark2Tokenizer
	tokenizer = AutoTokenizer.from_pretrained("ThingAI/Quark2Tokenizer")

	sequence = "<\|thinking\|>\nExecuting field equations...\n<\|/thinking\|>G_{\mu\nu} = \\frac{8\\pi G}{c^4} T_{\mu\nu}"
	tokens = tokenizer.encode(sequence)

	print(f"Token IDs: {tokens}")
	print(f"Decoded Sequence: {tokenizer.decode(tokens)}")