Copernicus Tokenizer
Overview
Copernicus Tokenizer is a domain-general Byte-Pair Encoding (BPE) tokenizer trained from scratch for large language models operating across heterogeneous reasoning domains, including:
- Natural language
- Source code
- Mathematical notation
- Scientific literature
- Symbol-heavy technical text
- Structured chat and tool-use formatting
The tokenizer was designed to prioritize:
- Reversible decoding integrity
- Structural fidelity for code
- Mathematical symbol preservation
- Vocabulary efficiency under mixed-domain corpora
- Robust multilingual byte-level coverage
The tokenizer uses GPT-2-style byte-level pretokenization combined with custom BPE merge training over approximately 3.96 million documents sourced from code, scientific literature, mathematics, and natural language corpora.
Technical Specifications
| Parameter | Value | ||
|---|---|---|---|
| Tokenizer Type | Byte-Pair Encoding (BPE) | ||
| Pretokenization | GPT-2 byte-level | ||
| Vocabulary Size | 55,812 | ||
| Merge Operations | 55,725 | ||
| Base Alphabet | 256-byte alphabet | ||
| Minimum Merge Frequency | 3 | ||
| Unknown Token | `< | unk | >` |
| Padding Token | `< | pad | >` |
| BOS/EOS Token | `< | endoftext | >` |
| Maximum Sequence Length | 4096 | ||
| Training Documents | ~3.96M | ||
| Intended Use | General-purpose LLM pretraining |
Design Goals
The tokenizer was explicitly optimized for mixed-domain reasoning workloads rather than purely conversational English.
Core objectives included:
- Preserving programming-language structure
- Maintaining reversible decode behavior
- Improving compression over legacy GPT-2 BPEs
- Supporting LaTeX and symbolic mathematics
- Avoiding excessive fragmentation of scientific terminology
- Supporting tool-calling and agentic prompting formats
Supported Domains
| Domain | Optimization Goal |
|---|---|
| Natural Language | Compression efficiency + morphology preservation |
| Source Code | Syntax stability + AST-safe decoding |
| Mathematics | LaTeX atomicity + operator preservation |
| Scientific Text | Technical terminology coverage |
| Chat/Agents | Structured conversational formatting |
| Unicode Text | Full byte-level reversibility |
Special Tokens
| Token | Purpose | ||||
|---|---|---|---|---|---|
| `< | endoftext | >` | BOS / EOS | ||
| `< | unk | >` | Unknown token | ||
| `< | pad | >` | Padding | ||
<think> / </think> |
Chain-of-thought delimiters | ||||
| `< | user | >` | Chat role token | ||
| `< | assistant | >` | Chat role token | ||
| `< | system | >` | Chat role token | ||
| `< | im_start | >/< |
im_end | >` | ChatML formatting |
| `< | tool_call | >` | Tool invocation | ||
| `< | tool_result | >` | Tool response |
Training Corpus
The tokenizer was trained on a heterogeneous multi-domain corpus.
| Domain | Primary Sources |
|---|---|
| Natural Language | Wikipedia, Common Crawl |
| Source Code | The Stack |
| Mathematics | MATH dataset, arXiv |
| Scientific Literature | PubMed, S2ORC |
The corpus intentionally mixed:
- prose
- code
- formulas
- Unicode-heavy text
- markdown
- structured conversations
- technical documentation
This mixture was intended to prevent domain starvation during BPE merge allocation.
Installation
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"Nj-1111/Copernicus-Tokenizer"
)
Example Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"Nj-1111/Copernicus-Tokenizer"
)
text = "def factorial(n): return 1 if n <= 1 else n * factorial(n-1)"
encoded = tokenizer(text)
print(encoded["input_ids"])
decoded = tokenizer.decode(encoded["input_ids"])
print(decoded)
Batched Training Usage
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained(
"Nj-1111/Copernicus-Tokenizer"
)
inputs = tokenizer(
[
"Hello world",
"def foo(): pass"
],
truncation=True,
max_length=2048,
padding="max_length",
return_tensors="pt"
)
Evaluation Methodology
The tokenizer was evaluated using a mixed-domain stress-testing suite designed to benchmark:
- compression efficiency
- structural preservation
- mathematical tokenization quality
- reversibility
- morphology handling
- numeric stability
- code integrity
The benchmark corpus included:
- deeply nested Python syntax
- asynchronous code
- indentation stress tests
- LaTeX equations
- Unicode mathematics
- morphologically rich English
- long decimal sequences
- hexadecimal and binary literals
Baseline comparison was performed against the GPT-2 tokenizer.
Benchmark Results
Core Metrics
| Metric | Copernicus | GPT-2 |
|---|---|---|
| Total Tokens | 12,600 | 14,920 |
| Character Compression Ratio | 2.754 | 2.326 |
| Byte Compression Ratio | 2.870 | 2.424 |
| Word Fertility | 2.601 | 2.872 |
| Entropy | 6.850 | 6.775 |
| Estimated BPT Proxy | 5.726 | 6.715 |
| Reversible Integrity | True | True |
| Unknown Tokens | 0 | 0 |
Interpretation of Metrics
Compression Efficiency
Copernicus demonstrates significantly stronger compression than GPT-2 on mixed-domain technical corpora.
The lower fertility and higher compression ratio indicate:
- better merge efficiency
- stronger domain coverage
- reduced subword fragmentation
- improved vocabulary allocation
The benchmark corpus was intentionally difficult and included:
- source code
- LaTeX
- Unicode mathematics
- technical scientific language
- long numeric sequences
Performance on standard English corpora is expected to exceed the reported mixed-domain ratios.
Reversible Integrity
The tokenizer achieved:
decode(encode(text)) == text
across the benchmark corpus.
This property is critical for:
- code generation
- compiler-safe decoding
- mathematical reconstruction
- structured prompting
- dataset integrity preservation
Structural Purity Evaluation
Structural Purity Score
0.887
The tokenizer largely avoided catastrophic syntax merges.
Examples of acceptable structural tokens:
'=='
'<='
'='
The tokenizer successfully avoided highly destructive merges such as:
foo:
(variable
]])
This indicates relatively strong syntax-boundary preservation.
AST Integrity Testing
Python code subjected to encode/decode cycles remained parseable by Python's AST parser.
Result:
AST PARSE: PASS
This demonstrates:
- indentation preservation
- bracket stability
- newline consistency
- syntax-safe decoding
This property is especially important for code-language-model training.
Mathematical Tokenization Quality
LaTeX Atomicity Score
0.875
The tokenizer preserved many common LaTeX operators as atomic units.
Examples:
| Symbol | Result |
|---|---|
\\sqrt |
Atomic |
\\frac |
Atomic |
\\sum |
Atomic |
\\int |
Atomic |
\\alpha |
Atomic |
\\partial |
Atomic |
Rare-symbol fragmentation still occurs in some cases:
\\vartheta -> ['\\v', 'artheta']
This indicates that the tokenizer is math-aware but not yet fully optimized for frontier symbolic reasoning workloads.
Morphological Evaluation
The tokenizer demonstrated strong segmentation behavior on morphologically rich vocabulary.
Examples:
| Word | Tokenization |
|---|---|
| interoperability | inter + oper + ability |
| hyperparameterization | hyper + parameter + ization |
| counterrevolutionaries | counter + rev + olution + aries |
This suggests:
- good subword reuse
- semantic morpheme retention
- efficient scientific terminology handling
Some residual BPE artifacts remain:
antidisestablishmentarianism
-> ant + idis + estab + lish + ment + arian + ism
indicating mid-frequency merge residue.
Numeric Stability Analysis
The tokenizer currently exhibits moderate numeric consistency.
Examples:
890.123456789
-> ['89', '0.', '123456789']
9876543210.000000000001
-> ['987', '65', '432', '10.00', '0000000001']
Strengths:
- no unknown tokens
- efficient compression
- stable decimal preservation
Weaknesses:
- inconsistent digit chunking
- fragmented numerical semantics
- unstable precision grouping
Future revisions may benefit from dedicated numeric pretokenization.
Whitespace & Indentation Behavior
The tokenizer partially compresses indentation patterns.
Examples:
4 spaces -> ['ĠĠ', 'ĠĠ']
8 spaces -> ['ĠĠ', 'ĠĠ', 'ĠĠ', 'ĠĠ']
This behavior is functional but not yet indentation-semantic.
Dedicated indentation tokens could further improve:
- code modeling
- AST consistency
- Python generation quality
Strengths
Major Strengths
- Strong mixed-domain compression
- Excellent reversibility
- AST-safe code preservation
- Good syntax-boundary awareness
- Strong LaTeX operator handling
- Good scientific morphology segmentation
- Unicode-safe byte-level encoding
- Zero unknown tokens during benchmark
Current Limitations
Areas for Improvement
- Numeric chunking consistency
- Rare mathematical symbol coverage
- Indentation-semantic tokenization
- Syntax-aware pretokenization
- Expanded theorem-level LaTeX coverage
Intended Use Cases
Recommended
- General-purpose LLM pretraining
- Coding assistants
- Research copilots
- Scientific language models
- Tool-using agent systems
- Mathematical text generation
- Mixed-domain instruction tuning
Less Ideal
- High-precision arithmetic models
- Frontier symbolic theorem provers
- Compiler-verified code synthesis
- Financial numerical reasoning systems
Research Assessment
Based on mixed-domain evaluation, Copernicus Tokenizer currently falls within:
Advanced / Early Research-Grade
relative to contemporary open-source BPE tokenizers.
The tokenizer substantially outperforms legacy GPT-2 tokenization behavior on:
- compression
- morphology
- code structure
- LaTeX preservation
- Unicode robustness
while remaining fully reversible and structurally stable.
Future Work
Planned future improvements may include:
- syntax-aware code pretokenization
- dedicated numeric tokenization strategies
- extended LaTeX operator vocabularies
- theorem-aware symbolic coverage
- indentation-semantic merges
- multilingual optimization
- adaptive merge allocation
Repository
Training code and tokenizer assets:
github.com/Nj-1111/copernicus-tokenizer
Tokenizer repository:
huggingface.co/Nj-1111/Copernicus-Tokenizer