You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Copernicus Tokenizer

Overview

Copernicus Tokenizer is a domain-general Byte-Pair Encoding (BPE) tokenizer trained from scratch for large language models operating across heterogeneous reasoning domains, including:

Natural language
Source code
Mathematical notation
Scientific literature
Symbol-heavy technical text
Structured chat and tool-use formatting

The tokenizer was designed to prioritize:

Reversible decoding integrity
Structural fidelity for code
Mathematical symbol preservation
Vocabulary efficiency under mixed-domain corpora
Robust multilingual byte-level coverage

The tokenizer uses GPT-2-style byte-level pretokenization combined with custom BPE merge training over approximately 3.96 million documents sourced from code, scientific literature, mathematics, and natural language corpora.

Technical Specifications

Parameter	Value
Tokenizer Type	Byte-Pair Encoding (BPE)
Pretokenization	GPT-2 byte-level
Vocabulary Size	55,812
Merge Operations	55,725
Base Alphabet	256-byte alphabet
Minimum Merge Frequency	3
Unknown Token	`<	unk	>`
Padding Token	`<	pad	>`
BOS/EOS Token	`<	endoftext	>`
Maximum Sequence Length	4096
Training Documents	~3.96M
Intended Use	General-purpose LLM pretraining

Design Goals

The tokenizer was explicitly optimized for mixed-domain reasoning workloads rather than purely conversational English.

Core objectives included:

Preserving programming-language structure
Maintaining reversible decode behavior
Improving compression over legacy GPT-2 BPEs
Supporting LaTeX and symbolic mathematics
Avoiding excessive fragmentation of scientific terminology
Supporting tool-calling and agentic prompting formats

Supported Domains

Domain	Optimization Goal
Natural Language	Compression efficiency + morphology preservation
Source Code	Syntax stability + AST-safe decoding
Mathematics	LaTeX atomicity + operator preservation
Scientific Text	Technical terminology coverage
Chat/Agents	Structured conversational formatting
Unicode Text	Full byte-level reversibility

Special Tokens

Token	Purpose
`<	endoftext	>`	BOS / EOS
`<	unk	>`	Unknown token
`<	pad	>`	Padding
`<think>` / `</think>`	Chain-of-thought delimiters
`<	user	>`	Chat role token
`<	assistant	>`	Chat role token
`<	system	>`	Chat role token
`<	im_start	>`/`<	im_end	>`	ChatML formatting
`<	tool_call	>`	Tool invocation
`<	tool_result	>`	Tool response

Training Corpus

The tokenizer was trained on a heterogeneous multi-domain corpus.

Domain	Primary Sources
Natural Language	Wikipedia, Common Crawl
Source Code	The Stack
Mathematics	MATH dataset, arXiv
Scientific Literature	PubMed, S2ORC

The corpus intentionally mixed:

prose
code
formulas
Unicode-heavy text
markdown
structured conversations
technical documentation

This mixture was intended to prevent domain starvation during BPE merge allocation.

Installation

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Nj-1111/Copernicus-Tokenizer"
)

Example Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Nj-1111/Copernicus-Tokenizer"
)

text = "def factorial(n): return 1 if n <= 1 else n * factorial(n-1)"

encoded = tokenizer(text)
print(encoded["input_ids"])

decoded = tokenizer.decode(encoded["input_ids"])
print(decoded)

Batched Training Usage

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained(
    "Nj-1111/Copernicus-Tokenizer"
)

inputs = tokenizer(
    [
        "Hello world",
        "def foo(): pass"
    ],
    truncation=True,
    max_length=2048,
    padding="max_length",
    return_tensors="pt"
)

Evaluation Methodology

The tokenizer was evaluated using a mixed-domain stress-testing suite designed to benchmark:

compression efficiency
structural preservation
mathematical tokenization quality
reversibility
morphology handling
numeric stability
code integrity

The benchmark corpus included:

deeply nested Python syntax
asynchronous code
indentation stress tests
LaTeX equations
Unicode mathematics
morphologically rich English
long decimal sequences
hexadecimal and binary literals

Baseline comparison was performed against the GPT-2 tokenizer.

Benchmark Results

Core Metrics

Metric	Copernicus	GPT-2
Total Tokens	12,600	14,920
Character Compression Ratio	2.754	2.326
Byte Compression Ratio	2.870	2.424
Word Fertility	2.601	2.872
Entropy	6.850	6.775
Estimated BPT Proxy	5.726	6.715
Reversible Integrity	True	True
Unknown Tokens	0	0

Interpretation of Metrics

Compression Efficiency

Copernicus demonstrates significantly stronger compression than GPT-2 on mixed-domain technical corpora.

The lower fertility and higher compression ratio indicate:

better merge efficiency
stronger domain coverage
reduced subword fragmentation
improved vocabulary allocation

The benchmark corpus was intentionally difficult and included:

source code
LaTeX
Unicode mathematics
technical scientific language
long numeric sequences

Performance on standard English corpora is expected to exceed the reported mixed-domain ratios.

Reversible Integrity

The tokenizer achieved:

decode(encode(text)) == text

across the benchmark corpus.

This property is critical for:

code generation
compiler-safe decoding
mathematical reconstruction
structured prompting
dataset integrity preservation

Structural Purity Evaluation

Structural Purity Score

0.887

The tokenizer largely avoided catastrophic syntax merges.

Examples of acceptable structural tokens:

'=='
'<='
'='

The tokenizer successfully avoided highly destructive merges such as:

foo:
(variable
]])

This indicates relatively strong syntax-boundary preservation.

AST Integrity Testing

Python code subjected to encode/decode cycles remained parseable by Python's AST parser.

Result:

AST PARSE: PASS

This demonstrates:

indentation preservation
bracket stability
newline consistency
syntax-safe decoding

This property is especially important for code-language-model training.

Mathematical Tokenization Quality

LaTeX Atomicity Score

0.875

The tokenizer preserved many common LaTeX operators as atomic units.

Examples:

Symbol	Result
`\\sqrt`	Atomic
`\\frac`	Atomic
`\\sum`	Atomic
`\\int`	Atomic
`\\alpha`	Atomic
`\\partial`	Atomic

Rare-symbol fragmentation still occurs in some cases:

\\vartheta -> ['\\v', 'artheta']

This indicates that the tokenizer is math-aware but not yet fully optimized for frontier symbolic reasoning workloads.

Morphological Evaluation

The tokenizer demonstrated strong segmentation behavior on morphologically rich vocabulary.

Examples:

Word	Tokenization
interoperability	inter + oper + ability
hyperparameterization	hyper + parameter + ization
counterrevolutionaries	counter + rev + olution + aries

This suggests:

good subword reuse
semantic morpheme retention
efficient scientific terminology handling

Some residual BPE artifacts remain:

antidisestablishmentarianism
-> ant + idis + estab + lish + ment + arian + ism

indicating mid-frequency merge residue.

Numeric Stability Analysis

The tokenizer currently exhibits moderate numeric consistency.

Examples:

890.123456789
-> ['89', '0.', '123456789']

9876543210.000000000001
-> ['987', '65', '432', '10.00', '0000000001']

Strengths:

no unknown tokens
efficient compression
stable decimal preservation

Weaknesses:

inconsistent digit chunking
fragmented numerical semantics
unstable precision grouping

Future revisions may benefit from dedicated numeric pretokenization.

Whitespace & Indentation Behavior

The tokenizer partially compresses indentation patterns.

Examples:

4 spaces -> ['ĠĠ', 'ĠĠ']
8 spaces -> ['ĠĠ', 'ĠĠ', 'ĠĠ', 'ĠĠ']

This behavior is functional but not yet indentation-semantic.

Dedicated indentation tokens could further improve:

code modeling
AST consistency
Python generation quality

Strengths

Major Strengths

Strong mixed-domain compression
Excellent reversibility
AST-safe code preservation
Good syntax-boundary awareness
Strong LaTeX operator handling
Good scientific morphology segmentation
Unicode-safe byte-level encoding
Zero unknown tokens during benchmark

Current Limitations

Areas for Improvement

Numeric chunking consistency
Rare mathematical symbol coverage
Indentation-semantic tokenization
Syntax-aware pretokenization
Expanded theorem-level LaTeX coverage

Intended Use Cases

Less Ideal

High-precision arithmetic models
Frontier symbolic theorem provers
Compiler-verified code synthesis
Financial numerical reasoning systems

Research Assessment

Based on mixed-domain evaluation, Copernicus Tokenizer currently falls within:

Advanced / Early Research-Grade

relative to contemporary open-source BPE tokenizers.

The tokenizer substantially outperforms legacy GPT-2 tokenization behavior on:

compression
morphology
code structure
LaTeX preservation
Unicode robustness

while remaining fully reversible and structurally stable.

Future Work

Planned future improvements may include:

syntax-aware code pretokenization
dedicated numeric tokenization strategies
extended LaTeX operator vocabularies
theorem-aware symbolic coverage
indentation-semantic merges
multilingual optimization
adaptive merge allocation

Repository

Training code and tokenizer assets:

github.com/Nj-1111/copernicus-tokenizer

Tokenizer repository:

huggingface.co/Nj-1111/Copernicus-Tokenizer

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Nj-1111
/

Copernicus-Tokenizer

You need to agree to share your contact information to access this model

Copernicus Tokenizer

Overview

Technical Specifications

Design Goals

Supported Domains

Special Tokens

Training Corpus

Installation

Example Usage

Batched Training Usage

Evaluation Methodology

Benchmark Results

Core Metrics

Interpretation of Metrics

Compression Efficiency

Reversible Integrity

Structural Purity Evaluation

Structural Purity Score

AST Integrity Testing

Mathematical Tokenization Quality

LaTeX Atomicity Score

Morphological Evaluation

Numeric Stability Analysis

Whitespace & Indentation Behavior

Strengths

Major Strengths

Current Limitations

Areas for Improvement

Intended Use Cases

Recommended

Less Ideal

Research Assessment

Future Work

Repository