You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Copernicus Tokenizer

Overview

Copernicus Tokenizer is a domain-general Byte-Pair Encoding (BPE) tokenizer trained from scratch for large language models operating across heterogeneous reasoning domains, including:

  • Natural language
  • Source code
  • Mathematical notation
  • Scientific literature
  • Symbol-heavy technical text
  • Structured chat and tool-use formatting

The tokenizer was designed to prioritize:

  1. Reversible decoding integrity
  2. Structural fidelity for code
  3. Mathematical symbol preservation
  4. Vocabulary efficiency under mixed-domain corpora
  5. Robust multilingual byte-level coverage

The tokenizer uses GPT-2-style byte-level pretokenization combined with custom BPE merge training over approximately 3.96 million documents sourced from code, scientific literature, mathematics, and natural language corpora.


Technical Specifications

Parameter Value
Tokenizer Type Byte-Pair Encoding (BPE)
Pretokenization GPT-2 byte-level
Vocabulary Size 55,812
Merge Operations 55,725
Base Alphabet 256-byte alphabet
Minimum Merge Frequency 3
Unknown Token `< unk >`
Padding Token `< pad >`
BOS/EOS Token `< endoftext >`
Maximum Sequence Length 4096
Training Documents ~3.96M
Intended Use General-purpose LLM pretraining

Design Goals

The tokenizer was explicitly optimized for mixed-domain reasoning workloads rather than purely conversational English.

Core objectives included:

  • Preserving programming-language structure
  • Maintaining reversible decode behavior
  • Improving compression over legacy GPT-2 BPEs
  • Supporting LaTeX and symbolic mathematics
  • Avoiding excessive fragmentation of scientific terminology
  • Supporting tool-calling and agentic prompting formats

Supported Domains

Domain Optimization Goal
Natural Language Compression efficiency + morphology preservation
Source Code Syntax stability + AST-safe decoding
Mathematics LaTeX atomicity + operator preservation
Scientific Text Technical terminology coverage
Chat/Agents Structured conversational formatting
Unicode Text Full byte-level reversibility

Special Tokens

Token Purpose
`< endoftext >` BOS / EOS
`< unk >` Unknown token
`< pad >` Padding
<think> / </think> Chain-of-thought delimiters
`< user >` Chat role token
`< assistant >` Chat role token
`< system >` Chat role token
`< im_start >/< im_end >` ChatML formatting
`< tool_call >` Tool invocation
`< tool_result >` Tool response

Training Corpus

The tokenizer was trained on a heterogeneous multi-domain corpus.

Domain Primary Sources
Natural Language Wikipedia, Common Crawl
Source Code The Stack
Mathematics MATH dataset, arXiv
Scientific Literature PubMed, S2ORC

The corpus intentionally mixed:

  • prose
  • code
  • formulas
  • Unicode-heavy text
  • markdown
  • structured conversations
  • technical documentation

This mixture was intended to prevent domain starvation during BPE merge allocation.


Installation

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Nj-1111/Copernicus-Tokenizer"
)

Example Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Nj-1111/Copernicus-Tokenizer"
)

text = "def factorial(n): return 1 if n <= 1 else n * factorial(n-1)"

encoded = tokenizer(text)
print(encoded["input_ids"])

decoded = tokenizer.decode(encoded["input_ids"])
print(decoded)

Batched Training Usage

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained(
    "Nj-1111/Copernicus-Tokenizer"
)

inputs = tokenizer(
    [
        "Hello world",
        "def foo(): pass"
    ],
    truncation=True,
    max_length=2048,
    padding="max_length",
    return_tensors="pt"
)

Evaluation Methodology

The tokenizer was evaluated using a mixed-domain stress-testing suite designed to benchmark:

  • compression efficiency
  • structural preservation
  • mathematical tokenization quality
  • reversibility
  • morphology handling
  • numeric stability
  • code integrity

The benchmark corpus included:

  • deeply nested Python syntax
  • asynchronous code
  • indentation stress tests
  • LaTeX equations
  • Unicode mathematics
  • morphologically rich English
  • long decimal sequences
  • hexadecimal and binary literals

Baseline comparison was performed against the GPT-2 tokenizer.


Benchmark Results

Core Metrics

Metric Copernicus GPT-2
Total Tokens 12,600 14,920
Character Compression Ratio 2.754 2.326
Byte Compression Ratio 2.870 2.424
Word Fertility 2.601 2.872
Entropy 6.850 6.775
Estimated BPT Proxy 5.726 6.715
Reversible Integrity True True
Unknown Tokens 0 0

Interpretation of Metrics

Compression Efficiency

Copernicus demonstrates significantly stronger compression than GPT-2 on mixed-domain technical corpora.

The lower fertility and higher compression ratio indicate:

  • better merge efficiency
  • stronger domain coverage
  • reduced subword fragmentation
  • improved vocabulary allocation

The benchmark corpus was intentionally difficult and included:

  • source code
  • LaTeX
  • Unicode mathematics
  • technical scientific language
  • long numeric sequences

Performance on standard English corpora is expected to exceed the reported mixed-domain ratios.


Reversible Integrity

The tokenizer achieved:

decode(encode(text)) == text

across the benchmark corpus.

This property is critical for:

  • code generation
  • compiler-safe decoding
  • mathematical reconstruction
  • structured prompting
  • dataset integrity preservation

Structural Purity Evaluation

Structural Purity Score

0.887

The tokenizer largely avoided catastrophic syntax merges.

Examples of acceptable structural tokens:

'=='
'<='
'='

The tokenizer successfully avoided highly destructive merges such as:

foo:
(variable
]])

This indicates relatively strong syntax-boundary preservation.


AST Integrity Testing

Python code subjected to encode/decode cycles remained parseable by Python's AST parser.

Result:

AST PARSE: PASS

This demonstrates:

  • indentation preservation
  • bracket stability
  • newline consistency
  • syntax-safe decoding

This property is especially important for code-language-model training.


Mathematical Tokenization Quality

LaTeX Atomicity Score

0.875

The tokenizer preserved many common LaTeX operators as atomic units.

Examples:

Symbol Result
\\sqrt Atomic
\\frac Atomic
\\sum Atomic
\\int Atomic
\\alpha Atomic
\\partial Atomic

Rare-symbol fragmentation still occurs in some cases:

\\vartheta -> ['\\v', 'artheta']

This indicates that the tokenizer is math-aware but not yet fully optimized for frontier symbolic reasoning workloads.


Morphological Evaluation

The tokenizer demonstrated strong segmentation behavior on morphologically rich vocabulary.

Examples:

Word Tokenization
interoperability inter + oper + ability
hyperparameterization hyper + parameter + ization
counterrevolutionaries counter + rev + olution + aries

This suggests:

  • good subword reuse
  • semantic morpheme retention
  • efficient scientific terminology handling

Some residual BPE artifacts remain:

antidisestablishmentarianism
-> ant + idis + estab + lish + ment + arian + ism

indicating mid-frequency merge residue.


Numeric Stability Analysis

The tokenizer currently exhibits moderate numeric consistency.

Examples:

890.123456789
-> ['89', '0.', '123456789']
9876543210.000000000001
-> ['987', '65', '432', '10.00', '0000000001']

Strengths:

  • no unknown tokens
  • efficient compression
  • stable decimal preservation

Weaknesses:

  • inconsistent digit chunking
  • fragmented numerical semantics
  • unstable precision grouping

Future revisions may benefit from dedicated numeric pretokenization.


Whitespace & Indentation Behavior

The tokenizer partially compresses indentation patterns.

Examples:

4 spaces -> ['ĠĠ', 'ĠĠ']
8 spaces -> ['ĠĠ', 'ĠĠ', 'ĠĠ', 'ĠĠ']

This behavior is functional but not yet indentation-semantic.

Dedicated indentation tokens could further improve:

  • code modeling
  • AST consistency
  • Python generation quality

Strengths

Major Strengths

  • Strong mixed-domain compression
  • Excellent reversibility
  • AST-safe code preservation
  • Good syntax-boundary awareness
  • Strong LaTeX operator handling
  • Good scientific morphology segmentation
  • Unicode-safe byte-level encoding
  • Zero unknown tokens during benchmark

Current Limitations

Areas for Improvement

  • Numeric chunking consistency
  • Rare mathematical symbol coverage
  • Indentation-semantic tokenization
  • Syntax-aware pretokenization
  • Expanded theorem-level LaTeX coverage

Intended Use Cases

Recommended

  • General-purpose LLM pretraining
  • Coding assistants
  • Research copilots
  • Scientific language models
  • Tool-using agent systems
  • Mathematical text generation
  • Mixed-domain instruction tuning

Less Ideal

  • High-precision arithmetic models
  • Frontier symbolic theorem provers
  • Compiler-verified code synthesis
  • Financial numerical reasoning systems

Research Assessment

Based on mixed-domain evaluation, Copernicus Tokenizer currently falls within:

Advanced / Early Research-Grade

relative to contemporary open-source BPE tokenizers.

The tokenizer substantially outperforms legacy GPT-2 tokenization behavior on:

  • compression
  • morphology
  • code structure
  • LaTeX preservation
  • Unicode robustness

while remaining fully reversible and structurally stable.


Future Work

Planned future improvements may include:

  • syntax-aware code pretokenization
  • dedicated numeric tokenization strategies
  • extended LaTeX operator vocabularies
  • theorem-aware symbolic coverage
  • indentation-semantic merges
  • multilingual optimization
  • adaptive merge allocation

Repository

Training code and tokenizer assets:

github.com/Nj-1111/copernicus-tokenizer

Tokenizer repository:

huggingface.co/Nj-1111/Copernicus-Tokenizer
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support