ThingAI
/

DwarfGoToken

Model card Files Files and versions

DwarfGoToken / README.md

ThingsAI's picture

Create README.md

c527fc5 verified 5 days ago

|

History Blame Contribute Delete

2.8 kB

	---
	language: en
	tags:
	- tokenizer
	- bpe
	- shell
	- code
	- chatml
	license: apache-2.0
	datasets:
	- bigcode/the-stack-dedup
	- m-a-p/CodeFeedback-Filtered-Instruction
	- HuggingFaceH4/helpful-instructions
	- HuggingFaceFW/fineweb
	- Magpie-Align/Magpie-Reasoning-150K
	---

	# DwarfGoToken

	A compact BPE tokenizer (8,192 tokens) designed for tiny language models that need to understand shell commands, code snippets, and ChatML-formatted conversations. Built on top of a custom Go pre‑tokenizer that keeps critical shell tokens (`grep`, `chmod`, `2>&1`, `-rf`, …) atomic, avoiding the fragmentation that kills performance on CPU-bound inference.

	## Why 8,192 tokens?

	For a small LM (<20M parameters), a large vocabulary (e.g., 64K) wastes the majority of the model’s parameters on the embedding matrix. With `d_model=256`, the embedding here accounts for only 2.1M parameters (~14%) — the rest goes into the transformer layers, where it matters most.

	## Corpus

	\| Source \| Domain \| Lines \|
	\|--------\|--------\|-------\|
	\| `bigcode/the-stack-dedup/shell` \| Shell \| 1,500,000 \|
	\| `bigcode/the-stack-dedup/batchfile` \| Batch \| 500,000 \|
	\| `bigcode/the-stack-dedup/python` \| Python \| 1,000,000 \|
	\| `bigcode/the-stack-dedup/c` \| C \| 500,000 \|
	\| `m-a-p/CodeFeedback-Filtered-Instruction` \| Code+Instructions \| 200,000 \|
	\| `HuggingFaceH4/helpful-instructions` \| English instructions \| 150,000 \|
	\| `HuggingFaceFW/fineweb/sample-10BT` \| Web English \| 300,000 \|
	\| `Magpie-Align/Magpie-Reasoning-150K` \| Chain-of-Thought \| 200,000 \|

	Total: 4,251,427 lines (3.5 GB) — 47% Shell, 40% Code, 9.5% EN, 3.5% CoT.

	## Special tokens (all atomic)

	`<s>`, `</s>`, `<unk>`, `<pad>`, `<\|system\|>`, `<\|user\|>`, `<\|assistant\|>`, `<\|end\|>`, `<\|thinking\|>`, `<\|/thinking\|>`, plus 54 Go‑pre‑tokenizer tokens (e.g., `grep`, `chmod`, `2>&1`, `&&`, `>>`, `-rf`, `--help`).

	## Quick test

	```python
	from transformers import AutoTokenizer
	tok = AutoTokenizer.from_pretrained("ThingAI/DwarfGoToken")

	# Shell commands stay atomic
	tok.tokenize("find /var/log -name '*.gz' \| xargs rm -rf")
	# → ['find', '/', 'var', '/', 'log', '-n', 'ame', "'", '*.', 'gz', "'", '\|', 'xargs', 'rm', '-rf']

	# ChatML template
	tok.tokenize("<\|user\|>\nCosa fa grep?\n<\|end\|>\n<\|assistant\|>\n...")
	# → ['<\|user\|>', 'C', 'os', 'a', 'fa', 'grep', '?', '<\|end\|>', '<\|assistant\|>', '...']
	```
	## Usage
	```python
	from transformers import AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("ThingAI/DwarfGoToken")
	```
	## Intended use
	This tokenizer was built to pair with tiny LMs (~10–20M parameters) specialised in command‑line assistance, shell scripting, or code generation. It’s the companion of the Dwarf model family by ThingsAI.
	## License
	Apache 2.0 — use it, modify it, ship it.