KoHRM-Text-1.4B / README.md
gyung's picture
Update KoHRM usage instructions
c8f56e5 verified
---
license: other
language:
- ko
- en
tags:
- hrm-text
- korean
- terminal
- tool-use
- code
- pretraining
- prefix-lm
library_name: transformers
---
# KoHRM-Text-1.4B
**Language / ์–ธ์–ด:** [English](#english) | [ํ•œ๊ตญ์–ด](#korean)
<a id="english"></a>
## English
`KoHRM-Text-1.4B` is a scratch-pretrained Korean/English/code/terminal/tool-use model built from the `sapientinc/HRM-Text` PrefixLM training stack.
This is **not** a continued finetune of `sapientinc/HRM-Text-1B`. It uses a new Korean/terminal-oriented 131K byte-level BPE tokenizer and a new scratch training run.
### Current Status
This repository is a rolling **latest public model export**. Training is still in progress.
- Main repo: `LLM-OS-Models/KoHRM-Text-1.4B`
- Current public files: `model.safetensors`, `config.json`, tokenizer files, and this `README.md`
- Raw FSDP2 resume checkpoints: `LLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints`
- Prepared data: `LLM-OS-Models/KoHRM-Text-1.4B-prepared-data`
- Project code: https://github.com/LLM-OS-Models/KoHRM-text
- Upstream HRM-Text code: https://github.com/sapientinc/HRM-Text
- HRM-Text paper: https://arxiv.org/html/2605.20613
- Tokenizer repo: `LLM-OS-Models/HRM-Text-Ko-Terminal-Tokenizer-131K`
The main branch is overwritten with the newest converted EMA `safetensors` export as training checkpoints are uploaded. To test the latest public weight, download `revision="main"`.
### Important Compatibility Note
The public repo currently contains the converted model weights and tokenizer, but it does **not yet** include a Hugging Face `trust_remote_code` modeling implementation for `HrmTextForCausalLM`.
What works today:
- Download the latest public weights.
- Load the tokenizer with `AutoTokenizer`.
- Inspect `config.json`.
- Verify `model.safetensors` on CPU or Colab T4.
What is not supported yet in plain Transformers:
- `AutoModelForCausalLM.from_pretrained("LLM-OS-Models/KoHRM-Text-1.4B")`
- One-line hosted text generation from this repo
Expected reason: `model_type: "hrm_text"` is a custom HRM-Text architecture. Public generation will require adding the compatible `HrmTextForCausalLM` remote-code files to this model repo or releasing a standard wrapper.
### Model Details
| Field | Value |
|---|---:|
| Model id | `LLM-OS-Models/KoHRM-Text-1.4B` |
| Standard name | `KoHRM-Text-1.4B` |
| Training origin | scratch |
| Architecture family | HRM-Text PrefixLM |
| Architecture size | `XL` |
| Parameters | 1,384,120,320 |
| Context length | 4,096 tokens |
| Training dtype | bfloat16 |
| Public export dtype | bfloat16 EMA `safetensors` |
| Tokenizer | byte-level BPE, NFC normalization |
| Vocabulary size | 131,072 |
| Objective | PrefixLM response-only loss |
| Optimizer | Adam-atan2 from upstream HRM-Text |
| EMA | 0.9999 |
Converted config highlights:
```json
{
"model_type": "hrm_text",
"architectures": ["HrmTextForCausalLM"],
"vocab_size": 131072,
"hidden_size": 1536,
"num_hidden_layers": 32,
"num_attention_heads": 12,
"max_position_embeddings": 4096,
"prefix_lm": true
}
```
### Compared With The HRM-Text Paper
This run can take longer than the paper recipe even on 8 x H200 because the setup is not identical:
- The paper reference used 16 x H100; this run uses 8 x H200.
- KoHRM uses a larger 131K tokenizer vocabulary, compared with the upstream 65K tokenizer.
- The public KoHRM size is about 1.38B parameters.
- The stable long-run batch is `180,224` tokens/step after OOM probing; larger batches were possible briefly but not chosen for reliability.
- The continuation includes extra Korean, terminal, tool-call, legal, finance, wiki, and repeated HRM-cleaned stages.
This does not automatically guarantee better benchmark scores. The expected upside is domain-specific: Korean tokenization efficiency, Korean legal/finance/wiki coverage, terminal trajectories, tool-call formatting, and code-oriented behavior should have a better chance than the upstream English/general checkpoint. Final claims require evaluation after the planned continuation and SFT finish.
### Tokenizer
The tokenizer was trained for Korean, English, code, shell/terminal text, and JSON/tool-call formats. It keeps common chat/tool special tokens as stable single tokens where possible.
| Sample bucket | chars/token |
|---|---:|
| Korean general text | 2.60 |
| Korean legal text | 2.36 |
| Korean terminal instruction | 2.18 |
| shell command | 2.68 |
| tool-call JSON | 3.32 |
| Python code | 3.37 |
| English | 4.40 |
Formatting tokens:
```text
<|im_start|> instruction start
<|im_end|> instruction end
<|box_end|> response/end marker
<|object_ref_start|> direct condition
<|object_ref_end|> chain-of-thought style condition
<|quad_start|> noisy condition
<|quad_end|> synthetic condition
```
Prompt format used by the project-side inference code:
```text
<|im_start|><|object_ref_start|>YOUR_PROMPT_HERE<|im_end|>
```
### CPU / Colab T4 Quick Test
Use this to test the **latest public weight files** on CPU or a Colab T4 runtime. This verifies that the tokenizer, config, and `model.safetensors` are downloadable and readable.
It does not run text generation yet, because the public repo does not yet ship the custom HRM-Text modeling wrapper.
```python
!pip -q install -U huggingface_hub transformers safetensors accelerate
```
```python
from pathlib import Path
import json
import torch
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer
from safetensors.torch import load_file
repo_id = "LLM-OS-Models/KoHRM-Text-1.4B"
repo_dir = Path(snapshot_download(
repo_id,
revision="main",
allow_patterns=[
"README.md",
"config.json",
"tokenizer.json",
"tokenizer_config.json",
"special_tokens_map.json",
"model.safetensors",
],
))
print("Downloaded to:", repo_dir)
print("Runtime:", "cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
print("GPU:", torch.cuda.get_device_name(0))
config = json.loads((repo_dir / "config.json").read_text())
print("model_type:", config["model_type"])
print("hidden_size:", config["hidden_size"])
print("vocab_size:", config["vocab_size"])
print("context:", config["max_position_embeddings"])
tokenizer = AutoTokenizer.from_pretrained(repo_dir, use_fast=True)
prompt = "<|im_start|><|object_ref_start|>ํ•œ๊ตญ์–ด๋กœ ํ˜„์žฌ ๋””๋ ‰ํ„ฐ๋ฆฌ์—์„œ ๊ฐ€์žฅ ํฐ ํŒŒ์ผ 10๊ฐœ๋ฅผ ์ฐพ๋Š” ๋ช…๋ น์„ ์•Œ๋ ค์ฃผ์„ธ์š”.<|im_end|>"
ids = tokenizer(prompt, add_special_tokens=False)["input_ids"]
print("prompt tokens:", len(ids))
print("first token ids:", ids[:20])
# CPU weight integrity check. This loads about 2.8GB of bf16 weights into CPU RAM.
state = load_file(str(repo_dir / "model.safetensors"), device="cpu")
num_tensors = len(state)
num_params = sum(t.numel() for t in state.values())
first_key = next(iter(state))
print("num_tensors:", num_tensors)
print("num_params:", f"{num_params:,}")
print("first tensor:", first_key, tuple(state[first_key].shape), state[first_key].dtype)
```
Expected result:
- `model_type` should be `hrm_text`.
- `vocab_size` should be `131072`.
- `num_params` should be around `1.38B`.
- Tokenizer loading should work on CPU and Colab T4.
- `AutoModelForCausalLM` generation is expected to be unavailable until remote-code support is added.
If you try this:
```python
from transformers import AutoModelForCausalLM
AutoModelForCausalLM.from_pretrained("LLM-OS-Models/KoHRM-Text-1.4B")
```
and it fails with an unknown `hrm_text` architecture, that is expected for the current public export.
### Internal / Project-Side Generation
For actual generation today, use the project code and raw FSDP2 checkpoints. This is the currently supported copy-paste path for CUDA machines. A BF16-capable GPU with enough VRAM is recommended; Colab T4 is useful for the smoke test above, not for this raw-checkpoint generation path.
```bash
git clone https://github.com/LLM-OS-Models/KoHRM-text
cd KoHRM-text
python -m venv .venv
source .venv/bin/activate
pip install -U pip wheel
pip install -r requirements.txt
pip install -U "huggingface_hub[cli]"
export TOKENIZERS_PARALLELISM=false
export NUMEXPR_MAX_THREADS=128
```
Download the latest uploaded raw checkpoint example. This example uses `stage1b-hrm-fastcap-repeat-step310000`, which is available in the raw checkpoint repo. When a newer raw checkpoint is uploaded, change both the include path and `ckpt_step`.
```bash
mkdir -p checkpoints/kohm-raw
huggingface-cli download LLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints \
--include "stage1b-hrm-fastcap-repeat-step310000/**" \
--local-dir checkpoints/kohm-raw
```
Create and run a minimal generation script:
```bash
cat > run_kohrm_raw_generate.py <<'PY'
import os
os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
os.environ.setdefault("NUMEXPR_MAX_THREADS", "128")
from simple_inference_engine import inference_load_checkpoint, inference_generate
ckpt_dir = "checkpoints/kohm-raw/stage1b-hrm-fastcap-repeat-step310000"
prompts = [
(
0,
(
"direct",
"ํ•œ๊ตญ์–ด ์กด๋Œ“๋ง๋กœ ํ˜„์žฌ ๋””๋ ‰ํ„ฐ๋ฆฌ์—์„œ ์šฉ๋Ÿ‰์ด ๊ฐ€์žฅ ํฐ ํŒŒ์ผ 10๊ฐœ๋ฅผ ์ฐพ๋Š” bash ๋ช…๋ น์„ ์ œ์•ˆํ•ด ์ฃผ์„ธ์š”.",
),
),
(
1,
(
"direct",
"Write a Python function that validates a JSON tool-call object with name and arguments.",
),
),
]
ckpt = inference_load_checkpoint(
ckpt_path=ckpt_dir,
ckpt_epoch=None,
ckpt_step=310000,
ckpt_use_ema=True,
device="cuda",
)
for pid, text in inference_generate(
ckpt,
iter(prompts),
max_tokens=1024,
max_generation=256,
batch_size=1,
temp=0.0,
):
print(f"\n### sample {pid}\n{text}")
PY
python run_kohrm_raw_generate.py
```
Prompt format is handled by `InferenceCheckpoint.tokenize_prompt`. The first tuple item is the condition string, usually `"direct"`, and the second item is the user prompt. Internally this becomes:
```text
<|im_start|><|object_ref_start|>PROMPT<|im_end|>
```
If you want to test a newer raw checkpoint:
1. Check the raw checkpoint repo for the newest uploaded stage/step.
2. Change the `huggingface-cli download --include` pattern.
3. Change `ckpt_dir`.
4. Change `ckpt_step`.
Plain `AutoModelForCausalLM` generation from `model.safetensors` will be added later when the public `trust_remote_code` wrapper is available.
### Training Data
Prepared data artifacts are uploaded to:
https://huggingface.co/datasets/LLM-OS-Models/KoHRM-Text-1.4B-prepared-data
The training objective is PrefixLM response-only loss. Instruction/prompt tokens are visible as context, while loss is applied to the response span.
Major prepared data groups:
| Dataset group | Tokens | Use |
|---|---:|---|
| `koterm_pretrain_mix_v1` | 711.3M | stage-0/stage0b |
| HRM cleaned fast-cap stage1/stage1b | 14.55B | HRM-style instruction pretraining |
| HRM cleaned full/no-cap stage2 | 14.55B | completed continuation |
| HRM cleaned full/no-cap extra stage2b | 14.55B | active continuation |
| Local terminal conversations | 9.39B | terminal/code/tool-heavy continuation |
| Korean tool/legal/wiki/finance mix | 3.02B | Korean domain and tool continuation |
| BCAI Finance Korean | 857.7M | Korean finance/domain data |
| Korean legal/admin task data | 629.0M | Korean legal/admin data |
| Korean Wikipedia | 462.5M | Korean general text |
| ToolBench train tool-call data | 127.0M | tool-call pretraining |
| SWE-ZERO + GLM reasoning subsets | 251.2M | code/reasoning data |
Evaluation-like datasets are excluded where identified, including ToolBench eval, Terminal Bench style evaluation data, and benchmark-oriented `chi-bench` data.
### Training Run
The current run uses staged continuation:
```text
stage0
-> stage0b
-> stage1
-> stage2
-> stage3
-> stage4
-> stage1b
-> stage2b
-> stage3b
-> stage4b
-> stage1c
-> stage2c
-> stage3c
-> stage4c
```
The checkpoint carries model weights, optimizer state, EMA weights, and recurrent carry state. `resume_step_offset` and `total_steps_override` are used so the learning-rate schedule follows the intended longer run instead of resetting at each stage.
As of 2026-05-27, `stage2b` is active. The continuation watcher is scheduled to launch `stage3b -> stage4b -> stage1c -> stage2c -> stage3c -> stage4c` after each completed checkpoint. The handoff reads the actual `epoch_1_info.json` `global_step` from each completed checkpoint before starting the next stage.
### Intended Use
This checkpoint is intended for:
- continued pretraining experiments
- Korean tokenizer and HRM-Text architecture experiments
- terminal/tool-call/code pretraining research
- checkpoint conversion and evaluation work
It is not yet intended as a finished assistant model.
### Limitations
- This is an intermediate checkpoint, not a final aligned instruct model.
- The full planned continuation has not finished.
- Final SFT and safety tuning have not been completed.
- Public benchmark scores for this new checkpoint are not final.
- Plain Transformers generation requires adding the custom `hrm_text` modeling wrapper or remote-code files.
- Tool-call JSON validity and terminal action safety must be evaluated before production use.
### Citation
This work builds on HRM-Text:
- Paper: https://arxiv.org/html/2605.20613
- Upstream code: https://github.com/sapientinc/HRM-Text
<a id="korean"></a>
## ํ•œ๊ตญ์–ด
`KoHRM-Text-1.4B`๋Š” `sapientinc/HRM-Text`์˜ PrefixLM ํ•™์Šต ์Šคํƒ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต ์ค‘์ธ ํ•œ๊ตญ์–ด/์˜์–ด/์ฝ”๋“œ/ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
์ด ๋ชจ๋ธ์€ `sapientinc/HRM-Text-1B`๋ฅผ ์ด์–ด์„œ ํŒŒ์ธํŠœ๋‹ํ•œ ๋ชจ๋ธ์ด ์•„๋‹™๋‹ˆ๋‹ค. ํ•œ๊ตญ์–ด์™€ ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ ํ˜•์‹์— ๋งž์ถฐ ์ƒˆ๋กœ ๋งŒ๋“  131K byte-level BPE tokenizer๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ๊ฐ€์ค‘์น˜๋„ scratch pretraining์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
### ํ˜„์žฌ ์ƒํƒœ
์ด ์ €์žฅ์†Œ๋Š” ์ตœ์‹  ๊ณต๊ฐœ ๋ณ€ํ™˜๋ณธ์„ ๊ณ„์† ๋ฎ์–ด์“ฐ๋Š” rolling latest model repo์ž…๋‹ˆ๋‹ค. ํ•™์Šต์€ ์•„์ง ์ง„ํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค.
- ๋ฉ”์ธ ๋ชจ๋ธ repo: `LLM-OS-Models/KoHRM-Text-1.4B`
- ํ˜„์žฌ ๊ณต๊ฐœ ํŒŒ์ผ: `model.safetensors`, `config.json`, tokenizer ํŒŒ์ผ, `README.md`
- raw FSDP2 resume checkpoint: `LLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints`
- prepared data: `LLM-OS-Models/KoHRM-Text-1.4B-prepared-data`
- ํ”„๋กœ์ ํŠธ ์ฝ”๋“œ: https://github.com/LLM-OS-Models/KoHRM-text
- ์›๋ณธ HRM-Text ์ฝ”๋“œ: https://github.com/sapientinc/HRM-Text
- HRM-Text ๋…ผ๋ฌธ: https://arxiv.org/html/2605.20613
- tokenizer repo: `LLM-OS-Models/HRM-Text-Ko-Terminal-Tokenizer-131K`
์ตœ์‹  ๊ณต๊ฐœ weight๋ฅผ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด `revision="main"`์œผ๋กœ ๋‹ค์šด๋กœ๋“œํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ํ•™์Šต ์ค‘ 10,000 step ๋‹จ์œ„๋กœ ์ƒˆ checkpoint๊ฐ€ ๋ณ€ํ™˜๋˜์–ด ์˜ฌ๋ผ์˜ค๋ฉด ๊ฐ™์€ ํŒŒ์ผ๋ช…์ด ์ตœ์‹  EMA `safetensors`๋กœ ๊ฐฑ์‹ ๋ฉ๋‹ˆ๋‹ค.
### ์ค‘์š”ํ•œ ํ˜ธํ™˜์„ฑ ์•ˆ๋‚ด
ํ˜„์žฌ ๊ณต๊ฐœ repo์—๋Š” ๋ณ€ํ™˜๋œ model weight์™€ tokenizer๊ฐ€ ์žˆ์ง€๋งŒ, ์•„์ง Hugging Face `trust_remote_code`์šฉ `HrmTextForCausalLM` ๊ตฌํ˜„ ํŒŒ์ผ์€ ํฌํ•จ๋˜์–ด ์žˆ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
ํ˜„์žฌ ๋ฐ”๋กœ ๊ฐ€๋Šฅํ•œ ๊ฒƒ:
- ์ตœ์‹  ๊ณต๊ฐœ weight ๋‹ค์šด๋กœ๋“œ
- `AutoTokenizer`๋กœ tokenizer ๋กœ๋“œ
- `config.json` ํ™•์ธ
- CPU ๋˜๋Š” Colab T4์—์„œ `model.safetensors` ๋ฌด๊ฒฐ์„ฑ ํ™•์ธ
์•„์ง ์ผ๋ฐ˜ Transformers์—์„œ ๋ฐ”๋กœ ์•ˆ ๋˜๋Š” ๊ฒƒ:
- `AutoModelForCausalLM.from_pretrained("LLM-OS-Models/KoHRM-Text-1.4B")`
- ์ด repo๋งŒ์œผ๋กœ one-line text generation ์‹คํ–‰
์ด์œ ๋Š” `model_type: "hrm_text"`๊ฐ€ custom HRM-Text architecture์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๊ณต๊ฐœ generation์„ ํ•˜๋ ค๋ฉด ์ด model repo์— `HrmTextForCausalLM` remote-code wrapper๊ฐ€ ์ถ”๊ฐ€๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
### ๋ชจ๋ธ ์ƒ์„ธ
| ํ•ญ๋ชฉ | ๊ฐ’ |
|---|---:|
| ๋ชจ๋ธ ID | `LLM-OS-Models/KoHRM-Text-1.4B` |
| ํ‘œ์ค€ ์ด๋ฆ„ | `KoHRM-Text-1.4B` |
| ํ•™์Šต ์ถœ๋ฐœ์  | scratch |
| ์•„ํ‚คํ…์ฒ˜ ๊ณ„์—ด | HRM-Text PrefixLM |
| ์•„ํ‚คํ…์ฒ˜ ํฌ๊ธฐ | `XL` |
| ํŒŒ๋ผ๋ฏธํ„ฐ | 1,384,120,320 |
| ์ปจํ…์ŠคํŠธ ๊ธธ์ด | 4,096 tokens |
| ํ•™์Šต dtype | bfloat16 |
| ๊ณต๊ฐœ ๋ณ€ํ™˜๋ณธ dtype | bfloat16 EMA `safetensors` |
| tokenizer | byte-level BPE, NFC normalization |
| vocabulary size | 131,072 |
| objective | PrefixLM response-only loss |
| optimizer | HRM-Text์˜ Adam-atan2 |
| EMA | 0.9999 |
๋ณ€ํ™˜๋œ config ์ฃผ์š” ๊ฐ’:
```json
{
"model_type": "hrm_text",
"architectures": ["HrmTextForCausalLM"],
"vocab_size": 131072,
"hidden_size": 1536,
"num_hidden_layers": 32,
"num_attention_heads": 12,
"max_position_embeddings": 4096,
"prefix_lm": true
}
```
### HRM-Text ๋…ผ๋ฌธ ๋Œ€๋น„
ํ˜„์žฌ run์€ ๋…ผ๋ฌธ recipe๋ณด๋‹ค ๋” ์˜ค๋ž˜ ๊ฑธ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์„ค์ •์ด ์™„์ „ํžˆ ๊ฐ™์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
- ๋…ผ๋ฌธ ๊ธฐ์ค€์€ 16 x H100์ด๊ณ , ํ˜„์žฌ run์€ 8 x H200์ž…๋‹ˆ๋‹ค.
- KoHRM์€ ์›๋ณธ 65K tokenizer๋ณด๋‹ค ํฐ 131K tokenizer vocab์„ ์”๋‹ˆ๋‹ค.
- ๊ณต๊ฐœ KoHRM ํฌ๊ธฐ๋Š” ์•ฝ 1.38B parameters์ž…๋‹ˆ๋‹ค.
- ์•ˆ์ • ์žฅ๊ธฐ run batch๋Š” OOM probe ์ดํ›„ `180,224` tokens/step์œผ๋กœ ์žก์•˜์Šต๋‹ˆ๋‹ค. ๋” ํฐ batch๋Š” ์ดˆ๋ฐ˜์— ๊ฐ€๋Šฅํ•ด ๋ณด์—ฌ๋„ ์žฅ๊ธฐ ์•ˆ์ •์„ฑ์ด ๋–จ์–ด์กŒ์Šต๋‹ˆ๋‹ค.
- ํ•œ๊ตญ์–ด, ํ„ฐ๋ฏธ๋„, ํˆด์ฝœ, ๋ฒ•๋ฅ , ๊ธˆ์œต, ์œ„ํ‚ค, HRM-cleaned ๋ฐ˜๋ณต stage๊ฐ€ ์ถ”๊ฐ€๋์Šต๋‹ˆ๋‹ค.
์ด๊ฒƒ์ด ์ž๋™์œผ๋กœ ๋ชจ๋“  benchmark ์ ์ˆ˜ ์ƒ์Šน์„ ๋ณด์žฅํ•˜์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ ํ•œ๊ตญ์–ด ํ† ํฌ๋‚˜์ด์ € ํšจ์œจ, ํ•œ๊ตญ์–ด ๋ฒ•๋ฅ /๊ธˆ์œต/์œ„ํ‚ค coverage, ํ„ฐ๋ฏธ๋„ trajectory, tool-call formatting, code-oriented behavior ์ชฝ์€ ์›๋ณธ ์˜์–ด/general checkpoint๋ณด๋‹ค ์ข‹์•„์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ตœ์ข… ์ฃผ์žฅ์€ continuation๊ณผ SFT๊ฐ€ ๋๋‚œ ๋’ค ํ‰๊ฐ€๋กœ ํ™•์ธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
### ํ† ํฌ๋‚˜์ด์ €
ํ† ํฌ๋‚˜์ด์ €๋Š” ํ•œ๊ตญ์–ด, ์˜์–ด, ์ฝ”๋“œ, shell/terminal ํ…์ŠคํŠธ, JSON/tool-call ํ˜•์‹์„ ๊ณ ๋ คํ•ด์„œ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. ์ž์ฃผ ์“ฐ๋Š” chat/tool special token์€ ๊ฐ€๋Šฅํ•œ ํ•œ ์•ˆ์ •์ ์ธ ๋‹จ์ผ token์œผ๋กœ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
| ์ƒ˜ํ”Œ ์ข…๋ฅ˜ | chars/token |
|---|---:|
| ํ•œ๊ตญ์–ด ์ผ๋ฐ˜ | 2.60 |
| ํ•œ๊ตญ์–ด ๋ฒ•๋ฅ  | 2.36 |
| ํ•œ๊ตญ์–ด ํ„ฐ๋ฏธ๋„ ์ง€์‹œ | 2.18 |
| shell command | 2.68 |
| tool-call JSON | 3.32 |
| Python code | 3.37 |
| ์˜์–ด | 4.40 |
ํฌ๋งท token:
```text
<|im_start|> instruction ์‹œ์ž‘
<|im_end|> instruction ์ข…๋ฃŒ
<|box_end|> response/end marker
<|object_ref_start|> direct condition
<|object_ref_end|> chain-of-thought style condition
<|quad_start|> noisy condition
<|quad_end|> synthetic condition
```
ํ”„๋กœ์ ํŠธ ๋‚ด๋ถ€ inference code๊ฐ€ ์“ฐ๋Š” prompt ํ˜•์‹:
```text
<|im_start|><|object_ref_start|>์—ฌ๊ธฐ์—_ํ”„๋กฌํ”„ํŠธ๋ฅผ_๋„ฃ์Šต๋‹ˆ๋‹ค<|im_end|>
```
### CPU / Colab T4 ๋น ๋ฅธ ํ…Œ์ŠคํŠธ
์•„๋ž˜ ์ฝ”๋“œ๋Š” CPU ํ™˜๊ฒฝ์ด๋‚˜ Colab T4 ๋Ÿฐํƒ€์ž„์—์„œ ์ตœ์‹  ๊ณต๊ฐœ weight ํŒŒ์ผ์„ ํ™•์ธํ•˜๋Š” ์šฉ๋„์ž…๋‹ˆ๋‹ค. tokenizer, config, `model.safetensors`๊ฐ€ ์ •์ƒ์ ์œผ๋กœ ๋ฐ›์•„์ง€๊ณ  ์ฝํžˆ๋Š”์ง€ ๊ฒ€์ฆํ•ฉ๋‹ˆ๋‹ค.
์•„์ง public repo์— custom HRM-Text modeling wrapper๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ด ์ฝ”๋“œ๋Š” text generation์„ ์‹คํ–‰ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
```python
!pip -q install -U huggingface_hub transformers safetensors accelerate
```
```python
from pathlib import Path
import json
import torch
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer
from safetensors.torch import load_file
repo_id = "LLM-OS-Models/KoHRM-Text-1.4B"
repo_dir = Path(snapshot_download(
repo_id,
revision="main",
allow_patterns=[
"README.md",
"config.json",
"tokenizer.json",
"tokenizer_config.json",
"special_tokens_map.json",
"model.safetensors",
],
))
print("Downloaded to:", repo_dir)
print("Runtime:", "cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
print("GPU:", torch.cuda.get_device_name(0))
config = json.loads((repo_dir / "config.json").read_text())
print("model_type:", config["model_type"])
print("hidden_size:", config["hidden_size"])
print("vocab_size:", config["vocab_size"])
print("context:", config["max_position_embeddings"])
tokenizer = AutoTokenizer.from_pretrained(repo_dir, use_fast=True)
prompt = "<|im_start|><|object_ref_start|>ํ•œ๊ตญ์–ด๋กœ ํ˜„์žฌ ๋””๋ ‰ํ„ฐ๋ฆฌ์—์„œ ๊ฐ€์žฅ ํฐ ํŒŒ์ผ 10๊ฐœ๋ฅผ ์ฐพ๋Š” ๋ช…๋ น์„ ์•Œ๋ ค์ฃผ์„ธ์š”.<|im_end|>"
ids = tokenizer(prompt, add_special_tokens=False)["input_ids"]
print("prompt tokens:", len(ids))
print("first token ids:", ids[:20])
# CPU weight integrity check. ์•ฝ 2.8GB bf16 weight๋ฅผ CPU RAM์— ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.
state = load_file(str(repo_dir / "model.safetensors"), device="cpu")
num_tensors = len(state)
num_params = sum(t.numel() for t in state.values())
first_key = next(iter(state))
print("num_tensors:", num_tensors)
print("num_params:", f"{num_params:,}")
print("first tensor:", first_key, tuple(state[first_key].shape), state[first_key].dtype)
```
์ •์ƒ ๊ฒฐ๊ณผ:
- `model_type`์€ `hrm_text`์ž…๋‹ˆ๋‹ค.
- `vocab_size`๋Š” `131072`์ž…๋‹ˆ๋‹ค.
- `num_params`๋Š” ์•ฝ `1.38B`์ž…๋‹ˆ๋‹ค.
- tokenizer๋Š” CPU์™€ Colab T4์—์„œ ์ •์ƒ ๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค.
- `AutoModelForCausalLM` generation์€ remote-code wrapper๊ฐ€ ์ถ”๊ฐ€๋˜๊ธฐ ์ „๊นŒ์ง€๋Š” ์•ˆ ๋˜๋Š” ๊ฒƒ์ด ์ •์ƒ์ž…๋‹ˆ๋‹ค.
๋‹ค์Œ ์ฝ”๋“œ๋Š” ํ˜„์žฌ public repo ๊ธฐ์ค€์œผ๋กœ ์‹คํŒจํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
```python
from transformers import AutoModelForCausalLM
AutoModelForCausalLM.from_pretrained("LLM-OS-Models/KoHRM-Text-1.4B")
```
`hrm_text` architecture๋ฅผ ๋ชจ๋ฅธ๋‹ค๋Š” ์˜ค๋ฅ˜๊ฐ€ ๋‚˜์˜ค๋ฉด ํ˜„์žฌ ์ƒํƒœ์—์„œ๋Š” ์ •์ƒ์ž…๋‹ˆ๋‹ค.
### ๋‚ด๋ถ€ / ํ”„๋กœ์ ํŠธ ์ฝ”๋“œ ๊ธฐ๋ฐ˜ ์ƒ์„ฑ
ํ˜„์žฌ ์‹ค์ œ generation์„ ํ•˜๋ ค๋ฉด ํ”„๋กœ์ ํŠธ ์ฝ”๋“œ์™€ raw FSDP2 checkpoint๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ์ง€๊ธˆ ๋ฐ”๋กœ ์“ธ ์ˆ˜ ์žˆ๋Š” CUDA ํ™˜๊ฒฝ์šฉ ๊ฒฝ๋กœ์ž…๋‹ˆ๋‹ค. BF16์ด ๋˜๋Š” ์ถฉ๋ถ„ํ•œ VRAM์˜ GPU๋ฅผ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. Colab T4๋Š” ์œ„ smoke test์—๋Š” ์“ธ ์ˆ˜ ์žˆ์ง€๋งŒ, raw checkpoint generation ๊ถŒ์žฅ ๊ฒฝ๋กœ๋Š” ์•„๋‹™๋‹ˆ๋‹ค.
```bash
git clone https://github.com/LLM-OS-Models/KoHRM-text
cd KoHRM-text
python -m venv .venv
source .venv/bin/activate
pip install -U pip wheel
pip install -r requirements.txt
pip install -U "huggingface_hub[cli]"
export TOKENIZERS_PARALLELISM=false
export NUMEXPR_MAX_THREADS=128
```
ํ˜„์žฌ ๋ฐ”๋กœ ๋ฐ›์„ ์ˆ˜ ์žˆ๋Š” raw checkpoint ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค. ์•„๋ž˜ ์˜ˆ์‹œ๋Š” raw checkpoint repo์— ์˜ฌ๋ผ์˜จ `stage1b-hrm-fastcap-repeat-step310000`์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋” ์ตœ์‹  raw checkpoint๊ฐ€ ์˜ฌ๋ผ์˜ค๋ฉด include path์™€ `ckpt_step`์„ ๊ฐ™์ด ๋ฐ”๊พธ๋ฉด ๋ฉ๋‹ˆ๋‹ค.
```bash
mkdir -p checkpoints/kohm-raw
huggingface-cli download LLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints \
--include "stage1b-hrm-fastcap-repeat-step310000/**" \
--local-dir checkpoints/kohm-raw
```
์ตœ์†Œ generation script:
```bash
cat > run_kohrm_raw_generate.py <<'PY'
import os
os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
os.environ.setdefault("NUMEXPR_MAX_THREADS", "128")
from simple_inference_engine import inference_load_checkpoint, inference_generate
ckpt_dir = "checkpoints/kohm-raw/stage1b-hrm-fastcap-repeat-step310000"
prompts = [
(
0,
(
"direct",
"ํ•œ๊ตญ์–ด ์กด๋Œ“๋ง๋กœ ํ˜„์žฌ ๋””๋ ‰ํ„ฐ๋ฆฌ์—์„œ ์šฉ๋Ÿ‰์ด ๊ฐ€์žฅ ํฐ ํŒŒ์ผ 10๊ฐœ๋ฅผ ์ฐพ๋Š” bash ๋ช…๋ น์„ ์ œ์•ˆํ•ด ์ฃผ์„ธ์š”.",
),
),
(
1,
(
"direct",
"Write a Python function that validates a JSON tool-call object with name and arguments.",
),
),
]
ckpt = inference_load_checkpoint(
ckpt_path=ckpt_dir,
ckpt_epoch=None,
ckpt_step=310000,
ckpt_use_ema=True,
device="cuda",
)
for pid, text in inference_generate(
ckpt,
iter(prompts),
max_tokens=1024,
max_generation=256,
batch_size=1,
temp=0.0,
):
print(f"\n### sample {pid}\n{text}")
PY
python run_kohrm_raw_generate.py
```
Prompt formatting์€ `InferenceCheckpoint.tokenize_prompt`๊ฐ€ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. tuple์˜ ์ฒซ ๋ฒˆ์งธ ๊ฐ’์€ condition string์ด๊ณ  ๋ณดํ†ต `"direct"`๋ฅผ ์”๋‹ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ๊ฐ’์€ ์‚ฌ์šฉ์ž prompt์ž…๋‹ˆ๋‹ค. ๋‚ด๋ถ€์ ์œผ๋กœ๋Š” ๋‹ค์Œ ํ˜•์‹์ด ๋ฉ๋‹ˆ๋‹ค.
```text
<|im_start|><|object_ref_start|>PROMPT<|im_end|>
```
๋” ์ตœ์‹  raw checkpoint๋ฅผ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด:
1. raw checkpoint repo์—์„œ ๊ฐ€์žฅ ์ตœ์‹  stage/step์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.
2. `huggingface-cli download --include` pattern์„ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค.
3. `ckpt_dir`๋ฅผ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค.
4. `ckpt_step`์„ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค.
๊ณต๊ฐœ `model.safetensors`์—์„œ ๋ฐ”๋กœ `AutoModelForCausalLM` generation์„ ํ•˜๋Š” ๊ฒฝ๋กœ๋Š” public `trust_remote_code` wrapper๋ฅผ ์ถ”๊ฐ€ํ•œ ๋’ค ์ง€์›ํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.
### ํ•™์Šต ๋ฐ์ดํ„ฐ
prepared data๋Š” ์•„๋ž˜ dataset repo์— ์—…๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.
https://huggingface.co/datasets/LLM-OS-Models/KoHRM-Text-1.4B-prepared-data
ํ•™์Šต objective๋Š” PrefixLM response-only loss์ž…๋‹ˆ๋‹ค. instruction/prompt token์€ context๋กœ ๋ณด๊ณ , loss๋Š” response span์—๋งŒ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
์ฃผ์š” prepared data group:
| ๋ฐ์ดํ„ฐ ๊ทธ๋ฃน | Tokens | ์šฉ๋„ |
|---|---:|---|
| `koterm_pretrain_mix_v1` | 711.3M | stage-0/stage0b |
| HRM cleaned fast-cap stage1/stage1b | 14.55B | HRM-style instruction pretraining |
| HRM cleaned full/no-cap stage2 | 14.55B | ์™„๋ฃŒ๋œ continuation |
| HRM cleaned full/no-cap extra stage2b | 14.55B | ์ง„ํ–‰ ์ค‘์ธ continuation |
| local terminal conversations | 9.39B | terminal/code/tool-heavy continuation |
| Korean tool/legal/wiki/finance mix | 3.02B | ํ•œ๊ตญ์–ด domain/tool continuation |
| BCAI Finance Korean | 857.7M | ํ•œ๊ตญ์–ด ๊ธˆ์œต/domain data |
| Korean legal/admin task data | 629.0M | ํ•œ๊ตญ์–ด ๋ฒ•๋ฅ /ํ–‰์ • data |
| Korean Wikipedia | 462.5M | ํ•œ๊ตญ์–ด ์ผ๋ฐ˜ ํ…์ŠคํŠธ |
| ToolBench train tool-call data | 127.0M | tool-call pretraining |
| SWE-ZERO + GLM reasoning subsets | 251.2M | code/reasoning data |
ํ‰๊ฐ€ ์„ฑ๊ฒฉ ๋ฐ์ดํ„ฐ๋Š” ํ™•์ธ๋˜๋Š” ๋ฒ”์œ„์—์„œ train์—์„œ ์ œ์™ธํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ์‹œ๋Š” ToolBench eval, Terminal Bench ๊ณ„์—ด ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ, benchmark ์„ฑ๊ฒฉ์˜ `chi-bench`์ž…๋‹ˆ๋‹ค.
### ํ•™์Šต ์ง„ํ–‰
ํ˜„์žฌ run์€ staged continuation ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
```text
stage0
-> stage0b
-> stage1
-> stage2
-> stage3
-> stage4
-> stage1b
-> stage2b
-> stage3b
-> stage4b
-> stage1c
-> stage2c
-> stage3c
-> stage4c
```
checkpoint๋Š” model weights, optimizer state, EMA weights, recurrent carry state๋ฅผ ์ด์–ด๊ฐ‘๋‹ˆ๋‹ค. `resume_step_offset`๊ณผ `total_steps_override`๋ฅผ ์จ์„œ stage๋งˆ๋‹ค learning-rate schedule์ด ๋ฆฌ์…‹๋˜์ง€ ์•Š๊ณ  ๊ธด pretraining run์ฒ˜๋Ÿผ ์ด์–ด์ง€๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
2026-05-27 ๊ธฐ์ค€ `stage2b`๊ฐ€ ์ง„ํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค. continuation watcher๊ฐ€ ์ดํ›„ `stage3b -> stage4b -> stage1c -> stage2c -> stage3c -> stage4c`๋ฅผ ์ด์–ด์„œ ์‹คํ–‰ํ•˜๋„๋ก ์˜ˆ์•ฝ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. handoff๋Š” ๊ฐ stage์˜ ์‹ค์ œ `epoch_1_info.json` `global_step`์„ ์ฝ๊ณ  ๋‹ค์Œ stage๋ฅผ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.
### ์‚ฌ์šฉ ๋ชฉ์ 
์ด checkpoint๋Š” ๋‹ค์Œ ๋ชฉ์ ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
- continued pretraining ์‹คํ—˜
- ํ•œ๊ตญ์–ด tokenizer ๋ฐ HRM-Text architecture ์‹คํ—˜
- terminal/tool-call/code pretraining ์—ฐ๊ตฌ
- checkpoint conversion ๋ฐ evaluation ์ž‘์—…
์•„์ง ์™„์„ฑ๋œ assistant model์€ ์•„๋‹™๋‹ˆ๋‹ค.
### ์ œํ•œ ์‚ฌํ•ญ
- ์ค‘๊ฐ„ checkpoint์ด๋ฉฐ ์ตœ์ข… aligned instruct model์ด ์•„๋‹™๋‹ˆ๋‹ค.
- ์ „์ฒด planned continuation์ด ์•„์ง ๋๋‚˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.
- ์ตœ์ข… SFT์™€ safety tuning์ด ์•„์ง ๋๋‚˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.
- ์ƒˆ checkpoint์˜ public benchmark score๋Š” ์•„์ง final์ด ์•„๋‹™๋‹ˆ๋‹ค.
- ์ผ๋ฐ˜ Transformers generation์€ custom `hrm_text` modeling wrapper ๋˜๋Š” remote-code file์ด ์ถ”๊ฐ€๋˜์–ด์•ผ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
- tool-call JSON ์œ ํšจ์„ฑ๊ณผ terminal action safety๋Š” ์‹ค์ œ ์‚ฌ์šฉ ์ „์— ๋ณ„๋„ ํ‰๊ฐ€๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
### ์ธ์šฉ
์ด ์ž‘์—…์€ HRM-Text architecture์™€ training stack์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค.
- ๋…ผ๋ฌธ: https://arxiv.org/html/2605.20613
- ์›๋ณธ ์ฝ”๋“œ: https://github.com/sapientinc/HRM-Text