Instructions to use qubitpage/sentinel-prime-350m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use qubitpage/sentinel-prime-350m with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="qubitpage/sentinel-prime-350m", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("qubitpage/sentinel-prime-350m", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use qubitpage/sentinel-prime-350m with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "qubitpage/sentinel-prime-350m"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "qubitpage/sentinel-prime-350m",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/qubitpage/sentinel-prime-350m

SGLang

How to use qubitpage/sentinel-prime-350m with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "qubitpage/sentinel-prime-350m" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "qubitpage/sentinel-prime-350m",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "qubitpage/sentinel-prime-350m" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "qubitpage/sentinel-prime-350m",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use qubitpage/sentinel-prime-350m with Docker Model Runner:
```
docker model run hf.co/qubitpage/sentinel-prime-350m
```

Sentinel Prime 350M — Sparse MoE Language Model

Sentinel Prime 350M is a from-scratch sparse Mixture of Experts (MoE) transformer built by QubitPage Research.

Architecture

Parameter	Value
Total Parameters	471,231,488
Active Parameters	~471,231,488 per token
Hidden Dimension	1024
Layers	24
Attention Heads	16 (Q) / 4 (KV)
FFN Dimension	2752
Experts	1 total, top-1 active
Vocab Size	100,277 (tiktoken cl100k_base)
Max Sequence Length	2048
Position Encoding	RoPE (theta=500000.0)
Normalization	RMSNorm
FFN Type	SwiGLU
Attention	Grouped Query Attention (GQA)

Key Features

Sparse MoE: Only 1/1 experts active per token
GQA: Memory-efficient grouped query attention
SwiGLU: LLaMA/Mistral-style feed-forward
RoPE: Rotary position embeddings for length generalization
From Scratch: No pretrained weights, trained from random initialization

Training

Data: FineWeb-Edu (educational web text)
Tokens Seen: 3,113,287,680
Best Validation Loss: 3.0578
Hardware: NVIDIA GH200 96GB HBM3e
Framework: PyTorch 2.5.1

Usage

from transformers import AutoModelForCausalLM, AutoConfig
# Register custom model
from hf_model import SentinelBrainConfig, SentinelBrainForCausalLM
from hf_tokenizer import SentinelBrainTokenizer

model = SentinelBrainForCausalLM.from_pretrained("qubitpage/sentinel-prime-350m", trust_remote_code=True)
tokenizer = SentinelBrainTokenizer()

input_ids = tokenizer("The meaning of life is", return_tensors="pt")["input_ids"]
output = model.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(output[0]))

License

Apache 2.0

Benchmarks

Results from EleutherAI lm-evaluation-harness (latest) run on a single NVIDIA GH200 96GB. Full results, configs and per-sample logs are public in the companion dataset: qubitpage/sentinel-prime-350m-evals

Model	Params	Train Tokens	arc_challenge	arc_easy	hellaswag	lambada_openai	openbookqa	piqa	sciq	winogrande	Avg
Sentinel-Prime-350M (ours)	471M	3.1B	0.194	0.352	0.264	0.001	0.120	0.566	0.481	0.501	0.310
Pythia-410M	410M	300B	0.240	0.520	0.400	0.510	0.300	0.670	0.810	0.530	0.498
GPT-Neo-125M	125M	300B	0.190	0.430	0.300	0.370	0.260	0.630	0.760	0.520	0.433
SmolLM-360M	360M	600B	0.340	0.660	0.520	0.460	0.370	0.720	0.910	0.570	0.569

Training-Compute Context

Model	Hardware	Tokens Seen	Compute Multiplier vs ours
Sentinel-Prime-350M	1x NVIDIA GH200 96GB	3.1 B	1x (baseline)
Pythia-410M	TPU v4 cluster	300 B	97x
SmolLM-360M	64x NVIDIA H100	600 B	194x

Sentinel-Prime-350M was trained on 3.1B tokens — Pythia-410M on 300B (97x more) and SmolLM-360M on 600B (194x more). The current avg of 0.310 vs Pythia's 0.497 and SmolLM's 0.569 therefore reflects an early-checkpoint snapshot at <1% of typical training budget for this size class.

Per Chinchilla (Hoffmann et al. 2022), a 471M dense model is compute-optimal at ~~9.4B tokens (~~20x params); we are at 0.33x compute-optimal. Continued pretraining on the same architecture/data is expected to scale predictably toward the reference band.

Reference scores from EleutherAI Pythia paper and HuggingFace SmolLM card. All evaluated 0-shot under identical prompt formats.

Reproduce locally:

pip install "lm_eval[hf]"
lm_eval run --model hf \
  --model_args pretrained=qubitpage/sentinel-prime-350m,trust_remote_code=True,dtype=float32 \
  --tasks arc_challenge,arc_easy,hellaswag,lambada_openai,openbookqa,piqa,sciq,winogrande \
  --device cuda:0 --batch_size auto:4

Support This Project

Sentinel-Prime is being trained on a single GH200 against models that used hundreds of GPUs. If these results interest you and you want to help us close the 97x–194x compute gap, you can back the project here:

Support Sentinel-Prime on Surge

Every contribution funds more GH200 hours and brings the next checkpoint closer to (and past) the Pythia / SmolLM reference band.

Downloads last month: 791

Safetensors

Model size

0.5B params

Tensor type

F32