Instructions to use qubitpage/sentinel-prime-350m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use qubitpage/sentinel-prime-350m with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="qubitpage/sentinel-prime-350m", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("qubitpage/sentinel-prime-350m", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use qubitpage/sentinel-prime-350m with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "qubitpage/sentinel-prime-350m" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "qubitpage/sentinel-prime-350m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/qubitpage/sentinel-prime-350m
- SGLang
How to use qubitpage/sentinel-prime-350m with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "qubitpage/sentinel-prime-350m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "qubitpage/sentinel-prime-350m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "qubitpage/sentinel-prime-350m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "qubitpage/sentinel-prime-350m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use qubitpage/sentinel-prime-350m with Docker Model Runner:
docker model run hf.co/qubitpage/sentinel-prime-350m
Sentinel Prime 350M โ Sparse MoE Language Model
Sentinel Prime 350M is a from-scratch sparse Mixture of Experts (MoE) transformer built by QubitPage Research.
Architecture
| Parameter | Value |
|---|---|
| Total Parameters | 471,231,488 |
| Active Parameters | ~471,231,488 per token |
| Hidden Dimension | 1024 |
| Layers | 24 |
| Attention Heads | 16 (Q) / 4 (KV) |
| FFN Dimension | 2752 |
| Experts | 1 total, top-1 active |
| Vocab Size | 100,277 (tiktoken cl100k_base) |
| Max Sequence Length | 2048 |
| Position Encoding | RoPE (theta=500000.0) |
| Normalization | RMSNorm |
| FFN Type | SwiGLU |
| Attention | Grouped Query Attention (GQA) |
Key Features
- Sparse MoE: Only 1/1 experts active per token
- GQA: Memory-efficient grouped query attention
- SwiGLU: LLaMA/Mistral-style feed-forward
- RoPE: Rotary position embeddings for length generalization
- From Scratch: No pretrained weights, trained from random initialization
Training
- Data: FineWeb-Edu (educational web text)
- Tokens Seen: 3,113,287,680
- Best Validation Loss: 3.0578
- Hardware: NVIDIA GH200 96GB HBM3e
- Framework: PyTorch 2.5.1
Usage
from transformers import AutoModelForCausalLM, AutoConfig
# Register custom model
from hf_model import SentinelBrainConfig, SentinelBrainForCausalLM
from hf_tokenizer import SentinelBrainTokenizer
model = SentinelBrainForCausalLM.from_pretrained("qubitpage/sentinel-prime-350m", trust_remote_code=True)
tokenizer = SentinelBrainTokenizer()
input_ids = tokenizer("The meaning of life is", return_tensors="pt")["input_ids"]
output = model.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(output[0]))
License
Apache 2.0
Benchmarks
Results from EleutherAI lm-evaluation-harness (latest) run on a single NVIDIA GH200 96GB. Full results, configs and per-sample logs are public in the companion dataset:
qubitpage/sentinel-prime-350m-evals
| Model | Params | Train Tokens | arc_challenge | arc_easy | hellaswag | lambada_openai | openbookqa | piqa | sciq | winogrande | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Sentinel-Prime-350M (ours) | 471M | 3.1B | 0.194 | 0.352 | 0.264 | 0.001 | 0.120 | 0.566 | 0.481 | 0.501 | 0.310 |
| Pythia-410M | 410M | 300B | 0.240 | 0.520 | 0.400 | 0.510 | 0.300 | 0.670 | 0.810 | 0.530 | 0.498 |
| GPT-Neo-125M | 125M | 300B | 0.190 | 0.430 | 0.300 | 0.370 | 0.260 | 0.630 | 0.760 | 0.520 | 0.433 |
| SmolLM-360M | 360M | 600B | 0.340 | 0.660 | 0.520 | 0.460 | 0.370 | 0.720 | 0.910 | 0.570 | 0.569 |
Training-Compute Context
| Model | Hardware | Tokens Seen | Compute Multiplier vs ours |
|---|---|---|---|
| Sentinel-Prime-350M | 1x NVIDIA GH200 96GB | 3.1 B | 1x (baseline) |
| Pythia-410M | TPU v4 cluster | 300 B | 97x |
| SmolLM-360M | 64x NVIDIA H100 | 600 B | 194x |
Sentinel-Prime-350M was trained on 3.1B tokens โ Pythia-410M on 300B (97x more) and SmolLM-360M on 600B (194x more). The current avg of 0.310 vs Pythia's 0.497 and SmolLM's 0.569 therefore reflects an early-checkpoint snapshot at <1% of typical training budget for this size class.
Per Chinchilla (Hoffmann et al. 2022), a 471M dense model is compute-optimal at 9.4B tokens (20x params); we are at 0.33x compute-optimal. Continued pretraining on the same architecture/data is expected to scale predictably toward the reference band.
Reference scores from EleutherAI Pythia paper and HuggingFace SmolLM card. All evaluated 0-shot under identical prompt formats.
Reproduce locally:
pip install "lm_eval[hf]"
lm_eval run --model hf \
--model_args pretrained=qubitpage/sentinel-prime-350m,trust_remote_code=True,dtype=float32 \
--tasks arc_challenge,arc_easy,hellaswag,lambada_openai,openbookqa,piqa,sciq,winogrande \
--device cuda:0 --batch_size auto:4
Support This Project
Sentinel-Prime is being trained on a single GH200 against models that used hundreds of GPUs. If these results interest you and you want to help us close the 97xโ194x compute gap, you can back the project here:
Support Sentinel-Prime on Surge
Every contribution funds more GH200 hours and brings the next checkpoint closer to (and past) the Pythia / SmolLM reference band.
- Downloads last month
- 791