Muse-2-350M

Model Information

Muse-2-350M is an English text-generation language model developed by Muse-research. It is a compact causal language model built for general writing, question answering, science reasoning, lightweight coding help, summarization, and assistant-style generation.

The model uses the custom MuseNova Transformer architecture. MuseNova is an auto-regressive decoder-only language model with rotary position embeddings, grouped-query attention, RMSNorm, and a gated feed-forward network. The model is designed to keep a small footprint while preserving a long context window for documents, explanations, and multi-step prompts.

Model Developer: Muse-research

Model Architecture: MuseNova is a decoder-only Transformer architecture for causal language modeling. It uses grouped-query attention for efficient inference, RoPE for position encoding, and a SwiGLU-style MLP block.

Model Params Input modalities Output modalities Context length GQA Shared embeddings Primary language
Muse-2-350M 354.98M Text Text and code 32,768 tokens Yes No English

Supported Language: English.

Model Family: Muse-2 is a compact model family focused on efficient text generation and reasoning experiments. Muse-2-350M is the 350M parameter class release.

Status: This is a static model released as an offline checkpoint. It does not browse the web, retrieve live information, or update itself after release.

License: Apache 2.0.

Intended Use

Intended Use Cases: Muse-2-350M is intended for research and development use in English-language text generation. It can be used for lightweight assistant workflows, educational explanations, summarization, rewriting, coding support, science question answering, and evaluation of compact Transformer architectures.

The model is especially suited for:

  • general English text generation
  • assistant-style question answering
  • science and textbook-style explanations
  • summarization and rewriting
  • lightweight coding assistance
  • reasoning and multiple-choice evaluation research
  • long-context prompt experiments
  • custom PyTorch inference pipelines

Out of Scope: Muse-2-350M should not be used as the only source of truth for medical, legal, financial, safety-critical, or identity-sensitive decisions. It should not be treated as a live knowledge system, a safety classifier, or a substitute for expert review.

How to Use

This repository contains the Muse-2-350M tokenizer, configuration, and model weights. The model architecture is custom, so inference should be run with a MuseNova-compatible PyTorch implementation.

Install the common runtime packages:

pip install torch safetensors tokenizers

Load the tokenizer and weights:

from pathlib import Path

from safetensors.torch import load_file
from tokenizers import Tokenizer

model_dir = Path("Muse-2-350M")

tokenizer = Tokenizer.from_file(str(model_dir / "tokenizer.json"))
state_dict = load_file(str(model_dir / "model.safetensors"))

A compatible implementation should construct the MuseNovaForCausalLM architecture from config.json, then load model.safetensors.

Prompt Format

Muse-2-350M uses a simple role-style prompt format:

<|system|>
You are Muse-2, a helpful English assistant.<|end|>
<|user|>
Explain why the sky appears blue.<|end|>
<|assistant|>

For direct completion tasks, plain text prompts can also be used.

Generation Settings

For deterministic question answering:

generation_config = {
    "max_new_tokens": 256,
    "temperature": 0.0,
    "do_sample": False,
}

For general chat and writing:

generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 40,
    "do_sample": True,
}

For longer explanations:

generation_config = {
    "max_new_tokens": 1024,
    "temperature": 0.6,
    "top_p": 0.9,
    "do_sample": True,
}

Model Architecture

Muse-2-350M is a compact Transformer language model with a long context window and grouped-query attention.

Component Value
Architecture MuseNovaForCausalLM
Model type Decoder-only causal language model
Parameters 354.98M
Hidden size 1024
Intermediate size 4352
Layers 18
Attention heads 16
Key/value heads 4
Position encoding Rotary position embeddings
Context length 32,768 tokens
Vocabulary size 32,768
Normalization RMSNorm
MLP Gated feed-forward network
Weight format safetensors

Grouped-query attention allows the model to use fewer key/value heads than query heads, reducing inference memory pressure while preserving multi-head attention behavior.

Capabilities

General Text Generation

Muse-2-350M can generate English prose, continue passages, rewrite text, summarize information, and answer direct questions. It is designed for compact assistant-style use rather than large-scale frontier reasoning.

Science and Knowledge Questions

The model can answer basic and intermediate science questions, explain concepts, and respond to textbook-style prompts. It can also be used for multiple-choice reasoning evaluations, although answers should be checked.

Math and Structured Reasoning

Muse-2-350M can attempt arithmetic, word problems, and step-by-step explanations. As a compact model, it may make mistakes on multi-step calculations or problems requiring precise symbolic manipulation.

Coding

The model can help with short code snippets, Python-style examples, explanations of programming concepts, and simple debugging. It is not specialized as a production coding model.

Long Context

Muse-2-350M supports a 32,768 token context window, enabling longer prompts, documents, and multi-part instructions. Long-context quality can vary depending on prompt structure and generation settings.

Benchmarks

Benchmark values are reported for transparent reference and should be interpreted as approximate automatic-evaluation signals. Small models can vary noticeably with prompt formatting and answer extraction.

Category Benchmark Split / Task Metric Score
Science reasoning GPQA-Diamond public mirror fingertap/GPQA-Diamond, test Accuracy 28.79

The GPQA value above was produced with letter-scoring over 198 questions and saved in .eval_results/gpqa.yaml.

Hardware and Software

Muse-2-350M is small enough to run on a single modern GPU for comfortable inference. CPU inference is possible with a compatible implementation, but generation speed depends heavily on the runtime.

Precision Approximate VRAM class Notes
BF16 / FP16 2 GB+ Recommended for GPU inference
FP32 4 GB+ Useful for debugging, slower and larger
Quantized Runtime dependent Requires external quantization tooling

Recommended packages:

  • torch
  • safetensors
  • tokenizers

Data Scope

Muse-2-350M is intended for English-language generation and reasoning. It is not a multilingual model and should not be expected to provide strong performance outside English.

The model may reflect biases, errors, and omissions present in public text data. It does not contain a retrieval system and does not know about events after its offline data snapshot unless those facts are present in the prompt.

Responsibility and Safety

Muse-2-350M is a general-purpose text-generation model. Developers are responsible for testing it in their own use cases and applying appropriate safeguards.

Recommended deployment practices:

  1. Evaluate the model on task-specific data before use.
  2. Use human review for high-impact outputs.
  3. Add content filters or policy layers where needed.
  4. Avoid using raw model output as authoritative factual advice.
  5. Monitor for hallucinations, unsafe completions, and prompt sensitivity.

Limitations

  • The model can hallucinate facts.
  • The model can make arithmetic, science, and coding mistakes.
  • The model may be sensitive to prompt wording.
  • The model can show answer-choice bias on multiple-choice tasks.
  • The model is not a live search or retrieval system.
  • The model is not specialized for medical, legal, financial, or safety-critical advice.
  • Long-context support does not guarantee perfect long-context reasoning.

Citation

@misc{museresearch2026muse2_350m,
  title        = {Muse-2-350M},
  author       = {Muse-research},
  year         = {2026},
  url          = {https://huggingface.co/Muse-research/Muse-2-350M}
}
Downloads last month
-
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Muse-research/Muse-2-350M

Evaluation results