MSA-4B

Highlights

Long-term memory is essential for general intelligence, yet the full attention bottleneck constrains most LLMs' effective context length to 128K–1M. Existing attempts — hybrid linear attention, fixed-size state memory (e.g., RNNs), and external storage like RAG/agents — either suffer rapid precision decay and latency growth at extreme scales, lack end-to-end differentiability or dynamic memory maintenance, or require complex pipelines. We present Memory Sparse Attention (MSA): an end-to-end trainable, scalable sparse latent-state memory framework. Core ideas include:

Scalable sparse attention + document-wise RoPE (parallel/global) achieving near-linear complexity in both training and inference;
KV cache compression with a Memory Parallel inference engine to deliver 100M token throughput on 2×A800 GPUs;
Memory Interleave for multi-round, multi-hop reasoning across scattered memory segments.

On long-context QA and NIAH (Needle-in-a-Haystack) benchmarks, MSA surpasses same-backbone RAG, best-of-breed RAG stacks, and leading long-context models. Across an unprecedented 16K→100M token range, MSA shows < 9% degradation, suggesting a practical path to decouple memory capacity from reasoning.

Model Overview

This model is based on Qwen3-4B-Instruct-2507 with Memory Sparse Attention (MSA).

Number of Parameters: 4.0B
Number of Layers: 36
Number of MSA Layers: 18
Number of Attention Heads (GQA): 32 for Q and 8 for KV
Based on Qwen/Qwen3-4B-Instruct-2507

Quick Start

1. Install dependencies

conda create -n msa python=3.12 -y
conda activate msa

pip install -r requirements.txt
pip install flash-attn==2.7.4.post1 --no-build-isolation

2. Download model

mkdir ckpt
pip install -U huggingface_hub
huggingface-cli download Anoy123423123/MSA-4B --local-dir ckpt/MSA-4B

3. Run inference on benchmarks

bash scripts/run_benchmarks.sh eval_benchmark

4. Compute LLM-based scores

bash scripts/calculate_llm_score.sh eval_benchmark

Downloads last month: 22

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for Anoy123423123/MSA-4B

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

(1704)

this model