MSA-4B
Highlights
Long-term memory is essential for general intelligence, yet the full attention bottleneck constrains most LLMs' effective context length to 128K–1M. Existing attempts — hybrid linear attention, fixed-size state memory (e.g., RNNs), and external storage like RAG/agents — either suffer rapid precision decay and latency growth at extreme scales, lack end-to-end differentiability or dynamic memory maintenance, or require complex pipelines. We present Memory Sparse Attention (MSA): an end-to-end trainable, scalable sparse latent-state memory framework. Core ideas include:
- Scalable sparse attention + document-wise RoPE (parallel/global) achieving near-linear complexity in both training and inference;
- KV cache compression with a Memory Parallel inference engine to deliver 100M token throughput on 2×A800 GPUs;
- Memory Interleave for multi-round, multi-hop reasoning across scattered memory segments.
On long-context QA and NIAH (Needle-in-a-Haystack) benchmarks, MSA surpasses same-backbone RAG, best-of-breed RAG stacks, and leading long-context models. Across an unprecedented 16K→100M token range, MSA shows < 9% degradation, suggesting a practical path to decouple memory capacity from reasoning.
Model Overview
This model is based on Qwen3-4B-Instruct-2507 with Memory Sparse Attention (MSA).
- Number of Parameters: 4.0B
- Number of Layers: 36
- Number of MSA Layers: 18
- Number of Attention Heads (GQA): 32 for Q and 8 for KV
- Based on Qwen/Qwen3-4B-Instruct-2507
Quick Start
1. Install dependencies
conda create -n msa python=3.12 -y
conda activate msa
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1 --no-build-isolation
2. Download model
mkdir ckpt
pip install -U huggingface_hub
huggingface-cli download Anoy123423123/MSA-4B --local-dir ckpt/MSA-4B
3. Run inference on benchmarks
bash scripts/run_benchmarks.sh eval_benchmark
4. Compute LLM-based scores
bash scripts/calculate_llm_score.sh eval_benchmark
- Downloads last month
- 22
Model tree for Anoy123423123/MSA-4B
Base model
Qwen/Qwen3-4B-Instruct-2507