Ring-mini-sparse-2.0-exp / README.md

bestfleer

Add files using upload-large-folder tool

a69896f verified about 2 months ago

preview code

raw

history blame contribute delete

6.39 kB

metadata

license: mit
language:
  - en
base_model:
  - inclusionAI/Ling-mini-base-2.0-20T
pipeline_tag: text-generation
library_name: transformers
tags:
  - moe

Ring-mini-sparse-2.0-exp

🤗 Hugging Face | 🤖 ModelScope

Introduction

We are excited to annouce the official release of Ring-mini-sparse-2.0-exp. This model employs a Mixture of Block Attention (MoBA) architecture, delivering highly efficient inference without compromising performance. This model inherts from Ling-mini-base-2.0, continually trained on an additional 100B tokens. The performance of the MoBA-based model is on par with the standard attention models of the same size (e.g., Ring-mini-v2). Furthermore, by applying YaRN-based 4× window extrapolation, we extend the context length to 128K tokens, delivering superior inference speed on tasks that involve long inputs and outputs.

Figure 1: The Model Architecture of Ring-mini-sparse-2.0-exp

Evaluation

To comprehensively assess the reasoning capability of our model, we conducted evaluations on five challenging benchmarks spanning mathematics, coding, and science, comparing it with Ring-mini-2.0, Qwen3-8B-Thinking, and GPT-OSS-20B-Medium. The MoBA architecture demonstrates comparable performance to full softmax attention models.

Figure 2: Model Performance Comparison

Highly Sparse, High-Speed Generation

Ring-mini-sparse-2.0-exp achieves high inference efficiency through highly sparse attention and a Mixture-of-Experts (MoE) architecture. Unlike MoBA used in Kimi, our approach shares the same KV block selection across all heads within a GQA group, reducing the total number of KV tokens each query head retrieves from the KV cache during decoding. During 64K-context decoding, only 8,192 key-value (KV) tokens are activated per query—reducing KV cache retrieval overhead by 87.5% compared to full attention and delivering up to 3× inference speedup over Ring-mini-2.0. This design significantly lowers computational costs for high-concurrency scenarios involving reasoning-intensive models while maintaining competitive performance. Additionally, with YaRN extrapolation, the model extends context capacity to 128K tokens, achieving up to 2× relative speedup in long-input scenarios compared to Ring-mini-2.0 (full softmax attention).

Figure 4: Inference speedup ratios of Ring-mini-sparse-2.0-exp compared to Ring-mini-2.0.

Quickstart

🤗 Hugging Face Transformers

Installation requirements:

pip install flash-attn==2.6.3
pip install transformers==4.56.1

Here is a code snippet to show you how to use the chat model with transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "inclusionAI/Ring-mini-sparse-2.0-exp"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto",
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)


prompts = [
    "Give me a short introduction to large language models."
]
input_texts = []
for prompt in prompts:
    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    input_texts.append(text)

print(input_texts)

model_inputs = tokenizer(input_texts, return_tensors="pt", return_token_type_ids=False, padding=True, padding_side='left').to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=8192,
    do_sample=False,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

responses = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

print("*" * 30)
print(responses)
print("*" * 30)

🚀 SGLang

Environment Preparation

We have submitted our PR to SGLang official release and it will be merged later, for now we can prepare the environment following steps, firstly install the community version SGLang and required packages:

pip install sglang==0.5.3 sgl-kernel==0.3.15 torch==2.8.0 torchvision==0.23.0 torchao

Then you should install our sglang wheel package:

git clone https://github.com/inclusionAI/Ring-V2.git
pip install Ring-V2/moba/whls/sglang-0.5.3.post1-py3-none-any.whl --no-deps --force-reinstall

Run Inference

Our model is supported by SGLang now. You can launch the sever with the command in the following:

Start server:

python -m sglang.launch_server \
    --model-path <model_path> \
    --trust-remote-code \
    --tp-size 4 \
    --disable-radix-cache \
    --chunked-prefill-size 0 \
    --attention-backend moba

Client:

curl -s http://localhost:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "temperature": 0.6, "messages": [{"role": "user", "content": "Give me a short introduction to large language models."}]}'

More usage can be found here