bestfleer commited on
Commit
a4baf3b
·
verified ·
1 Parent(s): b899dff

Add files using upload-large-folder tool

Browse files
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2025 inclusionAI
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ base_model:
6
+ - inclusionAI/Ling-mini-base-2.0-20T
7
+ pipeline_tag: text-generation
8
+ library_name: transformers
9
+ tags:
10
+ - moe
11
+ ---
12
+ # Ring-mini-sparse-2.0-exp
13
+
14
+ <p align="center">
15
+ <img src="https://mdn.alipayobjects.com/huamei_qa8qxu/afts/img/A*4QxcQrBlTiAAAAAAQXAAAAgAemJ7AQ/original" width="100"/>
16
+ <p>
17
+ <p align="center">🤗 <a href="https://huggingface.co/inclusionAI/Ring-mini-sparse-2.0-exp">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/inclusionAI/Ring-mini-sparse-2.0-exp">ModelScope</a></p>
18
+
19
+ ## Introduction
20
+
21
+ We are excited to annouce the official release of Ring-mini-sparse-2.0-exp. This model employs a Mixture of Block Attention (MoBA) architecture, delivering highly efficient inference without compromising performance. This model inherts from [Ling-mini-base-2.0](https://huggingface.co/inclusionAI/Ling-mini-base-2.0-20T), continually trained on an additional 100B tokens. The performance of the MoBA-based model is on par with the standard attention models of the same size (e.g., Ring-mini-v2). Furthermore, by applying YaRN-based 4× window extrapolation, we extend the context length to 128K tokens, delivering superior inference speed on tasks that involve long inputs and outputs.
22
+
23
+ <div style="display: flex; justify-content: center;">
24
+ <div style="text-align: center;">
25
+ <img src="https://mdn.alipayobjects.com/huamei_9mcypc/afts/img/PIoSTKEzmsEAAAAAU5AAAAgADlCHAQFr/original" width="800">
26
+ <p style="margin-top: 8px; font-size: 14px;"><strong>Figure 1:</strong> The Model Architecture of Ring-mini-sparse-2.0-exp</p>
27
+ </div>
28
+ </div>
29
+
30
+ ## Evaluation
31
+
32
+ To comprehensively assess the reasoning capability of our model, we conducted evaluations on five challenging benchmarks spanning mathematics, coding, and science, comparing it with Ring-mini-2.0, Qwen3-8B-Thinking, and GPT-OSS-20B-Medium. The MoBA architecture demonstrates comparable performance to full softmax attention models.
33
+
34
+ <div style="display: flex; justify-content: center;">
35
+ <div style="text-align: center;">
36
+ <img src="https://mdn.alipayobjects.com/huamei_9mcypc/afts/img/Yr7eRreHNNUAAAAAWfAAAAgADlCHAQFr/original" width="100%">
37
+ <p style="margin-top: 8px; font-size: 14px;"><strong>Figure 2:</strong> Model Performance Comparison </p>
38
+ </div>
39
+ </div>
40
+
41
+ ## Highly Sparse, High-Speed Generation
42
+
43
+ Ring-mini-sparse-2.0-exp achieves high inference efficiency through highly sparse attention and a Mixture-of-Experts (MoE) architecture. Unlike MoBA used in Kimi, our approach shares the same KV block selection across all heads within a GQA group, reducing the total number of KV tokens each query head retrieves from the KV cache during decoding. During 64K-context decoding, only 8,192 key-value (KV) tokens are activated per query—reducing KV cache retrieval overhead by 87.5% compared to full attention and delivering up to 3× inference speedup over Ring-mini-2.0. This design significantly lowers computational costs for high-concurrency scenarios involving reasoning-intensive models while maintaining competitive performance. Additionally, with YaRN extrapolation, the model extends context capacity to 128K tokens, achieving up to 2× relative speedup in long-input scenarios compared to Ring-mini-2.0 (full softmax attention).
44
+
45
+ <div style="text-align: center;">
46
+ <p align="center">
47
+ <img src="https://mdn.alipayobjects.com/huamei_9mcypc/afts/img/iL_eTZP-FVEAAAAATOAAAAgADlCHAQFr/original" width="500">
48
+ </p>
49
+ <p style="margin-top: 8px; font-size: 14px;"><strong>Figure 4:</strong> Inference speedup ratios of Ring-mini-sparse-2.0-exp compared to Ring-mini-2.0.</p>
50
+ </div>
51
+ </div>
52
+
53
+ ## Quickstart
54
+
55
+ ### 🤗 Hugging Face Transformers
56
+ Installation requirements:
57
+
58
+ ```shell
59
+ pip install transformers==4.56.1
60
+ ```
61
+
62
+ Here is a code snippet to show you how to use the chat model with `transformers`:
63
+
64
+ ```python
65
+ from transformers import AutoModelForCausalLM, AutoTokenizer
66
+
67
+ model_name = "inclusionAI/Ring-mini-sparse-2.0-exp"
68
+
69
+ model = AutoModelForCausalLM.from_pretrained(
70
+ model_name,
71
+ dtype="auto",
72
+ device_map="auto",
73
+ trust_remote_code=True,
74
+ )
75
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
76
+
77
+
78
+ prompts = [
79
+ "Give me a short introduction to large language models."
80
+ ]
81
+ input_texts = []
82
+ for prompt in prompts:
83
+ messages = [
84
+ {"role": "user", "content": prompt}
85
+ ]
86
+ text = tokenizer.apply_chat_template(
87
+ messages,
88
+ tokenize=False,
89
+ add_generation_prompt=True
90
+ )
91
+ input_texts.append(text)
92
+
93
+ print(input_texts)
94
+
95
+ model_inputs = tokenizer(input_texts, return_tensors="pt", return_token_type_ids=False, padding=True, padding_side='left').to(model.device)
96
+
97
+ generated_ids = model.generate(
98
+ **model_inputs,
99
+ max_new_tokens=8192,
100
+ do_sample=False,
101
+ )
102
+ generated_ids = [
103
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
104
+ ]
105
+
106
+ responses = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
107
+
108
+ print("*" * 30)
109
+ print(responses)
110
+ print("*" * 30)
111
+ ```
112
+
113
+ ### 🚀 SGLang
114
+
115
+ #### Environment Preparation
116
+
117
+ We have submitted our PR to SGLang official release and it will be merged later, for now we can prepare the environment following steps, firstly install the community version SGLang and required packages:
118
+ ```shell
119
+ pip install sglang==0.5.3 sgl-kernel==0.3.15 torch==2.8.0 torchvision==0.23.0 torchao
120
+ ```
121
+
122
+ Then you should install our sglang wheel package:
123
+ ```shell
124
+ pip install http://raw.githubusercontent.com/inclusionAI/Ring-V2/blob/main/moba/whls/sglang-0.5.3.post1-py3-none-any.whl --no-deps --force-reinstall
125
+ ```
126
+
127
+ #### Run Inference
128
+
129
+ Our model is supported by SGLang now. You can launch the sever with the command in the following:
130
+
131
+ - Start server:
132
+ ```shell
133
+ python -m sglang.launch_server \
134
+ --model-path <model_path> \
135
+ --trust-remote-code \
136
+ --tp-size 4 \
137
+ --disable-radix-cache \
138
+ --chunked-prefill-size 0 \
139
+ --attention-backend moba
140
+ ```
141
+
142
+ - Client:
143
+
144
+ ```shell
145
+ curl -s http://localhost:${PORT}/v1/chat/completions \
146
+ -H "Content-Type: application/json" \
147
+ -d '{"model": "auto", "temperature": 0.6, "messages": [{"role": "user", "content": "Give me a short introduction to large language models."}]}'
148
+ ```
149
+
150
+ More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BailingMoeV2ForCausalLM"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_bailing_moe_v2.BailingMoeV2Config",
8
+ "AutoModel": "modeling_bailing_moe_v2.BailingMoeV2Model",
9
+ "AutoModelForCausalLM": "modeling_bailing_moe_v2.BailingMoeV2ForCausalLM"
10
+ },
11
+ "num_hidden_layers": 20,
12
+ "hidden_size": 2048,
13
+ "intermediate_size": 5120,
14
+ "eos_token_id": 156892,
15
+ "pad_token_id": 156892,
16
+ "first_k_dense_replace": 1,
17
+ "hidden_act": "silu",
18
+ "max_position_embeddings": 32768,
19
+ "model_type": "bailing_moe",
20
+ "moe_intermediate_size": 512,
21
+ "norm_topk_prob": true,
22
+ "num_experts_per_tok": 8,
23
+ "num_attention_heads": 16,
24
+ "num_experts": 256,
25
+ "num_key_value_heads": 4,
26
+ "rope_theta": 600000,
27
+ "rope_scaling": null,
28
+ "tie_word_embeddings": false,
29
+ "torch_dtype": "bfloat16",
30
+ "transformers_version": "4.52.3",
31
+ "use_bias": false,
32
+ "use_rmsnorm": true,
33
+ "rms_norm_eps": 1e-06,
34
+ "head_dim": 128,
35
+ "num_shared_experts": 1,
36
+ "use_cache": true,
37
+ "use_qkv_bias": false,
38
+ "embedding_dropout": 0.0,
39
+ "output_dropout": 0.0,
40
+ "vocab_size": 157184,
41
+ "partial_rotary_factor": 0.5,
42
+ "router_dtype": "fp32",
43
+ "moe_router_enable_expert_bias": true,
44
+ "routed_scaling_factor": 2.5,
45
+ "n_group": 8,
46
+ "topk_group": 4,
47
+ "use_qk_norm": true,
48
+ "score_function": "sigmoid",
49
+ "moe_shared_expert_intermediate_size": 512,
50
+ "moba_block_size": 1024,
51
+ "moba_topk": 8,
52
+ "use_moba_decode": true,
53
+ "moba_layer_freq": [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
54
+ }
configuration_bailing_moe_v2.py ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Bailing MoE model configuration"""
2
+
3
+ from transformers.configuration_utils import PretrainedConfig
4
+
5
+
6
+ class BailingMoeV2Config(PretrainedConfig):
7
+ model_type = "bailing_moe_v2"
8
+
9
+ def __init__(
10
+ self,
11
+ vocab_size=30592,
12
+ hidden_size=1024,
13
+ intermediate_size=None,
14
+ num_hidden_layers=24,
15
+ num_attention_heads=16,
16
+ num_key_value_heads=0,
17
+ hidden_act="silu",
18
+ use_qkv_bias=False, # bailing only
19
+ use_bias=True, # bailing only
20
+ rms_norm_eps=1e-05,
21
+ norm_head=False, # bailing only
22
+ tie_word_embeddings=False, # PretrainedConfig key, here change default value.
23
+ embedding_dropout=0.1,
24
+ attention_dropout=0.1,
25
+ output_dropout=0.1,
26
+ initializer_range=0.02,
27
+ max_position_embeddings=16384,
28
+ rope_theta=10000.0,
29
+ use_cache=True,
30
+ use_sliding_window=False,
31
+ sliding_window=4096,
32
+ max_window_layers=28,
33
+ rope_scaling=None,
34
+ pad_token_id=126081,
35
+ num_experts=16,
36
+ num_shared_experts=0,
37
+ num_experts_per_tok=2,
38
+ norm_topk_prob=True,
39
+ moe_intermediate_size=None,
40
+ first_k_dense_replace=0,
41
+ head_dim=None,
42
+ output_router_logits=False,
43
+ **kwargs,
44
+ ):
45
+ self.num_hidden_layers = num_hidden_layers
46
+ self.vocab_size = vocab_size
47
+ self.hidden_size = hidden_size
48
+ self.intermediate_size = intermediate_size
49
+ self.num_attention_heads = num_attention_heads
50
+ self.num_key_value_heads = num_key_value_heads
51
+ self.hidden_act = hidden_act
52
+ self.use_qkv_bias = use_qkv_bias
53
+ self.use_bias = use_bias
54
+ self.norm_head = norm_head
55
+ self.rms_norm_eps = rms_norm_eps
56
+ self.embedding_dropout = embedding_dropout
57
+ self.attention_dropout = attention_dropout
58
+ self.output_dropout = output_dropout
59
+ self.initializer_range = initializer_range
60
+ self.max_position_embeddings = max_position_embeddings
61
+ self.rope_theta = rope_theta
62
+ self.use_cache = use_cache
63
+ self.use_sliding_window = use_sliding_window
64
+ self.sliding_window = sliding_window
65
+ self.max_window_layers = max_window_layers
66
+ self.head_dim = head_dim or self.hidden_size // self.num_attention_heads
67
+ self.rope_scaling = rope_scaling
68
+
69
+ # MoE configs
70
+ self.num_experts = num_experts
71
+ self.num_shared_experts = num_shared_experts
72
+ self.num_experts_per_tok = num_experts_per_tok
73
+ self.norm_topk_prob = norm_topk_prob
74
+ self.moe_intermediate_size = moe_intermediate_size
75
+ self.first_k_dense_replace = first_k_dense_replace
76
+ self.output_router_logits = output_router_logits
77
+
78
+ super().__init__(pad_token_id=pad_token_id, tie_word_embeddings=tie_word_embeddings, **kwargs)
generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 156891,
3
+ "eos_token_id": [
4
+ 156892,
5
+ 156895
6
+ ],
7
+ "pad_token_id": 156892,
8
+ "transformers_version": "4.56.1"
9
+ }
model-00001-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:efab52319e26654aba6a683fe3c5f7526ac5405fa64f42b68eca7695b599984f
3
+ size 8951195664
model-00002-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aef03dc0f0606de5a240c3e993234461774a18a272bf2a072c7929e6ba8643f8
3
+ size 9834183392
model-00003-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5cc079480e86885dacb049a7069151f69d28e23034860827adc02acd66be419a
3
+ size 9834186472
model-00004-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:46bb29007589c392f841974b70298b5f6e6ce787c1bde7b24e5cdf63620714b8
3
+ size 3893569552
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
modeling_bailing_moe_v2.py ADDED
@@ -0,0 +1,1597 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/python
2
+ #****************************************************************#
3
+ # ScriptName: modeling_bailing_moe_v2.py
4
+ # Author: $SHTERM_REAL_USER@alibaba-inc.com
5
+ # Create Date: 2025-08-12 20:22
6
+ # Modify Author: $SHTERM_REAL_USER@alibaba-inc.com
7
+ # Modify Date: 2025-08-12 20:22
8
+ # Function:
9
+ #***************************************************************#
10
+ # coding=utf-8
11
+ # Copyright 2023 Antgroup and The HuggingFace Inc. team. All rights reserved.
12
+ #
13
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
14
+ # and OPT implementations in this library. It has been modified from its
15
+ # original forms to accommodate minor architectural differences compared
16
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
17
+ #
18
+ # Licensed under the Apache License, Version 2.0 (the "License");
19
+ # you may not use this file except in compliance with the License.
20
+ # You may obtain a copy of the License at
21
+ #
22
+ # http://www.apache.org/licenses/LICENSE-2.0
23
+ #
24
+ # Unless required by applicable law or agreed to in writing, software
25
+ # distributed under the License is distributed on an "AS IS" BASIS,
26
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
27
+ # See the License for the specific language governing permissions and
28
+ # limitations under the License.
29
+ """PyTorch BailingMoE model."""
30
+ import math
31
+ import warnings
32
+ from typing import List, Optional, Tuple, Union
33
+
34
+ import torch
35
+ import torch.nn.functional as F
36
+ import torch.utils.checkpoint
37
+ from torch import nn
38
+ from torch.nn import CrossEntropyLoss
39
+
40
+ from transformers.activations import ACT2FN
41
+ from transformers.cache_utils import Cache, DynamicCache
42
+ from transformers.modeling_attn_mask_utils import (
43
+ AttentionMaskConverter,
44
+ _prepare_4d_attention_mask,
45
+ _prepare_4d_causal_attention_mask,
46
+ _prepare_4d_causal_attention_mask_for_sdpa,
47
+ )
48
+ from transformers.modeling_outputs import (
49
+ MoeModelOutputWithPast,
50
+ MoeCausalLMOutputWithPast,
51
+ )
52
+ from transformers.modeling_utils import PreTrainedModel
53
+ from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS, is_torch_greater_or_equal_than_1_13
54
+ from transformers.utils import (
55
+ add_start_docstrings,
56
+ add_start_docstrings_to_model_forward,
57
+ is_flash_attn_2_available,
58
+ is_flash_attn_greater_or_equal_2_10,
59
+ logging,
60
+ replace_return_docstrings,
61
+ )
62
+ from transformers.utils.import_utils import is_torch_fx_available
63
+ from .configuration_bailing_moe_v2 import BailingMoeV2Config
64
+
65
+
66
+ if is_flash_attn_2_available():
67
+ from flash_attn import flash_attn_func, flash_attn_varlen_func
68
+ from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
69
+
70
+
71
+ # This makes `_prepare_4d_causal_attention_mask` a leaf function in the FX graph.
72
+ # It means that the function will not be traced through and simply appear as a node in the graph.
73
+ if is_torch_fx_available():
74
+ if not is_torch_greater_or_equal_than_1_13:
75
+ import torch.fx
76
+
77
+ _prepare_4d_causal_attention_mask = torch.fx.wrap(_prepare_4d_causal_attention_mask)
78
+
79
+
80
+ logger = logging.get_logger(__name__)
81
+
82
+ _CONFIG_FOR_DOC = "BailingMoeV2Config"
83
+
84
+
85
+ def _get_unpad_data(attention_mask):
86
+ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
87
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
88
+ max_seqlen_in_batch = seqlens_in_batch.max().item()
89
+ cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
90
+ return (
91
+ indices,
92
+ cu_seqlens,
93
+ max_seqlen_in_batch,
94
+ )
95
+
96
+
97
+ def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
98
+ warnings.warn(
99
+ "Calling `transformers.models.BailingMoeV2.modeling_BailingMoeV2._prepare_4d_attention_mask` is deprecated and will be removed in v4.37. Use `transformers.modeling_attn_mask_utils._prepare_4d_attention_mask"
100
+ )
101
+ return _prepare_4d_attention_mask(mask=mask, dtype=dtype, tgt_len=tgt_len)
102
+
103
+
104
+ def _make_causal_mask(
105
+ input_ids_shape: torch.Size, dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
106
+ ):
107
+ warnings.warn(
108
+ "Calling `transformers.models.BailingMoeV2.modeling_BailingMoeV2._make_causal_mask` is deprecated and will be removed in v4.37. Use `transformers.models.BailingMoeV2.modeling_BailingMoeV2.AttentionMaskConverter._make_causal_mask"
109
+ )
110
+ return AttentionMaskConverter._make_causal_mask(
111
+ input_ids_shape=input_ids_shape, dtype=dtype, device=device, past_key_values_length=past_key_values_length
112
+ )
113
+
114
+
115
+ class BailingMoeV2RMSNorm(nn.Module):
116
+ def __init__(self, hidden_size, eps=1e-6):
117
+ """
118
+ BailingMoeV2RMSNorm is equivalent to T5LayerNorm
119
+ """
120
+ super().__init__()
121
+ self.weight = nn.Parameter(torch.ones(hidden_size))
122
+ self.variance_epsilon = eps
123
+
124
+ def forward(self, hidden_states):
125
+ input_dtype = hidden_states.dtype
126
+ hidden_states = hidden_states.to(torch.float32)
127
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
128
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
129
+ return self.weight * hidden_states.to(input_dtype)
130
+
131
+
132
+ ALL_LAYERNORM_LAYERS.append(BailingMoeV2RMSNorm)
133
+
134
+
135
+ class BailingMoeV2RotaryEmbedding(nn.Module):
136
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
137
+ super().__init__()
138
+
139
+ self.dim = dim
140
+ self.max_position_embeddings = max_position_embeddings
141
+ self.base = base
142
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
143
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
144
+
145
+ # Build here to make `torch.jit.trace` work.
146
+ self._set_cos_sin_cache(
147
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
148
+ )
149
+ self.max_seq_len_cached = None
150
+
151
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
152
+ self.max_seq_len_cached = seq_len
153
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
154
+
155
+ freqs = torch.outer(t, self.inv_freq.to(t.device))
156
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
157
+ emb = torch.cat((freqs, freqs), dim=-1)
158
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
159
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
160
+
161
+ def forward(self, x, seq_len=None):
162
+ # x: [bs, num_attention_heads, seq_len, head_size]
163
+ if self.max_seq_len_cached is None or seq_len > self.max_seq_len_cached:
164
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
165
+
166
+ return (
167
+ self.cos_cached[:seq_len].to(dtype=x.dtype),
168
+ self.sin_cached[:seq_len].to(dtype=x.dtype),
169
+ )
170
+
171
+
172
+ # Copied from transformers.models.llama.modeling_llama.LlamaLinearScalingRotaryEmbedding with Llama->BailingMoeV2
173
+ class BailingMoeV2LinearScalingRotaryEmbedding(BailingMoeV2RotaryEmbedding):
174
+ """BailingMoeV2RotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
175
+
176
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
177
+ self.scaling_factor = scaling_factor
178
+ super().__init__(dim, max_position_embeddings, base, device)
179
+
180
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
181
+ self.max_seq_len_cached = seq_len
182
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
183
+ t = t / self.scaling_factor
184
+
185
+ freqs = torch.outer(t, self.inv_freq)
186
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
187
+ emb = torch.cat((freqs, freqs), dim=-1)
188
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
189
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
190
+
191
+
192
+ # Copied from transformers.models.llama.modeling_llama.LlamaDynamicNTKScalingRotaryEmbedding with Llama->BailingMoeV2
193
+ class BailingMoeV2DynamicNTKScalingRotaryEmbedding(BailingMoeV2RotaryEmbedding):
194
+ """BailingMoeV2RotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
195
+
196
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
197
+ self.scaling_factor = scaling_factor
198
+ super().__init__(dim, max_position_embeddings, base, device)
199
+
200
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
201
+ self.max_seq_len_cached = seq_len
202
+
203
+ if seq_len > self.max_position_embeddings:
204
+ base = self.base * (
205
+ (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
206
+ ) ** (self.dim / (self.dim - 2))
207
+ inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
208
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
209
+
210
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
211
+
212
+ freqs = torch.outer(t, self.inv_freq)
213
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
214
+ emb = torch.cat((freqs, freqs), dim=-1)
215
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
216
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
217
+
218
+
219
+ # Inverse dim formula to find dim based on number of rotations
220
+ def yarn_find_correction_dim(num_rotations, dim, base=10000, max_position_embeddings=2048):
221
+ return (dim * math.log(max_position_embeddings / (num_rotations * 2 * math.pi))) / (2 * math.log(base))
222
+
223
+
224
+ # Find dim range bounds based on rotations
225
+ def yarn_find_correction_range(low_rot, high_rot, dim, base=10000, max_position_embeddings=2048):
226
+ low = math.floor(yarn_find_correction_dim(low_rot, dim, base, max_position_embeddings))
227
+ high = math.ceil(yarn_find_correction_dim(high_rot, dim, base, max_position_embeddings))
228
+ return max(low, 0), min(high, dim - 1) # Clamp values just in case
229
+
230
+
231
+ def yarn_get_mscale(scale=1, mscale=1):
232
+ if scale <= 1:
233
+ return 1.0
234
+ return 0.1 * mscale * math.log(scale) + 1.0
235
+
236
+
237
+ def yarn_linear_ramp_mask(min, max, dim):
238
+ if min == max:
239
+ max += 0.001 # Prevent singularity
240
+
241
+ linear_func = (torch.arange(dim, dtype=torch.float32) - min) / (max - min)
242
+ ramp_func = torch.clamp(linear_func, 0, 1)
243
+ return ramp_func
244
+
245
+
246
+ class BailingMoeV2YarnRotaryEmbedding(BailingMoeV2RotaryEmbedding):
247
+
248
+ def __init__(
249
+ self,
250
+ dim,
251
+ max_position_embeddings=2048,
252
+ base=10000,
253
+ device=None,
254
+ scaling_factor=1.0,
255
+ original_max_position_embeddings=4096,
256
+ beta_fast=32,
257
+ beta_slow=1,
258
+ mscale=1,
259
+ mscale_all_dim=0,
260
+ ):
261
+ self.scaling_factor = scaling_factor
262
+ self.original_max_position_embeddings = original_max_position_embeddings
263
+ self.beta_fast = beta_fast
264
+ self.beta_slow = beta_slow
265
+ self.mscale = mscale
266
+ self.mscale_all_dim = mscale_all_dim
267
+ super().__init__(dim, max_position_embeddings, base, device)
268
+
269
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
270
+ self.max_seq_len_cached = seq_len
271
+ dim = self.dim
272
+
273
+ freq_extra = 1.0 / (self.base ** (torch.arange(0, dim, 2, dtype=torch.float32, device=device) / dim))
274
+ freq_inter = 1.0 / (
275
+ self.scaling_factor * self.base ** (torch.arange(0, dim, 2, dtype=torch.float32, device=device) / dim)
276
+ )
277
+
278
+ low, high = yarn_find_correction_range(
279
+ self.beta_fast,
280
+ self.beta_slow,
281
+ dim,
282
+ self.base,
283
+ self.original_max_position_embeddings,
284
+ )
285
+ inv_freq_mask = 1.0 - yarn_linear_ramp_mask(low, high, dim // 2).to(device=device, dtype=torch.float32)
286
+ inv_freq = freq_inter * (1 - inv_freq_mask) + freq_extra * inv_freq_mask
287
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
288
+
289
+ t = torch.arange(seq_len, device=device, dtype=torch.float32)
290
+
291
+ freqs = torch.outer(t, inv_freq)
292
+
293
+ _mscale = float(
294
+ yarn_get_mscale(self.scaling_factor, self.mscale)
295
+ / yarn_get_mscale(self.scaling_factor, self.mscale_all_dim)
296
+ )
297
+
298
+ emb = torch.cat((freqs, freqs), dim=-1)
299
+ self.register_buffer("cos_cached", (emb.cos() * _mscale).to(dtype), persistent=False)
300
+ self.register_buffer("sin_cached", (emb.sin() * _mscale).to(dtype), persistent=False)
301
+
302
+
303
+ # Copied from transformers.models.llama.modeling_llama.rotate_half
304
+ def rotate_half(x):
305
+ """Rotates half the hidden dims of the input."""
306
+ x1 = x[..., : x.shape[-1] // 2]
307
+ x2 = x[..., x.shape[-1] // 2 :]
308
+ return torch.cat((-x2, x1), dim=-1)
309
+
310
+
311
+ # Copied from transformers.models.llama.modeling_llama.apply_rotary_pos_emb
312
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
313
+ """Applies Rotary Position Embedding to the query and key tensors.
314
+
315
+ Args:
316
+ q (`torch.Tensor`): The query tensor.
317
+ k (`torch.Tensor`): The key tensor.
318
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
319
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
320
+ position_ids (`torch.Tensor`):
321
+ The position indices of the tokens corresponding to the query and key tensors. For example, this can be
322
+ used to pass offsetted position ids when working with a KV-cache.
323
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
324
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
325
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
326
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
327
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
328
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
329
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
330
+ Returns:
331
+ `tuple(torch.Tensor)` comprising the query and key tensors rotated using the Rotary Position Embedding.
332
+ """
333
+ cos = cos[position_ids].unsqueeze(unsqueeze_dim)
334
+ sin = sin[position_ids].unsqueeze(unsqueeze_dim)
335
+ q_embed = (q * cos) + (rotate_half(q) * sin)
336
+ k_embed = (k * cos) + (rotate_half(k) * sin)
337
+ return q_embed, k_embed
338
+
339
+
340
+ class BailingMoeV2MLP(nn.Module):
341
+ def __init__(self, config: BailingMoeV2Config, intermediate_size: int):
342
+ super().__init__()
343
+ self.config = config
344
+ self.hidden_size = config.hidden_size
345
+ self.intermediate_size = intermediate_size
346
+
347
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
348
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
349
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
350
+ self.act_fn = ACT2FN[config.hidden_act]
351
+
352
+ def forward(self, x):
353
+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
354
+
355
+
356
+ class BailingMoeV2Gate(nn.Module):
357
+ def __init__(self, config):
358
+ super().__init__()
359
+ self.config = config
360
+ self.top_k = config.num_experts_per_tok
361
+ self.num_experts = config.num_experts
362
+
363
+ # topk selection algorithm
364
+ self.norm_topk_prob = config.norm_topk_prob
365
+ self.gating_dim = config.hidden_size
366
+ self.weight = nn.Parameter(torch.empty((self.num_experts, self.gating_dim)))
367
+ self.moe_router_topk_scaling_factor = config.moe_router_topk_scaling_factor
368
+
369
+ if self.config.use_expert_bias:
370
+ self.register_buffer("expert_bias", torch.zeros((self.num_experts)))
371
+ self.reset_parameters()
372
+
373
+ def reset_parameters(self) -> None:
374
+ import torch.nn.init as init
375
+
376
+ init.kaiming_uniform_(self.weight, a=math.sqrt(5))
377
+
378
+ def forward(self, hidden_states):
379
+ # compute gating score
380
+ hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
381
+ logits = F.linear(hidden_states, self.weight, None)
382
+
383
+ if self.config.gate_score_function == 'softmax':
384
+ scores = logits.softmax(dim=-1, dtype=torch.float32)
385
+
386
+ # select top-k experts
387
+ topk_weight, topk_idx = torch.topk(scores, k=self.top_k, dim=-1)
388
+
389
+ # norm gate to sum 1
390
+ if self.top_k > 1 and self.norm_topk_prob:
391
+ denominator = topk_weight.sum(dim=-1, keepdim=True)
392
+ topk_weight = topk_weight / denominator
393
+ topk_weight = topk_weight * self.moe_router_topk_scaling_factor
394
+
395
+ return topk_idx, topk_weight, logits
396
+ elif self.config.gate_score_function == 'sigmoid':
397
+ scores = torch.sigmoid(logits)
398
+
399
+ if self.config.use_expert_bias:
400
+ scores_for_routing = scores + self.expert_bias
401
+ _, topk_idx = torch.topk(scores_for_routing, k=self.top_k, dim=-1)
402
+ scores = torch.gather(scores, dim=1, index=topk_idx).type_as(logits)
403
+ else:
404
+ scores, topk_idx = torch.topk(scores, k=self.top_k, dim=-1)
405
+ topk_weight = scores / (scores.sum(dim=-1, keepdim=True) + 1e-20) if self.top_k > 1 else scores
406
+ topk_weight = topk_weight * self.moe_router_topk_scaling_factor
407
+
408
+ return topk_idx, topk_weight, logits
409
+ else:
410
+ raise ValueError(f"Unsupported gate_score_function: {self.config.gate_score_function}")
411
+
412
+
413
+ class BailingMoeV2SparseMoeBlock(nn.Module):
414
+ """
415
+ A mixed expert module containing shared experts.
416
+ """
417
+
418
+ def __init__(self, config: BailingMoeV2Config):
419
+ super().__init__()
420
+ self.config = config
421
+ self.num_experts_per_tok = config.num_experts_per_tok
422
+ self._setup_experts()
423
+ self.gate = BailingMoeV2Gate(config)
424
+ if config.num_shared_experts is not None:
425
+ self.shared_experts = BailingMoeV2MLP(
426
+ config=config, intermediate_size=config.moe_intermediate_size * config.num_shared_experts
427
+ )
428
+
429
+ def _setup_experts(self):
430
+ self.experts = nn.ModuleList(
431
+ [
432
+ BailingMoeV2MLP(config=self.config, intermediate_size=self.config.moe_intermediate_size)
433
+ for _ in range(self.config.num_experts)
434
+ ]
435
+ )
436
+
437
+ def forward(self, hidden_states):
438
+ identity = hidden_states
439
+ bsz, seq_len, h = hidden_states.shape
440
+ topk_idx, topk_weight, router_logits = self.gate(hidden_states)
441
+ hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
442
+ flat_topk_idx = topk_idx.view(-1)
443
+ if self.training:
444
+ hidden_states = hidden_states.repeat_interleave(self.num_experts_per_tok, dim=0)
445
+ y = torch.empty_like(hidden_states)
446
+ for i, expert in enumerate(self.experts):
447
+ y[flat_topk_idx == i] = expert(hidden_states[flat_topk_idx == i])
448
+ y = (y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1)).sum(dim=1)
449
+ y = y.to(hidden_states.dtype).view(bsz, seq_len, h)
450
+ else:
451
+ y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(bsz, seq_len, h)
452
+ if self.config.num_shared_experts is not None:
453
+ y = y + self.shared_experts(identity)
454
+ return y, (router_logits.view(bsz, seq_len, -1), topk_idx.view(bsz, seq_len, -1))
455
+
456
+ @torch.no_grad()
457
+ def moe_infer(self, x, topk_ids, topk_weight):
458
+ cnts = topk_ids.new_zeros((topk_ids.shape[0], len(self.experts)))
459
+ cnts.scatter_(1, topk_ids, 1)
460
+ tokens_per_expert = cnts.sum(dim=0)
461
+ idxs = topk_ids.view(-1).argsort()
462
+ sorted_tokens = x[idxs // topk_ids.shape[1]]
463
+ sorted_tokens_shape = sorted_tokens.shape
464
+ tokens_per_expert = tokens_per_expert.cpu().numpy()
465
+ outputs = []
466
+ start_idx = 0
467
+ for i, num_tokens in enumerate(tokens_per_expert):
468
+ end_idx = start_idx + num_tokens
469
+ if num_tokens == 0:
470
+ continue
471
+ expert = self.experts[i]
472
+ tokens_for_this_expert = sorted_tokens[start_idx:end_idx]
473
+ expert_out = expert(tokens_for_this_expert)
474
+ outputs.append(expert_out)
475
+ start_idx = end_idx
476
+
477
+ outs = torch.cat(outputs, dim=0) if len(outputs) else sorted_tokens.new_empty(0)
478
+ new_x = torch.empty_like(outs)
479
+ new_x[idxs] = outs
480
+ final_out = (
481
+ new_x.view(*topk_ids.shape, -1)
482
+ .type(topk_weight.dtype)
483
+ .mul_(topk_weight.unsqueeze(dim=-1))
484
+ .sum(dim=1)
485
+ .type(new_x.dtype)
486
+ )
487
+ return final_out
488
+
489
+
490
+ # Copied from transformers.models.llama.modeling_llama.repeat_kv
491
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
492
+ """
493
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
494
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
495
+ """
496
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
497
+ if n_rep == 1:
498
+ return hidden_states
499
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
500
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
501
+
502
+
503
+ # Copied from transformers.models.llama.modeling_llama.LlamaAttention with Llama->BailingMoeV2
504
+ class BailingMoeV2Attention(nn.Module):
505
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
506
+
507
+ def __init__(self, config: BailingMoeV2Config, layer_idx: Optional[int] = None):
508
+ super().__init__()
509
+ self.config = config
510
+ self.layer_idx = layer_idx
511
+ if layer_idx is None:
512
+ logger.warning_once(
513
+ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will "
514
+ "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
515
+ "when creating this class."
516
+ )
517
+
518
+ self.attention_dropout = config.attention_dropout
519
+ self.hidden_size = config.hidden_size
520
+ self.num_heads = config.num_attention_heads
521
+ self.head_dim = config.head_dim or self.hidden_size // self.num_heads
522
+ self.num_key_value_heads = config.num_key_value_heads
523
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
524
+ self.max_position_embeddings = config.max_position_embeddings
525
+ self.rope_theta = config.rope_theta
526
+ self.is_causal = True
527
+
528
+ self.query_key_value = nn.Linear(
529
+ self.hidden_size,
530
+ (self.num_heads + 2 * self.num_key_value_heads) * self.head_dim,
531
+ bias=config.use_qkv_bias,
532
+ )
533
+
534
+ if self.config.use_qk_norm:
535
+ self.q_norm = BailingMoeV2RMSNorm(self.head_dim, eps=config.rms_norm_eps)
536
+ self.k_norm = BailingMoeV2RMSNorm(self.head_dim, eps=config.rms_norm_eps)
537
+ self.dense = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.use_bias)
538
+ self._init_rope()
539
+
540
+ def _init_rope(self):
541
+ if self.config.rope_scaling is None:
542
+ self.rotary_emb = BailingMoeV2RotaryEmbedding(
543
+ self.head_dim,
544
+ max_position_embeddings=self.max_position_embeddings,
545
+ base=self.rope_theta,
546
+ )
547
+ else:
548
+ scaling_type = self.config.rope_scaling["type"]
549
+ scaling_factor = self.config.rope_scaling["factor"]
550
+ if scaling_type == "linear":
551
+ self.rotary_emb = BailingMoeV2LinearScalingRotaryEmbedding(
552
+ self.head_dim,
553
+ max_position_embeddings=self.max_position_embeddings,
554
+ scaling_factor=scaling_factor,
555
+ base=self.rope_theta,
556
+ )
557
+ elif scaling_type == "dynamic":
558
+ self.rotary_emb = BailingMoeV2DynamicNTKScalingRotaryEmbedding(
559
+ self.head_dim,
560
+ max_position_embeddings=self.max_position_embeddings,
561
+ scaling_factor=scaling_factor,
562
+ base=self.rope_theta,
563
+ )
564
+ elif scaling_type == "yarn":
565
+ kwargs = {
566
+ key: self.config.rope_scaling[key]
567
+ for key in [
568
+ "original_max_position_embeddings",
569
+ "beta_fast",
570
+ "beta_slow",
571
+ "mscale",
572
+ "mscale_all_dim",
573
+ ]
574
+ if key in self.config.rope_scaling
575
+ }
576
+ self.rotary_emb = BailingMoeV2YarnRotaryEmbedding(
577
+ self.head_dim,
578
+ max_position_embeddings=self.max_position_embeddings,
579
+ scaling_factor=scaling_factor,
580
+ base=self.rope_theta,
581
+ **kwargs,
582
+ )
583
+ else:
584
+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
585
+
586
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
587
+ return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
588
+
589
+ def forward(
590
+ self,
591
+ hidden_states: torch.Tensor,
592
+ attention_mask: Optional[torch.Tensor] = None,
593
+ position_ids: Optional[torch.LongTensor] = None,
594
+ past_key_value: Optional[Cache] = None,
595
+ output_attentions: bool = False,
596
+ use_cache: bool = False,
597
+ **kwargs,
598
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
599
+ if "padding_mask" in kwargs:
600
+ warnings.warn(
601
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
602
+ )
603
+
604
+ bsz, q_len, _ = hidden_states.size()
605
+
606
+ qkv = self.query_key_value(hidden_states)
607
+ qkv = qkv.view(bsz, q_len, self.num_heads + 2 * self.num_key_value_heads, self.head_dim)
608
+
609
+ query_states, key_states, value_states = qkv.split(
610
+ [self.num_heads, self.num_key_value_heads, self.num_key_value_heads], dim=-2
611
+ )
612
+ query_states = query_states.transpose(1, 2)
613
+ key_states = key_states.transpose(1, 2)
614
+ value_states = value_states.transpose(1, 2)
615
+
616
+ if self.config.use_qk_norm:
617
+ query_states = self.q_norm(query_states)
618
+ key_states = self.k_norm(key_states)
619
+
620
+ kv_seq_len = key_states.shape[-2]
621
+ if past_key_value is not None:
622
+ if self.layer_idx is None:
623
+ raise ValueError(
624
+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
625
+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
626
+ "with a layer index."
627
+ )
628
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
629
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
630
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
631
+
632
+ if past_key_value is not None:
633
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
634
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
635
+
636
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
637
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
638
+
639
+ attn_weights = torch.matmul(query_states / math.sqrt(self.head_dim), key_states.transpose(2, 3))
640
+
641
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
642
+ raise ValueError(
643
+ f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
644
+ f" {attn_weights.size()}"
645
+ )
646
+
647
+ if attention_mask is not None:
648
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
649
+ raise ValueError(
650
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
651
+ )
652
+ attn_weights = attn_weights + attention_mask
653
+
654
+ # upcast attention to fp32
655
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
656
+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
657
+ attn_output = torch.matmul(attn_weights, value_states)
658
+
659
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
660
+ raise ValueError(
661
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
662
+ f" {attn_output.size()}"
663
+ )
664
+
665
+ attn_output = attn_output.transpose(1, 2).contiguous()
666
+
667
+ attn_output = attn_output.reshape(bsz, q_len, -1)
668
+
669
+ attn_output = self.dense(attn_output)
670
+
671
+ if not output_attentions:
672
+ attn_weights = None
673
+
674
+ return attn_output, attn_weights, past_key_value
675
+
676
+
677
+ # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2 with Llama->BailingMoeV2
678
+ class BailingMoeV2FlashAttention2(BailingMoeV2Attention):
679
+ """
680
+ BailingMoeV2 flash attention module. This module inherits from `BailingMoeV2Attention` as the weights of the module stays
681
+ untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
682
+ flash attention and deal with padding tokens in case the input contains any of them.
683
+ """
684
+
685
+ def __init__(self, *args, **kwargs):
686
+ super().__init__(*args, **kwargs)
687
+
688
+ # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
689
+ # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
690
+ # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
691
+ self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
692
+
693
+ def forward(
694
+ self,
695
+ hidden_states: torch.Tensor,
696
+ attention_mask: Optional[torch.LongTensor] = None,
697
+ position_ids: Optional[torch.LongTensor] = None,
698
+ past_key_value: Optional[Cache] = None,
699
+ output_attentions: bool = False,
700
+ use_cache: bool = False,
701
+ **kwargs,
702
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
703
+ # BailingMoeV2FlashAttention2 attention does not support output_attentions
704
+ if "padding_mask" in kwargs:
705
+ warnings.warn(
706
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
707
+ )
708
+
709
+ # overwrite attention_mask with padding_mask
710
+ attention_mask = kwargs.pop("padding_mask")
711
+
712
+ output_attentions = False
713
+
714
+ bsz, q_len, _ = hidden_states.size()
715
+
716
+ # Flash attention requires the input to have the shape
717
+ # batch_size x seq_length x head_dim x hidden_dim
718
+ # therefore we just need to keep the original shape
719
+
720
+ qkv = self.query_key_value(hidden_states)
721
+ qkv = qkv.view(bsz, q_len, self.num_heads + 2 * self.num_key_value_heads, self.head_dim)
722
+
723
+ query_states, key_states, value_states = qkv.split(
724
+ [self.num_heads, self.num_key_value_heads, self.num_key_value_heads], dim=-2
725
+ )
726
+ query_states = query_states.transpose(1, 2)
727
+ key_states = key_states.transpose(1, 2)
728
+ value_states = value_states.transpose(1, 2)
729
+
730
+ if self.config.use_qk_norm:
731
+ query_states = self.q_norm(query_states)
732
+ key_states = self.k_norm(key_states)
733
+
734
+ kv_seq_len = key_states.shape[-2]
735
+ if past_key_value is not None:
736
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
737
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
738
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
739
+
740
+ if past_key_value is not None:
741
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
742
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
743
+
744
+ # TODO: These transpose are quite inefficient but Flash Attention requires the layout [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
745
+ # to be able to avoid many of these transpose/reshape/view.
746
+ query_states = query_states.transpose(1, 2)
747
+ key_states = key_states.transpose(1, 2)
748
+ value_states = value_states.transpose(1, 2)
749
+
750
+ dropout_rate = self.attention_dropout if self.training else 0.0
751
+
752
+ # In PEFT, usually we cast the layer norms in float32 for training stability reasons
753
+ # therefore the input hidden states gets silently cast in float32. Hence, we need
754
+ # cast them back in the correct dtype just to be sure everything works as expected.
755
+ # This might slow down training & inference so it is recommended to not cast the LayerNorms
756
+ # in fp32. (BailingMoeV2RMSNorm handles it correctly)
757
+
758
+ input_dtype = query_states.dtype
759
+ if input_dtype == torch.float32:
760
+ # Handle the case where the model is quantized
761
+ if hasattr(self.config, "_pre_quantization_dtype"):
762
+ target_dtype = self.config._pre_quantization_dtype
763
+ elif torch.is_autocast_enabled():
764
+ target_dtype = torch.get_autocast_gpu_dtype()
765
+ else:
766
+ target_dtype = self.q_proj.weight.dtype
767
+
768
+ logger.warning_once(
769
+ f"The input hidden states seems to be silently casted in float32, this might be related to"
770
+ f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
771
+ f" {target_dtype}."
772
+ )
773
+
774
+ query_states = query_states.to(target_dtype)
775
+ key_states = key_states.to(target_dtype)
776
+ value_states = value_states.to(target_dtype)
777
+
778
+ attn_output = self._flash_attention_forward(
779
+ query_states, key_states, value_states, attention_mask, q_len, dropout=dropout_rate
780
+ )
781
+
782
+ attn_output = attn_output.reshape(bsz, q_len, -1).contiguous()
783
+ attn_output = self.dense(attn_output)
784
+
785
+ if not output_attentions:
786
+ attn_weights = None
787
+
788
+ return attn_output, attn_weights, past_key_value
789
+
790
+ def _flash_attention_forward(
791
+ self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None
792
+ ):
793
+ """
794
+ Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
795
+ first unpad the input, then computes the attention scores and pad the final attention scores.
796
+
797
+ Args:
798
+ query_states (`torch.Tensor`):
799
+ Input query states to be passed to Flash Attention API
800
+ key_states (`torch.Tensor`):
801
+ Input key states to be passed to Flash Attention API
802
+ value_states (`torch.Tensor`):
803
+ Input value states to be passed to Flash Attention API
804
+ attention_mask (`torch.Tensor`):
805
+ The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
806
+ position of padding tokens and 1 for the position of non-padding tokens.
807
+ dropout (`int`, *optional*):
808
+ Attention dropout
809
+ softmax_scale (`float`, *optional*):
810
+ The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
811
+ query_length (`int`):
812
+ The length of the query sequence in terms of tokens. This represents the number of tokens in the
813
+ `query_states` tensor along the sequence dimension. It is used to determine the effective sequence
814
+ length for attention computations.
815
+ """
816
+ if not self._flash_attn_uses_top_left_mask:
817
+ causal = self.is_causal
818
+ else:
819
+ # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in BailingMoeV2FlashAttention2 __init__.
820
+ causal = self.is_causal and query_length != 1
821
+
822
+ # Contains at least one padding token in the sequence
823
+ if attention_mask is not None:
824
+ batch_size = query_states.shape[0]
825
+ query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
826
+ query_states, key_states, value_states, attention_mask, query_length
827
+ )
828
+
829
+ cu_seqlens_q, cu_seqlens_k = cu_seq_lens
830
+ max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
831
+
832
+ attn_output_unpad = flash_attn_varlen_func(
833
+ query_states,
834
+ key_states,
835
+ value_states,
836
+ cu_seqlens_q=cu_seqlens_q,
837
+ cu_seqlens_k=cu_seqlens_k,
838
+ max_seqlen_q=max_seqlen_in_batch_q,
839
+ max_seqlen_k=max_seqlen_in_batch_k,
840
+ dropout_p=dropout,
841
+ softmax_scale=softmax_scale,
842
+ causal=causal,
843
+ )
844
+
845
+ attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
846
+ else:
847
+ attn_output = flash_attn_func(
848
+ query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal
849
+ )
850
+
851
+ return attn_output
852
+
853
+ def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
854
+ indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
855
+ batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
856
+
857
+ key_layer = index_first_axis(
858
+ key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
859
+ )
860
+ value_layer = index_first_axis(
861
+ value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
862
+ )
863
+ if query_length == kv_seq_len:
864
+ query_layer = index_first_axis(
865
+ query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k
866
+ )
867
+ cu_seqlens_q = cu_seqlens_k
868
+ max_seqlen_in_batch_q = max_seqlen_in_batch_k
869
+ indices_q = indices_k
870
+ elif query_length == 1:
871
+ max_seqlen_in_batch_q = 1
872
+ cu_seqlens_q = torch.arange(
873
+ batch_size + 1, dtype=torch.int32, device=query_layer.device
874
+ ) # There is a memcpy here, that is very bad.
875
+ indices_q = cu_seqlens_q[:-1]
876
+ query_layer = query_layer.squeeze(1)
877
+ else:
878
+ # The -q_len: slice assumes left padding.
879
+ attention_mask = attention_mask[:, -query_length:]
880
+ query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
881
+
882
+ return (
883
+ query_layer,
884
+ key_layer,
885
+ value_layer,
886
+ indices_q,
887
+ (cu_seqlens_q, cu_seqlens_k),
888
+ (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
889
+ )
890
+
891
+
892
+ # Copied from transformers.models.llama.modeling_llama.LlamaSdpaAttention with Llama->BailingMoeV2
893
+ class BailingMoeV2SdpaAttention(BailingMoeV2Attention):
894
+ """
895
+ BailingMoeV2 attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
896
+ `BailingMoeV2Attention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
897
+ SDPA API.
898
+ """
899
+
900
+ # Adapted from BailingMoeV2Attention.forward
901
+ def forward(
902
+ self,
903
+ hidden_states: torch.Tensor,
904
+ attention_mask: Optional[torch.Tensor] = None,
905
+ position_ids: Optional[torch.LongTensor] = None,
906
+ past_key_value: Optional[Cache] = None,
907
+ output_attentions: bool = False,
908
+ use_cache: bool = False,
909
+ **kwargs,
910
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
911
+ if output_attentions:
912
+ # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
913
+ logger.warning_once(
914
+ "BailingMoeV2Model is using BailingMoeV2SdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
915
+ 'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
916
+ )
917
+ return super().forward(
918
+ hidden_states=hidden_states,
919
+ attention_mask=attention_mask,
920
+ position_ids=position_ids,
921
+ past_key_value=past_key_value,
922
+ output_attentions=output_attentions,
923
+ use_cache=use_cache,
924
+ )
925
+
926
+ bsz, q_len, _ = hidden_states.size()
927
+
928
+ qkv = self.query_key_value(hidden_states)
929
+ qkv = qkv.view(bsz, q_len, self.num_heads + 2 * self.num_key_value_heads, self.head_dim)
930
+
931
+ query_states, key_states, value_states = qkv.split(
932
+ [self.num_heads, self.num_key_value_heads, self.num_key_value_heads], dim=-2
933
+ )
934
+ query_states = query_states.transpose(1, 2)
935
+ key_states = key_states.transpose(1, 2)
936
+ value_states = value_states.transpose(1, 2)
937
+
938
+ if self.config.use_qk_norm:
939
+ query_states = self.q_norm(query_states)
940
+ key_states = self.k_norm(key_states)
941
+
942
+ kv_seq_len = key_states.shape[-2]
943
+ if past_key_value is not None:
944
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
945
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
946
+
947
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
948
+
949
+ if past_key_value is not None:
950
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
951
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
952
+
953
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
954
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
955
+
956
+ if attention_mask is not None:
957
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
958
+ raise ValueError(
959
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
960
+ )
961
+
962
+ # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
963
+ # Reference: https://github.com/pytorch/pytorch/issues/112577.
964
+ if query_states.device.type == "cuda" and attention_mask is not None:
965
+ query_states = query_states.contiguous()
966
+ key_states = key_states.contiguous()
967
+ value_states = value_states.contiguous()
968
+
969
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
970
+ query_states,
971
+ key_states,
972
+ value_states,
973
+ attn_mask=attention_mask,
974
+ dropout_p=self.attention_dropout if self.training else 0.0,
975
+ # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
976
+ is_causal=self.is_causal and attention_mask is None and q_len > 1,
977
+ )
978
+
979
+ attn_output = attn_output.transpose(1, 2).contiguous()
980
+ attn_output = attn_output.reshape(bsz, q_len, -1)
981
+
982
+ attn_output = self.dense(attn_output)
983
+
984
+ return attn_output, None, past_key_value
985
+
986
+
987
+ ATTENTION_CLASSES = {
988
+ "eager": BailingMoeV2Attention,
989
+ "flash_attention_2": BailingMoeV2FlashAttention2,
990
+ "sdpa": BailingMoeV2SdpaAttention,
991
+ }
992
+
993
+
994
+ class BailingMoeV2DecoderLayer(nn.Module):
995
+ def __init__(self, config: BailingMoeV2Config, layer_idx: int):
996
+ super().__init__()
997
+ self.hidden_size = config.hidden_size
998
+
999
+ self.attention = ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)
1000
+
1001
+ self.mlp = (
1002
+ BailingMoeV2SparseMoeBlock(config)
1003
+ if (config.num_experts is not None and layer_idx >= config.first_k_dense_replace)
1004
+ else BailingMoeV2MLP(config=config, intermediate_size=config.intermediate_size)
1005
+ )
1006
+ self.input_layernorm = BailingMoeV2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
1007
+ self.post_attention_layernorm = BailingMoeV2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
1008
+
1009
+ def forward(
1010
+ self,
1011
+ hidden_states: torch.Tensor,
1012
+ attention_mask: Optional[torch.Tensor] = None,
1013
+ position_ids: Optional[torch.LongTensor] = None,
1014
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
1015
+ output_attentions: Optional[bool] = False,
1016
+ output_router_logits: Optional[bool] = False,
1017
+ use_cache: Optional[bool] = False,
1018
+ **kwargs,
1019
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
1020
+ """
1021
+ Args:
1022
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
1023
+ attention_mask (`torch.FloatTensor`, *optional*):
1024
+ attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
1025
+ query_sequence_length, key_sequence_length)` if default attention is used.
1026
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1027
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
1028
+ config.n_positions - 1]`.
1029
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*):
1030
+ cached past key and value projection states
1031
+ output_attentions (`bool`, *optional*):
1032
+ Whether to return the attentions tensors of all attention layers. See `attentions` under
1033
+ returned tensors for more detail.
1034
+ output_router_logits (`bool`, *optional*):
1035
+ Whether or not to return the logits of all the routers. They are useful for computing the router loss,
1036
+ and should not be returned during inference.
1037
+ use_cache (`bool`, *optional*):
1038
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
1039
+ (see `past_key_values`).
1040
+ """
1041
+ if "padding_mask" in kwargs:
1042
+ warnings.warn(
1043
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
1044
+ )
1045
+ residual = hidden_states
1046
+
1047
+ hidden_states = self.input_layernorm(hidden_states)
1048
+
1049
+ # Self Attention
1050
+ hidden_states, self_attn_weights, present_key_value = self.attention(
1051
+ hidden_states=hidden_states,
1052
+ attention_mask=attention_mask,
1053
+ position_ids=position_ids,
1054
+ past_key_value=past_key_value,
1055
+ output_attentions=output_attentions,
1056
+ use_cache=use_cache,
1057
+ )
1058
+ hidden_states = residual + hidden_states
1059
+
1060
+ # Fully Connected
1061
+ residual = hidden_states
1062
+ hidden_states = self.post_attention_layernorm(hidden_states)
1063
+ hidden_states = self.mlp(hidden_states)
1064
+ if isinstance(hidden_states, tuple):
1065
+ hidden_states, router_logits = hidden_states
1066
+ else:
1067
+ router_logits = None
1068
+ hidden_states = residual + hidden_states
1069
+
1070
+ outputs = (hidden_states,)
1071
+
1072
+ if output_attentions:
1073
+ outputs += (self_attn_weights,)
1074
+
1075
+ if use_cache:
1076
+ outputs += (present_key_value,)
1077
+
1078
+ if output_router_logits:
1079
+ outputs += (router_logits,)
1080
+
1081
+ return outputs
1082
+
1083
+
1084
+ BAILINGMOEV2_START_DOCSTRING = r"""
1085
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
1086
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
1087
+ etc.)
1088
+
1089
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
1090
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
1091
+ and behavior.
1092
+
1093
+ Parameters:
1094
+ config ([`BailingMoeV2Config`]):
1095
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
1096
+ load the weights associated with the model, only the configuration. Check out the
1097
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
1098
+ """
1099
+
1100
+
1101
+ @add_start_docstrings(
1102
+ "The bare BailingMoeV2 Model outputting raw hidden-states without any specific head on top.",
1103
+ BAILINGMOEV2_START_DOCSTRING,
1104
+ )
1105
+ class BailingMoeV2PreTrainedModel(PreTrainedModel):
1106
+ config_class = BailingMoeV2Config
1107
+ base_model_prefix = "model"
1108
+ supports_gradient_checkpointing = True
1109
+ _no_split_modules = ["BailingMoeV2DecoderLayer"]
1110
+ _skip_keys_device_placement = "past_key_values"
1111
+ _supports_flash_attn_2 = True
1112
+ _supports_sdpa = True
1113
+ _supports_cache_class = True
1114
+
1115
+ def _init_weights(self, module):
1116
+ std = self.config.initializer_range
1117
+ if isinstance(module, nn.Linear):
1118
+ module.weight.data.normal_(mean=0.0, std=std)
1119
+ if module.bias is not None:
1120
+ module.bias.data.zero_()
1121
+ elif isinstance(module, nn.Embedding):
1122
+ module.weight.data.normal_(mean=0.0, std=std)
1123
+ if module.padding_idx is not None:
1124
+ module.weight.data[module.padding_idx].zero_()
1125
+
1126
+
1127
+ BAILINGMOEV2_INPUTS_DOCSTRING = r"""
1128
+ Args:
1129
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
1130
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
1131
+ it.
1132
+
1133
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
1134
+ [`PreTrainedTokenizer.__call__`] for details.
1135
+
1136
+ [What are input IDs?](../glossary#input-ids)
1137
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
1138
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
1139
+
1140
+ - 1 for tokens that are **not masked**,
1141
+ - 0 for tokens that are **masked**.
1142
+
1143
+ [What are attention masks?](../glossary#attention-mask)
1144
+
1145
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
1146
+ [`PreTrainedTokenizer.__call__`] for details.
1147
+
1148
+ If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
1149
+ `past_key_values`).
1150
+
1151
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
1152
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
1153
+ information on the default strategy.
1154
+
1155
+ - 1 indicates the head is **not masked**,
1156
+ - 0 indicates the head is **masked**.
1157
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1158
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
1159
+ config.n_positions - 1]`.
1160
+
1161
+ [What are position IDs?](../glossary#position-ids)
1162
+ past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
1163
+ Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
1164
+ blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
1165
+ returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
1166
+
1167
+ Two formats are allowed:
1168
+ - a [`~cache_utils.Cache`] instance;
1169
+ - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
1170
+ shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
1171
+ cache format.
1172
+
1173
+ The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
1174
+ legacy cache format will be returned.
1175
+
1176
+ If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
1177
+ have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
1178
+ of shape `(batch_size, sequence_length)`.
1179
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
1180
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
1181
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
1182
+ model's internal embedding lookup matrix.
1183
+ use_cache (`bool`, *optional*):
1184
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
1185
+ `past_key_values`).
1186
+ output_attentions (`bool`, *optional*):
1187
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
1188
+ tensors for more detail.
1189
+ output_hidden_states (`bool`, *optional*):
1190
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
1191
+ more detail.
1192
+ return_dict (`bool`, *optional*):
1193
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
1194
+ """
1195
+
1196
+
1197
+ @add_start_docstrings(
1198
+ "The bare BailingMoeV2 Model outputting raw hidden-states without any specific head on top.",
1199
+ BAILINGMOEV2_START_DOCSTRING,
1200
+ )
1201
+ class BailingMoeV2Model(BailingMoeV2PreTrainedModel):
1202
+ """
1203
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`BailingMoeV2DecoderLayer`]
1204
+
1205
+ Args:
1206
+ config: BailingMoeV2Config
1207
+ """
1208
+
1209
+ def __init__(self, config: BailingMoeV2Config):
1210
+ super().__init__(config)
1211
+ self.padding_idx = config.pad_token_id
1212
+ self.vocab_size = config.vocab_size
1213
+
1214
+ self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
1215
+ self.layers = nn.ModuleList(
1216
+ [BailingMoeV2DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
1217
+ )
1218
+ self._use_sdpa = config._attn_implementation == "sdpa"
1219
+ self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2"
1220
+ self.norm = BailingMoeV2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
1221
+
1222
+ self.gradient_checkpointing = False
1223
+ # Initialize weights and apply final processing
1224
+ self.post_init()
1225
+
1226
+ def get_input_embeddings(self):
1227
+ return self.word_embeddings
1228
+
1229
+ def set_input_embeddings(self, value):
1230
+ self.word_embeddings = value
1231
+
1232
+ @add_start_docstrings_to_model_forward(BAILINGMOEV2_INPUTS_DOCSTRING)
1233
+ def forward(
1234
+ self,
1235
+ input_ids: torch.LongTensor = None,
1236
+ attention_mask: Optional[torch.Tensor] = None,
1237
+ position_ids: Optional[torch.LongTensor] = None,
1238
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1239
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1240
+ use_cache: Optional[bool] = None,
1241
+ output_attentions: Optional[bool] = None,
1242
+ output_hidden_states: Optional[bool] = None,
1243
+ output_router_logits: Optional[bool] = None,
1244
+ return_dict: Optional[bool] = None,
1245
+ **kwargs,
1246
+ ) -> Union[Tuple, MoeModelOutputWithPast]:
1247
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1248
+ output_hidden_states = (
1249
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1250
+ )
1251
+ output_router_logits = (
1252
+ output_router_logits if output_router_logits is not None else self.config.output_router_logits
1253
+ )
1254
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
1255
+
1256
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1257
+
1258
+ # retrieve input_ids and inputs_embeds
1259
+ if input_ids is not None and inputs_embeds is not None:
1260
+ raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
1261
+ elif input_ids is not None:
1262
+ batch_size, seq_length = input_ids.shape[:2]
1263
+ elif inputs_embeds is not None:
1264
+ batch_size, seq_length = inputs_embeds.shape[:2]
1265
+ else:
1266
+ raise ValueError("You have to specify either input_ids or inputs_embeds")
1267
+
1268
+ if self.gradient_checkpointing and self.training:
1269
+ if use_cache:
1270
+ logger.warning_once(
1271
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`transformers."
1272
+ )
1273
+ use_cache = False
1274
+
1275
+ past_key_values_length = 0
1276
+ if use_cache:
1277
+ use_legacy_cache = not isinstance(past_key_values, Cache)
1278
+ if use_legacy_cache:
1279
+ past_key_values = DynamicCache.from_legacy_cache(past_key_values)
1280
+ past_key_values_length = past_key_values.get_usable_length(seq_length)
1281
+
1282
+ if position_ids is None:
1283
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
1284
+ position_ids = torch.arange(
1285
+ past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
1286
+ )
1287
+ position_ids = position_ids.unsqueeze(0)
1288
+
1289
+ if inputs_embeds is None:
1290
+ inputs_embeds = self.word_embeddings(input_ids)
1291
+
1292
+ if self._use_flash_attention_2:
1293
+ # 2d mask is passed through the layers
1294
+ attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
1295
+ elif self._use_sdpa and not output_attentions:
1296
+ # output_attentions=True can not be supported when using SDPA, and we fall back on
1297
+ # the manual implementation that requires a 4D causal mask in all cases.
1298
+ attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
1299
+ attention_mask,
1300
+ (batch_size, seq_length),
1301
+ inputs_embeds,
1302
+ past_key_values_length,
1303
+ )
1304
+ else:
1305
+ # 4d mask is passed through the layers
1306
+ attention_mask = _prepare_4d_causal_attention_mask(
1307
+ attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
1308
+ )
1309
+
1310
+ # embed positions
1311
+ hidden_states = inputs_embeds
1312
+
1313
+ # decoder layers
1314
+ all_hidden_states = () if output_hidden_states else None
1315
+ all_self_attns = () if output_attentions else None
1316
+ all_router_logits = () if output_router_logits else None
1317
+ next_decoder_cache = None
1318
+
1319
+ for decoder_layer in self.layers:
1320
+ if output_hidden_states:
1321
+ all_hidden_states += (hidden_states,)
1322
+
1323
+ if self.gradient_checkpointing and self.training:
1324
+ layer_outputs = self._gradient_checkpointing_func(
1325
+ decoder_layer.__call__,
1326
+ hidden_states,
1327
+ attention_mask,
1328
+ position_ids,
1329
+ past_key_values,
1330
+ output_attentions,
1331
+ output_router_logits,
1332
+ use_cache,
1333
+ )
1334
+ else:
1335
+ layer_outputs = decoder_layer(
1336
+ hidden_states,
1337
+ attention_mask=attention_mask,
1338
+ position_ids=position_ids,
1339
+ past_key_value=past_key_values,
1340
+ output_attentions=output_attentions,
1341
+ output_router_logits=output_router_logits,
1342
+ use_cache=use_cache,
1343
+ )
1344
+ hidden_states = layer_outputs[0]
1345
+
1346
+ if use_cache:
1347
+ next_decoder_cache = layer_outputs[2 if output_attentions else 1]
1348
+
1349
+ if output_attentions:
1350
+ all_self_attns += (layer_outputs[1],)
1351
+
1352
+ if output_router_logits and layer_outputs[-1] is not None:
1353
+ all_router_logits += (layer_outputs[-1],)
1354
+
1355
+ hidden_states = self.norm(hidden_states)
1356
+
1357
+ # add hidden states from the last decoder layer
1358
+ if output_hidden_states:
1359
+ all_hidden_states += (hidden_states,)
1360
+
1361
+ next_cache = None
1362
+ if use_cache:
1363
+ next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
1364
+ if not return_dict:
1365
+ return tuple(
1366
+ v
1367
+ for v in [hidden_states, next_cache, all_hidden_states, all_self_attns, all_router_logits]
1368
+ if v is not None
1369
+ )
1370
+ return MoeModelOutputWithPast(
1371
+ last_hidden_state=hidden_states,
1372
+ past_key_values=next_cache,
1373
+ hidden_states=all_hidden_states,
1374
+ attentions=all_self_attns,
1375
+ router_logits=all_router_logits,
1376
+ )
1377
+
1378
+
1379
+ class BailingMoeV2ForCausalLM(BailingMoeV2PreTrainedModel):
1380
+ _tied_weights_keys = ["lm_head.weight"]
1381
+
1382
+ def __init__(self, config: BailingMoeV2Config):
1383
+ super().__init__(config)
1384
+ self.model = BailingMoeV2Model(config)
1385
+ self.vocab_size = config.vocab_size
1386
+ self.norm_head = config.norm_head
1387
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1388
+
1389
+ # Initialize weights and apply final processing
1390
+ self.post_init()
1391
+
1392
+ def get_input_embeddings(self):
1393
+ return self.model.word_embeddings
1394
+
1395
+ def set_input_embeddings(self, value):
1396
+ self.model.word_embeddings = value
1397
+
1398
+ def get_output_embeddings(self):
1399
+ return self.lm_head
1400
+
1401
+ def set_output_embeddings(self, new_embeddings):
1402
+ self.lm_head = new_embeddings
1403
+
1404
+ def set_decoder(self, decoder):
1405
+ self.model = decoder
1406
+
1407
+ def get_decoder(self):
1408
+ return self.model
1409
+
1410
+ def compute_logit(self, hidden_states):
1411
+ if self.norm_head:
1412
+ if self.training:
1413
+ norm_weight = (
1414
+ self.lm_head.weight / (torch.norm(self.lm_head.weight, p=2, dim=0, keepdim=True) + 1e-7).detach()
1415
+ )
1416
+ logits = F.linear(hidden_states, norm_weight, None)
1417
+ else:
1418
+ self.lm_head.weight.data = (
1419
+ self.lm_head.weight.data.float()
1420
+ / (torch.norm(self.lm_head.weight.data.float(), p=2, dim=0, keepdim=True) + 1e-7)
1421
+ ).to(hidden_states.dtype)
1422
+ logits = F.linear(hidden_states, self.lm_head.weight.data, None)
1423
+ self.norm_head = False
1424
+ else:
1425
+ logits = self.lm_head(hidden_states)
1426
+ return logits
1427
+
1428
+ @add_start_docstrings_to_model_forward(BAILINGMOEV2_INPUTS_DOCSTRING)
1429
+ @replace_return_docstrings(output_type=MoeCausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
1430
+ def forward(
1431
+ self,
1432
+ input_ids: torch.LongTensor = None,
1433
+ attention_mask: Optional[torch.Tensor] = None,
1434
+ position_ids: Optional[torch.LongTensor] = None,
1435
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1436
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1437
+ labels: Optional[torch.LongTensor] = None,
1438
+ use_cache: Optional[bool] = None,
1439
+ output_attentions: Optional[bool] = None,
1440
+ output_hidden_states: Optional[bool] = None,
1441
+ output_router_logits: Optional[bool] = None,
1442
+ return_dict: Optional[bool] = None,
1443
+ **kwargs,
1444
+ ) -> Union[Tuple, MoeCausalLMOutputWithPast]:
1445
+ r"""
1446
+ Args:
1447
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1448
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
1449
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
1450
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
1451
+
1452
+ Returns:
1453
+
1454
+ Example:
1455
+
1456
+ ```python
1457
+ >>> from transformers import AutoTokenizer
1458
+
1459
+ >>> model = BailingMoeV2ForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
1460
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
1461
+
1462
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
1463
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
1464
+
1465
+ >>> # Generate
1466
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
1467
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
1468
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
1469
+ ```"""
1470
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1471
+ output_hidden_states = (
1472
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1473
+ )
1474
+ output_router_logits = (
1475
+ output_router_logits if output_router_logits is not None else self.config.output_router_logits
1476
+ )
1477
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1478
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
1479
+ outputs = self.model(
1480
+ input_ids=input_ids,
1481
+ attention_mask=attention_mask,
1482
+ position_ids=position_ids,
1483
+ past_key_values=past_key_values,
1484
+ inputs_embeds=inputs_embeds,
1485
+ use_cache=use_cache,
1486
+ output_attentions=output_attentions,
1487
+ output_hidden_states=output_hidden_states,
1488
+ output_router_logits=output_router_logits,
1489
+ return_dict=return_dict,
1490
+ **kwargs,
1491
+ )
1492
+
1493
+ hidden_states = outputs[0]
1494
+
1495
+ logits = self.compute_logit(hidden_states=hidden_states)
1496
+ logits = logits.float()
1497
+
1498
+ loss = None
1499
+ aux_loss = None
1500
+
1501
+ if labels is not None:
1502
+ # Shift so that tokens < n predict n
1503
+ shift_logits = logits[..., :-1, :].contiguous()
1504
+ shift_labels = labels[..., 1:].contiguous()
1505
+ # Flatten the tokens
1506
+ loss_fct = CrossEntropyLoss()
1507
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1508
+ shift_labels = shift_labels.view(-1)
1509
+ # Enable model parallelism
1510
+ shift_labels = shift_labels.to(shift_logits.device)
1511
+ loss = loss_fct(shift_logits, shift_labels)
1512
+
1513
+ if not return_dict:
1514
+ output = (logits,) + outputs[1:]
1515
+ if output_router_logits:
1516
+ output = (aux_loss,) + output
1517
+ return (loss,) + output if loss is not None else output
1518
+
1519
+ return MoeCausalLMOutputWithPast(
1520
+ loss=loss,
1521
+ aux_loss=aux_loss,
1522
+ logits=logits,
1523
+ past_key_values=outputs.past_key_values,
1524
+ hidden_states=outputs.hidden_states,
1525
+ attentions=outputs.attentions,
1526
+ router_logits=outputs.router_logits,
1527
+ )
1528
+
1529
+ def prepare_inputs_for_generation(
1530
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, token_type_ids=None, **kwargs
1531
+ ):
1532
+ if past_key_values is not None:
1533
+ if isinstance(past_key_values, Cache):
1534
+ cache_length = past_key_values.get_seq_length()
1535
+ past_length = past_key_values.seen_tokens
1536
+ max_cache_length = (
1537
+ past_key_values.get_max_length()
1538
+ if hasattr(past_key_values, "get_max_length")
1539
+ else past_key_values.get_max_cache_shape()
1540
+ )
1541
+ else:
1542
+ cache_length = past_length = past_key_values[0][0].shape[2]
1543
+ max_cache_length = None
1544
+
1545
+ # Keep only the unprocessed tokens:
1546
+ # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
1547
+ # some of the inputs are exclusivelly passed as part of the cache (e.g. when passing input_embeds as input)
1548
+ if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
1549
+ input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
1550
+ # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
1551
+ # input_ids based on the past_length.
1552
+ elif past_length < input_ids.shape[1]:
1553
+ input_ids = input_ids[:, past_length:]
1554
+ # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
1555
+
1556
+ # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
1557
+ if (
1558
+ max_cache_length is not None
1559
+ and attention_mask is not None
1560
+ and cache_length + input_ids.shape[1] > max_cache_length
1561
+ ):
1562
+ attention_mask = attention_mask[:, -max_cache_length:]
1563
+
1564
+ position_ids = kwargs.get("position_ids", None)
1565
+ if attention_mask is not None and position_ids is None:
1566
+ # create position_ids on the fly for batch generation
1567
+ position_ids = attention_mask.long().cumsum(-1) - 1
1568
+ position_ids.masked_fill_(attention_mask == 0, 1)
1569
+ if past_key_values:
1570
+ position_ids = position_ids[:, -input_ids.shape[1] :]
1571
+
1572
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
1573
+ if inputs_embeds is not None and past_key_values is None:
1574
+ model_inputs = {"inputs_embeds": inputs_embeds}
1575
+ else:
1576
+ model_inputs = {"input_ids": input_ids}
1577
+
1578
+ model_inputs.update(
1579
+ {
1580
+ "position_ids": position_ids,
1581
+ "past_key_values": past_key_values,
1582
+ "use_cache": kwargs.get("use_cache"),
1583
+ "attention_mask": attention_mask,
1584
+ }
1585
+ )
1586
+ return model_inputs
1587
+
1588
+ @staticmethod
1589
+ def _reorder_cache(past_key_values, beam_idx):
1590
+ reordered_past = ()
1591
+ for layer_past in past_key_values:
1592
+ reordered_past += (
1593
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
1594
+ )
1595
+ return reordered_past
1596
+
1597
+
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|startoftext|>",
3
+ "cls_token": "[CLS]",
4
+ "eos_token": "<|endoftext|>",
5
+ "gmask_token": "[gMASK]",
6
+ "pad_token": "<|endoftext|>"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_eos_token": false,
4
+ "bos_token": "<|startoftext|>",
5
+ "chat_template": "{% for message in messages %}{% set role = message['role'] | lower %}{% if role == 'user' %}{% set role = 'HUMAN' %}{% endif %}{% set role = role | upper %}{{ '<role>' + role + '</role>' + message['content'] }}{% endfor %}{% if add_generation_prompt %}{{ '<role>ASSISTANT</role><think>\n' }}{% endif %}",
6
+ "clean_up_tokenization_spaces": false,
7
+ "cls_token": "[CLS]",
8
+ "eos_token": "<|endoftext|>",
9
+ "fast_tokenizer": true,
10
+ "gmask_token": "[gMASK]",
11
+ "merges_file": null,
12
+ "model_max_length": 1000000000000000019884624838656,
13
+ "pad_token": "<|endoftext|>",
14
+ "tokenizer_class": "PreTrainedTokenizerFast",
15
+ "trust_remote_code": true,
16
+ "vocab_file": null
17
+ }