chen459664's picture
Add files using upload-large-folder tool
55c92b3 verified
nohup: ignoring input
Begin main_assign: Llama-2-7b-hf self_attn alpha
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|█████ | 1/2 [00:17<00:17, 17.79s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 11.01s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 12.03s/it]
Once upon a time, I was a new mum, with a newborn baby. I was also a full-time teacher and doing a part-time Master's degree. I was tired and stressed. I had no time to do anything that I enjoyed. And then I was given a gift. A gift of a book that would change everything.
I was given the book _Babywise_ , by Gary Ezzo and Robert Bucknam. I was not sure what to expect. I was
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((4096,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
config:
LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"vocab_size": 32000
}
Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 0 ---1.5555332899093628
Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 1 ---2.2105441093444824
Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 2 ---2.6319708824157715
Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 3 ---2.659501791000366
Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 4 ---2.5770697593688965
Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 5 ---2.5436081886291504
Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 6 ---2.4908900260925293
Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 7 ---2.5257110595703125
Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 8 ---2.3653383255004883
Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 9 ---2.5174360275268555
Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 10 ---2.265111207962036
Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 11 ---2.1847732067108154
Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 12 ---2.4449100494384766
Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 13 ---2.679959774017334
Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 14 ---2.4503092765808105
Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 15 ---2.7230710983276367
Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 16 ---3.074552536010742
Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 17 ---3.4709739685058594
Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 18 ---3.67897629737854
Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 19 ---3.278068780899048
Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 20 ---3.6138486862182617
Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 21 ---3.5603649616241455
Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 22 ---3.9758076667785645
Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 23 ---4.087326526641846
Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 24 ---3.739630699157715
Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 25 ---4.076397895812988
Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 26 ---3.5009336471557617
Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 27 ---4.056451320648193
Processing layer 28--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 28 ---3.726351737976074
Processing layer 29--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 29 ---3.844115972518921
Processing layer 30--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 30 ---4.4837751388549805
Processing layer 31--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 31 ---3.275714874267578
metric_name alpha: [30, 23, 25, 27, 22, 29, 24, 28, 18, 20, 21, 26, 17, 19, 31, 16, 15, 13, 3, 2, 4, 5, 7, 9, 6, 14, 12, 8, 10, 1, 11, 0]
Begin main_assign: Llama-2-7b-hf self_attn alpha_hat
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|█████ | 1/2 [00:17<00:17, 17.91s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 10.98s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 12.02s/it]
Once upon a time, there was a boy who lived in a house with his parents, and there was a girl who lived in a house with her parents. There were other children, too, but they were less important, because they lived in other houses.
This boy had a sister, and the sister was very beautiful. She was so beautiful that the boy’s parents decided to give her away in marriage. The sister was quite upset about this, but her parents told her that she had no choice.
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((4096,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
config:
LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"vocab_size": 32000
}
Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 0 ---6.557916641235352
Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 1 ---8.015230178833008
Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 2 ---10.61414909362793
Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 3 ---10.561344146728516
Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 4 ---10.057601928710938
Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 5 ---9.833120346069336
Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 6 ---9.45051097869873
Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 7 ---9.582311630249023
Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 8 ---9.064985275268555
Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 9 ---9.556177139282227
Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 10 ---8.45679759979248
Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 11 ---8.441352844238281
Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 12 ---9.276637077331543
Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 13 ---10.002967834472656
Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 14 ---8.847436904907227
Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 15 ---9.9490327835083
Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 16 ---11.152729988098145
Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 17 ---12.680035591125488
Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 18 ---13.309869766235352
Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 19 ---12.054608345031738
Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 20 ---13.724580764770508
Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 21 ---13.702856063842773
Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 22 ---15.829018592834473
Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 23 ---15.232747077941895
Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 24 ---14.636650085449219
Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 25 ---15.008004188537598
Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 26 ---14.163816452026367
Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 27 ---15.760412216186523
Processing layer 28--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 28 ---14.682222366333008
Processing layer 29--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 29 ---15.929686546325684
Processing layer 30--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 30 ---17.405780792236328
Processing layer 31--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 31 ---14.380212783813477
metric_name alpha_hat: [30, 29, 22, 27, 23, 25, 28, 24, 31, 26, 20, 21, 18, 17, 19, 16, 2, 3, 4, 13, 15, 5, 7, 9, 6, 12, 8, 14, 10, 11, 1, 0]
Begin main_assign: Llama-2-7b-hf self_attn stable_rank
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|█████ | 1/2 [00:19<00:19, 19.27s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 11.89s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 13.00s/it]
Once upon a time, in a land far away, there was a castle. Everyone who lived in the castle was very wealthy. But the castle was haunted! The king and queen were very scared, but they kept the castle because they wanted the money.
One day, the king and queen had a baby. They called him Prince Charming. Prince Charming was very happy. He had everything he could ever want. He was so happy that he could fly!
One day,
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((4096,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
config:
LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"vocab_size": 32000
}
Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(55.2111, device='cuda:0')
spectral_norm tensor(21.8310, device='cuda:0')
frobenius_norm tensor(61.8584, device='cuda:0')
spectral_norm tensor(18.5959, device='cuda:0')
frobenius_norm tensor(45.2757, device='cuda:0')
spectral_norm tensor(4.0318, device='cuda:0')
frobenius_norm tensor(29.4097, device='cuda:0')
spectral_norm tensor(4.3865, device='cuda:0')
alpha value of layer 0 ---47.129005432128906
Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(108.3667, device='cuda:0')
spectral_norm tensor(18.4210, device='cuda:0')
frobenius_norm tensor(108.1698, device='cuda:0')
spectral_norm tensor(20.1468, device='cuda:0')
frobenius_norm tensor(41.1449, device='cuda:0')
spectral_norm tensor(3.2622, device='cuda:0')
frobenius_norm tensor(33.7843, device='cuda:0')
spectral_norm tensor(3.8698, device='cuda:0')
alpha value of layer 1 ---74.68388366699219
Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(109.0373, device='cuda:0')
spectral_norm tensor(16.2222, device='cuda:0')
frobenius_norm tensor(114.6870, device='cuda:0')
spectral_norm tensor(19.2055, device='cuda:0')
frobenius_norm tensor(58.7294, device='cuda:0')
spectral_norm tensor(3.5355, device='cuda:0')
frobenius_norm tensor(56.9624, device='cuda:0')
spectral_norm tensor(6.0202, device='cuda:0')
alpha value of layer 2 ---111.57455444335938
Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(103.1725, device='cuda:0')
spectral_norm tensor(13.4197, device='cuda:0')
frobenius_norm tensor(107.3360, device='cuda:0')
spectral_norm tensor(15.3238, device='cuda:0')
frobenius_norm tensor(55.7174, device='cuda:0')
spectral_norm tensor(3.0726, device='cuda:0')
frobenius_norm tensor(54.5137, device='cuda:0')
spectral_norm tensor(6.3542, device='cuda:0')
alpha value of layer 3 ---127.65095520019531
Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(107.5970, device='cuda:0')
spectral_norm tensor(13.7911, device='cuda:0')
frobenius_norm tensor(109.8770, device='cuda:0')
spectral_norm tensor(15.7544, device='cuda:0')
frobenius_norm tensor(58.8042, device='cuda:0')
spectral_norm tensor(3.1027, device='cuda:0')
frobenius_norm tensor(57.5345, device='cuda:0')
spectral_norm tensor(5.8239, device='cuda:0')
alpha value of layer 4 ---141.5764617919922
Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(108.1569, device='cuda:0')
spectral_norm tensor(13.7375, device='cuda:0')
frobenius_norm tensor(112.6002, device='cuda:0')
spectral_norm tensor(16.5425, device='cuda:0')
frobenius_norm tensor(60.3288, device='cuda:0')
spectral_norm tensor(2.9971, device='cuda:0')
frobenius_norm tensor(59.0258, device='cuda:0')
spectral_norm tensor(5.6560, device='cuda:0')
alpha value of layer 5 ---155.60284423828125
Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(102.0697, device='cuda:0')
spectral_norm tensor(12.2985, device='cuda:0')
frobenius_norm tensor(104.0983, device='cuda:0')
spectral_norm tensor(14.5729, device='cuda:0')
frobenius_norm tensor(55.9676, device='cuda:0')
spectral_norm tensor(3.0027, device='cuda:0')
frobenius_norm tensor(55.1871, device='cuda:0')
spectral_norm tensor(5.7670, device='cuda:0')
alpha value of layer 6 ---139.72598266601562
Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(101.8819, device='cuda:0')
spectral_norm tensor(11.7685, device='cuda:0')
frobenius_norm tensor(102.3675, device='cuda:0')
spectral_norm tensor(13.6839, device='cuda:0')
frobenius_norm tensor(56.6824, device='cuda:0')
spectral_norm tensor(3.1391, device='cuda:0')
frobenius_norm tensor(55.6199, device='cuda:0')
spectral_norm tensor(5.4573, device='cuda:0')
alpha value of layer 7 ---140.21083068847656
Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(102.8848, device='cuda:0')
spectral_norm tensor(11.9707, device='cuda:0')
frobenius_norm tensor(103.4811, device='cuda:0')
spectral_norm tensor(14.2754, device='cuda:0')
frobenius_norm tensor(58.2330, device='cuda:0')
spectral_norm tensor(3.3746, device='cuda:0')
frobenius_norm tensor(57.2962, device='cuda:0')
spectral_norm tensor(4.9391, device='cuda:0')
alpha value of layer 8 ---139.6942138671875
Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(103.0146, device='cuda:0')
spectral_norm tensor(12.2318, device='cuda:0')
frobenius_norm tensor(105.5969, device='cuda:0')
spectral_norm tensor(14.2079, device='cuda:0')
frobenius_norm tensor(59.3876, device='cuda:0')
spectral_norm tensor(3.2388, device='cuda:0')
frobenius_norm tensor(58.5812, device='cuda:0')
spectral_norm tensor(5.1232, device='cuda:0')
alpha value of layer 9 ---148.2813262939453
Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(102.8745, device='cuda:0')
spectral_norm tensor(12.0922, device='cuda:0')
frobenius_norm tensor(106.0223, device='cuda:0')
spectral_norm tensor(14.3113, device='cuda:0')
frobenius_norm tensor(58.7986, device='cuda:0')
spectral_norm tensor(3.2255, device='cuda:0')
frobenius_norm tensor(58.3338, device='cuda:0')
spectral_norm tensor(4.3048, device='cuda:0')
alpha value of layer 10 ---160.80075073242188
Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(97.7702, device='cuda:0')
spectral_norm tensor(11.1815, device='cuda:0')
frobenius_norm tensor(97.4910, device='cuda:0')
spectral_norm tensor(12.9324, device='cuda:0')
frobenius_norm tensor(61.3144, device='cuda:0')
spectral_norm tensor(3.4012, device='cuda:0')
frobenius_norm tensor(60.7354, device='cuda:0')
spectral_norm tensor(5.6488, device='cuda:0')
alpha value of layer 11 ---143.46920776367188
Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(99.7752, device='cuda:0')
spectral_norm tensor(11.8016, device='cuda:0')
frobenius_norm tensor(102.8686, device='cuda:0')
spectral_norm tensor(13.5998, device='cuda:0')
frobenius_norm tensor(60.5482, device='cuda:0')
spectral_norm tensor(3.2452, device='cuda:0')
frobenius_norm tensor(60.0323, device='cuda:0')
spectral_norm tensor(4.9509, device='cuda:0')
alpha value of layer 12 ---155.95849609375
Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(98.8230, device='cuda:0')
spectral_norm tensor(12.0874, device='cuda:0')
frobenius_norm tensor(100.6162, device='cuda:0')
spectral_norm tensor(13.8355, device='cuda:0')
frobenius_norm tensor(62.7593, device='cuda:0')
spectral_norm tensor(3.1144, device='cuda:0')
frobenius_norm tensor(62.1430, device='cuda:0')
spectral_norm tensor(5.1164, device='cuda:0')
alpha value of layer 13 ---168.32928466796875
Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(98.5708, device='cuda:0')
spectral_norm tensor(11.6312, device='cuda:0')
frobenius_norm tensor(100.4783, device='cuda:0')
spectral_norm tensor(13.6670, device='cuda:0')
frobenius_norm tensor(61.6071, device='cuda:0')
spectral_norm tensor(2.7584, device='cuda:0')
frobenius_norm tensor(60.9149, device='cuda:0')
spectral_norm tensor(4.5464, device='cuda:0')
alpha value of layer 14 ---201.04998779296875
Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(97.3649, device='cuda:0')
spectral_norm tensor(12.7196, device='cuda:0')
frobenius_norm tensor(100.7325, device='cuda:0')
spectral_norm tensor(14.3247, device='cuda:0')
frobenius_norm tensor(64.0580, device='cuda:0')
spectral_norm tensor(3.0007, device='cuda:0')
frobenius_norm tensor(63.2220, device='cuda:0')
spectral_norm tensor(4.5882, device='cuda:0')
alpha value of layer 15 ---188.4088134765625
Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(96.6512, device='cuda:0')
spectral_norm tensor(12.9655, device='cuda:0')
frobenius_norm tensor(99.2465, device='cuda:0')
spectral_norm tensor(14.6775, device='cuda:0')
frobenius_norm tensor(66.7600, device='cuda:0')
spectral_norm tensor(2.8352, device='cuda:0')
frobenius_norm tensor(66.0842, device='cuda:0')
spectral_norm tensor(5.0123, device='cuda:0')
alpha value of layer 16 ---207.397216796875
Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(95.8736, device='cuda:0')
spectral_norm tensor(12.8995, device='cuda:0')
frobenius_norm tensor(98.0118, device='cuda:0')
spectral_norm tensor(14.3731, device='cuda:0')
frobenius_norm tensor(66.5281, device='cuda:0')
spectral_norm tensor(2.8901, device='cuda:0')
frobenius_norm tensor(66.1344, device='cuda:0')
spectral_norm tensor(5.4531, device='cuda:0')
alpha value of layer 17 ---194.68035888671875
Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(93.7199, device='cuda:0')
spectral_norm tensor(12.9969, device='cuda:0')
frobenius_norm tensor(95.7889, device='cuda:0')
spectral_norm tensor(14.0707, device='cuda:0')
frobenius_norm tensor(69.6604, device='cuda:0')
spectral_norm tensor(2.8885, device='cuda:0')
frobenius_norm tensor(68.6924, device='cuda:0')
spectral_norm tensor(5.4377, device='cuda:0')
alpha value of layer 18 ---209.88125610351562
Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(92.5769, device='cuda:0')
spectral_norm tensor(12.7937, device='cuda:0')
frobenius_norm tensor(94.3567, device='cuda:0')
spectral_norm tensor(14.2169, device='cuda:0')
frobenius_norm tensor(70.2688, device='cuda:0')
spectral_norm tensor(2.7430, device='cuda:0')
frobenius_norm tensor(69.5499, device='cuda:0')
spectral_norm tensor(5.3632, device='cuda:0')
alpha value of layer 19 ---230.21255493164062
Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(93.1738, device='cuda:0')
spectral_norm tensor(13.3309, device='cuda:0')
frobenius_norm tensor(94.8772, device='cuda:0')
spectral_norm tensor(14.2162, device='cuda:0')
frobenius_norm tensor(71.3496, device='cuda:0')
spectral_norm tensor(2.7475, device='cuda:0')
frobenius_norm tensor(70.9048, device='cuda:0')
spectral_norm tensor(6.5838, device='cuda:0')
alpha value of layer 20 ---220.93441772460938
Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(91.0892, device='cuda:0')
spectral_norm tensor(12.7824, device='cuda:0')
frobenius_norm tensor(92.0887, device='cuda:0')
spectral_norm tensor(13.3415, device='cuda:0')
frobenius_norm tensor(73.5470, device='cuda:0')
spectral_norm tensor(3.2134, device='cuda:0')
frobenius_norm tensor(72.4853, device='cuda:0')
spectral_norm tensor(5.6293, device='cuda:0')
alpha value of layer 21 ---197.01531982421875
Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(93.0198, device='cuda:0')
spectral_norm tensor(12.9645, device='cuda:0')
frobenius_norm tensor(94.1975, device='cuda:0')
spectral_norm tensor(13.3901, device='cuda:0')
frobenius_norm tensor(73.8086, device='cuda:0')
spectral_norm tensor(2.7246, device='cuda:0')
frobenius_norm tensor(72.6579, device='cuda:0')
spectral_norm tensor(7.5949, device='cuda:0')
alpha value of layer 22 ---231.5908660888672
Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(92.3679, device='cuda:0')
spectral_norm tensor(12.5056, device='cuda:0')
frobenius_norm tensor(93.0806, device='cuda:0')
spectral_norm tensor(12.8793, device='cuda:0')
frobenius_norm tensor(77.2716, device='cuda:0')
spectral_norm tensor(3.0203, device='cuda:0')
frobenius_norm tensor(76.3245, device='cuda:0')
spectral_norm tensor(5.5035, device='cuda:0')
alpha value of layer 23 ---238.41143798828125
Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(89.8033, device='cuda:0')
spectral_norm tensor(12.3866, device='cuda:0')
frobenius_norm tensor(90.2768, device='cuda:0')
spectral_norm tensor(13.1727, device='cuda:0')
frobenius_norm tensor(76.5770, device='cuda:0')
spectral_norm tensor(3.2186, device='cuda:0')
frobenius_norm tensor(75.2567, device='cuda:0')
spectral_norm tensor(6.6725, device='cuda:0')
alpha value of layer 24 ---198.1973419189453
Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(90.5618, device='cuda:0')
spectral_norm tensor(11.8575, device='cuda:0')
frobenius_norm tensor(90.8598, device='cuda:0')
spectral_norm tensor(12.2949, device='cuda:0')
frobenius_norm tensor(79.4490, device='cuda:0')
spectral_norm tensor(3.1827, device='cuda:0')
frobenius_norm tensor(78.3357, device='cuda:0')
spectral_norm tensor(4.8220, device='cuda:0')
alpha value of layer 25 ---249.99850463867188
Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(89.1381, device='cuda:0')
spectral_norm tensor(12.8790, device='cuda:0')
frobenius_norm tensor(89.8395, device='cuda:0')
spectral_norm tensor(13.2880, device='cuda:0')
frobenius_norm tensor(80.7266, device='cuda:0')
spectral_norm tensor(3.6409, device='cuda:0')
frobenius_norm tensor(80.1935, device='cuda:0')
spectral_norm tensor(6.8510, device='cuda:0')
alpha value of layer 26 ---180.56016540527344
Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(92.4526, device='cuda:0')
spectral_norm tensor(13.1710, device='cuda:0')
frobenius_norm tensor(93.3447, device='cuda:0')
spectral_norm tensor(14.0394, device='cuda:0')
frobenius_norm tensor(80.8654, device='cuda:0')
spectral_norm tensor(3.3240, device='cuda:0')
frobenius_norm tensor(80.7260, device='cuda:0')
spectral_norm tensor(5.6484, device='cuda:0')
alpha value of layer 27 ---222.39707946777344
Processing layer 28--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(89.8511, device='cuda:0')
spectral_norm tensor(12.7679, device='cuda:0')
frobenius_norm tensor(90.9173, device='cuda:0')
spectral_norm tensor(13.5210, device='cuda:0')
frobenius_norm tensor(83.4357, device='cuda:0')
spectral_norm tensor(3.6919, device='cuda:0')
frobenius_norm tensor(83.0720, device='cuda:0')
spectral_norm tensor(6.0824, device='cuda:0')
alpha value of layer 28 ---198.00421142578125
Processing layer 29--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(87.5255, device='cuda:0')
spectral_norm tensor(13.2835, device='cuda:0')
frobenius_norm tensor(88.2859, device='cuda:0')
spectral_norm tensor(14.1804, device='cuda:0')
frobenius_norm tensor(83.7624, device='cuda:0')
spectral_norm tensor(4.7727, device='cuda:0')
frobenius_norm tensor(84.0506, device='cuda:0')
spectral_norm tensor(6.8564, device='cuda:0')
alpha value of layer 29 ---135.1178436279297
Processing layer 30--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(88.0865, device='cuda:0')
spectral_norm tensor(13.0963, device='cuda:0')
frobenius_norm tensor(89.2752, device='cuda:0')
spectral_norm tensor(13.7996, device='cuda:0')
frobenius_norm tensor(85.7229, device='cuda:0')
spectral_norm tensor(3.7462, device='cuda:0')
frobenius_norm tensor(86.1523, device='cuda:0')
spectral_norm tensor(6.8602, device='cuda:0')
alpha value of layer 30 ---192.10064697265625
Processing layer 31--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
frobenius_norm tensor(89.1405, device='cuda:0')
spectral_norm tensor(15.0259, device='cuda:0')
frobenius_norm tensor(92.3933, device='cuda:0')
spectral_norm tensor(16.3841, device='cuda:0')
frobenius_norm tensor(78.1290, device='cuda:0')
spectral_norm tensor(4.0098, device='cuda:0')
frobenius_norm tensor(78.9173, device='cuda:0')
spectral_norm tensor(10.5689, device='cuda:0')
alpha value of layer 31 ---125.60063171386719
metric_name stable_rank: [25, 23, 22, 19, 27, 20, 18, 16, 14, 24, 28, 21, 17, 30, 15, 26, 13, 10, 12, 5, 9, 11, 4, 7, 6, 8, 29, 3, 31, 2, 1, 0]
Begin main_assign: Llama-2-7b-hf self_attn effective_rank
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|█████ | 1/2 [00:19<00:19, 19.23s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:26<00:00, 11.91s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:26<00:00, 13.01s/it]
Once upon a time, I worked in a library. I was a page. I would shelve books, check them out, and help patrons find things. It was my first job, and it was great. I met some really neat people.
At the time, I had a lot of free time, so I spent a lot of time in the library. I would sit in the fiction section and read all the books that I didn’t have time to read at home. This is how I first
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((4096,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
config:
LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"vocab_size": 32000
}
Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 0 ---1640.3033447265625
Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 1 ---2134.1572265625
Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 2 ---2669.992919921875
Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 3 ---2901.25439453125
Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 4 ---2904.2001953125
Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 5 ---2910.534912109375
Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 6 ---2883.722900390625
Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 7 ---2887.236328125
Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 8 ---2899.039306640625
Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 9 ---2916.92822265625
Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 10 ---2859.56689453125
Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 11 ---2818.8173828125
Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 12 ---2905.6064453125
Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 13 ---2940.74462890625
Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 14 ---2900.401123046875
Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 15 ---2949.82080078125
Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 16 ---2976.977783203125
Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 17 ---3047.2646484375
Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 18 ---3096.2216796875
Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 19 ---3061.852783203125
Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 20 ---3062.37353515625
Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 21 ---3081.3349609375
Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 22 ---3106.181640625
Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 23 ---3144.513427734375
Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 24 ---3072.8798828125
Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 25 ---3137.80224609375
Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 26 ---3090.37158203125
Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 27 ---3181.7998046875
Processing layer 28--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 28 ---3147.865478515625
Processing layer 29--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 29 ---3101.146484375
Processing layer 30--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 30 ---3161.5263671875
Processing layer 31--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 31 ---3049.5556640625
metric_name effective_rank: [27, 30, 28, 23, 25, 22, 29, 18, 26, 21, 24, 20, 19, 31, 17, 16, 15, 13, 9, 5, 12, 4, 3, 14, 8, 7, 6, 10, 11, 2, 1, 0]
Begin main_assign: Llama-2-7b-hf self_attn ZD
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|█████ | 1/2 [00:17<00:17, 17.69s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 10.92s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 11.94s/it]
Once upon a time, there was a girl who had a heart that was so full of love that it overflowed from her chest and flowed out of her hands.
This girl was so full of love that she was like a fountain of love, and wherever she went, her love would flow out of her and touch the lives of those around her.
One day, the girl was walking through the woods when she came upon a beautiful stream. The stream was so clear and so peaceful that
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((4096,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
config:
LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"vocab_size": 32000
}
Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 0 ---0.09540334343910217
Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 1 ---0.11126542091369629
Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 2 ---0.14089055359363556
Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 3 ---0.1446058303117752
Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 4 ---0.14712807536125183
Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 5 ---0.1478433907032013
Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 6 ---0.14464625716209412
Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 7 ---0.14459004998207092
Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 8 ---0.14641690254211426
Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 9 ---0.147793248295784
Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 10 ---0.14709556102752686
Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 11 ---0.14403118193149567
Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 12 ---0.14700128138065338
Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 13 ---0.1479380875825882
Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 14 ---0.1479010283946991
Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 15 ---0.1490364670753479
Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 16 ---0.1480296403169632
Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 17 ---0.15020982921123505
Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 18 ---0.1507750302553177
Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 19 ---0.14981798827648163
Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 20 ---0.15018826723098755
Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 21 ---0.1498291790485382
Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 22 ---0.15043966472148895
Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 23 ---0.151978999376297
Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 24 ---0.14919137954711914
Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 25 ---0.15175150334835052
Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 26 ---0.1495654433965683
Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 27 ---0.15338149666786194
Processing layer 28--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 28 ---0.15180033445358276
Processing layer 29--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 29 ---0.1501537710428238
Processing layer 30--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 30 ---0.15218497812747955
Processing layer 31--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 31 ---0.14980709552764893
metric_name ZD: [27, 30, 23, 28, 25, 18, 22, 17, 20, 29, 21, 19, 31, 26, 24, 15, 16, 13, 14, 5, 9, 4, 10, 12, 8, 6, 3, 7, 11, 2, 1, 0]
Begin main_assign: Llama-2-7b-hf self_attn head_diversity
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|█████ | 1/2 [00:18<00:18, 18.58s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 11.55s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 12.61s/it]
Once upon a time, there was a city that was built with love. I was in that city. And I fell in love. I fell in love with the people. I fell in love with the architecture. I fell in love with the food. I fell in love with the art. I fell in love with the culture. I fell in love with the music. I fell in love with the history. I fell in love with the city. And I fell in love with the person I fell in love with
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((4096,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
config:
LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"vocab_size": 32000
}
Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 0 ---0.9916330575942993
Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 1 ---0.9952021241188049
Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 2 ---0.9966323971748352
Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 3 ---0.9973293542861938
Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 4 ---0.9971895217895508
Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 5 ---0.9973934888839722
Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 6 ---0.9974462389945984
Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 7 ---0.9975071549415588
Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 8 ---0.9974231719970703
Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 9 ---0.9973534345626831
Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 10 ---0.997123122215271
Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 11 ---0.9970043897628784
Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 12 ---0.9973783493041992
Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 13 ---0.9974591732025146
Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 14 ---0.9971306324005127
Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 15 ---0.9973533153533936
Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 16 ---0.9974291324615479
Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 17 ---0.9976841807365417
Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 18 ---0.997740626335144
Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 19 ---0.9975850582122803
Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 20 ---0.9973828792572021
Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 21 ---0.9975684881210327
Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 22 ---0.9977440237998962
Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 23 ---0.9980273246765137
Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 24 ---0.9974839091300964
Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 25 ---0.9979180693626404
Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 26 ---0.9974991083145142
Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 27 ---0.9979188442230225
Processing layer 28--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 28 ---0.997989296913147
Processing layer 29--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 29 ---0.9974175691604614
Processing layer 30--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 30 ---0.9975640773773193
Processing layer 31--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 31 ---0.997219443321228
metric_name head_diversity: [23, 28, 27, 25, 22, 18, 17, 19, 21, 30, 7, 26, 24, 13, 6, 16, 8, 29, 5, 20, 12, 9, 15, 3, 31, 4, 14, 10, 11, 2, 1, 0]
Begin main_assign: Llama-2-7b-hf self_attn coherence
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|█████ | 1/2 [00:17<00:17, 17.66s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 10.81s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 11.84s/it]
Once upon a time, there lived a king and a queen. They had a beautiful daughter named Cinderella. One day, the king announced that there would be a ball. The king invited all the royalty and dignitaries to the ball, including Cinderella.
Cinderella was so excited. She could not wait to see her friends and dress in her best dress. She was happy that she would finally be able to go to the ball. The only problem was that her step-mother
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((4096,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
config:
LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"vocab_size": 32000
}
Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 0 ---0.08510372042655945
Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 1 ---0.04102545976638794
Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 2 ---0.028616365045309067
Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 3 ---0.02104165218770504
Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 4 ---0.022063206881284714
Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 5 ---0.021188031882047653
Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 6 ---0.020417138934135437
Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 7 ---0.019520433619618416
Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 8 ---0.020254574716091156
Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 9 ---0.020007748156785965
Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 10 ---0.021119512617588043
Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 11 ---0.020985007286071777
Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 12 ---0.019723106175661087
Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 13 ---0.01894117146730423
Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 14 ---0.01963678002357483
Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 15 ---0.01925666816532612
Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 16 ---0.018222851678729057
Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 17 ---0.016996942460536957
Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 18 ---0.016209837049245834
Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 19 ---0.017241276800632477
Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 20 ---0.017154088243842125
Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 21 ---0.016598742455244064
Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 22 ---0.016119930893182755
Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 23 ---0.015261407010257244
Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 24 ---0.01685335859656334
Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 25 ---0.015361565165221691
Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 26 ---0.01685093343257904
Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 27 ---0.015206292271614075
Processing layer 28--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 28 ---0.01575298234820366
Processing layer 29--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 29 ---0.01735319383442402
Processing layer 30--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 30 ---0.016395289450883865
Processing layer 31--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)}
alpha value of layer 31 ---0.02029731497168541
metric_name coherence: [0, 1, 2, 4, 5, 10, 3, 11, 6, 31, 8, 9, 12, 14, 7, 15, 13, 16, 29, 19, 20, 17, 24, 26, 21, 30, 18, 22, 28, 25, 23, 27]
Begin main_assign: Llama-2-7b-hf mlp alpha
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|█████ | 1/2 [00:18<00:18, 18.84s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 11.57s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 12.66s/it]
Once upon a time, there was a little boy named Jack who lived in the country. Jack loved to play in the forest with his friends. He was a happy boy.
One day, he was playing with his friends when he saw a beautiful white horse. The horse was beautiful, and he wanted to ride it.
Jack ran up to the horse and asked, “Can I ride you?”
The horse said, “Yes, Jack, you can ride me.”
Jack jumped up on the horse
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((4096,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
config:
LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"vocab_size": 32000
}
Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 0 ---2.815411329269409
Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 1 ---3.4513909816741943
Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 2 ---3.7109851837158203
Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 3 ---3.8032162189483643
Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 4 ---4.195495128631592
Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 5 ---3.856921911239624
Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 6 ---3.6256275177001953
Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 7 ---3.660289764404297
Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 8 ---3.5230093002319336
Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 9 ---3.5076420307159424
Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 10 ---3.3467018604278564
Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 11 ---3.291457176208496
Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 12 ---3.5538055896759033
Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 13 ---3.4736461639404297
Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 14 ---3.715531587600708
Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 15 ---3.7430496215820312
Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 16 ---4.149142742156982
Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 17 ---4.197119235992432
Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 18 ---4.489593982696533
Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 19 ---4.358992576599121
Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 20 ---5.0549798011779785
Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 21 ---4.814038276672363
Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 22 ---4.621534824371338
Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 23 ---4.299625396728516
Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 24 ---4.563563346862793
Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 25 ---4.570354461669922
Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 26 ---4.2273688316345215
Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 27 ---4.423553466796875
Processing layer 28--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 28 ---4.38798189163208
Processing layer 29--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 29 ---4.595789432525635
Processing layer 30--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 30 ---4.671386241912842
Processing layer 31--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 31 ---3.7070789337158203
metric_name alpha: [20, 21, 30, 22, 29, 25, 24, 18, 27, 28, 19, 23, 26, 17, 4, 16, 5, 3, 15, 14, 2, 31, 7, 6, 12, 8, 9, 13, 1, 10, 11, 0]
Begin main_assign: Llama-2-7b-hf mlp alpha_hat
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|█████ | 1/2 [00:18<00:18, 18.01s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 11.18s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 12.20s/it]
Once upon a time, there was a kingdom called Buzuran. A kingdom known for its beautiful scenery and its friendly people.
One day, a stranger named Gavin arrived in the kingdom. He was looking for a place to settle down. He had heard of the kingdom and its beauty, and he was interested in seeing it for himself.
When Gavin arrived, he was greeted by the king and queen. They were very impressed with Gavin and his ab
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((4096,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
config:
LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"vocab_size": 32000
}
Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 0 ---13.072108268737793
Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 1 ---14.548068046569824
Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 2 ---13.608903884887695
Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 3 ---13.986700057983398
Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 4 ---15.312846183776855
Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 5 ---14.763805389404297
Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 6 ---14.188860893249512
Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 7 ---14.57461166381836
Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 8 ---14.344627380371094
Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 9 ---14.180407524108887
Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 10 ---13.752973556518555
Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 11 ---13.753717422485352
Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 12 ---14.589736938476562
Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 13 ---14.359493255615234
Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 14 ---15.248335838317871
Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 15 ---15.255033493041992
Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 16 ---16.761940002441406
Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 17 ---16.71654510498047
Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 18 ---17.673377990722656
Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 19 ---17.007017135620117
Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 20 ---19.24078941345215
Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 21 ---17.77168083190918
Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 22 ---17.303386688232422
Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 23 ---16.242385864257812
Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 24 ---16.798599243164062
Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 25 ---16.718074798583984
Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 26 ---16.25970458984375
Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 27 ---17.800804138183594
Processing layer 28--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 28 ---18.30744171142578
Processing layer 29--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 29 ---19.38349151611328
Processing layer 30--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 30 ---22.61872673034668
Processing layer 31--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 31 ---19.854324340820312
metric_name alpha_hat: [30, 31, 29, 20, 28, 27, 21, 18, 22, 19, 24, 16, 25, 17, 26, 23, 4, 15, 14, 5, 12, 7, 1, 13, 8, 6, 9, 3, 11, 10, 2, 0]
Begin main_assign: Llama-2-7b-hf mlp stable_rank
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|█████ | 1/2 [00:17<00:17, 17.70s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 11.03s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 12.03s/it]
Once upon a time there was a little boy who wanted to be a fireman when he grew up. He got to go on a fire truck and play with the hoses and watch the firemen and had a great time.
He also wanted to be a policeman when he grew up. He got to go on a police car and get to go to jail. He got to play with the police dogs and watch the policemen and had a great time.
He also wanted to be a doctor
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((4096,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
config:
LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"vocab_size": 32000
}
Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(110.1283, device='cuda:0')
spectral_norm tensor(11.7725, device='cuda:0')
frobenius_norm tensor(108.2747, device='cuda:0')
spectral_norm tensor(10.5846, device='cuda:0')
frobenius_norm tensor(112.8760, device='cuda:0')
spectral_norm tensor(8.8966, device='cuda:0')
alpha value of layer 0 ---117.70869445800781
Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(122.3248, device='cuda:0')
spectral_norm tensor(14.1620, device='cuda:0')
frobenius_norm tensor(115.1808, device='cuda:0')
spectral_norm tensor(7.3779, device='cuda:0')
frobenius_norm tensor(116.8547, device='cuda:0')
spectral_norm tensor(6.6577, device='cuda:0')
alpha value of layer 1 ---208.7972869873047
Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(125.7614, device='cuda:0')
spectral_norm tensor(10.6197, device='cuda:0')
frobenius_norm tensor(117.2508, device='cuda:0')
spectral_norm tensor(4.5438, device='cuda:0')
frobenius_norm tensor(117.9841, device='cuda:0')
spectral_norm tensor(5.6738, device='cuda:0')
alpha value of layer 2 ---412.8428649902344
Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(127.6067, device='cuda:0')
spectral_norm tensor(9.8554, device='cuda:0')
frobenius_norm tensor(118.0961, device='cuda:0')
spectral_norm tensor(4.3441, device='cuda:0')
frobenius_norm tensor(118.3421, device='cuda:0')
spectral_norm tensor(6.3601, device='cuda:0')
alpha value of layer 3 ---417.640380859375
Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(130.2638, device='cuda:0')
spectral_norm tensor(10.9230, device='cuda:0')
frobenius_norm tensor(117.2583, device='cuda:0')
spectral_norm tensor(4.2948, device='cuda:0')
frobenius_norm tensor(116.9673, device='cuda:0')
spectral_norm tensor(6.2874, device='cuda:0')
alpha value of layer 4 ---411.2488708496094
Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(131.0171, device='cuda:0')
spectral_norm tensor(12.3422, device='cuda:0')
frobenius_norm tensor(117.2556, device='cuda:0')
spectral_norm tensor(4.4561, device='cuda:0')
frobenius_norm tensor(117.0701, device='cuda:0')
spectral_norm tensor(6.4393, device='cuda:0')
alpha value of layer 5 ---378.5376892089844
Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(133.2534, device='cuda:0')
spectral_norm tensor(13.9346, device='cuda:0')
frobenius_norm tensor(116.8471, device='cuda:0')
spectral_norm tensor(4.7412, device='cuda:0')
frobenius_norm tensor(116.4051, device='cuda:0')
spectral_norm tensor(6.2935, device='cuda:0')
alpha value of layer 6 ---346.97930908203125
Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(133.1796, device='cuda:0')
spectral_norm tensor(13.7900, device='cuda:0')
frobenius_norm tensor(117.2589, device='cuda:0')
spectral_norm tensor(4.9844, device='cuda:0')
frobenius_norm tensor(116.5975, device='cuda:0')
spectral_norm tensor(6.7392, device='cuda:0')
alpha value of layer 7 ---315.34832763671875
Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(131.0625, device='cuda:0')
spectral_norm tensor(12.9479, device='cuda:0')
frobenius_norm tensor(118.7164, device='cuda:0')
spectral_norm tensor(5.3742, device='cuda:0')
frobenius_norm tensor(117.8239, device='cuda:0')
spectral_norm tensor(7.0607, device='cuda:0')
alpha value of layer 8 ---289.63446044921875
Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(129.7933, device='cuda:0')
spectral_norm tensor(13.0345, device='cuda:0')
frobenius_norm tensor(119.5203, device='cuda:0')
spectral_norm tensor(5.3777, device='cuda:0')
frobenius_norm tensor(118.6356, device='cuda:0')
spectral_norm tensor(7.0941, device='cuda:0')
alpha value of layer 9 ---290.9284362792969
Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(129.3888, device='cuda:0')
spectral_norm tensor(13.1949, device='cuda:0')
frobenius_norm tensor(120.7769, device='cuda:0')
spectral_norm tensor(5.5907, device='cuda:0')
frobenius_norm tensor(119.6504, device='cuda:0')
spectral_norm tensor(7.0719, device='cuda:0')
alpha value of layer 10 ---283.04052734375
Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(128.9186, device='cuda:0')
spectral_norm tensor(13.5576, device='cuda:0')
frobenius_norm tensor(121.8036, device='cuda:0')
spectral_norm tensor(5.6766, device='cuda:0')
frobenius_norm tensor(120.4323, device='cuda:0')
spectral_norm tensor(7.4645, device='cuda:0')
alpha value of layer 11 ---270.3800048828125
Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(128.1013, device='cuda:0')
spectral_norm tensor(13.0662, device='cuda:0')
frobenius_norm tensor(122.9129, device='cuda:0')
spectral_norm tensor(6.0042, device='cuda:0')
frobenius_norm tensor(121.4419, device='cuda:0')
spectral_norm tensor(6.8182, device='cuda:0')
alpha value of layer 12 ---277.48046875
Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(127.6379, device='cuda:0')
spectral_norm tensor(13.2896, device='cuda:0')
frobenius_norm tensor(124.1647, device='cuda:0')
spectral_norm tensor(6.3249, device='cuda:0')
frobenius_norm tensor(122.4098, device='cuda:0')
spectral_norm tensor(6.4258, device='cuda:0')
alpha value of layer 13 ---280.17169189453125
Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(127.3744, device='cuda:0')
spectral_norm tensor(13.1519, device='cuda:0')
frobenius_norm tensor(124.2991, device='cuda:0')
spectral_norm tensor(6.4910, device='cuda:0')
frobenius_norm tensor(122.5321, device='cuda:0')
spectral_norm tensor(6.1543, device='cuda:0')
alpha value of layer 14 ---285.63787841796875
Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(128.0534, device='cuda:0')
spectral_norm tensor(13.0927, device='cuda:0')
frobenius_norm tensor(124.9728, device='cuda:0')
spectral_norm tensor(6.8014, device='cuda:0')
frobenius_norm tensor(122.9694, device='cuda:0')
spectral_norm tensor(5.7687, device='cuda:0')
alpha value of layer 15 ---295.8956298828125
Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(128.9326, device='cuda:0')
spectral_norm tensor(13.4859, device='cuda:0')
frobenius_norm tensor(124.8730, device='cuda:0')
spectral_norm tensor(7.0698, device='cuda:0')
frobenius_norm tensor(122.8834, device='cuda:0')
spectral_norm tensor(5.4697, device='cuda:0')
alpha value of layer 16 ---302.703857421875
Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(130.1418, device='cuda:0')
spectral_norm tensor(13.9212, device='cuda:0')
frobenius_norm tensor(124.4075, device='cuda:0')
spectral_norm tensor(6.5251, device='cuda:0')
frobenius_norm tensor(122.7542, device='cuda:0')
spectral_norm tensor(5.4324, device='cuda:0')
alpha value of layer 17 ---320.5066833496094
Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(131.3037, device='cuda:0')
spectral_norm tensor(13.5630, device='cuda:0')
frobenius_norm tensor(124.0418, device='cuda:0')
spectral_norm tensor(6.0638, device='cuda:0')
frobenius_norm tensor(122.6318, device='cuda:0')
spectral_norm tensor(5.3899, device='cuda:0')
alpha value of layer 18 ---343.27899169921875
Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(131.7776, device='cuda:0')
spectral_norm tensor(12.6322, device='cuda:0')
frobenius_norm tensor(124.0385, device='cuda:0')
spectral_norm tensor(5.9170, device='cuda:0')
frobenius_norm tensor(122.9026, device='cuda:0')
spectral_norm tensor(5.5908, device='cuda:0')
alpha value of layer 19 ---343.8408203125
Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(132.4897, device='cuda:0')
spectral_norm tensor(12.7395, device='cuda:0')
frobenius_norm tensor(123.9540, device='cuda:0')
spectral_norm tensor(6.1100, device='cuda:0')
frobenius_norm tensor(122.8897, device='cuda:0')
spectral_norm tensor(5.1633, device='cuda:0')
alpha value of layer 20 ---362.06610107421875
Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(133.3879, device='cuda:0')
spectral_norm tensor(12.1366, device='cuda:0')
frobenius_norm tensor(123.7530, device='cuda:0')
spectral_norm tensor(5.6965, device='cuda:0')
frobenius_norm tensor(122.8804, device='cuda:0')
spectral_norm tensor(4.8811, device='cuda:0')
alpha value of layer 21 ---408.834228515625
Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(134.1589, device='cuda:0')
spectral_norm tensor(11.9229, device='cuda:0')
frobenius_norm tensor(123.6884, device='cuda:0')
spectral_norm tensor(5.3241, device='cuda:0')
frobenius_norm tensor(122.9257, device='cuda:0')
spectral_norm tensor(5.2778, device='cuda:0')
alpha value of layer 22 ---402.9353942871094
Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(134.1199, device='cuda:0')
spectral_norm tensor(10.9135, device='cuda:0')
frobenius_norm tensor(124.2959, device='cuda:0')
spectral_norm tensor(4.9033, device='cuda:0')
frobenius_norm tensor(123.6594, device='cuda:0')
spectral_norm tensor(5.8655, device='cuda:0')
alpha value of layer 23 ---412.6978454589844
Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(134.4485, device='cuda:0')
spectral_norm tensor(10.8745, device='cuda:0')
frobenius_norm tensor(124.6739, device='cuda:0')
spectral_norm tensor(4.6898, device='cuda:0')
frobenius_norm tensor(124.1024, device='cuda:0')
spectral_norm tensor(5.6106, device='cuda:0')
alpha value of layer 24 ---449.614501953125
Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(134.7100, device='cuda:0')
spectral_norm tensor(10.7878, device='cuda:0')
frobenius_norm tensor(125.2170, device='cuda:0')
spectral_norm tensor(4.9540, device='cuda:0')
frobenius_norm tensor(124.7208, device='cuda:0')
spectral_norm tensor(5.3156, device='cuda:0')
alpha value of layer 25 ---448.44183349609375
Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(135.1252, device='cuda:0')
spectral_norm tensor(11.1821, device='cuda:0')
frobenius_norm tensor(125.6647, device='cuda:0')
spectral_norm tensor(5.8583, device='cuda:0')
frobenius_norm tensor(125.1040, device='cuda:0')
spectral_norm tensor(5.0074, device='cuda:0')
alpha value of layer 26 ---410.114013671875
Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(135.2713, device='cuda:0')
spectral_norm tensor(11.5424, device='cuda:0')
frobenius_norm tensor(126.2993, device='cuda:0')
spectral_norm tensor(6.9998, device='cuda:0')
frobenius_norm tensor(125.7692, device='cuda:0')
spectral_norm tensor(5.1612, device='cuda:0')
alpha value of layer 27 ---352.2416687011719
Processing layer 28--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(134.6884, device='cuda:0')
spectral_norm tensor(11.4798, device='cuda:0')
frobenius_norm tensor(127.5331, device='cuda:0')
spectral_norm tensor(9.0437, device='cuda:0')
frobenius_norm tensor(126.5359, device='cuda:0')
spectral_norm tensor(4.9416, device='cuda:0')
alpha value of layer 28 ---330.72796630859375
Processing layer 29--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(135.0423, device='cuda:0')
spectral_norm tensor(12.0559, device='cuda:0')
frobenius_norm tensor(128.7051, device='cuda:0')
spectral_norm tensor(11.3077, device='cuda:0')
frobenius_norm tensor(126.9749, device='cuda:0')
spectral_norm tensor(4.4258, device='cuda:0')
alpha value of layer 29 ---359.37579345703125
Processing layer 30--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(138.1487, device='cuda:0')
spectral_norm tensor(19.7267, device='cuda:0')
frobenius_norm tensor(130.8807, device='cuda:0')
spectral_norm tensor(19.9367, device='cuda:0')
frobenius_norm tensor(126.2216, device='cuda:0')
spectral_norm tensor(4.5236, device='cuda:0')
alpha value of layer 30 ---290.23907470703125
Processing layer 31--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
frobenius_norm tensor(144.1483, device='cuda:0')
spectral_norm tensor(19.9982, device='cuda:0')
frobenius_norm tensor(135.9538, device='cuda:0')
spectral_norm tensor(19.9404, device='cuda:0')
frobenius_norm tensor(126.0626, device='cuda:0')
spectral_norm tensor(7.8316, device='cuda:0')
alpha value of layer 31 ---119.1807861328125
metric_name stable_rank: [24, 25, 3, 2, 23, 4, 26, 21, 22, 5, 20, 29, 27, 6, 19, 18, 28, 17, 7, 16, 15, 9, 30, 8, 14, 10, 13, 12, 11, 1, 31, 0]
Begin main_assign: Llama-2-7b-hf mlp effective_rank
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|█████ | 1/2 [00:18<00:18, 18.39s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 11.39s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 12.44s/it]
Once upon a time, my father-in-law went to the bank to change a cheque. He was asked to write his signature on the cheque and to then place a tick in the box indicating whether the cheque was a deposit or a withdrawal.
My father-in-law was confused. He was not accustomed to the new style of cheque, which was introduced in the 1980s.
“I know it says ‘deposit’, but
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((4096,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
config:
LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"vocab_size": 32000
}
Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 0 ---3572.474365234375
Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 1 ---3627.507568359375
Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 2 ---3733.4091796875
Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 3 ---3785.27880859375
Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 4 ---3790.103271484375
Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 5 ---3778.20458984375
Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 6 ---3768.807373046875
Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 7 ---3756.693359375
Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 8 ---3748.4443359375
Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 9 ---3745.5185546875
Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 10 ---3735.50341796875
Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 11 ---3737.810302734375
Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 12 ---3740.48388671875
Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 13 ---3753.927734375
Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 14 ---3753.2763671875
Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 15 ---3771.952880859375
Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 16 ---3771.9111328125
Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 17 ---3772.70849609375
Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 18 ---3783.5595703125
Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 19 ---3788.821044921875
Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 20 ---3790.424560546875
Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 21 ---3790.768310546875
Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 22 ---3790.56396484375
Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 23 ---3793.769287109375
Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 24 ---3795.335693359375
Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 25 ---3797.1728515625
Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 26 ---3800.90966796875
Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 27 ---3804.0595703125
Processing layer 28--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 28 ---3810.544921875
Processing layer 29--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 29 ---3812.28857421875
Processing layer 30--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 30 ---3802.36669921875
Processing layer 31--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 31 ---3771.994873046875
metric_name effective_rank: [29, 28, 27, 30, 26, 25, 24, 23, 21, 22, 20, 4, 19, 3, 18, 5, 17, 31, 15, 16, 6, 7, 13, 14, 8, 9, 12, 11, 10, 2, 1, 0]
Begin main_assign: Llama-2-7b-hf mlp ZD
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|█████ | 1/2 [00:18<00:18, 18.59s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 11.55s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 12.60s/it]
Once upon a time, in a faraway land, there lived a prince and a princess. They loved each other very much. The prince was very rich, but the princess was poor. The prince loved the princess so much that he wanted to give her everything he had.
One day, the prince decided to give the princess a gift. He went to the market and bought her a beautiful golden ring. The princess was very happy. She loved the ring. But the prince was not happy
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((4096,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
config:
LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"vocab_size": 32000
}
Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 0 ---0.15578114986419678
Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 1 ---0.15707579255104065
Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 2 ---0.15797409415245056
Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 3 ---0.1574869602918625
Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 4 ---0.15727774798870087
Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 5 ---0.15676063299179077
Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 6 ---0.1562662124633789
Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 7 ---0.15596774220466614
Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 8 ---0.1566615253686905
Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 9 ---0.1561465859413147
Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 10 ---0.15556025505065918
Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 11 ---0.15571480989456177
Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 12 ---0.15566584467887878
Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 13 ---0.15572252869606018
Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 14 ---0.15585064888000488
Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 15 ---0.15579025447368622
Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 16 ---0.15610937774181366
Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 17 ---0.1567423790693283
Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 18 ---0.15685215592384338
Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 19 ---0.15697705745697021
Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 20 ---0.15734341740608215
Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 21 ---0.1579303741455078
Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 22 ---0.15809854865074158
Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 23 ---0.15812799334526062
Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 24 ---0.1577608585357666
Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 25 ---0.15747487545013428
Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 26 ---0.1573786735534668
Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 27 ---0.15686684846878052
Processing layer 28--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 28 ---0.15638381242752075
Processing layer 29--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 29 ---0.1558859944343567
Processing layer 30--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 30 ---0.1538189947605133
Processing layer 31--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 31 ---0.15391211211681366
metric_name ZD: [23, 22, 2, 21, 24, 3, 25, 26, 20, 4, 1, 19, 27, 18, 5, 17, 8, 28, 6, 9, 16, 7, 29, 14, 15, 0, 13, 11, 12, 10, 31, 30]
Begin main_assign: Llama-2-7b-hf mlp head_diversity
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|█████ | 1/2 [00:18<00:18, 18.53s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 11.41s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 12.47s/it]
Once upon a time, there were three kingdoms. The first kingdom was ruled by a king who was very wise. The second kingdom was ruled by a king who was very greedy. The third kingdom was ruled by a king who was very lazy.
The first king was very happy and lived in peace with his people. The second king was very unhappy and lived in fear of his people. The third king was very sad and lived in shame with his people.
One day, the first king decided to
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((4096,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
config:
LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"vocab_size": 32000
}
Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
Traceback (most recent call last):
File "/mnt/bn/life-mllm/users/cxr/quantization/quantization_metric/main_assign.py", line 58, in <module>
all_layer_alpha = calculate_expert(model, metric=metric_name, keyword=keyword)
File "/mnt/bn/life-mllm/users/cxr/quantization/quantization_metric/alphalora/expert_number.py", line 354, in calculate_expert
all_layer_alpha.append(torch.stack(layer_final_alpha).mean().item())
RuntimeError: stack expects a non-empty TensorList
Begin main_assign: Llama-2-7b-hf mlp coherence
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|█████ | 1/2 [00:18<00:18, 18.65s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 11.55s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 12.61s/it]
Once upon a time, I was a child. I was full of dreams and visions of the future. I believed in the power of love and in the strength of family. I was the idealist, the dreamer, the one who believed that all good things were possible.
Then I grew up. I became a teenager. I was still full of dreams and visions of the future, but the dreams were tinged with darkness, and the visions were full of desp
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((4096,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
config:
LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"vocab_size": 32000
}
Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 0 ---0.018758879974484444
Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 1 ---0.015916038304567337
Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 2 ---0.014297164976596832
Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 3 ---0.013034423813223839
Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 4 ---0.012964712455868721
Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 5 ---0.013287676498293877
Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 6 ---0.013558547012507915
Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 7 ---0.013750488869845867
Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 8 ---0.013907302170991898
Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 9 ---0.013899993151426315
Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 10 ---0.014043152332305908
Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 11 ---0.014066706411540508
Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 12 ---0.01396130956709385
Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 13 ---0.013774177059531212
Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 14 ---0.013727117329835892
Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 15 ---0.013336378149688244
Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 16 ---0.013238323852419853
Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 17 ---0.013199622742831707
Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 18 ---0.012931596487760544
Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 19 ---0.012751361355185509
Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 20 ---0.012693298980593681
Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 21 ---0.012600544840097427
Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 22 ---0.012644654139876366
Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 23 ---0.012538459151983261
Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 24 ---0.012461837381124496
Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 25 ---0.012478632852435112
Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 26 ---0.012449456378817558
Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 27 ---0.012493574991822243
Processing layer 28--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 28 ---0.012300195172429085
Processing layer 29--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 29 ---0.01232621818780899
Processing layer 30--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 30 ---0.012763611972332
Processing layer 31--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)}
alpha value of layer 31 ---0.014904310926795006
metric_name coherence: [0, 1, 31, 2, 11, 10, 12, 8, 9, 13, 7, 14, 6, 15, 5, 16, 17, 3, 4, 18, 30, 19, 20, 22, 21, 23, 27, 25, 24, 26, 29, 28]
Begin main_assign: Qwen2.5-7B self_attn alpha
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:02, 1.28it/s] Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.33it/s] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.33it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.29it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.30it/s]
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Once upon a time, there was a king who loved gold. He even had his own gold mine, but he was never satisfied with what he had. One day, he decided to go on a journey to find even more gold. He sent his men to explore the world and bring back as much gold as they could find. They searched high and low, but they could not find any more gold. The king was disappointed. He felt like he had been cheated. But then, he had an idea. He decided
Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((3584,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)
config:
Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 131072,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 152064
}
Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 0 ---3.7398271560668945
Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 1 ---4.61224889755249
Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 2 ---3.6950488090515137
Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 3 ---3.5232601165771484
Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 4 ---4.135606288909912
Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 5 ---3.5275750160217285
Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 6 ---6.992785453796387
Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 7 ---3.8776655197143555
Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 8 ---5.665498733520508
Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 9 ---4.215641021728516
Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 10 ---3.8878092765808105
Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 11 ---4.219095706939697
Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 12 ---3.7720558643341064
Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 13 ---7.641180515289307
Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 14 ---3.707275152206421
Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 15 ---3.25449800491333
Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 16 ---4.158566474914551
Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 17 ---3.467252492904663
Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 18 ---5.359913349151611
Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 19 ---3.048678398132324
Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 20 ---2.6340246200561523
Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 21 ---4.8504228591918945
Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 22 ---3.2433063983917236
Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 23 ---6.264342308044434
Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 24 ---3.845273494720459
Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 25 ---4.786241054534912
Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 26 ---2.9173049926757812
Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 27 ---2.295574426651001
metric_name alpha: [13, 6, 23, 8, 18, 21, 25, 1, 11, 9, 16, 4, 10, 7, 24, 12, 0, 14, 2, 5, 3, 17, 15, 22, 19, 26, 20, 27]
Begin main_assign: Qwen2.5-7B self_attn alpha_hat
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:01, 1.56it/s] Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.63it/s] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:01<00:00, 1.62it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.50it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.54it/s]
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Once upon a time, there was a man who was in the habit of getting up at six o'clock every morning and jogging around his neighborhood. One morning, while he was out for his usual morning run, he saw an old man sitting in a park bench. He asked the old man what he was doing there. The old man replied,"I am waiting for my son to arrive." The man asked,"What does he do? Where does he work?" The old man said,"He is
Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((3584,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)
config:
Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 131072,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 152064
}
Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 0 ---11.435125350952148
Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 1 ---7.411468505859375
Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 2 ---7.694450378417969
Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 3 ---7.890998363494873
Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 4 ---7.172995567321777
Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 5 ---7.2080488204956055
Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 6 ---11.36629867553711
Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 7 ---7.176176071166992
Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 8 ---7.694736957550049
Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 9 ---7.057432174682617
Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 10 ---6.2363386154174805
Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 11 ---6.52932071685791
Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 12 ---7.182070732116699
Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 13 ---8.655073165893555
Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 14 ---5.977964401245117
Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 15 ---5.821178436279297
Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 16 ---7.714193820953369
Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 17 ---6.244594573974609
Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 18 ---7.704196929931641
Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 19 ---5.97743558883667
Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 20 ---5.515291213989258
Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 21 ---7.423676490783691
Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 22 ---6.235415458679199
Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 23 ---11.209845542907715
Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 24 ---7.722582817077637
Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 25 ---10.762764930725098
Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 26 ---7.431804180145264
Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 27 ---8.220139503479004
metric_name alpha_hat: [0, 6, 23, 25, 13, 27, 3, 24, 16, 18, 8, 2, 26, 21, 1, 5, 12, 7, 4, 9, 11, 17, 10, 22, 14, 19, 15, 20]
Begin main_assign: Qwen2.5-7B self_attn stable_rank
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:02, 1.21it/s] Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.29it/s] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.31it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.37it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.34it/s]
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Once upon a time, there was a family with 10 children. Each of the 10 children had a different number of books in their bookshelves. The first child had 1 book, the second had 2 books, the third had 3 books, and so on until the tenth child, who had 10 books. One day, the parents decided to redistribute the books so that each child would have the same number of books. How many books did each child end up with? To
Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((3584,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)
config:
Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 131072,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 152064
}
Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(74.6100, device='cuda:0')
spectral_norm tensor(22.8962, device='cuda:0')
frobenius_norm tensor(37.2340, device='cuda:0')
spectral_norm tensor(4.1839, device='cuda:0')
frobenius_norm tensor(12.2050, device='cuda:0')
spectral_norm tensor(1.2453, device='cuda:0')
frobenius_norm tensor(44.6291, device='cuda:0')
spectral_norm tensor(6.0001, device='cuda:0')
alpha value of layer 0 ---60.29820251464844
Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(56.7970, device='cuda:0')
spectral_norm tensor(10.4783, device='cuda:0')
frobenius_norm tensor(29.9237, device='cuda:0')
spectral_norm tensor(4.7115, device='cuda:0')
frobenius_norm tensor(18.3318, device='cuda:0')
spectral_norm tensor(1.4635, device='cuda:0')
frobenius_norm tensor(51.8109, device='cuda:0')
spectral_norm tensor(3.7208, device='cuda:0')
alpha value of layer 1 ---105.13154602050781
Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(65.4367, device='cuda:0')
spectral_norm tensor(5.5540, device='cuda:0')
frobenius_norm tensor(31.8038, device='cuda:0')
spectral_norm tensor(3.0331, device='cuda:0')
frobenius_norm tensor(14.7804, device='cuda:0')
spectral_norm tensor(1.2128, device='cuda:0')
frobenius_norm tensor(50.6930, device='cuda:0')
spectral_norm tensor(3.9768, device='cuda:0')
alpha value of layer 2 ---139.9442901611328
Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(67.9445, device='cuda:0')
spectral_norm tensor(7.3888, device='cuda:0')
frobenius_norm tensor(32.4597, device='cuda:0')
spectral_norm tensor(3.6898, device='cuda:0')
frobenius_norm tensor(17.2702, device='cuda:0')
spectral_norm tensor(1.2119, device='cuda:0')
frobenius_norm tensor(53.1248, device='cuda:0')
spectral_norm tensor(3.7626, device='cuda:0')
alpha value of layer 3 ---141.09706115722656
Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(64.7004, device='cuda:0')
spectral_norm tensor(6.6270, device='cuda:0')
frobenius_norm tensor(29.1071, device='cuda:0')
spectral_norm tensor(3.3637, device='cuda:0')
frobenius_norm tensor(20.8693, device='cuda:0')
spectral_norm tensor(1.5107, device='cuda:0')
frobenius_norm tensor(53.3331, device='cuda:0')
spectral_norm tensor(3.7430, device='cuda:0')
alpha value of layer 4 ---141.01356506347656
Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(63.2364, device='cuda:0')
spectral_norm tensor(5.9598, device='cuda:0')
frobenius_norm tensor(26.6552, device='cuda:0')
spectral_norm tensor(2.6609, device='cuda:0')
frobenius_norm tensor(19.8491, device='cuda:0')
spectral_norm tensor(1.5480, device='cuda:0')
frobenius_norm tensor(53.4192, device='cuda:0')
spectral_norm tensor(4.0186, device='cuda:0')
alpha value of layer 5 ---138.5137939453125
Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(64.2853, device='cuda:0')
spectral_norm tensor(5.5097, device='cuda:0')
frobenius_norm tensor(28.5616, device='cuda:0')
spectral_norm tensor(2.9881, device='cuda:0')
frobenius_norm tensor(20.6937, device='cuda:0')
spectral_norm tensor(1.4517, device='cuda:0')
frobenius_norm tensor(54.9730, device='cuda:0')
spectral_norm tensor(5.0745, device='cuda:0')
alpha value of layer 6 ---137.01039123535156
Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(60.9057, device='cuda:0')
spectral_norm tensor(4.5698, device='cuda:0')
frobenius_norm tensor(23.8630, device='cuda:0')
spectral_norm tensor(2.5382, device='cuda:0')
frobenius_norm tensor(24.6983, device='cuda:0')
spectral_norm tensor(1.7165, device='cuda:0')
frobenius_norm tensor(60.3206, device='cuda:0')
spectral_norm tensor(3.5918, device='cuda:0')
alpha value of layer 7 ---188.773193359375
Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(63.3943, device='cuda:0')
spectral_norm tensor(3.9526, device='cuda:0')
frobenius_norm tensor(27.2992, device='cuda:0')
spectral_norm tensor(2.5236, device='cuda:0')
frobenius_norm tensor(20.6254, device='cuda:0')
spectral_norm tensor(1.4007, device='cuda:0')
frobenius_norm tensor(55.2465, device='cuda:0')
spectral_norm tensor(3.4821, device='cuda:0')
alpha value of layer 8 ---210.70074462890625
Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(58.6742, device='cuda:0')
spectral_norm tensor(4.6933, device='cuda:0')
frobenius_norm tensor(23.2825, device='cuda:0')
spectral_norm tensor(2.6721, device='cuda:0')
frobenius_norm tensor(24.7824, device='cuda:0')
spectral_norm tensor(1.5406, device='cuda:0')
frobenius_norm tensor(60.4736, device='cuda:0')
spectral_norm tensor(6.3921, device='cuda:0')
alpha value of layer 9 ---145.1187744140625
Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(63.5265, device='cuda:0')
spectral_norm tensor(4.3337, device='cuda:0')
frobenius_norm tensor(27.1029, device='cuda:0')
spectral_norm tensor(2.6480, device='cuda:0')
frobenius_norm tensor(22.3527, device='cuda:0')
spectral_norm tensor(1.3822, device='cuda:0')
frobenius_norm tensor(57.0854, device='cuda:0')
spectral_norm tensor(4.1727, device='cuda:0')
alpha value of layer 10 ---192.07977294921875
Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(64.2904, device='cuda:0')
spectral_norm tensor(4.0832, device='cuda:0')
frobenius_norm tensor(28.0532, device='cuda:0')
spectral_norm tensor(2.6312, device='cuda:0')
frobenius_norm tensor(19.8152, device='cuda:0')
spectral_norm tensor(1.4528, device='cuda:0')
frobenius_norm tensor(55.3427, device='cuda:0')
spectral_norm tensor(4.3578, device='cuda:0')
alpha value of layer 11 ---177.22564697265625
Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(62.3252, device='cuda:0')
spectral_norm tensor(3.9938, device='cuda:0')
frobenius_norm tensor(26.8499, device='cuda:0')
spectral_norm tensor(2.5529, device='cuda:0')
frobenius_norm tensor(20.6057, device='cuda:0')
spectral_norm tensor(1.3954, device='cuda:0')
frobenius_norm tensor(55.6693, device='cuda:0')
spectral_norm tensor(4.6503, device='cuda:0')
alpha value of layer 12 ---178.88079833984375
Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(61.0881, device='cuda:0')
spectral_norm tensor(4.2847, device='cuda:0')
frobenius_norm tensor(25.6297, device='cuda:0')
spectral_norm tensor(2.9570, device='cuda:0')
frobenius_norm tensor(22.4836, device='cuda:0')
spectral_norm tensor(1.3594, device='cuda:0')
frobenius_norm tensor(57.5732, device='cuda:0')
spectral_norm tensor(5.3631, device='cuda:0')
alpha value of layer 13 ---166.80108642578125
Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(59.9385, device='cuda:0')
spectral_norm tensor(4.1989, device='cuda:0')
frobenius_norm tensor(25.3315, device='cuda:0')
spectral_norm tensor(2.3592, device='cuda:0')
frobenius_norm tensor(19.7791, device='cuda:0')
spectral_norm tensor(1.4382, device='cuda:0')
frobenius_norm tensor(54.6680, device='cuda:0')
spectral_norm tensor(4.8688, device='cuda:0')
alpha value of layer 14 ---158.56539916992188
Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(62.2632, device='cuda:0')
spectral_norm tensor(4.4386, device='cuda:0')
frobenius_norm tensor(26.5240, device='cuda:0')
spectral_norm tensor(2.5572, device='cuda:0')
frobenius_norm tensor(20.8667, device='cuda:0')
spectral_norm tensor(1.5160, device='cuda:0')
frobenius_norm tensor(55.3834, device='cuda:0')
spectral_norm tensor(4.4431, device='cuda:0')
alpha value of layer 15 ---162.2994384765625
Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(59.5504, device='cuda:0')
spectral_norm tensor(4.1705, device='cuda:0')
frobenius_norm tensor(23.7914, device='cuda:0')
spectral_norm tensor(2.3827, device='cuda:0')
frobenius_norm tensor(22.4299, device='cuda:0')
spectral_norm tensor(1.8284, device='cuda:0')
frobenius_norm tensor(57.2100, device='cuda:0')
spectral_norm tensor(5.3112, device='cuda:0')
alpha value of layer 16 ---142.52708435058594
Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(61.2952, device='cuda:0')
spectral_norm tensor(3.9247, device='cuda:0')
frobenius_norm tensor(24.2607, device='cuda:0')
spectral_norm tensor(2.3428, device='cuda:0')
frobenius_norm tensor(22.0993, device='cuda:0')
spectral_norm tensor(1.6527, device='cuda:0')
frobenius_norm tensor(56.9147, device='cuda:0')
spectral_norm tensor(4.7962, device='cuda:0')
alpha value of layer 17 ---167.6918182373047
Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(57.8086, device='cuda:0')
spectral_norm tensor(4.1784, device='cuda:0')
frobenius_norm tensor(23.1346, device='cuda:0')
spectral_norm tensor(2.7649, device='cuda:0')
frobenius_norm tensor(24.9173, device='cuda:0')
spectral_norm tensor(1.5424, device='cuda:0')
frobenius_norm tensor(60.8592, device='cuda:0')
spectral_norm tensor(5.2139, device='cuda:0')
alpha value of layer 18 ---164.66351318359375
Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(57.4449, device='cuda:0')
spectral_norm tensor(4.5447, device='cuda:0')
frobenius_norm tensor(21.3000, device='cuda:0')
spectral_norm tensor(2.4431, device='cuda:0')
frobenius_norm tensor(25.0779, device='cuda:0')
spectral_norm tensor(1.6237, device='cuda:0')
frobenius_norm tensor(59.6389, device='cuda:0')
spectral_norm tensor(4.9505, device='cuda:0')
alpha value of layer 19 ---154.86126708984375
Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(58.7482, device='cuda:0')
spectral_norm tensor(4.0858, device='cuda:0')
frobenius_norm tensor(22.3411, device='cuda:0')
spectral_norm tensor(2.3521, device='cuda:0')
frobenius_norm tensor(25.9143, device='cuda:0')
spectral_norm tensor(1.7104, device='cuda:0')
frobenius_norm tensor(60.9286, device='cuda:0')
spectral_norm tensor(4.8709, device='cuda:0')
alpha value of layer 20 ---170.74554443359375
Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(57.1627, device='cuda:0')
spectral_norm tensor(3.6813, device='cuda:0')
frobenius_norm tensor(19.9702, device='cuda:0')
spectral_norm tensor(2.2489, device='cuda:0')
frobenius_norm tensor(27.7738, device='cuda:0')
spectral_norm tensor(1.6762, device='cuda:0')
frobenius_norm tensor(63.2683, device='cuda:0')
spectral_norm tensor(5.2737, device='cuda:0')
alpha value of layer 21 ---184.6076202392578
Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(58.1757, device='cuda:0')
spectral_norm tensor(4.0771, device='cuda:0')
frobenius_norm tensor(19.4699, device='cuda:0')
spectral_norm tensor(1.9257, device='cuda:0')
frobenius_norm tensor(27.1717, device='cuda:0')
spectral_norm tensor(2.0305, device='cuda:0')
frobenius_norm tensor(63.5510, device='cuda:0')
spectral_norm tensor(4.3551, device='cuda:0')
alpha value of layer 22 ---174.45973205566406
Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(59.5436, device='cuda:0')
spectral_norm tensor(4.0933, device='cuda:0')
frobenius_norm tensor(20.5584, device='cuda:0')
spectral_norm tensor(2.1754, device='cuda:0')
frobenius_norm tensor(27.5827, device='cuda:0')
spectral_norm tensor(1.8698, device='cuda:0')
frobenius_norm tensor(64.9988, device='cuda:0')
spectral_norm tensor(5.3918, device='cuda:0')
alpha value of layer 23 ---165.96688842773438
Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(56.8169, device='cuda:0')
spectral_norm tensor(3.8277, device='cuda:0')
frobenius_norm tensor(19.4813, device='cuda:0')
spectral_norm tensor(1.8291, device='cuda:0')
frobenius_norm tensor(31.0836, device='cuda:0')
spectral_norm tensor(2.3168, device='cuda:0')
frobenius_norm tensor(65.3215, device='cuda:0')
spectral_norm tensor(6.2412, device='cuda:0')
alpha value of layer 24 ---155.8324432373047
Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(55.3445, device='cuda:0')
spectral_norm tensor(3.8164, device='cuda:0')
frobenius_norm tensor(17.7185, device='cuda:0')
spectral_norm tensor(1.8025, device='cuda:0')
frobenius_norm tensor(33.9660, device='cuda:0')
spectral_norm tensor(3.0984, device='cuda:0')
frobenius_norm tensor(68.8981, device='cuda:0')
spectral_norm tensor(5.3780, device='cuda:0')
alpha value of layer 25 ---147.80517578125
Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(51.7868, device='cuda:0')
spectral_norm tensor(3.9548, device='cuda:0')
frobenius_norm tensor(16.9575, device='cuda:0')
spectral_norm tensor(1.9295, device='cuda:0')
frobenius_norm tensor(40.6153, device='cuda:0')
spectral_norm tensor(3.1825, device='cuda:0')
frobenius_norm tensor(71.6628, device='cuda:0')
spectral_norm tensor(6.2830, device='cuda:0')
alpha value of layer 26 ---135.41717529296875
Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
frobenius_norm tensor(56.6802, device='cuda:0')
spectral_norm tensor(7.8725, device='cuda:0')
frobenius_norm tensor(18.0258, device='cuda:0')
spectral_norm tensor(2.1181, device='cuda:0')
frobenius_norm tensor(36.7123, device='cuda:0')
spectral_norm tensor(4.5755, device='cuda:0')
frobenius_norm tensor(66.9187, device='cuda:0')
spectral_norm tensor(10.3449, device='cuda:0')
alpha value of layer 27 ---57.62150573730469
metric_name stable_rank: [8, 10, 7, 21, 12, 11, 22, 20, 17, 13, 23, 18, 15, 14, 24, 19, 25, 9, 16, 3, 4, 2, 5, 6, 26, 1, 0, 27]
Begin main_assign: Qwen2.5-7B self_attn effective_rank
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:02, 1.40it/s] Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.46it/s] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.46it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.54it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.50it/s]
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Once upon a time, in a faraway land, there was a wizard named Zephyr. Zephyr had a magical garden filled with enchanted flowers that bloomed only once a year on the night of the full moon. Each flower had a unique power: some could grant wishes, others could heal, and some could even bring the dead back to life. Zephyr knew that the garden was in danger, and he needed to protect it. He decided to create a secret code to lock the garden's entrance
Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((3584,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)
config:
Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 131072,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 152064
}
Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 0 ---1452.84130859375
Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 1 ---1305.515869140625
Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 2 ---1505.5045166015625
Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 3 ---1528.22900390625
Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 4 ---1506.9344482421875
Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 5 ---1515.52099609375
Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 6 ---1521.9154052734375
Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 7 ---1530.30126953125
Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 8 ---1513.680908203125
Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 9 ---1508.0103759765625
Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 10 ---1541.102294921875
Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 11 ---1515.346923828125
Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 12 ---1515.414306640625
Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 13 ---1518.91796875
Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 14 ---1460.6475830078125
Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 15 ---1482.7188720703125
Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 16 ---1503.2802734375
Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 17 ---1507.509033203125
Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 18 ---1475.742431640625
Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 19 ---1503.236083984375
Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 20 ---1508.0513916015625
Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 21 ---1513.8974609375
Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 22 ---1492.510986328125
Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 23 ---1560.966552734375
Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 24 ---1547.26904296875
Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 25 ---1559.457763671875
Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 26 ---1530.5631103515625
Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 27 ---1474.574462890625
metric_name effective_rank: [23, 25, 24, 10, 26, 7, 3, 6, 13, 5, 12, 11, 21, 8, 20, 9, 17, 4, 2, 16, 19, 22, 15, 18, 27, 14, 0, 1]
Begin main_assign: Qwen2.5-7B self_attn ZD
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:01, 1.55it/s] Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.64it/s] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:01<00:00, 1.64it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.74it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.70it/s]
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Once upon a time there was a very special fish, who lived in a very special lake, and who had a very special name.
And when the fish was born, his parents named him Nemo.
Nemo was a very happy fish, who loved to swim around his lake, and who had a lot of friends.
There was Dory, the forgetful fish, who would always forget where she was going, and Marlin, the protective fish, who would always look after his family.
Nemo was also
Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((3584,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)
config:
Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 131072,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 152064
}
Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 0 ---0.14184384047985077
Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 1 ---0.13445976376533508
Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 2 ---0.1448441445827484
Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 3 ---0.14360329508781433
Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 4 ---0.14149385690689087
Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 5 ---0.14219337701797485
Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 6 ---0.14448319375514984
Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 7 ---0.14329871535301208
Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 8 ---0.14073410630226135
Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 9 ---0.14020180702209473
Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 10 ---0.14346933364868164
Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 11 ---0.14193479716777802
Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 12 ---0.1403268575668335
Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 13 ---0.14010006189346313
Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 14 ---0.13355384767055511
Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 15 ---0.13806670904159546
Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 16 ---0.13962170481681824
Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 17 ---0.13812372088432312
Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 18 ---0.13966597616672516
Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 19 ---0.13903221487998962
Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 20 ---0.1419043242931366
Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 21 ---0.13662637770175934
Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 22 ---0.13463997840881348
Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 23 ---0.13744638860225677
Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 24 ---0.1426282525062561
Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 25 ---0.1387733370065689
Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 26 ---0.13506180047988892
Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 27 ---0.13298790156841278
metric_name ZD: [2, 6, 3, 10, 7, 24, 5, 11, 20, 0, 4, 8, 12, 9, 13, 18, 16, 19, 25, 17, 15, 23, 21, 26, 22, 1, 14, 27]
Begin main_assign: Qwen2.5-7B self_attn head_diversity
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:02, 1.18it/s] Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.12it/s] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.11it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.18it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.16it/s]
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Once upon a time in the land of Mathoria, there was a magical forest where every tree had a unique number of leaves. The King of Mathoria decided to plant a new tree every day for a week (7 days), starting with 1 leaf on the first day and increasing the number of leaves by 1 each day. However, a mischievous sprite named Sprinkle loved to play tricks on the trees. On every even day, Sprinkle would randomly remove a number of leaves from the tree, between
Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((3584,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)
config:
Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 131072,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 152064
}
Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
Traceback (most recent call last):
File "/mnt/bn/life-mllm/users/cxr/quantization/quantization_metric/main_assign.py", line 58, in <module>
all_layer_alpha = calculate_expert(model, metric=metric_name, keyword=keyword)
File "/mnt/bn/life-mllm/users/cxr/quantization/quantization_metric/alphalora/expert_number.py", line 350, in calculate_expert
layer_final_alpha = func_call[metric](num_heads, subset)
File "/mnt/bn/life-mllm/users/cxr/quantization/quantization_metric/alphalora/expert_number.py", line 310, in head_diversity_asssist
ans.append(head_diversity(W, num_heads))
File "/mnt/bn/life-mllm/users/cxr/quantization/quantization_metric/alphalora/expert_number.py", line 191, in head_diversity
w_heads = W.view(num_heads, head_dim, d_in)
RuntimeError: shape '[28, 18, 3584]' is invalid for input of size 1835008
Begin main_assign: Qwen2.5-7B self_attn coherence
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:02, 1.19it/s] Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.21it/s] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.20it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.29it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.25it/s]
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Once upon a time in the not-so-distant past, the average person had a choice of two or three local phone companies and one long distance company. Now, you have to be an expert to navigate the maze of telephone companies and options. This article will help you make the right choice for your telephone needs.
If you are using a cellular phone, you should only use it in an emergency. It is important to use a cell phone only in emergencies because they use a lot of battery power. If you use a
Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((3584,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)
config:
Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 131072,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 152064
}
Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 0 ---0.019099362194538116
Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 1 ---0.03270909935235977
Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 2 ---0.02031659334897995
Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 3 ---0.020667918026447296
Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 4 ---0.021066918969154358
Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 5 ---0.019672438502311707
Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 6 ---0.020373258739709854
Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 7 ---0.01879500225186348
Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 8 ---0.018471794202923775
Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 9 ---0.01999843120574951
Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 10 ---0.018006717786192894
Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 11 ---0.019990842789411545
Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 12 ---0.01996159367263317
Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 13 ---0.020418085157871246
Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 14 ---0.020115870982408524
Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 15 ---0.02094407193362713
Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 16 ---0.02063441462814808
Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 17 ---0.018953558057546616
Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 18 ---0.020888380706310272
Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 19 ---0.01945885643362999
Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 20 ---0.019533313810825348
Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 21 ---0.018554046750068665
Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 22 ---0.020378313958644867
Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 23 ---0.019100410863757133
Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 24 ---0.018557211384177208
Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 25 ---0.019475571811199188
Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 26 ---0.021406373009085655
Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)}
alpha value of layer 27 ---0.02655916102230549
metric_name coherence: [1, 27, 26, 4, 15, 18, 3, 16, 13, 22, 6, 2, 14, 9, 11, 12, 5, 20, 25, 19, 23, 0, 17, 7, 24, 21, 8, 10]
Begin main_assign: Qwen2.5-7B mlp alpha
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:02, 1.32it/s] Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.37it/s] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.35it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.44it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.41it/s]
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Once upon a time there was a little girl named Maria. She was 10 years old and she loved to play with her dolls. Every night before she went to sleep she would place her dolls on her bed and have a tea party with them. One night, she was feeling very lonely. She wanted someone to talk to who could understand her. She wanted to be with her dolls, but she wanted a friend too. She prayed that God would send her a friend. She prayed that God would send her a
Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((3584,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)
config:
Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 131072,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 152064
}
Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 0 ---5.5507073402404785
Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 1 ---2.925907850265503
Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 2 ---3.427783727645874
Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 3 ---4.130731105804443
Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 4 ---4.2790751457214355
Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 5 ---4.67555570602417
Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 6 ---5.680507659912109
Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 7 ---5.469402313232422
Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 8 ---4.489261150360107
Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 9 ---5.958518981933594
Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 10 ---5.11647367477417
Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 11 ---4.431467056274414
Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 12 ---4.447659969329834
Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 13 ---4.224405288696289
Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 14 ---4.203671932220459
Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 15 ---4.193532943725586
Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 16 ---4.277862548828125
Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 17 ---4.189056396484375
Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 18 ---4.484411716461182
Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 19 ---4.689056396484375
Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 20 ---4.993287563323975
Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 21 ---6.104448318481445
Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 22 ---6.7987060546875
Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 23 ---6.16623067855835
Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 24 ---6.090585231781006
Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 25 ---5.552665710449219
Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 26 ---5.523178577423096
Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 27 ---4.90322208404541
metric_name alpha: [22, 23, 21, 24, 9, 6, 25, 0, 26, 7, 10, 20, 27, 19, 5, 8, 18, 12, 11, 4, 16, 13, 14, 15, 17, 3, 2, 1]
Begin main_assign: Qwen2.5-7B mlp alpha_hat
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:01, 1.57it/s] Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.65it/s] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:01<00:00, 1.64it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.71it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.68it/s]
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Once upon a time, a little boy named Timmy went to visit his grandmother. On the way, he saw a beautiful rainbow in the sky. He wanted to find the pot of gold at the end of the rainbow. But the rainbow led him to a magical door that was locked with a puzzle.
The puzzle was: "I am not alive, but I grow; I don't have lungs, but I need air; I don't have a mouth, but water kills me. What am I?"
Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((3584,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)
config:
Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 131072,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 152064
}
Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 0 ---26.469465255737305
Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 1 ---15.14222526550293
Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 2 ---15.627280235290527
Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 3 ---21.562671661376953
Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 4 ---18.97063636779785
Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 5 ---21.275915145874023
Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 6 ---20.404766082763672
Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 7 ---21.564342498779297
Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 8 ---17.832380294799805
Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 9 ---24.982494354248047
Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 10 ---21.225040435791016
Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 11 ---18.6888484954834
Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 12 ---19.226669311523438
Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 13 ---18.279586791992188
Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 14 ---17.538372039794922
Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 15 ---17.4776611328125
Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 16 ---18.033130645751953
Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 17 ---17.052593231201172
Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 18 ---18.05915641784668
Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 19 ---18.863903045654297
Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 20 ---19.91156768798828
Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 21 ---22.948781967163086
Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 22 ---26.310537338256836
Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 23 ---24.970985412597656
Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 24 ---24.232511520385742
Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 25 ---23.717098236083984
Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 26 ---23.539913177490234
Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 27 ---24.706113815307617
metric_name alpha_hat: [0, 22, 9, 23, 27, 24, 25, 26, 21, 7, 3, 5, 10, 6, 20, 12, 4, 19, 11, 13, 18, 16, 8, 14, 15, 17, 2, 1]
Begin main_assign: Qwen2.5-7B mlp stable_rank
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:02, 1.23it/s] Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.28it/s] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.28it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.35it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.32it/s]
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Once upon a time, there was a little girl. She was very beautiful, but she was so bad that nobody liked her. She had no friends. She didn't want to play with other children. She didn't want to go to school. She lived by herself in a small house. The only thing that made her happy was a little dog. The little girl was very sad. She didn't know what to do. One day, she walked out of the house and saw a beautiful bird. She wanted to
Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((3584,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)
config:
Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 131072,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 152064
}
Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(126.6843, device='cuda:0')
spectral_norm tensor(37.7735, device='cuda:0')
frobenius_norm tensor(108.8807, device='cuda:0')
spectral_norm tensor(7.8392, device='cuda:0')
frobenius_norm tensor(115.3682, device='cuda:0')
spectral_norm tensor(6.5322, device='cuda:0')
alpha value of layer 0 ---172.02789306640625
Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(123.2531, device='cuda:0')
spectral_norm tensor(20.7839, device='cuda:0')
frobenius_norm tensor(101.4097, device='cuda:0')
spectral_norm tensor(8.8381, device='cuda:0')
frobenius_norm tensor(101.5777, device='cuda:0')
spectral_norm tensor(13.5321, device='cuda:0')
alpha value of layer 1 ---74.3896484375
Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(138.8778, device='cuda:0')
spectral_norm tensor(23.4270, device='cuda:0')
frobenius_norm tensor(112.7775, device='cuda:0')
spectral_norm tensor(6.5452, device='cuda:0')
frobenius_norm tensor(115.6860, device='cuda:0')
spectral_norm tensor(7.9797, device='cuda:0')
alpha value of layer 2 ---180.73681640625
Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(153.6270, device='cuda:0')
spectral_norm tensor(22.2753, device='cuda:0')
frobenius_norm tensor(132.3810, device='cuda:0')
spectral_norm tensor(7.2662, device='cuda:0')
frobenius_norm tensor(131.2425, device='cuda:0')
spectral_norm tensor(17.0355, device='cuda:0')
alpha value of layer 3 ---146.2799072265625
Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(156.7403, device='cuda:0')
spectral_norm tensor(22.8238, device='cuda:0')
frobenius_norm tensor(129.6389, device='cuda:0')
spectral_norm tensor(5.4833, device='cuda:0')
frobenius_norm tensor(128.7504, device='cuda:0')
spectral_norm tensor(8.4121, device='cuda:0')
alpha value of layer 4 ---280.12554931640625
Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(149.4161, device='cuda:0')
spectral_norm tensor(21.1506, device='cuda:0')
frobenius_norm tensor(133.9768, device='cuda:0')
spectral_norm tensor(6.2399, device='cuda:0')
frobenius_norm tensor(132.0446, device='cuda:0')
spectral_norm tensor(8.4305, device='cuda:0')
alpha value of layer 5 ---252.07894897460938
Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(151.2180, device='cuda:0')
spectral_norm tensor(14.3238, device='cuda:0')
frobenius_norm tensor(129.6978, device='cuda:0')
spectral_norm tensor(4.1238, device='cuda:0')
frobenius_norm tensor(127.6298, device='cuda:0')
spectral_norm tensor(7.3585, device='cuda:0')
alpha value of layer 6 ---467.14599609375
Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(141.6723, device='cuda:0')
spectral_norm tensor(13.0737, device='cuda:0')
frobenius_norm tensor(133.2031, device='cuda:0')
spectral_norm tensor(5.1650, device='cuda:0')
frobenius_norm tensor(132.6786, device='cuda:0')
spectral_norm tensor(7.6574, device='cuda:0')
alpha value of layer 7 ---360.915771484375
Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(139.1350, device='cuda:0')
spectral_norm tensor(12.2201, device='cuda:0')
frobenius_norm tensor(135.6841, device='cuda:0')
spectral_norm tensor(4.7790, device='cuda:0')
frobenius_norm tensor(134.1169, device='cuda:0')
spectral_norm tensor(8.0838, device='cuda:0')
alpha value of layer 8 ---403.66400146484375
Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(152.4723, device='cuda:0')
spectral_norm tensor(28.9447, device='cuda:0')
frobenius_norm tensor(124.9508, device='cuda:0')
spectral_norm tensor(5.0401, device='cuda:0')
frobenius_norm tensor(123.2868, device='cuda:0')
spectral_norm tensor(7.9764, device='cuda:0')
alpha value of layer 9 ---293.7569580078125
Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(141.8071, device='cuda:0')
spectral_norm tensor(14.3111, device='cuda:0')
frobenius_norm tensor(133.1904, device='cuda:0')
spectral_norm tensor(5.2121, device='cuda:0')
frobenius_norm tensor(132.4129, device='cuda:0')
spectral_norm tensor(9.0536, device='cuda:0')
alpha value of layer 10 ---321.69720458984375
Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(139.0032, device='cuda:0')
spectral_norm tensor(12.7478, device='cuda:0')
frobenius_norm tensor(135.5063, device='cuda:0')
spectral_norm tensor(5.4040, device='cuda:0')
frobenius_norm tensor(134.1321, device='cuda:0')
spectral_norm tensor(9.8842, device='cuda:0')
alpha value of layer 11 ---310.609130859375
Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(136.5591, device='cuda:0')
spectral_norm tensor(12.8626, device='cuda:0')
frobenius_norm tensor(136.9887, device='cuda:0')
spectral_norm tensor(5.2525, device='cuda:0')
frobenius_norm tensor(135.8750, device='cuda:0')
spectral_norm tensor(10.9823, device='cuda:0')
alpha value of layer 12 ---315.332763671875
Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(139.4029, device='cuda:0')
spectral_norm tensor(12.9032, device='cuda:0')
frobenius_norm tensor(135.3871, device='cuda:0')
spectral_norm tensor(5.2956, device='cuda:0')
frobenius_norm tensor(133.7299, device='cuda:0')
spectral_norm tensor(10.5605, device='cuda:0')
alpha value of layer 13 ---310.2343444824219
Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(136.9790, device='cuda:0')
spectral_norm tensor(12.1678, device='cuda:0')
frobenius_norm tensor(136.4235, device='cuda:0')
spectral_norm tensor(5.1185, device='cuda:0')
frobenius_norm tensor(134.9485, device='cuda:0')
spectral_norm tensor(9.6906, device='cuda:0')
alpha value of layer 14 ---343.6784362792969
Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(135.3356, device='cuda:0')
spectral_norm tensor(11.3039, device='cuda:0')
frobenius_norm tensor(137.9369, device='cuda:0')
spectral_norm tensor(5.1105, device='cuda:0')
frobenius_norm tensor(135.9716, device='cuda:0')
spectral_norm tensor(10.2615, device='cuda:0')
alpha value of layer 15 ---349.14691162109375
Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(135.1444, device='cuda:0')
spectral_norm tensor(11.1679, device='cuda:0')
frobenius_norm tensor(137.6841, device='cuda:0')
spectral_norm tensor(5.1499, device='cuda:0')
frobenius_norm tensor(135.1051, device='cuda:0')
spectral_norm tensor(10.5823, device='cuda:0')
alpha value of layer 16 ---341.401611328125
Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(133.7680, device='cuda:0')
spectral_norm tensor(11.1640, device='cuda:0')
frobenius_norm tensor(138.4800, device='cuda:0')
spectral_norm tensor(5.2358, device='cuda:0')
frobenius_norm tensor(135.3299, device='cuda:0')
spectral_norm tensor(8.6134, device='cuda:0')
alpha value of layer 17 ---363.3155212402344
Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(133.3184, device='cuda:0')
spectral_norm tensor(10.9756, device='cuda:0')
frobenius_norm tensor(140.8582, device='cuda:0')
spectral_norm tensor(5.6010, device='cuda:0')
frobenius_norm tensor(137.9481, device='cuda:0')
spectral_norm tensor(7.5009, device='cuda:0')
alpha value of layer 18 ---372.74267578125
Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(135.1317, device='cuda:0')
spectral_norm tensor(12.0810, device='cuda:0')
frobenius_norm tensor(140.5321, device='cuda:0')
spectral_norm tensor(5.5793, device='cuda:0')
frobenius_norm tensor(137.1833, device='cuda:0')
spectral_norm tensor(6.9554, device='cuda:0')
alpha value of layer 19 ---382.8515625
Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(134.6000, device='cuda:0')
spectral_norm tensor(11.1297, device='cuda:0')
frobenius_norm tensor(141.2537, device='cuda:0')
spectral_norm tensor(5.9169, device='cuda:0')
frobenius_norm tensor(138.0343, device='cuda:0')
spectral_norm tensor(6.5506, device='cuda:0')
alpha value of layer 20 ---386.7352294921875
Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(136.7462, device='cuda:0')
spectral_norm tensor(11.6961, device='cuda:0')
frobenius_norm tensor(141.2342, device='cuda:0')
spectral_norm tensor(5.6582, device='cuda:0')
frobenius_norm tensor(137.9897, device='cuda:0')
spectral_norm tensor(5.3100, device='cuda:0')
alpha value of layer 21 ---478.354248046875
Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(136.9238, device='cuda:0')
spectral_norm tensor(12.4543, device='cuda:0')
frobenius_norm tensor(141.9774, device='cuda:0')
spectral_norm tensor(6.5433, device='cuda:0')
frobenius_norm tensor(139.0448, device='cuda:0')
spectral_norm tensor(5.1260, device='cuda:0')
alpha value of layer 22 ---442.48614501953125
Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(138.1573, device='cuda:0')
spectral_norm tensor(14.3250, device='cuda:0')
frobenius_norm tensor(141.6219, device='cuda:0')
spectral_norm tensor(6.6344, device='cuda:0')
frobenius_norm tensor(138.9087, device='cuda:0')
spectral_norm tensor(5.2948, device='cuda:0')
alpha value of layer 23 ---412.3252868652344
Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(135.7100, device='cuda:0')
spectral_norm tensor(12.7437, device='cuda:0')
frobenius_norm tensor(143.0019, device='cuda:0')
spectral_norm tensor(6.6948, device='cuda:0')
frobenius_norm tensor(141.1184, device='cuda:0')
spectral_norm tensor(5.4309, device='cuda:0')
alpha value of layer 24 ---414.9455261230469
Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(134.2979, device='cuda:0')
spectral_norm tensor(12.3842, device='cuda:0')
frobenius_norm tensor(144.5099, device='cuda:0')
spectral_norm tensor(7.7998, device='cuda:0')
frobenius_norm tensor(143.8913, device='cuda:0')
spectral_norm tensor(6.7086, device='cuda:0')
alpha value of layer 25 ---306.970947265625
Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(134.3241, device='cuda:0')
spectral_norm tensor(10.3940, device='cuda:0')
frobenius_norm tensor(145.9831, device='cuda:0')
spectral_norm tensor(10.3400, device='cuda:0')
frobenius_norm tensor(143.8145, device='cuda:0')
spectral_norm tensor(5.9343, device='cuda:0')
alpha value of layer 26 ---317.8813781738281
Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
frobenius_norm tensor(139.7623, device='cuda:0')
spectral_norm tensor(14.5234, device='cuda:0')
frobenius_norm tensor(145.0968, device='cuda:0')
spectral_norm tensor(16.8470, device='cuda:0')
frobenius_norm tensor(133.4643, device='cuda:0')
spectral_norm tensor(7.9846, device='cuda:0')
alpha value of layer 27 ---148.7277069091797
metric_name stable_rank: [21, 6, 22, 24, 23, 8, 20, 19, 18, 17, 7, 15, 14, 16, 10, 26, 12, 11, 13, 25, 9, 4, 5, 2, 0, 27, 3, 1]
Begin main_assign: Qwen2.5-7B mlp effective_rank
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:02, 1.01it/s] Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.03it/s] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.02it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.09it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.06it/s]
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Once upon a time, there was a queen who wanted to build a castle. She had a team of builders who would work day and night to complete the castle. She wanted the castle to be the most magnificent one in the kingdom. The queen would often visit the builders to see the progress of the castle. She would give them advice on how to make it even better. The builders worked hard and the castle was built in a short time. The queen was very happy with the castle and she made it the official residence
Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((3584,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)
config:
Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 131072,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 152064
}
Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 0 ---2950.251953125
Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 1 ---3178.52880859375
Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 2 ---3286.484375
Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 3 ---3382.06787109375
Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 4 ---3406.15234375
Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 5 ---3428.05908203125
Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 6 ---3425.3720703125
Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 7 ---3438.742919921875
Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 8 ---3428.871826171875
Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 9 ---3411.824951171875
Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 10 ---3431.89013671875
Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 11 ---3419.459716796875
Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 12 ---3425.357177734375
Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 13 ---3405.046875
Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 14 ---3417.287841796875
Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 15 ---3418.02490234375
Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 16 ---3407.5625
Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 17 ---3408.4970703125
Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 18 ---3426.13232421875
Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 19 ---3422.40576171875
Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 20 ---3435.60009765625
Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 21 ---3436.99267578125
Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 22 ---3444.177734375
Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 23 ---3438.670654296875
Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 24 ---3432.22265625
Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 25 ---3429.8212890625
Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 26 ---3432.60546875
Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 27 ---3447.41943359375
metric_name effective_rank: [27, 22, 7, 23, 21, 20, 26, 24, 10, 25, 8, 5, 18, 6, 12, 19, 11, 15, 14, 9, 17, 16, 4, 13, 3, 2, 1, 0]
Begin main_assign: Qwen2.5-7B mlp ZD
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:02, 1.29it/s] Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.33it/s] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.31it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.41it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.37it/s]
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Once upon a time there lived a rich man. He had a servant(仆人). He and the servant loved wine and good food very much. Each time the rich man left his home, the servant would drink the wine and eat up all the nice food in the house. The rich man knew what his servant did, but he had never caught his servant doing that. One morning, when he left home, he said to the servant, “Here are two bottles of poison(毒药) and some nice
Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((3584,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)
config:
Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 131072,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 152064
}
Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 0 ---0.14041712880134583
Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 1 ---0.12673243880271912
Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 2 ---0.14196857810020447
Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 3 ---0.15253740549087524
Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 4 ---0.15330302715301514
Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 5 ---0.1531524807214737
Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 6 ---0.15197408199310303
Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 7 ---0.15272140502929688
Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 8 ---0.15372541546821594
Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 9 ---0.15197794139385223
Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 10 ---0.15327925980091095
Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 11 ---0.15320764482021332
Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 12 ---0.15106868743896484
Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 13 ---0.15193961560726166
Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 14 ---0.14989005029201508
Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 15 ---0.15030330419540405
Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 16 ---0.1516457051038742
Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 17 ---0.15101006627082825
Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 18 ---0.14948342740535736
Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 19 ---0.15111291408538818
Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 20 ---0.15065661072731018
Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 21 ---0.15124627947807312
Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 22 ---0.1522490233182907
Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 23 ---0.15400244295597076
Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 24 ---0.15468579530715942
Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 25 ---0.15433579683303833
Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 26 ---0.1542406529188156
Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 27 ---0.1527990698814392
metric_name ZD: [24, 25, 26, 23, 8, 4, 10, 11, 5, 27, 7, 3, 22, 9, 6, 13, 16, 21, 19, 12, 17, 20, 15, 14, 18, 2, 0, 1]
Begin main_assign: Qwen2.5-7B mlp head_diversity
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:01<00:04, 1.43s/it] Loading checkpoint shards: 50%|█████ | 2/4 [00:02<00:02, 1.24s/it] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:03<00:01, 1.28s/it] Loading checkpoint shards: 100%|██████████| 4/4 [00:04<00:00, 1.19s/it] Loading checkpoint shards: 100%|██████████| 4/4 [00:04<00:00, 1.23s/it]
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Once upon a time, there was a little boy named Timmy. Timmy loved to play outside and explore the world around him. One day, while playing in the park, he met a talking tree named Oakley.
Oakley told Timmy that there was a magical forest nearby that only appeared once every hundred years. The forest was filled with talking animals and magical creatures that could grant wishes. Timmy was excited to hear this news and begged Oakley to take him there.
Oakley agreed and led Timmy
Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((3584,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)
config:
Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 131072,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 152064
}
Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
Traceback (most recent call last):
File "/mnt/bn/life-mllm/users/cxr/quantization/quantization_metric/main_assign.py", line 58, in <module>
all_layer_alpha = calculate_expert(model, metric=metric_name, keyword=keyword)
File "/mnt/bn/life-mllm/users/cxr/quantization/quantization_metric/alphalora/expert_number.py", line 354, in calculate_expert
all_layer_alpha.append(torch.stack(layer_final_alpha).mean().item())
RuntimeError: stack expects a non-empty TensorList
Begin main_assign: Qwen2.5-7B mlp coherence
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:01<00:03, 1.01s/it] Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.12it/s] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.15it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.23it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.18it/s]
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Once upon a time, the land of Greece was ruled by three powerful and evil sorceresses: Echidna, the mother of the monsters; her daughter, the monster Medusa; and the sea-goddess Gorgon. Their leader was the most evil of the three, the monster Medusa. When the gods tried to stop her, she became so angry that she attacked them and killed them all. She then took their weapons, which she used to attack the world. She even attacked the other
Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((3584,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)
config:
Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 131072,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.55.2",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 152064
}
Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 0 ---0.030741512775421143
Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 1 ---0.0886889323592186
Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 2 ---0.038115616887807846
Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 3 ---0.01555887795984745
Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 4 ---0.015330223366618156
Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 5 ---0.014795559458434582
Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 6 ---0.013012934476137161
Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 7 ---0.012255651876330376
Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 8 ---0.012662074528634548
Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 9 ---0.017650291323661804
Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 10 ---0.012567928992211819
Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 11 ---0.012772100046277046
Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 12 ---0.012693522498011589
Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 13 ---0.012766292318701744
Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 14 ---0.01258667092770338
Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 15 ---0.012480087578296661
Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 16 ---0.012708479538559914
Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 17 ---0.01283347513526678
Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 18 ---0.01226731389760971
Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 19 ---0.012299998663365841
Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 20 ---0.01201008539646864
Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 21 ---0.011824443936347961
Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 22 ---0.011804303154349327
Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 23 ---0.012257957831025124
Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 24 ---0.012149857357144356
Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 25 ---0.012290380895137787
Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 26 ---0.012179265730082989
Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)}
alpha value of layer 27 ---0.01317012868821621
metric_name coherence: [1, 2, 0, 9, 3, 4, 5, 27, 6, 17, 11, 13, 16, 12, 8, 14, 10, 15, 19, 25, 18, 23, 7, 26, 24, 20, 21, 22]