| nohup: ignoring input |
| Begin main_assign: Llama-2-7b-hf self_attn alpha |
|
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:17<00:17, 17.79s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 11.01s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 12.03s/it] |
| Once upon a time, I was a new mum, with a newborn baby. I was also a full-time teacher and doing a part-time Master's degree. I was tired and stressed. I had no time to do anything that I enjoyed. And then I was given a gift. A gift of a book that would change everything. |
| I was given the book _Babywise_ , by Gary Ezzo and Robert Bucknam. I was not sure what to expect. I was |
| LlamaForCausalLM( |
| (model): LlamaModel( |
| (embed_tokens): Embedding(32000, 4096, padding_idx=0) |
| (layers): ModuleList( |
| (0-31): 32 x LlamaDecoderLayer( |
| (self_attn): LlamaAttention( |
| (q_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (k_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (v_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (o_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| ) |
| (mlp): LlamaMLP( |
| (gate_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (up_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (down_proj): Linear(in_features=11008, out_features=4096, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| ) |
| ) |
| (norm): LlamaRMSNorm((4096,), eps=1e-05) |
| (rotary_emb): LlamaRotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=4096, out_features=32000, bias=False) |
| ) |
| config: |
| LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 4096, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pad_token_id": 0, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
| |
| Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 0 ---1.5555332899093628 |
| Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 1 ---2.2105441093444824 |
| Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 2 ---2.6319708824157715 |
| Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 3 ---2.659501791000366 |
| Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 4 ---2.5770697593688965 |
| Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 5 ---2.5436081886291504 |
| Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 6 ---2.4908900260925293 |
| Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 7 ---2.5257110595703125 |
| Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 8 ---2.3653383255004883 |
| Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 9 ---2.5174360275268555 |
| Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 10 ---2.265111207962036 |
| Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 11 ---2.1847732067108154 |
| Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 12 ---2.4449100494384766 |
| Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 13 ---2.679959774017334 |
| Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 14 ---2.4503092765808105 |
| Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 15 ---2.7230710983276367 |
| Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 16 ---3.074552536010742 |
| Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 17 ---3.4709739685058594 |
| Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 18 ---3.67897629737854 |
| Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 19 ---3.278068780899048 |
| Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 20 ---3.6138486862182617 |
| Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 21 ---3.5603649616241455 |
| Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 22 ---3.9758076667785645 |
| Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 23 ---4.087326526641846 |
| Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 24 ---3.739630699157715 |
| Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 25 ---4.076397895812988 |
| Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 26 ---3.5009336471557617 |
| Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 27 ---4.056451320648193 |
| Processing layer 28--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 28 ---3.726351737976074 |
| Processing layer 29--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 29 ---3.844115972518921 |
| Processing layer 30--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 30 ---4.4837751388549805 |
| Processing layer 31--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 31 ---3.275714874267578 |
| metric_name alpha: [30, 23, 25, 27, 22, 29, 24, 28, 18, 20, 21, 26, 17, 19, 31, 16, 15, 13, 3, 2, 4, 5, 7, 9, 6, 14, 12, 8, 10, 1, 11, 0] |
| Begin main_assign: Llama-2-7b-hf self_attn alpha_hat |
|
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:17<00:17, 17.91s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 10.98s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 12.02s/it] |
| Once upon a time, there was a boy who lived in a house with his parents, and there was a girl who lived in a house with her parents. There were other children, too, but they were less important, because they lived in other houses. |
| This boy had a sister, and the sister was very beautiful. She was so beautiful that the boy’s parents decided to give her away in marriage. The sister was quite upset about this, but her parents told her that she had no choice. |
| LlamaForCausalLM( |
| (model): LlamaModel( |
| (embed_tokens): Embedding(32000, 4096, padding_idx=0) |
| (layers): ModuleList( |
| (0-31): 32 x LlamaDecoderLayer( |
| (self_attn): LlamaAttention( |
| (q_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (k_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (v_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (o_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| ) |
| (mlp): LlamaMLP( |
| (gate_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (up_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (down_proj): Linear(in_features=11008, out_features=4096, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| ) |
| ) |
| (norm): LlamaRMSNorm((4096,), eps=1e-05) |
| (rotary_emb): LlamaRotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=4096, out_features=32000, bias=False) |
| ) |
| config: |
| LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 4096, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pad_token_id": 0, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
| |
| Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 0 ---6.557916641235352 |
| Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 1 ---8.015230178833008 |
| Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 2 ---10.61414909362793 |
| Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 3 ---10.561344146728516 |
| Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 4 ---10.057601928710938 |
| Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 5 ---9.833120346069336 |
| Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 6 ---9.45051097869873 |
| Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 7 ---9.582311630249023 |
| Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 8 ---9.064985275268555 |
| Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 9 ---9.556177139282227 |
| Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 10 ---8.45679759979248 |
| Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 11 ---8.441352844238281 |
| Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 12 ---9.276637077331543 |
| Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 13 ---10.002967834472656 |
| Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 14 ---8.847436904907227 |
| Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 15 ---9.9490327835083 |
| Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 16 ---11.152729988098145 |
| Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 17 ---12.680035591125488 |
| Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 18 ---13.309869766235352 |
| Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 19 ---12.054608345031738 |
| Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 20 ---13.724580764770508 |
| Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 21 ---13.702856063842773 |
| Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 22 ---15.829018592834473 |
| Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 23 ---15.232747077941895 |
| Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 24 ---14.636650085449219 |
| Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 25 ---15.008004188537598 |
| Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 26 ---14.163816452026367 |
| Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 27 ---15.760412216186523 |
| Processing layer 28--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 28 ---14.682222366333008 |
| Processing layer 29--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 29 ---15.929686546325684 |
| Processing layer 30--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 30 ---17.405780792236328 |
| Processing layer 31--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 31 ---14.380212783813477 |
| metric_name alpha_hat: [30, 29, 22, 27, 23, 25, 28, 24, 31, 26, 20, 21, 18, 17, 19, 16, 2, 3, 4, 13, 15, 5, 7, 9, 6, 12, 8, 14, 10, 11, 1, 0] |
| Begin main_assign: Llama-2-7b-hf self_attn stable_rank |
|
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:19<00:19, 19.27s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 11.89s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 13.00s/it] |
| Once upon a time, in a land far away, there was a castle. Everyone who lived in the castle was very wealthy. But the castle was haunted! The king and queen were very scared, but they kept the castle because they wanted the money. |
| One day, the king and queen had a baby. They called him Prince Charming. Prince Charming was very happy. He had everything he could ever want. He was so happy that he could fly! |
| One day, |
| LlamaForCausalLM( |
| (model): LlamaModel( |
| (embed_tokens): Embedding(32000, 4096, padding_idx=0) |
| (layers): ModuleList( |
| (0-31): 32 x LlamaDecoderLayer( |
| (self_attn): LlamaAttention( |
| (q_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (k_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (v_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (o_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| ) |
| (mlp): LlamaMLP( |
| (gate_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (up_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (down_proj): Linear(in_features=11008, out_features=4096, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| ) |
| ) |
| (norm): LlamaRMSNorm((4096,), eps=1e-05) |
| (rotary_emb): LlamaRotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=4096, out_features=32000, bias=False) |
| ) |
| config: |
| LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 4096, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pad_token_id": 0, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
| |
| Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(55.2111, device='cuda:0') |
| spectral_norm tensor(21.8310, device='cuda:0') |
| frobenius_norm tensor(61.8584, device='cuda:0') |
| spectral_norm tensor(18.5959, device='cuda:0') |
| frobenius_norm tensor(45.2757, device='cuda:0') |
| spectral_norm tensor(4.0318, device='cuda:0') |
| frobenius_norm tensor(29.4097, device='cuda:0') |
| spectral_norm tensor(4.3865, device='cuda:0') |
| alpha value of layer 0 ---47.129005432128906 |
| Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(108.3667, device='cuda:0') |
| spectral_norm tensor(18.4210, device='cuda:0') |
| frobenius_norm tensor(108.1698, device='cuda:0') |
| spectral_norm tensor(20.1468, device='cuda:0') |
| frobenius_norm tensor(41.1449, device='cuda:0') |
| spectral_norm tensor(3.2622, device='cuda:0') |
| frobenius_norm tensor(33.7843, device='cuda:0') |
| spectral_norm tensor(3.8698, device='cuda:0') |
| alpha value of layer 1 ---74.68388366699219 |
| Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(109.0373, device='cuda:0') |
| spectral_norm tensor(16.2222, device='cuda:0') |
| frobenius_norm tensor(114.6870, device='cuda:0') |
| spectral_norm tensor(19.2055, device='cuda:0') |
| frobenius_norm tensor(58.7294, device='cuda:0') |
| spectral_norm tensor(3.5355, device='cuda:0') |
| frobenius_norm tensor(56.9624, device='cuda:0') |
| spectral_norm tensor(6.0202, device='cuda:0') |
| alpha value of layer 2 ---111.57455444335938 |
| Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(103.1725, device='cuda:0') |
| spectral_norm tensor(13.4197, device='cuda:0') |
| frobenius_norm tensor(107.3360, device='cuda:0') |
| spectral_norm tensor(15.3238, device='cuda:0') |
| frobenius_norm tensor(55.7174, device='cuda:0') |
| spectral_norm tensor(3.0726, device='cuda:0') |
| frobenius_norm tensor(54.5137, device='cuda:0') |
| spectral_norm tensor(6.3542, device='cuda:0') |
| alpha value of layer 3 ---127.65095520019531 |
| Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(107.5970, device='cuda:0') |
| spectral_norm tensor(13.7911, device='cuda:0') |
| frobenius_norm tensor(109.8770, device='cuda:0') |
| spectral_norm tensor(15.7544, device='cuda:0') |
| frobenius_norm tensor(58.8042, device='cuda:0') |
| spectral_norm tensor(3.1027, device='cuda:0') |
| frobenius_norm tensor(57.5345, device='cuda:0') |
| spectral_norm tensor(5.8239, device='cuda:0') |
| alpha value of layer 4 ---141.5764617919922 |
| Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(108.1569, device='cuda:0') |
| spectral_norm tensor(13.7375, device='cuda:0') |
| frobenius_norm tensor(112.6002, device='cuda:0') |
| spectral_norm tensor(16.5425, device='cuda:0') |
| frobenius_norm tensor(60.3288, device='cuda:0') |
| spectral_norm tensor(2.9971, device='cuda:0') |
| frobenius_norm tensor(59.0258, device='cuda:0') |
| spectral_norm tensor(5.6560, device='cuda:0') |
| alpha value of layer 5 ---155.60284423828125 |
| Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(102.0697, device='cuda:0') |
| spectral_norm tensor(12.2985, device='cuda:0') |
| frobenius_norm tensor(104.0983, device='cuda:0') |
| spectral_norm tensor(14.5729, device='cuda:0') |
| frobenius_norm tensor(55.9676, device='cuda:0') |
| spectral_norm tensor(3.0027, device='cuda:0') |
| frobenius_norm tensor(55.1871, device='cuda:0') |
| spectral_norm tensor(5.7670, device='cuda:0') |
| alpha value of layer 6 ---139.72598266601562 |
| Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(101.8819, device='cuda:0') |
| spectral_norm tensor(11.7685, device='cuda:0') |
| frobenius_norm tensor(102.3675, device='cuda:0') |
| spectral_norm tensor(13.6839, device='cuda:0') |
| frobenius_norm tensor(56.6824, device='cuda:0') |
| spectral_norm tensor(3.1391, device='cuda:0') |
| frobenius_norm tensor(55.6199, device='cuda:0') |
| spectral_norm tensor(5.4573, device='cuda:0') |
| alpha value of layer 7 ---140.21083068847656 |
| Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(102.8848, device='cuda:0') |
| spectral_norm tensor(11.9707, device='cuda:0') |
| frobenius_norm tensor(103.4811, device='cuda:0') |
| spectral_norm tensor(14.2754, device='cuda:0') |
| frobenius_norm tensor(58.2330, device='cuda:0') |
| spectral_norm tensor(3.3746, device='cuda:0') |
| frobenius_norm tensor(57.2962, device='cuda:0') |
| spectral_norm tensor(4.9391, device='cuda:0') |
| alpha value of layer 8 ---139.6942138671875 |
| Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(103.0146, device='cuda:0') |
| spectral_norm tensor(12.2318, device='cuda:0') |
| frobenius_norm tensor(105.5969, device='cuda:0') |
| spectral_norm tensor(14.2079, device='cuda:0') |
| frobenius_norm tensor(59.3876, device='cuda:0') |
| spectral_norm tensor(3.2388, device='cuda:0') |
| frobenius_norm tensor(58.5812, device='cuda:0') |
| spectral_norm tensor(5.1232, device='cuda:0') |
| alpha value of layer 9 ---148.2813262939453 |
| Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(102.8745, device='cuda:0') |
| spectral_norm tensor(12.0922, device='cuda:0') |
| frobenius_norm tensor(106.0223, device='cuda:0') |
| spectral_norm tensor(14.3113, device='cuda:0') |
| frobenius_norm tensor(58.7986, device='cuda:0') |
| spectral_norm tensor(3.2255, device='cuda:0') |
| frobenius_norm tensor(58.3338, device='cuda:0') |
| spectral_norm tensor(4.3048, device='cuda:0') |
| alpha value of layer 10 ---160.80075073242188 |
| Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(97.7702, device='cuda:0') |
| spectral_norm tensor(11.1815, device='cuda:0') |
| frobenius_norm tensor(97.4910, device='cuda:0') |
| spectral_norm tensor(12.9324, device='cuda:0') |
| frobenius_norm tensor(61.3144, device='cuda:0') |
| spectral_norm tensor(3.4012, device='cuda:0') |
| frobenius_norm tensor(60.7354, device='cuda:0') |
| spectral_norm tensor(5.6488, device='cuda:0') |
| alpha value of layer 11 ---143.46920776367188 |
| Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(99.7752, device='cuda:0') |
| spectral_norm tensor(11.8016, device='cuda:0') |
| frobenius_norm tensor(102.8686, device='cuda:0') |
| spectral_norm tensor(13.5998, device='cuda:0') |
| frobenius_norm tensor(60.5482, device='cuda:0') |
| spectral_norm tensor(3.2452, device='cuda:0') |
| frobenius_norm tensor(60.0323, device='cuda:0') |
| spectral_norm tensor(4.9509, device='cuda:0') |
| alpha value of layer 12 ---155.95849609375 |
| Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(98.8230, device='cuda:0') |
| spectral_norm tensor(12.0874, device='cuda:0') |
| frobenius_norm tensor(100.6162, device='cuda:0') |
| spectral_norm tensor(13.8355, device='cuda:0') |
| frobenius_norm tensor(62.7593, device='cuda:0') |
| spectral_norm tensor(3.1144, device='cuda:0') |
| frobenius_norm tensor(62.1430, device='cuda:0') |
| spectral_norm tensor(5.1164, device='cuda:0') |
| alpha value of layer 13 ---168.32928466796875 |
| Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(98.5708, device='cuda:0') |
| spectral_norm tensor(11.6312, device='cuda:0') |
| frobenius_norm tensor(100.4783, device='cuda:0') |
| spectral_norm tensor(13.6670, device='cuda:0') |
| frobenius_norm tensor(61.6071, device='cuda:0') |
| spectral_norm tensor(2.7584, device='cuda:0') |
| frobenius_norm tensor(60.9149, device='cuda:0') |
| spectral_norm tensor(4.5464, device='cuda:0') |
| alpha value of layer 14 ---201.04998779296875 |
| Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(97.3649, device='cuda:0') |
| spectral_norm tensor(12.7196, device='cuda:0') |
| frobenius_norm tensor(100.7325, device='cuda:0') |
| spectral_norm tensor(14.3247, device='cuda:0') |
| frobenius_norm tensor(64.0580, device='cuda:0') |
| spectral_norm tensor(3.0007, device='cuda:0') |
| frobenius_norm tensor(63.2220, device='cuda:0') |
| spectral_norm tensor(4.5882, device='cuda:0') |
| alpha value of layer 15 ---188.4088134765625 |
| Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(96.6512, device='cuda:0') |
| spectral_norm tensor(12.9655, device='cuda:0') |
| frobenius_norm tensor(99.2465, device='cuda:0') |
| spectral_norm tensor(14.6775, device='cuda:0') |
| frobenius_norm tensor(66.7600, device='cuda:0') |
| spectral_norm tensor(2.8352, device='cuda:0') |
| frobenius_norm tensor(66.0842, device='cuda:0') |
| spectral_norm tensor(5.0123, device='cuda:0') |
| alpha value of layer 16 ---207.397216796875 |
| Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(95.8736, device='cuda:0') |
| spectral_norm tensor(12.8995, device='cuda:0') |
| frobenius_norm tensor(98.0118, device='cuda:0') |
| spectral_norm tensor(14.3731, device='cuda:0') |
| frobenius_norm tensor(66.5281, device='cuda:0') |
| spectral_norm tensor(2.8901, device='cuda:0') |
| frobenius_norm tensor(66.1344, device='cuda:0') |
| spectral_norm tensor(5.4531, device='cuda:0') |
| alpha value of layer 17 ---194.68035888671875 |
| Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(93.7199, device='cuda:0') |
| spectral_norm tensor(12.9969, device='cuda:0') |
| frobenius_norm tensor(95.7889, device='cuda:0') |
| spectral_norm tensor(14.0707, device='cuda:0') |
| frobenius_norm tensor(69.6604, device='cuda:0') |
| spectral_norm tensor(2.8885, device='cuda:0') |
| frobenius_norm tensor(68.6924, device='cuda:0') |
| spectral_norm tensor(5.4377, device='cuda:0') |
| alpha value of layer 18 ---209.88125610351562 |
| Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(92.5769, device='cuda:0') |
| spectral_norm tensor(12.7937, device='cuda:0') |
| frobenius_norm tensor(94.3567, device='cuda:0') |
| spectral_norm tensor(14.2169, device='cuda:0') |
| frobenius_norm tensor(70.2688, device='cuda:0') |
| spectral_norm tensor(2.7430, device='cuda:0') |
| frobenius_norm tensor(69.5499, device='cuda:0') |
| spectral_norm tensor(5.3632, device='cuda:0') |
| alpha value of layer 19 ---230.21255493164062 |
| Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(93.1738, device='cuda:0') |
| spectral_norm tensor(13.3309, device='cuda:0') |
| frobenius_norm tensor(94.8772, device='cuda:0') |
| spectral_norm tensor(14.2162, device='cuda:0') |
| frobenius_norm tensor(71.3496, device='cuda:0') |
| spectral_norm tensor(2.7475, device='cuda:0') |
| frobenius_norm tensor(70.9048, device='cuda:0') |
| spectral_norm tensor(6.5838, device='cuda:0') |
| alpha value of layer 20 ---220.93441772460938 |
| Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(91.0892, device='cuda:0') |
| spectral_norm tensor(12.7824, device='cuda:0') |
| frobenius_norm tensor(92.0887, device='cuda:0') |
| spectral_norm tensor(13.3415, device='cuda:0') |
| frobenius_norm tensor(73.5470, device='cuda:0') |
| spectral_norm tensor(3.2134, device='cuda:0') |
| frobenius_norm tensor(72.4853, device='cuda:0') |
| spectral_norm tensor(5.6293, device='cuda:0') |
| alpha value of layer 21 ---197.01531982421875 |
| Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(93.0198, device='cuda:0') |
| spectral_norm tensor(12.9645, device='cuda:0') |
| frobenius_norm tensor(94.1975, device='cuda:0') |
| spectral_norm tensor(13.3901, device='cuda:0') |
| frobenius_norm tensor(73.8086, device='cuda:0') |
| spectral_norm tensor(2.7246, device='cuda:0') |
| frobenius_norm tensor(72.6579, device='cuda:0') |
| spectral_norm tensor(7.5949, device='cuda:0') |
| alpha value of layer 22 ---231.5908660888672 |
| Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(92.3679, device='cuda:0') |
| spectral_norm tensor(12.5056, device='cuda:0') |
| frobenius_norm tensor(93.0806, device='cuda:0') |
| spectral_norm tensor(12.8793, device='cuda:0') |
| frobenius_norm tensor(77.2716, device='cuda:0') |
| spectral_norm tensor(3.0203, device='cuda:0') |
| frobenius_norm tensor(76.3245, device='cuda:0') |
| spectral_norm tensor(5.5035, device='cuda:0') |
| alpha value of layer 23 ---238.41143798828125 |
| Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(89.8033, device='cuda:0') |
| spectral_norm tensor(12.3866, device='cuda:0') |
| frobenius_norm tensor(90.2768, device='cuda:0') |
| spectral_norm tensor(13.1727, device='cuda:0') |
| frobenius_norm tensor(76.5770, device='cuda:0') |
| spectral_norm tensor(3.2186, device='cuda:0') |
| frobenius_norm tensor(75.2567, device='cuda:0') |
| spectral_norm tensor(6.6725, device='cuda:0') |
| alpha value of layer 24 ---198.1973419189453 |
| Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(90.5618, device='cuda:0') |
| spectral_norm tensor(11.8575, device='cuda:0') |
| frobenius_norm tensor(90.8598, device='cuda:0') |
| spectral_norm tensor(12.2949, device='cuda:0') |
| frobenius_norm tensor(79.4490, device='cuda:0') |
| spectral_norm tensor(3.1827, device='cuda:0') |
| frobenius_norm tensor(78.3357, device='cuda:0') |
| spectral_norm tensor(4.8220, device='cuda:0') |
| alpha value of layer 25 ---249.99850463867188 |
| Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(89.1381, device='cuda:0') |
| spectral_norm tensor(12.8790, device='cuda:0') |
| frobenius_norm tensor(89.8395, device='cuda:0') |
| spectral_norm tensor(13.2880, device='cuda:0') |
| frobenius_norm tensor(80.7266, device='cuda:0') |
| spectral_norm tensor(3.6409, device='cuda:0') |
| frobenius_norm tensor(80.1935, device='cuda:0') |
| spectral_norm tensor(6.8510, device='cuda:0') |
| alpha value of layer 26 ---180.56016540527344 |
| Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(92.4526, device='cuda:0') |
| spectral_norm tensor(13.1710, device='cuda:0') |
| frobenius_norm tensor(93.3447, device='cuda:0') |
| spectral_norm tensor(14.0394, device='cuda:0') |
| frobenius_norm tensor(80.8654, device='cuda:0') |
| spectral_norm tensor(3.3240, device='cuda:0') |
| frobenius_norm tensor(80.7260, device='cuda:0') |
| spectral_norm tensor(5.6484, device='cuda:0') |
| alpha value of layer 27 ---222.39707946777344 |
| Processing layer 28--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(89.8511, device='cuda:0') |
| spectral_norm tensor(12.7679, device='cuda:0') |
| frobenius_norm tensor(90.9173, device='cuda:0') |
| spectral_norm tensor(13.5210, device='cuda:0') |
| frobenius_norm tensor(83.4357, device='cuda:0') |
| spectral_norm tensor(3.6919, device='cuda:0') |
| frobenius_norm tensor(83.0720, device='cuda:0') |
| spectral_norm tensor(6.0824, device='cuda:0') |
| alpha value of layer 28 ---198.00421142578125 |
| Processing layer 29--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(87.5255, device='cuda:0') |
| spectral_norm tensor(13.2835, device='cuda:0') |
| frobenius_norm tensor(88.2859, device='cuda:0') |
| spectral_norm tensor(14.1804, device='cuda:0') |
| frobenius_norm tensor(83.7624, device='cuda:0') |
| spectral_norm tensor(4.7727, device='cuda:0') |
| frobenius_norm tensor(84.0506, device='cuda:0') |
| spectral_norm tensor(6.8564, device='cuda:0') |
| alpha value of layer 29 ---135.1178436279297 |
| Processing layer 30--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(88.0865, device='cuda:0') |
| spectral_norm tensor(13.0963, device='cuda:0') |
| frobenius_norm tensor(89.2752, device='cuda:0') |
| spectral_norm tensor(13.7996, device='cuda:0') |
| frobenius_norm tensor(85.7229, device='cuda:0') |
| spectral_norm tensor(3.7462, device='cuda:0') |
| frobenius_norm tensor(86.1523, device='cuda:0') |
| spectral_norm tensor(6.8602, device='cuda:0') |
| alpha value of layer 30 ---192.10064697265625 |
| Processing layer 31--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| frobenius_norm tensor(89.1405, device='cuda:0') |
| spectral_norm tensor(15.0259, device='cuda:0') |
| frobenius_norm tensor(92.3933, device='cuda:0') |
| spectral_norm tensor(16.3841, device='cuda:0') |
| frobenius_norm tensor(78.1290, device='cuda:0') |
| spectral_norm tensor(4.0098, device='cuda:0') |
| frobenius_norm tensor(78.9173, device='cuda:0') |
| spectral_norm tensor(10.5689, device='cuda:0') |
| alpha value of layer 31 ---125.60063171386719 |
| metric_name stable_rank: [25, 23, 22, 19, 27, 20, 18, 16, 14, 24, 28, 21, 17, 30, 15, 26, 13, 10, 12, 5, 9, 11, 4, 7, 6, 8, 29, 3, 31, 2, 1, 0] |
| Begin main_assign: Llama-2-7b-hf self_attn effective_rank |
|
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:19<00:19, 19.23s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:26<00:00, 11.91s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:26<00:00, 13.01s/it] |
| Once upon a time, I worked in a library. I was a page. I would shelve books, check them out, and help patrons find things. It was my first job, and it was great. I met some really neat people. |
| At the time, I had a lot of free time, so I spent a lot of time in the library. I would sit in the fiction section and read all the books that I didn’t have time to read at home. This is how I first |
| LlamaForCausalLM( |
| (model): LlamaModel( |
| (embed_tokens): Embedding(32000, 4096, padding_idx=0) |
| (layers): ModuleList( |
| (0-31): 32 x LlamaDecoderLayer( |
| (self_attn): LlamaAttention( |
| (q_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (k_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (v_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (o_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| ) |
| (mlp): LlamaMLP( |
| (gate_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (up_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (down_proj): Linear(in_features=11008, out_features=4096, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| ) |
| ) |
| (norm): LlamaRMSNorm((4096,), eps=1e-05) |
| (rotary_emb): LlamaRotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=4096, out_features=32000, bias=False) |
| ) |
| config: |
| LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 4096, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pad_token_id": 0, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
| |
| Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 0 ---1640.3033447265625 |
| Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 1 ---2134.1572265625 |
| Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 2 ---2669.992919921875 |
| Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 3 ---2901.25439453125 |
| Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 4 ---2904.2001953125 |
| Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 5 ---2910.534912109375 |
| Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 6 ---2883.722900390625 |
| Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 7 ---2887.236328125 |
| Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 8 ---2899.039306640625 |
| Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 9 ---2916.92822265625 |
| Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 10 ---2859.56689453125 |
| Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 11 ---2818.8173828125 |
| Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 12 ---2905.6064453125 |
| Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 13 ---2940.74462890625 |
| Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 14 ---2900.401123046875 |
| Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 15 ---2949.82080078125 |
| Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 16 ---2976.977783203125 |
| Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 17 ---3047.2646484375 |
| Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 18 ---3096.2216796875 |
| Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 19 ---3061.852783203125 |
| Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 20 ---3062.37353515625 |
| Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 21 ---3081.3349609375 |
| Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 22 ---3106.181640625 |
| Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 23 ---3144.513427734375 |
| Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 24 ---3072.8798828125 |
| Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 25 ---3137.80224609375 |
| Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 26 ---3090.37158203125 |
| Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 27 ---3181.7998046875 |
| Processing layer 28--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 28 ---3147.865478515625 |
| Processing layer 29--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 29 ---3101.146484375 |
| Processing layer 30--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 30 ---3161.5263671875 |
| Processing layer 31--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 31 ---3049.5556640625 |
| metric_name effective_rank: [27, 30, 28, 23, 25, 22, 29, 18, 26, 21, 24, 20, 19, 31, 17, 16, 15, 13, 9, 5, 12, 4, 3, 14, 8, 7, 6, 10, 11, 2, 1, 0] |
| Begin main_assign: Llama-2-7b-hf self_attn ZD |
|
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:17<00:17, 17.69s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 10.92s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 11.94s/it] |
| Once upon a time, there was a girl who had a heart that was so full of love that it overflowed from her chest and flowed out of her hands. |
| This girl was so full of love that she was like a fountain of love, and wherever she went, her love would flow out of her and touch the lives of those around her. |
| One day, the girl was walking through the woods when she came upon a beautiful stream. The stream was so clear and so peaceful that |
| LlamaForCausalLM( |
| (model): LlamaModel( |
| (embed_tokens): Embedding(32000, 4096, padding_idx=0) |
| (layers): ModuleList( |
| (0-31): 32 x LlamaDecoderLayer( |
| (self_attn): LlamaAttention( |
| (q_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (k_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (v_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (o_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| ) |
| (mlp): LlamaMLP( |
| (gate_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (up_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (down_proj): Linear(in_features=11008, out_features=4096, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| ) |
| ) |
| (norm): LlamaRMSNorm((4096,), eps=1e-05) |
| (rotary_emb): LlamaRotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=4096, out_features=32000, bias=False) |
| ) |
| config: |
| LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 4096, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pad_token_id": 0, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
| |
| Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 0 ---0.09540334343910217 |
| Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 1 ---0.11126542091369629 |
| Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 2 ---0.14089055359363556 |
| Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 3 ---0.1446058303117752 |
| Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 4 ---0.14712807536125183 |
| Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 5 ---0.1478433907032013 |
| Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 6 ---0.14464625716209412 |
| Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 7 ---0.14459004998207092 |
| Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 8 ---0.14641690254211426 |
| Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 9 ---0.147793248295784 |
| Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 10 ---0.14709556102752686 |
| Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 11 ---0.14403118193149567 |
| Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 12 ---0.14700128138065338 |
| Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 13 ---0.1479380875825882 |
| Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 14 ---0.1479010283946991 |
| Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 15 ---0.1490364670753479 |
| Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 16 ---0.1480296403169632 |
| Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 17 ---0.15020982921123505 |
| Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 18 ---0.1507750302553177 |
| Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 19 ---0.14981798827648163 |
| Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 20 ---0.15018826723098755 |
| Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 21 ---0.1498291790485382 |
| Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 22 ---0.15043966472148895 |
| Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 23 ---0.151978999376297 |
| Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 24 ---0.14919137954711914 |
| Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 25 ---0.15175150334835052 |
| Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 26 ---0.1495654433965683 |
| Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 27 ---0.15338149666786194 |
| Processing layer 28--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 28 ---0.15180033445358276 |
| Processing layer 29--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 29 ---0.1501537710428238 |
| Processing layer 30--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 30 ---0.15218497812747955 |
| Processing layer 31--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 31 ---0.14980709552764893 |
| metric_name ZD: [27, 30, 23, 28, 25, 18, 22, 17, 20, 29, 21, 19, 31, 26, 24, 15, 16, 13, 14, 5, 9, 4, 10, 12, 8, 6, 3, 7, 11, 2, 1, 0] |
| Begin main_assign: Llama-2-7b-hf self_attn head_diversity |
|
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:18<00:18, 18.58s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 11.55s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 12.61s/it] |
| Once upon a time, there was a city that was built with love. I was in that city. And I fell in love. I fell in love with the people. I fell in love with the architecture. I fell in love with the food. I fell in love with the art. I fell in love with the culture. I fell in love with the music. I fell in love with the history. I fell in love with the city. And I fell in love with the person I fell in love with |
| LlamaForCausalLM( |
| (model): LlamaModel( |
| (embed_tokens): Embedding(32000, 4096, padding_idx=0) |
| (layers): ModuleList( |
| (0-31): 32 x LlamaDecoderLayer( |
| (self_attn): LlamaAttention( |
| (q_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (k_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (v_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (o_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| ) |
| (mlp): LlamaMLP( |
| (gate_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (up_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (down_proj): Linear(in_features=11008, out_features=4096, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| ) |
| ) |
| (norm): LlamaRMSNorm((4096,), eps=1e-05) |
| (rotary_emb): LlamaRotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=4096, out_features=32000, bias=False) |
| ) |
| config: |
| LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 4096, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pad_token_id": 0, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
| |
| Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 0 ---0.9916330575942993 |
| Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 1 ---0.9952021241188049 |
| Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 2 ---0.9966323971748352 |
| Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 3 ---0.9973293542861938 |
| Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 4 ---0.9971895217895508 |
| Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 5 ---0.9973934888839722 |
| Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 6 ---0.9974462389945984 |
| Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 7 ---0.9975071549415588 |
| Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 8 ---0.9974231719970703 |
| Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 9 ---0.9973534345626831 |
| Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 10 ---0.997123122215271 |
| Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 11 ---0.9970043897628784 |
| Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 12 ---0.9973783493041992 |
| Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 13 ---0.9974591732025146 |
| Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 14 ---0.9971306324005127 |
| Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 15 ---0.9973533153533936 |
| Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 16 ---0.9974291324615479 |
| Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 17 ---0.9976841807365417 |
| Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 18 ---0.997740626335144 |
| Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 19 ---0.9975850582122803 |
| Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 20 ---0.9973828792572021 |
| Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 21 ---0.9975684881210327 |
| Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 22 ---0.9977440237998962 |
| Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 23 ---0.9980273246765137 |
| Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 24 ---0.9974839091300964 |
| Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 25 ---0.9979180693626404 |
| Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 26 ---0.9974991083145142 |
| Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 27 ---0.9979188442230225 |
| Processing layer 28--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 28 ---0.997989296913147 |
| Processing layer 29--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 29 ---0.9974175691604614 |
| Processing layer 30--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 30 ---0.9975640773773193 |
| Processing layer 31--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 31 ---0.997219443321228 |
| metric_name head_diversity: [23, 28, 27, 25, 22, 18, 17, 19, 21, 30, 7, 26, 24, 13, 6, 16, 8, 29, 5, 20, 12, 9, 15, 3, 31, 4, 14, 10, 11, 2, 1, 0] |
| Begin main_assign: Llama-2-7b-hf self_attn coherence |
|
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:17<00:17, 17.66s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 10.81s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 11.84s/it] |
| Once upon a time, there lived a king and a queen. They had a beautiful daughter named Cinderella. One day, the king announced that there would be a ball. The king invited all the royalty and dignitaries to the ball, including Cinderella. |
| Cinderella was so excited. She could not wait to see her friends and dress in her best dress. She was happy that she would finally be able to go to the ball. The only problem was that her step-mother |
| LlamaForCausalLM( |
| (model): LlamaModel( |
| (embed_tokens): Embedding(32000, 4096, padding_idx=0) |
| (layers): ModuleList( |
| (0-31): 32 x LlamaDecoderLayer( |
| (self_attn): LlamaAttention( |
| (q_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (k_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (v_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (o_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| ) |
| (mlp): LlamaMLP( |
| (gate_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (up_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (down_proj): Linear(in_features=11008, out_features=4096, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| ) |
| ) |
| (norm): LlamaRMSNorm((4096,), eps=1e-05) |
| (rotary_emb): LlamaRotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=4096, out_features=32000, bias=False) |
| ) |
| config: |
| LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 4096, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pad_token_id": 0, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
| |
| Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 0 ---0.08510372042655945 |
| Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 1 ---0.04102545976638794 |
| Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 2 ---0.028616365045309067 |
| Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 3 ---0.02104165218770504 |
| Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 4 ---0.022063206881284714 |
| Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 5 ---0.021188031882047653 |
| Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 6 ---0.020417138934135437 |
| Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 7 ---0.019520433619618416 |
| Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 8 ---0.020254574716091156 |
| Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 9 ---0.020007748156785965 |
| Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 10 ---0.021119512617588043 |
| Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 11 ---0.020985007286071777 |
| Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 12 ---0.019723106175661087 |
| Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 13 ---0.01894117146730423 |
| Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 14 ---0.01963678002357483 |
| Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 15 ---0.01925666816532612 |
| Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 16 ---0.018222851678729057 |
| Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 17 ---0.016996942460536957 |
| Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 18 ---0.016209837049245834 |
| Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 19 ---0.017241276800632477 |
| Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 20 ---0.017154088243842125 |
| Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 21 ---0.016598742455244064 |
| Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 22 ---0.016119930893182755 |
| Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 23 ---0.015261407010257244 |
| Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 24 ---0.01685335859656334 |
| Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 25 ---0.015361565165221691 |
| Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 26 ---0.01685093343257904 |
| Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 27 ---0.015206292271614075 |
| Processing layer 28--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 28 ---0.01575298234820366 |
| Processing layer 29--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 29 ---0.01735319383442402 |
| Processing layer 30--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 30 ---0.016395289450883865 |
| Processing layer 31--subset--{'self_attn.q_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.k_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.v_proj': Linear(in_features=4096, out_features=4096, bias=False), 'self_attn.o_proj': Linear(in_features=4096, out_features=4096, bias=False)} |
| alpha value of layer 31 ---0.02029731497168541 |
| metric_name coherence: [0, 1, 2, 4, 5, 10, 3, 11, 6, 31, 8, 9, 12, 14, 7, 15, 13, 16, 29, 19, 20, 17, 24, 26, 21, 30, 18, 22, 28, 25, 23, 27] |
| Begin main_assign: Llama-2-7b-hf mlp alpha |
|
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:18<00:18, 18.84s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 11.57s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 12.66s/it] |
| Once upon a time, there was a little boy named Jack who lived in the country. Jack loved to play in the forest with his friends. He was a happy boy. |
| One day, he was playing with his friends when he saw a beautiful white horse. The horse was beautiful, and he wanted to ride it. |
| Jack ran up to the horse and asked, “Can I ride you?” |
| The horse said, “Yes, Jack, you can ride me.” |
| Jack jumped up on the horse |
| LlamaForCausalLM( |
| (model): LlamaModel( |
| (embed_tokens): Embedding(32000, 4096, padding_idx=0) |
| (layers): ModuleList( |
| (0-31): 32 x LlamaDecoderLayer( |
| (self_attn): LlamaAttention( |
| (q_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (k_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (v_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (o_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| ) |
| (mlp): LlamaMLP( |
| (gate_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (up_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (down_proj): Linear(in_features=11008, out_features=4096, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| ) |
| ) |
| (norm): LlamaRMSNorm((4096,), eps=1e-05) |
| (rotary_emb): LlamaRotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=4096, out_features=32000, bias=False) |
| ) |
| config: |
| LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 4096, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pad_token_id": 0, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
| |
| Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 0 ---2.815411329269409 |
| Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 1 ---3.4513909816741943 |
| Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 2 ---3.7109851837158203 |
| Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 3 ---3.8032162189483643 |
| Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 4 ---4.195495128631592 |
| Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 5 ---3.856921911239624 |
| Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 6 ---3.6256275177001953 |
| Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 7 ---3.660289764404297 |
| Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 8 ---3.5230093002319336 |
| Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 9 ---3.5076420307159424 |
| Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 10 ---3.3467018604278564 |
| Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 11 ---3.291457176208496 |
| Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 12 ---3.5538055896759033 |
| Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 13 ---3.4736461639404297 |
| Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 14 ---3.715531587600708 |
| Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 15 ---3.7430496215820312 |
| Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 16 ---4.149142742156982 |
| Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 17 ---4.197119235992432 |
| Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 18 ---4.489593982696533 |
| Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 19 ---4.358992576599121 |
| Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 20 ---5.0549798011779785 |
| Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 21 ---4.814038276672363 |
| Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 22 ---4.621534824371338 |
| Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 23 ---4.299625396728516 |
| Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 24 ---4.563563346862793 |
| Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 25 ---4.570354461669922 |
| Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 26 ---4.2273688316345215 |
| Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 27 ---4.423553466796875 |
| Processing layer 28--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 28 ---4.38798189163208 |
| Processing layer 29--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 29 ---4.595789432525635 |
| Processing layer 30--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 30 ---4.671386241912842 |
| Processing layer 31--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 31 ---3.7070789337158203 |
| metric_name alpha: [20, 21, 30, 22, 29, 25, 24, 18, 27, 28, 19, 23, 26, 17, 4, 16, 5, 3, 15, 14, 2, 31, 7, 6, 12, 8, 9, 13, 1, 10, 11, 0] |
| Begin main_assign: Llama-2-7b-hf mlp alpha_hat |
|
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:18<00:18, 18.01s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 11.18s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 12.20s/it] |
| Once upon a time, there was a kingdom called Buzuran. A kingdom known for its beautiful scenery and its friendly people. |
| One day, a stranger named Gavin arrived in the kingdom. He was looking for a place to settle down. He had heard of the kingdom and its beauty, and he was interested in seeing it for himself. |
| When Gavin arrived, he was greeted by the king and queen. They were very impressed with Gavin and his ab |
| LlamaForCausalLM( |
| (model): LlamaModel( |
| (embed_tokens): Embedding(32000, 4096, padding_idx=0) |
| (layers): ModuleList( |
| (0-31): 32 x LlamaDecoderLayer( |
| (self_attn): LlamaAttention( |
| (q_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (k_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (v_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (o_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| ) |
| (mlp): LlamaMLP( |
| (gate_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (up_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (down_proj): Linear(in_features=11008, out_features=4096, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| ) |
| ) |
| (norm): LlamaRMSNorm((4096,), eps=1e-05) |
| (rotary_emb): LlamaRotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=4096, out_features=32000, bias=False) |
| ) |
| config: |
| LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 4096, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pad_token_id": 0, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
| |
| Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 0 ---13.072108268737793 |
| Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 1 ---14.548068046569824 |
| Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 2 ---13.608903884887695 |
| Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 3 ---13.986700057983398 |
| Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 4 ---15.312846183776855 |
| Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 5 ---14.763805389404297 |
| Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 6 ---14.188860893249512 |
| Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 7 ---14.57461166381836 |
| Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 8 ---14.344627380371094 |
| Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 9 ---14.180407524108887 |
| Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 10 ---13.752973556518555 |
| Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 11 ---13.753717422485352 |
| Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 12 ---14.589736938476562 |
| Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 13 ---14.359493255615234 |
| Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 14 ---15.248335838317871 |
| Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 15 ---15.255033493041992 |
| Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 16 ---16.761940002441406 |
| Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 17 ---16.71654510498047 |
| Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 18 ---17.673377990722656 |
| Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 19 ---17.007017135620117 |
| Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 20 ---19.24078941345215 |
| Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 21 ---17.77168083190918 |
| Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 22 ---17.303386688232422 |
| Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 23 ---16.242385864257812 |
| Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 24 ---16.798599243164062 |
| Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 25 ---16.718074798583984 |
| Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 26 ---16.25970458984375 |
| Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 27 ---17.800804138183594 |
| Processing layer 28--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 28 ---18.30744171142578 |
| Processing layer 29--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 29 ---19.38349151611328 |
| Processing layer 30--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 30 ---22.61872673034668 |
| Processing layer 31--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 31 ---19.854324340820312 |
| metric_name alpha_hat: [30, 31, 29, 20, 28, 27, 21, 18, 22, 19, 24, 16, 25, 17, 26, 23, 4, 15, 14, 5, 12, 7, 1, 13, 8, 6, 9, 3, 11, 10, 2, 0] |
| Begin main_assign: Llama-2-7b-hf mlp stable_rank |
|
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:17<00:17, 17.70s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 11.03s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 12.03s/it] |
| Once upon a time there was a little boy who wanted to be a fireman when he grew up. He got to go on a fire truck and play with the hoses and watch the firemen and had a great time. |
| He also wanted to be a policeman when he grew up. He got to go on a police car and get to go to jail. He got to play with the police dogs and watch the policemen and had a great time. |
| He also wanted to be a doctor |
| LlamaForCausalLM( |
| (model): LlamaModel( |
| (embed_tokens): Embedding(32000, 4096, padding_idx=0) |
| (layers): ModuleList( |
| (0-31): 32 x LlamaDecoderLayer( |
| (self_attn): LlamaAttention( |
| (q_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (k_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (v_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (o_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| ) |
| (mlp): LlamaMLP( |
| (gate_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (up_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (down_proj): Linear(in_features=11008, out_features=4096, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| ) |
| ) |
| (norm): LlamaRMSNorm((4096,), eps=1e-05) |
| (rotary_emb): LlamaRotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=4096, out_features=32000, bias=False) |
| ) |
| config: |
| LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 4096, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pad_token_id": 0, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
| |
| Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(110.1283, device='cuda:0') |
| spectral_norm tensor(11.7725, device='cuda:0') |
| frobenius_norm tensor(108.2747, device='cuda:0') |
| spectral_norm tensor(10.5846, device='cuda:0') |
| frobenius_norm tensor(112.8760, device='cuda:0') |
| spectral_norm tensor(8.8966, device='cuda:0') |
| alpha value of layer 0 ---117.70869445800781 |
| Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(122.3248, device='cuda:0') |
| spectral_norm tensor(14.1620, device='cuda:0') |
| frobenius_norm tensor(115.1808, device='cuda:0') |
| spectral_norm tensor(7.3779, device='cuda:0') |
| frobenius_norm tensor(116.8547, device='cuda:0') |
| spectral_norm tensor(6.6577, device='cuda:0') |
| alpha value of layer 1 ---208.7972869873047 |
| Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(125.7614, device='cuda:0') |
| spectral_norm tensor(10.6197, device='cuda:0') |
| frobenius_norm tensor(117.2508, device='cuda:0') |
| spectral_norm tensor(4.5438, device='cuda:0') |
| frobenius_norm tensor(117.9841, device='cuda:0') |
| spectral_norm tensor(5.6738, device='cuda:0') |
| alpha value of layer 2 ---412.8428649902344 |
| Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(127.6067, device='cuda:0') |
| spectral_norm tensor(9.8554, device='cuda:0') |
| frobenius_norm tensor(118.0961, device='cuda:0') |
| spectral_norm tensor(4.3441, device='cuda:0') |
| frobenius_norm tensor(118.3421, device='cuda:0') |
| spectral_norm tensor(6.3601, device='cuda:0') |
| alpha value of layer 3 ---417.640380859375 |
| Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(130.2638, device='cuda:0') |
| spectral_norm tensor(10.9230, device='cuda:0') |
| frobenius_norm tensor(117.2583, device='cuda:0') |
| spectral_norm tensor(4.2948, device='cuda:0') |
| frobenius_norm tensor(116.9673, device='cuda:0') |
| spectral_norm tensor(6.2874, device='cuda:0') |
| alpha value of layer 4 ---411.2488708496094 |
| Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(131.0171, device='cuda:0') |
| spectral_norm tensor(12.3422, device='cuda:0') |
| frobenius_norm tensor(117.2556, device='cuda:0') |
| spectral_norm tensor(4.4561, device='cuda:0') |
| frobenius_norm tensor(117.0701, device='cuda:0') |
| spectral_norm tensor(6.4393, device='cuda:0') |
| alpha value of layer 5 ---378.5376892089844 |
| Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(133.2534, device='cuda:0') |
| spectral_norm tensor(13.9346, device='cuda:0') |
| frobenius_norm tensor(116.8471, device='cuda:0') |
| spectral_norm tensor(4.7412, device='cuda:0') |
| frobenius_norm tensor(116.4051, device='cuda:0') |
| spectral_norm tensor(6.2935, device='cuda:0') |
| alpha value of layer 6 ---346.97930908203125 |
| Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(133.1796, device='cuda:0') |
| spectral_norm tensor(13.7900, device='cuda:0') |
| frobenius_norm tensor(117.2589, device='cuda:0') |
| spectral_norm tensor(4.9844, device='cuda:0') |
| frobenius_norm tensor(116.5975, device='cuda:0') |
| spectral_norm tensor(6.7392, device='cuda:0') |
| alpha value of layer 7 ---315.34832763671875 |
| Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(131.0625, device='cuda:0') |
| spectral_norm tensor(12.9479, device='cuda:0') |
| frobenius_norm tensor(118.7164, device='cuda:0') |
| spectral_norm tensor(5.3742, device='cuda:0') |
| frobenius_norm tensor(117.8239, device='cuda:0') |
| spectral_norm tensor(7.0607, device='cuda:0') |
| alpha value of layer 8 ---289.63446044921875 |
| Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(129.7933, device='cuda:0') |
| spectral_norm tensor(13.0345, device='cuda:0') |
| frobenius_norm tensor(119.5203, device='cuda:0') |
| spectral_norm tensor(5.3777, device='cuda:0') |
| frobenius_norm tensor(118.6356, device='cuda:0') |
| spectral_norm tensor(7.0941, device='cuda:0') |
| alpha value of layer 9 ---290.9284362792969 |
| Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(129.3888, device='cuda:0') |
| spectral_norm tensor(13.1949, device='cuda:0') |
| frobenius_norm tensor(120.7769, device='cuda:0') |
| spectral_norm tensor(5.5907, device='cuda:0') |
| frobenius_norm tensor(119.6504, device='cuda:0') |
| spectral_norm tensor(7.0719, device='cuda:0') |
| alpha value of layer 10 ---283.04052734375 |
| Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(128.9186, device='cuda:0') |
| spectral_norm tensor(13.5576, device='cuda:0') |
| frobenius_norm tensor(121.8036, device='cuda:0') |
| spectral_norm tensor(5.6766, device='cuda:0') |
| frobenius_norm tensor(120.4323, device='cuda:0') |
| spectral_norm tensor(7.4645, device='cuda:0') |
| alpha value of layer 11 ---270.3800048828125 |
| Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(128.1013, device='cuda:0') |
| spectral_norm tensor(13.0662, device='cuda:0') |
| frobenius_norm tensor(122.9129, device='cuda:0') |
| spectral_norm tensor(6.0042, device='cuda:0') |
| frobenius_norm tensor(121.4419, device='cuda:0') |
| spectral_norm tensor(6.8182, device='cuda:0') |
| alpha value of layer 12 ---277.48046875 |
| Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(127.6379, device='cuda:0') |
| spectral_norm tensor(13.2896, device='cuda:0') |
| frobenius_norm tensor(124.1647, device='cuda:0') |
| spectral_norm tensor(6.3249, device='cuda:0') |
| frobenius_norm tensor(122.4098, device='cuda:0') |
| spectral_norm tensor(6.4258, device='cuda:0') |
| alpha value of layer 13 ---280.17169189453125 |
| Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(127.3744, device='cuda:0') |
| spectral_norm tensor(13.1519, device='cuda:0') |
| frobenius_norm tensor(124.2991, device='cuda:0') |
| spectral_norm tensor(6.4910, device='cuda:0') |
| frobenius_norm tensor(122.5321, device='cuda:0') |
| spectral_norm tensor(6.1543, device='cuda:0') |
| alpha value of layer 14 ---285.63787841796875 |
| Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(128.0534, device='cuda:0') |
| spectral_norm tensor(13.0927, device='cuda:0') |
| frobenius_norm tensor(124.9728, device='cuda:0') |
| spectral_norm tensor(6.8014, device='cuda:0') |
| frobenius_norm tensor(122.9694, device='cuda:0') |
| spectral_norm tensor(5.7687, device='cuda:0') |
| alpha value of layer 15 ---295.8956298828125 |
| Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(128.9326, device='cuda:0') |
| spectral_norm tensor(13.4859, device='cuda:0') |
| frobenius_norm tensor(124.8730, device='cuda:0') |
| spectral_norm tensor(7.0698, device='cuda:0') |
| frobenius_norm tensor(122.8834, device='cuda:0') |
| spectral_norm tensor(5.4697, device='cuda:0') |
| alpha value of layer 16 ---302.703857421875 |
| Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(130.1418, device='cuda:0') |
| spectral_norm tensor(13.9212, device='cuda:0') |
| frobenius_norm tensor(124.4075, device='cuda:0') |
| spectral_norm tensor(6.5251, device='cuda:0') |
| frobenius_norm tensor(122.7542, device='cuda:0') |
| spectral_norm tensor(5.4324, device='cuda:0') |
| alpha value of layer 17 ---320.5066833496094 |
| Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(131.3037, device='cuda:0') |
| spectral_norm tensor(13.5630, device='cuda:0') |
| frobenius_norm tensor(124.0418, device='cuda:0') |
| spectral_norm tensor(6.0638, device='cuda:0') |
| frobenius_norm tensor(122.6318, device='cuda:0') |
| spectral_norm tensor(5.3899, device='cuda:0') |
| alpha value of layer 18 ---343.27899169921875 |
| Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(131.7776, device='cuda:0') |
| spectral_norm tensor(12.6322, device='cuda:0') |
| frobenius_norm tensor(124.0385, device='cuda:0') |
| spectral_norm tensor(5.9170, device='cuda:0') |
| frobenius_norm tensor(122.9026, device='cuda:0') |
| spectral_norm tensor(5.5908, device='cuda:0') |
| alpha value of layer 19 ---343.8408203125 |
| Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(132.4897, device='cuda:0') |
| spectral_norm tensor(12.7395, device='cuda:0') |
| frobenius_norm tensor(123.9540, device='cuda:0') |
| spectral_norm tensor(6.1100, device='cuda:0') |
| frobenius_norm tensor(122.8897, device='cuda:0') |
| spectral_norm tensor(5.1633, device='cuda:0') |
| alpha value of layer 20 ---362.06610107421875 |
| Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(133.3879, device='cuda:0') |
| spectral_norm tensor(12.1366, device='cuda:0') |
| frobenius_norm tensor(123.7530, device='cuda:0') |
| spectral_norm tensor(5.6965, device='cuda:0') |
| frobenius_norm tensor(122.8804, device='cuda:0') |
| spectral_norm tensor(4.8811, device='cuda:0') |
| alpha value of layer 21 ---408.834228515625 |
| Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(134.1589, device='cuda:0') |
| spectral_norm tensor(11.9229, device='cuda:0') |
| frobenius_norm tensor(123.6884, device='cuda:0') |
| spectral_norm tensor(5.3241, device='cuda:0') |
| frobenius_norm tensor(122.9257, device='cuda:0') |
| spectral_norm tensor(5.2778, device='cuda:0') |
| alpha value of layer 22 ---402.9353942871094 |
| Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(134.1199, device='cuda:0') |
| spectral_norm tensor(10.9135, device='cuda:0') |
| frobenius_norm tensor(124.2959, device='cuda:0') |
| spectral_norm tensor(4.9033, device='cuda:0') |
| frobenius_norm tensor(123.6594, device='cuda:0') |
| spectral_norm tensor(5.8655, device='cuda:0') |
| alpha value of layer 23 ---412.6978454589844 |
| Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(134.4485, device='cuda:0') |
| spectral_norm tensor(10.8745, device='cuda:0') |
| frobenius_norm tensor(124.6739, device='cuda:0') |
| spectral_norm tensor(4.6898, device='cuda:0') |
| frobenius_norm tensor(124.1024, device='cuda:0') |
| spectral_norm tensor(5.6106, device='cuda:0') |
| alpha value of layer 24 ---449.614501953125 |
| Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(134.7100, device='cuda:0') |
| spectral_norm tensor(10.7878, device='cuda:0') |
| frobenius_norm tensor(125.2170, device='cuda:0') |
| spectral_norm tensor(4.9540, device='cuda:0') |
| frobenius_norm tensor(124.7208, device='cuda:0') |
| spectral_norm tensor(5.3156, device='cuda:0') |
| alpha value of layer 25 ---448.44183349609375 |
| Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(135.1252, device='cuda:0') |
| spectral_norm tensor(11.1821, device='cuda:0') |
| frobenius_norm tensor(125.6647, device='cuda:0') |
| spectral_norm tensor(5.8583, device='cuda:0') |
| frobenius_norm tensor(125.1040, device='cuda:0') |
| spectral_norm tensor(5.0074, device='cuda:0') |
| alpha value of layer 26 ---410.114013671875 |
| Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(135.2713, device='cuda:0') |
| spectral_norm tensor(11.5424, device='cuda:0') |
| frobenius_norm tensor(126.2993, device='cuda:0') |
| spectral_norm tensor(6.9998, device='cuda:0') |
| frobenius_norm tensor(125.7692, device='cuda:0') |
| spectral_norm tensor(5.1612, device='cuda:0') |
| alpha value of layer 27 ---352.2416687011719 |
| Processing layer 28--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(134.6884, device='cuda:0') |
| spectral_norm tensor(11.4798, device='cuda:0') |
| frobenius_norm tensor(127.5331, device='cuda:0') |
| spectral_norm tensor(9.0437, device='cuda:0') |
| frobenius_norm tensor(126.5359, device='cuda:0') |
| spectral_norm tensor(4.9416, device='cuda:0') |
| alpha value of layer 28 ---330.72796630859375 |
| Processing layer 29--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(135.0423, device='cuda:0') |
| spectral_norm tensor(12.0559, device='cuda:0') |
| frobenius_norm tensor(128.7051, device='cuda:0') |
| spectral_norm tensor(11.3077, device='cuda:0') |
| frobenius_norm tensor(126.9749, device='cuda:0') |
| spectral_norm tensor(4.4258, device='cuda:0') |
| alpha value of layer 29 ---359.37579345703125 |
| Processing layer 30--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(138.1487, device='cuda:0') |
| spectral_norm tensor(19.7267, device='cuda:0') |
| frobenius_norm tensor(130.8807, device='cuda:0') |
| spectral_norm tensor(19.9367, device='cuda:0') |
| frobenius_norm tensor(126.2216, device='cuda:0') |
| spectral_norm tensor(4.5236, device='cuda:0') |
| alpha value of layer 30 ---290.23907470703125 |
| Processing layer 31--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| frobenius_norm tensor(144.1483, device='cuda:0') |
| spectral_norm tensor(19.9982, device='cuda:0') |
| frobenius_norm tensor(135.9538, device='cuda:0') |
| spectral_norm tensor(19.9404, device='cuda:0') |
| frobenius_norm tensor(126.0626, device='cuda:0') |
| spectral_norm tensor(7.8316, device='cuda:0') |
| alpha value of layer 31 ---119.1807861328125 |
| metric_name stable_rank: [24, 25, 3, 2, 23, 4, 26, 21, 22, 5, 20, 29, 27, 6, 19, 18, 28, 17, 7, 16, 15, 9, 30, 8, 14, 10, 13, 12, 11, 1, 31, 0] |
| Begin main_assign: Llama-2-7b-hf mlp effective_rank |
|
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:18<00:18, 18.39s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 11.39s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 12.44s/it] |
| Once upon a time, my father-in-law went to the bank to change a cheque. He was asked to write his signature on the cheque and to then place a tick in the box indicating whether the cheque was a deposit or a withdrawal. |
| My father-in-law was confused. He was not accustomed to the new style of cheque, which was introduced in the 1980s. |
| “I know it says ‘deposit’, but |
| LlamaForCausalLM( |
| (model): LlamaModel( |
| (embed_tokens): Embedding(32000, 4096, padding_idx=0) |
| (layers): ModuleList( |
| (0-31): 32 x LlamaDecoderLayer( |
| (self_attn): LlamaAttention( |
| (q_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (k_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (v_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (o_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| ) |
| (mlp): LlamaMLP( |
| (gate_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (up_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (down_proj): Linear(in_features=11008, out_features=4096, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| ) |
| ) |
| (norm): LlamaRMSNorm((4096,), eps=1e-05) |
| (rotary_emb): LlamaRotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=4096, out_features=32000, bias=False) |
| ) |
| config: |
| LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 4096, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pad_token_id": 0, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
| |
| Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 0 ---3572.474365234375 |
| Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 1 ---3627.507568359375 |
| Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 2 ---3733.4091796875 |
| Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 3 ---3785.27880859375 |
| Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 4 ---3790.103271484375 |
| Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 5 ---3778.20458984375 |
| Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 6 ---3768.807373046875 |
| Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 7 ---3756.693359375 |
| Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 8 ---3748.4443359375 |
| Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 9 ---3745.5185546875 |
| Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 10 ---3735.50341796875 |
| Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 11 ---3737.810302734375 |
| Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 12 ---3740.48388671875 |
| Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 13 ---3753.927734375 |
| Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 14 ---3753.2763671875 |
| Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 15 ---3771.952880859375 |
| Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 16 ---3771.9111328125 |
| Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 17 ---3772.70849609375 |
| Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 18 ---3783.5595703125 |
| Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 19 ---3788.821044921875 |
| Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 20 ---3790.424560546875 |
| Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 21 ---3790.768310546875 |
| Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 22 ---3790.56396484375 |
| Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 23 ---3793.769287109375 |
| Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 24 ---3795.335693359375 |
| Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 25 ---3797.1728515625 |
| Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 26 ---3800.90966796875 |
| Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 27 ---3804.0595703125 |
| Processing layer 28--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 28 ---3810.544921875 |
| Processing layer 29--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 29 ---3812.28857421875 |
| Processing layer 30--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 30 ---3802.36669921875 |
| Processing layer 31--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 31 ---3771.994873046875 |
| metric_name effective_rank: [29, 28, 27, 30, 26, 25, 24, 23, 21, 22, 20, 4, 19, 3, 18, 5, 17, 31, 15, 16, 6, 7, 13, 14, 8, 9, 12, 11, 10, 2, 1, 0] |
| Begin main_assign: Llama-2-7b-hf mlp ZD |
|
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:18<00:18, 18.59s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 11.55s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 12.60s/it] |
| Once upon a time, in a faraway land, there lived a prince and a princess. They loved each other very much. The prince was very rich, but the princess was poor. The prince loved the princess so much that he wanted to give her everything he had. |
| One day, the prince decided to give the princess a gift. He went to the market and bought her a beautiful golden ring. The princess was very happy. She loved the ring. But the prince was not happy |
| LlamaForCausalLM( |
| (model): LlamaModel( |
| (embed_tokens): Embedding(32000, 4096, padding_idx=0) |
| (layers): ModuleList( |
| (0-31): 32 x LlamaDecoderLayer( |
| (self_attn): LlamaAttention( |
| (q_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (k_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (v_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (o_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| ) |
| (mlp): LlamaMLP( |
| (gate_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (up_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (down_proj): Linear(in_features=11008, out_features=4096, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| ) |
| ) |
| (norm): LlamaRMSNorm((4096,), eps=1e-05) |
| (rotary_emb): LlamaRotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=4096, out_features=32000, bias=False) |
| ) |
| config: |
| LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 4096, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pad_token_id": 0, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
| |
| Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 0 ---0.15578114986419678 |
| Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 1 ---0.15707579255104065 |
| Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 2 ---0.15797409415245056 |
| Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 3 ---0.1574869602918625 |
| Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 4 ---0.15727774798870087 |
| Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 5 ---0.15676063299179077 |
| Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 6 ---0.1562662124633789 |
| Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 7 ---0.15596774220466614 |
| Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 8 ---0.1566615253686905 |
| Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 9 ---0.1561465859413147 |
| Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 10 ---0.15556025505065918 |
| Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 11 ---0.15571480989456177 |
| Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 12 ---0.15566584467887878 |
| Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 13 ---0.15572252869606018 |
| Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 14 ---0.15585064888000488 |
| Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 15 ---0.15579025447368622 |
| Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 16 ---0.15610937774181366 |
| Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 17 ---0.1567423790693283 |
| Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 18 ---0.15685215592384338 |
| Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 19 ---0.15697705745697021 |
| Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 20 ---0.15734341740608215 |
| Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 21 ---0.1579303741455078 |
| Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 22 ---0.15809854865074158 |
| Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 23 ---0.15812799334526062 |
| Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 24 ---0.1577608585357666 |
| Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 25 ---0.15747487545013428 |
| Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 26 ---0.1573786735534668 |
| Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 27 ---0.15686684846878052 |
| Processing layer 28--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 28 ---0.15638381242752075 |
| Processing layer 29--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 29 ---0.1558859944343567 |
| Processing layer 30--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 30 ---0.1538189947605133 |
| Processing layer 31--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 31 ---0.15391211211681366 |
| metric_name ZD: [23, 22, 2, 21, 24, 3, 25, 26, 20, 4, 1, 19, 27, 18, 5, 17, 8, 28, 6, 9, 16, 7, 29, 14, 15, 0, 13, 11, 12, 10, 31, 30] |
| Begin main_assign: Llama-2-7b-hf mlp head_diversity |
|
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:18<00:18, 18.53s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 11.41s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 12.47s/it] |
| Once upon a time, there were three kingdoms. The first kingdom was ruled by a king who was very wise. The second kingdom was ruled by a king who was very greedy. The third kingdom was ruled by a king who was very lazy. |
| The first king was very happy and lived in peace with his people. The second king was very unhappy and lived in fear of his people. The third king was very sad and lived in shame with his people. |
| One day, the first king decided to |
| LlamaForCausalLM( |
| (model): LlamaModel( |
| (embed_tokens): Embedding(32000, 4096, padding_idx=0) |
| (layers): ModuleList( |
| (0-31): 32 x LlamaDecoderLayer( |
| (self_attn): LlamaAttention( |
| (q_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (k_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (v_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (o_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| ) |
| (mlp): LlamaMLP( |
| (gate_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (up_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (down_proj): Linear(in_features=11008, out_features=4096, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| ) |
| ) |
| (norm): LlamaRMSNorm((4096,), eps=1e-05) |
| (rotary_emb): LlamaRotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=4096, out_features=32000, bias=False) |
| ) |
| config: |
| LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 4096, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pad_token_id": 0, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
| |
| Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| Traceback (most recent call last): |
| File "/mnt/bn/life-mllm/users/cxr/quantization/quantization_metric/main_assign.py", line 58, in <module> |
| all_layer_alpha = calculate_expert(model, metric=metric_name, keyword=keyword) |
| File "/mnt/bn/life-mllm/users/cxr/quantization/quantization_metric/alphalora/expert_number.py", line 354, in calculate_expert |
| all_layer_alpha.append(torch.stack(layer_final_alpha).mean().item()) |
| RuntimeError: stack expects a non-empty TensorList |
| Begin main_assign: Llama-2-7b-hf mlp coherence |
|
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:18<00:18, 18.65s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 11.55s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:25<00:00, 12.61s/it] |
| Once upon a time, I was a child. I was full of dreams and visions of the future. I believed in the power of love and in the strength of family. I was the idealist, the dreamer, the one who believed that all good things were possible. |
| Then I grew up. I became a teenager. I was still full of dreams and visions of the future, but the dreams were tinged with darkness, and the visions were full of desp |
| LlamaForCausalLM( |
| (model): LlamaModel( |
| (embed_tokens): Embedding(32000, 4096, padding_idx=0) |
| (layers): ModuleList( |
| (0-31): 32 x LlamaDecoderLayer( |
| (self_attn): LlamaAttention( |
| (q_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (k_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (v_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| (o_proj): Linear(in_features=4096, out_features=4096, bias=False) |
| ) |
| (mlp): LlamaMLP( |
| (gate_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (up_proj): Linear(in_features=4096, out_features=11008, bias=False) |
| (down_proj): Linear(in_features=11008, out_features=4096, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) |
| ) |
| ) |
| (norm): LlamaRMSNorm((4096,), eps=1e-05) |
| (rotary_emb): LlamaRotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=4096, out_features=32000, bias=False) |
| ) |
| config: |
| LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 4096, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pad_token_id": 0, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
| |
| Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 0 ---0.018758879974484444 |
| Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 1 ---0.015916038304567337 |
| Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 2 ---0.014297164976596832 |
| Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 3 ---0.013034423813223839 |
| Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 4 ---0.012964712455868721 |
| Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 5 ---0.013287676498293877 |
| Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 6 ---0.013558547012507915 |
| Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 7 ---0.013750488869845867 |
| Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 8 ---0.013907302170991898 |
| Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 9 ---0.013899993151426315 |
| Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 10 ---0.014043152332305908 |
| Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 11 ---0.014066706411540508 |
| Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 12 ---0.01396130956709385 |
| Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 13 ---0.013774177059531212 |
| Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 14 ---0.013727117329835892 |
| Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 15 ---0.013336378149688244 |
| Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 16 ---0.013238323852419853 |
| Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 17 ---0.013199622742831707 |
| Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 18 ---0.012931596487760544 |
| Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 19 ---0.012751361355185509 |
| Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 20 ---0.012693298980593681 |
| Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 21 ---0.012600544840097427 |
| Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 22 ---0.012644654139876366 |
| Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 23 ---0.012538459151983261 |
| Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 24 ---0.012461837381124496 |
| Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 25 ---0.012478632852435112 |
| Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 26 ---0.012449456378817558 |
| Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 27 ---0.012493574991822243 |
| Processing layer 28--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 28 ---0.012300195172429085 |
| Processing layer 29--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 29 ---0.01232621818780899 |
| Processing layer 30--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 30 ---0.012763611972332 |
| Processing layer 31--subset--{'mlp.gate_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.up_proj': Linear(in_features=4096, out_features=11008, bias=False), 'mlp.down_proj': Linear(in_features=11008, out_features=4096, bias=False)} |
| alpha value of layer 31 ---0.014904310926795006 |
| metric_name coherence: [0, 1, 31, 2, 11, 10, 12, 8, 9, 13, 7, 14, 6, 15, 5, 16, 17, 3, 4, 18, 30, 19, 20, 22, 21, 23, 27, 25, 24, 26, 29, 28] |
| Begin main_assign: Qwen2.5-7B self_attn alpha |
|
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:02, 1.28it/s]
Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.33it/s]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.33it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.29it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.30it/s] |
| Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation. |
| Once upon a time, there was a king who loved gold. He even had his own gold mine, but he was never satisfied with what he had. One day, he decided to go on a journey to find even more gold. He sent his men to explore the world and bring back as much gold as they could find. They searched high and low, but they could not find any more gold. The king was disappointed. He felt like he had been cheated. But then, he had an idea. He decided |
| Qwen2ForCausalLM( |
| (model): Qwen2Model( |
| (embed_tokens): Embedding(152064, 3584) |
| (layers): ModuleList( |
| (0-27): 28 x Qwen2DecoderLayer( |
| (self_attn): Qwen2Attention( |
| (q_proj): Linear(in_features=3584, out_features=3584, bias=True) |
| (k_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (v_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (o_proj): Linear(in_features=3584, out_features=3584, bias=False) |
| ) |
| (mlp): Qwen2MLP( |
| (gate_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (up_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (down_proj): Linear(in_features=18944, out_features=3584, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| ) |
| ) |
| (norm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (rotary_emb): Qwen2RotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=3584, out_features=152064, bias=False) |
| ) |
| config: |
| Qwen2Config { |
| "architectures": [ |
| "Qwen2ForCausalLM" |
| ], |
| "attention_dropout": 0.0, |
| "bos_token_id": 151643, |
| "eos_token_id": 151643, |
| "hidden_act": "silu", |
| "hidden_size": 3584, |
| "initializer_range": 0.02, |
| "intermediate_size": 18944, |
| "layer_types": [ |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention" |
| ], |
| "max_position_embeddings": 131072, |
| "max_window_layers": 28, |
| "model_type": "qwen2", |
| "num_attention_heads": 28, |
| "num_hidden_layers": 28, |
| "num_key_value_heads": 4, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 1000000.0, |
| "sliding_window": null, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "use_mrope": false, |
| "use_sliding_window": false, |
| "vocab_size": 152064 |
| } |
| |
| Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 0 ---3.7398271560668945 |
| Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 1 ---4.61224889755249 |
| Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 2 ---3.6950488090515137 |
| Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 3 ---3.5232601165771484 |
| Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 4 ---4.135606288909912 |
| Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 5 ---3.5275750160217285 |
| Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 6 ---6.992785453796387 |
| Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 7 ---3.8776655197143555 |
| Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 8 ---5.665498733520508 |
| Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 9 ---4.215641021728516 |
| Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 10 ---3.8878092765808105 |
| Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 11 ---4.219095706939697 |
| Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 12 ---3.7720558643341064 |
| Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 13 ---7.641180515289307 |
| Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 14 ---3.707275152206421 |
| Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 15 ---3.25449800491333 |
| Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 16 ---4.158566474914551 |
| Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 17 ---3.467252492904663 |
| Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 18 ---5.359913349151611 |
| Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 19 ---3.048678398132324 |
| Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 20 ---2.6340246200561523 |
| Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 21 ---4.8504228591918945 |
| Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 22 ---3.2433063983917236 |
| Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 23 ---6.264342308044434 |
| Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 24 ---3.845273494720459 |
| Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 25 ---4.786241054534912 |
| Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 26 ---2.9173049926757812 |
| Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 27 ---2.295574426651001 |
| metric_name alpha: [13, 6, 23, 8, 18, 21, 25, 1, 11, 9, 16, 4, 10, 7, 24, 12, 0, 14, 2, 5, 3, 17, 15, 22, 19, 26, 20, 27] |
| Begin main_assign: Qwen2.5-7B self_attn alpha_hat |
|
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:01, 1.56it/s]
Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.63it/s]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:01<00:00, 1.62it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.50it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.54it/s] |
| Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation. |
| Once upon a time, there was a man who was in the habit of getting up at six o'clock every morning and jogging around his neighborhood. One morning, while he was out for his usual morning run, he saw an old man sitting in a park bench. He asked the old man what he was doing there. The old man replied,"I am waiting for my son to arrive." The man asked,"What does he do? Where does he work?" The old man said,"He is |
| Qwen2ForCausalLM( |
| (model): Qwen2Model( |
| (embed_tokens): Embedding(152064, 3584) |
| (layers): ModuleList( |
| (0-27): 28 x Qwen2DecoderLayer( |
| (self_attn): Qwen2Attention( |
| (q_proj): Linear(in_features=3584, out_features=3584, bias=True) |
| (k_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (v_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (o_proj): Linear(in_features=3584, out_features=3584, bias=False) |
| ) |
| (mlp): Qwen2MLP( |
| (gate_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (up_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (down_proj): Linear(in_features=18944, out_features=3584, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| ) |
| ) |
| (norm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (rotary_emb): Qwen2RotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=3584, out_features=152064, bias=False) |
| ) |
| config: |
| Qwen2Config { |
| "architectures": [ |
| "Qwen2ForCausalLM" |
| ], |
| "attention_dropout": 0.0, |
| "bos_token_id": 151643, |
| "eos_token_id": 151643, |
| "hidden_act": "silu", |
| "hidden_size": 3584, |
| "initializer_range": 0.02, |
| "intermediate_size": 18944, |
| "layer_types": [ |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention" |
| ], |
| "max_position_embeddings": 131072, |
| "max_window_layers": 28, |
| "model_type": "qwen2", |
| "num_attention_heads": 28, |
| "num_hidden_layers": 28, |
| "num_key_value_heads": 4, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 1000000.0, |
| "sliding_window": null, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "use_mrope": false, |
| "use_sliding_window": false, |
| "vocab_size": 152064 |
| } |
|
|
| Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 0 ---11.435125350952148 |
| Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 1 ---7.411468505859375 |
| Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 2 ---7.694450378417969 |
| Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 3 ---7.890998363494873 |
| Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 4 ---7.172995567321777 |
| Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 5 ---7.2080488204956055 |
| Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 6 ---11.36629867553711 |
| Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 7 ---7.176176071166992 |
| Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 8 ---7.694736957550049 |
| Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 9 ---7.057432174682617 |
| Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 10 ---6.2363386154174805 |
| Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 11 ---6.52932071685791 |
| Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 12 ---7.182070732116699 |
| Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 13 ---8.655073165893555 |
| Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 14 ---5.977964401245117 |
| Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 15 ---5.821178436279297 |
| Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 16 ---7.714193820953369 |
| Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 17 ---6.244594573974609 |
| Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 18 ---7.704196929931641 |
| Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 19 ---5.97743558883667 |
| Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 20 ---5.515291213989258 |
| Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 21 ---7.423676490783691 |
| Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 22 ---6.235415458679199 |
| Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 23 ---11.209845542907715 |
| Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 24 ---7.722582817077637 |
| Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 25 ---10.762764930725098 |
| Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 26 ---7.431804180145264 |
| Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 27 ---8.220139503479004 |
| metric_name alpha_hat: [0, 6, 23, 25, 13, 27, 3, 24, 16, 18, 8, 2, 26, 21, 1, 5, 12, 7, 4, 9, 11, 17, 10, 22, 14, 19, 15, 20] |
| Begin main_assign: Qwen2.5-7B self_attn stable_rank |
|
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:02, 1.21it/s]
Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.29it/s]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.31it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.37it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.34it/s] |
| Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation. |
| Once upon a time, there was a family with 10 children. Each of the 10 children had a different number of books in their bookshelves. The first child had 1 book, the second had 2 books, the third had 3 books, and so on until the tenth child, who had 10 books. One day, the parents decided to redistribute the books so that each child would have the same number of books. How many books did each child end up with? To |
| Qwen2ForCausalLM( |
| (model): Qwen2Model( |
| (embed_tokens): Embedding(152064, 3584) |
| (layers): ModuleList( |
| (0-27): 28 x Qwen2DecoderLayer( |
| (self_attn): Qwen2Attention( |
| (q_proj): Linear(in_features=3584, out_features=3584, bias=True) |
| (k_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (v_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (o_proj): Linear(in_features=3584, out_features=3584, bias=False) |
| ) |
| (mlp): Qwen2MLP( |
| (gate_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (up_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (down_proj): Linear(in_features=18944, out_features=3584, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| ) |
| ) |
| (norm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (rotary_emb): Qwen2RotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=3584, out_features=152064, bias=False) |
| ) |
| config: |
| Qwen2Config { |
| "architectures": [ |
| "Qwen2ForCausalLM" |
| ], |
| "attention_dropout": 0.0, |
| "bos_token_id": 151643, |
| "eos_token_id": 151643, |
| "hidden_act": "silu", |
| "hidden_size": 3584, |
| "initializer_range": 0.02, |
| "intermediate_size": 18944, |
| "layer_types": [ |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention" |
| ], |
| "max_position_embeddings": 131072, |
| "max_window_layers": 28, |
| "model_type": "qwen2", |
| "num_attention_heads": 28, |
| "num_hidden_layers": 28, |
| "num_key_value_heads": 4, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 1000000.0, |
| "sliding_window": null, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "use_mrope": false, |
| "use_sliding_window": false, |
| "vocab_size": 152064 |
| } |
|
|
| Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(74.6100, device='cuda:0') |
| spectral_norm tensor(22.8962, device='cuda:0') |
| frobenius_norm tensor(37.2340, device='cuda:0') |
| spectral_norm tensor(4.1839, device='cuda:0') |
| frobenius_norm tensor(12.2050, device='cuda:0') |
| spectral_norm tensor(1.2453, device='cuda:0') |
| frobenius_norm tensor(44.6291, device='cuda:0') |
| spectral_norm tensor(6.0001, device='cuda:0') |
| alpha value of layer 0 ---60.29820251464844 |
| Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(56.7970, device='cuda:0') |
| spectral_norm tensor(10.4783, device='cuda:0') |
| frobenius_norm tensor(29.9237, device='cuda:0') |
| spectral_norm tensor(4.7115, device='cuda:0') |
| frobenius_norm tensor(18.3318, device='cuda:0') |
| spectral_norm tensor(1.4635, device='cuda:0') |
| frobenius_norm tensor(51.8109, device='cuda:0') |
| spectral_norm tensor(3.7208, device='cuda:0') |
| alpha value of layer 1 ---105.13154602050781 |
| Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(65.4367, device='cuda:0') |
| spectral_norm tensor(5.5540, device='cuda:0') |
| frobenius_norm tensor(31.8038, device='cuda:0') |
| spectral_norm tensor(3.0331, device='cuda:0') |
| frobenius_norm tensor(14.7804, device='cuda:0') |
| spectral_norm tensor(1.2128, device='cuda:0') |
| frobenius_norm tensor(50.6930, device='cuda:0') |
| spectral_norm tensor(3.9768, device='cuda:0') |
| alpha value of layer 2 ---139.9442901611328 |
| Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(67.9445, device='cuda:0') |
| spectral_norm tensor(7.3888, device='cuda:0') |
| frobenius_norm tensor(32.4597, device='cuda:0') |
| spectral_norm tensor(3.6898, device='cuda:0') |
| frobenius_norm tensor(17.2702, device='cuda:0') |
| spectral_norm tensor(1.2119, device='cuda:0') |
| frobenius_norm tensor(53.1248, device='cuda:0') |
| spectral_norm tensor(3.7626, device='cuda:0') |
| alpha value of layer 3 ---141.09706115722656 |
| Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(64.7004, device='cuda:0') |
| spectral_norm tensor(6.6270, device='cuda:0') |
| frobenius_norm tensor(29.1071, device='cuda:0') |
| spectral_norm tensor(3.3637, device='cuda:0') |
| frobenius_norm tensor(20.8693, device='cuda:0') |
| spectral_norm tensor(1.5107, device='cuda:0') |
| frobenius_norm tensor(53.3331, device='cuda:0') |
| spectral_norm tensor(3.7430, device='cuda:0') |
| alpha value of layer 4 ---141.01356506347656 |
| Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(63.2364, device='cuda:0') |
| spectral_norm tensor(5.9598, device='cuda:0') |
| frobenius_norm tensor(26.6552, device='cuda:0') |
| spectral_norm tensor(2.6609, device='cuda:0') |
| frobenius_norm tensor(19.8491, device='cuda:0') |
| spectral_norm tensor(1.5480, device='cuda:0') |
| frobenius_norm tensor(53.4192, device='cuda:0') |
| spectral_norm tensor(4.0186, device='cuda:0') |
| alpha value of layer 5 ---138.5137939453125 |
| Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(64.2853, device='cuda:0') |
| spectral_norm tensor(5.5097, device='cuda:0') |
| frobenius_norm tensor(28.5616, device='cuda:0') |
| spectral_norm tensor(2.9881, device='cuda:0') |
| frobenius_norm tensor(20.6937, device='cuda:0') |
| spectral_norm tensor(1.4517, device='cuda:0') |
| frobenius_norm tensor(54.9730, device='cuda:0') |
| spectral_norm tensor(5.0745, device='cuda:0') |
| alpha value of layer 6 ---137.01039123535156 |
| Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(60.9057, device='cuda:0') |
| spectral_norm tensor(4.5698, device='cuda:0') |
| frobenius_norm tensor(23.8630, device='cuda:0') |
| spectral_norm tensor(2.5382, device='cuda:0') |
| frobenius_norm tensor(24.6983, device='cuda:0') |
| spectral_norm tensor(1.7165, device='cuda:0') |
| frobenius_norm tensor(60.3206, device='cuda:0') |
| spectral_norm tensor(3.5918, device='cuda:0') |
| alpha value of layer 7 ---188.773193359375 |
| Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(63.3943, device='cuda:0') |
| spectral_norm tensor(3.9526, device='cuda:0') |
| frobenius_norm tensor(27.2992, device='cuda:0') |
| spectral_norm tensor(2.5236, device='cuda:0') |
| frobenius_norm tensor(20.6254, device='cuda:0') |
| spectral_norm tensor(1.4007, device='cuda:0') |
| frobenius_norm tensor(55.2465, device='cuda:0') |
| spectral_norm tensor(3.4821, device='cuda:0') |
| alpha value of layer 8 ---210.70074462890625 |
| Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(58.6742, device='cuda:0') |
| spectral_norm tensor(4.6933, device='cuda:0') |
| frobenius_norm tensor(23.2825, device='cuda:0') |
| spectral_norm tensor(2.6721, device='cuda:0') |
| frobenius_norm tensor(24.7824, device='cuda:0') |
| spectral_norm tensor(1.5406, device='cuda:0') |
| frobenius_norm tensor(60.4736, device='cuda:0') |
| spectral_norm tensor(6.3921, device='cuda:0') |
| alpha value of layer 9 ---145.1187744140625 |
| Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(63.5265, device='cuda:0') |
| spectral_norm tensor(4.3337, device='cuda:0') |
| frobenius_norm tensor(27.1029, device='cuda:0') |
| spectral_norm tensor(2.6480, device='cuda:0') |
| frobenius_norm tensor(22.3527, device='cuda:0') |
| spectral_norm tensor(1.3822, device='cuda:0') |
| frobenius_norm tensor(57.0854, device='cuda:0') |
| spectral_norm tensor(4.1727, device='cuda:0') |
| alpha value of layer 10 ---192.07977294921875 |
| Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(64.2904, device='cuda:0') |
| spectral_norm tensor(4.0832, device='cuda:0') |
| frobenius_norm tensor(28.0532, device='cuda:0') |
| spectral_norm tensor(2.6312, device='cuda:0') |
| frobenius_norm tensor(19.8152, device='cuda:0') |
| spectral_norm tensor(1.4528, device='cuda:0') |
| frobenius_norm tensor(55.3427, device='cuda:0') |
| spectral_norm tensor(4.3578, device='cuda:0') |
| alpha value of layer 11 ---177.22564697265625 |
| Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(62.3252, device='cuda:0') |
| spectral_norm tensor(3.9938, device='cuda:0') |
| frobenius_norm tensor(26.8499, device='cuda:0') |
| spectral_norm tensor(2.5529, device='cuda:0') |
| frobenius_norm tensor(20.6057, device='cuda:0') |
| spectral_norm tensor(1.3954, device='cuda:0') |
| frobenius_norm tensor(55.6693, device='cuda:0') |
| spectral_norm tensor(4.6503, device='cuda:0') |
| alpha value of layer 12 ---178.88079833984375 |
| Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(61.0881, device='cuda:0') |
| spectral_norm tensor(4.2847, device='cuda:0') |
| frobenius_norm tensor(25.6297, device='cuda:0') |
| spectral_norm tensor(2.9570, device='cuda:0') |
| frobenius_norm tensor(22.4836, device='cuda:0') |
| spectral_norm tensor(1.3594, device='cuda:0') |
| frobenius_norm tensor(57.5732, device='cuda:0') |
| spectral_norm tensor(5.3631, device='cuda:0') |
| alpha value of layer 13 ---166.80108642578125 |
| Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(59.9385, device='cuda:0') |
| spectral_norm tensor(4.1989, device='cuda:0') |
| frobenius_norm tensor(25.3315, device='cuda:0') |
| spectral_norm tensor(2.3592, device='cuda:0') |
| frobenius_norm tensor(19.7791, device='cuda:0') |
| spectral_norm tensor(1.4382, device='cuda:0') |
| frobenius_norm tensor(54.6680, device='cuda:0') |
| spectral_norm tensor(4.8688, device='cuda:0') |
| alpha value of layer 14 ---158.56539916992188 |
| Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(62.2632, device='cuda:0') |
| spectral_norm tensor(4.4386, device='cuda:0') |
| frobenius_norm tensor(26.5240, device='cuda:0') |
| spectral_norm tensor(2.5572, device='cuda:0') |
| frobenius_norm tensor(20.8667, device='cuda:0') |
| spectral_norm tensor(1.5160, device='cuda:0') |
| frobenius_norm tensor(55.3834, device='cuda:0') |
| spectral_norm tensor(4.4431, device='cuda:0') |
| alpha value of layer 15 ---162.2994384765625 |
| Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(59.5504, device='cuda:0') |
| spectral_norm tensor(4.1705, device='cuda:0') |
| frobenius_norm tensor(23.7914, device='cuda:0') |
| spectral_norm tensor(2.3827, device='cuda:0') |
| frobenius_norm tensor(22.4299, device='cuda:0') |
| spectral_norm tensor(1.8284, device='cuda:0') |
| frobenius_norm tensor(57.2100, device='cuda:0') |
| spectral_norm tensor(5.3112, device='cuda:0') |
| alpha value of layer 16 ---142.52708435058594 |
| Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(61.2952, device='cuda:0') |
| spectral_norm tensor(3.9247, device='cuda:0') |
| frobenius_norm tensor(24.2607, device='cuda:0') |
| spectral_norm tensor(2.3428, device='cuda:0') |
| frobenius_norm tensor(22.0993, device='cuda:0') |
| spectral_norm tensor(1.6527, device='cuda:0') |
| frobenius_norm tensor(56.9147, device='cuda:0') |
| spectral_norm tensor(4.7962, device='cuda:0') |
| alpha value of layer 17 ---167.6918182373047 |
| Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(57.8086, device='cuda:0') |
| spectral_norm tensor(4.1784, device='cuda:0') |
| frobenius_norm tensor(23.1346, device='cuda:0') |
| spectral_norm tensor(2.7649, device='cuda:0') |
| frobenius_norm tensor(24.9173, device='cuda:0') |
| spectral_norm tensor(1.5424, device='cuda:0') |
| frobenius_norm tensor(60.8592, device='cuda:0') |
| spectral_norm tensor(5.2139, device='cuda:0') |
| alpha value of layer 18 ---164.66351318359375 |
| Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(57.4449, device='cuda:0') |
| spectral_norm tensor(4.5447, device='cuda:0') |
| frobenius_norm tensor(21.3000, device='cuda:0') |
| spectral_norm tensor(2.4431, device='cuda:0') |
| frobenius_norm tensor(25.0779, device='cuda:0') |
| spectral_norm tensor(1.6237, device='cuda:0') |
| frobenius_norm tensor(59.6389, device='cuda:0') |
| spectral_norm tensor(4.9505, device='cuda:0') |
| alpha value of layer 19 ---154.86126708984375 |
| Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(58.7482, device='cuda:0') |
| spectral_norm tensor(4.0858, device='cuda:0') |
| frobenius_norm tensor(22.3411, device='cuda:0') |
| spectral_norm tensor(2.3521, device='cuda:0') |
| frobenius_norm tensor(25.9143, device='cuda:0') |
| spectral_norm tensor(1.7104, device='cuda:0') |
| frobenius_norm tensor(60.9286, device='cuda:0') |
| spectral_norm tensor(4.8709, device='cuda:0') |
| alpha value of layer 20 ---170.74554443359375 |
| Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(57.1627, device='cuda:0') |
| spectral_norm tensor(3.6813, device='cuda:0') |
| frobenius_norm tensor(19.9702, device='cuda:0') |
| spectral_norm tensor(2.2489, device='cuda:0') |
| frobenius_norm tensor(27.7738, device='cuda:0') |
| spectral_norm tensor(1.6762, device='cuda:0') |
| frobenius_norm tensor(63.2683, device='cuda:0') |
| spectral_norm tensor(5.2737, device='cuda:0') |
| alpha value of layer 21 ---184.6076202392578 |
| Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(58.1757, device='cuda:0') |
| spectral_norm tensor(4.0771, device='cuda:0') |
| frobenius_norm tensor(19.4699, device='cuda:0') |
| spectral_norm tensor(1.9257, device='cuda:0') |
| frobenius_norm tensor(27.1717, device='cuda:0') |
| spectral_norm tensor(2.0305, device='cuda:0') |
| frobenius_norm tensor(63.5510, device='cuda:0') |
| spectral_norm tensor(4.3551, device='cuda:0') |
| alpha value of layer 22 ---174.45973205566406 |
| Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(59.5436, device='cuda:0') |
| spectral_norm tensor(4.0933, device='cuda:0') |
| frobenius_norm tensor(20.5584, device='cuda:0') |
| spectral_norm tensor(2.1754, device='cuda:0') |
| frobenius_norm tensor(27.5827, device='cuda:0') |
| spectral_norm tensor(1.8698, device='cuda:0') |
| frobenius_norm tensor(64.9988, device='cuda:0') |
| spectral_norm tensor(5.3918, device='cuda:0') |
| alpha value of layer 23 ---165.96688842773438 |
| Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(56.8169, device='cuda:0') |
| spectral_norm tensor(3.8277, device='cuda:0') |
| frobenius_norm tensor(19.4813, device='cuda:0') |
| spectral_norm tensor(1.8291, device='cuda:0') |
| frobenius_norm tensor(31.0836, device='cuda:0') |
| spectral_norm tensor(2.3168, device='cuda:0') |
| frobenius_norm tensor(65.3215, device='cuda:0') |
| spectral_norm tensor(6.2412, device='cuda:0') |
| alpha value of layer 24 ---155.8324432373047 |
| Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(55.3445, device='cuda:0') |
| spectral_norm tensor(3.8164, device='cuda:0') |
| frobenius_norm tensor(17.7185, device='cuda:0') |
| spectral_norm tensor(1.8025, device='cuda:0') |
| frobenius_norm tensor(33.9660, device='cuda:0') |
| spectral_norm tensor(3.0984, device='cuda:0') |
| frobenius_norm tensor(68.8981, device='cuda:0') |
| spectral_norm tensor(5.3780, device='cuda:0') |
| alpha value of layer 25 ---147.80517578125 |
| Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(51.7868, device='cuda:0') |
| spectral_norm tensor(3.9548, device='cuda:0') |
| frobenius_norm tensor(16.9575, device='cuda:0') |
| spectral_norm tensor(1.9295, device='cuda:0') |
| frobenius_norm tensor(40.6153, device='cuda:0') |
| spectral_norm tensor(3.1825, device='cuda:0') |
| frobenius_norm tensor(71.6628, device='cuda:0') |
| spectral_norm tensor(6.2830, device='cuda:0') |
| alpha value of layer 26 ---135.41717529296875 |
| Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| frobenius_norm tensor(56.6802, device='cuda:0') |
| spectral_norm tensor(7.8725, device='cuda:0') |
| frobenius_norm tensor(18.0258, device='cuda:0') |
| spectral_norm tensor(2.1181, device='cuda:0') |
| frobenius_norm tensor(36.7123, device='cuda:0') |
| spectral_norm tensor(4.5755, device='cuda:0') |
| frobenius_norm tensor(66.9187, device='cuda:0') |
| spectral_norm tensor(10.3449, device='cuda:0') |
| alpha value of layer 27 ---57.62150573730469 |
| metric_name stable_rank: [8, 10, 7, 21, 12, 11, 22, 20, 17, 13, 23, 18, 15, 14, 24, 19, 25, 9, 16, 3, 4, 2, 5, 6, 26, 1, 0, 27] |
| Begin main_assign: Qwen2.5-7B self_attn effective_rank |
|
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:02, 1.40it/s]
Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.46it/s]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.46it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.54it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.50it/s] |
| Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation. |
| Once upon a time, in a faraway land, there was a wizard named Zephyr. Zephyr had a magical garden filled with enchanted flowers that bloomed only once a year on the night of the full moon. Each flower had a unique power: some could grant wishes, others could heal, and some could even bring the dead back to life. Zephyr knew that the garden was in danger, and he needed to protect it. He decided to create a secret code to lock the garden's entrance |
| Qwen2ForCausalLM( |
| (model): Qwen2Model( |
| (embed_tokens): Embedding(152064, 3584) |
| (layers): ModuleList( |
| (0-27): 28 x Qwen2DecoderLayer( |
| (self_attn): Qwen2Attention( |
| (q_proj): Linear(in_features=3584, out_features=3584, bias=True) |
| (k_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (v_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (o_proj): Linear(in_features=3584, out_features=3584, bias=False) |
| ) |
| (mlp): Qwen2MLP( |
| (gate_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (up_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (down_proj): Linear(in_features=18944, out_features=3584, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| ) |
| ) |
| (norm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (rotary_emb): Qwen2RotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=3584, out_features=152064, bias=False) |
| ) |
| config: |
| Qwen2Config { |
| "architectures": [ |
| "Qwen2ForCausalLM" |
| ], |
| "attention_dropout": 0.0, |
| "bos_token_id": 151643, |
| "eos_token_id": 151643, |
| "hidden_act": "silu", |
| "hidden_size": 3584, |
| "initializer_range": 0.02, |
| "intermediate_size": 18944, |
| "layer_types": [ |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention" |
| ], |
| "max_position_embeddings": 131072, |
| "max_window_layers": 28, |
| "model_type": "qwen2", |
| "num_attention_heads": 28, |
| "num_hidden_layers": 28, |
| "num_key_value_heads": 4, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 1000000.0, |
| "sliding_window": null, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "use_mrope": false, |
| "use_sliding_window": false, |
| "vocab_size": 152064 |
| } |
|
|
| Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 0 ---1452.84130859375 |
| Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 1 ---1305.515869140625 |
| Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 2 ---1505.5045166015625 |
| Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 3 ---1528.22900390625 |
| Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 4 ---1506.9344482421875 |
| Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 5 ---1515.52099609375 |
| Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 6 ---1521.9154052734375 |
| Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 7 ---1530.30126953125 |
| Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 8 ---1513.680908203125 |
| Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 9 ---1508.0103759765625 |
| Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 10 ---1541.102294921875 |
| Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 11 ---1515.346923828125 |
| Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 12 ---1515.414306640625 |
| Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 13 ---1518.91796875 |
| Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 14 ---1460.6475830078125 |
| Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 15 ---1482.7188720703125 |
| Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 16 ---1503.2802734375 |
| Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 17 ---1507.509033203125 |
| Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 18 ---1475.742431640625 |
| Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 19 ---1503.236083984375 |
| Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 20 ---1508.0513916015625 |
| Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 21 ---1513.8974609375 |
| Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 22 ---1492.510986328125 |
| Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 23 ---1560.966552734375 |
| Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 24 ---1547.26904296875 |
| Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 25 ---1559.457763671875 |
| Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 26 ---1530.5631103515625 |
| Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 27 ---1474.574462890625 |
| metric_name effective_rank: [23, 25, 24, 10, 26, 7, 3, 6, 13, 5, 12, 11, 21, 8, 20, 9, 17, 4, 2, 16, 19, 22, 15, 18, 27, 14, 0, 1] |
| Begin main_assign: Qwen2.5-7B self_attn ZD |
|
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:01, 1.55it/s]
Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.64it/s]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:01<00:00, 1.64it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.74it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.70it/s] |
| Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation. |
| Once upon a time there was a very special fish, who lived in a very special lake, and who had a very special name. |
| And when the fish was born, his parents named him Nemo. |
| Nemo was a very happy fish, who loved to swim around his lake, and who had a lot of friends. |
| There was Dory, the forgetful fish, who would always forget where she was going, and Marlin, the protective fish, who would always look after his family. |
| Nemo was also |
| Qwen2ForCausalLM( |
| (model): Qwen2Model( |
| (embed_tokens): Embedding(152064, 3584) |
| (layers): ModuleList( |
| (0-27): 28 x Qwen2DecoderLayer( |
| (self_attn): Qwen2Attention( |
| (q_proj): Linear(in_features=3584, out_features=3584, bias=True) |
| (k_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (v_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (o_proj): Linear(in_features=3584, out_features=3584, bias=False) |
| ) |
| (mlp): Qwen2MLP( |
| (gate_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (up_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (down_proj): Linear(in_features=18944, out_features=3584, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| ) |
| ) |
| (norm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (rotary_emb): Qwen2RotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=3584, out_features=152064, bias=False) |
| ) |
| config: |
| Qwen2Config { |
| "architectures": [ |
| "Qwen2ForCausalLM" |
| ], |
| "attention_dropout": 0.0, |
| "bos_token_id": 151643, |
| "eos_token_id": 151643, |
| "hidden_act": "silu", |
| "hidden_size": 3584, |
| "initializer_range": 0.02, |
| "intermediate_size": 18944, |
| "layer_types": [ |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention" |
| ], |
| "max_position_embeddings": 131072, |
| "max_window_layers": 28, |
| "model_type": "qwen2", |
| "num_attention_heads": 28, |
| "num_hidden_layers": 28, |
| "num_key_value_heads": 4, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 1000000.0, |
| "sliding_window": null, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "use_mrope": false, |
| "use_sliding_window": false, |
| "vocab_size": 152064 |
| } |
|
|
| Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 0 ---0.14184384047985077 |
| Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 1 ---0.13445976376533508 |
| Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 2 ---0.1448441445827484 |
| Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 3 ---0.14360329508781433 |
| Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 4 ---0.14149385690689087 |
| Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 5 ---0.14219337701797485 |
| Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 6 ---0.14448319375514984 |
| Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 7 ---0.14329871535301208 |
| Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 8 ---0.14073410630226135 |
| Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 9 ---0.14020180702209473 |
| Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 10 ---0.14346933364868164 |
| Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 11 ---0.14193479716777802 |
| Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 12 ---0.1403268575668335 |
| Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 13 ---0.14010006189346313 |
| Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 14 ---0.13355384767055511 |
| Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 15 ---0.13806670904159546 |
| Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 16 ---0.13962170481681824 |
| Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 17 ---0.13812372088432312 |
| Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 18 ---0.13966597616672516 |
| Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 19 ---0.13903221487998962 |
| Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 20 ---0.1419043242931366 |
| Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 21 ---0.13662637770175934 |
| Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 22 ---0.13463997840881348 |
| Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 23 ---0.13744638860225677 |
| Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 24 ---0.1426282525062561 |
| Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 25 ---0.1387733370065689 |
| Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 26 ---0.13506180047988892 |
| Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 27 ---0.13298790156841278 |
| metric_name ZD: [2, 6, 3, 10, 7, 24, 5, 11, 20, 0, 4, 8, 12, 9, 13, 18, 16, 19, 25, 17, 15, 23, 21, 26, 22, 1, 14, 27] |
| Begin main_assign: Qwen2.5-7B self_attn head_diversity |
|
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:02, 1.18it/s]
Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.12it/s]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.11it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.18it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.16it/s] |
| Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation. |
| Once upon a time in the land of Mathoria, there was a magical forest where every tree had a unique number of leaves. The King of Mathoria decided to plant a new tree every day for a week (7 days), starting with 1 leaf on the first day and increasing the number of leaves by 1 each day. However, a mischievous sprite named Sprinkle loved to play tricks on the trees. On every even day, Sprinkle would randomly remove a number of leaves from the tree, between |
| Qwen2ForCausalLM( |
| (model): Qwen2Model( |
| (embed_tokens): Embedding(152064, 3584) |
| (layers): ModuleList( |
| (0-27): 28 x Qwen2DecoderLayer( |
| (self_attn): Qwen2Attention( |
| (q_proj): Linear(in_features=3584, out_features=3584, bias=True) |
| (k_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (v_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (o_proj): Linear(in_features=3584, out_features=3584, bias=False) |
| ) |
| (mlp): Qwen2MLP( |
| (gate_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (up_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (down_proj): Linear(in_features=18944, out_features=3584, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| ) |
| ) |
| (norm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (rotary_emb): Qwen2RotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=3584, out_features=152064, bias=False) |
| ) |
| config: |
| Qwen2Config { |
| "architectures": [ |
| "Qwen2ForCausalLM" |
| ], |
| "attention_dropout": 0.0, |
| "bos_token_id": 151643, |
| "eos_token_id": 151643, |
| "hidden_act": "silu", |
| "hidden_size": 3584, |
| "initializer_range": 0.02, |
| "intermediate_size": 18944, |
| "layer_types": [ |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention" |
| ], |
| "max_position_embeddings": 131072, |
| "max_window_layers": 28, |
| "model_type": "qwen2", |
| "num_attention_heads": 28, |
| "num_hidden_layers": 28, |
| "num_key_value_heads": 4, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 1000000.0, |
| "sliding_window": null, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "use_mrope": false, |
| "use_sliding_window": false, |
| "vocab_size": 152064 |
| } |
|
|
| Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| Traceback (most recent call last): |
| File "/mnt/bn/life-mllm/users/cxr/quantization/quantization_metric/main_assign.py", line 58, in <module> |
| all_layer_alpha = calculate_expert(model, metric=metric_name, keyword=keyword) |
| File "/mnt/bn/life-mllm/users/cxr/quantization/quantization_metric/alphalora/expert_number.py", line 350, in calculate_expert |
| layer_final_alpha = func_call[metric](num_heads, subset) |
| File "/mnt/bn/life-mllm/users/cxr/quantization/quantization_metric/alphalora/expert_number.py", line 310, in head_diversity_asssist |
| ans.append(head_diversity(W, num_heads)) |
| File "/mnt/bn/life-mllm/users/cxr/quantization/quantization_metric/alphalora/expert_number.py", line 191, in head_diversity |
| w_heads = W.view(num_heads, head_dim, d_in) |
| RuntimeError: shape '[28, 18, 3584]' is invalid for input of size 1835008 |
| Begin main_assign: Qwen2.5-7B self_attn coherence |
|
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:02, 1.19it/s]
Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.21it/s]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.20it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.29it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.25it/s] |
| Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation. |
| Once upon a time in the not-so-distant past, the average person had a choice of two or three local phone companies and one long distance company. Now, you have to be an expert to navigate the maze of telephone companies and options. This article will help you make the right choice for your telephone needs. |
| If you are using a cellular phone, you should only use it in an emergency. It is important to use a cell phone only in emergencies because they use a lot of battery power. If you use a |
| Qwen2ForCausalLM( |
| (model): Qwen2Model( |
| (embed_tokens): Embedding(152064, 3584) |
| (layers): ModuleList( |
| (0-27): 28 x Qwen2DecoderLayer( |
| (self_attn): Qwen2Attention( |
| (q_proj): Linear(in_features=3584, out_features=3584, bias=True) |
| (k_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (v_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (o_proj): Linear(in_features=3584, out_features=3584, bias=False) |
| ) |
| (mlp): Qwen2MLP( |
| (gate_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (up_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (down_proj): Linear(in_features=18944, out_features=3584, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| ) |
| ) |
| (norm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (rotary_emb): Qwen2RotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=3584, out_features=152064, bias=False) |
| ) |
| config: |
| Qwen2Config { |
| "architectures": [ |
| "Qwen2ForCausalLM" |
| ], |
| "attention_dropout": 0.0, |
| "bos_token_id": 151643, |
| "eos_token_id": 151643, |
| "hidden_act": "silu", |
| "hidden_size": 3584, |
| "initializer_range": 0.02, |
| "intermediate_size": 18944, |
| "layer_types": [ |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention" |
| ], |
| "max_position_embeddings": 131072, |
| "max_window_layers": 28, |
| "model_type": "qwen2", |
| "num_attention_heads": 28, |
| "num_hidden_layers": 28, |
| "num_key_value_heads": 4, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 1000000.0, |
| "sliding_window": null, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "use_mrope": false, |
| "use_sliding_window": false, |
| "vocab_size": 152064 |
| } |
|
|
| Processing layer 0--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 0 ---0.019099362194538116 |
| Processing layer 1--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 1 ---0.03270909935235977 |
| Processing layer 2--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 2 ---0.02031659334897995 |
| Processing layer 3--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 3 ---0.020667918026447296 |
| Processing layer 4--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 4 ---0.021066918969154358 |
| Processing layer 5--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 5 ---0.019672438502311707 |
| Processing layer 6--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 6 ---0.020373258739709854 |
| Processing layer 7--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 7 ---0.01879500225186348 |
| Processing layer 8--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 8 ---0.018471794202923775 |
| Processing layer 9--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 9 ---0.01999843120574951 |
| Processing layer 10--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 10 ---0.018006717786192894 |
| Processing layer 11--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 11 ---0.019990842789411545 |
| Processing layer 12--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 12 ---0.01996159367263317 |
| Processing layer 13--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 13 ---0.020418085157871246 |
| Processing layer 14--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 14 ---0.020115870982408524 |
| Processing layer 15--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 15 ---0.02094407193362713 |
| Processing layer 16--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 16 ---0.02063441462814808 |
| Processing layer 17--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 17 ---0.018953558057546616 |
| Processing layer 18--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 18 ---0.020888380706310272 |
| Processing layer 19--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 19 ---0.01945885643362999 |
| Processing layer 20--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 20 ---0.019533313810825348 |
| Processing layer 21--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 21 ---0.018554046750068665 |
| Processing layer 22--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 22 ---0.020378313958644867 |
| Processing layer 23--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 23 ---0.019100410863757133 |
| Processing layer 24--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 24 ---0.018557211384177208 |
| Processing layer 25--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 25 ---0.019475571811199188 |
| Processing layer 26--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 26 ---0.021406373009085655 |
| Processing layer 27--subset--{'self_attn.q_proj': Linear(in_features=3584, out_features=3584, bias=True), 'self_attn.k_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.v_proj': Linear(in_features=3584, out_features=512, bias=True), 'self_attn.o_proj': Linear(in_features=3584, out_features=3584, bias=False)} |
| alpha value of layer 27 ---0.02655916102230549 |
| metric_name coherence: [1, 27, 26, 4, 15, 18, 3, 16, 13, 22, 6, 2, 14, 9, 11, 12, 5, 20, 25, 19, 23, 0, 17, 7, 24, 21, 8, 10] |
| Begin main_assign: Qwen2.5-7B mlp alpha |
|
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:02, 1.32it/s]
Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.37it/s]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.35it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.44it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.41it/s] |
| Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation. |
| Once upon a time there was a little girl named Maria. She was 10 years old and she loved to play with her dolls. Every night before she went to sleep she would place her dolls on her bed and have a tea party with them. One night, she was feeling very lonely. She wanted someone to talk to who could understand her. She wanted to be with her dolls, but she wanted a friend too. She prayed that God would send her a friend. She prayed that God would send her a |
| Qwen2ForCausalLM( |
| (model): Qwen2Model( |
| (embed_tokens): Embedding(152064, 3584) |
| (layers): ModuleList( |
| (0-27): 28 x Qwen2DecoderLayer( |
| (self_attn): Qwen2Attention( |
| (q_proj): Linear(in_features=3584, out_features=3584, bias=True) |
| (k_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (v_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (o_proj): Linear(in_features=3584, out_features=3584, bias=False) |
| ) |
| (mlp): Qwen2MLP( |
| (gate_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (up_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (down_proj): Linear(in_features=18944, out_features=3584, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| ) |
| ) |
| (norm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (rotary_emb): Qwen2RotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=3584, out_features=152064, bias=False) |
| ) |
| config: |
| Qwen2Config { |
| "architectures": [ |
| "Qwen2ForCausalLM" |
| ], |
| "attention_dropout": 0.0, |
| "bos_token_id": 151643, |
| "eos_token_id": 151643, |
| "hidden_act": "silu", |
| "hidden_size": 3584, |
| "initializer_range": 0.02, |
| "intermediate_size": 18944, |
| "layer_types": [ |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention" |
| ], |
| "max_position_embeddings": 131072, |
| "max_window_layers": 28, |
| "model_type": "qwen2", |
| "num_attention_heads": 28, |
| "num_hidden_layers": 28, |
| "num_key_value_heads": 4, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 1000000.0, |
| "sliding_window": null, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "use_mrope": false, |
| "use_sliding_window": false, |
| "vocab_size": 152064 |
| } |
|
|
| Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 0 ---5.5507073402404785 |
| Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 1 ---2.925907850265503 |
| Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 2 ---3.427783727645874 |
| Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 3 ---4.130731105804443 |
| Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 4 ---4.2790751457214355 |
| Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 5 ---4.67555570602417 |
| Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 6 ---5.680507659912109 |
| Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 7 ---5.469402313232422 |
| Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 8 ---4.489261150360107 |
| Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 9 ---5.958518981933594 |
| Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 10 ---5.11647367477417 |
| Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 11 ---4.431467056274414 |
| Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 12 ---4.447659969329834 |
| Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 13 ---4.224405288696289 |
| Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 14 ---4.203671932220459 |
| Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 15 ---4.193532943725586 |
| Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 16 ---4.277862548828125 |
| Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 17 ---4.189056396484375 |
| Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 18 ---4.484411716461182 |
| Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 19 ---4.689056396484375 |
| Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 20 ---4.993287563323975 |
| Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 21 ---6.104448318481445 |
| Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 22 ---6.7987060546875 |
| Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 23 ---6.16623067855835 |
| Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 24 ---6.090585231781006 |
| Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 25 ---5.552665710449219 |
| Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 26 ---5.523178577423096 |
| Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 27 ---4.90322208404541 |
| metric_name alpha: [22, 23, 21, 24, 9, 6, 25, 0, 26, 7, 10, 20, 27, 19, 5, 8, 18, 12, 11, 4, 16, 13, 14, 15, 17, 3, 2, 1] |
| Begin main_assign: Qwen2.5-7B mlp alpha_hat |
|
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:01, 1.57it/s]
Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.65it/s]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:01<00:00, 1.64it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.71it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.68it/s] |
| Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation. |
| Once upon a time, a little boy named Timmy went to visit his grandmother. On the way, he saw a beautiful rainbow in the sky. He wanted to find the pot of gold at the end of the rainbow. But the rainbow led him to a magical door that was locked with a puzzle. |
| The puzzle was: "I am not alive, but I grow; I don't have lungs, but I need air; I don't have a mouth, but water kills me. What am I?" |
|
|
| Qwen2ForCausalLM( |
| (model): Qwen2Model( |
| (embed_tokens): Embedding(152064, 3584) |
| (layers): ModuleList( |
| (0-27): 28 x Qwen2DecoderLayer( |
| (self_attn): Qwen2Attention( |
| (q_proj): Linear(in_features=3584, out_features=3584, bias=True) |
| (k_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (v_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (o_proj): Linear(in_features=3584, out_features=3584, bias=False) |
| ) |
| (mlp): Qwen2MLP( |
| (gate_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (up_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (down_proj): Linear(in_features=18944, out_features=3584, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| ) |
| ) |
| (norm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (rotary_emb): Qwen2RotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=3584, out_features=152064, bias=False) |
| ) |
| config: |
| Qwen2Config { |
| "architectures": [ |
| "Qwen2ForCausalLM" |
| ], |
| "attention_dropout": 0.0, |
| "bos_token_id": 151643, |
| "eos_token_id": 151643, |
| "hidden_act": "silu", |
| "hidden_size": 3584, |
| "initializer_range": 0.02, |
| "intermediate_size": 18944, |
| "layer_types": [ |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention" |
| ], |
| "max_position_embeddings": 131072, |
| "max_window_layers": 28, |
| "model_type": "qwen2", |
| "num_attention_heads": 28, |
| "num_hidden_layers": 28, |
| "num_key_value_heads": 4, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 1000000.0, |
| "sliding_window": null, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "use_mrope": false, |
| "use_sliding_window": false, |
| "vocab_size": 152064 |
| } |
|
|
| Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 0 ---26.469465255737305 |
| Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 1 ---15.14222526550293 |
| Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 2 ---15.627280235290527 |
| Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 3 ---21.562671661376953 |
| Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 4 ---18.97063636779785 |
| Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 5 ---21.275915145874023 |
| Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 6 ---20.404766082763672 |
| Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 7 ---21.564342498779297 |
| Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 8 ---17.832380294799805 |
| Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 9 ---24.982494354248047 |
| Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 10 ---21.225040435791016 |
| Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 11 ---18.6888484954834 |
| Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 12 ---19.226669311523438 |
| Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 13 ---18.279586791992188 |
| Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 14 ---17.538372039794922 |
| Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 15 ---17.4776611328125 |
| Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 16 ---18.033130645751953 |
| Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 17 ---17.052593231201172 |
| Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 18 ---18.05915641784668 |
| Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 19 ---18.863903045654297 |
| Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 20 ---19.91156768798828 |
| Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 21 ---22.948781967163086 |
| Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 22 ---26.310537338256836 |
| Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 23 ---24.970985412597656 |
| Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 24 ---24.232511520385742 |
| Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 25 ---23.717098236083984 |
| Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 26 ---23.539913177490234 |
| Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 27 ---24.706113815307617 |
| metric_name alpha_hat: [0, 22, 9, 23, 27, 24, 25, 26, 21, 7, 3, 5, 10, 6, 20, 12, 4, 19, 11, 13, 18, 16, 8, 14, 15, 17, 2, 1] |
| Begin main_assign: Qwen2.5-7B mlp stable_rank |
|
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:02, 1.23it/s]
Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.28it/s]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.28it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.35it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.32it/s] |
| Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation. |
| Once upon a time, there was a little girl. She was very beautiful, but she was so bad that nobody liked her. She had no friends. She didn't want to play with other children. She didn't want to go to school. She lived by herself in a small house. The only thing that made her happy was a little dog. The little girl was very sad. She didn't know what to do. One day, she walked out of the house and saw a beautiful bird. She wanted to |
| Qwen2ForCausalLM( |
| (model): Qwen2Model( |
| (embed_tokens): Embedding(152064, 3584) |
| (layers): ModuleList( |
| (0-27): 28 x Qwen2DecoderLayer( |
| (self_attn): Qwen2Attention( |
| (q_proj): Linear(in_features=3584, out_features=3584, bias=True) |
| (k_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (v_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (o_proj): Linear(in_features=3584, out_features=3584, bias=False) |
| ) |
| (mlp): Qwen2MLP( |
| (gate_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (up_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (down_proj): Linear(in_features=18944, out_features=3584, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| ) |
| ) |
| (norm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (rotary_emb): Qwen2RotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=3584, out_features=152064, bias=False) |
| ) |
| config: |
| Qwen2Config { |
| "architectures": [ |
| "Qwen2ForCausalLM" |
| ], |
| "attention_dropout": 0.0, |
| "bos_token_id": 151643, |
| "eos_token_id": 151643, |
| "hidden_act": "silu", |
| "hidden_size": 3584, |
| "initializer_range": 0.02, |
| "intermediate_size": 18944, |
| "layer_types": [ |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention" |
| ], |
| "max_position_embeddings": 131072, |
| "max_window_layers": 28, |
| "model_type": "qwen2", |
| "num_attention_heads": 28, |
| "num_hidden_layers": 28, |
| "num_key_value_heads": 4, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 1000000.0, |
| "sliding_window": null, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "use_mrope": false, |
| "use_sliding_window": false, |
| "vocab_size": 152064 |
| } |
|
|
| Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(126.6843, device='cuda:0') |
| spectral_norm tensor(37.7735, device='cuda:0') |
| frobenius_norm tensor(108.8807, device='cuda:0') |
| spectral_norm tensor(7.8392, device='cuda:0') |
| frobenius_norm tensor(115.3682, device='cuda:0') |
| spectral_norm tensor(6.5322, device='cuda:0') |
| alpha value of layer 0 ---172.02789306640625 |
| Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(123.2531, device='cuda:0') |
| spectral_norm tensor(20.7839, device='cuda:0') |
| frobenius_norm tensor(101.4097, device='cuda:0') |
| spectral_norm tensor(8.8381, device='cuda:0') |
| frobenius_norm tensor(101.5777, device='cuda:0') |
| spectral_norm tensor(13.5321, device='cuda:0') |
| alpha value of layer 1 ---74.3896484375 |
| Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(138.8778, device='cuda:0') |
| spectral_norm tensor(23.4270, device='cuda:0') |
| frobenius_norm tensor(112.7775, device='cuda:0') |
| spectral_norm tensor(6.5452, device='cuda:0') |
| frobenius_norm tensor(115.6860, device='cuda:0') |
| spectral_norm tensor(7.9797, device='cuda:0') |
| alpha value of layer 2 ---180.73681640625 |
| Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(153.6270, device='cuda:0') |
| spectral_norm tensor(22.2753, device='cuda:0') |
| frobenius_norm tensor(132.3810, device='cuda:0') |
| spectral_norm tensor(7.2662, device='cuda:0') |
| frobenius_norm tensor(131.2425, device='cuda:0') |
| spectral_norm tensor(17.0355, device='cuda:0') |
| alpha value of layer 3 ---146.2799072265625 |
| Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(156.7403, device='cuda:0') |
| spectral_norm tensor(22.8238, device='cuda:0') |
| frobenius_norm tensor(129.6389, device='cuda:0') |
| spectral_norm tensor(5.4833, device='cuda:0') |
| frobenius_norm tensor(128.7504, device='cuda:0') |
| spectral_norm tensor(8.4121, device='cuda:0') |
| alpha value of layer 4 ---280.12554931640625 |
| Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(149.4161, device='cuda:0') |
| spectral_norm tensor(21.1506, device='cuda:0') |
| frobenius_norm tensor(133.9768, device='cuda:0') |
| spectral_norm tensor(6.2399, device='cuda:0') |
| frobenius_norm tensor(132.0446, device='cuda:0') |
| spectral_norm tensor(8.4305, device='cuda:0') |
| alpha value of layer 5 ---252.07894897460938 |
| Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(151.2180, device='cuda:0') |
| spectral_norm tensor(14.3238, device='cuda:0') |
| frobenius_norm tensor(129.6978, device='cuda:0') |
| spectral_norm tensor(4.1238, device='cuda:0') |
| frobenius_norm tensor(127.6298, device='cuda:0') |
| spectral_norm tensor(7.3585, device='cuda:0') |
| alpha value of layer 6 ---467.14599609375 |
| Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(141.6723, device='cuda:0') |
| spectral_norm tensor(13.0737, device='cuda:0') |
| frobenius_norm tensor(133.2031, device='cuda:0') |
| spectral_norm tensor(5.1650, device='cuda:0') |
| frobenius_norm tensor(132.6786, device='cuda:0') |
| spectral_norm tensor(7.6574, device='cuda:0') |
| alpha value of layer 7 ---360.915771484375 |
| Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(139.1350, device='cuda:0') |
| spectral_norm tensor(12.2201, device='cuda:0') |
| frobenius_norm tensor(135.6841, device='cuda:0') |
| spectral_norm tensor(4.7790, device='cuda:0') |
| frobenius_norm tensor(134.1169, device='cuda:0') |
| spectral_norm tensor(8.0838, device='cuda:0') |
| alpha value of layer 8 ---403.66400146484375 |
| Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(152.4723, device='cuda:0') |
| spectral_norm tensor(28.9447, device='cuda:0') |
| frobenius_norm tensor(124.9508, device='cuda:0') |
| spectral_norm tensor(5.0401, device='cuda:0') |
| frobenius_norm tensor(123.2868, device='cuda:0') |
| spectral_norm tensor(7.9764, device='cuda:0') |
| alpha value of layer 9 ---293.7569580078125 |
| Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(141.8071, device='cuda:0') |
| spectral_norm tensor(14.3111, device='cuda:0') |
| frobenius_norm tensor(133.1904, device='cuda:0') |
| spectral_norm tensor(5.2121, device='cuda:0') |
| frobenius_norm tensor(132.4129, device='cuda:0') |
| spectral_norm tensor(9.0536, device='cuda:0') |
| alpha value of layer 10 ---321.69720458984375 |
| Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(139.0032, device='cuda:0') |
| spectral_norm tensor(12.7478, device='cuda:0') |
| frobenius_norm tensor(135.5063, device='cuda:0') |
| spectral_norm tensor(5.4040, device='cuda:0') |
| frobenius_norm tensor(134.1321, device='cuda:0') |
| spectral_norm tensor(9.8842, device='cuda:0') |
| alpha value of layer 11 ---310.609130859375 |
| Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(136.5591, device='cuda:0') |
| spectral_norm tensor(12.8626, device='cuda:0') |
| frobenius_norm tensor(136.9887, device='cuda:0') |
| spectral_norm tensor(5.2525, device='cuda:0') |
| frobenius_norm tensor(135.8750, device='cuda:0') |
| spectral_norm tensor(10.9823, device='cuda:0') |
| alpha value of layer 12 ---315.332763671875 |
| Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(139.4029, device='cuda:0') |
| spectral_norm tensor(12.9032, device='cuda:0') |
| frobenius_norm tensor(135.3871, device='cuda:0') |
| spectral_norm tensor(5.2956, device='cuda:0') |
| frobenius_norm tensor(133.7299, device='cuda:0') |
| spectral_norm tensor(10.5605, device='cuda:0') |
| alpha value of layer 13 ---310.2343444824219 |
| Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(136.9790, device='cuda:0') |
| spectral_norm tensor(12.1678, device='cuda:0') |
| frobenius_norm tensor(136.4235, device='cuda:0') |
| spectral_norm tensor(5.1185, device='cuda:0') |
| frobenius_norm tensor(134.9485, device='cuda:0') |
| spectral_norm tensor(9.6906, device='cuda:0') |
| alpha value of layer 14 ---343.6784362792969 |
| Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(135.3356, device='cuda:0') |
| spectral_norm tensor(11.3039, device='cuda:0') |
| frobenius_norm tensor(137.9369, device='cuda:0') |
| spectral_norm tensor(5.1105, device='cuda:0') |
| frobenius_norm tensor(135.9716, device='cuda:0') |
| spectral_norm tensor(10.2615, device='cuda:0') |
| alpha value of layer 15 ---349.14691162109375 |
| Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(135.1444, device='cuda:0') |
| spectral_norm tensor(11.1679, device='cuda:0') |
| frobenius_norm tensor(137.6841, device='cuda:0') |
| spectral_norm tensor(5.1499, device='cuda:0') |
| frobenius_norm tensor(135.1051, device='cuda:0') |
| spectral_norm tensor(10.5823, device='cuda:0') |
| alpha value of layer 16 ---341.401611328125 |
| Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(133.7680, device='cuda:0') |
| spectral_norm tensor(11.1640, device='cuda:0') |
| frobenius_norm tensor(138.4800, device='cuda:0') |
| spectral_norm tensor(5.2358, device='cuda:0') |
| frobenius_norm tensor(135.3299, device='cuda:0') |
| spectral_norm tensor(8.6134, device='cuda:0') |
| alpha value of layer 17 ---363.3155212402344 |
| Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(133.3184, device='cuda:0') |
| spectral_norm tensor(10.9756, device='cuda:0') |
| frobenius_norm tensor(140.8582, device='cuda:0') |
| spectral_norm tensor(5.6010, device='cuda:0') |
| frobenius_norm tensor(137.9481, device='cuda:0') |
| spectral_norm tensor(7.5009, device='cuda:0') |
| alpha value of layer 18 ---372.74267578125 |
| Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(135.1317, device='cuda:0') |
| spectral_norm tensor(12.0810, device='cuda:0') |
| frobenius_norm tensor(140.5321, device='cuda:0') |
| spectral_norm tensor(5.5793, device='cuda:0') |
| frobenius_norm tensor(137.1833, device='cuda:0') |
| spectral_norm tensor(6.9554, device='cuda:0') |
| alpha value of layer 19 ---382.8515625 |
| Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(134.6000, device='cuda:0') |
| spectral_norm tensor(11.1297, device='cuda:0') |
| frobenius_norm tensor(141.2537, device='cuda:0') |
| spectral_norm tensor(5.9169, device='cuda:0') |
| frobenius_norm tensor(138.0343, device='cuda:0') |
| spectral_norm tensor(6.5506, device='cuda:0') |
| alpha value of layer 20 ---386.7352294921875 |
| Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(136.7462, device='cuda:0') |
| spectral_norm tensor(11.6961, device='cuda:0') |
| frobenius_norm tensor(141.2342, device='cuda:0') |
| spectral_norm tensor(5.6582, device='cuda:0') |
| frobenius_norm tensor(137.9897, device='cuda:0') |
| spectral_norm tensor(5.3100, device='cuda:0') |
| alpha value of layer 21 ---478.354248046875 |
| Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(136.9238, device='cuda:0') |
| spectral_norm tensor(12.4543, device='cuda:0') |
| frobenius_norm tensor(141.9774, device='cuda:0') |
| spectral_norm tensor(6.5433, device='cuda:0') |
| frobenius_norm tensor(139.0448, device='cuda:0') |
| spectral_norm tensor(5.1260, device='cuda:0') |
| alpha value of layer 22 ---442.48614501953125 |
| Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(138.1573, device='cuda:0') |
| spectral_norm tensor(14.3250, device='cuda:0') |
| frobenius_norm tensor(141.6219, device='cuda:0') |
| spectral_norm tensor(6.6344, device='cuda:0') |
| frobenius_norm tensor(138.9087, device='cuda:0') |
| spectral_norm tensor(5.2948, device='cuda:0') |
| alpha value of layer 23 ---412.3252868652344 |
| Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(135.7100, device='cuda:0') |
| spectral_norm tensor(12.7437, device='cuda:0') |
| frobenius_norm tensor(143.0019, device='cuda:0') |
| spectral_norm tensor(6.6948, device='cuda:0') |
| frobenius_norm tensor(141.1184, device='cuda:0') |
| spectral_norm tensor(5.4309, device='cuda:0') |
| alpha value of layer 24 ---414.9455261230469 |
| Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(134.2979, device='cuda:0') |
| spectral_norm tensor(12.3842, device='cuda:0') |
| frobenius_norm tensor(144.5099, device='cuda:0') |
| spectral_norm tensor(7.7998, device='cuda:0') |
| frobenius_norm tensor(143.8913, device='cuda:0') |
| spectral_norm tensor(6.7086, device='cuda:0') |
| alpha value of layer 25 ---306.970947265625 |
| Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(134.3241, device='cuda:0') |
| spectral_norm tensor(10.3940, device='cuda:0') |
| frobenius_norm tensor(145.9831, device='cuda:0') |
| spectral_norm tensor(10.3400, device='cuda:0') |
| frobenius_norm tensor(143.8145, device='cuda:0') |
| spectral_norm tensor(5.9343, device='cuda:0') |
| alpha value of layer 26 ---317.8813781738281 |
| Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| frobenius_norm tensor(139.7623, device='cuda:0') |
| spectral_norm tensor(14.5234, device='cuda:0') |
| frobenius_norm tensor(145.0968, device='cuda:0') |
| spectral_norm tensor(16.8470, device='cuda:0') |
| frobenius_norm tensor(133.4643, device='cuda:0') |
| spectral_norm tensor(7.9846, device='cuda:0') |
| alpha value of layer 27 ---148.7277069091797 |
| metric_name stable_rank: [21, 6, 22, 24, 23, 8, 20, 19, 18, 17, 7, 15, 14, 16, 10, 26, 12, 11, 13, 25, 9, 4, 5, 2, 0, 27, 3, 1] |
| Begin main_assign: Qwen2.5-7B mlp effective_rank |
|
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:02, 1.01it/s]
Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.03it/s]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.02it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.09it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.06it/s] |
| Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation. |
| Once upon a time, there was a queen who wanted to build a castle. She had a team of builders who would work day and night to complete the castle. She wanted the castle to be the most magnificent one in the kingdom. The queen would often visit the builders to see the progress of the castle. She would give them advice on how to make it even better. The builders worked hard and the castle was built in a short time. The queen was very happy with the castle and she made it the official residence |
| Qwen2ForCausalLM( |
| (model): Qwen2Model( |
| (embed_tokens): Embedding(152064, 3584) |
| (layers): ModuleList( |
| (0-27): 28 x Qwen2DecoderLayer( |
| (self_attn): Qwen2Attention( |
| (q_proj): Linear(in_features=3584, out_features=3584, bias=True) |
| (k_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (v_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (o_proj): Linear(in_features=3584, out_features=3584, bias=False) |
| ) |
| (mlp): Qwen2MLP( |
| (gate_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (up_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (down_proj): Linear(in_features=18944, out_features=3584, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| ) |
| ) |
| (norm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (rotary_emb): Qwen2RotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=3584, out_features=152064, bias=False) |
| ) |
| config: |
| Qwen2Config { |
| "architectures": [ |
| "Qwen2ForCausalLM" |
| ], |
| "attention_dropout": 0.0, |
| "bos_token_id": 151643, |
| "eos_token_id": 151643, |
| "hidden_act": "silu", |
| "hidden_size": 3584, |
| "initializer_range": 0.02, |
| "intermediate_size": 18944, |
| "layer_types": [ |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention" |
| ], |
| "max_position_embeddings": 131072, |
| "max_window_layers": 28, |
| "model_type": "qwen2", |
| "num_attention_heads": 28, |
| "num_hidden_layers": 28, |
| "num_key_value_heads": 4, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 1000000.0, |
| "sliding_window": null, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "use_mrope": false, |
| "use_sliding_window": false, |
| "vocab_size": 152064 |
| } |
|
|
| Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 0 ---2950.251953125 |
| Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 1 ---3178.52880859375 |
| Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 2 ---3286.484375 |
| Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 3 ---3382.06787109375 |
| Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 4 ---3406.15234375 |
| Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 5 ---3428.05908203125 |
| Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 6 ---3425.3720703125 |
| Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 7 ---3438.742919921875 |
| Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 8 ---3428.871826171875 |
| Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 9 ---3411.824951171875 |
| Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 10 ---3431.89013671875 |
| Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 11 ---3419.459716796875 |
| Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 12 ---3425.357177734375 |
| Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 13 ---3405.046875 |
| Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 14 ---3417.287841796875 |
| Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 15 ---3418.02490234375 |
| Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 16 ---3407.5625 |
| Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 17 ---3408.4970703125 |
| Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 18 ---3426.13232421875 |
| Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 19 ---3422.40576171875 |
| Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 20 ---3435.60009765625 |
| Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 21 ---3436.99267578125 |
| Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 22 ---3444.177734375 |
| Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 23 ---3438.670654296875 |
| Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 24 ---3432.22265625 |
| Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 25 ---3429.8212890625 |
| Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 26 ---3432.60546875 |
| Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 27 ---3447.41943359375 |
| metric_name effective_rank: [27, 22, 7, 23, 21, 20, 26, 24, 10, 25, 8, 5, 18, 6, 12, 19, 11, 15, 14, 9, 17, 16, 4, 13, 3, 2, 1, 0] |
| Begin main_assign: Qwen2.5-7B mlp ZD |
|
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:02, 1.29it/s]
Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.33it/s]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.31it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.41it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.37it/s] |
| Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation. |
| Once upon a time there lived a rich man. He had a servant(仆人). He and the servant loved wine and good food very much. Each time the rich man left his home, the servant would drink the wine and eat up all the nice food in the house. The rich man knew what his servant did, but he had never caught his servant doing that. One morning, when he left home, he said to the servant, “Here are two bottles of poison(毒药) and some nice |
| Qwen2ForCausalLM( |
| (model): Qwen2Model( |
| (embed_tokens): Embedding(152064, 3584) |
| (layers): ModuleList( |
| (0-27): 28 x Qwen2DecoderLayer( |
| (self_attn): Qwen2Attention( |
| (q_proj): Linear(in_features=3584, out_features=3584, bias=True) |
| (k_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (v_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (o_proj): Linear(in_features=3584, out_features=3584, bias=False) |
| ) |
| (mlp): Qwen2MLP( |
| (gate_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (up_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (down_proj): Linear(in_features=18944, out_features=3584, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| ) |
| ) |
| (norm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (rotary_emb): Qwen2RotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=3584, out_features=152064, bias=False) |
| ) |
| config: |
| Qwen2Config { |
| "architectures": [ |
| "Qwen2ForCausalLM" |
| ], |
| "attention_dropout": 0.0, |
| "bos_token_id": 151643, |
| "eos_token_id": 151643, |
| "hidden_act": "silu", |
| "hidden_size": 3584, |
| "initializer_range": 0.02, |
| "intermediate_size": 18944, |
| "layer_types": [ |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention" |
| ], |
| "max_position_embeddings": 131072, |
| "max_window_layers": 28, |
| "model_type": "qwen2", |
| "num_attention_heads": 28, |
| "num_hidden_layers": 28, |
| "num_key_value_heads": 4, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 1000000.0, |
| "sliding_window": null, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "use_mrope": false, |
| "use_sliding_window": false, |
| "vocab_size": 152064 |
| } |
|
|
| Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 0 ---0.14041712880134583 |
| Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 1 ---0.12673243880271912 |
| Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 2 ---0.14196857810020447 |
| Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 3 ---0.15253740549087524 |
| Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 4 ---0.15330302715301514 |
| Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 5 ---0.1531524807214737 |
| Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 6 ---0.15197408199310303 |
| Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 7 ---0.15272140502929688 |
| Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 8 ---0.15372541546821594 |
| Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 9 ---0.15197794139385223 |
| Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 10 ---0.15327925980091095 |
| Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 11 ---0.15320764482021332 |
| Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 12 ---0.15106868743896484 |
| Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 13 ---0.15193961560726166 |
| Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 14 ---0.14989005029201508 |
| Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 15 ---0.15030330419540405 |
| Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 16 ---0.1516457051038742 |
| Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 17 ---0.15101006627082825 |
| Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 18 ---0.14948342740535736 |
| Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 19 ---0.15111291408538818 |
| Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 20 ---0.15065661072731018 |
| Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 21 ---0.15124627947807312 |
| Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 22 ---0.1522490233182907 |
| Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 23 ---0.15400244295597076 |
| Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 24 ---0.15468579530715942 |
| Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 25 ---0.15433579683303833 |
| Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 26 ---0.1542406529188156 |
| Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 27 ---0.1527990698814392 |
| metric_name ZD: [24, 25, 26, 23, 8, 4, 10, 11, 5, 27, 7, 3, 22, 9, 6, 13, 16, 21, 19, 12, 17, 20, 15, 14, 18, 2, 0, 1] |
| Begin main_assign: Qwen2.5-7B mlp head_diversity |
|
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:01<00:04, 1.43s/it]
Loading checkpoint shards: 50%|█████ | 2/4 [00:02<00:02, 1.24s/it]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:03<00:01, 1.28s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:04<00:00, 1.19s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:04<00:00, 1.23s/it] |
| Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation. |
| Once upon a time, there was a little boy named Timmy. Timmy loved to play outside and explore the world around him. One day, while playing in the park, he met a talking tree named Oakley. |
| Oakley told Timmy that there was a magical forest nearby that only appeared once every hundred years. The forest was filled with talking animals and magical creatures that could grant wishes. Timmy was excited to hear this news and begged Oakley to take him there. |
| Oakley agreed and led Timmy |
| Qwen2ForCausalLM( |
| (model): Qwen2Model( |
| (embed_tokens): Embedding(152064, 3584) |
| (layers): ModuleList( |
| (0-27): 28 x Qwen2DecoderLayer( |
| (self_attn): Qwen2Attention( |
| (q_proj): Linear(in_features=3584, out_features=3584, bias=True) |
| (k_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (v_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (o_proj): Linear(in_features=3584, out_features=3584, bias=False) |
| ) |
| (mlp): Qwen2MLP( |
| (gate_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (up_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (down_proj): Linear(in_features=18944, out_features=3584, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| ) |
| ) |
| (norm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (rotary_emb): Qwen2RotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=3584, out_features=152064, bias=False) |
| ) |
| config: |
| Qwen2Config { |
| "architectures": [ |
| "Qwen2ForCausalLM" |
| ], |
| "attention_dropout": 0.0, |
| "bos_token_id": 151643, |
| "eos_token_id": 151643, |
| "hidden_act": "silu", |
| "hidden_size": 3584, |
| "initializer_range": 0.02, |
| "intermediate_size": 18944, |
| "layer_types": [ |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention" |
| ], |
| "max_position_embeddings": 131072, |
| "max_window_layers": 28, |
| "model_type": "qwen2", |
| "num_attention_heads": 28, |
| "num_hidden_layers": 28, |
| "num_key_value_heads": 4, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 1000000.0, |
| "sliding_window": null, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "use_mrope": false, |
| "use_sliding_window": false, |
| "vocab_size": 152064 |
| } |
|
|
| Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| Traceback (most recent call last): |
| File "/mnt/bn/life-mllm/users/cxr/quantization/quantization_metric/main_assign.py", line 58, in <module> |
| all_layer_alpha = calculate_expert(model, metric=metric_name, keyword=keyword) |
| File "/mnt/bn/life-mllm/users/cxr/quantization/quantization_metric/alphalora/expert_number.py", line 354, in calculate_expert |
| all_layer_alpha.append(torch.stack(layer_final_alpha).mean().item()) |
| RuntimeError: stack expects a non-empty TensorList |
| Begin main_assign: Qwen2.5-7B mlp coherence |
|
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:01<00:03, 1.01s/it]
Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.12it/s]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.15it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.23it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.18it/s] |
| Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation. |
| Once upon a time, the land of Greece was ruled by three powerful and evil sorceresses: Echidna, the mother of the monsters; her daughter, the monster Medusa; and the sea-goddess Gorgon. Their leader was the most evil of the three, the monster Medusa. When the gods tried to stop her, she became so angry that she attacked them and killed them all. She then took their weapons, which she used to attack the world. She even attacked the other |
| Qwen2ForCausalLM( |
| (model): Qwen2Model( |
| (embed_tokens): Embedding(152064, 3584) |
| (layers): ModuleList( |
| (0-27): 28 x Qwen2DecoderLayer( |
| (self_attn): Qwen2Attention( |
| (q_proj): Linear(in_features=3584, out_features=3584, bias=True) |
| (k_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (v_proj): Linear(in_features=3584, out_features=512, bias=True) |
| (o_proj): Linear(in_features=3584, out_features=3584, bias=False) |
| ) |
| (mlp): Qwen2MLP( |
| (gate_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (up_proj): Linear(in_features=3584, out_features=18944, bias=False) |
| (down_proj): Linear(in_features=18944, out_features=3584, bias=False) |
| (act_fn): SiLU() |
| ) |
| (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) |
| ) |
| ) |
| (norm): Qwen2RMSNorm((3584,), eps=1e-06) |
| (rotary_emb): Qwen2RotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=3584, out_features=152064, bias=False) |
| ) |
| config: |
| Qwen2Config { |
| "architectures": [ |
| "Qwen2ForCausalLM" |
| ], |
| "attention_dropout": 0.0, |
| "bos_token_id": 151643, |
| "eos_token_id": 151643, |
| "hidden_act": "silu", |
| "hidden_size": 3584, |
| "initializer_range": 0.02, |
| "intermediate_size": 18944, |
| "layer_types": [ |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention", |
| "full_attention" |
| ], |
| "max_position_embeddings": 131072, |
| "max_window_layers": 28, |
| "model_type": "qwen2", |
| "num_attention_heads": 28, |
| "num_hidden_layers": 28, |
| "num_key_value_heads": 4, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 1000000.0, |
| "sliding_window": null, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float16", |
| "transformers_version": "4.55.2", |
| "use_cache": true, |
| "use_mrope": false, |
| "use_sliding_window": false, |
| "vocab_size": 152064 |
| } |
|
|
| Processing layer 0--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 0 ---0.030741512775421143 |
| Processing layer 1--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 1 ---0.0886889323592186 |
| Processing layer 2--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 2 ---0.038115616887807846 |
| Processing layer 3--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 3 ---0.01555887795984745 |
| Processing layer 4--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 4 ---0.015330223366618156 |
| Processing layer 5--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 5 ---0.014795559458434582 |
| Processing layer 6--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 6 ---0.013012934476137161 |
| Processing layer 7--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 7 ---0.012255651876330376 |
| Processing layer 8--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 8 ---0.012662074528634548 |
| Processing layer 9--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 9 ---0.017650291323661804 |
| Processing layer 10--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 10 ---0.012567928992211819 |
| Processing layer 11--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 11 ---0.012772100046277046 |
| Processing layer 12--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 12 ---0.012693522498011589 |
| Processing layer 13--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 13 ---0.012766292318701744 |
| Processing layer 14--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 14 ---0.01258667092770338 |
| Processing layer 15--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 15 ---0.012480087578296661 |
| Processing layer 16--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 16 ---0.012708479538559914 |
| Processing layer 17--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 17 ---0.01283347513526678 |
| Processing layer 18--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 18 ---0.01226731389760971 |
| Processing layer 19--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 19 ---0.012299998663365841 |
| Processing layer 20--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 20 ---0.01201008539646864 |
| Processing layer 21--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 21 ---0.011824443936347961 |
| Processing layer 22--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 22 ---0.011804303154349327 |
| Processing layer 23--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 23 ---0.012257957831025124 |
| Processing layer 24--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 24 ---0.012149857357144356 |
| Processing layer 25--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 25 ---0.012290380895137787 |
| Processing layer 26--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 26 ---0.012179265730082989 |
| Processing layer 27--subset--{'mlp.gate_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.up_proj': Linear(in_features=3584, out_features=18944, bias=False), 'mlp.down_proj': Linear(in_features=18944, out_features=3584, bias=False)} |
| alpha value of layer 27 ---0.01317012868821621 |
| metric_name coherence: [1, 2, 0, 9, 3, 4, 5, 27, 6, 17, 11, 13, 16, 12, 8, 14, 10, 15, 19, 25, 18, 23, 7, 26, 24, 20, 21, 22] |
|
|