nohup: ignoring input Namespace(save_dir='/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg', self_attn_layer_to_quant='23 22 25 24 26', mlp_layer_to_quant='27 16 19', model_id='/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen/Qwen2.5-7B', cuda_id=6) `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:2061718:2062420 [0] NCCL INFO Initialized NET plugin IB n136-128-154:2061719:2062421 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:2061719:2062421 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:2061719:2062421 [1] NCCL INFO Initialized NET plugin IB n136-128-154:2061718:2062420 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2061718:2062420 [0] NCCL INFO Using network IB n136-128-154:2061719:2062421 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2061719:2062421 [1] NCCL INFO Using network IB n136-128-154:2061719:2062421 [1] NCCL INFO ncclCommInitRankConfig comm 0x11503200 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x57372e7e365f65e - Init START n136-128-154:2061718:2062420 [0] NCCL INFO ncclCommInitRankConfig comm 0x109f5100 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x57372e7e365f65e - Init START n136-128-154:2061719:2062421 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2061718:2062420 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2061718:2062420 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2061718:2062420 [0] NCCL INFO Retrieving state for IB n136-128-154:2061718:2062420 [0] NCCL INFO Initialized state 0 for IB n136-128-154:2061718:2062420 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2061719:2062421 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2061719:2062421 [1] NCCL INFO Retrieving state for IB n136-128-154:2061719:2062421 [1] NCCL INFO Initialized state 0 for IB n136-128-154:2061718:2062420 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2061719:2062421 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2061718:2062420 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2061719:2062421 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2061718:2062420 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2061719:2062421 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2061719:2062421 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2061719:2062421 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2061719:2062421 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2061719:2062421 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2061719:2062421 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2061719:2062421 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2061719:2062421 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2061719:2062421 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2061719:2062421 [1] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2061719:2062421 [1] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2061719:2062421 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2061719:2062421 [1] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2061719:2062421 [1] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2061719:2062421 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2061718:2062420 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2061719:2062421 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2061718:2062420 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2061719:2062421 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2061719:2062421 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2061719:2062421 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2061719:2062421 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2061719:2062421 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2061719:2062421 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2061719:2062421 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2061719:2062421 [1] NCCL INFO ========================================== n136-128-154:2061719:2062421 [1] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2061719:2062421 [1] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2061719:2062421 [1] NCCL INFO Setting affinity for GPU 7 to 33-62,97-126 n136-128-154:2061718:2062420 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2061718:2062420 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2061718:2062420 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2061718:2062420 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2061718:2062420 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2061718:2062420 [0] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2061718:2062420 [0] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2061718:2062420 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2061718:2062420 [0] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2061718:2062420 [0] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2061718:2062420 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2061718:2062420 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2061718:2062420 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2061718:2062420 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2061718:2062420 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2061718:2062420 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2061718:2062420 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2061718:2062420 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2061718:2062420 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2061718:2062420 [0] NCCL INFO ========================================== n136-128-154:2061718:2062420 [0] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2061718:2062420 [0] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2061718:2062420 [0] NCCL INFO Setting affinity for GPU 6 to 33-62,97-126 n136-128-154:2061719:2062421 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2061719:2062421 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061719:2062421 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061719:2062421 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061719:2062421 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061719:2062421 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061719:2062421 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061719:2062421 [1] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061719:2062421 [1] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061719:2062421 [1] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061719:2062421 [1] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061719:2062421 [1] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061719:2062421 [1] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061719:2062421 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2061719:2062421 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061719:2062421 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061719:2062421 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061719:2062421 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061719:2062421 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061719:2062421 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061719:2062421 [1] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2061719:2062421 [1] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2061719:2062421 [1] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2061719:2062421 [1] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2061719:2062421 [1] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2061718:2062420 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2061719:2062421 [1] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2061718:2062420 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061718:2062420 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061718:2062420 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061718:2062420 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061718:2062420 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061718:2062420 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061718:2062420 [0] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061718:2062420 [0] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061718:2062420 [0] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061718:2062420 [0] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061718:2062420 [0] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061718:2062420 [0] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061718:2062420 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2061718:2062420 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061718:2062420 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061718:2062420 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061718:2062420 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061718:2062420 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061718:2062420 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2061718:2062420 [0] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2061718:2062420 [0] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2061718:2062420 [0] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2061718:2062420 [0] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2061718:2062420 [0] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2061718:2062420 [0] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2061719:2062421 [1] NCCL INFO comm 0x11503200 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:2061718:2062420 [0] NCCL INFO comm 0x109f5100 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:2061718:2062420 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:2061719:2062421 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:2061719:2062421 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:2061719:2062421 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:2061718:2062420 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:2061718:2062420 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:2061718:2062420 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2061718:2062420 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2061718:2062420 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:2061718:2062433 [0] NCCL INFO [Proxy Service] Device 0 CPU core 42 n136-128-154:2061718:2062434 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 60 n136-128-154:2061719:2062421 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2061719:2062435 [1] NCCL INFO [Proxy Service] Device 1 CPU core 38 n136-128-154:2061719:2062436 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 53 n136-128-154:2061718:2062420 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2061719:2062421 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2061718:2062420 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2061719:2062421 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2061718:2062420 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:2061718:2062420 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2061718:2062420 [0] NCCL INFO ncclCommInitRankConfig comm 0x109f5100 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x57372e7e365f65e - Init COMPLETE n136-128-154:2061718:2062420 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.88 (kernels 0.23, alloc 0.42, bootstrap 0.00, allgathers 0.01, topo 0.09, graphs 0.00, connections 0.04, rest 0.09) n136-128-154:2061719:2062421 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2061719:2062421 [1] NCCL INFO ncclCommInitRankConfig comm 0x11503200 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x57372e7e365f65e - Init COMPLETE n136-128-154:2061719:2062421 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.88 (kernels 0.23, alloc 0.42, bootstrap 0.00, allgathers 0.01, topo 0.09, graphs 0.00, connections 0.03, rest 0.09) n136-128-154:2061718:2062437 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 04/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 05/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 06/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 07/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 08/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 09/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 04/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 10/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 05/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 11/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 06/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 12/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 07/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 13/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 08/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 14/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 09/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 10/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 15/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 11/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 16/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 12/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 17/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 13/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 18/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 14/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 19/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 15/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 20/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 16/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 21/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 17/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 22/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 18/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Channel 23/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 19/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 20/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 21/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 22/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062438 [1] NCCL INFO Channel 23/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061718:2062437 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:2061719:2062438 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-08:17:45:49 INFO [evaluator:559] Running loglikelihood requests 2025-12-08:17:45:49 INFO [evaluator:559] Running loglikelihood requests Running loglikelihood requests: 0%| | 0/1268 [00:00 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 01/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 02/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 03/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 04/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 05/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 06/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 07/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 08/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 09/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 10/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 11/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 12/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 13/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 14/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 15/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 16/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 17/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 18/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 19/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 20/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 21/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 22/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 23/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 24/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 25/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 26/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 27/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 28/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 29/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 30/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2061719:2062496 [1] NCCL INFO Channel 31/1 : 1[7] -> 0[6] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:2061719:2062508 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2061719:2062508 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2061719:2062508 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2061719:2062508 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2061719:2062508 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2061719:2062508 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2061719:2062435 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-08:17:46:21 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto (64) | Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr| |----------|------:|------|-----:|------|---|-----:|---|-----:| |winogrande| 1|none | 5|acc |↑ |0.5588|± | 0.014| [rank0]:[W1208 17:46:22.406339430 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:2061718:2062564 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2061718:2062564 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2061718:2062564 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2061718:2062433 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2061718:2062564 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2061718:2062564 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2061718:2062564 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2061719:2062435 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2061719:2062508 [1] NCCL INFO comm 0x11503200 rank 1 nranks 2 cudaDev 1 busId c9000 - Abort COMPLETE n136-128-154:2061718:2062564 [0] NCCL INFO comm 0x109f5100 rank 0 nranks 2 cudaDev 0 busId c5000 - Abort COMPLETE 任务 winogrande 评估完成! ================================================== 开始评估:任务=gsm8k | 少样本数=4 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/fg3/gsm8k.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-08:17:47:43 INFO [__main__:440] Selected Tasks: ['gsm8k'] 2025-12-08:17:47:43 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-08:17:47:43 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-08:17:47:43 INFO [__main__:440] Selected Tasks: ['gsm8k'] 2025-12-08:17:47:43 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-08:17:47:43 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-08:17:47:44 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-08:17:47:45 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0} 2025-12-08:17:48:01 WARNING [evaluator:309] Overwriting default num_fewshot of gsm8k from 5 to 4 2025-12-08:17:48:01 INFO [api.task:434] Building contexts for gsm8k on rank 0... 0%| | 0/660 [00:00 n136-128-154:2062874:2063938 [0] NCCL INFO Initialized NET plugin IB n136-128-154:2062874:2063938 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2062874:2063938 [0] NCCL INFO Using network IB n136-128-154:2062874:2063938 [0] NCCL INFO ncclCommInitRankConfig comm 0xf5bdd20 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x3b2c48198e38d0ad - Init START 2025-12-08:17:48:04 INFO [evaluator:290] gsm8k: Using gen_kwargs: {'until': ['Question:', '', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0} 2025-12-08:17:48:04 WARNING [evaluator:309] Overwriting default num_fewshot of gsm8k from 5 to 4 2025-12-08:17:48:04 INFO [api.task:434] Building contexts for gsm8k on rank 1... 0%| | 0/659 [00:00 n136-128-154:2062875:2064076 [1] NCCL INFO Initialized NET plugin IB n136-128-154:2062875:2064076 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2062875:2064076 [1] NCCL INFO Using network IB n136-128-154:2062875:2064076 [1] NCCL INFO ncclCommInitRankConfig comm 0x104d2320 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x3b2c48198e38d0ad - Init START n136-128-154:2062875:2064076 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2062874:2063938 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2062875:2064076 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2062875:2064076 [1] NCCL INFO Retrieving state for IB n136-128-154:2062875:2064076 [1] NCCL INFO Initialized state 0 for IB n136-128-154:2062875:2064076 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2062875:2064076 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2062875:2064076 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2062875:2064076 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2062874:2063938 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2062874:2063938 [0] NCCL INFO Retrieving state for IB n136-128-154:2062874:2063938 [0] NCCL INFO Initialized state 0 for IB n136-128-154:2062874:2063938 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2062874:2063938 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2062874:2063938 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2062874:2063938 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2062874:2063938 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2062874:2063938 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2062874:2063938 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2062874:2063938 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2062874:2063938 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2062874:2063938 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2062874:2063938 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2062874:2063938 [0] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2062874:2063938 [0] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2062874:2063938 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2062874:2063938 [0] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2062874:2063938 [0] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2062874:2063938 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2062874:2063938 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2062874:2063938 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2062874:2063938 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2062874:2063938 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2062874:2063938 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2062874:2063938 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2062874:2063938 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2062874:2063938 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2062874:2063938 [0] NCCL INFO ========================================== n136-128-154:2062874:2063938 [0] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2062874:2063938 [0] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2062874:2063938 [0] NCCL INFO Setting affinity for GPU 6 to 33-62,97-126 n136-128-154:2062874:2063938 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2062874:2063938 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062874:2063938 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062874:2063938 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062874:2063938 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062874:2063938 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062874:2063938 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062874:2063938 [0] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062874:2063938 [0] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062874:2063938 [0] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062874:2063938 [0] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062874:2063938 [0] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062874:2063938 [0] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062874:2063938 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2062874:2063938 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062874:2063938 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062874:2063938 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062874:2063938 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062874:2063938 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062874:2063938 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062874:2063938 [0] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2062874:2063938 [0] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2062874:2063938 [0] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2062874:2063938 [0] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2062874:2063938 [0] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2062874:2063938 [0] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2062875:2064076 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2062875:2064076 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2062875:2064076 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2062875:2064076 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2062875:2064076 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2062875:2064076 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2062875:2064076 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2062875:2064076 [1] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2062875:2064076 [1] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2062875:2064076 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2062875:2064076 [1] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2062875:2064076 [1] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2062875:2064076 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2062875:2064076 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2062875:2064076 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2062875:2064076 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2062875:2064076 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2062875:2064076 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2062875:2064076 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2062875:2064076 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2062875:2064076 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2062875:2064076 [1] NCCL INFO ========================================== n136-128-154:2062875:2064076 [1] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2062875:2064076 [1] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2062875:2064076 [1] NCCL INFO Setting affinity for GPU 7 to 33-62,97-126 n136-128-154:2062875:2064076 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2062875:2064076 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062875:2064076 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062875:2064076 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062875:2064076 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062875:2064076 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062875:2064076 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062875:2064076 [1] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062875:2064076 [1] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062875:2064076 [1] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062875:2064076 [1] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062875:2064076 [1] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062875:2064076 [1] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062875:2064076 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2062875:2064076 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062875:2064076 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062875:2064076 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062875:2064076 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062875:2064076 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062875:2064076 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2062875:2064076 [1] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2062875:2064076 [1] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2062875:2064076 [1] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2062875:2064076 [1] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2062875:2064076 [1] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2062875:2064076 [1] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2062874:2063938 [0] NCCL INFO comm 0xf5bdd20 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:2062875:2064076 [1] NCCL INFO comm 0x104d2320 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:2062875:2064076 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:2062875:2064076 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:2062875:2064076 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:2062875:2064076 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:2062874:2063938 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:2062874:2063938 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2062874:2063938 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2062874:2063938 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:2062874:2064096 [0] NCCL INFO [Proxy Service] Device 0 CPU core 38 n136-128-154:2062874:2064097 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 39 n136-128-154:2062875:2064076 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2062875:2064098 [1] NCCL INFO [Proxy Service] Device 1 CPU core 60 n136-128-154:2062875:2064099 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 61 n136-128-154:2062874:2063938 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2062874:2063938 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2062875:2064076 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2062875:2064076 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2062874:2063938 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:2062874:2063938 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2062874:2063938 [0] NCCL INFO ncclCommInitRankConfig comm 0xf5bdd20 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x3b2c48198e38d0ad - Init COMPLETE n136-128-154:2062874:2063938 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 3.78 (kernels 0.26, alloc 0.11, bootstrap 3.20, allgathers 0.00, topo 0.07, graphs 0.00, connections 0.09, rest 0.05) n136-128-154:2062875:2064076 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2062875:2064076 [1] NCCL INFO ncclCommInitRankConfig comm 0x104d2320 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x3b2c48198e38d0ad - Init COMPLETE n136-128-154:2062875:2064076 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.48 (kernels 0.17, alloc 0.11, bootstrap 0.00, allgathers 0.00, topo 0.07, graphs 0.00, connections 0.08, rest 0.05) n136-128-154:2062874:2064102 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 04/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 04/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 05/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 05/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 06/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 06/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 07/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 07/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 08/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 08/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 09/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 09/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 10/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 10/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 11/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 11/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 12/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 12/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 13/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 13/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 14/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 14/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 15/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 15/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 16/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 16/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 17/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 17/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 18/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 18/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 19/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 19/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 20/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 20/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 21/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 21/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 22/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 22/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Channel 23/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2062875:2064101 [1] NCCL INFO Channel 23/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062874:2064102 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:2062875:2064101 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-08:17:48:07 INFO [evaluator:559] Running generate_until requests 2025-12-08:17:48:07 INFO [evaluator:559] Running generate_until requests Passed argument batch_size = auto. Detecting largest batch size Running generate_until requests: 0%| | 0/660 [00:00 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 01/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 02/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 03/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 04/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 05/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 06/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 07/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 08/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 09/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 10/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 11/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 12/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 13/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 14/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 15/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 16/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 17/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 18/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 19/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 20/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 21/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 22/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 23/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 24/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 25/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 26/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 27/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 28/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 29/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 30/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2062875:2118887 [1] NCCL INFO Channel 31/1 : 1[7] -> 0[6] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:2062875:2118893 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2062875:2118893 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2062875:2118893 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2062875:2118893 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2062875:2118893 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2062875:2118893 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2062875:2064098 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-09:02:30:37 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 4, batch_size: auto |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 4|exact_match|↑ |0.0053|± | 0.002| | | |strict-match | 4|exact_match|↑ |0.0000|± | 0.000| [rank0]:[W1209 02:30:38.271385879 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:2062874:2118944 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2062874:2118944 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2062874:2118944 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2062874:2064096 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2062874:2118944 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2062874:2118944 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2062874:2118944 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2062875:2064098 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2062875:2118893 [1] NCCL INFO comm 0x104d2320 rank 1 nranks 2 cudaDev 1 busId c9000 - Abort COMPLETE n136-128-154:2062874:2118944 [0] NCCL INFO comm 0xf5bdd20 rank 0 nranks 2 cudaDev 0 busId c5000 - Abort COMPLETE 任务 gsm8k 评估完成! ================================================== 开始评估:任务=boolq | 少样本数=0 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/fg3/boolq.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:02:32:02 INFO [__main__:440] Selected Tasks: ['boolq'] 2025-12-09:02:32:02 INFO [__main__:440] Selected Tasks: ['boolq'] 2025-12-09:02:32:02 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:02:32:02 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:02:32:02 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:02:32:02 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:02:32:03 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:02:32:04 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:02:32:04 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} `torch_dtype` is deprecated! Use `dtype` instead! `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:2118998:2119418 [1] NCCL INFO Initialized NET plugin IB n136-128-154:2118997:2119417 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:2118997:2119417 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:2118997:2119417 [0] NCCL INFO Initialized NET plugin IB n136-128-154:2118998:2119418 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2118998:2119418 [1] NCCL INFO Using network IB n136-128-154:2118997:2119417 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2118997:2119417 [0] NCCL INFO Using network IB n136-128-154:2118998:2119418 [1] NCCL INFO ncclCommInitRankConfig comm 0x108a51d0 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0xd79e75f1643ca660 - Init START n136-128-154:2118997:2119417 [0] NCCL INFO ncclCommInitRankConfig comm 0x103454d0 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0xd79e75f1643ca660 - Init START n136-128-154:2118998:2119418 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2118997:2119417 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2118997:2119417 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2118997:2119417 [0] NCCL INFO Retrieving state for IB n136-128-154:2118997:2119417 [0] NCCL INFO Initialized state 0 for IB n136-128-154:2118997:2119417 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2118998:2119418 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2118998:2119418 [1] NCCL INFO Retrieving state for IB n136-128-154:2118998:2119418 [1] NCCL INFO Initialized state 0 for IB n136-128-154:2118997:2119417 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2118998:2119418 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2118998:2119418 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2118997:2119417 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2118997:2119417 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2118998:2119418 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2118998:2119418 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2118998:2119418 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2118998:2119418 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2118997:2119417 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2118997:2119417 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2118997:2119417 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2118997:2119417 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2118997:2119417 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2118997:2119417 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2118997:2119417 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2118997:2119417 [0] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2118997:2119417 [0] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2118997:2119417 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2118997:2119417 [0] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2118997:2119417 [0] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2118997:2119417 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2118997:2119417 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2118997:2119417 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2118997:2119417 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2118997:2119417 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2118997:2119417 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2118997:2119417 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2118997:2119417 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2118997:2119417 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2118997:2119417 [0] NCCL INFO ========================================== n136-128-154:2118997:2119417 [0] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2118997:2119417 [0] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2118997:2119417 [0] NCCL INFO Setting affinity for GPU 6 to 33-62,97-126 n136-128-154:2118998:2119418 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2118998:2119418 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2118998:2119418 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2118998:2119418 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2118998:2119418 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2118998:2119418 [1] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2118998:2119418 [1] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2118998:2119418 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2118998:2119418 [1] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2118998:2119418 [1] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2118998:2119418 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2118998:2119418 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2118998:2119418 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2118998:2119418 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2118998:2119418 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2118998:2119418 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2118998:2119418 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2118998:2119418 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2118998:2119418 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2118998:2119418 [1] NCCL INFO ========================================== n136-128-154:2118998:2119418 [1] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2118998:2119418 [1] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2118998:2119418 [1] NCCL INFO Setting affinity for GPU 7 to 33-62,97-126 n136-128-154:2118997:2119417 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2118997:2119417 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118997:2119417 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118997:2119417 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118997:2119417 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118997:2119417 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118997:2119417 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118997:2119417 [0] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118997:2119417 [0] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118997:2119417 [0] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118997:2119417 [0] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118997:2119417 [0] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118997:2119417 [0] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118997:2119417 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2118997:2119417 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118997:2119417 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118997:2119417 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118997:2119417 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118997:2119417 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118997:2119417 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118997:2119417 [0] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2118997:2119417 [0] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2118997:2119417 [0] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2118997:2119417 [0] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2118997:2119417 [0] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2118997:2119417 [0] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2118998:2119418 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2118998:2119418 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118998:2119418 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118998:2119418 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118998:2119418 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118998:2119418 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118998:2119418 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118998:2119418 [1] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118998:2119418 [1] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118998:2119418 [1] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118998:2119418 [1] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118998:2119418 [1] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118998:2119418 [1] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118998:2119418 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2118998:2119418 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118998:2119418 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118998:2119418 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118998:2119418 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118998:2119418 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118998:2119418 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2118998:2119418 [1] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2118998:2119418 [1] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2118998:2119418 [1] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2118998:2119418 [1] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2118998:2119418 [1] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2118998:2119418 [1] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2118997:2119417 [0] NCCL INFO comm 0x103454d0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:2118998:2119418 [1] NCCL INFO comm 0x108a51d0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:2118998:2119418 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:2118998:2119418 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:2118998:2119418 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:2118998:2119418 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:2118997:2119417 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:2118997:2119417 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2118998:2119418 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2118998:2119429 [1] NCCL INFO [Proxy Service] Device 1 CPU core 112 n136-128-154:2118998:2119430 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 49 n136-128-154:2118997:2119417 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2118997:2119417 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:2118997:2119431 [0] NCCL INFO [Proxy Service] Device 0 CPU core 34 n136-128-154:2118997:2119432 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 99 n136-128-154:2118998:2119418 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2118998:2119418 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2118997:2119417 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2118997:2119417 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2118997:2119417 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:2118998:2119418 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2118998:2119418 [1] NCCL INFO ncclCommInitRankConfig comm 0x108a51d0 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0xd79e75f1643ca660 - Init COMPLETE n136-128-154:2118998:2119418 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.60 (kernels 0.28, alloc 0.18, bootstrap 0.00, allgathers 0.00, topo 0.10, graphs 0.00, connections 0.01, rest 0.03) n136-128-154:2118997:2119417 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2118997:2119417 [0] NCCL INFO ncclCommInitRankConfig comm 0x103454d0 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0xd79e75f1643ca660 - Init COMPLETE n136-128-154:2118997:2119417 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.60 (kernels 0.28, alloc 0.18, bootstrap 0.00, allgathers 0.00, topo 0.10, graphs 0.00, connections 0.01, rest 0.03) n136-128-154:2118997:2119433 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 04/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 04/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 05/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 05/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 06/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 06/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 07/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 07/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 08/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 08/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 09/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 09/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 10/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 10/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 11/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 11/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 12/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 12/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 13/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 13/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 14/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 14/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 15/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 15/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 16/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 16/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 17/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 17/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 18/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 18/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 19/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 19/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 20/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 20/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 21/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 21/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 22/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 22/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118997:2119433 [0] NCCL INFO Channel 23/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Channel 23/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119434 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:2118997:2119433 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-09:02:32:37 INFO [evaluator:559] Running loglikelihood requests 2025-12-09:02:32:37 INFO [evaluator:559] Running loglikelihood requests Passed argument batch_size = auto:1. Detecting largest batch size Running loglikelihood requests: 0%| | 0/3270 [00:00 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 01/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 02/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 03/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 04/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 05/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 06/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 07/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 08/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 09/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 10/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 11/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 12/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 13/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 14/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 15/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 16/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 17/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 18/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 19/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 20/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 21/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 22/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 23/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 24/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 25/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 26/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 27/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 28/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 29/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 30/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2118998:2119547 [1] NCCL INFO Channel 31/1 : 1[7] -> 0[6] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:2118998:2119553 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2118998:2119553 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2118998:2119553 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2118998:2119553 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2118998:2119553 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2118998:2119553 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2118998:2119429 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-09:02:34:08 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: auto (45) |Tasks|Version|Filter|n-shot|Metric| |Value | |Stderr| |-----|------:|------|-----:|------|---|-----:|---|-----:| |boolq| 2|none | 0|acc |↑ |0.6028|± |0.0086| [rank0]:[W1209 02:34:09.673844874 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:2118997:2119623 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2118997:2119623 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2118997:2119623 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2118997:2119431 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2118997:2119623 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2118997:2119623 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2118997:2119623 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2118998:2119429 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2118998:2119553 [1] NCCL INFO comm 0x108a51d0 rank 1 nranks 2 cudaDev 1 busId c9000 - Abort COMPLETE n136-128-154:2118997:2119623 [0] NCCL INFO comm 0x103454d0 rank 0 nranks 2 cudaDev 0 busId c5000 - Abort COMPLETE 任务 boolq 评估完成! ================================================== 开始评估:任务=arc_challenge | 少样本数=25 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/fg3/arc_challenge.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:02:35:36 INFO [__main__:440] Selected Tasks: ['arc_challenge'] 2025-12-09:02:35:36 INFO [__main__:440] Selected Tasks: ['arc_challenge'] 2025-12-09:02:35:36 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:02:35:36 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:02:35:36 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:02:35:36 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:02:35:37 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:02:35:38 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:02:35:38 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:2119679:2119909 [0] NCCL INFO Initialized NET plugin IB n136-128-154:2119679:2119909 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2119679:2119909 [0] NCCL INFO Using network IB n136-128-154:2119679:2119909 [0] NCCL INFO ncclCommInitRankConfig comm 0x10f76950 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x6ebd969f77d8e249 - Init START 94%|█████████▎| 548/586 [00:15<00:01, 36.86it/s] 94%|█████████▍| 552/586 [00:15<00:00, 36.91it/s] 95%|█████████▍| 556/586 [00:15<00:00, 36.87it/s] 96%|█████████▌| 560/586 [00:15<00:00, 36.32it/s] 96%|█████████▌| 564/586 [00:15<00:00, 36.24it/s] 97%|█████████▋| 568/586 [00:15<00:00, 36.22it/s] 98%|█████████▊| 572/586 [00:15<00:00, 36.22it/s] 98%|█████████▊| 576/586 [00:15<00:00, 36.25it/s] 99%|█████████▉| 580/586 [00:15<00:00, 36.26it/s] 100%|█████████▉| 584/586 [00:16<00:00, 36.22it/s] 100%|██████████| 586/586 [00:16<00:00, 36.44it/s] n136-128-154:2119680:2119680 [1] NCCL INFO cudaDriverVersion 12040 n136-128-154:2119680:2119680 [1] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6 n136-128-154:2119680:2119680 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0 n136-128-154:2119680:2119680 [1] NCCL INFO NCCL version 2.27.5+cuda12.9 n136-128-154:2119680:2119680 [1] NCCL INFO Comm config Blocking set to 1 n136-128-154:2119680:2119917 [1] NCCL INFO NET/Plugin: Could not find: none libnccl-net-none.so. n136-128-154:2119680:2119917 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0. n136-128-154:2119680:2119917 [1] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6 n136-128-154:2119680:2119917 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0 n136-128-154:2119680:2119917 [1] NCCL INFO NCCL_IB_HCA set to mlx5 n136-128-154:2119680:2119917 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:2119680:2119917 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:2119680:2119917 [1] NCCL INFO Initialized NET plugin IB n136-128-154:2119680:2119917 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2119680:2119917 [1] NCCL INFO Using network IB n136-128-154:2119680:2119917 [1] NCCL INFO ncclCommInitRankConfig comm 0x10d38510 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x6ebd969f77d8e249 - Init START n136-128-154:2119680:2119917 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2119679:2119909 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2119680:2119917 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2119680:2119917 [1] NCCL INFO Retrieving state for IB n136-128-154:2119680:2119917 [1] NCCL INFO Initialized state 0 for IB n136-128-154:2119680:2119917 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2119680:2119917 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2119680:2119917 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2119680:2119917 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2119679:2119909 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2119679:2119909 [0] NCCL INFO Retrieving state for IB n136-128-154:2119679:2119909 [0] NCCL INFO Initialized state 0 for IB n136-128-154:2119679:2119909 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2119679:2119909 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2119679:2119909 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2119679:2119909 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2119680:2119917 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2119680:2119917 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2119679:2119909 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2119679:2119909 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2119679:2119909 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2119679:2119909 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2119679:2119909 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2119679:2119909 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2119679:2119909 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2119679:2119909 [0] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2119679:2119909 [0] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2119679:2119909 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2119679:2119909 [0] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2119679:2119909 [0] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2119679:2119909 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2119679:2119909 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2119679:2119909 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2119679:2119909 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2119679:2119909 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2119679:2119909 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2119679:2119909 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2119679:2119909 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2119679:2119909 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2119679:2119909 [0] NCCL INFO ========================================== n136-128-154:2119679:2119909 [0] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2119679:2119909 [0] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2119679:2119909 [0] NCCL INFO Setting affinity for GPU 6 to 33-62,97-126 n136-128-154:2119680:2119917 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2119680:2119917 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2119680:2119917 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2119680:2119917 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2119680:2119917 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2119680:2119917 [1] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2119680:2119917 [1] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2119680:2119917 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2119680:2119917 [1] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2119680:2119917 [1] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2119680:2119917 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2119680:2119917 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2119680:2119917 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2119680:2119917 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2119680:2119917 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2119680:2119917 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2119680:2119917 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2119680:2119917 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2119680:2119917 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2119680:2119917 [1] NCCL INFO ========================================== n136-128-154:2119680:2119917 [1] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2119680:2119917 [1] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2119680:2119917 [1] NCCL INFO Setting affinity for GPU 7 to 33-62,97-126 n136-128-154:2119680:2119917 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2119680:2119917 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119680:2119917 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119680:2119917 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119680:2119917 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119680:2119917 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119680:2119917 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119680:2119917 [1] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119679:2119909 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2119680:2119917 [1] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119680:2119917 [1] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119679:2119909 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119680:2119917 [1] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119679:2119909 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119680:2119917 [1] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119679:2119909 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119680:2119917 [1] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119679:2119909 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119679:2119909 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119679:2119909 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119679:2119909 [0] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119679:2119909 [0] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119679:2119909 [0] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119679:2119909 [0] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119679:2119909 [0] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119679:2119909 [0] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119680:2119917 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2119680:2119917 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119680:2119917 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119680:2119917 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119680:2119917 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119680:2119917 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119680:2119917 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119680:2119917 [1] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2119680:2119917 [1] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2119680:2119917 [1] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2119679:2119909 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2119680:2119917 [1] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2119679:2119909 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119680:2119917 [1] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2119679:2119909 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119680:2119917 [1] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2119679:2119909 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119679:2119909 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119679:2119909 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119679:2119909 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2119679:2119909 [0] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2119679:2119909 [0] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2119679:2119909 [0] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2119679:2119909 [0] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2119679:2119909 [0] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2119679:2119909 [0] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2119680:2119917 [1] NCCL INFO comm 0x10d38510 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:2119679:2119909 [0] NCCL INFO comm 0x10f76950 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:2119680:2119917 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:2119679:2119909 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:2119680:2119917 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:2119680:2119917 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:2119680:2119917 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:2119679:2119909 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:2119679:2119909 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2119680:2119917 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2119680:2119924 [1] NCCL INFO [Proxy Service] Device 1 CPU core 43 n136-128-154:2119680:2119925 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 48 n136-128-154:2119679:2119909 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2119679:2119909 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:2119679:2119927 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 42 n136-128-154:2119679:2119926 [0] NCCL INFO [Proxy Service] Device 0 CPU core 41 n136-128-154:2119679:2119909 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2119679:2119909 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2119679:2119909 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:2119680:2119917 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2119680:2119917 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2119680:2119917 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2119680:2119917 [1] NCCL INFO ncclCommInitRankConfig comm 0x10d38510 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x6ebd969f77d8e249 - Init COMPLETE n136-128-154:2119680:2119917 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.60 (kernels 0.11, alloc 0.20, bootstrap 0.00, allgathers 0.00, topo 0.08, graphs 0.00, connections 0.07, rest 0.15) n136-128-154:2119679:2119909 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2119679:2119909 [0] NCCL INFO ncclCommInitRankConfig comm 0x10f76950 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x6ebd969f77d8e249 - Init COMPLETE n136-128-154:2119679:2119909 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 2.12 (kernels 0.17, alloc 0.26, bootstrap 1.40, allgathers 0.00, topo 0.08, graphs 0.00, connections 0.06, rest 0.15) n136-128-154:2119680:2119928 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 04/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 05/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 06/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 07/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 08/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 09/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 10/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 11/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 12/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 13/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 14/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 15/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 16/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 17/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 18/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 19/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 20/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 21/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 22/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Channel 23/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 04/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 05/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 06/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 07/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 08/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 09/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 10/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 11/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 12/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 13/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 14/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 15/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 16/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 17/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 18/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 19/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 20/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 21/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 22/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119679:2119929 [0] NCCL INFO Channel 23/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2119680:2119928 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:2119679:2119929 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-09:02:36:23 INFO [evaluator:559] Running loglikelihood requests 2025-12-09:02:36:23 INFO [evaluator:559] Running loglikelihood requests Passed argument batch_size = auto:1. Detecting largest batch size Running loglikelihood requests: 0%| | 0/2344 [00:00 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 01/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 02/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 03/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 04/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 05/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 06/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 07/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 08/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 09/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 10/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 11/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 12/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 13/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 14/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 15/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 16/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 17/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 18/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 19/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 20/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 21/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 22/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 23/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 24/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 25/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 26/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 27/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 28/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 29/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 30/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2119680:2120175 [1] NCCL INFO Channel 31/1 : 1[7] -> 0[6] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:2119680:2120180 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2119680:2120180 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2119680:2120180 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2119680:2120180 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2119680:2120180 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2119680:2120180 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2119680:2119924 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-09:02:40:12 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 25, batch_size: auto (45) | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |-------------|------:|------|-----:|--------|---|-----:|---|-----:| |arc_challenge| 1|none | 25|acc |↑ |0.4454|± |0.0145| | | |none | 25|acc_norm|↑ |0.4701|± |0.0146| [rank0]:[W1209 02:40:13.746938469 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:2119679:2120230 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2119679:2120230 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2119679:2120230 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2119679:2119926 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2119679:2120230 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2119679:2120230 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2119679:2120230 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2119680:2119924 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2119680:2120180 [1] NCCL INFO comm 0x10d38510 rank 1 nranks 2 cudaDev 1 busId c9000 - Abort COMPLETE n136-128-154:2119679:2120230 [0] NCCL INFO comm 0x10f76950 rank 0 nranks 2 cudaDev 0 busId c5000 - Abort COMPLETE 任务 arc_challenge 评估完成! ================================================== 开始评估:任务=truthfulqa_mc1 | 少样本数=0 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/fg3/truthfulqa_mc1.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:02:41:38 INFO [__main__:440] Selected Tasks: ['truthfulqa_mc1'] 2025-12-09:02:41:38 INFO [__main__:440] Selected Tasks: ['truthfulqa_mc1'] 2025-12-09:02:41:38 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:02:41:38 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:02:41:38 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:02:41:38 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:02:41:40 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:02:41:40 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:02:41:40 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:2120285:2120535 [0] NCCL INFO Initialized NET plugin IB n136-128-154:2120285:2120535 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2120285:2120535 [0] NCCL INFO Using network IB n136-128-154:2120285:2120535 [0] NCCL INFO ncclCommInitRankConfig comm 0xef94000 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0xb329e6d7aedcd68a - Init START 2025-12-09:02:42:06 INFO [evaluator:305] num_fewshot has been set to 0 for truthfulqa_mc1 in its config. Manual configuration will be ignored. 2025-12-09:02:42:06 INFO [api.task:434] Building contexts for truthfulqa_mc1 on rank 1... 0%| | 0/408 [00:00 n136-128-154:2120286:2120589 [1] NCCL INFO Initialized NET plugin IB n136-128-154:2120286:2120589 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2120286:2120589 [1] NCCL INFO Using network IB n136-128-154:2120286:2120589 [1] NCCL INFO ncclCommInitRankConfig comm 0xf659f40 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0xb329e6d7aedcd68a - Init START n136-128-154:2120286:2120589 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2120285:2120535 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2120286:2120589 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2120286:2120589 [1] NCCL INFO Retrieving state for IB n136-128-154:2120286:2120589 [1] NCCL INFO Initialized state 0 for IB n136-128-154:2120286:2120589 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2120286:2120589 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2120286:2120589 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2120286:2120589 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2120285:2120535 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2120285:2120535 [0] NCCL INFO Retrieving state for IB n136-128-154:2120285:2120535 [0] NCCL INFO Initialized state 0 for IB n136-128-154:2120285:2120535 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2120285:2120535 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2120285:2120535 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2120285:2120535 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2120286:2120589 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2120286:2120589 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2120285:2120535 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2120285:2120535 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2120286:2120589 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2120286:2120589 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2120286:2120589 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2120286:2120589 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2120286:2120589 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2120286:2120589 [1] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2120286:2120589 [1] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2120286:2120589 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2120286:2120589 [1] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2120286:2120589 [1] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2120286:2120589 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2120286:2120589 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2120286:2120589 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2120285:2120535 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2120286:2120589 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2120285:2120535 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2120286:2120589 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2120286:2120589 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2120285:2120535 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2120286:2120589 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2120285:2120535 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2120286:2120589 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2120285:2120535 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2120286:2120589 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2120285:2120535 [0] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2120286:2120589 [1] NCCL INFO ========================================== n136-128-154:2120285:2120535 [0] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2120285:2120535 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2120286:2120589 [1] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2120285:2120535 [0] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2120285:2120535 [0] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2120286:2120589 [1] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2120285:2120535 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2120285:2120535 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2120285:2120535 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2120285:2120535 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2120286:2120589 [1] NCCL INFO Setting affinity for GPU 7 to 33-62,97-126 n136-128-154:2120285:2120535 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2120285:2120535 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2120285:2120535 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2120285:2120535 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2120285:2120535 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2120285:2120535 [0] NCCL INFO ========================================== n136-128-154:2120285:2120535 [0] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2120285:2120535 [0] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2120285:2120535 [0] NCCL INFO Setting affinity for GPU 6 to 33-62,97-126 n136-128-154:2120285:2120535 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2120286:2120589 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2120285:2120535 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120285:2120535 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120285:2120535 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120285:2120535 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120285:2120535 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120285:2120535 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120285:2120535 [0] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120285:2120535 [0] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120285:2120535 [0] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120285:2120535 [0] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120285:2120535 [0] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120285:2120535 [0] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120285:2120535 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2120285:2120535 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120285:2120535 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120285:2120535 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120285:2120535 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120285:2120535 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120285:2120535 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120285:2120535 [0] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120285:2120535 [0] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120285:2120535 [0] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120285:2120535 [0] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120285:2120535 [0] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120285:2120535 [0] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120286:2120589 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120286:2120589 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120286:2120589 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120286:2120589 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120286:2120589 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120286:2120589 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120286:2120589 [1] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120286:2120589 [1] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120286:2120589 [1] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120286:2120589 [1] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120286:2120589 [1] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120286:2120589 [1] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120286:2120589 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2120286:2120589 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120286:2120589 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120286:2120589 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120286:2120589 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120286:2120589 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120286:2120589 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120286:2120589 [1] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120286:2120589 [1] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120286:2120589 [1] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120286:2120589 [1] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120286:2120589 [1] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120286:2120589 [1] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120285:2120535 [0] NCCL INFO comm 0xef94000 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:2120286:2120589 [1] NCCL INFO comm 0xf659f40 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120286:2120589 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:2120286:2120589 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:2120286:2120589 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:2120286:2120589 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:2120285:2120535 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:2120285:2120535 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:2120285:2120535 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2120286:2120589 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2120286:2120596 [1] NCCL INFO [Proxy Service] Device 1 CPU core 41 n136-128-154:2120286:2120597 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 42 n136-128-154:2120285:2120535 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2120285:2120535 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:2120285:2120598 [0] NCCL INFO [Proxy Service] Device 0 CPU core 116 n136-128-154:2120285:2120599 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 114 n136-128-154:2120285:2120535 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2120285:2120535 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2120285:2120535 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:2120286:2120589 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2120286:2120589 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2120285:2120535 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2120285:2120535 [0] NCCL INFO ncclCommInitRankConfig comm 0xef94000 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0xb329e6d7aedcd68a - Init COMPLETE n136-128-154:2120285:2120535 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 2.20 (kernels 0.20, alloc 0.10, bootstrap 1.67, allgathers 0.00, topo 0.05, graphs 0.00, connections 0.14, rest 0.04) n136-128-154:2120286:2120589 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2120286:2120589 [1] NCCL INFO ncclCommInitRankConfig comm 0xf659f40 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0xb329e6d7aedcd68a - Init COMPLETE n136-128-154:2120286:2120589 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.47 (kernels 0.13, alloc 0.10, bootstrap 0.00, allgathers 0.00, topo 0.05, graphs 0.00, connections 0.15, rest 0.03) n136-128-154:2120286:2120602 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 04/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 04/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 05/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 05/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 06/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 06/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 07/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 07/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 08/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 08/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 09/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 09/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 10/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 10/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 11/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 11/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 12/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 12/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 13/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 13/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 14/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 14/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 15/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 15/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 16/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 16/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 17/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 17/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 18/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 18/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 19/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 19/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 20/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 20/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 21/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 21/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 22/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 22/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120286:2120602 [1] NCCL INFO Channel 23/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Channel 23/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120285:2120601 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:2120286:2120602 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-09:02:42:07 INFO [evaluator:559] Running loglikelihood requests 2025-12-09:02:42:07 INFO [evaluator:559] Running loglikelihood requests Running loglikelihood requests: 0%| | 0/2066 [00:00 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 01/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 02/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 03/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 04/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 05/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 06/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 07/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 08/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 09/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 10/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 11/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 12/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 13/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 14/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 15/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 16/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 17/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 18/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 19/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 20/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 21/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 22/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 23/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 24/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 25/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 26/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 27/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 28/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 29/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 30/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120286:2120754 [1] NCCL INFO Channel 31/1 : 1[7] -> 0[6] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:2120286:2120760 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2120286:2120760 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2120286:2120760 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2120286:2120760 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2120286:2120760 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2120286:2120760 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2120286:2120596 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-09:02:43:00 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: auto (64) | Tasks |Version|Filter|n-shot|Metric| |Value| |Stderr| |--------------|------:|------|-----:|------|---|----:|---|-----:| |truthfulqa_mc1| 2|none | 0|acc |↑ |0.328|± |0.0164| [rank0]:[W1209 02:43:00.131972389 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:2120285:2120809 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2120285:2120809 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2120285:2120809 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2120285:2120598 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2120285:2120809 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2120285:2120809 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2120285:2120809 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2120286:2120596 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2120286:2120760 [1] NCCL INFO comm 0xf659f40 rank 1 nranks 2 cudaDev 1 busId c9000 - Abort COMPLETE n136-128-154:2120285:2120809 [0] NCCL INFO comm 0xef94000 rank 0 nranks 2 cudaDev 0 busId c5000 - Abort COMPLETE 任务 truthfulqa_mc1 评估完成! ================================================== 开始评估:任务=piqa | 少样本数=0 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/fg3/piqa.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:02:44:24 INFO [__main__:440] Selected Tasks: ['piqa'] 2025-12-09:02:44:24 INFO [__main__:440] Selected Tasks: ['piqa'] 2025-12-09:02:44:24 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:02:44:24 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:02:44:24 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:02:44:24 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:02:44:26 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:02:44:26 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:02:44:26 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:2120874:2121105 [1] NCCL INFO Initialized NET plugin IB n136-128-154:2120874:2121105 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2120874:2121105 [1] NCCL INFO Using network IB n136-128-154:2120873:2121104 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:2120873:2121104 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:2120873:2121104 [0] NCCL INFO Initialized NET plugin IB n136-128-154:2120873:2121104 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2120873:2121104 [0] NCCL INFO Using network IB n136-128-154:2120874:2121105 [1] NCCL INFO ncclCommInitRankConfig comm 0x10603ec0 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x74864cb0f4d1c634 - Init START n136-128-154:2120873:2121104 [0] NCCL INFO ncclCommInitRankConfig comm 0xee84980 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x74864cb0f4d1c634 - Init START n136-128-154:2120874:2121105 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2120873:2121104 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2120874:2121105 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2120874:2121105 [1] NCCL INFO Retrieving state for IB n136-128-154:2120874:2121105 [1] NCCL INFO Initialized state 0 for IB n136-128-154:2120874:2121105 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2120874:2121105 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2120874:2121105 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2120874:2121105 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2120873:2121104 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2120873:2121104 [0] NCCL INFO Retrieving state for IB n136-128-154:2120873:2121104 [0] NCCL INFO Initialized state 0 for IB n136-128-154:2120873:2121104 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2120873:2121104 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2120873:2121104 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2120873:2121104 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2120873:2121104 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2120873:2121104 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2120873:2121104 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2120873:2121104 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2120873:2121104 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2120873:2121104 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2120873:2121104 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2120873:2121104 [0] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2120873:2121104 [0] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2120873:2121104 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2120873:2121104 [0] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2120873:2121104 [0] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2120873:2121104 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2120873:2121104 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2120873:2121104 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2120873:2121104 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2120873:2121104 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2120873:2121104 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2120873:2121104 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2120873:2121104 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2120873:2121104 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2120873:2121104 [0] NCCL INFO ========================================== n136-128-154:2120873:2121104 [0] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2120873:2121104 [0] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2120873:2121104 [0] NCCL INFO Setting affinity for GPU 6 to 33-62,97-126 n136-128-154:2120873:2121104 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2120873:2121104 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120873:2121104 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120873:2121104 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120873:2121104 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120873:2121104 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120873:2121104 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120873:2121104 [0] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120873:2121104 [0] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120873:2121104 [0] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120873:2121104 [0] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120873:2121104 [0] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120873:2121104 [0] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120873:2121104 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2120873:2121104 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120873:2121104 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120873:2121104 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120873:2121104 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120873:2121104 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120873:2121104 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120873:2121104 [0] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120873:2121104 [0] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120873:2121104 [0] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120873:2121104 [0] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120873:2121104 [0] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120873:2121104 [0] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120874:2121105 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2120874:2121105 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2120874:2121105 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2120874:2121105 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2120874:2121105 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2120874:2121105 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2120874:2121105 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2120874:2121105 [1] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2120874:2121105 [1] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2120874:2121105 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2120874:2121105 [1] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2120874:2121105 [1] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2120874:2121105 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2120874:2121105 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2120874:2121105 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2120874:2121105 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2120874:2121105 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2120874:2121105 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2120874:2121105 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2120874:2121105 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2120874:2121105 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2120874:2121105 [1] NCCL INFO ========================================== n136-128-154:2120874:2121105 [1] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2120874:2121105 [1] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2120874:2121105 [1] NCCL INFO Setting affinity for GPU 7 to 33-62,97-126 n136-128-154:2120874:2121105 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2120874:2121105 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120874:2121105 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120874:2121105 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120874:2121105 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120874:2121105 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120874:2121105 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120874:2121105 [1] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120874:2121105 [1] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120874:2121105 [1] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120874:2121105 [1] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120874:2121105 [1] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120874:2121105 [1] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120874:2121105 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2120874:2121105 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120874:2121105 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120874:2121105 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120874:2121105 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120874:2121105 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120874:2121105 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2120874:2121105 [1] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120874:2121105 [1] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120874:2121105 [1] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120874:2121105 [1] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120874:2121105 [1] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120874:2121105 [1] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2120874:2121105 [1] NCCL INFO comm 0x10603ec0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:2120873:2121104 [0] NCCL INFO comm 0xee84980 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:2120873:2121104 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:2120873:2121104 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:2120874:2121105 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:2120874:2121105 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:2120873:2121104 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:2120873:2121104 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2120873:2121104 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2120873:2121104 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:2120873:2121117 [0] NCCL INFO [Proxy Service] Device 0 CPU core 98 n136-128-154:2120873:2121118 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 99 n136-128-154:2120874:2121105 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2120874:2121119 [1] NCCL INFO [Proxy Service] Device 1 CPU core 118 n136-128-154:2120874:2121120 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 119 n136-128-154:2120873:2121104 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2120873:2121104 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2120873:2121104 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:2120874:2121105 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2120874:2121105 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2120874:2121105 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2120874:2121105 [1] NCCL INFO ncclCommInitRankConfig comm 0x10603ec0 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x74864cb0f4d1c634 - Init COMPLETE n136-128-154:2120874:2121105 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.74 (kernels 0.11, alloc 0.39, bootstrap 0.00, allgathers 0.00, topo 0.06, graphs 0.00, connections 0.03, rest 0.14) n136-128-154:2120873:2121104 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2120873:2121104 [0] NCCL INFO ncclCommInitRankConfig comm 0xee84980 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x74864cb0f4d1c634 - Init COMPLETE n136-128-154:2120873:2121104 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.74 (kernels 0.12, alloc 0.39, bootstrap 0.00, allgathers 0.00, topo 0.06, graphs 0.00, connections 0.03, rest 0.14) n136-128-154:2120874:2121121 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 04/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 05/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 06/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 07/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 08/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 09/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 04/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 10/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 05/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 11/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 06/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 12/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 07/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 13/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 08/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 14/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 09/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 15/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 10/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 16/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 11/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 17/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 12/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 18/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 13/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 19/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 14/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 20/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 15/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 21/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 16/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 22/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 17/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Channel 23/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 18/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 19/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 20/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 21/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 22/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120873:2121122 [0] NCCL INFO Channel 23/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2120874:2121121 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:2120873:2121122 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-09:02:44:49 INFO [evaluator:559] Running loglikelihood requests 2025-12-09:02:44:49 INFO [evaluator:559] Running loglikelihood requests Running loglikelihood requests: 0%| | 0/1838 [00:00 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 01/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 02/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 03/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 04/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 05/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 06/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 07/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 08/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 09/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 10/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 11/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 12/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 13/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 14/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 15/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 16/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 17/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 18/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 19/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 20/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 21/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 22/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 23/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 24/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 25/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 26/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 27/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 28/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 29/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 30/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2120874:2121150 [1] NCCL INFO Channel 31/1 : 1[7] -> 0[6] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:2120874:2121156 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2120874:2121156 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2120874:2121156 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2120874:2121156 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2120874:2121156 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2120874:2121156 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2120874:2121119 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-09:02:45:17 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: auto (64) |Tasks|Version|Filter|n-shot| Metric | |Value | |Stderr| |-----|------:|------|-----:|--------|---|-----:|---|-----:| |piqa | 1|none | 0|acc |↑ |0.7051|± |0.0106| | | |none | 0|acc_norm|↑ |0.7144|± |0.0105| [rank0]:[W1209 02:45:18.589236697 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:2120873:2121242 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2120873:2121242 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2120873:2121242 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2120873:2121117 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2120873:2121242 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2120873:2121242 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2120873:2121242 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2120874:2121119 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2120874:2121156 [1] NCCL INFO comm 0x10603ec0 rank 1 nranks 2 cudaDev 1 busId c9000 - Abort COMPLETE n136-128-154:2120873:2121242 [0] NCCL INFO comm 0xee84980 rank 0 nranks 2 cudaDev 0 busId c5000 - Abort COMPLETE 任务 piqa 评估完成! ================================================== 开始评估:任务=hellaswag | 少样本数=10 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/fg3/hellaswag.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:02:46:42 INFO [__main__:440] Selected Tasks: ['hellaswag'] 2025-12-09:02:46:42 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:02:46:42 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:02:46:42 INFO [__main__:440] Selected Tasks: ['hellaswag'] 2025-12-09:02:46:42 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:02:46:42 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:02:46:44 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:02:46:44 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:2121332:2121730 [0] NCCL INFO Initialized NET plugin IB n136-128-154:2121332:2121730 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2121332:2121730 [0] NCCL INFO Using network IB n136-128-154:2121332:2121730 [0] NCCL INFO ncclCommInitRankConfig comm 0x46976f30 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0xdce05904dd40e8bb - Init START 98%|█████████▊| 4941/5021 [00:25<00:00, 197.10it/s] 99%|█████████▉| 4961/5021 [00:25<00:00, 197.27it/s] 99%|█████████▉| 4981/5021 [00:25<00:00, 197.04it/s] 100%|█████████▉| 5001/5021 [00:25<00:00, 197.01it/s] 100%|██████████| 5021/5021 [00:25<00:00, 197.21it/s] 100%|██████████| 5021/5021 [00:25<00:00, 196.48it/s] n136-128-154:2121333:2121333 [1] NCCL INFO cudaDriverVersion 12040 n136-128-154:2121333:2121333 [1] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6 n136-128-154:2121333:2121333 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0 n136-128-154:2121333:2121333 [1] NCCL INFO NCCL version 2.27.5+cuda12.9 n136-128-154:2121333:2121333 [1] NCCL INFO Comm config Blocking set to 1 n136-128-154:2121333:2121735 [1] NCCL INFO NET/Plugin: Could not find: none libnccl-net-none.so. n136-128-154:2121333:2121735 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0. n136-128-154:2121333:2121735 [1] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6 n136-128-154:2121333:2121735 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0 n136-128-154:2121333:2121735 [1] NCCL INFO NCCL_IB_HCA set to mlx5 n136-128-154:2121333:2121735 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:2121333:2121735 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:2121333:2121735 [1] NCCL INFO Initialized NET plugin IB n136-128-154:2121333:2121735 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2121333:2121735 [1] NCCL INFO Using network IB n136-128-154:2121333:2121735 [1] NCCL INFO ncclCommInitRankConfig comm 0xec85c30 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0xdce05904dd40e8bb - Init START n136-128-154:2121333:2121735 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2121332:2121730 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2121333:2121735 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2121333:2121735 [1] NCCL INFO Retrieving state for IB n136-128-154:2121333:2121735 [1] NCCL INFO Initialized state 0 for IB n136-128-154:2121333:2121735 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2121333:2121735 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2121333:2121735 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2121333:2121735 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2121332:2121730 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2121332:2121730 [0] NCCL INFO Retrieving state for IB n136-128-154:2121332:2121730 [0] NCCL INFO Initialized state 0 for IB n136-128-154:2121332:2121730 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2121332:2121730 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2121332:2121730 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2121332:2121730 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2121333:2121735 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2121333:2121735 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2121332:2121730 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2121332:2121730 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2121332:2121730 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2121332:2121730 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2121332:2121730 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2121332:2121730 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2121332:2121730 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2121332:2121730 [0] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2121332:2121730 [0] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2121332:2121730 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2121332:2121730 [0] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2121332:2121730 [0] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2121332:2121730 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2121332:2121730 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2121332:2121730 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2121332:2121730 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2121332:2121730 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2121332:2121730 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2121332:2121730 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2121332:2121730 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2121332:2121730 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2121332:2121730 [0] NCCL INFO ========================================== n136-128-154:2121332:2121730 [0] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2121333:2121735 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2121332:2121730 [0] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2121333:2121735 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2121333:2121735 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2121333:2121735 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2121333:2121735 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2121332:2121730 [0] NCCL INFO Setting affinity for GPU 6 to 33-62,97-126 n136-128-154:2121333:2121735 [1] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2121333:2121735 [1] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2121333:2121735 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2121333:2121735 [1] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2121333:2121735 [1] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2121333:2121735 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2121333:2121735 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2121333:2121735 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2121333:2121735 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2121333:2121735 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2121333:2121735 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2121333:2121735 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2121333:2121735 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2121333:2121735 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2121333:2121735 [1] NCCL INFO ========================================== n136-128-154:2121333:2121735 [1] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2121333:2121735 [1] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2121333:2121735 [1] NCCL INFO Setting affinity for GPU 7 to 33-62,97-126 n136-128-154:2121332:2121730 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2121332:2121730 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121332:2121730 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121332:2121730 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121332:2121730 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121332:2121730 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121332:2121730 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121332:2121730 [0] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121332:2121730 [0] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121332:2121730 [0] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121332:2121730 [0] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121332:2121730 [0] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121332:2121730 [0] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121332:2121730 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2121332:2121730 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121332:2121730 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121332:2121730 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121332:2121730 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121332:2121730 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121332:2121730 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121332:2121730 [0] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2121332:2121730 [0] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2121332:2121730 [0] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2121332:2121730 [0] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2121332:2121730 [0] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2121332:2121730 [0] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2121333:2121735 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2121333:2121735 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121333:2121735 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121333:2121735 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121333:2121735 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121333:2121735 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121333:2121735 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121333:2121735 [1] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121333:2121735 [1] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121333:2121735 [1] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121333:2121735 [1] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121333:2121735 [1] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121333:2121735 [1] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121333:2121735 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2121333:2121735 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121333:2121735 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121333:2121735 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121333:2121735 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121333:2121735 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121333:2121735 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2121333:2121735 [1] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2121333:2121735 [1] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2121333:2121735 [1] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2121333:2121735 [1] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2121333:2121735 [1] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2121333:2121735 [1] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2121332:2121730 [0] NCCL INFO comm 0x46976f30 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:2121333:2121735 [1] NCCL INFO comm 0xec85c30 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:2121333:2121735 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:2121333:2121735 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:2121333:2121735 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:2121333:2121735 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:2121332:2121730 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:2121332:2121730 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2121333:2121735 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2121333:2121743 [1] NCCL INFO [Proxy Service] Device 1 CPU core 40 n136-128-154:2121333:2121744 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 105 n136-128-154:2121332:2121730 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2121332:2121730 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:2121332:2121745 [0] NCCL INFO [Proxy Service] Device 0 CPU core 100 n136-128-154:2121332:2121746 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 38 n136-128-154:2121333:2121735 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2121333:2121735 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2121332:2121730 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2121332:2121730 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2121332:2121730 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:2121332:2121730 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2121332:2121730 [0] NCCL INFO ncclCommInitRankConfig comm 0x46976f30 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0xdce05904dd40e8bb - Init COMPLETE n136-128-154:2121332:2121730 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 1.32 (kernels 0.13, alloc 0.20, bootstrap 0.77, allgathers 0.01, topo 0.07, graphs 0.00, connections 0.01, rest 0.13) n136-128-154:2121333:2121735 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2121333:2121735 [1] NCCL INFO ncclCommInitRankConfig comm 0xec85c30 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0xdce05904dd40e8bb - Init COMPLETE n136-128-154:2121333:2121735 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.55 (kernels 0.12, alloc 0.21, bootstrap 0.00, allgathers 0.01, topo 0.07, graphs 0.00, connections 0.01, rest 0.13) n136-128-154:2121333:2121747 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 04/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 05/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 06/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 07/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 08/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 09/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 10/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 11/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 04/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 12/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 05/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 13/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 06/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 14/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 07/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 15/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 08/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 16/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 09/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 17/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 10/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 18/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 11/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 19/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 12/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 20/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 13/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 21/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 14/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 22/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 15/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Channel 23/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 16/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 17/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 18/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 19/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 20/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 21/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 22/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121332:2121748 [0] NCCL INFO Channel 23/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2121333:2121747 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:2121332:2121748 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-09:02:47:39 INFO [evaluator:559] Running loglikelihood requests 2025-12-09:02:47:39 INFO [evaluator:559] Running loglikelihood requests Passed argument batch_size = auto:1. Detecting largest batch size Running loglikelihood requests: 0%| | 0/20084 [00:00 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 01/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 02/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 03/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 04/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 05/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 06/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 07/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 08/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 09/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 10/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 11/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 12/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 13/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 14/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 15/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 16/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 17/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 18/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 19/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 20/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 21/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 22/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 23/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 24/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 25/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 26/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 27/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 28/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 29/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 30/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2121333:2123507 [1] NCCL INFO Channel 31/1 : 1[7] -> 0[6] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:2121333:2123511 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2121333:2123511 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2121333:2123511 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2121333:2123511 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2121333:2123511 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2121333:2123511 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2121333:2121743 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-09:03:11:30 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 10, batch_size: auto (57) | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |---------|------:|------|-----:|--------|---|-----:|---|-----:| |hellaswag| 1|none | 10|acc |↑ |0.4808|± |0.0050| | | |none | 10|acc_norm|↑ |0.6582|± |0.0047| [rank0]:[W1209 03:11:31.613980955 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:2121332:2123566 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2121332:2123566 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2121332:2123566 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2121332:2121745 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2121332:2123566 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2121332:2123566 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2121332:2123566 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2121333:2121743 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2121333:2123511 [1] NCCL INFO comm 0xec85c30 rank 1 nranks 2 cudaDev 1 busId c9000 - Abort COMPLETE n136-128-154:2121332:2123566 [0] NCCL INFO comm 0x46976f30 rank 0 nranks 2 cudaDev 0 busId c5000 - Abort COMPLETE 任务 hellaswag 评估完成! Execution time: 34353.5 seconds python: can't open file '/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/main_fg.py': [Errno 2] No such file or directory ================================================== 开始评估:任务=winogrande | 少样本数=5 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/fg4/winogrande.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:03:12:57 INFO [__main__:440] Selected Tasks: ['winogrande'] 2025-12-09:03:12:57 INFO [__main__:440] Selected Tasks: ['winogrande'] 2025-12-09:03:12:57 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:03:12:57 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:03:12:57 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:03:12:57 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:03:12:59 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [rank0]: Traceback (most recent call last): [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 479, in cached_files [rank0]: hf_hub_download( [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank0]: validate_repo_id(arg_value) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank0]: raise HFValidationError( [rank0]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank0]: During handling of the above exception, another exception occurred: [rank0]: Traceback (most recent call last): [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict [rank0]: resolved_config_file = cached_file( [rank0]: ^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 322, in cached_file [rank0]: file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 531, in cached_files [rank0]: resolved_files = [ [rank0]: ^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 532, in [rank0]: _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return [rank0]: resolved_file = try_to_load_from_cache( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank0]: validate_repo_id(arg_value) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank0]: raise HFValidationError( [rank0]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank0]: During handling of the above exception, another exception occurred: [rank0]: Traceback (most recent call last): [rank0]: File "", line 198, in _run_module_as_main [rank0]: File "", line 88, in _run_code [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 530, in [rank0]: cli_evaluate() [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 449, in cli_evaluate [rank0]: results = evaluator.simple_evaluate( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/utils.py", line 439, in _wrapper [rank0]: return fn(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/evaluator.py", line 230, in simple_evaluate [rank0]: lm = lm_eval.api.registry.get_model(model).create_from_arg_string( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/api/model.py", line 151, in create_from_arg_string [rank0]: return cls(**args, **args2) [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 168, in __init__ [rank0]: self._get_config( [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 527, in _get_config [rank0]: self._config = transformers.AutoConfig.from_pretrained( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1332, in from_pretrained [rank0]: config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict [rank0]: config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict [rank0]: raise OSError( [rank0]: OSError: Can't load the configuration of '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' is the correct path to a directory containing a config.json file [rank1]: Traceback (most recent call last): [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 479, in cached_files [rank1]: hf_hub_download( [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank1]: validate_repo_id(arg_value) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank1]: raise HFValidationError( [rank1]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank1]: During handling of the above exception, another exception occurred: [rank1]: Traceback (most recent call last): [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict [rank1]: resolved_config_file = cached_file( [rank1]: ^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 322, in cached_file [rank1]: file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 531, in cached_files [rank1]: resolved_files = [ [rank1]: ^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 532, in [rank1]: _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return [rank1]: resolved_file = try_to_load_from_cache( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank1]: validate_repo_id(arg_value) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank1]: raise HFValidationError( [rank1]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank1]: During handling of the above exception, another exception occurred: [rank1]: Traceback (most recent call last): [rank1]: File "", line 198, in _run_module_as_main [rank1]: File "", line 88, in _run_code [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 530, in [rank1]: cli_evaluate() [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 449, in cli_evaluate [rank1]: results = evaluator.simple_evaluate( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/utils.py", line 439, in _wrapper [rank1]: return fn(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/evaluator.py", line 230, in simple_evaluate [rank1]: lm = lm_eval.api.registry.get_model(model).create_from_arg_string( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/api/model.py", line 151, in create_from_arg_string [rank1]: return cls(**args, **args2) [rank1]: ^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 168, in __init__ [rank1]: self._get_config( [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 527, in _get_config [rank1]: self._config = transformers.AutoConfig.from_pretrained( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1332, in from_pretrained [rank1]: config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict [rank1]: config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict [rank1]: raise OSError( [rank1]: OSError: Can't load the configuration of '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' is the correct path to a directory containing a config.json file [rank0]:[W1209 03:12:59.027525201 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) W1209 03:13:01.658000 2123579 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 2123629 closing signal SIGTERM E1209 03:13:01.693000 2123579 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 1 (pid: 2123630) of binary: /mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/bin/python Traceback (most recent call last): File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/bin/accelerate", line 7, in sys.exit(main()) ^^^^^^ File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main args.func(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1272, in launch_command multi_gpu_launcher(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/launch.py", line 899, in multi_gpu_launcher distrib_run.run(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/run.py", line 927, in run elastic_launch( File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 156, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ lm_eval FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-12-09_03:13:01 host : n136-128-154.byted.org rank : 1 (local_rank: 1) exitcode : 1 (pid: 2123630) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ 警告:任务 winogrande 执行失败! ================================================== 开始评估:任务=gsm8k | 少样本数=4 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/fg4/gsm8k.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:03:14:18 INFO [__main__:440] Selected Tasks: ['gsm8k'] 2025-12-09:03:14:18 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:03:14:18 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:03:14:18 INFO [__main__:440] Selected Tasks: ['gsm8k'] 2025-12-09:03:14:18 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:03:14:18 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:03:14:20 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [rank0]: Traceback (most recent call last): [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 479, in cached_files [rank0]: hf_hub_download( [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank0]: validate_repo_id(arg_value) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank0]: raise HFValidationError( [rank0]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank0]: During handling of the above exception, another exception occurred: [rank0]: Traceback (most recent call last): [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict [rank0]: resolved_config_file = cached_file( [rank0]: ^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 322, in cached_file [rank0]: file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 531, in cached_files [rank0]: resolved_files = [ [rank0]: ^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 532, in [rank0]: _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return [rank0]: resolved_file = try_to_load_from_cache( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank0]: validate_repo_id(arg_value) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank0]: raise HFValidationError( [rank0]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank0]: During handling of the above exception, another exception occurred: [rank0]: Traceback (most recent call last): [rank0]: File "", line 198, in _run_module_as_main [rank0]: File "", line 88, in _run_code [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 530, in [rank0]: cli_evaluate() [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 449, in cli_evaluate [rank0]: results = evaluator.simple_evaluate( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/utils.py", line 439, in _wrapper [rank0]: return fn(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/evaluator.py", line 230, in simple_evaluate [rank0]: lm = lm_eval.api.registry.get_model(model).create_from_arg_string( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/api/model.py", line 151, in create_from_arg_string [rank0]: return cls(**args, **args2) [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 168, in __init__ [rank0]: self._get_config( [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 527, in _get_config [rank0]: self._config = transformers.AutoConfig.from_pretrained( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1332, in from_pretrained [rank0]: config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict [rank0]: config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict [rank0]: raise OSError( [rank0]: OSError: Can't load the configuration of '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' is the correct path to a directory containing a config.json file [rank1]: Traceback (most recent call last): [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 479, in cached_files [rank1]: hf_hub_download( [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank1]: validate_repo_id(arg_value) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank1]: raise HFValidationError( [rank1]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank1]: During handling of the above exception, another exception occurred: [rank1]: Traceback (most recent call last): [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict [rank1]: resolved_config_file = cached_file( [rank1]: ^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 322, in cached_file [rank1]: file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 531, in cached_files [rank1]: resolved_files = [ [rank1]: ^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 532, in [rank1]: _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return [rank1]: resolved_file = try_to_load_from_cache( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank1]: validate_repo_id(arg_value) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank1]: raise HFValidationError( [rank1]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank1]: During handling of the above exception, another exception occurred: [rank1]: Traceback (most recent call last): [rank1]: File "", line 198, in _run_module_as_main [rank1]: File "", line 88, in _run_code [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 530, in [rank1]: cli_evaluate() [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 449, in cli_evaluate [rank1]: results = evaluator.simple_evaluate( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/utils.py", line 439, in _wrapper [rank1]: return fn(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/evaluator.py", line 230, in simple_evaluate [rank1]: lm = lm_eval.api.registry.get_model(model).create_from_arg_string( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/api/model.py", line 151, in create_from_arg_string [rank1]: return cls(**args, **args2) [rank1]: ^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 168, in __init__ [rank1]: self._get_config( [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 527, in _get_config [rank1]: self._config = transformers.AutoConfig.from_pretrained( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1332, in from_pretrained [rank1]: config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict [rank1]: config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict [rank1]: raise OSError( [rank1]: OSError: Can't load the configuration of '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' is the correct path to a directory containing a config.json file [rank0]:[W1209 03:14:20.786197923 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) E1209 03:14:22.623000 2123829 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 2123880) of binary: /mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/bin/python Traceback (most recent call last): File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/bin/accelerate", line 7, in sys.exit(main()) ^^^^^^ File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main args.func(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1272, in launch_command multi_gpu_launcher(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/launch.py", line 899, in multi_gpu_launcher distrib_run.run(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/run.py", line 927, in run elastic_launch( File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 156, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ lm_eval FAILED ------------------------------------------------------------ Failures: [1]: time : 2025-12-09_03:14:22 host : n136-128-154.byted.org rank : 1 (local_rank: 1) exitcode : 1 (pid: 2123881) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-12-09_03:14:22 host : n136-128-154.byted.org rank : 0 (local_rank: 0) exitcode : 1 (pid: 2123880) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ 警告:任务 gsm8k 执行失败! ================================================== 开始评估:任务=boolq | 少样本数=0 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/fg4/boolq.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:03:15:42 INFO [__main__:440] Selected Tasks: ['boolq'] 2025-12-09:03:15:42 INFO [__main__:440] Selected Tasks: ['boolq'] 2025-12-09:03:15:42 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:03:15:42 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:03:15:42 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:03:15:42 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} [rank1]: Traceback (most recent call last): [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 479, in cached_files [rank1]: hf_hub_download( [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank1]: validate_repo_id(arg_value) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank1]: raise HFValidationError( [rank1]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank1]: During handling of the above exception, another exception occurred: [rank1]: Traceback (most recent call last): [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict [rank1]: resolved_config_file = cached_file( [rank1]: ^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 322, in cached_file [rank1]: file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 531, in cached_files [rank1]: resolved_files = [ [rank1]: ^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 532, in [rank1]: _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return [rank1]: resolved_file = try_to_load_from_cache( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank1]: validate_repo_id(arg_value) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank1]: raise HFValidationError( [rank1]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank1]: During handling of the above exception, another exception occurred: [rank1]: Traceback (most recent call last): [rank1]: File "", line 198, in _run_module_as_main [rank1]: File "", line 88, in _run_code [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 530, in [rank1]: cli_evaluate() [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 449, in cli_evaluate [rank1]: results = evaluator.simple_evaluate( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/utils.py", line 439, in _wrapper [rank1]: return fn(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/evaluator.py", line 230, in simple_evaluate [rank1]: lm = lm_eval.api.registry.get_model(model).create_from_arg_string( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/api/model.py", line 151, in create_from_arg_string [rank1]: return cls(**args, **args2) [rank1]: ^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 168, in __init__ [rank1]: self._get_config( [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 527, in _get_config [rank1]: self._config = transformers.AutoConfig.from_pretrained( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1332, in from_pretrained [rank1]: config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict [rank1]: config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict [rank1]: raise OSError( [rank1]: OSError: Can't load the configuration of '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' is the correct path to a directory containing a config.json file 2025-12-09:03:15:44 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [rank0]: Traceback (most recent call last): [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 479, in cached_files [rank0]: hf_hub_download( [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank0]: validate_repo_id(arg_value) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank0]: raise HFValidationError( [rank0]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank0]: During handling of the above exception, another exception occurred: [rank0]: Traceback (most recent call last): [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict [rank0]: resolved_config_file = cached_file( [rank0]: ^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 322, in cached_file [rank0]: file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 531, in cached_files [rank0]: resolved_files = [ [rank0]: ^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 532, in [rank0]: _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return [rank0]: resolved_file = try_to_load_from_cache( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank0]: validate_repo_id(arg_value) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank0]: raise HFValidationError( [rank0]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank0]: During handling of the above exception, another exception occurred: [rank0]: Traceback (most recent call last): [rank0]: File "", line 198, in _run_module_as_main [rank0]: File "", line 88, in _run_code [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 530, in [rank0]: cli_evaluate() [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 449, in cli_evaluate [rank0]: results = evaluator.simple_evaluate( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/utils.py", line 439, in _wrapper [rank0]: return fn(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/evaluator.py", line 230, in simple_evaluate [rank0]: lm = lm_eval.api.registry.get_model(model).create_from_arg_string( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/api/model.py", line 151, in create_from_arg_string [rank0]: return cls(**args, **args2) [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 168, in __init__ [rank0]: self._get_config( [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 527, in _get_config [rank0]: self._config = transformers.AutoConfig.from_pretrained( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1332, in from_pretrained [rank0]: config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict [rank0]: config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict [rank0]: raise OSError( [rank0]: OSError: Can't load the configuration of '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' is the correct path to a directory containing a config.json file [rank0]:[W1209 03:15:44.908560853 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) W1209 03:15:46.360000 2124043 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 2124095 closing signal SIGTERM E1209 03:15:46.425000 2124043 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 1 (pid: 2124096) of binary: /mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/bin/python Traceback (most recent call last): File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/bin/accelerate", line 7, in sys.exit(main()) ^^^^^^ File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main args.func(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1272, in launch_command multi_gpu_launcher(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/launch.py", line 899, in multi_gpu_launcher distrib_run.run(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/run.py", line 927, in run elastic_launch( File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 156, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ lm_eval FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-12-09_03:15:46 host : n136-128-154.byted.org rank : 1 (local_rank: 1) exitcode : 1 (pid: 2124096) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ 警告:任务 boolq 执行失败! ================================================== 开始评估:任务=arc_challenge | 少样本数=25 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/fg4/arc_challenge.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:03:17:06 INFO [__main__:440] Selected Tasks: ['arc_challenge'] 2025-12-09:03:17:06 INFO [__main__:440] Selected Tasks: ['arc_challenge'] 2025-12-09:03:17:06 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:03:17:06 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:03:17:06 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:03:17:06 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} [rank1]: Traceback (most recent call last): [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 479, in cached_files [rank1]: hf_hub_download( [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank1]: validate_repo_id(arg_value) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank1]: raise HFValidationError( [rank1]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank1]: During handling of the above exception, another exception occurred: [rank1]: Traceback (most recent call last): [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict [rank1]: resolved_config_file = cached_file( [rank1]: ^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 322, in cached_file [rank1]: file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 531, in cached_files [rank1]: resolved_files = [ [rank1]: ^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 532, in [rank1]: _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return [rank1]: resolved_file = try_to_load_from_cache( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank1]: validate_repo_id(arg_value) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank1]: raise HFValidationError( [rank1]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank1]: During handling of the above exception, another exception occurred: [rank1]: Traceback (most recent call last): [rank1]: File "", line 198, in _run_module_as_main [rank1]: File "", line 88, in _run_code [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 530, in [rank1]: cli_evaluate() [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 449, in cli_evaluate [rank1]: results = evaluator.simple_evaluate( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/utils.py", line 439, in _wrapper [rank1]: return fn(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/evaluator.py", line 230, in simple_evaluate [rank1]: lm = lm_eval.api.registry.get_model(model).create_from_arg_string( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/api/model.py", line 151, in create_from_arg_string [rank1]: return cls(**args, **args2) [rank1]: ^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 168, in __init__ [rank1]: self._get_config( [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 527, in _get_config [rank1]: self._config = transformers.AutoConfig.from_pretrained( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1332, in from_pretrained [rank1]: config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict [rank1]: config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict [rank1]: raise OSError( [rank1]: OSError: Can't load the configuration of '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' is the correct path to a directory containing a config.json file 2025-12-09:03:17:07 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [rank0]: Traceback (most recent call last): [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 479, in cached_files [rank0]: hf_hub_download( [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank0]: validate_repo_id(arg_value) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank0]: raise HFValidationError( [rank0]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank0]: During handling of the above exception, another exception occurred: [rank0]: Traceback (most recent call last): [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict [rank0]: resolved_config_file = cached_file( [rank0]: ^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 322, in cached_file [rank0]: file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 531, in cached_files [rank0]: resolved_files = [ [rank0]: ^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 532, in [rank0]: _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return [rank0]: resolved_file = try_to_load_from_cache( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank0]: validate_repo_id(arg_value) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank0]: raise HFValidationError( [rank0]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank0]: During handling of the above exception, another exception occurred: [rank0]: Traceback (most recent call last): [rank0]: File "", line 198, in _run_module_as_main [rank0]: File "", line 88, in _run_code [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 530, in [rank0]: cli_evaluate() [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 449, in cli_evaluate [rank0]: results = evaluator.simple_evaluate( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/utils.py", line 439, in _wrapper [rank0]: return fn(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/evaluator.py", line 230, in simple_evaluate [rank0]: lm = lm_eval.api.registry.get_model(model).create_from_arg_string( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/api/model.py", line 151, in create_from_arg_string [rank0]: return cls(**args, **args2) [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 168, in __init__ [rank0]: self._get_config( [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 527, in _get_config [rank0]: self._config = transformers.AutoConfig.from_pretrained( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1332, in from_pretrained [rank0]: config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict [rank0]: config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict [rank0]: raise OSError( [rank0]: OSError: Can't load the configuration of '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' is the correct path to a directory containing a config.json file [rank0]:[W1209 03:17:08.627493384 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) W1209 03:17:10.309000 2124261 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 2124312 closing signal SIGTERM E1209 03:17:10.344000 2124261 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 2124311) of binary: /mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/bin/python Traceback (most recent call last): File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/bin/accelerate", line 7, in sys.exit(main()) ^^^^^^ File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main args.func(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1272, in launch_command multi_gpu_launcher(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/launch.py", line 899, in multi_gpu_launcher distrib_run.run(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/run.py", line 927, in run elastic_launch( File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 156, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ lm_eval FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-12-09_03:17:10 host : n136-128-154.byted.org rank : 0 (local_rank: 0) exitcode : 1 (pid: 2124311) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ 警告:任务 arc_challenge 执行失败! ================================================== 开始评估:任务=truthfulqa_mc1 | 少样本数=0 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/fg4/truthfulqa_mc1.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:03:18:31 INFO [__main__:440] Selected Tasks: ['truthfulqa_mc1'] 2025-12-09:03:18:31 INFO [__main__:440] Selected Tasks: ['truthfulqa_mc1'] 2025-12-09:03:18:31 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:03:18:31 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:03:18:31 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:03:18:31 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:03:18:32 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [rank0]: Traceback (most recent call last): [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 479, in cached_files [rank0]: hf_hub_download( [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank0]: validate_repo_id(arg_value) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank0]: raise HFValidationError( [rank0]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank0]: During handling of the above exception, another exception occurred: [rank0]: Traceback (most recent call last): [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict [rank0]: resolved_config_file = cached_file( [rank0]: ^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 322, in cached_file [rank0]: file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 531, in cached_files [rank0]: resolved_files = [ [rank0]: ^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 532, in [rank0]: _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return [rank0]: resolved_file = try_to_load_from_cache( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank0]: validate_repo_id(arg_value) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank0]: raise HFValidationError( [rank0]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank0]: During handling of the above exception, another exception occurred: [rank0]: Traceback (most recent call last): [rank0]: File "", line 198, in _run_module_as_main [rank0]: File "", line 88, in _run_code [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 530, in [rank0]: cli_evaluate() [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 449, in cli_evaluate [rank0]: results = evaluator.simple_evaluate( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/utils.py", line 439, in _wrapper [rank0]: return fn(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/evaluator.py", line 230, in simple_evaluate [rank0]: lm = lm_eval.api.registry.get_model(model).create_from_arg_string( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/api/model.py", line 151, in create_from_arg_string [rank0]: return cls(**args, **args2) [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 168, in __init__ [rank0]: self._get_config( [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 527, in _get_config [rank0]: self._config = transformers.AutoConfig.from_pretrained( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1332, in from_pretrained [rank0]: config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict [rank0]: config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict [rank0]: raise OSError( [rank0]: OSError: Can't load the configuration of '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' is the correct path to a directory containing a config.json file [rank1]: Traceback (most recent call last): [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 479, in cached_files [rank1]: hf_hub_download( [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank1]: validate_repo_id(arg_value) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank1]: raise HFValidationError( [rank1]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank1]: During handling of the above exception, another exception occurred: [rank1]: Traceback (most recent call last): [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict [rank1]: resolved_config_file = cached_file( [rank1]: ^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 322, in cached_file [rank1]: file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 531, in cached_files [rank1]: resolved_files = [ [rank1]: ^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 532, in [rank1]: _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return [rank1]: resolved_file = try_to_load_from_cache( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank1]: validate_repo_id(arg_value) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank1]: raise HFValidationError( [rank1]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank1]: During handling of the above exception, another exception occurred: [rank1]: Traceback (most recent call last): [rank1]: File "", line 198, in _run_module_as_main [rank1]: File "", line 88, in _run_code [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 530, in [rank1]: cli_evaluate() [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 449, in cli_evaluate [rank1]: results = evaluator.simple_evaluate( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/utils.py", line 439, in _wrapper [rank1]: return fn(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/evaluator.py", line 230, in simple_evaluate [rank1]: lm = lm_eval.api.registry.get_model(model).create_from_arg_string( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/api/model.py", line 151, in create_from_arg_string [rank1]: return cls(**args, **args2) [rank1]: ^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 168, in __init__ [rank1]: self._get_config( [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 527, in _get_config [rank1]: self._config = transformers.AutoConfig.from_pretrained( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1332, in from_pretrained [rank1]: config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict [rank1]: config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict [rank1]: raise OSError( [rank1]: OSError: Can't load the configuration of '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' is the correct path to a directory containing a config.json file [rank0]:[W1209 03:18:33.427657854 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) W1209 03:18:35.243000 2124506 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 2124555 closing signal SIGTERM E1209 03:18:35.261000 2124506 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 1 (pid: 2124556) of binary: /mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/bin/python Traceback (most recent call last): File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/bin/accelerate", line 7, in sys.exit(main()) ^^^^^^ File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main args.func(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1272, in launch_command multi_gpu_launcher(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/launch.py", line 899, in multi_gpu_launcher distrib_run.run(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/run.py", line 927, in run elastic_launch( File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 156, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ lm_eval FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-12-09_03:18:35 host : n136-128-154.byted.org rank : 1 (local_rank: 1) exitcode : 1 (pid: 2124556) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ 警告:任务 truthfulqa_mc1 执行失败! ================================================== 开始评估:任务=piqa | 少样本数=0 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/fg4/piqa.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:03:19:55 INFO [__main__:440] Selected Tasks: ['piqa'] 2025-12-09:03:19:55 INFO [__main__:440] Selected Tasks: ['piqa'] 2025-12-09:03:19:55 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:03:19:55 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:03:19:55 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:03:19:55 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:03:19:56 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [rank0]: Traceback (most recent call last): [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 479, in cached_files [rank0]: hf_hub_download( [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank0]: validate_repo_id(arg_value) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank0]: raise HFValidationError( [rank0]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank0]: During handling of the above exception, another exception occurred: [rank0]: Traceback (most recent call last): [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict [rank0]: resolved_config_file = cached_file( [rank0]: ^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 322, in cached_file [rank0]: file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 531, in cached_files [rank0]: resolved_files = [ [rank0]: ^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 532, in [rank0]: _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return [rank0]: resolved_file = try_to_load_from_cache( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank0]: validate_repo_id(arg_value) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank0]: raise HFValidationError( [rank0]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank0]: During handling of the above exception, another exception occurred: [rank0]: Traceback (most recent call last): [rank0]: File "", line 198, in _run_module_as_main [rank0]: File "", line 88, in _run_code [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 530, in [rank0]: cli_evaluate() [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 449, in cli_evaluate [rank0]: results = evaluator.simple_evaluate( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/utils.py", line 439, in _wrapper [rank0]: return fn(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/evaluator.py", line 230, in simple_evaluate [rank0]: lm = lm_eval.api.registry.get_model(model).create_from_arg_string( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/api/model.py", line 151, in create_from_arg_string [rank0]: return cls(**args, **args2) [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 168, in __init__ [rank0]: self._get_config( [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 527, in _get_config [rank0]: self._config = transformers.AutoConfig.from_pretrained( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1332, in from_pretrained [rank0]: config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict [rank0]: config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict [rank0]: raise OSError( [rank0]: OSError: Can't load the configuration of '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' is the correct path to a directory containing a config.json file [rank1]: Traceback (most recent call last): [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 479, in cached_files [rank1]: hf_hub_download( [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank1]: validate_repo_id(arg_value) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank1]: raise HFValidationError( [rank1]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank1]: During handling of the above exception, another exception occurred: [rank1]: Traceback (most recent call last): [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict [rank1]: resolved_config_file = cached_file( [rank1]: ^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 322, in cached_file [rank1]: file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 531, in cached_files [rank1]: resolved_files = [ [rank1]: ^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 532, in [rank1]: _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return [rank1]: resolved_file = try_to_load_from_cache( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank1]: validate_repo_id(arg_value) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank1]: raise HFValidationError( [rank1]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank1]: During handling of the above exception, another exception occurred: [rank1]: Traceback (most recent call last): [rank1]: File "", line 198, in _run_module_as_main [rank1]: File "", line 88, in _run_code [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 530, in [rank1]: cli_evaluate() [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 449, in cli_evaluate [rank1]: results = evaluator.simple_evaluate( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/utils.py", line 439, in _wrapper [rank1]: return fn(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/evaluator.py", line 230, in simple_evaluate [rank1]: lm = lm_eval.api.registry.get_model(model).create_from_arg_string( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/api/model.py", line 151, in create_from_arg_string [rank1]: return cls(**args, **args2) [rank1]: ^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 168, in __init__ [rank1]: self._get_config( [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 527, in _get_config [rank1]: self._config = transformers.AutoConfig.from_pretrained( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1332, in from_pretrained [rank1]: config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict [rank1]: config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict [rank1]: raise OSError( [rank1]: OSError: Can't load the configuration of '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' is the correct path to a directory containing a config.json file [rank0]:[W1209 03:19:57.410411438 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) E1209 03:19:59.189000 2124723 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 2124772) of binary: /mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/bin/python Traceback (most recent call last): File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/bin/accelerate", line 7, in sys.exit(main()) ^^^^^^ File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main args.func(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1272, in launch_command multi_gpu_launcher(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/launch.py", line 899, in multi_gpu_launcher distrib_run.run(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/run.py", line 927, in run elastic_launch( File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 156, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ lm_eval FAILED ------------------------------------------------------------ Failures: [1]: time : 2025-12-09_03:19:59 host : n136-128-154.byted.org rank : 1 (local_rank: 1) exitcode : 1 (pid: 2124773) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-12-09_03:19:59 host : n136-128-154.byted.org rank : 0 (local_rank: 0) exitcode : 1 (pid: 2124772) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ 警告:任务 piqa 执行失败! ================================================== 开始评估:任务=hellaswag | 少样本数=10 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/fg4/hellaswag.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:03:21:21 INFO [__main__:440] Selected Tasks: ['hellaswag'] 2025-12-09:03:21:21 INFO [__main__:440] Selected Tasks: ['hellaswag'] 2025-12-09:03:21:21 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:03:21:21 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:03:21:21 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:03:21:21 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:03:21:22 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [rank1]: Traceback (most recent call last): [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 479, in cached_files [rank1]: hf_hub_download( [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank1]: validate_repo_id(arg_value) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank1]: raise HFValidationError( [rank1]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank1]: During handling of the above exception, another exception occurred: [rank1]: Traceback (most recent call last): [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict [rank1]: resolved_config_file = cached_file( [rank1]: ^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 322, in cached_file [rank1]: file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 531, in cached_files [rank1]: resolved_files = [ [rank1]: ^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 532, in [rank1]: _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return [rank1]: resolved_file = try_to_load_from_cache( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank1]: validate_repo_id(arg_value) [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank1]: raise HFValidationError( [rank1]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank1]: During handling of the above exception, another exception occurred: [rank1]: Traceback (most recent call last): [rank1]: File "", line 198, in _run_module_as_main [rank1]: File "", line 88, in _run_code [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 530, in [rank1]: cli_evaluate() [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 449, in cli_evaluate [rank1]: results = evaluator.simple_evaluate( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/utils.py", line 439, in _wrapper [rank1]: return fn(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/evaluator.py", line 230, in simple_evaluate [rank1]: lm = lm_eval.api.registry.get_model(model).create_from_arg_string( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/api/model.py", line 151, in create_from_arg_string [rank1]: return cls(**args, **args2) [rank1]: ^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 168, in __init__ [rank1]: self._get_config( [rank1]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 527, in _get_config [rank1]: self._config = transformers.AutoConfig.from_pretrained( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1332, in from_pretrained [rank1]: config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict [rank1]: config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict [rank1]: raise OSError( [rank1]: OSError: Can't load the configuration of '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' is the correct path to a directory containing a config.json file [rank0]: Traceback (most recent call last): [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 479, in cached_files [rank0]: hf_hub_download( [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank0]: validate_repo_id(arg_value) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank0]: raise HFValidationError( [rank0]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank0]: During handling of the above exception, another exception occurred: [rank0]: Traceback (most recent call last): [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict [rank0]: resolved_config_file = cached_file( [rank0]: ^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 322, in cached_file [rank0]: file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 531, in cached_files [rank0]: resolved_files = [ [rank0]: ^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 532, in [rank0]: _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return [rank0]: resolved_file = try_to_load_from_cache( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn [rank0]: validate_repo_id(arg_value) [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id [rank0]: raise HFValidationError( [rank0]: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. Use `repo_type` argument if needed. [rank0]: During handling of the above exception, another exception occurred: [rank0]: Traceback (most recent call last): [rank0]: File "", line 198, in _run_module_as_main [rank0]: File "", line 88, in _run_code [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 530, in [rank0]: cli_evaluate() [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/__main__.py", line 449, in cli_evaluate [rank0]: results = evaluator.simple_evaluate( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/utils.py", line 439, in _wrapper [rank0]: return fn(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/evaluator.py", line 230, in simple_evaluate [rank0]: lm = lm_eval.api.registry.get_model(model).create_from_arg_string( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/api/model.py", line 151, in create_from_arg_string [rank0]: return cls(**args, **args2) [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 168, in __init__ [rank0]: self._get_config( [rank0]: File "/mnt/bn/life-mllm/users/cxr/quantization/lm-evaluation-harness/lm_eval/models/huggingface.py", line 527, in _get_config [rank0]: self._config = transformers.AutoConfig.from_pretrained( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1332, in from_pretrained [rank0]: config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict [rank0]: config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict [rank0]: raise OSError( [rank0]: OSError: Can't load the configuration of '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' is the correct path to a directory containing a config.json file [rank0]:[W1209 03:21:23.252238985 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) W1209 03:21:25.063000 2124940 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 2124990 closing signal SIGTERM E1209 03:21:25.096000 2124940 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 1 (pid: 2124991) of binary: /mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/bin/python Traceback (most recent call last): File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/bin/accelerate", line 7, in sys.exit(main()) ^^^^^^ File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main args.func(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1272, in launch_command multi_gpu_launcher(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/accelerate/commands/launch.py", line 899, in multi_gpu_launcher distrib_run.run(args) File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/run.py", line 927, in run elastic_launch( File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 156, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/bn/life-mllm/users/cxr/zhy/miniconda3/envs/lm-evaluation-harness/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ lm_eval FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-12-09_03:21:25 host : n136-128-154.byted.org rank : 1 (local_rank: 1) exitcode : 1 (pid: 2124991) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ 警告:任务 hellaswag 执行失败! Execution time: 34944.2 seconds