nohup: ignoring input Namespace(save_dir='/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg', self_attn_layer_to_quant='4 1 2 8 23', mlp_layer_to_quant='27 16 19 17 25', model_id='/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen/Qwen2.5-7B', cuda_id=1) `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:1658160:1658565 [1] NCCL INFO Initialized NET plugin IB n136-128-154:1658159:1658564 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:1658159:1658564 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:1658159:1658564 [0] NCCL INFO Initialized NET plugin IB n136-128-154:1658160:1658565 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1658160:1658565 [1] NCCL INFO Using network IB n136-128-154:1658159:1658564 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1658159:1658564 [0] NCCL INFO Using network IB n136-128-154:1658159:1658564 [0] NCCL INFO ncclCommInitRankConfig comm 0x10d4cd80 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0x578a0970f19dc939 - Init START n136-128-154:1658160:1658565 [1] NCCL INFO ncclCommInitRankConfig comm 0x10a6b100 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0x578a0970f19dc939 - Init START n136-128-154:1658160:1658565 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1658159:1658564 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1658160:1658565 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1658160:1658565 [1] NCCL INFO Retrieving state for IB n136-128-154:1658160:1658565 [1] NCCL INFO Initialized state 0 for IB n136-128-154:1658160:1658565 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1658159:1658564 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1658159:1658564 [0] NCCL INFO Retrieving state for IB n136-128-154:1658159:1658564 [0] NCCL INFO Initialized state 0 for IB n136-128-154:1658160:1658565 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1658159:1658564 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1658160:1658565 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1658159:1658564 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1658160:1658565 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1658159:1658564 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1658159:1658564 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1658159:1658564 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1658160:1658565 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1658159:1658564 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1658160:1658565 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1658159:1658564 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1658160:1658565 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1658159:1658564 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1658160:1658565 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1658159:1658564 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1658159:1658564 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1658159:1658564 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1658159:1658564 [0] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1658159:1658564 [0] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1658159:1658564 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1658159:1658564 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1658159:1658564 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1658159:1658564 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1658159:1658564 [0] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1658159:1658564 [0] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1658159:1658564 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1658159:1658564 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1658159:1658564 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1658159:1658564 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1658159:1658564 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1658159:1658564 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1658159:1658564 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1658159:1658564 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1658159:1658564 [0] NCCL INFO ========================================== n136-128-154:1658159:1658564 [0] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1658159:1658564 [0] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1658159:1658564 [0] NCCL INFO Setting affinity for GPU 1 to 1-31,65-95 n136-128-154:1658160:1658565 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1658160:1658565 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1658160:1658565 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1658160:1658565 [1] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1658159:1658564 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1658160:1658565 [1] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1658159:1658564 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1658159:1658564 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1658159:1658564 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1658159:1658564 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1658159:1658564 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1658159:1658564 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1658159:1658564 [0] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1658159:1658564 [0] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1658159:1658564 [0] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1658159:1658564 [0] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1658159:1658564 [0] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1658159:1658564 [0] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1658160:1658565 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1658160:1658565 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1658160:1658565 [1] NCCL INFO ========================================== n136-128-154:1658160:1658565 [1] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1658160:1658565 [1] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1658160:1658565 [1] NCCL INFO Setting affinity for GPU 2 to 1-31,65-95 n136-128-154:1658159:1658564 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1658159:1658564 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658159:1658564 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658159:1658564 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658159:1658564 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658159:1658564 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658159:1658564 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658159:1658564 [0] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658159:1658564 [0] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658159:1658564 [0] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658159:1658564 [0] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658159:1658564 [0] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658159:1658564 [0] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658160:1658565 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1658160:1658565 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1658160:1658565 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658160:1658565 [1] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658160:1658565 [1] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658160:1658565 [1] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658160:1658565 [1] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658160:1658565 [1] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658160:1658565 [1] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658159:1658564 [0] NCCL INFO comm 0x10d4cd80 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:1658160:1658565 [1] NCCL INFO comm 0x10a6b100 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:1658160:1658565 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:1658160:1658565 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:1658160:1658565 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:1658160:1658565 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:1658159:1658564 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:1658159:1658564 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:1658159:1658564 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1658159:1658564 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1658159:1658564 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:1658159:1658583 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 82 n136-128-154:1658159:1658582 [0] NCCL INFO [Proxy Service] Device 0 CPU core 81 n136-128-154:1658160:1658565 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1658160:1658585 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 78 n136-128-154:1658160:1658584 [1] NCCL INFO [Proxy Service] Device 1 CPU core 77 n136-128-154:1658159:1658564 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1658159:1658564 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1658159:1658564 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:1658160:1658565 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1658160:1658565 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1658159:1658564 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1658159:1658564 [0] NCCL INFO ncclCommInitRankConfig comm 0x10d4cd80 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0x578a0970f19dc939 - Init COMPLETE n136-128-154:1658159:1658564 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 1.24 (kernels 0.24, alloc 0.70, bootstrap 0.00, allgathers 0.00, topo 0.08, graphs 0.00, connections 0.01, rest 0.20) n136-128-154:1658160:1658565 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1658160:1658565 [1] NCCL INFO ncclCommInitRankConfig comm 0x10a6b100 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0x578a0970f19dc939 - Init COMPLETE n136-128-154:1658160:1658565 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 1.24 (kernels 0.24, alloc 0.70, bootstrap 0.00, allgathers 0.00, topo 0.08, graphs 0.00, connections 0.01, rest 0.21) n136-128-154:1658159:1658586 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 04/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 05/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 06/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 07/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 08/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 09/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 10/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 11/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 00/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 12/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 01/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 13/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 02/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 14/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 15/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 03/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 16/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 04/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 17/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 05/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 18/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 06/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 19/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 07/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 20/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 08/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 21/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 09/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 22/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 10/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Channel 23/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 11/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 12/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 13/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 14/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 15/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 16/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 17/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 18/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 19/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 20/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 21/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 22/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658587 [1] NCCL INFO Channel 23/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658159:1658586 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:1658160:1658587 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-05:18:05:04 INFO [evaluator:559] Running loglikelihood requests 2025-12-05:18:05:04 INFO [evaluator:559] Running loglikelihood requests Running loglikelihood requests: 0%| | 0/1268 [00:00 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 01/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 02/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 03/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 04/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 05/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 06/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 07/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 08/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 09/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 10/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 11/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 12/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 13/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 14/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 15/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 16/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 17/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 18/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 19/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 20/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 21/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 22/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 23/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 24/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 25/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 26/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 27/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 28/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 29/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 30/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658160:1658630 [1] NCCL INFO Channel 31/1 : 1[2] -> 0[1] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:1658160:1658635 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1658160:1658635 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1658160:1658635 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1658160:1658635 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1658160:1658635 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1658160:1658635 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1658160:1658584 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-05:18:05:37 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto (64) | Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr| |----------|------:|------|-----:|------|---|-----:|---|-----:| |winogrande| 1|none | 5|acc |↑ |0.4996|± |0.0141| [rank0]:[W1205 18:05:38.848203292 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:1658159:1658688 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1658159:1658688 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1658159:1658688 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1658159:1658582 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1658159:1658688 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1658159:1658688 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1658159:1658688 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1658160:1658584 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1658160:1658635 [1] NCCL INFO comm 0x10a6b100 rank 1 nranks 2 cudaDev 1 busId 4a000 - Abort COMPLETE n136-128-154:1658159:1658688 [0] NCCL INFO comm 0x10d4cd80 rank 0 nranks 2 cudaDev 0 busId 16000 - Abort COMPLETE 任务 winogrande 评估完成! ================================================== 开始评估:任务=gsm8k | 少样本数=4 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/fg/gsm8k.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-05:18:07:02 INFO [__main__:440] Selected Tasks: ['gsm8k'] 2025-12-05:18:07:02 INFO [__main__:440] Selected Tasks: ['gsm8k'] 2025-12-05:18:07:02 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-05:18:07:02 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-05:18:07:02 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-05:18:07:02 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-05:18:07:03 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-05:18:07:04 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-05:18:07:04 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0} 2025-12-05:18:07:21 WARNING [evaluator:309] Overwriting default num_fewshot of gsm8k from 5 to 4 2025-12-05:18:07:21 INFO [api.task:434] Building contexts for gsm8k on rank 0... 0%| | 0/660 [00:00 n136-128-154:1658789:1659148 [0] NCCL INFO Initialized NET plugin IB n136-128-154:1658789:1659148 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1658789:1659148 [0] NCCL INFO Using network IB n136-128-154:1658789:1659148 [0] NCCL INFO ncclCommInitRankConfig comm 0x12b667f0 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0xf496d131a9623cc2 - Init START 2025-12-05:18:07:27 INFO [evaluator:290] gsm8k: Using gen_kwargs: {'until': ['Question:', '', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0} 2025-12-05:18:07:27 WARNING [evaluator:309] Overwriting default num_fewshot of gsm8k from 5 to 4 2025-12-05:18:07:27 INFO [api.task:434] Building contexts for gsm8k on rank 1... 0%| | 0/659 [00:00 n136-128-154:1658790:1659164 [1] NCCL INFO Initialized NET plugin IB n136-128-154:1658790:1659164 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1658790:1659164 [1] NCCL INFO Using network IB n136-128-154:1658790:1659164 [1] NCCL INFO ncclCommInitRankConfig comm 0x10287fb0 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0xf496d131a9623cc2 - Init START n136-128-154:1658790:1659164 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1658789:1659148 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1658790:1659164 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1658790:1659164 [1] NCCL INFO Retrieving state for IB n136-128-154:1658790:1659164 [1] NCCL INFO Initialized state 0 for IB n136-128-154:1658790:1659164 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1658790:1659164 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1658790:1659164 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1658789:1659148 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1658789:1659148 [0] NCCL INFO Retrieving state for IB n136-128-154:1658789:1659148 [0] NCCL INFO Initialized state 0 for IB n136-128-154:1658789:1659148 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1658790:1659164 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1658789:1659148 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1658789:1659148 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1658789:1659148 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1658790:1659164 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1658790:1659164 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1658790:1659164 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1658790:1659164 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1658789:1659148 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1658789:1659148 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1658789:1659148 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1658789:1659148 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1658789:1659148 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1658789:1659148 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1658789:1659148 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1658789:1659148 [0] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1658789:1659148 [0] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1658789:1659148 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1658789:1659148 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1658789:1659148 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1658789:1659148 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1658789:1659148 [0] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1658789:1659148 [0] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1658789:1659148 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1658789:1659148 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1658789:1659148 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1658789:1659148 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1658789:1659148 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1658789:1659148 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1658789:1659148 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1658789:1659148 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1658789:1659148 [0] NCCL INFO ========================================== n136-128-154:1658789:1659148 [0] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1658790:1659164 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1658789:1659148 [0] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1658790:1659164 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1658790:1659164 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1658790:1659164 [1] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1658790:1659164 [1] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1658789:1659148 [0] NCCL INFO Setting affinity for GPU 1 to 1-31,65-95 n136-128-154:1658790:1659164 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1658790:1659164 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1658790:1659164 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1658790:1659164 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1658790:1659164 [1] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1658790:1659164 [1] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1658790:1659164 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1658790:1659164 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1658790:1659164 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1658790:1659164 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1658790:1659164 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1658790:1659164 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1658790:1659164 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1658790:1659164 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1658790:1659164 [1] NCCL INFO ========================================== n136-128-154:1658790:1659164 [1] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1658790:1659164 [1] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1658790:1659164 [1] NCCL INFO Setting affinity for GPU 2 to 1-31,65-95 n136-128-154:1658789:1659148 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1658790:1659164 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1658789:1659148 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658790:1659164 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658789:1659148 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658790:1659164 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658789:1659148 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658790:1659164 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658789:1659148 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658790:1659164 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658789:1659148 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658790:1659164 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658789:1659148 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658790:1659164 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658789:1659148 [0] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658790:1659164 [1] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658789:1659148 [0] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658790:1659164 [1] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658789:1659148 [0] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658790:1659164 [1] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658789:1659148 [0] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658790:1659164 [1] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658789:1659148 [0] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658790:1659164 [1] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658789:1659148 [0] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658790:1659164 [1] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658790:1659164 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1658789:1659148 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1658790:1659164 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658789:1659148 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658790:1659164 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658789:1659148 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658790:1659164 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658789:1659148 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658790:1659164 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658789:1659148 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658790:1659164 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658789:1659148 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658790:1659164 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658789:1659148 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1658790:1659164 [1] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658789:1659148 [0] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658790:1659164 [1] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658789:1659148 [0] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658790:1659164 [1] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658789:1659148 [0] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658790:1659164 [1] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658789:1659148 [0] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658790:1659164 [1] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658789:1659148 [0] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658790:1659164 [1] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658789:1659148 [0] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1658790:1659164 [1] NCCL INFO comm 0x10287fb0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:1658789:1659148 [0] NCCL INFO comm 0x12b667f0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658789:1659148 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:1658789:1659148 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:1658790:1659164 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:1658790:1659164 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:1658789:1659148 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:1658789:1659148 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1658790:1659164 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1658790:1659172 [1] NCCL INFO [Proxy Service] Device 1 CPU core 81 n136-128-154:1658790:1659173 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 18 n136-128-154:1658789:1659148 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1658789:1659148 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:1658789:1659175 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 67 n136-128-154:1658789:1659174 [0] NCCL INFO [Proxy Service] Device 0 CPU core 66 n136-128-154:1658790:1659164 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1658790:1659164 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1658789:1659148 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1658789:1659148 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1658789:1659148 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:1658790:1659164 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1658790:1659164 [1] NCCL INFO ncclCommInitRankConfig comm 0x10287fb0 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0xf496d131a9623cc2 - Init COMPLETE n136-128-154:1658790:1659164 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.44 (kernels 0.13, alloc 0.12, bootstrap 0.00, allgathers 0.00, topo 0.07, graphs 0.00, connections 0.08, rest 0.04) n136-128-154:1658789:1659148 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1658789:1659148 [0] NCCL INFO ncclCommInitRankConfig comm 0x12b667f0 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0xf496d131a9623cc2 - Init COMPLETE n136-128-154:1658789:1659148 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 5.76 (kernels 0.16, alloc 0.14, bootstrap 5.25, allgathers 0.00, topo 0.07, graphs 0.00, connections 0.08, rest 0.04) n136-128-154:1658790:1659176 [1] NCCL INFO Channel 00/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 01/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 02/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 03/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 04/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 05/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 06/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 07/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 08/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 09/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 10/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 11/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 04/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 12/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 05/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 13/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 06/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 14/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 07/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 15/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 08/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 16/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 09/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 17/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 10/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 18/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 11/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 19/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 12/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 20/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 13/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 21/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 14/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 22/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 15/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Channel 23/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 16/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 17/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 18/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 19/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 20/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 21/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 22/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658789:1659177 [0] NCCL INFO Channel 23/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1658790:1659176 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:1658789:1659177 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-05:18:07:29 INFO [evaluator:559] Running generate_until requests Passed argument batch_size = auto. Detecting largest batch size 2025-12-05:18:07:29 INFO [evaluator:559] Running generate_until requests Running generate_until requests: 0%| | 0/660 [00:00 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 01/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 02/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 03/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 04/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 05/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 06/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 07/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 08/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 09/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 10/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 11/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 12/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 13/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 14/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 15/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 16/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 17/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 18/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 19/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 20/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 21/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 22/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 23/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 24/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 25/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 26/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 27/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 28/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 29/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 30/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1658790:1706819 [1] NCCL INFO Channel 31/1 : 1[2] -> 0[1] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:1658790:1706824 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1658790:1706824 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1658790:1706824 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1658790:1706824 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1658790:1706824 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1658790:1706824 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1658790:1659172 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-06:03:01:40 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 4, batch_size: auto |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 4|exact_match|↑ |0.0190|± |0.0038| | | |strict-match | 4|exact_match|↑ |0.0015|± |0.0011| [rank0]:[W1206 03:01:41.050700585 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:1658789:1706896 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1658789:1706896 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1658789:1706896 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1658789:1659174 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1658789:1706896 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1658789:1706896 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1658789:1706896 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1658790:1659172 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1658790:1706824 [1] NCCL INFO comm 0x10287fb0 rank 1 nranks 2 cudaDev 1 busId 4a000 - Abort COMPLETE n136-128-154:1658789:1706896 [0] NCCL INFO comm 0x12b667f0 rank 0 nranks 2 cudaDev 0 busId 16000 - Abort COMPLETE 任务 gsm8k 评估完成! ================================================== 开始评估:任务=boolq | 少样本数=0 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/fg/boolq.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-06:03:03:07 INFO [__main__:440] Selected Tasks: ['boolq'] 2025-12-06:03:03:07 INFO [__main__:440] Selected Tasks: ['boolq'] 2025-12-06:03:03:07 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:03:03:07 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-06:03:03:07 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:03:03:07 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-06:03:03:08 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-06:03:03:08 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} 2025-12-06:03:03:08 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:1706952:1707154 [1] NCCL INFO Initialized NET plugin IB n136-128-154:1706951:1707153 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:1706951:1707153 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:1706951:1707153 [0] NCCL INFO Initialized NET plugin IB n136-128-154:1706952:1707154 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1706952:1707154 [1] NCCL INFO Using network IB n136-128-154:1706951:1707153 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1706951:1707153 [0] NCCL INFO Using network IB n136-128-154:1706952:1707154 [1] NCCL INFO ncclCommInitRankConfig comm 0xe578900 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0xe98a3522ac27b72b - Init START n136-128-154:1706951:1707153 [0] NCCL INFO ncclCommInitRankConfig comm 0xeb759d0 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0xe98a3522ac27b72b - Init START n136-128-154:1706951:1707153 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1706952:1707154 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1706952:1707154 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1706952:1707154 [1] NCCL INFO Retrieving state for IB n136-128-154:1706952:1707154 [1] NCCL INFO Initialized state 0 for IB n136-128-154:1706952:1707154 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1706951:1707153 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1706951:1707153 [0] NCCL INFO Retrieving state for IB n136-128-154:1706951:1707153 [0] NCCL INFO Initialized state 0 for IB n136-128-154:1706952:1707154 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1706951:1707153 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1706952:1707154 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1706951:1707153 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1706952:1707154 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1706951:1707153 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1706951:1707153 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1706951:1707153 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1706951:1707153 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1706951:1707153 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1706951:1707153 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1706952:1707154 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1706952:1707154 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1706952:1707154 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1706952:1707154 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1706951:1707153 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1706951:1707153 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1706951:1707153 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1706951:1707153 [0] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1706951:1707153 [0] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1706951:1707153 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1706951:1707153 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1706951:1707153 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1706951:1707153 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1706951:1707153 [0] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1706951:1707153 [0] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1706951:1707153 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1706951:1707153 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1706951:1707153 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1706951:1707153 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1706951:1707153 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1706951:1707153 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1706951:1707153 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1706951:1707153 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1706951:1707153 [0] NCCL INFO ========================================== n136-128-154:1706951:1707153 [0] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1706951:1707153 [0] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1706951:1707153 [0] NCCL INFO Setting affinity for GPU 1 to 1-31,65-95 n136-128-154:1706951:1707153 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1706951:1707153 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706951:1707153 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706951:1707153 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706951:1707153 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706951:1707153 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706951:1707153 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706951:1707153 [0] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706951:1707153 [0] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706951:1707153 [0] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706951:1707153 [0] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706951:1707153 [0] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706951:1707153 [0] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706951:1707153 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1706951:1707153 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706951:1707153 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706951:1707153 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706951:1707153 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706951:1707153 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706951:1707153 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706951:1707153 [0] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1706951:1707153 [0] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1706951:1707153 [0] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1706952:1707154 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1706951:1707153 [0] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1706952:1707154 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1706951:1707153 [0] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1706952:1707154 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1706951:1707153 [0] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1706952:1707154 [1] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1706952:1707154 [1] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1706952:1707154 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1706952:1707154 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1706952:1707154 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1706952:1707154 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1706952:1707154 [1] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1706952:1707154 [1] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1706952:1707154 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1706952:1707154 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1706952:1707154 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1706952:1707154 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1706952:1707154 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1706952:1707154 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1706952:1707154 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1706952:1707154 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1706952:1707154 [1] NCCL INFO ========================================== n136-128-154:1706952:1707154 [1] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1706952:1707154 [1] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1706952:1707154 [1] NCCL INFO Setting affinity for GPU 2 to 1-31,65-95 n136-128-154:1706952:1707154 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1706952:1707154 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706952:1707154 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706952:1707154 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706952:1707154 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706952:1707154 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706952:1707154 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706952:1707154 [1] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706952:1707154 [1] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706952:1707154 [1] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706952:1707154 [1] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706952:1707154 [1] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706952:1707154 [1] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706952:1707154 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1706952:1707154 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706952:1707154 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706952:1707154 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706952:1707154 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706952:1707154 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706952:1707154 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1706952:1707154 [1] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1706952:1707154 [1] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1706952:1707154 [1] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1706952:1707154 [1] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1706952:1707154 [1] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1706952:1707154 [1] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1706951:1707153 [0] NCCL INFO comm 0xeb759d0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:1706952:1707154 [1] NCCL INFO comm 0xe578900 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:1706952:1707154 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:1706952:1707154 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:1706952:1707154 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:1706952:1707154 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:1706951:1707153 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:1706951:1707153 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1706951:1707153 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1706951:1707153 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:1706951:1707166 [0] NCCL INFO [Proxy Service] Device 0 CPU core 68 n136-128-154:1706951:1707167 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 6 n136-128-154:1706952:1707154 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1706952:1707168 [1] NCCL INFO [Proxy Service] Device 1 CPU core 72 n136-128-154:1706952:1707169 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 73 n136-128-154:1706951:1707153 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1706951:1707153 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1706952:1707154 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1706952:1707154 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1706951:1707153 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:1706951:1707153 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1706951:1707153 [0] NCCL INFO ncclCommInitRankConfig comm 0xeb759d0 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0xe98a3522ac27b72b - Init COMPLETE n136-128-154:1706951:1707153 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 1.02 (kernels 0.22, alloc 0.54, bootstrap 0.00, allgathers 0.01, topo 0.08, graphs 0.00, connections 0.04, rest 0.14) n136-128-154:1706952:1707154 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1706952:1707154 [1] NCCL INFO ncclCommInitRankConfig comm 0xe578900 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0xe98a3522ac27b72b - Init COMPLETE n136-128-154:1706952:1707154 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 1.01 (kernels 0.20, alloc 0.54, bootstrap 0.00, allgathers 0.01, topo 0.08, graphs 0.00, connections 0.03, rest 0.15) n136-128-154:1706951:1707170 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 04/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 05/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 06/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 07/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 08/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 09/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 00/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 10/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 01/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 11/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 02/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 12/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 03/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 13/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 04/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 14/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 05/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 15/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 06/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 16/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 07/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 17/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 08/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 18/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 09/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 10/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 19/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 11/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 20/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 12/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 21/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 13/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 22/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 14/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Channel 23/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 15/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 16/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 17/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 18/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 19/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 20/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 21/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 22/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707171 [1] NCCL INFO Channel 23/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706951:1707170 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:1706952:1707171 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-06:03:03:29 INFO [evaluator:559] Running loglikelihood requests 2025-12-06:03:03:29 INFO [evaluator:559] Running loglikelihood requests Passed argument batch_size = auto:1. Detecting largest batch size Running loglikelihood requests: 0%| | 0/3270 [00:00 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 01/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 02/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 03/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 04/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 05/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 06/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 07/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 08/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 09/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 10/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 11/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 12/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 13/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 14/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 15/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 16/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 17/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 18/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 19/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 20/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 21/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 22/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 23/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 24/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 25/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 26/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 27/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 28/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 29/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 30/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1706952:1707253 [1] NCCL INFO Channel 31/1 : 1[2] -> 0[1] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:1706952:1707257 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1706952:1707257 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1706952:1707257 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1706952:1707257 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1706952:1707257 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1706952:1707257 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1706952:1707168 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-06:03:05:00 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: auto (45) |Tasks|Version|Filter|n-shot|Metric| |Value | |Stderr| |-----|------:|------|-----:|------|---|-----:|---|-----:| |boolq| 2|none | 0|acc |↑ |0.5046|± |0.0087| [rank0]:[W1206 03:05:00.143295240 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:1706951:1707309 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1706951:1707309 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1706951:1707309 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1706951:1707166 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1706951:1707309 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1706951:1707309 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1706951:1707309 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1706952:1707168 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1706952:1707257 [1] NCCL INFO comm 0xe578900 rank 1 nranks 2 cudaDev 1 busId 4a000 - Abort COMPLETE n136-128-154:1706951:1707309 [0] NCCL INFO comm 0xeb759d0 rank 0 nranks 2 cudaDev 0 busId 16000 - Abort COMPLETE 任务 boolq 评估完成! ================================================== 开始评估:任务=arc_challenge | 少样本数=25 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/fg/arc_challenge.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-06:03:06:16 INFO [__main__:440] Selected Tasks: ['arc_challenge'] 2025-12-06:03:06:16 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:03:06:16 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-06:03:06:17 INFO [__main__:440] Selected Tasks: ['arc_challenge'] 2025-12-06:03:06:17 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:03:06:17 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-06:03:06:17 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-06:03:06:18 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:1707362:1707614 [0] NCCL INFO Initialized NET plugin IB n136-128-154:1707362:1707614 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1707362:1707614 [0] NCCL INFO Using network IB n136-128-154:1707362:1707614 [0] NCCL INFO ncclCommInitRankConfig comm 0xf300900 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0xfa97e203f5b44b91 - Init START n136-128-154:1707363:1707615 [1] NCCL INFO NET/Plugin: Could not find: none libnccl-net-none.so. n136-128-154:1707363:1707615 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0. n136-128-154:1707363:1707615 [1] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6 n136-128-154:1707363:1707615 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0 n136-128-154:1707363:1707615 [1] NCCL INFO NCCL_IB_HCA set to mlx5 n136-128-154:1707363:1707615 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:1707363:1707615 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:1707363:1707615 [1] NCCL INFO Initialized NET plugin IB n136-128-154:1707363:1707615 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1707363:1707615 [1] NCCL INFO Using network IB n136-128-154:1707363:1707615 [1] NCCL INFO ncclCommInitRankConfig comm 0xf6d1700 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0xfa97e203f5b44b91 - Init START n136-128-154:1707363:1707615 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1707362:1707614 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1707363:1707615 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1707363:1707615 [1] NCCL INFO Retrieving state for IB n136-128-154:1707363:1707615 [1] NCCL INFO Initialized state 0 for IB n136-128-154:1707363:1707615 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1707363:1707615 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1707363:1707615 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1707363:1707615 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1707362:1707614 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1707362:1707614 [0] NCCL INFO Retrieving state for IB n136-128-154:1707362:1707614 [0] NCCL INFO Initialized state 0 for IB n136-128-154:1707362:1707614 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1707362:1707614 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1707362:1707614 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1707362:1707614 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1707363:1707615 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1707363:1707615 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1707363:1707615 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1707363:1707615 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1707363:1707615 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1707363:1707615 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1707363:1707615 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1707363:1707615 [1] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1707363:1707615 [1] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1707363:1707615 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1707363:1707615 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1707363:1707615 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1707363:1707615 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1707363:1707615 [1] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1707363:1707615 [1] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1707363:1707615 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1707363:1707615 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1707363:1707615 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1707363:1707615 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1707363:1707615 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1707363:1707615 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1707363:1707615 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1707363:1707615 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1707363:1707615 [1] NCCL INFO ========================================== n136-128-154:1707362:1707614 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1707363:1707615 [1] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1707362:1707614 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1707363:1707615 [1] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1707362:1707614 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1707362:1707614 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1707363:1707615 [1] NCCL INFO Setting affinity for GPU 2 to 1-31,65-95 n136-128-154:1707362:1707614 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1707362:1707614 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1707362:1707614 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1707362:1707614 [0] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1707362:1707614 [0] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1707362:1707614 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1707362:1707614 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1707362:1707614 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1707362:1707614 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1707362:1707614 [0] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1707362:1707614 [0] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1707362:1707614 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1707362:1707614 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1707362:1707614 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1707362:1707614 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1707362:1707614 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1707362:1707614 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1707362:1707614 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1707363:1707615 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1707362:1707614 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1707363:1707615 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO ========================================== n136-128-154:1707363:1707615 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707363:1707615 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707363:1707615 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1707363:1707615 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707363:1707615 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1707363:1707615 [1] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707363:1707615 [1] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707363:1707615 [1] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707363:1707615 [1] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO Setting affinity for GPU 1 to 1-31,65-95 n136-128-154:1707363:1707615 [1] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707363:1707615 [1] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707363:1707615 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1707363:1707615 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707363:1707615 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707363:1707615 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707363:1707615 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707363:1707615 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707363:1707615 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707363:1707615 [1] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707363:1707615 [1] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707363:1707615 [1] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707363:1707615 [1] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707363:1707615 [1] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707363:1707615 [1] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707362:1707614 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1707362:1707614 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1707362:1707614 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707362:1707614 [0] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707362:1707614 [0] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707362:1707614 [0] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707362:1707614 [0] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707362:1707614 [0] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707362:1707614 [0] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707363:1707615 [1] NCCL INFO comm 0xf6d1700 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:1707362:1707614 [0] NCCL INFO comm 0xf300900 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707362:1707614 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:1707362:1707614 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:1707363:1707615 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:1707363:1707615 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:1707362:1707614 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:1707362:1707614 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1707363:1707615 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1707363:1707627 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 28 n136-128-154:1707363:1707626 [1] NCCL INFO [Proxy Service] Device 1 CPU core 27 n136-128-154:1707362:1707614 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1707362:1707614 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:1707362:1707628 [0] NCCL INFO [Proxy Service] Device 0 CPU core 67 n136-128-154:1707362:1707629 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 68 n136-128-154:1707363:1707615 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1707363:1707615 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1707362:1707614 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1707362:1707614 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1707362:1707614 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:1707363:1707615 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1707363:1707615 [1] NCCL INFO ncclCommInitRankConfig comm 0xf6d1700 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0xfa97e203f5b44b91 - Init COMPLETE n136-128-154:1707363:1707615 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.57 (kernels 0.12, alloc 0.24, bootstrap 0.00, allgathers 0.00, topo 0.06, graphs 0.00, connections 0.01, rest 0.14) n136-128-154:1707362:1707614 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1707362:1707614 [0] NCCL INFO ncclCommInitRankConfig comm 0xf300900 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0xfa97e203f5b44b91 - Init COMPLETE n136-128-154:1707362:1707614 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.95 (kernels 0.20, alloc 0.25, bootstrap 0.29, allgathers 0.00, topo 0.06, graphs 0.00, connections 0.01, rest 0.14) n136-128-154:1707363:1707630 [1] NCCL INFO Channel 00/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 01/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 02/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 03/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 04/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 05/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 06/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 07/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 08/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 09/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 10/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 11/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 12/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 13/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 14/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 15/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 04/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 16/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 05/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 17/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 06/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 18/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 07/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 19/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 08/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 20/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 09/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 21/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 10/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 22/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 11/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Channel 23/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 12/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 13/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 14/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 15/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 16/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 17/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 18/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 19/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 20/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 21/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 22/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707362:1707631 [0] NCCL INFO Channel 23/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707363:1707630 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:1707362:1707631 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-06:03:06:57 INFO [evaluator:559] Running loglikelihood requests 2025-12-06:03:06:57 INFO [evaluator:559] Running loglikelihood requests Running loglikelihood requests: 0%| | 0/2344 [00:00 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 01/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 02/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 03/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 04/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 05/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 06/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 07/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 08/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 09/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 10/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 11/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 12/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 13/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 14/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 15/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 16/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 17/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 18/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 19/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 20/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 21/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 22/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 23/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 24/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 25/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 26/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 27/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 28/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 29/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 30/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707363:1707852 [1] NCCL INFO Channel 31/1 : 1[2] -> 0[1] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:1707363:1707858 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1707363:1707858 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1707363:1707858 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1707363:1707858 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1707363:1707858 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1707363:1707858 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1707363:1707626 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-06:03:10:45 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 25, batch_size: auto (45) | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |-------------|------:|------|-----:|--------|---|-----:|---|-----:| |arc_challenge| 1|none | 25|acc |↑ |0.2474|± |0.0126| | | |none | 25|acc_norm|↑ |0.2688|± |0.0130| [rank0]:[W1206 03:10:46.573087668 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:1707362:1707907 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1707362:1707907 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1707362:1707907 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1707362:1707628 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1707362:1707907 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1707362:1707907 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1707362:1707907 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1707363:1707626 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1707363:1707858 [1] NCCL INFO comm 0xf6d1700 rank 1 nranks 2 cudaDev 1 busId 4a000 - Abort COMPLETE n136-128-154:1707362:1707907 [0] NCCL INFO comm 0xf300900 rank 0 nranks 2 cudaDev 0 busId 16000 - Abort COMPLETE 任务 arc_challenge 评估完成! ================================================== 开始评估:任务=truthfulqa_mc1 | 少样本数=0 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/fg/truthfulqa_mc1.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-06:03:12:08 INFO [__main__:440] Selected Tasks: ['truthfulqa_mc1'] 2025-12-06:03:12:08 INFO [__main__:440] Selected Tasks: ['truthfulqa_mc1'] 2025-12-06:03:12:08 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:03:12:08 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-06:03:12:08 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:03:12:08 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-06:03:12:09 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-06:03:12:09 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-06:03:12:09 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:1707960:1708200 [0] NCCL INFO Initialized NET plugin IB n136-128-154:1707960:1708200 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1707960:1708200 [0] NCCL INFO Using network IB n136-128-154:1707960:1708200 [0] NCCL INFO ncclCommInitRankConfig comm 0x10ceec80 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0xc58854ca40f60e9c - Init START 2025-12-06:03:13:04 INFO [evaluator:305] num_fewshot has been set to 0 for truthfulqa_mc1 in its config. Manual configuration will be ignored. 2025-12-06:03:13:04 INFO [api.task:434] Building contexts for truthfulqa_mc1 on rank 1... 0%| | 0/408 [00:00 n136-128-154:1707961:1708219 [1] NCCL INFO Initialized NET plugin IB n136-128-154:1707961:1708219 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1707961:1708219 [1] NCCL INFO Using network IB n136-128-154:1707961:1708219 [1] NCCL INFO ncclCommInitRankConfig comm 0x10477620 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0xc58854ca40f60e9c - Init START n136-128-154:1707961:1708219 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1707960:1708200 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1707961:1708219 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1707961:1708219 [1] NCCL INFO Retrieving state for IB n136-128-154:1707961:1708219 [1] NCCL INFO Initialized state 0 for IB n136-128-154:1707961:1708219 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1707961:1708219 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1707961:1708219 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1707961:1708219 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1707960:1708200 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1707960:1708200 [0] NCCL INFO Retrieving state for IB n136-128-154:1707960:1708200 [0] NCCL INFO Initialized state 0 for IB n136-128-154:1707960:1708200 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1707960:1708200 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1707960:1708200 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1707960:1708200 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1707960:1708200 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1707960:1708200 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1707960:1708200 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1707960:1708200 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1707961:1708219 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1707961:1708219 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1707961:1708219 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1707961:1708219 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1707960:1708200 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1707960:1708200 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1707960:1708200 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1707960:1708200 [0] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1707960:1708200 [0] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1707960:1708200 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1707960:1708200 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1707960:1708200 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1707960:1708200 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1707960:1708200 [0] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1707960:1708200 [0] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1707960:1708200 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1707960:1708200 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1707960:1708200 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1707960:1708200 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1707960:1708200 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1707960:1708200 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1707960:1708200 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1707960:1708200 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1707960:1708200 [0] NCCL INFO ========================================== n136-128-154:1707960:1708200 [0] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1707960:1708200 [0] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1707960:1708200 [0] NCCL INFO Setting affinity for GPU 1 to 1-31,65-95 n136-128-154:1707961:1708219 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1707961:1708219 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1707961:1708219 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1707961:1708219 [1] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1707961:1708219 [1] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1707961:1708219 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1707961:1708219 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1707961:1708219 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1707961:1708219 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1707961:1708219 [1] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1707961:1708219 [1] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1707961:1708219 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1707961:1708219 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1707961:1708219 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1707961:1708219 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1707961:1708219 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1707961:1708219 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1707961:1708219 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1707961:1708219 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1707961:1708219 [1] NCCL INFO ========================================== n136-128-154:1707961:1708219 [1] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1707961:1708219 [1] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1707961:1708219 [1] NCCL INFO Setting affinity for GPU 2 to 1-31,65-95 n136-128-154:1707960:1708200 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1707960:1708200 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707960:1708200 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707960:1708200 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707960:1708200 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707960:1708200 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707960:1708200 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707960:1708200 [0] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707960:1708200 [0] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707960:1708200 [0] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707960:1708200 [0] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707960:1708200 [0] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707960:1708200 [0] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707961:1708219 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1707961:1708219 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707961:1708219 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707961:1708219 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707961:1708219 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707961:1708219 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707961:1708219 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707961:1708219 [1] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707961:1708219 [1] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707961:1708219 [1] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707961:1708219 [1] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707961:1708219 [1] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707961:1708219 [1] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707960:1708200 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1707960:1708200 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707960:1708200 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707960:1708200 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707960:1708200 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707960:1708200 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707960:1708200 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707960:1708200 [0] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707960:1708200 [0] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707960:1708200 [0] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707960:1708200 [0] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707960:1708200 [0] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707960:1708200 [0] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707961:1708219 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1707961:1708219 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707961:1708219 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707961:1708219 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707961:1708219 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707961:1708219 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707961:1708219 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1707961:1708219 [1] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707961:1708219 [1] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707961:1708219 [1] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707961:1708219 [1] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707961:1708219 [1] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707961:1708219 [1] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1707961:1708219 [1] NCCL INFO comm 0x10477620 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:1707960:1708200 [0] NCCL INFO comm 0x10ceec80 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707961:1708219 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:1707961:1708219 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:1707961:1708219 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:1707961:1708219 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:1707960:1708200 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:1707960:1708200 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1707960:1708200 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1707960:1708200 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:1707960:1708228 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 3 n136-128-154:1707960:1708227 [0] NCCL INFO [Proxy Service] Device 0 CPU core 2 n136-128-154:1707961:1708219 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1707961:1708229 [1] NCCL INFO [Proxy Service] Device 1 CPU core 87 n136-128-154:1707961:1708230 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 25 n136-128-154:1707961:1708219 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1707961:1708219 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1707960:1708200 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1707960:1708200 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1707960:1708200 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:1707961:1708219 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1707961:1708219 [1] NCCL INFO ncclCommInitRankConfig comm 0x10477620 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0xc58854ca40f60e9c - Init COMPLETE n136-128-154:1707961:1708219 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.64 (kernels 0.13, alloc 0.22, bootstrap 0.00, allgathers 0.00, topo 0.06, graphs 0.00, connections 0.06, rest 0.17) n136-128-154:1707960:1708200 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1707960:1708200 [0] NCCL INFO ncclCommInitRankConfig comm 0x10ceec80 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0xc58854ca40f60e9c - Init COMPLETE n136-128-154:1707960:1708200 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 12.30 (kernels 0.16, alloc 0.31, bootstrap 11.54, allgathers 0.00, topo 0.06, graphs 0.00, connections 0.06, rest 0.17) n136-128-154:1707961:1708232 [1] NCCL INFO Channel 00/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 01/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 02/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 03/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 04/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 05/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 06/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 07/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 08/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 09/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 10/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 11/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 12/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 04/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 13/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 05/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 14/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 06/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 15/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 07/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 16/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 17/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 08/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 18/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 09/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 19/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 10/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 20/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 21/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 11/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 22/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 12/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707961:1708232 [1] NCCL INFO Channel 23/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 13/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 14/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 15/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 16/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 17/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 18/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 19/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 20/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 21/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 22/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Channel 23/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1707960:1708233 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:1707961:1708232 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-06:03:13:06 INFO [evaluator:559] Running loglikelihood requests 2025-12-06:03:13:06 INFO [evaluator:559] Running loglikelihood requests Running loglikelihood requests: 0%| | 0/2066 [00:00 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 01/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 02/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 03/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 04/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 05/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 06/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 07/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 08/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 09/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 10/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 11/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 12/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 13/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 14/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 15/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 16/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 17/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 18/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 19/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 20/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 21/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 22/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 23/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 24/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 25/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 26/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 27/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 28/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 29/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 30/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1707961:1708276 [1] NCCL INFO Channel 31/1 : 1[2] -> 0[1] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:1707961:1708282 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1707961:1708282 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1707961:1708282 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1707961:1708282 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1707961:1708282 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1707961:1708282 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1707961:1708229 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-06:03:13:58 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: auto (64) | Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr| |--------------|------:|------|-----:|------|---|-----:|---|-----:| |truthfulqa_mc1| 2|none | 0|acc |↑ |0.2803|± |0.0157| [rank0]:[W1206 03:13:59.665614963 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:1707960:1708331 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1707960:1708331 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1707960:1708331 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1707960:1708227 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1707960:1708331 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1707960:1708331 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1707960:1708331 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1707961:1708229 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1707961:1708282 [1] NCCL INFO comm 0x10477620 rank 1 nranks 2 cudaDev 1 busId 4a000 - Abort COMPLETE n136-128-154:1707960:1708331 [0] NCCL INFO comm 0x10ceec80 rank 0 nranks 2 cudaDev 0 busId 16000 - Abort COMPLETE 任务 truthfulqa_mc1 评估完成! ================================================== 开始评估:任务=piqa | 少样本数=0 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/fg/piqa.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-06:03:15:20 INFO [__main__:440] Selected Tasks: ['piqa'] 2025-12-06:03:15:20 INFO [__main__:440] Selected Tasks: ['piqa'] 2025-12-06:03:15:20 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:03:15:20 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-06:03:15:20 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:03:15:20 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-06:03:15:21 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-06:03:15:22 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} `torch_dtype` is deprecated! Use `dtype` instead! The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-06:03:15:22 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:1708386:1708580 [1] NCCL INFO Initialized NET plugin IB n136-128-154:1708385:1708579 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:1708385:1708579 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:1708385:1708579 [0] NCCL INFO Initialized NET plugin IB n136-128-154:1708386:1708580 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1708386:1708580 [1] NCCL INFO Using network IB n136-128-154:1708385:1708579 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1708385:1708579 [0] NCCL INFO Using network IB n136-128-154:1708385:1708579 [0] NCCL INFO ncclCommInitRankConfig comm 0x10cf2a90 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0x8d4536edfece5aa8 - Init START n136-128-154:1708386:1708580 [1] NCCL INFO ncclCommInitRankConfig comm 0x10dd3860 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0x8d4536edfece5aa8 - Init START n136-128-154:1708386:1708580 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1708385:1708579 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1708386:1708580 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1708386:1708580 [1] NCCL INFO Retrieving state for IB n136-128-154:1708386:1708580 [1] NCCL INFO Initialized state 0 for IB n136-128-154:1708386:1708580 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1708386:1708580 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1708386:1708580 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1708386:1708580 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1708385:1708579 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1708385:1708579 [0] NCCL INFO Retrieving state for IB n136-128-154:1708385:1708579 [0] NCCL INFO Initialized state 0 for IB n136-128-154:1708385:1708579 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1708385:1708579 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1708385:1708579 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1708385:1708579 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1708385:1708579 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1708385:1708579 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1708385:1708579 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1708385:1708579 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1708386:1708580 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1708386:1708580 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1708386:1708580 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1708386:1708580 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1708385:1708579 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1708385:1708579 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1708385:1708579 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1708385:1708579 [0] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1708385:1708579 [0] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1708385:1708579 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1708385:1708579 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1708385:1708579 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1708385:1708579 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1708385:1708579 [0] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1708385:1708579 [0] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1708385:1708579 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1708385:1708579 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1708385:1708579 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1708385:1708579 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1708385:1708579 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1708385:1708579 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1708385:1708579 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1708385:1708579 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1708385:1708579 [0] NCCL INFO ========================================== n136-128-154:1708385:1708579 [0] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1708385:1708579 [0] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1708385:1708579 [0] NCCL INFO Setting affinity for GPU 1 to 1-31,65-95 n136-128-154:1708386:1708580 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1708386:1708580 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1708386:1708580 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1708386:1708580 [1] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1708386:1708580 [1] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1708386:1708580 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1708386:1708580 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1708386:1708580 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1708386:1708580 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1708386:1708580 [1] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1708386:1708580 [1] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1708386:1708580 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1708386:1708580 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1708386:1708580 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1708386:1708580 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1708386:1708580 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1708386:1708580 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1708386:1708580 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1708386:1708580 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1708386:1708580 [1] NCCL INFO ========================================== n136-128-154:1708386:1708580 [1] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1708385:1708579 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1708386:1708580 [1] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1708385:1708579 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708386:1708580 [1] NCCL INFO Setting affinity for GPU 2 to 1-31,65-95 n136-128-154:1708385:1708579 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708385:1708579 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708385:1708579 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708385:1708579 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708385:1708579 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708385:1708579 [0] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708385:1708579 [0] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708385:1708579 [0] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708385:1708579 [0] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708385:1708579 [0] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708385:1708579 [0] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708385:1708579 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1708385:1708579 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708385:1708579 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708385:1708579 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708385:1708579 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708385:1708579 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708385:1708579 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708385:1708579 [0] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708385:1708579 [0] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708385:1708579 [0] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708385:1708579 [0] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708385:1708579 [0] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708385:1708579 [0] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708386:1708580 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1708386:1708580 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708386:1708580 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708386:1708580 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708386:1708580 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708386:1708580 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708386:1708580 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708386:1708580 [1] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708386:1708580 [1] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708386:1708580 [1] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708386:1708580 [1] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708386:1708580 [1] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708386:1708580 [1] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708386:1708580 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1708386:1708580 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708386:1708580 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708386:1708580 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708386:1708580 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708386:1708580 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708386:1708580 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708386:1708580 [1] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708386:1708580 [1] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708386:1708580 [1] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708386:1708580 [1] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708386:1708580 [1] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708386:1708580 [1] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708386:1708580 [1] NCCL INFO comm 0x10dd3860 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:1708385:1708579 [0] NCCL INFO comm 0x10cf2a90 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:1708385:1708579 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:1708386:1708580 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:1708386:1708580 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:1708386:1708580 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:1708385:1708579 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:1708385:1708579 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:1708385:1708579 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1708385:1708579 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1708385:1708579 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:1708385:1708591 [0] NCCL INFO [Proxy Service] Device 0 CPU core 7 n136-128-154:1708385:1708592 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 72 n136-128-154:1708386:1708580 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1708386:1708594 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 73 n136-128-154:1708386:1708593 [1] NCCL INFO [Proxy Service] Device 1 CPU core 4 n136-128-154:1708385:1708579 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1708385:1708579 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1708385:1708579 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:1708386:1708580 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1708386:1708580 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1708385:1708579 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1708385:1708579 [0] NCCL INFO ncclCommInitRankConfig comm 0x10cf2a90 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0x8d4536edfece5aa8 - Init COMPLETE n136-128-154:1708385:1708579 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.77 (kernels 0.24, alloc 0.22, bootstrap 0.01, allgathers 0.01, topo 0.12, graphs 0.00, connections 0.09, rest 0.08) n136-128-154:1708386:1708580 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1708386:1708580 [1] NCCL INFO ncclCommInitRankConfig comm 0x10dd3860 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0x8d4536edfece5aa8 - Init COMPLETE n136-128-154:1708386:1708580 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.74 (kernels 0.21, alloc 0.23, bootstrap 0.00, allgathers 0.01, topo 0.12, graphs 0.00, connections 0.08, rest 0.08) n136-128-154:1708385:1708595 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 00/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 01/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 02/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 04/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 05/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 03/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 06/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 04/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 05/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 07/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 08/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 06/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 09/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 07/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 10/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 08/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 11/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 09/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 12/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 10/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 13/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 11/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 14/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 12/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 15/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 13/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 16/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 14/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 17/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 15/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 16/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 18/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 17/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 19/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 18/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 20/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 19/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 21/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 20/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 22/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 21/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708385:1708595 [0] NCCL INFO Channel 23/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 22/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Channel 23/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708596 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:1708385:1708595 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-06:03:15:36 INFO [evaluator:559] Running loglikelihood requests 2025-12-06:03:15:36 INFO [evaluator:559] Running loglikelihood requests Passed argument batch_size = auto:1. Detecting largest batch size Running loglikelihood requests: 0%| | 0/1838 [00:00 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 01/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 02/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 03/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 04/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 05/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 06/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 07/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 08/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 09/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 10/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 11/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 12/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 13/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 14/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 15/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 16/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 17/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 18/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 19/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 20/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 21/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 22/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 23/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 24/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 25/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 26/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 27/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 28/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 29/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 30/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708386:1708616 [1] NCCL INFO Channel 31/1 : 1[2] -> 0[1] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:1708386:1708621 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1708386:1708621 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1708386:1708621 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1708386:1708621 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1708386:1708621 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1708386:1708621 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1708386:1708593 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-06:03:16:06 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: auto (64) |Tasks|Version|Filter|n-shot| Metric | |Value | |Stderr| |-----|------:|------|-----:|--------|---|-----:|---|-----:| |piqa | 1|none | 0|acc |↑ |0.6523|± |0.0111| | | |none | 0|acc_norm|↑ |0.6567|± |0.0111| [rank0]:[W1206 03:16:07.675223866 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:1708385:1708672 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1708385:1708672 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1708385:1708672 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1708385:1708591 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1708385:1708672 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1708385:1708672 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1708385:1708672 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1708386:1708593 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1708386:1708621 [1] NCCL INFO comm 0x10dd3860 rank 1 nranks 2 cudaDev 1 busId 4a000 - Abort COMPLETE n136-128-154:1708385:1708672 [0] NCCL INFO comm 0x10cf2a90 rank 0 nranks 2 cudaDev 0 busId 16000 - Abort COMPLETE 任务 piqa 评估完成! ================================================== 开始评估:任务=hellaswag | 少样本数=10 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/fg/hellaswag.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-06:03:17:32 INFO [__main__:440] Selected Tasks: ['hellaswag'] 2025-12-06:03:17:32 INFO [__main__:440] Selected Tasks: ['hellaswag'] 2025-12-06:03:17:32 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:03:17:32 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-06:03:17:32 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:03:17:32 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-06:03:17:33 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-06:03:17:33 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-06:03:17:33 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} `torch_dtype` is deprecated! Use `dtype` instead! `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:1708728:1708987 [0] NCCL INFO Initialized NET plugin IB n136-128-154:1708728:1708987 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1708728:1708987 [0] NCCL INFO Using network IB 100%|█████████▉| 5000/5021 [00:25<00:00, 195.33it/s]n136-128-154:1708728:1708987 [0] NCCL INFO ncclCommInitRankConfig comm 0xf855840 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0x8e41aae0c4bf49e3 - Init START 100%|█████████▉| 5020/5021 [00:25<00:00, 195.61it/s] 100%|██████████| 5021/5021 [00:25<00:00, 196.36it/s] n136-128-154:1708729:1708729 [1] NCCL INFO cudaDriverVersion 12040 n136-128-154:1708729:1708729 [1] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6 n136-128-154:1708729:1708729 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0 n136-128-154:1708729:1708729 [1] NCCL INFO NCCL version 2.27.5+cuda12.9 n136-128-154:1708729:1708729 [1] NCCL INFO Comm config Blocking set to 1 n136-128-154:1708729:1708992 [1] NCCL INFO NET/Plugin: Could not find: none libnccl-net-none.so. n136-128-154:1708729:1708992 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0. n136-128-154:1708729:1708992 [1] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6 n136-128-154:1708729:1708992 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0 n136-128-154:1708729:1708992 [1] NCCL INFO NCCL_IB_HCA set to mlx5 n136-128-154:1708729:1708992 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:1708729:1708992 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:1708729:1708992 [1] NCCL INFO Initialized NET plugin IB n136-128-154:1708729:1708992 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1708729:1708992 [1] NCCL INFO Using network IB n136-128-154:1708729:1708992 [1] NCCL INFO ncclCommInitRankConfig comm 0xfc75670 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0x8e41aae0c4bf49e3 - Init START n136-128-154:1708729:1708992 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1708728:1708987 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1708729:1708992 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1708729:1708992 [1] NCCL INFO Retrieving state for IB n136-128-154:1708729:1708992 [1] NCCL INFO Initialized state 0 for IB n136-128-154:1708729:1708992 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1708728:1708987 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1708728:1708987 [0] NCCL INFO Retrieving state for IB n136-128-154:1708728:1708987 [0] NCCL INFO Initialized state 0 for IB n136-128-154:1708729:1708992 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1708728:1708987 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1708729:1708992 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1708728:1708987 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1708729:1708992 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1708728:1708987 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1708728:1708987 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1708728:1708987 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1708728:1708987 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1708728:1708987 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1708728:1708987 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1708729:1708992 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1708729:1708992 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1708729:1708992 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1708729:1708992 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1708728:1708987 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1708728:1708987 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1708728:1708987 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1708728:1708987 [0] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1708728:1708987 [0] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1708728:1708987 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1708728:1708987 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1708728:1708987 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1708728:1708987 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1708728:1708987 [0] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1708728:1708987 [0] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1708728:1708987 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1708728:1708987 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1708728:1708987 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1708728:1708987 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1708728:1708987 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1708728:1708987 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1708728:1708987 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1708728:1708987 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1708728:1708987 [0] NCCL INFO ========================================== n136-128-154:1708728:1708987 [0] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1708728:1708987 [0] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1708728:1708987 [0] NCCL INFO Setting affinity for GPU 1 to 1-31,65-95 n136-128-154:1708729:1708992 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1708729:1708992 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1708729:1708992 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1708729:1708992 [1] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1708729:1708992 [1] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1708729:1708992 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1708729:1708992 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1708729:1708992 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1708729:1708992 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1708729:1708992 [1] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1708729:1708992 [1] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1708729:1708992 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1708729:1708992 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1708729:1708992 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1708729:1708992 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1708729:1708992 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1708729:1708992 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1708729:1708992 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1708729:1708992 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1708729:1708992 [1] NCCL INFO ========================================== n136-128-154:1708729:1708992 [1] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1708729:1708992 [1] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1708729:1708992 [1] NCCL INFO Setting affinity for GPU 2 to 1-31,65-95 n136-128-154:1708728:1708987 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1708728:1708987 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708728:1708987 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708728:1708987 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708728:1708987 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708728:1708987 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708728:1708987 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708728:1708987 [0] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708728:1708987 [0] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708728:1708987 [0] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708728:1708987 [0] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708728:1708987 [0] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708728:1708987 [0] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708729:1708992 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1708729:1708992 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708729:1708992 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708729:1708992 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708729:1708992 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708729:1708992 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708729:1708992 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708729:1708992 [1] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708729:1708992 [1] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708729:1708992 [1] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708729:1708992 [1] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708729:1708992 [1] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708729:1708992 [1] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708728:1708987 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1708728:1708987 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708728:1708987 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708728:1708987 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708728:1708987 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708728:1708987 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708728:1708987 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708728:1708987 [0] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708728:1708987 [0] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708728:1708987 [0] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708728:1708987 [0] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708728:1708987 [0] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708728:1708987 [0] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708729:1708992 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1708729:1708992 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708729:1708992 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708729:1708992 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708729:1708992 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708729:1708992 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708729:1708992 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1708729:1708992 [1] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708729:1708992 [1] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708729:1708992 [1] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708729:1708992 [1] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708729:1708992 [1] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708729:1708992 [1] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1708728:1708987 [0] NCCL INFO comm 0xf855840 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:1708729:1708992 [1] NCCL INFO comm 0xfc75670 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708729:1708992 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:1708729:1708992 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:1708729:1708992 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:1708729:1708992 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:1708728:1708987 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:1708728:1708987 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1708729:1708992 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1708729:1708999 [1] NCCL INFO [Proxy Service] Device 1 CPU core 29 n136-128-154:1708729:1709000 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 94 n136-128-154:1708728:1708987 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1708728:1708987 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:1708728:1709001 [0] NCCL INFO [Proxy Service] Device 0 CPU core 10 n136-128-154:1708728:1709002 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 11 n136-128-154:1708729:1708992 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1708729:1708992 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1708728:1708987 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1708728:1708987 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1708728:1708987 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:1708728:1708987 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1708728:1708987 [0] NCCL INFO ncclCommInitRankConfig comm 0xf855840 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0x8e41aae0c4bf49e3 - Init COMPLETE n136-128-154:1708728:1708987 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.72 (kernels 0.19, alloc 0.12, bootstrap 0.32, allgathers 0.00, topo 0.05, graphs 0.00, connections 0.01, rest 0.02) n136-128-154:1708729:1708992 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1708729:1708992 [1] NCCL INFO ncclCommInitRankConfig comm 0xfc75670 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0x8e41aae0c4bf49e3 - Init COMPLETE n136-128-154:1708729:1708992 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.32 (kernels 0.12, alloc 0.11, bootstrap 0.00, allgathers 0.00, topo 0.05, graphs 0.00, connections 0.01, rest 0.02) n136-128-154:1708728:1709003 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 00/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 01/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 02/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 03/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 04/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 04/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 05/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 05/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 06/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 06/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 07/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 07/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 08/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 08/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 09/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 09/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 10/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 10/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 11/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 11/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 12/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 12/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 13/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 13/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 14/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 14/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 15/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 15/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 16/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 16/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 17/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 17/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 18/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 18/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 19/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 19/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 20/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 20/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 21/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 21/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 22/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 22/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708728:1709003 [0] NCCL INFO Channel 23/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Channel 23/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1709004 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:1708728:1709003 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-06:03:18:18 INFO [evaluator:559] Running loglikelihood requests 2025-12-06:03:18:18 INFO [evaluator:559] Running loglikelihood requests Running loglikelihood requests: 0%| | 0/20084 [00:00 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 01/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 02/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 03/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 04/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 05/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 06/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 07/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 08/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 09/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 10/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 11/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 12/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 13/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 14/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 15/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 16/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 17/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 18/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 19/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 20/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 21/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 22/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 23/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 24/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 25/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 26/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 27/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 28/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 29/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 30/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1708729:1710784 [1] NCCL INFO Channel 31/1 : 1[2] -> 0[1] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:1708729:1710790 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1708729:1710790 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1708729:1710790 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1708729:1710790 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1708729:1710790 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1708729:1710790 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1708729:1708999 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-06:03:42:08 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 10, batch_size: auto (57) | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |---------|------:|------|-----:|--------|---|-----:|---|-----:| |hellaswag| 1|none | 10|acc |↑ |0.2825|± |0.0045| | | |none | 10|acc_norm|↑ |0.3104|± |0.0046| [rank0]:[W1206 03:42:09.860326522 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:1708728:1710842 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1708728:1710842 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1708728:1710842 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1708728:1709001 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1708728:1710842 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1708728:1710842 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1708728:1710842 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1708729:1708999 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1708729:1710790 [1] NCCL INFO comm 0xfc75670 rank 1 nranks 2 cudaDev 1 busId 4a000 - Abort COMPLETE n136-128-154:1708728:1710842 [0] NCCL INFO comm 0xf855840 rank 0 nranks 2 cudaDev 0 busId 16000 - Abort COMPLETE 任务 hellaswag 评估完成! Execution time: 35039.5 seconds Namespace(save_dir='/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg', self_attn_layer_to_quant='29 23 24 30 18', mlp_layer_to_quant='26 20 22 23 19', model_id='/mnt/bn/life-mllm/users/cxr/quantization/models/meta-llama/Llama-3.1-8B', cuda_id=1) `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:1711815:1712014 [1] NCCL INFO Initialized NET plugin IB n136-128-154:1711815:1712014 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1711815:1712014 [1] NCCL INFO Using network IB n136-128-154:1711814:1712013 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:1711814:1712013 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:1711814:1712013 [0] NCCL INFO Initialized NET plugin IB n136-128-154:1711814:1712013 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1711814:1712013 [0] NCCL INFO Using network IB n136-128-154:1711815:1712014 [1] NCCL INFO ncclCommInitRankConfig comm 0x153b72b0 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0x9717e9b07b27bee8 - Init START n136-128-154:1711814:1712013 [0] NCCL INFO ncclCommInitRankConfig comm 0x14c61800 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0x9717e9b07b27bee8 - Init START n136-128-154:1711814:1712013 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1711815:1712014 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1711814:1712013 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1711814:1712013 [0] NCCL INFO Retrieving state for IB n136-128-154:1711814:1712013 [0] NCCL INFO Initialized state 0 for IB n136-128-154:1711814:1712013 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1711815:1712014 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1711815:1712014 [1] NCCL INFO Retrieving state for IB n136-128-154:1711815:1712014 [1] NCCL INFO Initialized state 0 for IB n136-128-154:1711814:1712013 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1711815:1712014 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1711814:1712013 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1711815:1712014 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1711814:1712013 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1711815:1712014 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1711815:1712014 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1711815:1712014 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1711814:1712013 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1711815:1712014 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1711814:1712013 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1711815:1712014 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1711814:1712013 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1711815:1712014 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1711814:1712013 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1711815:1712014 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1711814:1712013 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1711815:1712014 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1711814:1712013 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1711815:1712014 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1711814:1712013 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1711815:1712014 [1] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1711814:1712013 [0] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1711815:1712014 [1] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1711814:1712013 [0] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1711815:1712014 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1711814:1712013 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1711815:1712014 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1711814:1712013 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1711815:1712014 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1711814:1712013 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1711815:1712014 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1711814:1712013 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1711815:1712014 [1] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1711814:1712013 [0] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1711815:1712014 [1] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1711814:1712013 [0] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1711815:1712014 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1711814:1712013 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1711815:1712014 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1711814:1712013 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1711815:1712014 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1711814:1712013 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1711815:1712014 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1711814:1712013 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1711815:1712014 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1711814:1712013 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1711815:1712014 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1711814:1712013 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1711815:1712014 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1711814:1712013 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1711815:1712014 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1711814:1712013 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1711815:1712014 [1] NCCL INFO ========================================== n136-128-154:1711814:1712013 [0] NCCL INFO ========================================== n136-128-154:1711815:1712014 [1] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1711814:1712013 [0] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1711815:1712014 [1] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1711814:1712013 [0] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1711814:1712013 [0] NCCL INFO Setting affinity for GPU 1 to 1-31,65-95 n136-128-154:1711815:1712014 [1] NCCL INFO Setting affinity for GPU 2 to 1-31,65-95 n136-128-154:1711815:1712014 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1711814:1712013 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1711815:1712014 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711814:1712013 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711815:1712014 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711814:1712013 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711815:1712014 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711814:1712013 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711815:1712014 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711814:1712013 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711815:1712014 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711814:1712013 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711815:1712014 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711814:1712013 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711815:1712014 [1] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711814:1712013 [0] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711815:1712014 [1] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711814:1712013 [0] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711815:1712014 [1] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711814:1712013 [0] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711815:1712014 [1] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711814:1712013 [0] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711815:1712014 [1] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711814:1712013 [0] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711815:1712014 [1] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711814:1712013 [0] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711815:1712014 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1711814:1712013 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1711815:1712014 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711814:1712013 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711815:1712014 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711814:1712013 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711815:1712014 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711814:1712013 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711815:1712014 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711814:1712013 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711815:1712014 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711814:1712013 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711815:1712014 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711814:1712013 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1711815:1712014 [1] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1711814:1712013 [0] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1711815:1712014 [1] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1711814:1712013 [0] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1711815:1712014 [1] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1711814:1712013 [0] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1711815:1712014 [1] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1711814:1712013 [0] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1711815:1712014 [1] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1711814:1712013 [0] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1711815:1712014 [1] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1711814:1712013 [0] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1711814:1712013 [0] NCCL INFO comm 0x14c61800 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:1711815:1712014 [1] NCCL INFO comm 0x153b72b0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:1711815:1712014 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:1711815:1712014 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:1711815:1712014 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:1711815:1712014 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:1711814:1712013 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:1711814:1712013 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1711815:1712014 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1711815:1712026 [1] NCCL INFO [Proxy Service] Device 1 CPU core 2 n136-128-154:1711815:1712027 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 70 n136-128-154:1711814:1712013 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1711814:1712013 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:1711814:1712028 [0] NCCL INFO [Proxy Service] Device 0 CPU core 71 n136-128-154:1711814:1712029 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 8 n136-128-154:1711815:1712014 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1711815:1712014 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1711814:1712013 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1711814:1712013 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1711814:1712013 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:1711815:1712014 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1711815:1712014 [1] NCCL INFO ncclCommInitRankConfig comm 0x153b72b0 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0x9717e9b07b27bee8 - Init COMPLETE n136-128-154:1711815:1712014 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.59 (kernels 0.16, alloc 0.18, bootstrap 0.02, allgathers 0.00, topo 0.06, graphs 0.00, connections 0.12, rest 0.05) n136-128-154:1711814:1712013 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1711814:1712013 [0] NCCL INFO ncclCommInitRankConfig comm 0x14c61800 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0x9717e9b07b27bee8 - Init COMPLETE n136-128-154:1711814:1712013 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.62 (kernels 0.20, alloc 0.19, bootstrap 0.00, allgathers 0.00, topo 0.06, graphs 0.00, connections 0.12, rest 0.05) n136-128-154:1711815:1712030 [1] NCCL INFO Channel 00/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 01/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 02/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 03/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 04/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 05/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 06/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 07/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 08/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 09/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 04/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 10/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 05/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 11/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 06/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 12/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 07/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 13/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 08/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 14/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 09/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 15/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 10/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 16/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 11/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 17/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 12/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 18/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 13/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 19/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 14/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 20/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 15/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 21/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 16/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 22/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 17/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 18/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Channel 23/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 19/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 20/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 21/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 22/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711814:1712031 [0] NCCL INFO Channel 23/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1711815:1712030 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:1711814:1712031 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-06:03:49:05 INFO [evaluator:559] Running loglikelihood requests 2025-12-06:03:49:05 INFO [evaluator:559] Running loglikelihood requests Running loglikelihood requests: 0%| | 0/1268 [00:00 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 01/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 02/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 03/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 04/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 05/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 06/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 07/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 08/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 09/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 10/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 11/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 12/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 13/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 14/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 15/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 16/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 17/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 18/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 19/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 20/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 21/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 22/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 23/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 24/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 25/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 26/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 27/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 28/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 29/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 30/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1711815:1712056 [1] NCCL INFO Channel 31/1 : 1[2] -> 0[1] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:1711815:1712062 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1711815:1712062 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1711815:1712062 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1711815:1712062 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1711815:1712062 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1711815:1712062 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1711815:1712026 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-06:03:49:40 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto (64) | Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr| |----------|------:|------|-----:|------|---|-----:|---|-----:| |winogrande| 1|none | 5|acc |↑ |0.6851|± |0.0131| [rank0]:[W1206 03:49:41.739800330 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:1711814:1712119 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1711814:1712119 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1711814:1712119 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1711814:1712028 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1711814:1712119 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1711814:1712119 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1711814:1712119 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1711815:1712026 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1711815:1712062 [1] NCCL INFO comm 0x153b72b0 rank 1 nranks 2 cudaDev 1 busId 4a000 - Abort COMPLETE n136-128-154:1711814:1712119 [0] NCCL INFO comm 0x14c61800 rank 0 nranks 2 cudaDev 0 busId 16000 - Abort COMPLETE 任务 winogrande 评估完成! ================================================== 开始评估:任务=gsm8k | 少样本数=4 | 模型=Llama-3.1-8B-quantization-fg 输出路径:results2/Llama-3.1-8B-quantization-fg/fg/gsm8k.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-06:03:51:04 INFO [__main__:440] Selected Tasks: ['gsm8k'] 2025-12-06:03:51:04 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:03:51:04 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-06:03:51:04 INFO [__main__:440] Selected Tasks: ['gsm8k'] 2025-12-06:03:51:04 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:03:51:04 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-06:03:51:06 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-06:03:51:06 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} `torch_dtype` is deprecated! Use `dtype` instead! The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-06:03:51:06 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0} 2025-12-06:03:51:21 WARNING [evaluator:309] Overwriting default num_fewshot of gsm8k from 5 to 4 2025-12-06:03:51:21 INFO [api.task:434] Building contexts for gsm8k on rank 1... 0%| | 0/659 [00:00', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0} 2025-12-06:03:51:21 WARNING [evaluator:309] Overwriting default num_fewshot of gsm8k from 5 to 4 2025-12-06:03:51:21 INFO [api.task:434] Building contexts for gsm8k on rank 0... 25%|██▌ | 167/659 [00:00<00:01, 407.82it/s] 0%| | 0/660 [00:00 n136-128-154:1712228:1712563 [0] NCCL INFO Initialized NET plugin IB n136-128-154:1712229:1712564 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:1712229:1712564 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:1712229:1712564 [1] NCCL INFO Initialized NET plugin IB n136-128-154:1712228:1712563 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1712228:1712563 [0] NCCL INFO Using network IB n136-128-154:1712229:1712564 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1712229:1712564 [1] NCCL INFO Using network IB n136-128-154:1712229:1712564 [1] NCCL INFO ncclCommInitRankConfig comm 0xd8742a0 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0x3e34cbe80a35aca4 - Init START n136-128-154:1712228:1712563 [0] NCCL INFO ncclCommInitRankConfig comm 0x12f077a0 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0x3e34cbe80a35aca4 - Init START n136-128-154:1712229:1712564 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1712228:1712563 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1712228:1712563 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1712228:1712563 [0] NCCL INFO Retrieving state for IB n136-128-154:1712228:1712563 [0] NCCL INFO Initialized state 0 for IB n136-128-154:1712228:1712563 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1712229:1712564 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1712229:1712564 [1] NCCL INFO Retrieving state for IB n136-128-154:1712229:1712564 [1] NCCL INFO Initialized state 0 for IB n136-128-154:1712228:1712563 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1712229:1712564 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1712229:1712564 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1712228:1712563 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1712229:1712564 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1712228:1712563 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1712229:1712564 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1712228:1712563 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1712228:1712563 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1712228:1712563 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1712228:1712563 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1712229:1712564 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1712229:1712564 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1712229:1712564 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1712229:1712564 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1712228:1712563 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1712228:1712563 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1712228:1712563 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1712228:1712563 [0] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1712228:1712563 [0] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1712228:1712563 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1712228:1712563 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1712228:1712563 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1712228:1712563 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1712228:1712563 [0] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1712228:1712563 [0] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1712228:1712563 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1712229:1712564 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1712228:1712563 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1712229:1712564 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1712228:1712563 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1712228:1712563 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1712229:1712564 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1712228:1712563 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1712229:1712564 [1] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1712228:1712563 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1712229:1712564 [1] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1712228:1712563 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1712229:1712564 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1712228:1712563 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1712229:1712564 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1712228:1712563 [0] NCCL INFO ========================================== n136-128-154:1712229:1712564 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1712229:1712564 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1712229:1712564 [1] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1712228:1712563 [0] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1712229:1712564 [1] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1712229:1712564 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1712228:1712563 [0] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1712229:1712564 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1712229:1712564 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1712229:1712564 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1712229:1712564 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1712228:1712563 [0] NCCL INFO Setting affinity for GPU 1 to 1-31,65-95 n136-128-154:1712229:1712564 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1712229:1712564 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1712229:1712564 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1712229:1712564 [1] NCCL INFO ========================================== n136-128-154:1712229:1712564 [1] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1712229:1712564 [1] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1712229:1712564 [1] NCCL INFO Setting affinity for GPU 2 to 1-31,65-95 n136-128-154:1712228:1712563 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1712228:1712563 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712228:1712563 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712228:1712563 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712228:1712563 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712228:1712563 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712228:1712563 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712228:1712563 [0] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712228:1712563 [0] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712228:1712563 [0] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712228:1712563 [0] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712228:1712563 [0] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712228:1712563 [0] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712228:1712563 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1712228:1712563 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712228:1712563 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712228:1712563 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712228:1712563 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712228:1712563 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712228:1712563 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712228:1712563 [0] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1712228:1712563 [0] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1712228:1712563 [0] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1712228:1712563 [0] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1712228:1712563 [0] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1712228:1712563 [0] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1712229:1712564 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1712229:1712564 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712229:1712564 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712229:1712564 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712229:1712564 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712229:1712564 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712229:1712564 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712229:1712564 [1] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712229:1712564 [1] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712229:1712564 [1] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712229:1712564 [1] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712229:1712564 [1] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712229:1712564 [1] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712229:1712564 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1712229:1712564 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712229:1712564 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712229:1712564 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712229:1712564 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712229:1712564 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712229:1712564 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1712229:1712564 [1] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1712229:1712564 [1] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1712229:1712564 [1] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1712229:1712564 [1] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1712229:1712564 [1] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1712229:1712564 [1] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1712228:1712563 [0] NCCL INFO comm 0x12f077a0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:1712229:1712564 [1] NCCL INFO comm 0xd8742a0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:1712229:1712564 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:1712228:1712563 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:1712228:1712563 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:1712228:1712563 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:1712229:1712564 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:1712229:1712564 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:1712229:1712564 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1712228:1712563 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1712228:1712563 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:1712228:1712576 [0] NCCL INFO [Proxy Service] Device 0 CPU core 66 n136-128-154:1712228:1712577 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 68 n136-128-154:1712229:1712564 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1712229:1712578 [1] NCCL INFO [Proxy Service] Device 1 CPU core 69 n136-128-154:1712229:1712579 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 74 n136-128-154:1712229:1712564 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1712229:1712564 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1712228:1712563 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1712228:1712563 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1712228:1712563 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:1712229:1712564 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1712229:1712564 [1] NCCL INFO ncclCommInitRankConfig comm 0xd8742a0 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0x3e34cbe80a35aca4 - Init COMPLETE n136-128-154:1712229:1712564 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.63 (kernels 0.24, alloc 0.21, bootstrap 0.00, allgathers 0.00, topo 0.07, graphs 0.00, connections 0.06, rest 0.05) n136-128-154:1712228:1712563 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1712228:1712563 [0] NCCL INFO ncclCommInitRankConfig comm 0x12f077a0 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0x3e34cbe80a35aca4 - Init COMPLETE n136-128-154:1712228:1712563 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.63 (kernels 0.25, alloc 0.21, bootstrap 0.00, allgathers 0.00, topo 0.07, graphs 0.00, connections 0.06, rest 0.04) n136-128-154:1712229:1712580 [1] NCCL INFO Channel 00/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 01/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 02/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 03/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 04/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 04/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 05/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 05/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 06/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 06/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 07/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 07/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 08/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 08/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 09/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 09/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 10/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 10/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 11/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 11/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 12/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 12/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 13/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 13/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 14/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 14/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 15/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 15/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 16/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 16/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 17/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 17/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 18/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 18/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 19/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 19/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 20/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 20/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 21/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 21/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 22/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 22/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712229:1712580 [1] NCCL INFO Channel 23/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Channel 23/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1712228:1712581 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:1712229:1712580 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-06:03:51:24 INFO [evaluator:559] Running generate_until requests 2025-12-06:03:51:24 INFO [evaluator:559] Running generate_until requests Running generate_until requests: 0%| | 0/660 [00:00 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 01/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 02/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 03/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 04/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 05/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 06/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 07/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 08/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 09/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 10/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 11/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 12/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 13/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 14/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 15/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 16/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 17/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 18/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 19/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 20/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 21/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 22/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 23/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 24/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 25/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 26/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 27/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 28/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 29/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 30/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1712229:1717984 [1] NCCL INFO Channel 31/1 : 1[2] -> 0[1] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:1712229:1717988 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1712229:1717988 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1712229:1717988 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1712229:1717988 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1712229:1717988 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1712229:1717988 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1712229:1712578 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-06:05:00:09 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 4, batch_size: auto |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 4|exact_match|↑ |0.0106|± |0.0028| | | |strict-match | 4|exact_match|↑ |0.0008|± |0.0008| [rank0]:[W1206 05:00:10.047535372 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:1712228:1718041 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1712228:1718041 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1712228:1718041 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1712228:1712576 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1712228:1718041 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1712228:1718041 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1712228:1718041 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1712229:1712578 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1712229:1717988 [1] NCCL INFO comm 0xd8742a0 rank 1 nranks 2 cudaDev 1 busId 4a000 - Abort COMPLETE n136-128-154:1712228:1718041 [0] NCCL INFO comm 0x12f077a0 rank 0 nranks 2 cudaDev 0 busId 16000 - Abort COMPLETE 任务 gsm8k 评估完成! ================================================== 开始评估:任务=boolq | 少样本数=0 | 模型=Llama-3.1-8B-quantization-fg 输出路径:results2/Llama-3.1-8B-quantization-fg/fg/boolq.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-06:05:01:35 INFO [__main__:440] Selected Tasks: ['boolq'] 2025-12-06:05:01:35 INFO [__main__:440] Selected Tasks: ['boolq'] 2025-12-06:05:01:35 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:05:01:35 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:05:01:35 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-06:05:01:35 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-06:05:01:37 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-06:05:01:37 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-06:05:01:37 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:1718093:1718328 [0] NCCL INFO Initialized NET plugin IB n136-128-154:1718094:1718329 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:1718094:1718329 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:1718094:1718329 [1] NCCL INFO Initialized NET plugin IB n136-128-154:1718093:1718328 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1718093:1718328 [0] NCCL INFO Using network IB n136-128-154:1718094:1718329 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1718094:1718329 [1] NCCL INFO Using network IB n136-128-154:1718093:1718328 [0] NCCL INFO ncclCommInitRankConfig comm 0x1182afd0 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0x2e657754adbae1ea - Init START n136-128-154:1718094:1718329 [1] NCCL INFO ncclCommInitRankConfig comm 0xf232820 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0x2e657754adbae1ea - Init START n136-128-154:1718093:1718328 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1718094:1718329 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1718093:1718328 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1718093:1718328 [0] NCCL INFO Retrieving state for IB n136-128-154:1718093:1718328 [0] NCCL INFO Initialized state 0 for IB n136-128-154:1718093:1718328 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1718094:1718329 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1718094:1718329 [1] NCCL INFO Retrieving state for IB n136-128-154:1718094:1718329 [1] NCCL INFO Initialized state 0 for IB n136-128-154:1718093:1718328 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1718094:1718329 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1718093:1718328 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1718094:1718329 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1718093:1718328 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1718094:1718329 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1718094:1718329 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1718093:1718328 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1718093:1718328 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1718093:1718328 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1718093:1718328 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1718093:1718328 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1718093:1718328 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1718093:1718328 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1718093:1718328 [0] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1718093:1718328 [0] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1718093:1718328 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1718093:1718328 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1718093:1718328 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1718093:1718328 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1718093:1718328 [0] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1718093:1718328 [0] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1718093:1718328 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1718093:1718328 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1718093:1718328 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1718093:1718328 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1718093:1718328 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1718093:1718328 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1718093:1718328 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1718093:1718328 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1718093:1718328 [0] NCCL INFO ========================================== n136-128-154:1718093:1718328 [0] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1718093:1718328 [0] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1718093:1718328 [0] NCCL INFO Setting affinity for GPU 1 to 1-31,65-95 n136-128-154:1718094:1718329 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1718093:1718328 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1718094:1718329 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1718093:1718328 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1718093:1718328 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1718093:1718328 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718093:1718328 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718093:1718328 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718093:1718328 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718093:1718328 [0] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718093:1718328 [0] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718093:1718328 [0] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718093:1718328 [0] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718093:1718328 [0] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718093:1718328 [0] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718093:1718328 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1718093:1718328 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718093:1718328 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718093:1718328 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718093:1718328 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718093:1718328 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718093:1718328 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718093:1718328 [0] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718093:1718328 [0] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718093:1718328 [0] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718093:1718328 [0] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718093:1718328 [0] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718093:1718328 [0] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718094:1718329 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1718094:1718329 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1718094:1718329 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1718094:1718329 [1] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1718094:1718329 [1] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1718094:1718329 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1718094:1718329 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1718094:1718329 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1718094:1718329 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1718094:1718329 [1] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1718094:1718329 [1] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1718094:1718329 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1718094:1718329 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1718094:1718329 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1718094:1718329 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1718094:1718329 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1718094:1718329 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1718094:1718329 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1718094:1718329 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1718094:1718329 [1] NCCL INFO ========================================== n136-128-154:1718094:1718329 [1] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1718094:1718329 [1] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1718094:1718329 [1] NCCL INFO Setting affinity for GPU 2 to 1-31,65-95 n136-128-154:1718094:1718329 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1718094:1718329 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1718094:1718329 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718094:1718329 [1] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718094:1718329 [1] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718094:1718329 [1] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718094:1718329 [1] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718094:1718329 [1] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718094:1718329 [1] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718093:1718328 [0] NCCL INFO comm 0x1182afd0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:1718094:1718329 [1] NCCL INFO comm 0xf232820 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718094:1718329 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:1718094:1718329 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:1718094:1718329 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:1718094:1718329 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:1718093:1718328 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:1718093:1718328 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:1718093:1718328 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1718093:1718328 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1718093:1718328 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:1718093:1718340 [0] NCCL INFO [Proxy Service] Device 0 CPU core 27 n136-128-154:1718093:1718341 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 28 n136-128-154:1718094:1718329 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1718094:1718342 [1] NCCL INFO [Proxy Service] Device 1 CPU core 4 n136-128-154:1718094:1718343 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 6 n136-128-154:1718093:1718328 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1718093:1718328 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1718094:1718329 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1718094:1718329 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1718093:1718328 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:1718093:1718328 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1718093:1718328 [0] NCCL INFO ncclCommInitRankConfig comm 0x1182afd0 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0x2e657754adbae1ea - Init COMPLETE n136-128-154:1718093:1718328 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 1.02 (kernels 0.24, alloc 0.52, bootstrap 0.00, allgathers 0.00, topo 0.06, graphs 0.00, connections 0.06, rest 0.15) n136-128-154:1718094:1718329 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1718094:1718329 [1] NCCL INFO ncclCommInitRankConfig comm 0xf232820 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0x2e657754adbae1ea - Init COMPLETE n136-128-154:1718094:1718329 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 1.02 (kernels 0.23, alloc 0.52, bootstrap 0.00, allgathers 0.00, topo 0.06, graphs 0.00, connections 0.06, rest 0.15) n136-128-154:1718093:1718345 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 00/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 01/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 04/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 02/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 05/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 03/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 06/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 04/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 07/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 05/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 08/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 06/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 09/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 07/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 10/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 08/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 11/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 09/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 12/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 10/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 13/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 11/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 14/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 12/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 15/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 13/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 16/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 14/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 17/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 15/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 18/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 16/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 19/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 17/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 20/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 18/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 21/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 19/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 22/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 20/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718093:1718345 [0] NCCL INFO Channel 23/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 21/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 22/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Channel 23/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718346 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:1718093:1718345 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-06:05:02:02 INFO [evaluator:559] Running loglikelihood requests 2025-12-06:05:02:02 INFO [evaluator:559] Running loglikelihood requests Passed argument batch_size = auto:1. Detecting largest batch size Running loglikelihood requests: 0%| | 0/3270 [00:00 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 01/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 02/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 03/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 04/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 05/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 06/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 07/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 08/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 09/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 10/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 11/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 12/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 13/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 14/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 15/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 16/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 17/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 18/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 19/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 20/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 21/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 22/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 23/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 24/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 25/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 26/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 27/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 28/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 29/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 30/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718094:1718407 [1] NCCL INFO Channel 31/1 : 1[2] -> 0[1] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:1718094:1718413 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1718094:1718413 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1718094:1718413 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1718094:1718413 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1718094:1718413 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1718094:1718413 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1718094:1718342 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-06:05:03:13 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: auto (64) |Tasks|Version|Filter|n-shot|Metric| |Value | |Stderr| |-----|------:|------|-----:|------|---|-----:|---|-----:| |boolq| 2|none | 0|acc |↑ |0.7768|± |0.0073| [rank0]:[W1206 05:03:14.343996600 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:1718093:1718462 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1718093:1718462 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1718093:1718462 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1718093:1718340 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1718093:1718462 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1718093:1718462 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1718093:1718462 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1718094:1718342 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1718094:1718413 [1] NCCL INFO comm 0xf232820 rank 1 nranks 2 cudaDev 1 busId 4a000 - Abort COMPLETE n136-128-154:1718093:1718462 [0] NCCL INFO comm 0x1182afd0 rank 0 nranks 2 cudaDev 0 busId 16000 - Abort COMPLETE 任务 boolq 评估完成! ================================================== 开始评估:任务=arc_challenge | 少样本数=25 | 模型=Llama-3.1-8B-quantization-fg 输出路径:results2/Llama-3.1-8B-quantization-fg/fg/arc_challenge.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-06:05:04:38 INFO [__main__:440] Selected Tasks: ['arc_challenge'] 2025-12-06:05:04:38 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:05:04:38 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-06:05:04:38 INFO [__main__:440] Selected Tasks: ['arc_challenge'] 2025-12-06:05:04:38 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:05:04:38 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-06:05:04:39 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-06:05:04:40 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-06:05:04:40 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} `torch_dtype` is deprecated! Use `dtype` instead! `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:1718519:1718862 [1] NCCL INFO Initialized NET plugin IB n136-128-154:1718518:1718861 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:1718518:1718861 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:1718518:1718861 [0] NCCL INFO Initialized NET plugin IB n136-128-154:1718519:1718862 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1718519:1718862 [1] NCCL INFO Using network IB n136-128-154:1718518:1718861 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1718518:1718861 [0] NCCL INFO Using network IB n136-128-154:1718519:1718862 [1] NCCL INFO ncclCommInitRankConfig comm 0x11a60100 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0xa268e5e053a110c3 - Init START n136-128-154:1718518:1718861 [0] NCCL INFO ncclCommInitRankConfig comm 0x12d9fc50 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0xa268e5e053a110c3 - Init START n136-128-154:1718519:1718862 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1718518:1718861 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1718518:1718861 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1718518:1718861 [0] NCCL INFO Retrieving state for IB n136-128-154:1718518:1718861 [0] NCCL INFO Initialized state 0 for IB n136-128-154:1718518:1718861 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1718519:1718862 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1718519:1718862 [1] NCCL INFO Retrieving state for IB n136-128-154:1718519:1718862 [1] NCCL INFO Initialized state 0 for IB n136-128-154:1718518:1718861 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1718519:1718862 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1718518:1718861 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1718519:1718862 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1718518:1718861 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1718519:1718862 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1718519:1718862 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1718518:1718861 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1718519:1718862 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1718518:1718861 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1718519:1718862 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1718518:1718861 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1718519:1718862 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1718518:1718861 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1718519:1718862 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1718518:1718861 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1718518:1718861 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1718518:1718861 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1718518:1718861 [0] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1718518:1718861 [0] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1718518:1718861 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1718518:1718861 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1718519:1718862 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1718518:1718861 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1718518:1718861 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1718519:1718862 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1718518:1718861 [0] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1718519:1718862 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1718518:1718861 [0] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1718519:1718862 [1] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1718518:1718861 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1718519:1718862 [1] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1718519:1718862 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1718518:1718861 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1718519:1718862 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1718518:1718861 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1718519:1718862 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1718519:1718862 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1718518:1718861 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1718519:1718862 [1] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1718518:1718861 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1718519:1718862 [1] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1718519:1718862 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1718518:1718861 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1718519:1718862 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1718519:1718862 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1718518:1718861 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1718519:1718862 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1718519:1718862 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1718518:1718861 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1718519:1718862 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1718518:1718861 [0] NCCL INFO ========================================== n136-128-154:1718519:1718862 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1718519:1718862 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1718519:1718862 [1] NCCL INFO ========================================== n136-128-154:1718518:1718861 [0] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1718519:1718862 [1] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1718519:1718862 [1] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1718518:1718861 [0] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1718519:1718862 [1] NCCL INFO Setting affinity for GPU 2 to 1-31,65-95 n136-128-154:1718518:1718861 [0] NCCL INFO Setting affinity for GPU 1 to 1-31,65-95 n136-128-154:1718518:1718861 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1718518:1718861 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718518:1718861 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718518:1718861 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718518:1718861 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718518:1718861 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718518:1718861 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718518:1718861 [0] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718518:1718861 [0] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718518:1718861 [0] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718518:1718861 [0] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718518:1718861 [0] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718518:1718861 [0] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718519:1718862 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1718519:1718862 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718519:1718862 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718519:1718862 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718519:1718862 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718519:1718862 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718519:1718862 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718519:1718862 [1] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718519:1718862 [1] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718519:1718862 [1] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718519:1718862 [1] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718519:1718862 [1] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718519:1718862 [1] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718518:1718861 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1718518:1718861 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718518:1718861 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718518:1718861 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718518:1718861 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718518:1718861 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718518:1718861 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718518:1718861 [0] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718518:1718861 [0] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718518:1718861 [0] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718518:1718861 [0] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718518:1718861 [0] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718519:1718862 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1718518:1718861 [0] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718519:1718862 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718519:1718862 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718519:1718862 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718519:1718862 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718519:1718862 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718519:1718862 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1718519:1718862 [1] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718519:1718862 [1] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718519:1718862 [1] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718519:1718862 [1] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718519:1718862 [1] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718519:1718862 [1] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1718518:1718861 [0] NCCL INFO comm 0x12d9fc50 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:1718519:1718862 [1] NCCL INFO comm 0x11a60100 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:1718519:1718862 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:1718519:1718862 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:1718519:1718862 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:1718519:1718862 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:1718518:1718861 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:1718518:1718861 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1718519:1718862 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1718519:1718873 [1] NCCL INFO [Proxy Service] Device 1 CPU core 10 n136-128-154:1718519:1718874 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 13 n136-128-154:1718518:1718861 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1718518:1718861 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:1718518:1718875 [0] NCCL INFO [Proxy Service] Device 0 CPU core 70 n136-128-154:1718518:1718876 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 71 n136-128-154:1718518:1718861 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1718519:1718862 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1718518:1718861 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1718519:1718862 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1718518:1718861 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:1718518:1718861 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1718518:1718861 [0] NCCL INFO ncclCommInitRankConfig comm 0x12d9fc50 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0xa268e5e053a110c3 - Init COMPLETE n136-128-154:1718518:1718861 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.60 (kernels 0.20, alloc 0.20, bootstrap 0.00, allgathers 0.00, topo 0.07, graphs 0.00, connections 0.06, rest 0.06) n136-128-154:1718519:1718862 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1718519:1718862 [1] NCCL INFO ncclCommInitRankConfig comm 0x11a60100 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0xa268e5e053a110c3 - Init COMPLETE n136-128-154:1718519:1718862 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.58 (kernels 0.19, alloc 0.20, bootstrap 0.00, allgathers 0.00, topo 0.07, graphs 0.00, connections 0.06, rest 0.06) n136-128-154:1718518:1718877 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 00/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 01/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 02/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 03/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 04/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 04/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 05/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 05/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 06/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 06/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 07/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 07/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 08/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 08/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 09/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 09/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 10/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 10/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 11/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 11/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 12/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 12/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 13/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 13/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 14/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 14/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 15/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 15/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 16/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 16/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 17/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 17/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 18/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 18/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 19/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 19/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 20/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 20/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 21/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 21/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 22/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 22/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Channel 23/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1718519:1718878 [1] NCCL INFO Channel 23/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718518:1718877 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:1718519:1718878 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-06:05:05:11 INFO [evaluator:559] Running loglikelihood requests 2025-12-06:05:05:11 INFO [evaluator:559] Running loglikelihood requests Running loglikelihood requests: 0%| | 0/2344 [00:00 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 01/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 02/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 03/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 04/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 05/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 06/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 07/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 08/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 09/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 10/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 11/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 12/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 13/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 14/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 15/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 16/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 17/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 18/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 19/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 20/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 21/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 22/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 23/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 24/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 25/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 26/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 27/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 28/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 29/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 30/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1718519:1719392 [1] NCCL INFO Channel 31/1 : 1[2] -> 0[1] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:1718519:1719399 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1718519:1719399 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1718519:1719399 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1718519:1719399 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1718519:1719399 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1718519:1719399 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1718519:1718873 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-06:05:08:55 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 25, batch_size: auto (64) | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |-------------|------:|------|-----:|--------|---|-----:|---|-----:| |arc_challenge| 1|none | 25|acc |↑ |0.3737|± |0.0141| | | |none | 25|acc_norm|↑ |0.4147|± |0.0144| [rank0]:[W1206 05:08:56.682272706 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:1718518:1719449 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1718518:1719449 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1718518:1719449 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1718518:1718875 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1718518:1719449 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1718518:1719449 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1718518:1719449 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1718519:1718873 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1718519:1719399 [1] NCCL INFO comm 0x11a60100 rank 1 nranks 2 cudaDev 1 busId 4a000 - Abort COMPLETE n136-128-154:1718518:1719449 [0] NCCL INFO comm 0x12d9fc50 rank 0 nranks 2 cudaDev 0 busId 16000 - Abort COMPLETE 任务 arc_challenge 评估完成! ================================================== 开始评估:任务=truthfulqa_mc1 | 少样本数=0 | 模型=Llama-3.1-8B-quantization-fg 输出路径:results2/Llama-3.1-8B-quantization-fg/fg/truthfulqa_mc1.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-06:05:10:17 INFO [__main__:440] Selected Tasks: ['truthfulqa_mc1'] 2025-12-06:05:10:17 INFO [__main__:440] Selected Tasks: ['truthfulqa_mc1'] 2025-12-06:05:10:17 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:05:10:17 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-06:05:10:17 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:05:10:17 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-06:05:10:18 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-06:05:10:19 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:1719504:1719691 [0] NCCL INFO Initialized NET plugin IB n136-128-154:1719505:1719692 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:1719505:1719692 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:1719505:1719692 [1] NCCL INFO Initialized NET plugin IB n136-128-154:1719504:1719691 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1719504:1719691 [0] NCCL INFO Using network IB n136-128-154:1719505:1719692 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1719505:1719692 [1] NCCL INFO Using network IB n136-128-154:1719505:1719692 [1] NCCL INFO ncclCommInitRankConfig comm 0x162a5740 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0x2229f7460bbce154 - Init START n136-128-154:1719504:1719691 [0] NCCL INFO ncclCommInitRankConfig comm 0x15177fb0 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0x2229f7460bbce154 - Init START n136-128-154:1719504:1719691 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1719505:1719692 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1719504:1719691 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1719504:1719691 [0] NCCL INFO Retrieving state for IB n136-128-154:1719504:1719691 [0] NCCL INFO Initialized state 0 for IB n136-128-154:1719504:1719691 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1719504:1719691 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1719504:1719691 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1719504:1719691 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1719505:1719692 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1719505:1719692 [1] NCCL INFO Retrieving state for IB n136-128-154:1719505:1719692 [1] NCCL INFO Initialized state 0 for IB n136-128-154:1719505:1719692 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1719505:1719692 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1719505:1719692 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1719505:1719692 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1719504:1719691 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1719504:1719691 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1719504:1719691 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1719505:1719692 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1719504:1719691 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1719505:1719692 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1719505:1719692 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1719505:1719692 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1719505:1719692 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1719504:1719691 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1719505:1719692 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1719504:1719691 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1719505:1719692 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1719504:1719691 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1719505:1719692 [1] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1719504:1719691 [0] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1719505:1719692 [1] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1719504:1719691 [0] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1719505:1719692 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1719504:1719691 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1719505:1719692 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1719504:1719691 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1719505:1719692 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1719504:1719691 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1719505:1719692 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1719504:1719691 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1719505:1719692 [1] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1719504:1719691 [0] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1719505:1719692 [1] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1719504:1719691 [0] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1719505:1719692 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1719504:1719691 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1719505:1719692 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1719504:1719691 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1719505:1719692 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1719504:1719691 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1719505:1719692 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1719504:1719691 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1719505:1719692 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1719504:1719691 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1719505:1719692 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1719504:1719691 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1719505:1719692 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1719504:1719691 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1719505:1719692 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1719504:1719691 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1719505:1719692 [1] NCCL INFO ========================================== n136-128-154:1719504:1719691 [0] NCCL INFO ========================================== n136-128-154:1719505:1719692 [1] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1719504:1719691 [0] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1719505:1719692 [1] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1719504:1719691 [0] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1719504:1719691 [0] NCCL INFO Setting affinity for GPU 1 to 1-31,65-95 n136-128-154:1719505:1719692 [1] NCCL INFO Setting affinity for GPU 2 to 1-31,65-95 n136-128-154:1719505:1719692 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1719505:1719692 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719505:1719692 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719505:1719692 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719505:1719692 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719505:1719692 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719505:1719692 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719505:1719692 [1] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719505:1719692 [1] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719504:1719691 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1719505:1719692 [1] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719504:1719691 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719505:1719692 [1] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719504:1719691 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719505:1719692 [1] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719504:1719691 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719505:1719692 [1] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719504:1719691 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719504:1719691 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719504:1719691 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719504:1719691 [0] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719504:1719691 [0] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719504:1719691 [0] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719504:1719691 [0] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719504:1719691 [0] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719504:1719691 [0] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719505:1719692 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1719504:1719691 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1719505:1719692 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719504:1719691 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719505:1719692 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719504:1719691 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719505:1719692 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719504:1719691 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719505:1719692 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719504:1719691 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719505:1719692 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719504:1719691 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719505:1719692 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719504:1719691 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719505:1719692 [1] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719504:1719691 [0] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719505:1719692 [1] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719504:1719691 [0] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719505:1719692 [1] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719504:1719691 [0] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719505:1719692 [1] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719504:1719691 [0] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719505:1719692 [1] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719504:1719691 [0] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719505:1719692 [1] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719504:1719691 [0] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719505:1719692 [1] NCCL INFO comm 0x162a5740 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:1719504:1719691 [0] NCCL INFO comm 0x15177fb0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719505:1719692 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:1719505:1719692 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:1719505:1719692 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:1719505:1719692 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:1719504:1719691 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:1719504:1719691 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1719505:1719692 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1719505:1719703 [1] NCCL INFO [Proxy Service] Device 1 CPU core 73 n136-128-154:1719505:1719704 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 10 n136-128-154:1719504:1719691 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1719504:1719691 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:1719504:1719706 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 66 n136-128-154:1719504:1719705 [0] NCCL INFO [Proxy Service] Device 0 CPU core 65 n136-128-154:1719505:1719692 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1719505:1719692 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1719504:1719691 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1719504:1719691 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1719504:1719691 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:1719505:1719692 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1719505:1719692 [1] NCCL INFO ncclCommInitRankConfig comm 0x162a5740 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0x2229f7460bbce154 - Init COMPLETE n136-128-154:1719505:1719692 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.69 (kernels 0.27, alloc 0.33, bootstrap 0.00, allgathers 0.00, topo 0.05, graphs 0.00, connections 0.01, rest 0.02) n136-128-154:1719504:1719691 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1719504:1719691 [0] NCCL INFO ncclCommInitRankConfig comm 0x15177fb0 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0x2229f7460bbce154 - Init COMPLETE n136-128-154:1719504:1719691 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.70 (kernels 0.28, alloc 0.33, bootstrap 0.00, allgathers 0.00, topo 0.05, graphs 0.00, connections 0.01, rest 0.02) n136-128-154:1719505:1719707 [1] NCCL INFO Channel 00/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 01/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 02/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 03/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 04/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 05/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 06/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 07/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 08/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 09/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 10/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 11/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 12/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 04/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 13/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 05/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 14/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 06/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 15/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 07/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 16/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 08/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 17/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 18/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 09/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 19/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 10/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 20/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 11/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 21/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 12/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 22/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 13/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719505:1719707 [1] NCCL INFO Channel 23/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 14/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 15/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 16/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 17/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 18/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 19/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 20/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 21/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 22/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Channel 23/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719504:1719708 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:1719505:1719707 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-06:05:10:34 INFO [evaluator:559] Running loglikelihood requests 2025-12-06:05:10:34 INFO [evaluator:559] Running loglikelihood requests Passed argument batch_size = auto:1. Detecting largest batch size Running loglikelihood requests: 0%| | 0/2066 [00:00 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 01/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 02/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 03/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 04/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 05/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 06/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 07/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 08/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 09/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 10/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 11/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 12/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 13/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 14/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 15/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 16/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 17/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 18/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 19/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 20/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 21/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 22/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 23/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 24/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 25/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 26/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 27/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 28/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 29/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 30/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719505:1719753 [1] NCCL INFO Channel 31/1 : 1[2] -> 0[1] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:1719505:1719759 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1719505:1719759 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1719505:1719759 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1719505:1719759 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1719505:1719759 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1719505:1719759 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1719505:1719703 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-06:05:11:29 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: auto (64) | Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr| |--------------|------:|------|-----:|------|---|-----:|---|-----:| |truthfulqa_mc1| 2|none | 0|acc |↑ |0.2497|± |0.0152| [rank0]:[W1206 05:11:30.034795243 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:1719504:1719821 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1719504:1719821 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1719504:1719821 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1719504:1719705 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1719504:1719821 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1719504:1719821 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1719504:1719821 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1719505:1719703 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1719505:1719759 [1] NCCL INFO comm 0x162a5740 rank 1 nranks 2 cudaDev 1 busId 4a000 - Abort COMPLETE n136-128-154:1719504:1719821 [0] NCCL INFO comm 0x15177fb0 rank 0 nranks 2 cudaDev 0 busId 16000 - Abort COMPLETE 任务 truthfulqa_mc1 评估完成! ================================================== 开始评估:任务=piqa | 少样本数=0 | 模型=Llama-3.1-8B-quantization-fg 输出路径:results2/Llama-3.1-8B-quantization-fg/fg/piqa.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-06:05:12:53 INFO [__main__:440] Selected Tasks: ['piqa'] 2025-12-06:05:12:53 INFO [__main__:440] Selected Tasks: ['piqa'] 2025-12-06:05:12:53 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:05:12:53 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:05:12:53 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-06:05:12:53 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-06:05:12:54 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-06:05:12:55 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-06:05:12:55 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:1719951:1720291 [1] NCCL INFO Initialized NET plugin IB n136-128-154:1719950:1720290 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:1719950:1720290 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:1719950:1720290 [0] NCCL INFO Initialized NET plugin IB n136-128-154:1719951:1720291 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1719951:1720291 [1] NCCL INFO Using network IB n136-128-154:1719950:1720290 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1719950:1720290 [0] NCCL INFO Using network IB n136-128-154:1719950:1720290 [0] NCCL INFO ncclCommInitRankConfig comm 0x15aaccd0 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0xa16a9b90c264df48 - Init START n136-128-154:1719951:1720291 [1] NCCL INFO ncclCommInitRankConfig comm 0x130515a0 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0xa16a9b90c264df48 - Init START n136-128-154:1719950:1720290 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1719951:1720291 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1719951:1720291 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1719951:1720291 [1] NCCL INFO Retrieving state for IB n136-128-154:1719951:1720291 [1] NCCL INFO Initialized state 0 for IB n136-128-154:1719951:1720291 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1719950:1720290 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1719950:1720290 [0] NCCL INFO Retrieving state for IB n136-128-154:1719950:1720290 [0] NCCL INFO Initialized state 0 for IB n136-128-154:1719951:1720291 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1719950:1720290 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1719950:1720290 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1719951:1720291 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1719950:1720290 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1719951:1720291 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1719950:1720290 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1719951:1720291 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1719951:1720291 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1719951:1720291 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1719951:1720291 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1719950:1720290 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1719950:1720290 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1719950:1720290 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1719950:1720290 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1719951:1720291 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1719951:1720291 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1719950:1720290 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1719951:1720291 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1719950:1720290 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1719951:1720291 [1] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1719950:1720290 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1719951:1720291 [1] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1719950:1720290 [0] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1719951:1720291 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1719950:1720290 [0] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1719951:1720291 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1719950:1720290 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1719951:1720291 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1719950:1720290 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1719951:1720291 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1719950:1720290 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1719951:1720291 [1] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1719950:1720290 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1719951:1720291 [1] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1719950:1720290 [0] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1719951:1720291 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1719950:1720290 [0] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1719951:1720291 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1719950:1720290 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1719951:1720291 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1719950:1720290 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1719951:1720291 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1719950:1720290 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1719951:1720291 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1719950:1720290 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1719951:1720291 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1719950:1720290 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1719951:1720291 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1719950:1720290 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1719951:1720291 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1719950:1720290 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1719951:1720291 [1] NCCL INFO ========================================== n136-128-154:1719950:1720290 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1719950:1720290 [0] NCCL INFO ========================================== n136-128-154:1719951:1720291 [1] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1719950:1720290 [0] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1719951:1720291 [1] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1719950:1720290 [0] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1719951:1720291 [1] NCCL INFO Setting affinity for GPU 2 to 1-31,65-95 n136-128-154:1719950:1720290 [0] NCCL INFO Setting affinity for GPU 1 to 1-31,65-95 n136-128-154:1719951:1720291 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1719951:1720291 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719951:1720291 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719951:1720291 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719951:1720291 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719951:1720291 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719951:1720291 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719951:1720291 [1] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719951:1720291 [1] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719951:1720291 [1] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719951:1720291 [1] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719951:1720291 [1] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719951:1720291 [1] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719950:1720290 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1719950:1720290 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719950:1720290 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719950:1720290 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719950:1720290 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719950:1720290 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719950:1720290 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719950:1720290 [0] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719950:1720290 [0] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719950:1720290 [0] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719950:1720290 [0] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719950:1720290 [0] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719950:1720290 [0] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719951:1720291 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1719951:1720291 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719951:1720291 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719951:1720291 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719951:1720291 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719951:1720291 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719951:1720291 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719951:1720291 [1] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719951:1720291 [1] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719951:1720291 [1] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719951:1720291 [1] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719950:1720290 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1719951:1720291 [1] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719950:1720290 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719951:1720291 [1] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719950:1720290 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719950:1720290 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719950:1720290 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719950:1720290 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719950:1720290 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1719950:1720290 [0] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719950:1720290 [0] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719950:1720290 [0] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719950:1720290 [0] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719950:1720290 [0] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719950:1720290 [0] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1719951:1720291 [1] NCCL INFO comm 0x130515a0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:1719950:1720290 [0] NCCL INFO comm 0x15aaccd0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719951:1720291 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:1719951:1720291 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:1719951:1720291 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:1719951:1720291 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:1719950:1720290 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:1719950:1720290 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1719951:1720291 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1719951:1720304 [1] NCCL INFO [Proxy Service] Device 1 CPU core 67 n136-128-154:1719951:1720305 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 68 n136-128-154:1719950:1720290 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1719950:1720290 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:1719950:1720306 [0] NCCL INFO [Proxy Service] Device 0 CPU core 69 n136-128-154:1719950:1720307 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 14 n136-128-154:1719951:1720291 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1719951:1720291 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1719950:1720290 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1719950:1720290 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1719950:1720290 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:1719950:1720290 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1719950:1720290 [0] NCCL INFO ncclCommInitRankConfig comm 0x15aaccd0 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0xa16a9b90c264df48 - Init COMPLETE n136-128-154:1719950:1720290 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.77 (kernels 0.24, alloc 0.37, bootstrap 0.00, allgathers 0.00, topo 0.05, graphs 0.00, connections 0.01, rest 0.09) n136-128-154:1719951:1720291 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1719951:1720291 [1] NCCL INFO ncclCommInitRankConfig comm 0x130515a0 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0xa16a9b90c264df48 - Init COMPLETE n136-128-154:1719951:1720291 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.77 (kernels 0.23, alloc 0.37, bootstrap 0.00, allgathers 0.00, topo 0.05, graphs 0.00, connections 0.01, rest 0.09) n136-128-154:1719950:1720308 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 00/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 04/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 01/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 05/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 02/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 06/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 03/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 07/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 04/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 08/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 05/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 09/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 06/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 10/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 07/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 11/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 08/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 12/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 09/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 13/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 10/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 14/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 11/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 15/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 12/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 16/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 13/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 17/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 14/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 18/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 15/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 19/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 16/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 20/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 17/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 21/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 18/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 22/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 19/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719950:1720308 [0] NCCL INFO Channel 23/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 20/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 21/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 22/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Channel 23/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720309 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:1719950:1720308 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-06:05:13:17 INFO [evaluator:559] Running loglikelihood requests 2025-12-06:05:13:17 INFO [evaluator:559] Running loglikelihood requests Running loglikelihood requests: 0%| | 0/1838 [00:00 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 01/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 02/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 03/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 04/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 05/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 06/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 07/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 08/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 09/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 10/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 11/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 12/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 13/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 14/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 15/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 16/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 17/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 18/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 19/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 20/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 21/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 22/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 23/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 24/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 25/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 26/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 27/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 28/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 29/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 30/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1719951:1720350 [1] NCCL INFO Channel 31/1 : 1[2] -> 0[1] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:1719951:1720356 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1719951:1720356 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1719951:1720356 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1719951:1720356 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1719951:1720356 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1719951:1720356 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1719951:1720304 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-06:05:13:49 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: auto (64) |Tasks|Version|Filter|n-shot| Metric | |Value | |Stderr| |-----|------:|------|-----:|--------|---|-----:|---|-----:| |piqa | 1|none | 0|acc |↑ |0.7008|± |0.0107| | | |none | 0|acc_norm|↑ |0.7024|± |0.0107| [rank0]:[W1206 05:13:50.372810255 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:1719950:1720407 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1719950:1720407 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1719950:1720407 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1719950:1720306 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1719951:1720304 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1719950:1720407 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1719950:1720407 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1719950:1720407 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1719951:1720356 [1] NCCL INFO comm 0x130515a0 rank 1 nranks 2 cudaDev 1 busId 4a000 - Abort COMPLETE n136-128-154:1719950:1720407 [0] NCCL INFO comm 0x15aaccd0 rank 0 nranks 2 cudaDev 0 busId 16000 - Abort COMPLETE 任务 piqa 评估完成! ================================================== 开始评估:任务=hellaswag | 少样本数=10 | 模型=Llama-3.1-8B-quantization-fg 输出路径:results2/Llama-3.1-8B-quantization-fg/fg/hellaswag.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-06:05:15:09 INFO [__main__:440] Selected Tasks: ['hellaswag'] 2025-12-06:05:15:09 INFO [__main__:440] Selected Tasks: ['hellaswag'] 2025-12-06:05:15:09 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:05:15:09 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-06:05:15:09 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-06:05:15:09 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-06:05:15:10 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-06:05:15:11 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-06:05:15:11 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:1720460:1720691 [0] NCCL INFO Initialized NET plugin IB n136-128-154:1720460:1720691 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1720460:1720691 [0] NCCL INFO Using network IB 95%|█████████▌| 4785/5021 [00:24<00:01, 200.43it/s]n136-128-154:1720460:1720691 [0] NCCL INFO ncclCommInitRankConfig comm 0x134d07d0 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0x114028e9049878c1 - Init START 96%|█████████▌| 4806/5021 [00:24<00:01, 200.43it/s] 96%|█████████▌| 4827/5021 [00:24<00:00, 200.41it/s] 97%|█████████▋| 4848/5021 [00:24<00:00, 197.24it/s] 97%|█████████▋| 4868/5021 [00:24<00:00, 196.83it/s] 97%|█████████▋| 4888/5021 [00:24<00:00, 196.48it/s] 98%|█████████▊| 4908/5021 [00:24<00:00, 195.98it/s] 98%|█████████▊| 4928/5021 [00:24<00:00, 196.10it/s] 99%|█████████▊| 4948/5021 [00:24<00:00, 196.27it/s] 99%|█████████▉| 4968/5021 [00:25<00:00, 196.61it/s] 99%|█████████▉| 4988/5021 [00:25<00:00, 197.07it/s] 100%|█████████▉| 5008/5021 [00:25<00:00, 197.31it/s] 100%|██████████| 5021/5021 [00:25<00:00, 198.03it/s] n136-128-154:1720461:1720461 [1] NCCL INFO cudaDriverVersion 12040 n136-128-154:1720461:1720461 [1] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6 n136-128-154:1720461:1720461 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0 n136-128-154:1720461:1720461 [1] NCCL INFO NCCL version 2.27.5+cuda12.9 n136-128-154:1720461:1720461 [1] NCCL INFO Comm config Blocking set to 1 n136-128-154:1720461:1720697 [1] NCCL INFO NET/Plugin: Could not find: none libnccl-net-none.so. n136-128-154:1720461:1720697 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0. n136-128-154:1720461:1720697 [1] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6 n136-128-154:1720461:1720697 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0 n136-128-154:1720461:1720697 [1] NCCL INFO NCCL_IB_HCA set to mlx5 n136-128-154:1720461:1720697 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:1720461:1720697 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:1720461:1720697 [1] NCCL INFO Initialized NET plugin IB n136-128-154:1720461:1720697 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:1720461:1720697 [1] NCCL INFO Using network IB n136-128-154:1720461:1720697 [1] NCCL INFO ncclCommInitRankConfig comm 0x15f07470 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0x114028e9049878c1 - Init START n136-128-154:1720461:1720697 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1720460:1720691 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:1720460:1720691 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1720460:1720691 [0] NCCL INFO Retrieving state for IB n136-128-154:1720460:1720691 [0] NCCL INFO Initialized state 0 for IB n136-128-154:1720460:1720691 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1720461:1720697 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:1720461:1720697 [1] NCCL INFO Retrieving state for IB n136-128-154:1720461:1720697 [1] NCCL INFO Initialized state 0 for IB n136-128-154:1720460:1720691 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1720461:1720697 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:1720461:1720697 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:1720460:1720691 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1720461:1720697 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:1720460:1720691 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1720461:1720697 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:1720460:1720691 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1720461:1720697 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1720460:1720691 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1720460:1720691 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1720461:1720697 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 0 (distance 5 <= 5), read 0 mode Default n136-128-154:1720460:1720691 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1720461:1720697 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1720461:1720697 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 1 (distance 5 <= 5), read 0 mode Default n136-128-154:1720460:1720691 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1720460:1720691 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1720460:1720691 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1720460:1720691 [0] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1720460:1720691 [0] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1720460:1720691 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1720460:1720691 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1720460:1720691 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1720460:1720691 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1720460:1720691 [0] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1720460:1720691 [0] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1720460:1720691 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1720460:1720691 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1720460:1720691 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1720460:1720691 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1720460:1720691 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1720460:1720691 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1720460:1720691 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1720460:1720691 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1720460:1720691 [0] NCCL INFO ========================================== n136-128-154:1720460:1720691 [0] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1720460:1720691 [0] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1720460:1720691 [0] NCCL INFO Setting affinity for GPU 1 to 1-31,65-95 n136-128-154:1720461:1720697 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:1720461:1720697 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:1720461:1720697 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:1720461:1720697 [1] NCCL INFO + PCI[24.0] - PCI/0-14000 (1000c01010de13b8) n136-128-154:1720461:1720697 [1] NCCL INFO + PCI[24.0] - GPU/0-16000 (0) n136-128-154:1720461:1720697 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1720461:1720697 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:1720461:1720697 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:1720461:1720697 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:1720461:1720697 [1] NCCL INFO + PCI[24.0] - PCI/0-48000 (1000c01010de13b8) n136-128-154:1720461:1720697 [1] NCCL INFO + PCI[24.0] - GPU/0-4a000 (1) n136-128-154:1720461:1720697 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:1720461:1720697 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:1720461:1720697 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:1720461:1720697 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:1720461:1720697 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:1720461:1720697 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:1720461:1720697 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:1720461:1720697 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:1720461:1720697 [1] NCCL INFO ========================================== n136-128-154:1720461:1720697 [1] NCCL INFO GPU/0-16000 :GPU/0-16000 (0/5000.0/LOC) GPU/0-4a000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1720461:1720697 [1] NCCL INFO GPU/0-4a000 :GPU/0-16000 (2/240.0/NVL) GPU/0-4a000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) n136-128-154:1720461:1720697 [1] NCCL INFO Setting affinity for GPU 2 to 1-31,65-95 n136-128-154:1720460:1720691 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1720460:1720691 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720460:1720691 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720460:1720691 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720460:1720691 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720460:1720691 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720460:1720691 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720460:1720691 [0] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720460:1720691 [0] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720460:1720691 [0] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720460:1720691 [0] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720460:1720691 [0] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720460:1720691 [0] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720461:1720697 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:1720461:1720697 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720461:1720697 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720461:1720697 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720461:1720697 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720461:1720697 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720461:1720697 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720461:1720697 [1] NCCL INFO 6 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720461:1720697 [1] NCCL INFO 7 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720461:1720697 [1] NCCL INFO 8 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720461:1720697 [1] NCCL INFO 9 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720461:1720697 [1] NCCL INFO 10 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720461:1720697 [1] NCCL INFO 11 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720460:1720691 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1720460:1720691 [0] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720460:1720691 [0] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720460:1720691 [0] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720460:1720691 [0] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720460:1720691 [0] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720460:1720691 [0] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720460:1720691 [0] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1720460:1720691 [0] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1720460:1720691 [0] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1720460:1720691 [0] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1720460:1720691 [0] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1720460:1720691 [0] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1720461:1720697 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:1720461:1720697 [1] NCCL INFO 0 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720461:1720697 [1] NCCL INFO 1 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720461:1720697 [1] NCCL INFO 2 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720461:1720697 [1] NCCL INFO 3 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720461:1720697 [1] NCCL INFO 4 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720461:1720697 [1] NCCL INFO 5 : GPU/0-16000 GPU/0-4a000 n136-128-154:1720461:1720697 [1] NCCL INFO 6 : GPU/0-4a000 GPU/0-16000 n136-128-154:1720461:1720697 [1] NCCL INFO 7 : GPU/0-4a000 GPU/0-16000 n136-128-154:1720461:1720697 [1] NCCL INFO 8 : GPU/0-4a000 GPU/0-16000 n136-128-154:1720461:1720697 [1] NCCL INFO 9 : GPU/0-4a000 GPU/0-16000 n136-128-154:1720461:1720697 [1] NCCL INFO 10 : GPU/0-4a000 GPU/0-16000 n136-128-154:1720461:1720697 [1] NCCL INFO 11 : GPU/0-4a000 GPU/0-16000 n136-128-154:1720460:1720691 [0] NCCL INFO comm 0x134d07d0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:1720461:1720697 [1] NCCL INFO comm 0x15f07470 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:1720461:1720697 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:1720461:1720697 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:1720461:1720697 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:1720461:1720697 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:1720460:1720691 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:1720460:1720691 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:1720461:1720697 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1720461:1720705 [1] NCCL INFO [Proxy Service] Device 1 CPU core 29 n136-128-154:1720461:1720706 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 92 n136-128-154:1720460:1720691 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:1720460:1720691 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:1720460:1720707 [0] NCCL INFO [Proxy Service] Device 0 CPU core 70 n136-128-154:1720460:1720708 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 7 n136-128-154:1720461:1720697 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1720461:1720697 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1720460:1720691 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:1720460:1720691 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:1720460:1720691 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:1720461:1720697 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1720461:1720697 [1] NCCL INFO ncclCommInitRankConfig comm 0x15f07470 rank 1 nranks 2 cudaDev 1 nvmlDev 2 busId 4a000 commId 0x114028e9049878c1 - Init COMPLETE n136-128-154:1720461:1720697 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.63 (kernels 0.12, alloc 0.25, bootstrap 0.00, allgathers 0.01, topo 0.08, graphs 0.00, connections 0.01, rest 0.16) n136-128-154:1720460:1720691 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:1720460:1720691 [0] NCCL INFO ncclCommInitRankConfig comm 0x134d07d0 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 16000 commId 0x114028e9049878c1 - Init COMPLETE n136-128-154:1720460:1720691 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 2.14 (kernels 0.17, alloc 0.14, bootstrap 1.57, allgathers 0.01, topo 0.08, graphs 0.00, connections 0.01, rest 0.17) n136-128-154:1720461:1720709 [1] NCCL INFO Channel 00/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 01/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 02/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 03/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 04/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 05/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 06/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 07/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 08/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 09/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 10/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 04/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 11/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 05/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 12/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 06/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 13/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 07/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 14/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 08/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 15/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 09/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 16/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 10/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 17/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 11/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 18/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 12/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 19/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 13/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 20/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 14/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 21/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 15/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 22/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 16/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Channel 23/0 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 17/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 18/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 19/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 20/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 21/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 22/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720460:1720710 [0] NCCL INFO Channel 23/0 : 0[1] -> 1[2] via P2P/CUMEM/read n136-128-154:1720461:1720709 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:1720460:1720710 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-06:05:16:02 INFO [evaluator:559] Running loglikelihood requests 2025-12-06:05:16:02 INFO [evaluator:559] Running loglikelihood requests Running loglikelihood requests: 0%| | 0/20084 [00:00 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 01/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 02/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 03/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 04/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 05/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 06/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 07/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 08/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 09/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 10/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 11/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 12/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 13/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 14/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 15/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 16/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 17/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 18/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 19/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 20/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 21/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 22/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 23/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 24/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 25/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 26/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 27/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 28/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 29/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 30/1 : 1[2] -> 0[1] via P2P/CUMEM/read n136-128-154:1720461:1723019 [1] NCCL INFO Channel 31/1 : 1[2] -> 0[1] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:1720461:1723024 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1720461:1723024 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1720461:1723024 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1720461:1723024 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1720461:1723024 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1720461:1723024 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1720461:1720705 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-06:05:41:33 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 10, batch_size: auto (64) | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |---------|------:|------|-----:|--------|---|-----:|---|-----:| |hellaswag| 1|none | 10|acc |↑ |0.4510|± |0.0050| | | |none | 10|acc_norm|↑ |0.6418|± |0.0048| [rank0]:[W1206 05:41:34.198573901 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:1720460:1723080 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1720460:1723080 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1720460:1723080 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1720460:1720707 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1720460:1723080 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:1720460:1723080 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:1720460:1723080 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:1720461:1720705 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:1720461:1723024 [1] NCCL INFO comm 0x15f07470 rank 1 nranks 2 cudaDev 1 busId 4a000 - Abort COMPLETE n136-128-154:1720460:1723080 [0] NCCL INFO comm 0x134d07d0 rank 0 nranks 2 cudaDev 0 busId 16000 - Abort COMPLETE 任务 hellaswag 评估完成! Execution time: 7164.04 seconds