nohup: ignoring input Namespace(save_dir='/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg', self_attn_layer_to_quant='16 17 15 14 13', mlp_layer_to_quant='16 17 15 14 13', model_id='/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen/Qwen2.5-7B', cuda_id=6) `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:2216229:2216439 [0] NCCL INFO Initialized NET plugin IB n136-128-154:2216229:2216439 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2216229:2216439 [0] NCCL INFO Using network IB n136-128-154:2216230:2216440 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:2216230:2216440 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:2216230:2216440 [1] NCCL INFO Initialized NET plugin IB n136-128-154:2216230:2216440 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2216230:2216440 [1] NCCL INFO Using network IB n136-128-154:2216229:2216439 [0] NCCL INFO ncclCommInitRankConfig comm 0xe2c5cc0 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0xf7a531d2d3107fc6 - Init START n136-128-154:2216230:2216440 [1] NCCL INFO ncclCommInitRankConfig comm 0x11084de0 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0xf7a531d2d3107fc6 - Init START n136-128-154:2216229:2216439 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2216230:2216440 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2216230:2216440 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2216230:2216440 [1] NCCL INFO Retrieving state for IB n136-128-154:2216230:2216440 [1] NCCL INFO Initialized state 0 for IB n136-128-154:2216230:2216440 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2216229:2216439 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2216229:2216439 [0] NCCL INFO Retrieving state for IB n136-128-154:2216229:2216439 [0] NCCL INFO Initialized state 0 for IB n136-128-154:2216229:2216439 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2216230:2216440 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2216229:2216439 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2216230:2216440 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2216229:2216439 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2216230:2216440 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2216229:2216439 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2216229:2216439 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2216230:2216440 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2216229:2216439 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2216230:2216440 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2216229:2216439 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2216229:2216439 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2216229:2216439 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2216229:2216439 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2216229:2216439 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2216229:2216439 [0] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2216229:2216439 [0] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2216229:2216439 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2216229:2216439 [0] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2216229:2216439 [0] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2216229:2216439 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2216229:2216439 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2216229:2216439 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2216229:2216439 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2216229:2216439 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2216229:2216439 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2216229:2216439 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2216229:2216439 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2216229:2216439 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2216229:2216439 [0] NCCL INFO ========================================== n136-128-154:2216229:2216439 [0] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2216229:2216439 [0] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2216229:2216439 [0] NCCL INFO Setting affinity for GPU 6 to 33-62,97-126 n136-128-154:2216230:2216440 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2216230:2216440 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2216230:2216440 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2216230:2216440 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2216230:2216440 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2216230:2216440 [1] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2216230:2216440 [1] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2216230:2216440 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2216230:2216440 [1] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2216230:2216440 [1] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2216230:2216440 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2216230:2216440 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2216230:2216440 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2216230:2216440 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2216230:2216440 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2216230:2216440 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2216230:2216440 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2216230:2216440 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2216230:2216440 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2216230:2216440 [1] NCCL INFO ========================================== n136-128-154:2216230:2216440 [1] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2216230:2216440 [1] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2216229:2216439 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2216230:2216440 [1] NCCL INFO Setting affinity for GPU 7 to 33-62,97-126 n136-128-154:2216229:2216439 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216229:2216439 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216229:2216439 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216229:2216439 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216229:2216439 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216229:2216439 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216229:2216439 [0] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216229:2216439 [0] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216229:2216439 [0] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216229:2216439 [0] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216229:2216439 [0] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216229:2216439 [0] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216229:2216439 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2216229:2216439 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216229:2216439 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216229:2216439 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216229:2216439 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216229:2216439 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216229:2216439 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216229:2216439 [0] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216229:2216439 [0] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216229:2216439 [0] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216229:2216439 [0] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216229:2216439 [0] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216229:2216439 [0] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216230:2216440 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2216230:2216440 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216230:2216440 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216230:2216440 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216230:2216440 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216230:2216440 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216230:2216440 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216230:2216440 [1] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216230:2216440 [1] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216230:2216440 [1] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216230:2216440 [1] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216230:2216440 [1] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216230:2216440 [1] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216230:2216440 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2216230:2216440 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216230:2216440 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216230:2216440 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216230:2216440 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216230:2216440 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216230:2216440 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216230:2216440 [1] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216230:2216440 [1] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216230:2216440 [1] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216230:2216440 [1] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216230:2216440 [1] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216230:2216440 [1] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216229:2216439 [0] NCCL INFO comm 0xe2c5cc0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:2216229:2216439 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:2216229:2216439 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2216230:2216440 [1] NCCL INFO comm 0x11084de0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:2216230:2216440 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:2216230:2216440 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2216229:2216439 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2216229:2216439 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:2216229:2216453 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 42 n136-128-154:2216229:2216452 [0] NCCL INFO [Proxy Service] Device 0 CPU core 41 n136-128-154:2216230:2216440 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2216230:2216454 [1] NCCL INFO [Proxy Service] Device 1 CPU core 37 n136-128-154:2216230:2216455 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 102 n136-128-154:2216229:2216439 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2216229:2216439 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2216229:2216439 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:2216230:2216440 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2216230:2216440 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2216229:2216439 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2216229:2216439 [0] NCCL INFO ncclCommInitRankConfig comm 0xe2c5cc0 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0xf7a531d2d3107fc6 - Init COMPLETE n136-128-154:2216229:2216439 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.67 (kernels 0.20, alloc 0.25, bootstrap 0.00, allgathers 0.02, topo 0.10, graphs 0.00, connections 0.04, rest 0.07) n136-128-154:2216230:2216440 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2216230:2216440 [1] NCCL INFO ncclCommInitRankConfig comm 0x11084de0 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0xf7a531d2d3107fc6 - Init COMPLETE n136-128-154:2216230:2216440 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.67 (kernels 0.20, alloc 0.25, bootstrap 0.00, allgathers 0.01, topo 0.10, graphs 0.00, connections 0.03, rest 0.09) n136-128-154:2216229:2216456 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 04/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 05/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 06/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 07/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 08/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 09/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 10/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 11/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 04/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 12/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 05/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 13/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 06/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 14/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 07/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 15/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 16/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 08/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 17/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 09/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 18/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 10/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 19/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 11/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 20/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 12/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 21/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 13/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 22/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 14/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216229:2216456 [0] NCCL INFO Channel 23/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 15/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 16/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 17/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 18/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 19/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 20/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 21/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 22/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Channel 23/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216457 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:2216229:2216456 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-09:16:09:23 INFO [evaluator:559] Running loglikelihood requests 2025-12-09:16:09:23 INFO [evaluator:559] Running loglikelihood requests Running loglikelihood requests: 0%| | 0/1268 [00:00 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 01/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 02/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 03/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 04/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 05/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 06/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 07/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 08/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 09/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 10/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 11/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 12/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 13/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 14/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 15/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 16/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 17/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 18/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 19/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 20/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 21/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 22/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 23/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 24/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 25/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 26/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 27/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 28/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 29/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 30/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216230:2216480 [1] NCCL INFO Channel 31/1 : 1[7] -> 0[6] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:2216230:2216486 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2216230:2216486 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2216230:2216486 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2216230:2216486 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2216230:2216486 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2216230:2216486 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2216230:2216454 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-09:16:09:55 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto (64) | Tasks |Version|Filter|n-shot|Metric| |Value| |Stderr| |----------|------:|------|-----:|------|---|----:|---|-----:| |winogrande| 1|none | 5|acc |↑ |0.513|± | 0.014| [rank0]:[W1209 16:09:56.530616565 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:2216229:2216535 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2216229:2216535 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2216229:2216535 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2216229:2216452 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2216229:2216535 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2216229:2216535 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2216229:2216535 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2216230:2216454 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2216230:2216486 [1] NCCL INFO comm 0x11084de0 rank 1 nranks 2 cudaDev 1 busId c9000 - Abort COMPLETE n136-128-154:2216229:2216535 [0] NCCL INFO comm 0xe2c5cc0 rank 0 nranks 2 cudaDev 0 busId c5000 - Abort COMPLETE 任务 winogrande 评估完成! ================================================== 开始评估:任务=boolq | 少样本数=0 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/baseline_BI/boolq.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:16:10:48 INFO [__main__:440] Selected Tasks: ['boolq'] 2025-12-09:16:10:48 INFO [__main__:440] Selected Tasks: ['boolq'] 2025-12-09:16:10:48 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:16:10:48 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:16:10:48 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:16:10:48 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:16:10:49 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:16:10:50 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:16:10:50 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:2216582:2216898 [0] NCCL INFO Initialized NET plugin IB n136-128-154:2216582:2216898 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2216582:2216898 [0] NCCL INFO Using network IB n136-128-154:2216583:2216899 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:2216583:2216899 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:2216583:2216899 [1] NCCL INFO Initialized NET plugin IB n136-128-154:2216583:2216899 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2216583:2216899 [1] NCCL INFO Using network IB n136-128-154:2216582:2216898 [0] NCCL INFO ncclCommInitRankConfig comm 0xfda11b0 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x8d0e5fd69ac7ccb1 - Init START n136-128-154:2216583:2216899 [1] NCCL INFO ncclCommInitRankConfig comm 0x10a0bd50 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x8d0e5fd69ac7ccb1 - Init START n136-128-154:2216582:2216898 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2216583:2216899 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2216582:2216898 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2216582:2216898 [0] NCCL INFO Retrieving state for IB n136-128-154:2216582:2216898 [0] NCCL INFO Initialized state 0 for IB n136-128-154:2216582:2216898 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2216583:2216899 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2216583:2216899 [1] NCCL INFO Retrieving state for IB n136-128-154:2216583:2216899 [1] NCCL INFO Initialized state 0 for IB n136-128-154:2216582:2216898 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2216583:2216899 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2216582:2216898 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2216583:2216899 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2216582:2216898 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2216583:2216899 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2216583:2216899 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2216582:2216898 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2216582:2216898 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2216583:2216899 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2216583:2216899 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2216582:2216898 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2216582:2216898 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2216582:2216898 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2216582:2216898 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2216582:2216898 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2216582:2216898 [0] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2216582:2216898 [0] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2216582:2216898 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2216582:2216898 [0] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2216582:2216898 [0] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2216582:2216898 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2216582:2216898 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2216582:2216898 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2216582:2216898 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2216582:2216898 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2216582:2216898 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2216582:2216898 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2216582:2216898 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2216582:2216898 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2216582:2216898 [0] NCCL INFO ========================================== n136-128-154:2216582:2216898 [0] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2216582:2216898 [0] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2216582:2216898 [0] NCCL INFO Setting affinity for GPU 6 to 33-62,97-126 n136-128-154:2216583:2216899 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2216582:2216898 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2216583:2216899 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2216582:2216898 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2216582:2216898 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2216582:2216898 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216582:2216898 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2216582:2216898 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216582:2216898 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2216582:2216898 [0] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2216582:2216898 [0] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2216582:2216898 [0] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2216582:2216898 [0] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2216582:2216898 [0] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2216582:2216898 [0] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2216583:2216899 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2216583:2216899 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2216583:2216899 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2216583:2216899 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2216583:2216899 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2216583:2216899 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2216583:2216899 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2216583:2216899 [1] NCCL INFO ========================================== n136-128-154:2216583:2216899 [1] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2216583:2216899 [1] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2216583:2216899 [1] NCCL INFO Setting affinity for GPU 7 to 33-62,97-126 n136-128-154:2216582:2216898 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2216582:2216898 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216582:2216898 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216582:2216898 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216582:2216898 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216582:2216898 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216582:2216898 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216582:2216898 [0] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216582:2216898 [0] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216582:2216898 [0] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216582:2216898 [0] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216582:2216898 [0] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216582:2216898 [0] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216583:2216899 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2216583:2216899 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2216583:2216899 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2216583:2216899 [1] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216583:2216899 [1] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216583:2216899 [1] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216583:2216899 [1] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216583:2216899 [1] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216583:2216899 [1] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2216583:2216899 [1] NCCL INFO comm 0x10a0bd50 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO comm 0xfda11b0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:2216582:2216898 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:2216583:2216899 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:2216583:2216899 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:2216583:2216899 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:2216582:2216898 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:2216582:2216898 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:2216582:2216898 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2216583:2216899 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2216583:2216910 [1] NCCL INFO [Proxy Service] Device 1 CPU core 39 n136-128-154:2216583:2216911 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 40 n136-128-154:2216582:2216898 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2216582:2216898 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:2216582:2216912 [0] NCCL INFO [Proxy Service] Device 0 CPU core 98 n136-128-154:2216582:2216913 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 37 n136-128-154:2216583:2216899 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2216583:2216899 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2216582:2216898 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2216582:2216898 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2216582:2216898 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:2216583:2216899 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2216583:2216899 [1] NCCL INFO ncclCommInitRankConfig comm 0x10a0bd50 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x8d0e5fd69ac7ccb1 - Init COMPLETE n136-128-154:2216583:2216899 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.49 (kernels 0.14, alloc 0.17, bootstrap 0.00, allgathers 0.01, topo 0.07, graphs 0.00, connections 0.05, rest 0.05) n136-128-154:2216582:2216898 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2216582:2216898 [0] NCCL INFO ncclCommInitRankConfig comm 0xfda11b0 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x8d0e5fd69ac7ccb1 - Init COMPLETE n136-128-154:2216582:2216898 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.52 (kernels 0.16, alloc 0.18, bootstrap 0.00, allgathers 0.01, topo 0.07, graphs 0.00, connections 0.05, rest 0.05) n136-128-154:2216583:2216914 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 04/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 05/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 06/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 07/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 08/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 09/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 04/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 10/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 05/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 11/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 06/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 12/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 07/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 13/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 08/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 14/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 09/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 15/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 10/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 16/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 11/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 17/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 12/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 18/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 13/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 19/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 14/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 20/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 15/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 21/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 16/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 22/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 17/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Channel 23/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 18/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 19/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 20/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 21/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 22/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216582:2216915 [0] NCCL INFO Channel 23/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2216583:2216914 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:2216582:2216915 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-09:16:11:14 INFO [evaluator:559] Running loglikelihood requests 2025-12-09:16:11:14 INFO [evaluator:559] Running loglikelihood requests Passed argument batch_size = auto:1. Detecting largest batch size Running loglikelihood requests: 0%| | 0/3270 [00:00 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 01/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 02/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 03/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 04/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 05/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 06/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 07/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 08/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 09/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 10/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 11/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 12/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 13/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 14/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 15/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 16/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 17/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 18/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 19/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 20/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 21/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 22/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 23/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 24/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 25/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 26/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 27/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 28/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 29/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 30/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2216583:2217108 [1] NCCL INFO Channel 31/1 : 1[7] -> 0[6] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:2216583:2217113 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2216583:2217113 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2216583:2217113 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2216583:2217113 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2216583:2217113 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2216583:2217113 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2216583:2216910 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-09:16:12:42 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: auto (45) |Tasks|Version|Filter|n-shot|Metric| |Value | |Stderr| |-----|------:|------|-----:|------|---|-----:|---|-----:| |boolq| 2|none | 0|acc |↑ |0.5076|± |0.0087| [rank0]:[W1209 16:12:43.746052598 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:2216582:2217160 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2216582:2217160 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2216582:2217160 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2216582:2216912 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2216582:2217160 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2216582:2217160 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2216582:2217160 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2216583:2216910 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2216583:2217113 [1] NCCL INFO comm 0x10a0bd50 rank 1 nranks 2 cudaDev 1 busId c9000 - Abort COMPLETE n136-128-154:2216582:2217160 [0] NCCL INFO comm 0xfda11b0 rank 0 nranks 2 cudaDev 0 busId c5000 - Abort COMPLETE 任务 boolq 评估完成! ================================================== 开始评估:任务=arc_challenge | 少样本数=25 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/baseline_BI/arc_challenge.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:16:13:33 INFO [__main__:440] Selected Tasks: ['arc_challenge'] 2025-12-09:16:13:33 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:16:13:33 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:16:13:33 INFO [__main__:440] Selected Tasks: ['arc_challenge'] 2025-12-09:16:13:33 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:16:13:33 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:16:13:34 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:16:13:35 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:2217285:2217577 [0] NCCL INFO Initialized NET plugin IB n136-128-154:2217285:2217577 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2217285:2217577 [0] NCCL INFO Using network IB n136-128-154:2217285:2217577 [0] NCCL INFO ncclCommInitRankConfig comm 0x125cd3d0 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x6987de3c12bf2f47 - Init START 76%|███████▌ | 444/586 [00:11<00:03, 37.33it/s] 76%|███████▋ | 448/586 [00:12<00:03, 37.27it/s] 77%|███████▋ | 452/586 [00:12<00:03, 37.28it/s] 78%|███████▊ | 456/586 [00:12<00:03, 37.30it/s] 78%|███████▊ | 460/586 [00:12<00:03, 37.26it/s] 79%|███████▉ | 464/586 [00:12<00:03, 37.26it/s] 80%|███████▉ | 468/586 [00:12<00:03, 37.25it/s] 81%|████████ | 472/586 [00:12<00:03, 37.26it/s] 81%|████████ | 476/586 [00:12<00:02, 37.24it/s] 82%|████████▏ | 480/586 [00:12<00:02, 37.23it/s] 83%|████████▎ | 484/586 [00:13<00:02, 37.21it/s] 83%|████████▎ | 488/586 [00:13<00:02, 37.23it/s] 84%|████████▍ | 492/586 [00:13<00:02, 37.22it/s] 85%|████████▍ | 496/586 [00:13<00:02, 37.20it/s] 85%|████████▌ | 500/586 [00:13<00:02, 37.22it/s] 86%|████████▌ | 504/586 [00:13<00:02, 37.22it/s] 87%|████████▋ | 508/586 [00:13<00:02, 37.19it/s] 87%|████████▋ | 512/586 [00:13<00:01, 37.21it/s] 88%|████████▊ | 516/586 [00:13<00:01, 37.16it/s] 89%|████████▊ | 520/586 [00:14<00:01, 37.16it/s] 89%|████████▉ | 524/586 [00:14<00:01, 37.19it/s] 90%|█████████ | 528/586 [00:14<00:01, 37.10it/s] 91%|█████████ | 532/586 [00:14<00:01, 37.12it/s] 91%|█████████▏| 536/586 [00:14<00:01, 36.74it/s] 92%|█████████▏| 540/586 [00:14<00:01, 36.86it/s] 93%|█████████▎| 544/586 [00:14<00:01, 36.94it/s] 94%|█████████▎| 548/586 [00:14<00:01, 37.02it/s] 94%|█████████▍| 552/586 [00:14<00:00, 37.08it/s] 95%|█████████▍| 556/586 [00:14<00:00, 37.15it/s] 96%|█████████▌| 560/586 [00:15<00:00, 37.16it/s] 96%|█████████▌| 564/586 [00:15<00:00, 37.23it/s] 97%|█████████▋| 568/586 [00:15<00:00, 37.22it/s] 98%|█████████▊| 572/586 [00:15<00:00, 37.22it/s] 98%|█████████▊| 576/586 [00:15<00:00, 37.22it/s] 99%|█████████▉| 580/586 [00:15<00:00, 37.26it/s] 100%|█████████▉| 584/586 [00:15<00:00, 37.29it/s] 100%|██████████| 586/586 [00:15<00:00, 37.09it/s] n136-128-154:2217286:2217286 [1] NCCL INFO cudaDriverVersion 12040 n136-128-154:2217286:2217286 [1] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6 n136-128-154:2217286:2217286 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0 n136-128-154:2217286:2217286 [1] NCCL INFO NCCL version 2.27.5+cuda12.9 n136-128-154:2217286:2217286 [1] NCCL INFO Comm config Blocking set to 1 n136-128-154:2217286:2217585 [1] NCCL INFO NET/Plugin: Could not find: none libnccl-net-none.so. n136-128-154:2217286:2217585 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0. n136-128-154:2217286:2217585 [1] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6 n136-128-154:2217286:2217585 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0 n136-128-154:2217286:2217585 [1] NCCL INFO NCCL_IB_HCA set to mlx5 n136-128-154:2217286:2217585 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:2217286:2217585 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:2217286:2217585 [1] NCCL INFO Initialized NET plugin IB n136-128-154:2217286:2217585 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2217286:2217585 [1] NCCL INFO Using network IB n136-128-154:2217286:2217585 [1] NCCL INFO ncclCommInitRankConfig comm 0x1131e060 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x6987de3c12bf2f47 - Init START n136-128-154:2217286:2217585 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2217285:2217577 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2217286:2217585 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2217286:2217585 [1] NCCL INFO Retrieving state for IB n136-128-154:2217286:2217585 [1] NCCL INFO Initialized state 0 for IB n136-128-154:2217286:2217585 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2217286:2217585 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2217286:2217585 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2217285:2217577 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2217285:2217577 [0] NCCL INFO Retrieving state for IB n136-128-154:2217285:2217577 [0] NCCL INFO Initialized state 0 for IB n136-128-154:2217285:2217577 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2217286:2217585 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2217285:2217577 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2217285:2217577 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2217285:2217577 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2217286:2217585 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2217286:2217585 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2217285:2217577 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2217285:2217577 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2217285:2217577 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2217285:2217577 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2217285:2217577 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2217285:2217577 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2217285:2217577 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2217285:2217577 [0] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2217285:2217577 [0] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2217285:2217577 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2217285:2217577 [0] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2217285:2217577 [0] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2217285:2217577 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2217285:2217577 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2217285:2217577 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2217285:2217577 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2217285:2217577 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2217285:2217577 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2217285:2217577 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2217285:2217577 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2217285:2217577 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2217285:2217577 [0] NCCL INFO ========================================== n136-128-154:2217285:2217577 [0] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2217285:2217577 [0] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2217285:2217577 [0] NCCL INFO Setting affinity for GPU 6 to 33-62,97-126 n136-128-154:2217286:2217585 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2217286:2217585 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2217286:2217585 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2217286:2217585 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2217286:2217585 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2217286:2217585 [1] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2217286:2217585 [1] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2217286:2217585 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2217286:2217585 [1] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2217286:2217585 [1] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2217286:2217585 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2217286:2217585 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2217286:2217585 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2217286:2217585 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2217286:2217585 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2217286:2217585 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2217286:2217585 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2217286:2217585 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2217286:2217585 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2217286:2217585 [1] NCCL INFO ========================================== n136-128-154:2217285:2217577 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2217286:2217585 [1] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2217285:2217577 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217285:2217577 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217285:2217577 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2217285:2217577 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217285:2217577 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217285:2217577 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217285:2217577 [0] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217285:2217577 [0] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217285:2217577 [0] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO Setting affinity for GPU 7 to 33-62,97-126 n136-128-154:2217285:2217577 [0] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217285:2217577 [0] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217285:2217577 [0] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217285:2217577 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2217285:2217577 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217285:2217577 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217285:2217577 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217285:2217577 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217285:2217577 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217285:2217577 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217285:2217577 [0] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2217285:2217577 [0] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2217285:2217577 [0] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2217285:2217577 [0] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2217285:2217577 [0] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2217285:2217577 [0] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2217286:2217585 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2217286:2217585 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2217286:2217585 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2217286:2217585 [1] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2217286:2217585 [1] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2217286:2217585 [1] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2217286:2217585 [1] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2217286:2217585 [1] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2217286:2217585 [1] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2217285:2217577 [0] NCCL INFO comm 0x125cd3d0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:2217286:2217585 [1] NCCL INFO comm 0x1131e060 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:2217286:2217585 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:2217286:2217585 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:2217286:2217585 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:2217286:2217585 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:2217285:2217577 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:2217285:2217577 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:2217285:2217577 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2217286:2217585 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2217286:2217593 [1] NCCL INFO [Proxy Service] Device 1 CPU core 120 n136-128-154:2217286:2217594 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 58 n136-128-154:2217285:2217577 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2217285:2217577 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:2217285:2217596 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 99 n136-128-154:2217285:2217595 [0] NCCL INFO [Proxy Service] Device 0 CPU core 34 n136-128-154:2217286:2217585 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2217286:2217585 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2217285:2217577 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2217285:2217577 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2217285:2217577 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:2217286:2217585 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2217286:2217585 [1] NCCL INFO ncclCommInitRankConfig comm 0x1131e060 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x6987de3c12bf2f47 - Init COMPLETE n136-128-154:2217286:2217585 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.44 (kernels 0.12, alloc 0.15, bootstrap 0.00, allgathers 0.00, topo 0.06, graphs 0.00, connections 0.01, rest 0.09) n136-128-154:2217285:2217577 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2217285:2217577 [0] NCCL INFO ncclCommInitRankConfig comm 0x125cd3d0 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x6987de3c12bf2f47 - Init COMPLETE n136-128-154:2217285:2217577 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 4.56 (kernels 0.11, alloc 0.13, bootstrap 4.16, allgathers 0.00, topo 0.06, graphs 0.00, connections 0.01, rest 0.09) n136-128-154:2217286:2217597 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 04/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 05/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 06/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 07/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 08/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 09/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 10/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 11/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 12/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 13/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 14/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 15/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 16/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 17/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 18/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 19/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 20/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 21/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 22/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217597 [1] NCCL INFO Channel 23/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 04/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 05/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 06/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 07/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 08/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 09/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 10/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 11/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 12/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 13/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 14/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 15/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 16/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 17/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 18/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 19/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 20/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 21/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 22/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Channel 23/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2217285:2217598 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:2217286:2217597 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-09:16:14:12 INFO [evaluator:559] Running loglikelihood requests 2025-12-09:16:14:12 INFO [evaluator:559] Running loglikelihood requests Passed argument batch_size = auto:1. Detecting largest batch size Running loglikelihood requests: 0%| | 0/2344 [00:00 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 01/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 02/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 03/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 04/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 05/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 06/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 07/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 08/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 09/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 10/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 11/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 12/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 13/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 14/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 15/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 16/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 17/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 18/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 19/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 20/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 21/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 22/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 23/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 24/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 25/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 26/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 27/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 28/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 29/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 30/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2217286:2217993 [1] NCCL INFO Channel 31/1 : 1[7] -> 0[6] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:2217286:2218000 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2217286:2218000 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2217286:2218000 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2217286:2218000 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2217286:2218000 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2217286:2218000 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2217286:2217593 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-09:16:17:59 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 25, batch_size: auto (45) | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |-------------|------:|------|-----:|--------|---|-----:|---|-----:| |arc_challenge| 1|none | 25|acc |↑ |0.2944|± |0.0133| | | |none | 25|acc_norm|↑ |0.3251|± |0.0137| [rank0]:[W1209 16:18:00.245690467 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:2217285:2218058 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2217285:2218058 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2217285:2218058 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2217285:2217595 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2217285:2218058 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2217285:2218058 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2217285:2218058 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2217286:2217593 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2217286:2218000 [1] NCCL INFO comm 0x1131e060 rank 1 nranks 2 cudaDev 1 busId c9000 - Abort COMPLETE n136-128-154:2217285:2218058 [0] NCCL INFO comm 0x125cd3d0 rank 0 nranks 2 cudaDev 0 busId c5000 - Abort COMPLETE 任务 arc_challenge 评估完成! ================================================== 开始评估:任务=truthfulqa_mc1 | 少样本数=0 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/baseline_BI/truthfulqa_mc1.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:16:18:50 INFO [__main__:440] Selected Tasks: ['truthfulqa_mc1'] 2025-12-09:16:18:50 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:16:18:50 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:16:18:50 INFO [__main__:440] Selected Tasks: ['truthfulqa_mc1'] 2025-12-09:16:18:50 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:16:18:50 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:16:18:51 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:16:18:52 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:16:18:52 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:2218110:2218336 [0] NCCL INFO Initialized NET plugin IB n136-128-154:2218110:2218336 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2218110:2218336 [0] NCCL INFO Using network IB n136-128-154:2218110:2218336 [0] NCCL INFO ncclCommInitRankConfig comm 0xfdabdb0 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0xb0dc6c95e12de2b7 - Init START 2025-12-09:16:19:13 INFO [evaluator:305] num_fewshot has been set to 0 for truthfulqa_mc1 in its config. Manual configuration will be ignored. 2025-12-09:16:19:13 INFO [api.task:434] Building contexts for truthfulqa_mc1 on rank 1... 0%| | 0/408 [00:00 n136-128-154:2218111:2218347 [1] NCCL INFO Initialized NET plugin IB n136-128-154:2218111:2218347 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2218111:2218347 [1] NCCL INFO Using network IB n136-128-154:2218111:2218347 [1] NCCL INFO ncclCommInitRankConfig comm 0x10923990 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0xb0dc6c95e12de2b7 - Init START n136-128-154:2218111:2218347 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2218110:2218336 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2218110:2218336 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2218110:2218336 [0] NCCL INFO Retrieving state for IB n136-128-154:2218110:2218336 [0] NCCL INFO Initialized state 0 for IB n136-128-154:2218110:2218336 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2218111:2218347 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2218111:2218347 [1] NCCL INFO Retrieving state for IB n136-128-154:2218111:2218347 [1] NCCL INFO Initialized state 0 for IB n136-128-154:2218110:2218336 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2218111:2218347 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2218110:2218336 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2218111:2218347 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2218110:2218336 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2218111:2218347 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2218111:2218347 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2218110:2218336 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2218110:2218336 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2218111:2218347 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2218111:2218347 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2218110:2218336 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2218110:2218336 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2218110:2218336 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2218110:2218336 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2218110:2218336 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2218110:2218336 [0] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2218110:2218336 [0] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2218110:2218336 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2218110:2218336 [0] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2218110:2218336 [0] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2218110:2218336 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2218110:2218336 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2218110:2218336 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2218110:2218336 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2218110:2218336 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2218110:2218336 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2218110:2218336 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2218110:2218336 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2218110:2218336 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2218110:2218336 [0] NCCL INFO ========================================== n136-128-154:2218110:2218336 [0] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2218110:2218336 [0] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2218110:2218336 [0] NCCL INFO Setting affinity for GPU 6 to 33-62,97-126 n136-128-154:2218111:2218347 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2218111:2218347 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2218111:2218347 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2218111:2218347 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2218111:2218347 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2218111:2218347 [1] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2218111:2218347 [1] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2218111:2218347 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2218110:2218336 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2218111:2218347 [1] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2218111:2218347 [1] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2218110:2218336 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2218110:2218336 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2218110:2218336 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2218110:2218336 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2218110:2218336 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2218110:2218336 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2218110:2218336 [0] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2218110:2218336 [0] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2218110:2218336 [0] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2218110:2218336 [0] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO ========================================== n136-128-154:2218110:2218336 [0] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218110:2218336 [0] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2218111:2218347 [1] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2218111:2218347 [1] NCCL INFO Setting affinity for GPU 7 to 33-62,97-126 n136-128-154:2218110:2218336 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2218110:2218336 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218110:2218336 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218110:2218336 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218110:2218336 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218110:2218336 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218110:2218336 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218110:2218336 [0] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218110:2218336 [0] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218110:2218336 [0] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218110:2218336 [0] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218110:2218336 [0] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218110:2218336 [0] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218111:2218347 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2218111:2218347 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2218111:2218347 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218111:2218347 [1] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218111:2218347 [1] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218111:2218347 [1] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218111:2218347 [1] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218111:2218347 [1] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218111:2218347 [1] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218110:2218336 [0] NCCL INFO comm 0xfdabdb0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:2218111:2218347 [1] NCCL INFO comm 0x10923990 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218111:2218347 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:2218111:2218347 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:2218111:2218347 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:2218111:2218347 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:2218110:2218336 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:2218110:2218336 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2218111:2218347 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2218111:2218355 [1] NCCL INFO [Proxy Service] Device 1 CPU core 58 n136-128-154:2218111:2218356 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 124 n136-128-154:2218110:2218336 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2218110:2218336 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:2218110:2218357 [0] NCCL INFO [Proxy Service] Device 0 CPU core 98 n136-128-154:2218110:2218358 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 35 n136-128-154:2218111:2218347 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2218111:2218347 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2218110:2218336 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2218110:2218336 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2218110:2218336 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:2218111:2218347 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2218111:2218347 [1] NCCL INFO ncclCommInitRankConfig comm 0x10923990 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0xb0dc6c95e12de2b7 - Init COMPLETE n136-128-154:2218111:2218347 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.50 (kernels 0.12, alloc 0.17, bootstrap 0.00, allgathers 0.00, topo 0.08, graphs 0.00, connections 0.03, rest 0.10) n136-128-154:2218110:2218336 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2218110:2218336 [0] NCCL INFO ncclCommInitRankConfig comm 0xfdabdb0 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0xb0dc6c95e12de2b7 - Init COMPLETE n136-128-154:2218110:2218336 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 3.21 (kernels 0.11, alloc 0.11, bootstrap 2.78, allgathers 0.00, topo 0.08, graphs 0.00, connections 0.02, rest 0.10) n136-128-154:2218111:2218359 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 04/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 05/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 06/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 07/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 08/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 09/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 10/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 04/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 11/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 05/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 12/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 06/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 13/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 07/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 14/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 08/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 15/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 09/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 16/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 10/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 17/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 11/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 18/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 12/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 19/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 13/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 20/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 14/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 21/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 15/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 22/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 16/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218111:2218359 [1] NCCL INFO Channel 23/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 17/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 18/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 19/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 20/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 21/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 22/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Channel 23/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218110:2218360 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:2218111:2218359 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-09:16:19:14 INFO [evaluator:559] Running loglikelihood requests 2025-12-09:16:19:14 INFO [evaluator:559] Running loglikelihood requests Passed argument batch_size = auto:1. Detecting largest batch size Running loglikelihood requests: 0%| | 0/2066 [00:00 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 01/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 02/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 03/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 04/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 05/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 06/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 07/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 08/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 09/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 10/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 11/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 12/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 13/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 14/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 15/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 16/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 17/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 18/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 19/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 20/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 21/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 22/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 23/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 24/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 25/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 26/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 27/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 28/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 29/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 30/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218111:2218404 [1] NCCL INFO Channel 31/1 : 1[7] -> 0[6] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:2218111:2218409 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2218111:2218409 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2218111:2218409 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2218111:2218409 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2218111:2218409 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2218111:2218409 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2218111:2218355 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-09:16:20:05 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: auto (64) | Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr| |--------------|------:|------|-----:|------|---|-----:|---|-----:| |truthfulqa_mc1| 2|none | 0|acc |↑ |0.2705|± |0.0156| [rank0]:[W1209 16:20:06.430630291 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:2218110:2218459 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2218110:2218459 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2218110:2218459 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2218110:2218357 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2218110:2218459 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2218111:2218355 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2218110:2218459 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2218110:2218459 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2218111:2218409 [1] NCCL INFO comm 0x10923990 rank 1 nranks 2 cudaDev 1 busId c9000 - Abort COMPLETE n136-128-154:2218110:2218459 [0] NCCL INFO comm 0xfdabdb0 rank 0 nranks 2 cudaDev 0 busId c5000 - Abort COMPLETE 任务 truthfulqa_mc1 评估完成! ================================================== 开始评估:任务=piqa | 少样本数=0 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/baseline_BI/piqa.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:16:20:55 INFO [__main__:440] Selected Tasks: ['piqa'] 2025-12-09:16:20:55 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:16:20:55 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:16:20:55 INFO [__main__:440] Selected Tasks: ['piqa'] 2025-12-09:16:20:55 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:16:20:55 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:16:20:56 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:16:20:56 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:2218506:2218685 [1] NCCL INFO Initialized NET plugin IB n136-128-154:2218505:2218684 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:2218505:2218684 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:2218505:2218684 [0] NCCL INFO Initialized NET plugin IB n136-128-154:2218506:2218685 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2218506:2218685 [1] NCCL INFO Using network IB n136-128-154:2218505:2218684 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2218505:2218684 [0] NCCL INFO Using network IB n136-128-154:2218506:2218685 [1] NCCL INFO ncclCommInitRankConfig comm 0xfc899c0 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x29a160749ad47b2e - Init START n136-128-154:2218505:2218684 [0] NCCL INFO ncclCommInitRankConfig comm 0xe53b9c0 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x29a160749ad47b2e - Init START n136-128-154:2218505:2218684 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2218506:2218685 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2218506:2218685 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2218506:2218685 [1] NCCL INFO Retrieving state for IB n136-128-154:2218506:2218685 [1] NCCL INFO Initialized state 0 for IB n136-128-154:2218506:2218685 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2218505:2218684 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2218505:2218684 [0] NCCL INFO Retrieving state for IB n136-128-154:2218505:2218684 [0] NCCL INFO Initialized state 0 for IB n136-128-154:2218506:2218685 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2218505:2218684 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2218506:2218685 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2218505:2218684 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2218506:2218685 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2218505:2218684 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2218505:2218684 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2218505:2218684 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2218505:2218684 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2218506:2218685 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2218506:2218685 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2218505:2218684 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2218505:2218684 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2218505:2218684 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2218505:2218684 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2218505:2218684 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2218505:2218684 [0] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2218505:2218684 [0] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2218505:2218684 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2218505:2218684 [0] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2218505:2218684 [0] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2218505:2218684 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2218505:2218684 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2218505:2218684 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2218505:2218684 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2218505:2218684 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2218505:2218684 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2218505:2218684 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2218505:2218684 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2218505:2218684 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2218505:2218684 [0] NCCL INFO ========================================== n136-128-154:2218505:2218684 [0] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2218505:2218684 [0] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2218505:2218684 [0] NCCL INFO Setting affinity for GPU 6 to 33-62,97-126 n136-128-154:2218506:2218685 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2218506:2218685 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2218506:2218685 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2218506:2218685 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2218506:2218685 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2218506:2218685 [1] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2218506:2218685 [1] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2218506:2218685 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2218506:2218685 [1] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2218506:2218685 [1] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2218506:2218685 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2218506:2218685 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2218506:2218685 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2218506:2218685 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2218506:2218685 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2218506:2218685 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2218506:2218685 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2218506:2218685 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2218506:2218685 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2218506:2218685 [1] NCCL INFO ========================================== n136-128-154:2218506:2218685 [1] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2218506:2218685 [1] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2218506:2218685 [1] NCCL INFO Setting affinity for GPU 7 to 33-62,97-126 n136-128-154:2218506:2218685 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2218506:2218685 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218506:2218685 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218506:2218685 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218506:2218685 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218506:2218685 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218506:2218685 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218506:2218685 [1] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218506:2218685 [1] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218506:2218685 [1] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218506:2218685 [1] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218506:2218685 [1] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218506:2218685 [1] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218506:2218685 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2218506:2218685 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218506:2218685 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218506:2218685 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218506:2218685 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218506:2218685 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218506:2218685 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218506:2218685 [1] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218506:2218685 [1] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218506:2218685 [1] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218506:2218685 [1] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218506:2218685 [1] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218506:2218685 [1] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218505:2218684 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2218505:2218684 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218505:2218684 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218505:2218684 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218505:2218684 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218505:2218684 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218505:2218684 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218505:2218684 [0] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218505:2218684 [0] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218505:2218684 [0] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218505:2218684 [0] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218505:2218684 [0] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218505:2218684 [0] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218505:2218684 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2218505:2218684 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218505:2218684 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218505:2218684 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218505:2218684 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218505:2218684 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218505:2218684 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218505:2218684 [0] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218505:2218684 [0] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218505:2218684 [0] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218505:2218684 [0] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218505:2218684 [0] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218505:2218684 [0] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218505:2218684 [0] NCCL INFO comm 0xe53b9c0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:2218506:2218685 [1] NCCL INFO comm 0xfc899c0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218506:2218685 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:2218506:2218685 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:2218506:2218685 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:2218506:2218685 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:2218505:2218684 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:2218505:2218684 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:2218505:2218684 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2218506:2218685 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2218506:2218697 [1] NCCL INFO [Proxy Service] Device 1 CPU core 100 n136-128-154:2218506:2218698 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 38 n136-128-154:2218505:2218684 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2218505:2218684 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:2218505:2218699 [0] NCCL INFO [Proxy Service] Device 0 CPU core 103 n136-128-154:2218505:2218700 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 40 n136-128-154:2218506:2218685 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2218506:2218685 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2218505:2218684 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2218505:2218684 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2218505:2218684 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:2218506:2218685 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2218506:2218685 [1] NCCL INFO ncclCommInitRankConfig comm 0xfc899c0 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x29a160749ad47b2e - Init COMPLETE n136-128-154:2218506:2218685 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.51 (kernels 0.12, alloc 0.25, bootstrap 0.00, allgathers 0.01, topo 0.07, graphs 0.00, connections 0.01, rest 0.06) n136-128-154:2218505:2218684 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2218505:2218684 [0] NCCL INFO ncclCommInitRankConfig comm 0xe53b9c0 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x29a160749ad47b2e - Init COMPLETE n136-128-154:2218505:2218684 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.52 (kernels 0.11, alloc 0.25, bootstrap 0.00, allgathers 0.00, topo 0.07, graphs 0.00, connections 0.01, rest 0.06) n136-128-154:2218506:2218701 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 04/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 05/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 06/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 07/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 08/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 09/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 10/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 11/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 12/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 04/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 13/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 05/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 14/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 06/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 15/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 07/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 16/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 08/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 17/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 09/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 18/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 10/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 19/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 11/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 20/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 12/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 21/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 13/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 22/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 14/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Channel 23/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 15/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 16/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 17/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 18/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 19/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 20/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 21/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 22/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218505:2218702 [0] NCCL INFO Channel 23/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218506:2218701 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:2218505:2218702 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-09:16:21:16 INFO [evaluator:559] Running loglikelihood requests 2025-12-09:16:21:16 INFO [evaluator:559] Running loglikelihood requests Passed argument batch_size = auto:1. Detecting largest batch size Running loglikelihood requests: 0%| | 0/1838 [00:00 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 01/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 02/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 03/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 04/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 05/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 06/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 07/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 08/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 09/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 10/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 11/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 12/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 13/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 14/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 15/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 16/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 17/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 18/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 19/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 20/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 21/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 22/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 23/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 24/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 25/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 26/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 27/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 28/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 29/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 30/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218506:2218723 [1] NCCL INFO Channel 31/1 : 1[7] -> 0[6] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:2218506:2218729 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2218506:2218729 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2218506:2218729 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2218506:2218729 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2218506:2218729 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2218506:2218729 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2218506:2218697 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-09:16:21:45 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: auto (64) |Tasks|Version|Filter|n-shot| Metric | |Value | |Stderr| |-----|------:|------|-----:|--------|---|-----:|---|-----:| |piqa | 1|none | 0|acc |↑ |0.7067|± |0.0106| | | |none | 0|acc_norm|↑ |0.7073|± |0.0106| [rank0]:[W1209 16:21:46.735703412 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:2218505:2218778 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2218505:2218778 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2218505:2218778 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2218505:2218699 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2218505:2218778 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2218505:2218778 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2218505:2218778 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2218506:2218697 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2218506:2218729 [1] NCCL INFO comm 0xfc899c0 rank 1 nranks 2 cudaDev 1 busId c9000 - Abort COMPLETE n136-128-154:2218505:2218778 [0] NCCL INFO comm 0xe53b9c0 rank 0 nranks 2 cudaDev 0 busId c5000 - Abort COMPLETE 任务 piqa 评估完成! ================================================== 开始评估:任务=hellaswag | 少样本数=10 | 模型=Qwen2.5-7B-quantization-fg 输出路径:results2/Qwen2.5-7B-quantization-fg/baseline_BI/hellaswag.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:16:22:36 INFO [__main__:440] Selected Tasks: ['hellaswag'] 2025-12-09:16:22:36 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:16:22:36 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:16:22:36 INFO [__main__:440] Selected Tasks: ['hellaswag'] 2025-12-09:16:22:36 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:16:22:36 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg'} 2025-12-09:16:22:37 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:16:22:37 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:2218825:2219031 [0] NCCL INFO Initialized NET plugin IB n136-128-154:2218825:2219031 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2218825:2219031 [0] NCCL INFO Using network IB n136-128-154:2218826:2219032 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:2218826:2219032 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:2218826:2219032 [1] NCCL INFO Initialized NET plugin IB n136-128-154:2218826:2219032 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2218826:2219032 [1] NCCL INFO Using network IB n136-128-154:2218825:2219031 [0] NCCL INFO ncclCommInitRankConfig comm 0xe3cc210 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0xf5eaff4420616091 - Init START n136-128-154:2218826:2219032 [1] NCCL INFO ncclCommInitRankConfig comm 0xe7126f0 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0xf5eaff4420616091 - Init START n136-128-154:2218825:2219031 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2218826:2219032 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2218826:2219032 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2218826:2219032 [1] NCCL INFO Retrieving state for IB n136-128-154:2218826:2219032 [1] NCCL INFO Initialized state 0 for IB n136-128-154:2218826:2219032 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2218826:2219032 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2218826:2219032 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2218826:2219032 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2218825:2219031 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2218825:2219031 [0] NCCL INFO Retrieving state for IB n136-128-154:2218825:2219031 [0] NCCL INFO Initialized state 0 for IB n136-128-154:2218825:2219031 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2218825:2219031 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2218825:2219031 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2218825:2219031 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2218825:2219031 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2218825:2219031 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2218825:2219031 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2218825:2219031 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2218825:2219031 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2218825:2219031 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2218825:2219031 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2218825:2219031 [0] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2218825:2219031 [0] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2218825:2219031 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2218825:2219031 [0] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2218825:2219031 [0] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2218825:2219031 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2218825:2219031 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2218825:2219031 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2218825:2219031 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2218825:2219031 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2218825:2219031 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2218825:2219031 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2218825:2219031 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2218825:2219031 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2218825:2219031 [0] NCCL INFO ========================================== n136-128-154:2218825:2219031 [0] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2218825:2219031 [0] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2218825:2219031 [0] NCCL INFO Setting affinity for GPU 6 to 33-62,97-126 n136-128-154:2218826:2219032 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2218826:2219032 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2218825:2219031 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2218825:2219031 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218825:2219031 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218825:2219031 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218825:2219031 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218825:2219031 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218825:2219031 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218825:2219031 [0] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218825:2219031 [0] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218825:2219031 [0] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218825:2219031 [0] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218825:2219031 [0] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218825:2219031 [0] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218825:2219031 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2218825:2219031 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218825:2219031 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218825:2219031 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218825:2219031 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218825:2219031 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218825:2219031 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218825:2219031 [0] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218825:2219031 [0] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218825:2219031 [0] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218825:2219031 [0] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218825:2219031 [0] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218825:2219031 [0] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218826:2219032 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2218826:2219032 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2218826:2219032 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2218826:2219032 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2218826:2219032 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2218826:2219032 [1] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2218826:2219032 [1] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2218826:2219032 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2218826:2219032 [1] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2218826:2219032 [1] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2218826:2219032 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2218826:2219032 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2218826:2219032 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2218826:2219032 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2218826:2219032 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2218826:2219032 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2218826:2219032 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2218826:2219032 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2218826:2219032 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2218826:2219032 [1] NCCL INFO ========================================== n136-128-154:2218826:2219032 [1] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2218826:2219032 [1] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2218826:2219032 [1] NCCL INFO Setting affinity for GPU 7 to 33-62,97-126 n136-128-154:2218826:2219032 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2218826:2219032 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218826:2219032 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218826:2219032 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218826:2219032 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218826:2219032 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218826:2219032 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218826:2219032 [1] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218826:2219032 [1] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218826:2219032 [1] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218826:2219032 [1] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218826:2219032 [1] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218826:2219032 [1] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218826:2219032 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2218826:2219032 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218826:2219032 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218826:2219032 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218826:2219032 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218826:2219032 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218826:2219032 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2218826:2219032 [1] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218826:2219032 [1] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218826:2219032 [1] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218826:2219032 [1] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218826:2219032 [1] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218826:2219032 [1] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2218825:2219031 [0] NCCL INFO comm 0xe3cc210 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:2218826:2219032 [1] NCCL INFO comm 0xe7126f0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:2218826:2219032 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:2218826:2219032 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:2218826:2219032 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:2218826:2219032 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:2218825:2219031 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:2218825:2219031 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2218826:2219032 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2218826:2219044 [1] NCCL INFO [Proxy Service] Device 1 CPU core 60 n136-128-154:2218826:2219045 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 62 n136-128-154:2218825:2219031 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2218825:2219031 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:2218825:2219046 [0] NCCL INFO [Proxy Service] Device 0 CPU core 39 n136-128-154:2218825:2219047 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 42 n136-128-154:2218826:2219032 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2218826:2219032 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2218825:2219031 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2218825:2219031 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2218825:2219031 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:2218826:2219032 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2218826:2219032 [1] NCCL INFO ncclCommInitRankConfig comm 0xe7126f0 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0xf5eaff4420616091 - Init COMPLETE n136-128-154:2218826:2219032 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.71 (kernels 0.13, alloc 0.38, bootstrap 0.00, allgathers 0.00, topo 0.05, graphs 0.00, connections 0.02, rest 0.13) n136-128-154:2218825:2219031 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2218825:2219031 [0] NCCL INFO ncclCommInitRankConfig comm 0xe3cc210 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0xf5eaff4420616091 - Init COMPLETE n136-128-154:2218825:2219031 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.71 (kernels 0.11, alloc 0.40, bootstrap 0.00, allgathers 0.00, topo 0.05, graphs 0.00, connections 0.01, rest 0.13) n136-128-154:2218826:2219048 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 04/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 05/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 06/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 07/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 04/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 08/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 05/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 09/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 06/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 10/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 07/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 11/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 08/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 12/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 09/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 13/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 10/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 14/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 11/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 15/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 12/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 16/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 13/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 17/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 14/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 18/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 15/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 19/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 16/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 20/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 17/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 21/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 18/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 22/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 19/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Channel 23/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 20/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 21/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 22/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218825:2219049 [0] NCCL INFO Channel 23/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2218826:2219048 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:2218825:2219049 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-09:16:23:25 INFO [evaluator:559] Running loglikelihood requests 2025-12-09:16:23:25 INFO [evaluator:559] Running loglikelihood requests Running loglikelihood requests: 0%| | 0/20084 [00:00 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 01/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 02/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 03/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 04/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 05/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 06/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 07/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 08/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 09/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 10/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 11/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 12/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 13/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 14/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 15/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 16/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 17/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 18/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 19/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 20/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 21/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 22/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 23/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 24/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 25/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 26/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 27/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 28/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 29/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 30/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2218826:2221263 [1] NCCL INFO Channel 31/1 : 1[7] -> 0[6] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:2218826:2221268 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2218826:2221268 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2218826:2221268 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2218826:2221268 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2218826:2221268 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2218826:2221268 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2218826:2219044 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-09:16:47:07 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Qwen2.5-7B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 10, batch_size: auto (57) | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |---------|------:|------|-----:|--------|---|-----:|---|-----:| |hellaswag| 1|none | 10|acc |↑ |0.3639|± |0.0048| | | |none | 10|acc_norm|↑ |0.4923|± |0.0050| [rank0]:[W1209 16:47:08.661753181 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:2218825:2221318 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2218825:2221318 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2218825:2221318 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2218825:2219046 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2218825:2221318 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2218825:2221318 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2218825:2221318 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2218826:2219044 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2218826:2221268 [1] NCCL INFO comm 0xe7126f0 rank 1 nranks 2 cudaDev 1 busId c9000 - Abort COMPLETE n136-128-154:2218825:2221318 [0] NCCL INFO comm 0xe3cc210 rank 0 nranks 2 cudaDev 0 busId c5000 - Abort COMPLETE 任务 hellaswag 评估完成! Execution time: 2604.13 seconds Namespace(save_dir='/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg', self_attn_layer_to_quant='24 25 23 26 27', mlp_layer_to_quant='24 25 23 26 27', model_id='/mnt/bn/life-mllm/users/cxr/quantization/models/meta-llama/Llama-3.1-8B', cuda_id=6) `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:2221864:2222047 [1] NCCL INFO Initialized NET plugin IB n136-128-154:2221864:2222047 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2221864:2222047 [1] NCCL INFO Using network IB n136-128-154:2221863:2222046 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:2221863:2222046 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:2221863:2222046 [0] NCCL INFO Initialized NET plugin IB n136-128-154:2221863:2222046 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2221863:2222046 [0] NCCL INFO Using network IB n136-128-154:2221864:2222047 [1] NCCL INFO ncclCommInitRankConfig comm 0x14fea0c0 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x7c29df96cd79f60b - Init START n136-128-154:2221863:2222046 [0] NCCL INFO ncclCommInitRankConfig comm 0x166dd430 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x7c29df96cd79f60b - Init START n136-128-154:2221863:2222046 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2221864:2222047 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2221864:2222047 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2221864:2222047 [1] NCCL INFO Retrieving state for IB n136-128-154:2221864:2222047 [1] NCCL INFO Initialized state 0 for IB n136-128-154:2221864:2222047 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2221864:2222047 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2221864:2222047 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2221864:2222047 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2221863:2222046 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2221863:2222046 [0] NCCL INFO Retrieving state for IB n136-128-154:2221863:2222046 [0] NCCL INFO Initialized state 0 for IB n136-128-154:2221863:2222046 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2221863:2222046 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2221863:2222046 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2221863:2222046 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2221863:2222046 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2221863:2222046 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2221864:2222047 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2221864:2222047 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2221863:2222046 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2221863:2222046 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2221863:2222046 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2221863:2222046 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2221863:2222046 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2221863:2222046 [0] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2221863:2222046 [0] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2221863:2222046 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2221863:2222046 [0] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2221863:2222046 [0] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2221864:2222047 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2221863:2222046 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2221864:2222047 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2221863:2222046 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2221864:2222047 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2221863:2222046 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2221864:2222047 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2221863:2222046 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2221864:2222047 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2221863:2222046 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2221864:2222047 [1] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2221863:2222046 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2221864:2222047 [1] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2221863:2222046 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2221864:2222047 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2221863:2222046 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2221864:2222047 [1] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2221863:2222046 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2221864:2222047 [1] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2221863:2222046 [0] NCCL INFO ========================================== n136-128-154:2221864:2222047 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2221864:2222047 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2221864:2222047 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2221863:2222046 [0] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2221864:2222047 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2221864:2222047 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2221863:2222046 [0] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2221864:2222047 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2221864:2222047 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2221864:2222047 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2221864:2222047 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2221863:2222046 [0] NCCL INFO Setting affinity for GPU 6 to 33-62,97-126 n136-128-154:2221864:2222047 [1] NCCL INFO ========================================== n136-128-154:2221864:2222047 [1] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2221864:2222047 [1] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2221864:2222047 [1] NCCL INFO Setting affinity for GPU 7 to 33-62,97-126 n136-128-154:2221863:2222046 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2221863:2222046 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221863:2222046 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221863:2222046 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221863:2222046 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221863:2222046 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221863:2222046 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221863:2222046 [0] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221863:2222046 [0] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221863:2222046 [0] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221863:2222046 [0] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221863:2222046 [0] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221863:2222046 [0] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221863:2222046 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2221863:2222046 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221863:2222046 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221863:2222046 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221863:2222046 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221863:2222046 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221863:2222046 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221863:2222046 [0] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2221863:2222046 [0] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2221863:2222046 [0] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2221863:2222046 [0] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2221863:2222046 [0] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2221863:2222046 [0] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2221864:2222047 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2221864:2222047 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221864:2222047 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221864:2222047 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221864:2222047 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221864:2222047 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221864:2222047 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221864:2222047 [1] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221864:2222047 [1] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221864:2222047 [1] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221864:2222047 [1] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221864:2222047 [1] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221864:2222047 [1] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221864:2222047 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2221864:2222047 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221864:2222047 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221864:2222047 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221864:2222047 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221864:2222047 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221864:2222047 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2221864:2222047 [1] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2221864:2222047 [1] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2221864:2222047 [1] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2221864:2222047 [1] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2221864:2222047 [1] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2221864:2222047 [1] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2221863:2222046 [0] NCCL INFO comm 0x166dd430 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:2221864:2222047 [1] NCCL INFO comm 0x14fea0c0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:2221864:2222047 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:2221863:2222046 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:2221864:2222047 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:2221864:2222047 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:2221864:2222047 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:2221863:2222046 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:2221863:2222046 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2221864:2222047 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2221864:2222060 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 46 n136-128-154:2221864:2222059 [1] NCCL INFO [Proxy Service] Device 1 CPU core 109 n136-128-154:2221863:2222046 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2221863:2222046 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:2221863:2222061 [0] NCCL INFO [Proxy Service] Device 0 CPU core 34 n136-128-154:2221863:2222062 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 36 n136-128-154:2221864:2222047 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2221864:2222047 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2221863:2222046 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2221863:2222046 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2221863:2222046 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:2221863:2222046 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2221863:2222046 [0] NCCL INFO ncclCommInitRankConfig comm 0x166dd430 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x7c29df96cd79f60b - Init COMPLETE n136-128-154:2221863:2222046 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.69 (kernels 0.12, alloc 0.35, bootstrap 0.00, allgathers 0.00, topo 0.06, graphs 0.00, connections 0.02, rest 0.13) n136-128-154:2221864:2222047 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2221864:2222047 [1] NCCL INFO ncclCommInitRankConfig comm 0x14fea0c0 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x7c29df96cd79f60b - Init COMPLETE n136-128-154:2221864:2222047 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.68 (kernels 0.12, alloc 0.34, bootstrap 0.01, allgathers 0.00, topo 0.06, graphs 0.00, connections 0.03, rest 0.13) n136-128-154:2221863:2222063 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 04/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 05/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 06/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 07/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 08/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 09/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 04/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 10/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 05/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 11/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 06/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 12/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 07/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 13/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 08/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 14/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 09/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 15/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 10/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 16/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 11/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 17/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 12/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 18/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 13/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 19/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 14/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 20/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 15/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 21/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 16/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 22/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 17/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Channel 23/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 18/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 19/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 20/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 21/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 22/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222064 [1] NCCL INFO Channel 23/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221863:2222063 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:2221864:2222064 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-09:16:53:09 INFO [evaluator:559] Running loglikelihood requests 2025-12-09:16:53:09 INFO [evaluator:559] Running loglikelihood requests Running loglikelihood requests: 0%| | 0/1268 [00:00 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 01/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 02/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 03/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 04/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 05/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 06/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 07/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 08/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 09/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 10/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 11/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 12/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 13/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 14/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 15/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 16/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 17/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 18/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 19/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 20/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 21/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 22/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 23/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 24/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 25/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 26/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 27/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 28/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 29/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 30/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2221864:2222089 [1] NCCL INFO Channel 31/1 : 1[7] -> 0[6] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:2221864:2222095 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2221864:2222095 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2221864:2222095 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2221864:2222095 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2221864:2222095 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2221864:2222095 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2221864:2222059 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-09:16:53:42 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto (64) | Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr| |----------|------:|------|-----:|------|---|-----:|---|-----:| |winogrande| 1|none | 5|acc |↑ |0.6803|± |0.0131| [rank0]:[W1209 16:53:42.128630331 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:2221863:2222165 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2221863:2222165 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2221863:2222165 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2221863:2222061 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2221863:2222165 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2221863:2222165 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2221863:2222165 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2221864:2222059 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2221864:2222095 [1] NCCL INFO comm 0x14fea0c0 rank 1 nranks 2 cudaDev 1 busId c9000 - Abort COMPLETE n136-128-154:2221863:2222165 [0] NCCL INFO comm 0x166dd430 rank 0 nranks 2 cudaDev 0 busId c5000 - Abort COMPLETE 任务 winogrande 评估完成! ================================================== 开始评估:任务=boolq | 少样本数=0 | 模型=Llama-3.1-8B-quantization-fg 输出路径:results2/Llama-3.1-8B-quantization-fg/baseline_BI/boolq.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:16:54:31 INFO [__main__:440] Selected Tasks: ['boolq'] 2025-12-09:16:54:31 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:16:54:31 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-09:16:54:31 INFO [__main__:440] Selected Tasks: ['boolq'] 2025-12-09:16:54:31 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:16:54:31 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-09:16:54:32 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:16:54:32 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:2222211:2222378 [0] NCCL INFO Initialized NET plugin IB n136-128-154:2222211:2222378 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2222211:2222378 [0] NCCL INFO Using network IB n136-128-154:2222211:2222378 [0] NCCL INFO ncclCommInitRankConfig comm 0x11e067c0 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x2b3729dc5e596a90 - Init START 2025-12-09:16:54:52 WARNING [evaluator:309] Overwriting default num_fewshot of boolq from None to 0 2025-12-09:16:54:52 INFO [api.task:434] Building contexts for boolq on rank 1... 0%| | 0/1635 [00:00 n136-128-154:2222212:2222392 [1] NCCL INFO Initialized NET plugin IB n136-128-154:2222212:2222392 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2222212:2222392 [1] NCCL INFO Using network IB n136-128-154:2222212:2222392 [1] NCCL INFO ncclCommInitRankConfig comm 0x12f06230 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x2b3729dc5e596a90 - Init START n136-128-154:2222212:2222392 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2222211:2222378 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2222212:2222392 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2222212:2222392 [1] NCCL INFO Retrieving state for IB n136-128-154:2222212:2222392 [1] NCCL INFO Initialized state 0 for IB n136-128-154:2222212:2222392 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2222211:2222378 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2222211:2222378 [0] NCCL INFO Retrieving state for IB n136-128-154:2222211:2222378 [0] NCCL INFO Initialized state 0 for IB n136-128-154:2222212:2222392 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2222211:2222378 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2222212:2222392 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2222211:2222378 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2222212:2222392 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2222211:2222378 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2222211:2222378 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2222212:2222392 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2222212:2222392 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2222211:2222378 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2222211:2222378 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2222212:2222392 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2222212:2222392 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2222212:2222392 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2222212:2222392 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2222212:2222392 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2222212:2222392 [1] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2222212:2222392 [1] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2222212:2222392 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2222212:2222392 [1] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2222212:2222392 [1] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2222212:2222392 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2222212:2222392 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2222212:2222392 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2222212:2222392 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2222212:2222392 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2222212:2222392 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2222212:2222392 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2222212:2222392 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2222212:2222392 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2222212:2222392 [1] NCCL INFO ========================================== n136-128-154:2222212:2222392 [1] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2222211:2222378 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2222212:2222392 [1] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2222211:2222378 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2222211:2222378 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2222211:2222378 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2222211:2222378 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2222212:2222392 [1] NCCL INFO Setting affinity for GPU 7 to 33-62,97-126 n136-128-154:2222211:2222378 [0] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2222211:2222378 [0] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2222211:2222378 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2222211:2222378 [0] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2222211:2222378 [0] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2222211:2222378 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2222211:2222378 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2222211:2222378 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2222211:2222378 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2222211:2222378 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2222211:2222378 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2222211:2222378 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2222211:2222378 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2222211:2222378 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2222211:2222378 [0] NCCL INFO ========================================== n136-128-154:2222211:2222378 [0] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2222211:2222378 [0] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2222211:2222378 [0] NCCL INFO Setting affinity for GPU 6 to 33-62,97-126 n136-128-154:2222212:2222392 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2222212:2222392 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222212:2222392 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222212:2222392 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222212:2222392 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222212:2222392 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222212:2222392 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222212:2222392 [1] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222212:2222392 [1] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222212:2222392 [1] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222212:2222392 [1] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222212:2222392 [1] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222212:2222392 [1] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222212:2222392 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2222212:2222392 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222212:2222392 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222212:2222392 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222212:2222392 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222212:2222392 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222212:2222392 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222212:2222392 [1] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222212:2222392 [1] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222212:2222392 [1] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222211:2222378 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2222212:2222392 [1] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222212:2222392 [1] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222211:2222378 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222212:2222392 [1] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222211:2222378 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222211:2222378 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222211:2222378 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222211:2222378 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222211:2222378 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222211:2222378 [0] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222211:2222378 [0] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222211:2222378 [0] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222211:2222378 [0] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222211:2222378 [0] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222211:2222378 [0] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222211:2222378 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2222211:2222378 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222211:2222378 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222211:2222378 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222211:2222378 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222211:2222378 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222211:2222378 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222211:2222378 [0] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222211:2222378 [0] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222211:2222378 [0] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222211:2222378 [0] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222211:2222378 [0] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222211:2222378 [0] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222212:2222392 [1] NCCL INFO comm 0x12f06230 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:2222211:2222378 [0] NCCL INFO comm 0x11e067c0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222212:2222392 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:2222212:2222392 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:2222212:2222392 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:2222212:2222392 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:2222211:2222378 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:2222211:2222378 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2222211:2222378 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2222211:2222378 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:2222211:2222400 [0] NCCL INFO [Proxy Service] Device 0 CPU core 110 n136-128-154:2222211:2222401 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 47 n136-128-154:2222212:2222392 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2222212:2222402 [1] NCCL INFO [Proxy Service] Device 1 CPU core 34 n136-128-154:2222212:2222403 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 100 n136-128-154:2222212:2222392 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2222212:2222392 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2222211:2222378 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2222211:2222378 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2222211:2222378 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:2222211:2222378 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2222211:2222378 [0] NCCL INFO ncclCommInitRankConfig comm 0x11e067c0 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x2b3729dc5e596a90 - Init COMPLETE n136-128-154:2222211:2222378 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 2.58 (kernels 0.12, alloc 0.10, bootstrap 1.99, allgathers 0.00, topo 0.06, graphs 0.00, connections 0.17, rest 0.14) n136-128-154:2222212:2222392 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2222212:2222392 [1] NCCL INFO ncclCommInitRankConfig comm 0x12f06230 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x2b3729dc5e596a90 - Init COMPLETE n136-128-154:2222212:2222392 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.66 (kernels 0.13, alloc 0.15, bootstrap 0.00, allgathers 0.00, topo 0.06, graphs 0.00, connections 0.16, rest 0.15) n136-128-154:2222211:2222404 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 04/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 05/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 06/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 04/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 07/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 05/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 08/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 06/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 09/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 07/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 10/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 08/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 11/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 09/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 12/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 10/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 13/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 11/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 14/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 12/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 15/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 13/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 16/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 14/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 17/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 15/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 18/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 16/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 19/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 17/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 20/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 18/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 21/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 19/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 22/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 20/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Channel 23/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 21/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 22/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222405 [1] NCCL INFO Channel 23/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222211:2222404 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:2222212:2222405 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-09:16:54:54 INFO [evaluator:559] Running loglikelihood requests 2025-12-09:16:54:54 INFO [evaluator:559] Running loglikelihood requests Passed argument batch_size = auto:1. Detecting largest batch size Running loglikelihood requests: 0%| | 0/3270 [00:00 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 01/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 02/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 03/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 04/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 05/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 06/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 07/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 08/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 09/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 10/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 11/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 12/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 13/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 14/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 15/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 16/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 17/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 18/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 19/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 20/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 21/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 22/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 23/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 24/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 25/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 26/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 27/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 28/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 29/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 30/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222212:2222473 [1] NCCL INFO Channel 31/1 : 1[7] -> 0[6] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:2222212:2222479 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2222212:2222479 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2222212:2222479 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2222212:2222479 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2222212:2222479 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2222212:2222479 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2222212:2222402 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-09:16:56:03 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: auto (64) |Tasks|Version|Filter|n-shot|Metric| |Value | |Stderr| |-----|------:|------|-----:|------|---|-----:|---|-----:| |boolq| 2|none | 0|acc |↑ |0.6869|± |0.0081| [rank0]:[W1209 16:56:04.857565386 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:2222211:2222528 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2222211:2222528 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2222211:2222528 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2222211:2222400 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2222211:2222528 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2222211:2222528 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2222211:2222528 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2222212:2222402 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2222212:2222479 [1] NCCL INFO comm 0x12f06230 rank 1 nranks 2 cudaDev 1 busId c9000 - Abort COMPLETE n136-128-154:2222211:2222528 [0] NCCL INFO comm 0x11e067c0 rank 0 nranks 2 cudaDev 0 busId c5000 - Abort COMPLETE 任务 boolq 评估完成! ================================================== 开始评估:任务=arc_challenge | 少样本数=25 | 模型=Llama-3.1-8B-quantization-fg 输出路径:results2/Llama-3.1-8B-quantization-fg/baseline_BI/arc_challenge.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:16:56:53 INFO [__main__:440] Selected Tasks: ['arc_challenge'] 2025-12-09:16:56:53 INFO [__main__:440] Selected Tasks: ['arc_challenge'] 2025-12-09:16:56:53 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:16:56:53 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-09:16:56:53 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:16:56:53 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-09:16:56:55 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:16:56:55 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:16:56:55 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:2222575:2222766 [0] NCCL INFO Initialized NET plugin IB n136-128-154:2222575:2222766 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2222575:2222766 [0] NCCL INFO Using network IB n136-128-154:2222575:2222766 [0] NCCL INFO ncclCommInitRankConfig comm 0x117a52e0 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x4121f562e23c70d - Init START n136-128-154:2222576:2222768 [1] NCCL INFO NET/Plugin: Could not find: none libnccl-net-none.so. n136-128-154:2222576:2222768 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0. n136-128-154:2222576:2222768 [1] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6 n136-128-154:2222576:2222768 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0 n136-128-154:2222576:2222768 [1] NCCL INFO NCCL_IB_HCA set to mlx5 n136-128-154:2222576:2222768 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:2222576:2222768 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:2222576:2222768 [1] NCCL INFO Initialized NET plugin IB n136-128-154:2222576:2222768 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2222576:2222768 [1] NCCL INFO Using network IB n136-128-154:2222576:2222768 [1] NCCL INFO ncclCommInitRankConfig comm 0x11f1d2b0 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x4121f562e23c70d - Init START n136-128-154:2222575:2222766 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2222576:2222768 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2222575:2222766 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2222575:2222766 [0] NCCL INFO Retrieving state for IB n136-128-154:2222575:2222766 [0] NCCL INFO Initialized state 0 for IB n136-128-154:2222575:2222766 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2222576:2222768 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2222576:2222768 [1] NCCL INFO Retrieving state for IB n136-128-154:2222576:2222768 [1] NCCL INFO Initialized state 0 for IB n136-128-154:2222575:2222766 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2222576:2222768 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2222576:2222768 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2222575:2222766 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2222576:2222768 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2222575:2222766 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2222576:2222768 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2222576:2222768 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2222576:2222768 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2222575:2222766 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2222575:2222766 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2222576:2222768 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2222576:2222768 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2222576:2222768 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2222576:2222768 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2222576:2222768 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2222576:2222768 [1] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2222576:2222768 [1] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2222576:2222768 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2222576:2222768 [1] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2222576:2222768 [1] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2222576:2222768 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2222576:2222768 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2222576:2222768 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2222576:2222768 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2222576:2222768 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2222576:2222768 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2222576:2222768 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2222576:2222768 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2222576:2222768 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2222576:2222768 [1] NCCL INFO ========================================== n136-128-154:2222576:2222768 [1] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2222576:2222768 [1] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2222576:2222768 [1] NCCL INFO Setting affinity for GPU 7 to 33-62,97-126 n136-128-154:2222575:2222766 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2222575:2222766 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2222575:2222766 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2222575:2222766 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2222575:2222766 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2222575:2222766 [0] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2222575:2222766 [0] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2222575:2222766 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2222575:2222766 [0] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2222575:2222766 [0] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2222575:2222766 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2222575:2222766 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2222575:2222766 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2222575:2222766 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2222575:2222766 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2222575:2222766 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2222575:2222766 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2222575:2222766 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2222575:2222766 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2222575:2222766 [0] NCCL INFO ========================================== n136-128-154:2222575:2222766 [0] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2222575:2222766 [0] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2222575:2222766 [0] NCCL INFO Setting affinity for GPU 6 to 33-62,97-126 n136-128-154:2222576:2222768 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2222576:2222768 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222576:2222768 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222576:2222768 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222576:2222768 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222576:2222768 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222576:2222768 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222576:2222768 [1] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222576:2222768 [1] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222576:2222768 [1] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222576:2222768 [1] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222576:2222768 [1] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222576:2222768 [1] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222576:2222768 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2222576:2222768 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222576:2222768 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222576:2222768 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222576:2222768 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222576:2222768 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222576:2222768 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222576:2222768 [1] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222576:2222768 [1] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222576:2222768 [1] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222576:2222768 [1] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222576:2222768 [1] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222576:2222768 [1] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222575:2222766 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2222575:2222766 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222575:2222766 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222575:2222766 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222575:2222766 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222575:2222766 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222575:2222766 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222575:2222766 [0] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222575:2222766 [0] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222575:2222766 [0] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222575:2222766 [0] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222575:2222766 [0] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222575:2222766 [0] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222575:2222766 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2222575:2222766 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222575:2222766 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222575:2222766 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222575:2222766 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222575:2222766 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222575:2222766 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2222575:2222766 [0] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222575:2222766 [0] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222575:2222766 [0] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222575:2222766 [0] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222575:2222766 [0] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222575:2222766 [0] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2222575:2222766 [0] NCCL INFO comm 0x117a52e0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:2222576:2222768 [1] NCCL INFO comm 0x11f1d2b0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222576:2222768 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:2222576:2222768 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:2222576:2222768 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:2222576:2222768 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:2222575:2222766 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:2222575:2222766 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:2222575:2222766 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2222576:2222768 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2222576:2222780 [1] NCCL INFO [Proxy Service] Device 1 CPU core 36 n136-128-154:2222576:2222781 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 37 n136-128-154:2222575:2222766 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2222575:2222766 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:2222575:2222782 [0] NCCL INFO [Proxy Service] Device 0 CPU core 104 n136-128-154:2222575:2222783 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 106 n136-128-154:2222576:2222768 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2222576:2222768 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2222575:2222766 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2222575:2222766 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2222575:2222766 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:2222575:2222766 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2222575:2222766 [0] NCCL INFO ncclCommInitRankConfig comm 0x117a52e0 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x4121f562e23c70d - Init COMPLETE n136-128-154:2222575:2222766 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.79 (kernels 0.13, alloc 0.19, bootstrap 0.16, allgathers 0.00, topo 0.15, graphs 0.00, connections 0.01, rest 0.14) n136-128-154:2222576:2222768 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2222576:2222768 [1] NCCL INFO ncclCommInitRankConfig comm 0x11f1d2b0 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x4121f562e23c70d - Init COMPLETE n136-128-154:2222576:2222768 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.61 (kernels 0.13, alloc 0.17, bootstrap 0.00, allgathers 0.02, topo 0.15, graphs 0.00, connections 0.01, rest 0.13) n136-128-154:2222575:2222784 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 04/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 05/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 06/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 07/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 04/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 08/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 05/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 09/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 06/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 10/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 07/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 11/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 08/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 12/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 09/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 13/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 10/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 14/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 11/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 15/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 12/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 16/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 13/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 17/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 14/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 18/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 15/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 19/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 16/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 20/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 17/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 21/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 18/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 22/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 19/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Channel 23/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 20/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 21/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 22/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2222785 [1] NCCL INFO Channel 23/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222575:2222784 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:2222576:2222785 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-09:16:57:29 INFO [evaluator:559] Running loglikelihood requests 2025-12-09:16:57:29 INFO [evaluator:559] Running loglikelihood requests Passed argument batch_size = auto:1. Detecting largest batch size Running loglikelihood requests: 0%| | 0/2344 [00:00 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 01/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 02/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 03/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 04/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 05/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 06/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 07/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 08/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 09/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 10/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 11/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 12/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 13/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 14/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 15/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 16/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 17/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 18/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 19/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 20/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 21/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 22/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 23/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 24/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 25/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 26/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 27/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 28/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 29/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 30/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2222576:2223034 [1] NCCL INFO Channel 31/1 : 1[7] -> 0[6] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:2222576:2223038 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2222576:2223038 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2222576:2223038 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2222576:2223038 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2222576:2223038 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2222576:2223038 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2222576:2222780 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-09:17:01:10 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 25, batch_size: auto (64) | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |-------------|------:|------|-----:|--------|---|-----:|---|-----:| |arc_challenge| 1|none | 25|acc |↑ |0.3712|± |0.0141| | | |none | 25|acc_norm|↑ |0.4061|± |0.0144| [rank0]:[W1209 17:01:11.668862667 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:2222575:2223088 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2222575:2223088 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2222575:2223088 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2222575:2222782 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2222575:2223088 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2222575:2223088 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2222575:2223088 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2222576:2222780 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2222576:2223038 [1] NCCL INFO comm 0x11f1d2b0 rank 1 nranks 2 cudaDev 1 busId c9000 - Abort COMPLETE n136-128-154:2222575:2223088 [0] NCCL INFO comm 0x117a52e0 rank 0 nranks 2 cudaDev 0 busId c5000 - Abort COMPLETE 任务 arc_challenge 评估完成! ================================================== 开始评估:任务=truthfulqa_mc1 | 少样本数=0 | 模型=Llama-3.1-8B-quantization-fg 输出路径:results2/Llama-3.1-8B-quantization-fg/baseline_BI/truthfulqa_mc1.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:17:02:00 INFO [__main__:440] Selected Tasks: ['truthfulqa_mc1'] 2025-12-09:17:02:00 INFO [__main__:440] Selected Tasks: ['truthfulqa_mc1'] 2025-12-09:17:02:00 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:17:02:00 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-09:17:02:00 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:17:02:00 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-09:17:02:01 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:17:02:01 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} `torch_dtype` is deprecated! Use `dtype` instead! The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:17:02:01 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:2223134:2223316 [0] NCCL INFO Initialized NET plugin IB n136-128-154:2223135:2223317 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:2223135:2223317 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:2223135:2223317 [1] NCCL INFO Initialized NET plugin IB n136-128-154:2223134:2223316 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2223134:2223316 [0] NCCL INFO Using network IB n136-128-154:2223135:2223317 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2223135:2223317 [1] NCCL INFO Using network IB n136-128-154:2223135:2223317 [1] NCCL INFO ncclCommInitRankConfig comm 0x14fc5940 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0xfde0410e5cf6efa8 - Init START n136-128-154:2223134:2223316 [0] NCCL INFO ncclCommInitRankConfig comm 0x14cd8680 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0xfde0410e5cf6efa8 - Init START n136-128-154:2223135:2223317 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2223134:2223316 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2223135:2223317 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2223135:2223317 [1] NCCL INFO Retrieving state for IB n136-128-154:2223135:2223317 [1] NCCL INFO Initialized state 0 for IB n136-128-154:2223135:2223317 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2223134:2223316 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2223134:2223316 [0] NCCL INFO Retrieving state for IB n136-128-154:2223134:2223316 [0] NCCL INFO Initialized state 0 for IB n136-128-154:2223135:2223317 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2223134:2223316 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2223135:2223317 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2223134:2223316 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2223135:2223317 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2223134:2223316 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2223134:2223316 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2223134:2223316 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2223134:2223316 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2223135:2223317 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2223135:2223317 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2223134:2223316 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2223134:2223316 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2223134:2223316 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2223134:2223316 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2223134:2223316 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2223134:2223316 [0] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2223134:2223316 [0] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2223134:2223316 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2223134:2223316 [0] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2223134:2223316 [0] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2223134:2223316 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2223134:2223316 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2223134:2223316 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2223134:2223316 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2223134:2223316 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2223134:2223316 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2223134:2223316 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2223134:2223316 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2223134:2223316 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2223134:2223316 [0] NCCL INFO ========================================== n136-128-154:2223134:2223316 [0] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2223134:2223316 [0] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2223134:2223316 [0] NCCL INFO Setting affinity for GPU 6 to 33-62,97-126 n136-128-154:2223134:2223316 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2223134:2223316 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223134:2223316 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223134:2223316 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223134:2223316 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223134:2223316 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223134:2223316 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223134:2223316 [0] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223134:2223316 [0] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223134:2223316 [0] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223134:2223316 [0] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223134:2223316 [0] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2223134:2223316 [0] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2223135:2223317 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2223135:2223317 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2223135:2223317 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2223135:2223317 [1] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2223135:2223317 [1] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2223135:2223317 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2223135:2223317 [1] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2223135:2223317 [1] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2223135:2223317 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2223135:2223317 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2223135:2223317 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2223135:2223317 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2223135:2223317 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2223135:2223317 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2223135:2223317 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2223135:2223317 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2223135:2223317 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2223135:2223317 [1] NCCL INFO ========================================== n136-128-154:2223134:2223316 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2223134:2223316 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2223134:2223316 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223134:2223316 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2223134:2223316 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223134:2223316 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223134:2223316 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223134:2223316 [0] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223135:2223317 [1] NCCL INFO Setting affinity for GPU 7 to 33-62,97-126 n136-128-154:2223134:2223316 [0] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223134:2223316 [0] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223134:2223316 [0] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223134:2223316 [0] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223134:2223316 [0] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223135:2223317 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2223135:2223317 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2223135:2223317 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223135:2223317 [1] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223135:2223317 [1] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223135:2223317 [1] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223135:2223317 [1] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223135:2223317 [1] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223135:2223317 [1] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223134:2223316 [0] NCCL INFO comm 0x14cd8680 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:2223135:2223317 [1] NCCL INFO comm 0x14fc5940 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223135:2223317 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:2223135:2223317 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:2223135:2223317 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:2223135:2223317 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:2223134:2223316 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:2223134:2223316 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2223134:2223316 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2223134:2223316 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:2223134:2223328 [0] NCCL INFO [Proxy Service] Device 0 CPU core 106 n136-128-154:2223134:2223329 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 43 n136-128-154:2223135:2223317 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2223135:2223330 [1] NCCL INFO [Proxy Service] Device 1 CPU core 35 n136-128-154:2223135:2223331 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 36 n136-128-154:2223134:2223316 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2223134:2223316 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2223134:2223316 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:2223135:2223317 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2223135:2223317 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2223134:2223316 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2223134:2223316 [0] NCCL INFO ncclCommInitRankConfig comm 0x14cd8680 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0xfde0410e5cf6efa8 - Init COMPLETE n136-128-154:2223134:2223316 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.71 (kernels 0.13, alloc 0.36, bootstrap 0.00, allgathers 0.00, topo 0.08, graphs 0.00, connections 0.01, rest 0.12) n136-128-154:2223135:2223317 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2223135:2223317 [1] NCCL INFO ncclCommInitRankConfig comm 0x14fc5940 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0xfde0410e5cf6efa8 - Init COMPLETE n136-128-154:2223135:2223317 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.71 (kernels 0.12, alloc 0.37, bootstrap 0.00, allgathers 0.00, topo 0.08, graphs 0.00, connections 0.01, rest 0.12) n136-128-154:2223134:2223332 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 04/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 05/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 06/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 07/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 08/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 09/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 10/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 11/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 12/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 13/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 14/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 15/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 04/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 16/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 05/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 17/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 06/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 18/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 07/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 19/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 08/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 20/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 09/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 21/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 10/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 22/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 11/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223134:2223332 [0] NCCL INFO Channel 23/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 12/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 13/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 14/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 15/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 16/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 17/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 18/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 19/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 20/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 21/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 22/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Channel 23/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223333 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:2223134:2223332 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-09:17:02:34 INFO [evaluator:559] Running loglikelihood requests 2025-12-09:17:02:34 INFO [evaluator:559] Running loglikelihood requests Running loglikelihood requests: 0%| | 0/2066 [00:00 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 01/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 02/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 03/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 04/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 05/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 06/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 07/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 08/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 09/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 10/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 11/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 12/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 13/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 14/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 15/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 16/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 17/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 18/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 19/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 20/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 21/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 22/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 23/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 24/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 25/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 26/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 27/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 28/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 29/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 30/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223135:2223378 [1] NCCL INFO Channel 31/1 : 1[7] -> 0[6] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:2223135:2223384 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2223135:2223384 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2223135:2223384 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2223135:2223384 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2223135:2223384 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2223135:2223384 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2223135:2223330 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-09:17:03:27 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: auto (64) | Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr| |--------------|------:|------|-----:|------|---|-----:|---|-----:| |truthfulqa_mc1| 2|none | 0|acc |↑ |0.2644|± |0.0154| [rank0]:[W1209 17:03:28.487046361 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:2223134:2223432 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2223134:2223432 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2223134:2223328 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2223134:2223432 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2223134:2223432 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2223134:2223432 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2223134:2223432 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2223135:2223330 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2223135:2223384 [1] NCCL INFO comm 0x14fc5940 rank 1 nranks 2 cudaDev 1 busId c9000 - Abort COMPLETE n136-128-154:2223134:2223432 [0] NCCL INFO comm 0x14cd8680 rank 0 nranks 2 cudaDev 0 busId c5000 - Abort COMPLETE 任务 truthfulqa_mc1 评估完成! ================================================== 开始评估:任务=piqa | 少样本数=0 | 模型=Llama-3.1-8B-quantization-fg 输出路径:results2/Llama-3.1-8B-quantization-fg/baseline_BI/piqa.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:17:04:17 INFO [__main__:440] Selected Tasks: ['piqa'] 2025-12-09:17:04:17 INFO [__main__:440] Selected Tasks: ['piqa'] 2025-12-09:17:04:17 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:17:04:17 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-09:17:04:17 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:17:04:17 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-09:17:04:18 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:17:04:19 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:17:04:19 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} `torch_dtype` is deprecated! Use `dtype` instead! `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:2223502:2223673 [1] NCCL INFO Initialized NET plugin IB n136-128-154:2223502:2223673 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2223502:2223673 [1] NCCL INFO Using network IB n136-128-154:2223501:2223672 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:2223501:2223672 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:2223501:2223672 [0] NCCL INFO Initialized NET plugin IB n136-128-154:2223501:2223672 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2223501:2223672 [0] NCCL INFO Using network IB n136-128-154:2223502:2223673 [1] NCCL INFO ncclCommInitRankConfig comm 0xd992cd0 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x38c55482cacbc41c - Init START n136-128-154:2223501:2223672 [0] NCCL INFO ncclCommInitRankConfig comm 0x1530f880 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x38c55482cacbc41c - Init START n136-128-154:2223502:2223673 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2223501:2223672 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2223502:2223673 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2223502:2223673 [1] NCCL INFO Retrieving state for IB n136-128-154:2223502:2223673 [1] NCCL INFO Initialized state 0 for IB n136-128-154:2223502:2223673 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2223502:2223673 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2223502:2223673 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2223502:2223673 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2223501:2223672 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2223501:2223672 [0] NCCL INFO Retrieving state for IB n136-128-154:2223501:2223672 [0] NCCL INFO Initialized state 0 for IB n136-128-154:2223501:2223672 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2223501:2223672 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2223501:2223672 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2223501:2223672 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2223501:2223672 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2223501:2223672 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2223502:2223673 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2223502:2223673 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2223501:2223672 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2223501:2223672 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2223501:2223672 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2223501:2223672 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2223501:2223672 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2223501:2223672 [0] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2223501:2223672 [0] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2223501:2223672 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2223501:2223672 [0] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2223501:2223672 [0] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2223501:2223672 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2223501:2223672 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2223501:2223672 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2223501:2223672 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2223501:2223672 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2223501:2223672 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2223501:2223672 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2223501:2223672 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2223501:2223672 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2223501:2223672 [0] NCCL INFO ========================================== n136-128-154:2223501:2223672 [0] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2223501:2223672 [0] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2223501:2223672 [0] NCCL INFO Setting affinity for GPU 6 to 33-62,97-126 n136-128-154:2223502:2223673 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2223502:2223673 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2223502:2223673 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2223502:2223673 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2223502:2223673 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2223502:2223673 [1] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2223502:2223673 [1] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2223502:2223673 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2223502:2223673 [1] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2223502:2223673 [1] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2223502:2223673 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2223502:2223673 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2223502:2223673 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2223502:2223673 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2223502:2223673 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2223502:2223673 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2223502:2223673 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2223502:2223673 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2223502:2223673 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2223502:2223673 [1] NCCL INFO ========================================== n136-128-154:2223502:2223673 [1] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2223502:2223673 [1] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2223502:2223673 [1] NCCL INFO Setting affinity for GPU 7 to 33-62,97-126 n136-128-154:2223501:2223672 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2223501:2223672 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223501:2223672 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223501:2223672 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223501:2223672 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223501:2223672 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223501:2223672 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223501:2223672 [0] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223501:2223672 [0] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223502:2223673 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2223501:2223672 [0] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223502:2223673 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223501:2223672 [0] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223502:2223673 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223501:2223672 [0] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223502:2223673 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223501:2223672 [0] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223502:2223673 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223502:2223673 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223502:2223673 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223502:2223673 [1] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223502:2223673 [1] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223502:2223673 [1] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223502:2223673 [1] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223502:2223673 [1] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223502:2223673 [1] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223501:2223672 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2223502:2223673 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2223501:2223672 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223502:2223673 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223501:2223672 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223502:2223673 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223501:2223672 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223502:2223673 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223501:2223672 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223502:2223673 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223501:2223672 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223502:2223673 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223501:2223672 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223502:2223673 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223501:2223672 [0] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223502:2223673 [1] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223501:2223672 [0] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223502:2223673 [1] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223501:2223672 [0] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223502:2223673 [1] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223501:2223672 [0] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223502:2223673 [1] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223501:2223672 [0] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223502:2223673 [1] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223501:2223672 [0] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223502:2223673 [1] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223502:2223673 [1] NCCL INFO comm 0xd992cd0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223501:2223672 [0] NCCL INFO comm 0x1530f880 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223502:2223673 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223501:2223672 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:2223501:2223672 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:2223501:2223672 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2223502:2223673 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2223502:2223685 [1] NCCL INFO [Proxy Service] Device 1 CPU core 122 n136-128-154:2223502:2223686 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 123 n136-128-154:2223501:2223672 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2223501:2223672 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:2223501:2223687 [0] NCCL INFO [Proxy Service] Device 0 CPU core 100 n136-128-154:2223501:2223688 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 102 n136-128-154:2223502:2223673 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2223502:2223673 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2223501:2223672 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2223501:2223672 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2223501:2223672 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:2223502:2223673 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2223502:2223673 [1] NCCL INFO ncclCommInitRankConfig comm 0xd992cd0 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x38c55482cacbc41c - Init COMPLETE n136-128-154:2223502:2223673 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.47 (kernels 0.12, alloc 0.21, bootstrap 0.00, allgathers 0.01, topo 0.07, graphs 0.00, connections 0.02, rest 0.03) n136-128-154:2223501:2223672 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2223501:2223672 [0] NCCL INFO ncclCommInitRankConfig comm 0x1530f880 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x38c55482cacbc41c - Init COMPLETE n136-128-154:2223501:2223672 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.48 (kernels 0.13, alloc 0.22, bootstrap 0.00, allgathers 0.01, topo 0.07, graphs 0.00, connections 0.03, rest 0.03) n136-128-154:2223502:2223689 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 04/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 05/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 06/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 07/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 08/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 04/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 09/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 05/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 10/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 06/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 11/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 07/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 12/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 08/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 13/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 09/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 14/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 10/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 15/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 11/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 16/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 12/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 17/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 13/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 18/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 14/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 19/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 15/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 20/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 16/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 21/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 17/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 22/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 18/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Channel 23/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 19/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 20/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 21/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 22/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223501:2223690 [0] NCCL INFO Channel 23/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223502:2223689 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:2223501:2223690 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-09:17:04:36 INFO [evaluator:559] Running loglikelihood requests 2025-12-09:17:04:36 INFO [evaluator:559] Running loglikelihood requests Running loglikelihood requests: 0%| | 0/1838 [00:00 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 01/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 02/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 03/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 04/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 05/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 06/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 07/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 08/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 09/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 10/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 11/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 12/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 13/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 14/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 15/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 16/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 17/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 18/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 19/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 20/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 21/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 22/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 23/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 24/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 25/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 26/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 27/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 28/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 29/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 30/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223502:2223711 [1] NCCL INFO Channel 31/1 : 1[7] -> 0[6] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:2223502:2223717 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2223502:2223717 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2223502:2223717 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2223502:2223717 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2223502:2223717 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2223502:2223717 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2223502:2223685 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-09:17:05:06 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: auto (64) |Tasks|Version|Filter|n-shot| Metric | |Value | |Stderr| |-----|------:|------|-----:|--------|---|-----:|---|-----:| |piqa | 1|none | 0|acc |↑ |0.6795|± |0.0109| | | |none | 0|acc_norm|↑ |0.6866|± |0.0108| [rank0]:[W1209 17:05:07.286703568 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:2223501:2223767 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2223501:2223767 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2223501:2223767 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2223501:2223687 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2223501:2223767 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2223501:2223767 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2223501:2223767 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2223502:2223685 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2223502:2223717 [1] NCCL INFO comm 0xd992cd0 rank 1 nranks 2 cudaDev 1 busId c9000 - Abort COMPLETE n136-128-154:2223501:2223767 [0] NCCL INFO comm 0x1530f880 rank 0 nranks 2 cudaDev 0 busId c5000 - Abort COMPLETE 任务 piqa 评估完成! ================================================== 开始评估:任务=hellaswag | 少样本数=10 | 模型=Llama-3.1-8B-quantization-fg 输出路径:results2/Llama-3.1-8B-quantization-fg/baseline_BI/hellaswag.json ================================================== The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 2025-12-09:17:05:57 INFO [__main__:440] Selected Tasks: ['hellaswag'] 2025-12-09:17:05:57 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:17:05:57 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-09:17:05:57 INFO [__main__:440] Selected Tasks: ['hellaswag'] 2025-12-09:17:05:57 INFO [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2025-12-09:17:05:57 INFO [evaluator:227] Initializing hf model, with arguments: {'pretrained': '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg'} 2025-12-09:17:05:58 WARNING [accelerate.utils.other:513] Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. The tokenizer you are loading from '/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. 2025-12-09:17:05:58 INFO [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:1'} `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00 n136-128-154:2223813:2224024 [0] NCCL INFO Initialized NET plugin IB n136-128-154:2223813:2224024 [0] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2223813:2224024 [0] NCCL INFO Using network IB n136-128-154:2223814:2224025 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. n136-128-154:2223814:2224025 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc03:9:451::154<0> n136-128-154:2223814:2224025 [1] NCCL INFO Initialized NET plugin IB n136-128-154:2223814:2224025 [1] NCCL INFO Assigned NET plugin IB to comm n136-128-154:2223814:2224025 [1] NCCL INFO Using network IB n136-128-154:2223813:2224024 [0] NCCL INFO ncclCommInitRankConfig comm 0x155a4e40 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x16c35cf1206b8bda - Init START n136-128-154:2223814:2224025 [1] NCCL INFO ncclCommInitRankConfig comm 0x12aba270 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x16c35cf1206b8bda - Init START n136-128-154:2223813:2224024 [0] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2223814:2224025 [1] NCCL INFO RAS client listening socket at ::1<28028> n136-128-154:2223814:2224025 [1] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2223814:2224025 [1] NCCL INFO Retrieving state for IB n136-128-154:2223814:2224025 [1] NCCL INFO Initialized state 0 for IB n136-128-154:2223814:2224025 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2223814:2224025 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2223814:2224025 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2223813:2224024 [0] NCCL INFO TOPO/NET : Importing network plugins to topology n136-128-154:2223813:2224024 [0] NCCL INFO Retrieving state for IB n136-128-154:2223813:2224024 [0] NCCL INFO Initialized state 0 for IB n136-128-154:2223814:2224025 [1] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2223813:2224024 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_0 in topo with pciPath=/sys/devices/pci0000:09/0000:09:02.0/0000:0a:00.0/0000:0b:08.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0 keep=1 coll=(null) n136-128-154:2223813:2224024 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_1 in topo with pciPath=/sys/devices/pci0000:43/0000:43:02.0/0000:44:00.0/0000:45:08.0/0000:5e:00.0/0000:5f:00.0/0000:60:00.0 keep=1 coll=(null) n136-128-154:2223813:2224024 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_2 in topo with pciPath=/sys/devices/pci0000:82/0000:82:02.0/0000:83:00.0/0000:84:08.0/0000:93:00.0/0000:94:00.0/0000:95:00.0 keep=1 coll=(null) n136-128-154:2223813:2224024 [0] NCCL INFO ncclTopoPopulateNics : Filled mlx5_3 in topo with pciPath=/sys/devices/pci0000:be/0000:be:02.0/0000:bf:00.0/0000:c0:04.0/0000:ca:00.0/0000:cb:10.0/0000:cc:00.0 keep=1 coll=(null) n136-128-154:2223813:2224024 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2223813:2224024 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2223814:2224025 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 0 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2223814:2224025 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 1 / HCA 3 (distance 5 <= 5), read 0 mode Default n136-128-154:2223814:2224025 [1] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2223814:2224025 [1] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2223814:2224025 [1] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2223814:2224025 [1] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2223814:2224025 [1] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2223814:2224025 [1] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2223814:2224025 [1] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2223814:2224025 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2223814:2224025 [1] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2223814:2224025 [1] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2223814:2224025 [1] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2223814:2224025 [1] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2223814:2224025 [1] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2223814:2224025 [1] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2223814:2224025 [1] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2223814:2224025 [1] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2223814:2224025 [1] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2223814:2224025 [1] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2223814:2224025 [1] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2223814:2224025 [1] NCCL INFO ========================================== n136-128-154:2223814:2224025 [1] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2223814:2224025 [1] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2223814:2224025 [1] NCCL INFO Setting affinity for GPU 7 to 33-62,97-126 n136-128-154:2223814:2224025 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2223813:2224024 [0] NCCL INFO === System : maxBw 240.0 totalBw 240.0 === n136-128-154:2223814:2224025 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO CPU/0-1 (1/1/2) n136-128-154:2223814:2224025 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223814:2224025 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO + PCI[24.0] - PCI/0-83000 (1000c0101000ffff) n136-128-154:2223814:2224025 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO + PCI[24.0] - NIC/0-95000 n136-128-154:2223814:2224025 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO + PCI[24.0] - PCI/0-bf000 (1000c0101000ffff) n136-128-154:2223814:2224025 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO + PCI[24.0] - PCI/0-c3000 (1000c01010de13b8) n136-128-154:2223814:2224025 [1] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO + PCI[24.0] - GPU/0-c5000 (0) n136-128-154:2223814:2224025 [1] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2223814:2224025 [1] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO + PCI[24.0] - PCI/0-c7000 (1000c01010de13b8) n136-128-154:2223814:2224025 [1] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO + PCI[24.0] - GPU/0-c9000 (1) n136-128-154:2223814:2224025 [1] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO + NVL[240.0] - NVS/0-0 n136-128-154:2223814:2224025 [1] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO + PCI[24.0] - NIC/0-cc000 n136-128-154:2223813:2224024 [0] NCCL INFO + SYS[10.0] - CPU/0-0 n136-128-154:2223813:2224024 [0] NCCL INFO CPU/0-0 (1/1/2) n136-128-154:2223813:2224024 [0] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff) n136-128-154:2223813:2224024 [0] NCCL INFO + PCI[24.0] - NIC/0-1d000 n136-128-154:2223813:2224024 [0] NCCL INFO + PCI[24.0] - PCI/0-44000 (1000c0101000ffff) n136-128-154:2223813:2224024 [0] NCCL INFO + PCI[24.0] - NIC/0-60000 n136-128-154:2223813:2224024 [0] NCCL INFO + SYS[10.0] - CPU/0-1 n136-128-154:2223813:2224024 [0] NCCL INFO ========================================== n136-128-154:2223813:2224024 [0] NCCL INFO GPU/0-c5000 :GPU/0-c5000 (0/5000.0/LOC) GPU/0-c9000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2223813:2224024 [0] NCCL INFO GPU/0-c9000 :GPU/0-c5000 (2/240.0/NVL) GPU/0-c9000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS) n136-128-154:2223813:2224024 [0] NCCL INFO Setting affinity for GPU 6 to 33-62,97-126 n136-128-154:2223814:2224025 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2223814:2224025 [1] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223814:2224025 [1] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223814:2224025 [1] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223814:2224025 [1] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223814:2224025 [1] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223814:2224025 [1] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223814:2224025 [1] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223814:2224025 [1] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223814:2224025 [1] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223814:2224025 [1] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223814:2224025 [1] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223814:2224025 [1] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223813:2224024 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1 n136-128-154:2223813:2224024 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO 6 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO 7 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO 8 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO 9 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO 10 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO 11 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 40.000000/40.000000, type NVL/PIX, sameChannels 0 n136-128-154:2223813:2224024 [0] NCCL INFO 0 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO 1 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO 2 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO 3 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO 4 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO 5 : GPU/0-c5000 GPU/0-c9000 n136-128-154:2223813:2224024 [0] NCCL INFO 6 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223813:2224024 [0] NCCL INFO 7 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223813:2224024 [0] NCCL INFO 8 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223813:2224024 [0] NCCL INFO 9 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223813:2224024 [0] NCCL INFO 10 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223813:2224024 [0] NCCL INFO 11 : GPU/0-c9000 GPU/0-c5000 n136-128-154:2223813:2224024 [0] NCCL INFO comm 0x155a4e40 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-128-154:2223814:2224025 [1] NCCL INFO comm 0x12aba270 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 12 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 12 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 13 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 13 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 14 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 3 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 14 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 15 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 3 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 4 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 15 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 16 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 4 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 5 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 16 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 17 : -1 -> 0 -> 1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 5 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 6 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 17 : 0 -> 1 -> -1/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 18 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 6 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 7 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 18 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 19 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 7 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 8 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 19 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 20 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 8 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 9 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 20 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 21 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 9 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 10 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 21 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 22 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 10 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 11 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 22 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Tree 23 : 1 -> 0 -> -1/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 11 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223814:2224025 [1] NCCL INFO Tree 23 : -1 -> 1 -> 0/-1/-1 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 00/24 : 0 1 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 01/24 : 0 1 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 02/24 : 0 1 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 03/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 04/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 05/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 06/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 07/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 04 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 08/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 05 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 09/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 06 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 10/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 07 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 11/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 08 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 12/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 09 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 13/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 10 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 14/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 11 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 15/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 12 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 16/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 13 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 17/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 14 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 18/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 15 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 19/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 16 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 20/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 17 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 21/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 18 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 22/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 19 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Channel 23/24 : 0 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 20 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 21 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 22 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 n136-128-154:2223814:2224025 [1] NCCL INFO Ring 23 : 0 -> 1 -> 0 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 n136-128-154:2223814:2224025 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 04 : 1 -> 0 -> 1 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 05 : 1 -> 0 -> 1 n136-128-154:2223814:2224025 [1] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 06 : 1 -> 0 -> 1 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 07 : 1 -> 0 -> 1 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 08 : 1 -> 0 -> 1 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 09 : 1 -> 0 -> 1 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 10 : 1 -> 0 -> 1 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 11 : 1 -> 0 -> 1 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 12 : 1 -> 0 -> 1 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 13 : 1 -> 0 -> 1 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 14 : 1 -> 0 -> 1 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 15 : 1 -> 0 -> 1 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 16 : 1 -> 0 -> 1 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 17 : 1 -> 0 -> 1 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 18 : 1 -> 0 -> 1 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 19 : 1 -> 0 -> 1 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 20 : 1 -> 0 -> 1 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 21 : 1 -> 0 -> 1 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 22 : 1 -> 0 -> 1 n136-128-154:2223813:2224024 [0] NCCL INFO Ring 23 : 1 -> 0 -> 1 n136-128-154:2223813:2224024 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-128-154:2223813:2224024 [0] NCCL INFO P2P Chunksize set to 524288 n136-128-154:2223814:2224025 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2223814:2224037 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 116 n136-128-154:2223814:2224036 [1] NCCL INFO [Proxy Service] Device 1 CPU core 50 n136-128-154:2223813:2224024 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. n136-128-154:2223813:2224024 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 n136-128-154:2223813:2224038 [0] NCCL INFO [Proxy Service] Device 0 CPU core 108 n136-128-154:2223813:2224039 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 119 n136-128-154:2223814:2224025 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2223814:2224025 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2223813:2224024 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-128-154:2223813:2224024 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-128-154:2223813:2224024 [0] NCCL INFO CC Off, workFifoBytes 1048576 n136-128-154:2223813:2224024 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2223813:2224024 [0] NCCL INFO ncclCommInitRankConfig comm 0x155a4e40 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId c5000 commId 0x16c35cf1206b8bda - Init COMPLETE n136-128-154:2223813:2224024 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.71 (kernels 0.12, alloc 0.36, bootstrap 0.01, allgathers 0.00, topo 0.05, graphs 0.00, connections 0.03, rest 0.14) n136-128-154:2223814:2224025 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. n136-128-154:2223814:2224025 [1] NCCL INFO ncclCommInitRankConfig comm 0x12aba270 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId c9000 commId 0x16c35cf1206b8bda - Init COMPLETE n136-128-154:2223814:2224025 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 0.70 (kernels 0.12, alloc 0.35, bootstrap 0.00, allgathers 0.00, topo 0.05, graphs 0.00, connections 0.03, rest 0.14) n136-128-154:2223813:2224040 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 04/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 05/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 06/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 07/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 08/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 09/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 10/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 11/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 04/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 12/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 05/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 13/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 06/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 14/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 07/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 15/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 08/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 16/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 09/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 17/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 10/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 18/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 11/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 19/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 12/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 20/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 13/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 21/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 14/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 22/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 15/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Channel 23/0 : 0[6] -> 1[7] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 16/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 17/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 18/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 19/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 20/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 21/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 22/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2224041 [1] NCCL INFO Channel 23/0 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223813:2224040 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 n136-128-154:2223814:2224041 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 2025-12-09:17:06:47 INFO [evaluator:559] Running loglikelihood requests 2025-12-09:17:06:47 INFO [evaluator:559] Running loglikelihood requests Passed argument batch_size = auto:1. Detecting largest batch size Running loglikelihood requests: 0%| | 0/20084 [00:00 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 01/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 02/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 03/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 04/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 05/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 06/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 07/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 08/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 09/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 10/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 11/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 12/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 13/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 14/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 15/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 16/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 17/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 18/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 19/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 20/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 21/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 22/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 23/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 24/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 25/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 26/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 27/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 28/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 29/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 30/1 : 1[7] -> 0[6] via P2P/CUMEM/read n136-128-154:2223814:2226553 [1] NCCL INFO Channel 31/1 : 1[7] -> 0[6] via P2P/CUMEM/read fatal: detected dubious ownership in repository at '/mnt/bn/life-mllm/users/cxr/quantization' To add an exception for this directory, call: git config --global --add safe.directory /mnt/bn/life-mllm/users/cxr/quantization n136-128-154:2223814:2226574 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2223814:2226574 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2223814:2226574 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2223814:2226574 [1] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2223814:2226574 [1] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2223814:2226574 [1] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2223814:2224036 [1] NCCL INFO misc/socket.cc:915 -> 3 2025-12-09:17:32:06 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/mnt/bn/life-mllm/users/cxr/quantization/models/Llama-3.1-8B-quantization-fg), gen_kwargs: (None), limit: None, num_fewshot: 10, batch_size: auto (64) | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |---------|------:|------|-----:|--------|---|-----:|---|-----:| |hellaswag| 1|none | 10|acc |↑ |0.4457|± |0.0050| | | |none | 10|acc_norm|↑ |0.6242|± |0.0048| [rank0]:[W1209 17:32:07.519485233 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) n136-128-154:2223813:2226625 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2223813:2226625 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2223813:2226625 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2223813:2224038 [0] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2223813:2226625 [0] NCCL INFO misc/socket.cc:64 -> 3 n136-128-154:2223813:2226625 [0] NCCL INFO misc/socket.cc:81 -> 3 n136-128-154:2223813:2226625 [0] NCCL INFO misc/socket.cc:863 -> 3 n136-128-154:2223814:2224036 [1] NCCL INFO misc/socket.cc:915 -> 3 n136-128-154:2223814:2226574 [1] NCCL INFO comm 0x12aba270 rank 1 nranks 2 cudaDev 1 busId c9000 - Abort COMPLETE n136-128-154:2223813:2226625 [0] NCCL INFO comm 0x155a4e40 rank 0 nranks 2 cudaDev 0 busId c5000 - Abort COMPLETE 任务 hellaswag 评估完成! Execution time: 2699.23 seconds