[2025-05-03 19:17:10,547] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. [2025-05-03 19:17:10,547] torch.distributed.run: [WARNING] [2025-05-03 19:17:10,547] torch.distributed.run: [WARNING] ***************************************** [2025-05-03 19:17:10,547] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2025-05-03 19:17:10,547] torch.distributed.run: [WARNING] ***************************************** [2025-05-03 19:17:13] Experiment directory created at logs/nwm_cdit_m [2025-05-03 19:17:27] CDiT Parameters: 1,011,959,456 [2025-05-03 19:17:28] Dataset contains 132,929 images [2025-05-03 19:17:28] Training for 300 epochs... [2025-05-03 19:17:28] Beginning epoch 0... [2025-05-03 19:20:24] (step=0000100) Train Loss: 0.3427, Train Steps/Sec: 0.57, Samples/Sec: 27.26 [2025-05-03 19:21:10] (step=0000200) Train Loss: 0.2083, Train Steps/Sec: 2.15, Samples/Sec: 103.05 [2025-05-03 19:21:57] (step=0000300) Train Loss: 0.1963, Train Steps/Sec: 2.15, Samples/Sec: 103.02 [2025-05-03 19:22:45] (step=0000400) Train Loss: 0.1902, Train Steps/Sec: 2.10, Samples/Sec: 100.83 [2025-05-03 19:23:31] (step=0000500) Train Loss: 0.1773, Train Steps/Sec: 2.14, Samples/Sec: 102.95 [2025-05-03 19:24:18] (step=0000600) Train Loss: 0.1827, Train Steps/Sec: 2.15, Samples/Sec: 103.03 [2025-05-03 19:25:04] (step=0000700) Train Loss: 0.1773, Train Steps/Sec: 2.14, Samples/Sec: 102.95 [2025-05-03 19:25:51] (step=0000800) Train Loss: 0.1689, Train Steps/Sec: 2.13, Samples/Sec: 102.45 [2025-05-03 19:26:38] (step=0000900) Train Loss: 0.1784, Train Steps/Sec: 2.15, Samples/Sec: 102.99 [2025-05-03 19:27:25] (step=0001000) Train Loss: 0.1725, Train Steps/Sec: 2.13, Samples/Sec: 102.40 [2025-05-03 19:28:12] (step=0001100) Train Loss: 0.1645, Train Steps/Sec: 2.13, Samples/Sec: 102.45 [2025-05-03 19:28:58] (step=0001200) Train Loss: 0.1716, Train Steps/Sec: 2.13, Samples/Sec: 102.41 [2025-05-03 19:29:45] (step=0001300) Train Loss: 0.1750, Train Steps/Sec: 2.15, Samples/Sec: 103.04 [2025-05-03 19:30:32] (step=0001400) Train Loss: 0.1631, Train Steps/Sec: 2.15, Samples/Sec: 102.98 [2025-05-03 19:31:19] (step=0001500) Train Loss: 0.1667, Train Steps/Sec: 2.12, Samples/Sec: 101.82 [2025-05-03 19:32:06] (step=0001600) Train Loss: 0.1680, Train Steps/Sec: 2.15, Samples/Sec: 102.99 [2025-05-03 19:32:52] (step=0001700) Train Loss: 0.1665, Train Steps/Sec: 2.15, Samples/Sec: 103.03 [2025-05-03 19:33:39] (step=0001800) Train Loss: 0.1602, Train Steps/Sec: 2.15, Samples/Sec: 102.99 [2025-05-03 19:34:26] (step=0001900) Train Loss: 0.1718, Train Steps/Sec: 2.12, Samples/Sec: 101.97 [2025-05-03 19:35:12] (step=0002000) Train Loss: 0.1734, Train Steps/Sec: 2.15, Samples/Sec: 102.98 [2025-05-03 19:35:29] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar [2025-05-03 19:36:16] (step=0002100) Train Loss: 0.1608, Train Steps/Sec: 1.59, Samples/Sec: 76.15 [2025-05-03 19:37:02] (step=0002200) Train Loss: 0.1668, Train Steps/Sec: 2.15, Samples/Sec: 103.05 [2025-05-03 19:37:49] (step=0002300) Train Loss: 0.1628, Train Steps/Sec: 2.13, Samples/Sec: 102.43 [2025-05-03 19:38:36] (step=0002400) Train Loss: 0.1686, Train Steps/Sec: 2.13, Samples/Sec: 102.36 [2025-05-03 19:39:23] (step=0002500) Train Loss: 0.1595, Train Steps/Sec: 2.13, Samples/Sec: 102.36 [2025-05-03 19:40:09] (step=0002600) Train Loss: 0.1698, Train Steps/Sec: 2.14, Samples/Sec: 102.95 [2025-05-03 19:40:56] (step=0002700) Train Loss: 0.1662, Train Steps/Sec: 2.14, Samples/Sec: 102.55 [2025-05-03 19:41:43] (step=0002800) Train Loss: 0.1591, Train Steps/Sec: 2.15, Samples/Sec: 103.00 [2025-05-03 19:42:30] (step=0002900) Train Loss: 0.1673, Train Steps/Sec: 2.12, Samples/Sec: 101.75 [2025-05-03 19:43:17] (step=0003000) Train Loss: 0.1561, Train Steps/Sec: 2.15, Samples/Sec: 102.97 [2025-05-03 19:44:03] (step=0003100) Train Loss: 0.1615, Train Steps/Sec: 2.15, Samples/Sec: 103.00 [2025-05-03 19:44:50] (step=0003200) Train Loss: 0.1586, Train Steps/Sec: 2.14, Samples/Sec: 102.50 [2025-05-03 19:45:37] (step=0003300) Train Loss: 0.1537, Train Steps/Sec: 2.12, Samples/Sec: 101.82 [2025-05-03 19:46:24] (step=0003400) Train Loss: 0.1555, Train Steps/Sec: 2.14, Samples/Sec: 102.96 [2025-05-03 19:47:10] (step=0003500) Train Loss: 0.1598, Train Steps/Sec: 2.15, Samples/Sec: 103.00 [2025-05-03 19:47:57] (step=0003600) Train Loss: 0.1564, Train Steps/Sec: 2.14, Samples/Sec: 102.58 [2025-05-03 19:48:44] (step=0003700) Train Loss: 0.1616, Train Steps/Sec: 2.13, Samples/Sec: 102.32 [2025-05-03 19:49:31] (step=0003800) Train Loss: 0.1593, Train Steps/Sec: 2.13, Samples/Sec: 102.45 [2025-05-03 19:50:18] (step=0003900) Train Loss: 0.1575, Train Steps/Sec: 2.14, Samples/Sec: 102.94 [2025-05-03 19:51:04] (step=0004000) Train Loss: 0.1603, Train Steps/Sec: 2.13, Samples/Sec: 102.37 [2025-05-03 19:51:19] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar [2025-05-03 19:52:06] (step=0004100) Train Loss: 0.1566, Train Steps/Sec: 1.62, Samples/Sec: 77.61 [2025-05-03 19:52:53] (step=0004200) Train Loss: 0.1528, Train Steps/Sec: 2.13, Samples/Sec: 102.45 [2025-05-03 19:53:40] (step=0004300) Train Loss: 0.1591, Train Steps/Sec: 2.15, Samples/Sec: 102.97 [2025-05-03 19:54:27] (step=0004400) Train Loss: 0.1582, Train Steps/Sec: 2.14, Samples/Sec: 102.53 [2025-05-03 19:55:13] (step=0004500) Train Loss: 0.1539, Train Steps/Sec: 2.13, Samples/Sec: 102.45 [2025-05-03 19:56:00] (step=0004600) Train Loss: 0.1567, Train Steps/Sec: 2.13, Samples/Sec: 102.45 [2025-05-03 19:56:47] (step=0004700) Train Loss: 0.1534, Train Steps/Sec: 2.15, Samples/Sec: 103.05 [2025-05-03 19:57:33] (step=0004800) Train Loss: 0.1592, Train Steps/Sec: 2.15, Samples/Sec: 103.00 [2025-05-03 19:58:20] (step=0004900) Train Loss: 0.1558, Train Steps/Sec: 2.13, Samples/Sec: 102.47 [2025-05-03 19:59:07] (step=0005000) Train Loss: 0.1563, Train Steps/Sec: 2.12, Samples/Sec: 101.89 [2025-05-03 19:59:54] (step=0005100) Train Loss: 0.1567, Train Steps/Sec: 2.15, Samples/Sec: 103.02 [2025-05-03 20:00:41] (step=0005200) Train Loss: 0.1473, Train Steps/Sec: 2.15, Samples/Sec: 103.10 [2025-05-03 20:01:27] (step=0005300) Train Loss: 0.1503, Train Steps/Sec: 2.13, Samples/Sec: 102.40 [2025-05-03 20:02:14] (step=0005400) Train Loss: 0.1573, Train Steps/Sec: 2.13, Samples/Sec: 102.44 [2025-05-03 20:03:01] (step=0005500) Train Loss: 0.1503, Train Steps/Sec: 2.14, Samples/Sec: 102.49 [2025-05-03 20:03:48] (step=0005600) Train Loss: 0.1553, Train Steps/Sec: 2.15, Samples/Sec: 103.02 [2025-05-03 20:04:35] (step=0005700) Train Loss: 0.1517, Train Steps/Sec: 2.14, Samples/Sec: 102.55 [2025-05-03 20:05:21] (step=0005800) Train Loss: 0.1590, Train Steps/Sec: 2.13, Samples/Sec: 102.40 [2025-05-03 20:06:08] (step=0005900) Train Loss: 0.1487, Train Steps/Sec: 2.13, Samples/Sec: 102.44 [2025-05-03 20:06:55] (step=0006000) Train Loss: 0.1486, Train Steps/Sec: 2.14, Samples/Sec: 102.92 [2025-05-03 20:07:10] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar [2025-05-03 20:07:57] (step=0006100) Train Loss: 0.1519, Train Steps/Sec: 1.61, Samples/Sec: 77.30 [2025-05-03 20:08:44] (step=0006200) Train Loss: 0.1544, Train Steps/Sec: 2.15, Samples/Sec: 103.04 [2025-05-03 20:09:31] (step=0006300) Train Loss: 0.1520, Train Steps/Sec: 2.13, Samples/Sec: 102.01 [2025-05-03 20:10:17] (step=0006400) Train Loss: 0.1439, Train Steps/Sec: 2.15, Samples/Sec: 103.02 [2025-05-03 20:11:04] (step=0006500) Train Loss: 0.1527, Train Steps/Sec: 2.15, Samples/Sec: 103.01 [2025-05-03 20:11:51] (step=0006600) Train Loss: 0.1510, Train Steps/Sec: 2.13, Samples/Sec: 102.31 [2025-05-03 20:12:38] (step=0006700) Train Loss: 0.1495, Train Steps/Sec: 2.12, Samples/Sec: 101.83 [2025-05-03 20:13:25] (step=0006800) Train Loss: 0.1514, Train Steps/Sec: 2.15, Samples/Sec: 102.98 [2025-05-03 20:14:11] (step=0006900) Train Loss: 0.1505, Train Steps/Sec: 2.14, Samples/Sec: 102.89 [2025-05-03 20:14:58] (step=0007000) Train Loss: 0.1450, Train Steps/Sec: 2.13, Samples/Sec: 102.45 [2025-05-03 20:15:45] (step=0007100) Train Loss: 0.1522, Train Steps/Sec: 2.15, Samples/Sec: 103.02 [2025-05-03 20:16:32] (step=0007200) Train Loss: 0.1496, Train Steps/Sec: 2.12, Samples/Sec: 101.90 [2025-05-03 20:17:18] (step=0007300) Train Loss: 0.1483, Train Steps/Sec: 2.15, Samples/Sec: 103.08 [2025-05-03 20:18:05] (step=0007400) Train Loss: 0.1457, Train Steps/Sec: 2.14, Samples/Sec: 102.48 [2025-05-03 20:18:52] (step=0007500) Train Loss: 0.1514, Train Steps/Sec: 2.15, Samples/Sec: 103.07 [2025-05-03 20:19:39] (step=0007600) Train Loss: 0.1475, Train Steps/Sec: 2.12, Samples/Sec: 101.98 [2025-05-03 20:20:25] (step=0007700) Train Loss: 0.1506, Train Steps/Sec: 2.15, Samples/Sec: 103.07 [2025-05-03 20:21:12] (step=0007800) Train Loss: 0.1528, Train Steps/Sec: 2.14, Samples/Sec: 102.50 [2025-05-03 20:21:59] (step=0007900) Train Loss: 0.1442, Train Steps/Sec: 2.15, Samples/Sec: 103.03 [2025-05-03 20:22:46] (step=0008000) Train Loss: 0.1514, Train Steps/Sec: 2.12, Samples/Sec: 101.91 [2025-05-03 20:23:01] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar [2025-05-03 20:23:47] (step=0008100) Train Loss: 0.1502, Train Steps/Sec: 1.62, Samples/Sec: 77.90 [2025-05-03 20:24:34] (step=0008200) Train Loss: 0.1422, Train Steps/Sec: 2.15, Samples/Sec: 103.09 [2025-05-03 20:25:21] (step=0008300) Train Loss: 0.1492, Train Steps/Sec: 2.14, Samples/Sec: 102.51 [2025-05-03 20:26:08] (step=0008400) Train Loss: 0.1483, Train Steps/Sec: 2.12, Samples/Sec: 101.88 [2025-05-03 20:26:55] (step=0008500) Train Loss: 0.1516, Train Steps/Sec: 2.14, Samples/Sec: 102.96 [2025-05-03 20:27:41] (step=0008600) Train Loss: 0.1456, Train Steps/Sec: 2.15, Samples/Sec: 103.13 [2025-05-03 20:28:28] (step=0008700) Train Loss: 0.1442, Train Steps/Sec: 2.13, Samples/Sec: 102.47 [2025-05-03 20:29:15] (step=0008800) Train Loss: 0.1426, Train Steps/Sec: 2.13, Samples/Sec: 102.42 [2025-05-03 20:30:02] (step=0008900) Train Loss: 0.1527, Train Steps/Sec: 2.14, Samples/Sec: 102.51 [2025-05-03 20:30:48] (step=0009000) Train Loss: 0.1414, Train Steps/Sec: 2.15, Samples/Sec: 103.05 [2025-05-03 20:31:35] (step=0009100) Train Loss: 0.1405, Train Steps/Sec: 2.13, Samples/Sec: 102.41 [2025-05-03 20:32:22] (step=0009200) Train Loss: 0.1449, Train Steps/Sec: 2.14, Samples/Sec: 102.53 [2025-05-03 20:33:09] (step=0009300) Train Loss: 0.1420, Train Steps/Sec: 2.13, Samples/Sec: 102.41 [2025-05-03 20:33:55] (step=0009400) Train Loss: 0.1454, Train Steps/Sec: 2.15, Samples/Sec: 103.00 [2025-05-03 20:34:42] (step=0009500) Train Loss: 0.1462, Train Steps/Sec: 2.14, Samples/Sec: 102.50 [2025-05-03 20:35:29] (step=0009600) Train Loss: 0.1490, Train Steps/Sec: 2.14, Samples/Sec: 102.90 [2025-05-03 20:36:16] (step=0009700) Train Loss: 0.1443, Train Steps/Sec: 2.12, Samples/Sec: 101.84 [2025-05-03 20:37:03] (step=0009800) Train Loss: 0.1417, Train Steps/Sec: 2.14, Samples/Sec: 102.87 [2025-05-03 20:37:50] (step=0009900) Train Loss: 0.1448, Train Steps/Sec: 2.13, Samples/Sec: 102.34 [2025-05-03 20:38:36] (step=0010000) Train Loss: 0.1431, Train Steps/Sec: 2.15, Samples/Sec: 103.01 Downloading: "https://github.com/facebookresearch/dino/zipball/main" to ./models/main.zip Downloading: "https://github.com/facebookresearch/dino/zipball/main" to ./models/main.zip [2025-05-03 20:38:52] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar Downloading: "https://github.com/facebookresearch/dino/zipball/main" to ./models/main.zip Traceback (most recent call last): File "train.py", line 437, in main(args) File "train.py", line 352, in main sim_score = evaluate(ema, tokenizer, diffusion, test_dataset, rank, config["batch_size"], config["num_workers"], latent_size, device, save_dir, args.global_seed, bfloat_enable, num_cond) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "train.py", line 384, in evaluate eval_model, _ = dreamsim(pretrained=True) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/model.py", line 275, in dreamsim ours_model = PerceptualModel(**dreamsim_args['model_config'][dreamsim_type], device=device, load_dir=cache_dir, File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/model.py", line 65, in __init__ ViTExtractor(model_type, stride, load_dir, device=device) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/feature_extraction/extractor.py", line 44, in __init__ self.model = ViTExtractor.create_model(model_type, load_dir) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/feature_extraction/extractor.py", line 72, in create_model model = torch.hub.load('facebookresearch/dino:main', model_type) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 563, in load repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, trust_repo, "load", File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 238, in _get_cache_or_reload download_url_to_file(url, cached_file, progress=False) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 620, in download_url_to_file u = urlopen(req) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 525, in open response = self._open(req, data) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 542, in _open result = self._call_chain(self.handle_open, protocol, protocol + File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 502, in _call_chain result = func(*args) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1397, in https_open return self.do_open(http.client.HTTPSConnection, req, File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1358, in do_open r = h.getresponse() File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1348, in getresponse response.begin() File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 316, in begin version, status, reason = self._read_status() File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 285, in _read_status raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response Traceback (most recent call last): File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1354, in do_open h.request(req.get_method(), req.selector, req.data, headers, File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1256, in request self._send_request(method, url, body, headers, encode_chunked) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1302, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1251, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1011, in _send_output self.send(msg) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 951, in send self.connect() File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1418, in connect super().connect() File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 922, in connect self.sock = self._create_connection( File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/socket.py", line 820, in create_connection raise err File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/socket.py", line 808, in create_connection sock.connect(sa) TimeoutError: [Errno 110] Connection timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "train.py", line 437, in main(args) File "train.py", line 352, in main sim_score = evaluate(ema, tokenizer, diffusion, test_dataset, rank, config["batch_size"], config["num_workers"], latent_size, device, save_dir, args.global_seed, bfloat_enable, num_cond) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "train.py", line 384, in evaluate eval_model, _ = dreamsim(pretrained=True) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/model.py", line 275, in dreamsim ours_model = PerceptualModel(**dreamsim_args['model_config'][dreamsim_type], device=device, load_dir=cache_dir, File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/model.py", line 65, in __init__ ViTExtractor(model_type, stride, load_dir, device=device) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/feature_extraction/extractor.py", line 44, in __init__ self.model = ViTExtractor.create_model(model_type, load_dir) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/feature_extraction/extractor.py", line 72, in create_model model = torch.hub.load('facebookresearch/dino:main', model_type) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 563, in load repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, trust_repo, "load", File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 238, in _get_cache_or_reload download_url_to_file(url, cached_file, progress=False) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 620, in download_url_to_file u = urlopen(req) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 525, in open response = self._open(req, data) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 542, in _open result = self._call_chain(self.handle_open, protocol, protocol + File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 502, in _call_chain result = func(*args) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1397, in https_open return self.do_open(http.client.HTTPSConnection, req, File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1357, in do_open raise URLError(err) urllib.error.URLError: [2025-05-03 20:41:14,779] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 715331 closing signal SIGTERM [2025-05-03 20:41:14,780] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 715440 closing signal SIGTERM [2025-05-03 20:41:14,944] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 715461) of binary: /data1/zwc/miniconda3/envs/nwm2/bin/python Traceback (most recent call last): File "/data1/tpz/anaconda3/envs/nwm2/bin/torchrun", line 8, in sys.exit(main()) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-05-03_20:41:14 host : localhost rank : 2 (local_rank: 2) exitcode : 1 (pid: 715461) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================