anwm / evalBugLog
de99's picture
Upload evalBugLog
3650590 verified
[2025-05-03 19:17:10,547] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2025-05-03 19:17:10,547] torch.distributed.run: [WARNING]
[2025-05-03 19:17:10,547] torch.distributed.run: [WARNING] *****************************************
[2025-05-03 19:17:10,547] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2025-05-03 19:17:10,547] torch.distributed.run: [WARNING] *****************************************
[2025-05-03 19:17:13] Experiment directory created at logs/nwm_cdit_m
[2025-05-03 19:17:27] CDiT Parameters: 1,011,959,456
[2025-05-03 19:17:28] Dataset contains 132,929 images
[2025-05-03 19:17:28] Training for 300 epochs...
[2025-05-03 19:17:28] Beginning epoch 0...
[2025-05-03 19:20:24] (step=0000100) Train Loss: 0.3427, Train Steps/Sec: 0.57, Samples/Sec: 27.26
[2025-05-03 19:21:10] (step=0000200) Train Loss: 0.2083, Train Steps/Sec: 2.15, Samples/Sec: 103.05
[2025-05-03 19:21:57] (step=0000300) Train Loss: 0.1963, Train Steps/Sec: 2.15, Samples/Sec: 103.02
[2025-05-03 19:22:45] (step=0000400) Train Loss: 0.1902, Train Steps/Sec: 2.10, Samples/Sec: 100.83
[2025-05-03 19:23:31] (step=0000500) Train Loss: 0.1773, Train Steps/Sec: 2.14, Samples/Sec: 102.95
[2025-05-03 19:24:18] (step=0000600) Train Loss: 0.1827, Train Steps/Sec: 2.15, Samples/Sec: 103.03
[2025-05-03 19:25:04] (step=0000700) Train Loss: 0.1773, Train Steps/Sec: 2.14, Samples/Sec: 102.95
[2025-05-03 19:25:51] (step=0000800) Train Loss: 0.1689, Train Steps/Sec: 2.13, Samples/Sec: 102.45
[2025-05-03 19:26:38] (step=0000900) Train Loss: 0.1784, Train Steps/Sec: 2.15, Samples/Sec: 102.99
[2025-05-03 19:27:25] (step=0001000) Train Loss: 0.1725, Train Steps/Sec: 2.13, Samples/Sec: 102.40
[2025-05-03 19:28:12] (step=0001100) Train Loss: 0.1645, Train Steps/Sec: 2.13, Samples/Sec: 102.45
[2025-05-03 19:28:58] (step=0001200) Train Loss: 0.1716, Train Steps/Sec: 2.13, Samples/Sec: 102.41
[2025-05-03 19:29:45] (step=0001300) Train Loss: 0.1750, Train Steps/Sec: 2.15, Samples/Sec: 103.04
[2025-05-03 19:30:32] (step=0001400) Train Loss: 0.1631, Train Steps/Sec: 2.15, Samples/Sec: 102.98
[2025-05-03 19:31:19] (step=0001500) Train Loss: 0.1667, Train Steps/Sec: 2.12, Samples/Sec: 101.82
[2025-05-03 19:32:06] (step=0001600) Train Loss: 0.1680, Train Steps/Sec: 2.15, Samples/Sec: 102.99
[2025-05-03 19:32:52] (step=0001700) Train Loss: 0.1665, Train Steps/Sec: 2.15, Samples/Sec: 103.03
[2025-05-03 19:33:39] (step=0001800) Train Loss: 0.1602, Train Steps/Sec: 2.15, Samples/Sec: 102.99
[2025-05-03 19:34:26] (step=0001900) Train Loss: 0.1718, Train Steps/Sec: 2.12, Samples/Sec: 101.97
[2025-05-03 19:35:12] (step=0002000) Train Loss: 0.1734, Train Steps/Sec: 2.15, Samples/Sec: 102.98
[2025-05-03 19:35:29] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar
[2025-05-03 19:36:16] (step=0002100) Train Loss: 0.1608, Train Steps/Sec: 1.59, Samples/Sec: 76.15
[2025-05-03 19:37:02] (step=0002200) Train Loss: 0.1668, Train Steps/Sec: 2.15, Samples/Sec: 103.05
[2025-05-03 19:37:49] (step=0002300) Train Loss: 0.1628, Train Steps/Sec: 2.13, Samples/Sec: 102.43
[2025-05-03 19:38:36] (step=0002400) Train Loss: 0.1686, Train Steps/Sec: 2.13, Samples/Sec: 102.36
[2025-05-03 19:39:23] (step=0002500) Train Loss: 0.1595, Train Steps/Sec: 2.13, Samples/Sec: 102.36
[2025-05-03 19:40:09] (step=0002600) Train Loss: 0.1698, Train Steps/Sec: 2.14, Samples/Sec: 102.95
[2025-05-03 19:40:56] (step=0002700) Train Loss: 0.1662, Train Steps/Sec: 2.14, Samples/Sec: 102.55
[2025-05-03 19:41:43] (step=0002800) Train Loss: 0.1591, Train Steps/Sec: 2.15, Samples/Sec: 103.00
[2025-05-03 19:42:30] (step=0002900) Train Loss: 0.1673, Train Steps/Sec: 2.12, Samples/Sec: 101.75
[2025-05-03 19:43:17] (step=0003000) Train Loss: 0.1561, Train Steps/Sec: 2.15, Samples/Sec: 102.97
[2025-05-03 19:44:03] (step=0003100) Train Loss: 0.1615, Train Steps/Sec: 2.15, Samples/Sec: 103.00
[2025-05-03 19:44:50] (step=0003200) Train Loss: 0.1586, Train Steps/Sec: 2.14, Samples/Sec: 102.50
[2025-05-03 19:45:37] (step=0003300) Train Loss: 0.1537, Train Steps/Sec: 2.12, Samples/Sec: 101.82
[2025-05-03 19:46:24] (step=0003400) Train Loss: 0.1555, Train Steps/Sec: 2.14, Samples/Sec: 102.96
[2025-05-03 19:47:10] (step=0003500) Train Loss: 0.1598, Train Steps/Sec: 2.15, Samples/Sec: 103.00
[2025-05-03 19:47:57] (step=0003600) Train Loss: 0.1564, Train Steps/Sec: 2.14, Samples/Sec: 102.58
[2025-05-03 19:48:44] (step=0003700) Train Loss: 0.1616, Train Steps/Sec: 2.13, Samples/Sec: 102.32
[2025-05-03 19:49:31] (step=0003800) Train Loss: 0.1593, Train Steps/Sec: 2.13, Samples/Sec: 102.45
[2025-05-03 19:50:18] (step=0003900) Train Loss: 0.1575, Train Steps/Sec: 2.14, Samples/Sec: 102.94
[2025-05-03 19:51:04] (step=0004000) Train Loss: 0.1603, Train Steps/Sec: 2.13, Samples/Sec: 102.37
[2025-05-03 19:51:19] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar
[2025-05-03 19:52:06] (step=0004100) Train Loss: 0.1566, Train Steps/Sec: 1.62, Samples/Sec: 77.61
[2025-05-03 19:52:53] (step=0004200) Train Loss: 0.1528, Train Steps/Sec: 2.13, Samples/Sec: 102.45
[2025-05-03 19:53:40] (step=0004300) Train Loss: 0.1591, Train Steps/Sec: 2.15, Samples/Sec: 102.97
[2025-05-03 19:54:27] (step=0004400) Train Loss: 0.1582, Train Steps/Sec: 2.14, Samples/Sec: 102.53
[2025-05-03 19:55:13] (step=0004500) Train Loss: 0.1539, Train Steps/Sec: 2.13, Samples/Sec: 102.45
[2025-05-03 19:56:00] (step=0004600) Train Loss: 0.1567, Train Steps/Sec: 2.13, Samples/Sec: 102.45
[2025-05-03 19:56:47] (step=0004700) Train Loss: 0.1534, Train Steps/Sec: 2.15, Samples/Sec: 103.05
[2025-05-03 19:57:33] (step=0004800) Train Loss: 0.1592, Train Steps/Sec: 2.15, Samples/Sec: 103.00
[2025-05-03 19:58:20] (step=0004900) Train Loss: 0.1558, Train Steps/Sec: 2.13, Samples/Sec: 102.47
[2025-05-03 19:59:07] (step=0005000) Train Loss: 0.1563, Train Steps/Sec: 2.12, Samples/Sec: 101.89
[2025-05-03 19:59:54] (step=0005100) Train Loss: 0.1567, Train Steps/Sec: 2.15, Samples/Sec: 103.02
[2025-05-03 20:00:41] (step=0005200) Train Loss: 0.1473, Train Steps/Sec: 2.15, Samples/Sec: 103.10
[2025-05-03 20:01:27] (step=0005300) Train Loss: 0.1503, Train Steps/Sec: 2.13, Samples/Sec: 102.40
[2025-05-03 20:02:14] (step=0005400) Train Loss: 0.1573, Train Steps/Sec: 2.13, Samples/Sec: 102.44
[2025-05-03 20:03:01] (step=0005500) Train Loss: 0.1503, Train Steps/Sec: 2.14, Samples/Sec: 102.49
[2025-05-03 20:03:48] (step=0005600) Train Loss: 0.1553, Train Steps/Sec: 2.15, Samples/Sec: 103.02
[2025-05-03 20:04:35] (step=0005700) Train Loss: 0.1517, Train Steps/Sec: 2.14, Samples/Sec: 102.55
[2025-05-03 20:05:21] (step=0005800) Train Loss: 0.1590, Train Steps/Sec: 2.13, Samples/Sec: 102.40
[2025-05-03 20:06:08] (step=0005900) Train Loss: 0.1487, Train Steps/Sec: 2.13, Samples/Sec: 102.44
[2025-05-03 20:06:55] (step=0006000) Train Loss: 0.1486, Train Steps/Sec: 2.14, Samples/Sec: 102.92
[2025-05-03 20:07:10] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar
[2025-05-03 20:07:57] (step=0006100) Train Loss: 0.1519, Train Steps/Sec: 1.61, Samples/Sec: 77.30
[2025-05-03 20:08:44] (step=0006200) Train Loss: 0.1544, Train Steps/Sec: 2.15, Samples/Sec: 103.04
[2025-05-03 20:09:31] (step=0006300) Train Loss: 0.1520, Train Steps/Sec: 2.13, Samples/Sec: 102.01
[2025-05-03 20:10:17] (step=0006400) Train Loss: 0.1439, Train Steps/Sec: 2.15, Samples/Sec: 103.02
[2025-05-03 20:11:04] (step=0006500) Train Loss: 0.1527, Train Steps/Sec: 2.15, Samples/Sec: 103.01
[2025-05-03 20:11:51] (step=0006600) Train Loss: 0.1510, Train Steps/Sec: 2.13, Samples/Sec: 102.31
[2025-05-03 20:12:38] (step=0006700) Train Loss: 0.1495, Train Steps/Sec: 2.12, Samples/Sec: 101.83
[2025-05-03 20:13:25] (step=0006800) Train Loss: 0.1514, Train Steps/Sec: 2.15, Samples/Sec: 102.98
[2025-05-03 20:14:11] (step=0006900) Train Loss: 0.1505, Train Steps/Sec: 2.14, Samples/Sec: 102.89
[2025-05-03 20:14:58] (step=0007000) Train Loss: 0.1450, Train Steps/Sec: 2.13, Samples/Sec: 102.45
[2025-05-03 20:15:45] (step=0007100) Train Loss: 0.1522, Train Steps/Sec: 2.15, Samples/Sec: 103.02
[2025-05-03 20:16:32] (step=0007200) Train Loss: 0.1496, Train Steps/Sec: 2.12, Samples/Sec: 101.90
[2025-05-03 20:17:18] (step=0007300) Train Loss: 0.1483, Train Steps/Sec: 2.15, Samples/Sec: 103.08
[2025-05-03 20:18:05] (step=0007400) Train Loss: 0.1457, Train Steps/Sec: 2.14, Samples/Sec: 102.48
[2025-05-03 20:18:52] (step=0007500) Train Loss: 0.1514, Train Steps/Sec: 2.15, Samples/Sec: 103.07
[2025-05-03 20:19:39] (step=0007600) Train Loss: 0.1475, Train Steps/Sec: 2.12, Samples/Sec: 101.98
[2025-05-03 20:20:25] (step=0007700) Train Loss: 0.1506, Train Steps/Sec: 2.15, Samples/Sec: 103.07
[2025-05-03 20:21:12] (step=0007800) Train Loss: 0.1528, Train Steps/Sec: 2.14, Samples/Sec: 102.50
[2025-05-03 20:21:59] (step=0007900) Train Loss: 0.1442, Train Steps/Sec: 2.15, Samples/Sec: 103.03
[2025-05-03 20:22:46] (step=0008000) Train Loss: 0.1514, Train Steps/Sec: 2.12, Samples/Sec: 101.91
[2025-05-03 20:23:01] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar
[2025-05-03 20:23:47] (step=0008100) Train Loss: 0.1502, Train Steps/Sec: 1.62, Samples/Sec: 77.90
[2025-05-03 20:24:34] (step=0008200) Train Loss: 0.1422, Train Steps/Sec: 2.15, Samples/Sec: 103.09
[2025-05-03 20:25:21] (step=0008300) Train Loss: 0.1492, Train Steps/Sec: 2.14, Samples/Sec: 102.51
[2025-05-03 20:26:08] (step=0008400) Train Loss: 0.1483, Train Steps/Sec: 2.12, Samples/Sec: 101.88
[2025-05-03 20:26:55] (step=0008500) Train Loss: 0.1516, Train Steps/Sec: 2.14, Samples/Sec: 102.96
[2025-05-03 20:27:41] (step=0008600) Train Loss: 0.1456, Train Steps/Sec: 2.15, Samples/Sec: 103.13
[2025-05-03 20:28:28] (step=0008700) Train Loss: 0.1442, Train Steps/Sec: 2.13, Samples/Sec: 102.47
[2025-05-03 20:29:15] (step=0008800) Train Loss: 0.1426, Train Steps/Sec: 2.13, Samples/Sec: 102.42
[2025-05-03 20:30:02] (step=0008900) Train Loss: 0.1527, Train Steps/Sec: 2.14, Samples/Sec: 102.51
[2025-05-03 20:30:48] (step=0009000) Train Loss: 0.1414, Train Steps/Sec: 2.15, Samples/Sec: 103.05
[2025-05-03 20:31:35] (step=0009100) Train Loss: 0.1405, Train Steps/Sec: 2.13, Samples/Sec: 102.41
[2025-05-03 20:32:22] (step=0009200) Train Loss: 0.1449, Train Steps/Sec: 2.14, Samples/Sec: 102.53
[2025-05-03 20:33:09] (step=0009300) Train Loss: 0.1420, Train Steps/Sec: 2.13, Samples/Sec: 102.41
[2025-05-03 20:33:55] (step=0009400) Train Loss: 0.1454, Train Steps/Sec: 2.15, Samples/Sec: 103.00
[2025-05-03 20:34:42] (step=0009500) Train Loss: 0.1462, Train Steps/Sec: 2.14, Samples/Sec: 102.50
[2025-05-03 20:35:29] (step=0009600) Train Loss: 0.1490, Train Steps/Sec: 2.14, Samples/Sec: 102.90
[2025-05-03 20:36:16] (step=0009700) Train Loss: 0.1443, Train Steps/Sec: 2.12, Samples/Sec: 101.84
[2025-05-03 20:37:03] (step=0009800) Train Loss: 0.1417, Train Steps/Sec: 2.14, Samples/Sec: 102.87
[2025-05-03 20:37:50] (step=0009900) Train Loss: 0.1448, Train Steps/Sec: 2.13, Samples/Sec: 102.34
[2025-05-03 20:38:36] (step=0010000) Train Loss: 0.1431, Train Steps/Sec: 2.15, Samples/Sec: 103.01
Downloading: "https://github.com/facebookresearch/dino/zipball/main" to ./models/main.zip
Downloading: "https://github.com/facebookresearch/dino/zipball/main" to ./models/main.zip
[2025-05-03 20:38:52] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar
Downloading: "https://github.com/facebookresearch/dino/zipball/main" to ./models/main.zip
Traceback (most recent call last):
File "train.py", line 437, in <module>
main(args)
File "train.py", line 352, in main
sim_score = evaluate(ema, tokenizer, diffusion, test_dataset, rank, config["batch_size"], config["num_workers"], latent_size, device, save_dir, args.global_seed, bfloat_enable, num_cond)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "train.py", line 384, in evaluate
eval_model, _ = dreamsim(pretrained=True)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/model.py", line 275, in dreamsim
ours_model = PerceptualModel(**dreamsim_args['model_config'][dreamsim_type], device=device, load_dir=cache_dir,
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/model.py", line 65, in __init__
ViTExtractor(model_type, stride, load_dir, device=device)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/feature_extraction/extractor.py", line 44, in __init__
self.model = ViTExtractor.create_model(model_type, load_dir)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/feature_extraction/extractor.py", line 72, in create_model
model = torch.hub.load('facebookresearch/dino:main', model_type)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 563, in load
repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, trust_repo, "load",
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 238, in _get_cache_or_reload
download_url_to_file(url, cached_file, progress=False)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 620, in download_url_to_file
u = urlopen(req)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 542, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1397, in https_open
return self.do_open(http.client.HTTPSConnection, req,
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1358, in do_open
r = h.getresponse()
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1348, in getresponse
response.begin()
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 316, in begin
version, status, reason = self._read_status()
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 285, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
Traceback (most recent call last):
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1354, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1256, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1302, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1251, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1011, in _send_output
self.send(msg)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 951, in send
self.connect()
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1418, in connect
super().connect()
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 922, in connect
self.sock = self._create_connection(
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/socket.py", line 820, in create_connection
raise err
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/socket.py", line 808, in create_connection
sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 437, in <module>
main(args)
File "train.py", line 352, in main
sim_score = evaluate(ema, tokenizer, diffusion, test_dataset, rank, config["batch_size"], config["num_workers"], latent_size, device, save_dir, args.global_seed, bfloat_enable, num_cond)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "train.py", line 384, in evaluate
eval_model, _ = dreamsim(pretrained=True)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/model.py", line 275, in dreamsim
ours_model = PerceptualModel(**dreamsim_args['model_config'][dreamsim_type], device=device, load_dir=cache_dir,
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/model.py", line 65, in __init__
ViTExtractor(model_type, stride, load_dir, device=device)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/feature_extraction/extractor.py", line 44, in __init__
self.model = ViTExtractor.create_model(model_type, load_dir)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/feature_extraction/extractor.py", line 72, in create_model
model = torch.hub.load('facebookresearch/dino:main', model_type)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 563, in load
repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, trust_repo, "load",
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 238, in _get_cache_or_reload
download_url_to_file(url, cached_file, progress=False)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 620, in download_url_to_file
u = urlopen(req)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 542, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1397, in https_open
return self.do_open(http.client.HTTPSConnection, req,
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1357, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
[2025-05-03 20:41:14,779] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 715331 closing signal SIGTERM
[2025-05-03 20:41:14,780] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 715440 closing signal SIGTERM
[2025-05-03 20:41:14,944] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 715461) of binary: /data1/zwc/miniconda3/envs/nwm2/bin/python
Traceback (most recent call last):
File "/data1/tpz/anaconda3/envs/nwm2/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-05-03_20:41:14
host : localhost
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 715461)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================