| [2025-05-03 19:17:10,547] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. |
| [2025-05-03 19:17:10,547] torch.distributed.run: [WARNING] |
| [2025-05-03 19:17:10,547] torch.distributed.run: [WARNING] ***************************************** |
| [2025-05-03 19:17:10,547] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
| [2025-05-03 19:17:10,547] torch.distributed.run: [WARNING] ***************************************** |
| [[34m2025-05-03 19:17:13[0m] Experiment directory created at logs/nwm_cdit_m |
| [[34m2025-05-03 19:17:27[0m] CDiT Parameters: 1,011,959,456 |
| [[34m2025-05-03 19:17:28[0m] Dataset contains 132,929 images |
| [[34m2025-05-03 19:17:28[0m] Training for 300 epochs... |
| [[34m2025-05-03 19:17:28[0m] Beginning epoch 0... |
| [[34m2025-05-03 19:20:24[0m] (step=0000100) Train Loss: 0.3427, Train Steps/Sec: 0.57, Samples/Sec: 27.26 |
| [[34m2025-05-03 19:21:10[0m] (step=0000200) Train Loss: 0.2083, Train Steps/Sec: 2.15, Samples/Sec: 103.05 |
| [[34m2025-05-03 19:21:57[0m] (step=0000300) Train Loss: 0.1963, Train Steps/Sec: 2.15, Samples/Sec: 103.02 |
| [[34m2025-05-03 19:22:45[0m] (step=0000400) Train Loss: 0.1902, Train Steps/Sec: 2.10, Samples/Sec: 100.83 |
| [[34m2025-05-03 19:23:31[0m] (step=0000500) Train Loss: 0.1773, Train Steps/Sec: 2.14, Samples/Sec: 102.95 |
| [[34m2025-05-03 19:24:18[0m] (step=0000600) Train Loss: 0.1827, Train Steps/Sec: 2.15, Samples/Sec: 103.03 |
| [[34m2025-05-03 19:25:04[0m] (step=0000700) Train Loss: 0.1773, Train Steps/Sec: 2.14, Samples/Sec: 102.95 |
| [[34m2025-05-03 19:25:51[0m] (step=0000800) Train Loss: 0.1689, Train Steps/Sec: 2.13, Samples/Sec: 102.45 |
| [[34m2025-05-03 19:26:38[0m] (step=0000900) Train Loss: 0.1784, Train Steps/Sec: 2.15, Samples/Sec: 102.99 |
| [[34m2025-05-03 19:27:25[0m] (step=0001000) Train Loss: 0.1725, Train Steps/Sec: 2.13, Samples/Sec: 102.40 |
| [[34m2025-05-03 19:28:12[0m] (step=0001100) Train Loss: 0.1645, Train Steps/Sec: 2.13, Samples/Sec: 102.45 |
| [[34m2025-05-03 19:28:58[0m] (step=0001200) Train Loss: 0.1716, Train Steps/Sec: 2.13, Samples/Sec: 102.41 |
| [[34m2025-05-03 19:29:45[0m] (step=0001300) Train Loss: 0.1750, Train Steps/Sec: 2.15, Samples/Sec: 103.04 |
| [[34m2025-05-03 19:30:32[0m] (step=0001400) Train Loss: 0.1631, Train Steps/Sec: 2.15, Samples/Sec: 102.98 |
| [[34m2025-05-03 19:31:19[0m] (step=0001500) Train Loss: 0.1667, Train Steps/Sec: 2.12, Samples/Sec: 101.82 |
| [[34m2025-05-03 19:32:06[0m] (step=0001600) Train Loss: 0.1680, Train Steps/Sec: 2.15, Samples/Sec: 102.99 |
| [[34m2025-05-03 19:32:52[0m] (step=0001700) Train Loss: 0.1665, Train Steps/Sec: 2.15, Samples/Sec: 103.03 |
| [[34m2025-05-03 19:33:39[0m] (step=0001800) Train Loss: 0.1602, Train Steps/Sec: 2.15, Samples/Sec: 102.99 |
| [[34m2025-05-03 19:34:26[0m] (step=0001900) Train Loss: 0.1718, Train Steps/Sec: 2.12, Samples/Sec: 101.97 |
| [[34m2025-05-03 19:35:12[0m] (step=0002000) Train Loss: 0.1734, Train Steps/Sec: 2.15, Samples/Sec: 102.98 |
| [[34m2025-05-03 19:35:29[0m] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar |
| [[34m2025-05-03 19:36:16[0m] (step=0002100) Train Loss: 0.1608, Train Steps/Sec: 1.59, Samples/Sec: 76.15 |
| [[34m2025-05-03 19:37:02[0m] (step=0002200) Train Loss: 0.1668, Train Steps/Sec: 2.15, Samples/Sec: 103.05 |
| [[34m2025-05-03 19:37:49[0m] (step=0002300) Train Loss: 0.1628, Train Steps/Sec: 2.13, Samples/Sec: 102.43 |
| [[34m2025-05-03 19:38:36[0m] (step=0002400) Train Loss: 0.1686, Train Steps/Sec: 2.13, Samples/Sec: 102.36 |
| [[34m2025-05-03 19:39:23[0m] (step=0002500) Train Loss: 0.1595, Train Steps/Sec: 2.13, Samples/Sec: 102.36 |
| [[34m2025-05-03 19:40:09[0m] (step=0002600) Train Loss: 0.1698, Train Steps/Sec: 2.14, Samples/Sec: 102.95 |
| [[34m2025-05-03 19:40:56[0m] (step=0002700) Train Loss: 0.1662, Train Steps/Sec: 2.14, Samples/Sec: 102.55 |
| [[34m2025-05-03 19:41:43[0m] (step=0002800) Train Loss: 0.1591, Train Steps/Sec: 2.15, Samples/Sec: 103.00 |
| [[34m2025-05-03 19:42:30[0m] (step=0002900) Train Loss: 0.1673, Train Steps/Sec: 2.12, Samples/Sec: 101.75 |
| [[34m2025-05-03 19:43:17[0m] (step=0003000) Train Loss: 0.1561, Train Steps/Sec: 2.15, Samples/Sec: 102.97 |
| [[34m2025-05-03 19:44:03[0m] (step=0003100) Train Loss: 0.1615, Train Steps/Sec: 2.15, Samples/Sec: 103.00 |
| [[34m2025-05-03 19:44:50[0m] (step=0003200) Train Loss: 0.1586, Train Steps/Sec: 2.14, Samples/Sec: 102.50 |
| [[34m2025-05-03 19:45:37[0m] (step=0003300) Train Loss: 0.1537, Train Steps/Sec: 2.12, Samples/Sec: 101.82 |
| [[34m2025-05-03 19:46:24[0m] (step=0003400) Train Loss: 0.1555, Train Steps/Sec: 2.14, Samples/Sec: 102.96 |
| [[34m2025-05-03 19:47:10[0m] (step=0003500) Train Loss: 0.1598, Train Steps/Sec: 2.15, Samples/Sec: 103.00 |
| [[34m2025-05-03 19:47:57[0m] (step=0003600) Train Loss: 0.1564, Train Steps/Sec: 2.14, Samples/Sec: 102.58 |
| [[34m2025-05-03 19:48:44[0m] (step=0003700) Train Loss: 0.1616, Train Steps/Sec: 2.13, Samples/Sec: 102.32 |
| [[34m2025-05-03 19:49:31[0m] (step=0003800) Train Loss: 0.1593, Train Steps/Sec: 2.13, Samples/Sec: 102.45 |
| [[34m2025-05-03 19:50:18[0m] (step=0003900) Train Loss: 0.1575, Train Steps/Sec: 2.14, Samples/Sec: 102.94 |
| [[34m2025-05-03 19:51:04[0m] (step=0004000) Train Loss: 0.1603, Train Steps/Sec: 2.13, Samples/Sec: 102.37 |
| [[34m2025-05-03 19:51:19[0m] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar |
| [[34m2025-05-03 19:52:06[0m] (step=0004100) Train Loss: 0.1566, Train Steps/Sec: 1.62, Samples/Sec: 77.61 |
| [[34m2025-05-03 19:52:53[0m] (step=0004200) Train Loss: 0.1528, Train Steps/Sec: 2.13, Samples/Sec: 102.45 |
| [[34m2025-05-03 19:53:40[0m] (step=0004300) Train Loss: 0.1591, Train Steps/Sec: 2.15, Samples/Sec: 102.97 |
| [[34m2025-05-03 19:54:27[0m] (step=0004400) Train Loss: 0.1582, Train Steps/Sec: 2.14, Samples/Sec: 102.53 |
| [[34m2025-05-03 19:55:13[0m] (step=0004500) Train Loss: 0.1539, Train Steps/Sec: 2.13, Samples/Sec: 102.45 |
| [[34m2025-05-03 19:56:00[0m] (step=0004600) Train Loss: 0.1567, Train Steps/Sec: 2.13, Samples/Sec: 102.45 |
| [[34m2025-05-03 19:56:47[0m] (step=0004700) Train Loss: 0.1534, Train Steps/Sec: 2.15, Samples/Sec: 103.05 |
| [[34m2025-05-03 19:57:33[0m] (step=0004800) Train Loss: 0.1592, Train Steps/Sec: 2.15, Samples/Sec: 103.00 |
| [[34m2025-05-03 19:58:20[0m] (step=0004900) Train Loss: 0.1558, Train Steps/Sec: 2.13, Samples/Sec: 102.47 |
| [[34m2025-05-03 19:59:07[0m] (step=0005000) Train Loss: 0.1563, Train Steps/Sec: 2.12, Samples/Sec: 101.89 |
| [[34m2025-05-03 19:59:54[0m] (step=0005100) Train Loss: 0.1567, Train Steps/Sec: 2.15, Samples/Sec: 103.02 |
| [[34m2025-05-03 20:00:41[0m] (step=0005200) Train Loss: 0.1473, Train Steps/Sec: 2.15, Samples/Sec: 103.10 |
| [[34m2025-05-03 20:01:27[0m] (step=0005300) Train Loss: 0.1503, Train Steps/Sec: 2.13, Samples/Sec: 102.40 |
| [[34m2025-05-03 20:02:14[0m] (step=0005400) Train Loss: 0.1573, Train Steps/Sec: 2.13, Samples/Sec: 102.44 |
| [[34m2025-05-03 20:03:01[0m] (step=0005500) Train Loss: 0.1503, Train Steps/Sec: 2.14, Samples/Sec: 102.49 |
| [[34m2025-05-03 20:03:48[0m] (step=0005600) Train Loss: 0.1553, Train Steps/Sec: 2.15, Samples/Sec: 103.02 |
| [[34m2025-05-03 20:04:35[0m] (step=0005700) Train Loss: 0.1517, Train Steps/Sec: 2.14, Samples/Sec: 102.55 |
| [[34m2025-05-03 20:05:21[0m] (step=0005800) Train Loss: 0.1590, Train Steps/Sec: 2.13, Samples/Sec: 102.40 |
| [[34m2025-05-03 20:06:08[0m] (step=0005900) Train Loss: 0.1487, Train Steps/Sec: 2.13, Samples/Sec: 102.44 |
| [[34m2025-05-03 20:06:55[0m] (step=0006000) Train Loss: 0.1486, Train Steps/Sec: 2.14, Samples/Sec: 102.92 |
| [[34m2025-05-03 20:07:10[0m] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar |
| [[34m2025-05-03 20:07:57[0m] (step=0006100) Train Loss: 0.1519, Train Steps/Sec: 1.61, Samples/Sec: 77.30 |
| [[34m2025-05-03 20:08:44[0m] (step=0006200) Train Loss: 0.1544, Train Steps/Sec: 2.15, Samples/Sec: 103.04 |
| [[34m2025-05-03 20:09:31[0m] (step=0006300) Train Loss: 0.1520, Train Steps/Sec: 2.13, Samples/Sec: 102.01 |
| [[34m2025-05-03 20:10:17[0m] (step=0006400) Train Loss: 0.1439, Train Steps/Sec: 2.15, Samples/Sec: 103.02 |
| [[34m2025-05-03 20:11:04[0m] (step=0006500) Train Loss: 0.1527, Train Steps/Sec: 2.15, Samples/Sec: 103.01 |
| [[34m2025-05-03 20:11:51[0m] (step=0006600) Train Loss: 0.1510, Train Steps/Sec: 2.13, Samples/Sec: 102.31 |
| [[34m2025-05-03 20:12:38[0m] (step=0006700) Train Loss: 0.1495, Train Steps/Sec: 2.12, Samples/Sec: 101.83 |
| [[34m2025-05-03 20:13:25[0m] (step=0006800) Train Loss: 0.1514, Train Steps/Sec: 2.15, Samples/Sec: 102.98 |
| [[34m2025-05-03 20:14:11[0m] (step=0006900) Train Loss: 0.1505, Train Steps/Sec: 2.14, Samples/Sec: 102.89 |
| [[34m2025-05-03 20:14:58[0m] (step=0007000) Train Loss: 0.1450, Train Steps/Sec: 2.13, Samples/Sec: 102.45 |
| [[34m2025-05-03 20:15:45[0m] (step=0007100) Train Loss: 0.1522, Train Steps/Sec: 2.15, Samples/Sec: 103.02 |
| [[34m2025-05-03 20:16:32[0m] (step=0007200) Train Loss: 0.1496, Train Steps/Sec: 2.12, Samples/Sec: 101.90 |
| [[34m2025-05-03 20:17:18[0m] (step=0007300) Train Loss: 0.1483, Train Steps/Sec: 2.15, Samples/Sec: 103.08 |
| [[34m2025-05-03 20:18:05[0m] (step=0007400) Train Loss: 0.1457, Train Steps/Sec: 2.14, Samples/Sec: 102.48 |
| [[34m2025-05-03 20:18:52[0m] (step=0007500) Train Loss: 0.1514, Train Steps/Sec: 2.15, Samples/Sec: 103.07 |
| [[34m2025-05-03 20:19:39[0m] (step=0007600) Train Loss: 0.1475, Train Steps/Sec: 2.12, Samples/Sec: 101.98 |
| [[34m2025-05-03 20:20:25[0m] (step=0007700) Train Loss: 0.1506, Train Steps/Sec: 2.15, Samples/Sec: 103.07 |
| [[34m2025-05-03 20:21:12[0m] (step=0007800) Train Loss: 0.1528, Train Steps/Sec: 2.14, Samples/Sec: 102.50 |
| [[34m2025-05-03 20:21:59[0m] (step=0007900) Train Loss: 0.1442, Train Steps/Sec: 2.15, Samples/Sec: 103.03 |
| [[34m2025-05-03 20:22:46[0m] (step=0008000) Train Loss: 0.1514, Train Steps/Sec: 2.12, Samples/Sec: 101.91 |
| [[34m2025-05-03 20:23:01[0m] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar |
| [[34m2025-05-03 20:23:47[0m] (step=0008100) Train Loss: 0.1502, Train Steps/Sec: 1.62, Samples/Sec: 77.90 |
| [[34m2025-05-03 20:24:34[0m] (step=0008200) Train Loss: 0.1422, Train Steps/Sec: 2.15, Samples/Sec: 103.09 |
| [[34m2025-05-03 20:25:21[0m] (step=0008300) Train Loss: 0.1492, Train Steps/Sec: 2.14, Samples/Sec: 102.51 |
| [[34m2025-05-03 20:26:08[0m] (step=0008400) Train Loss: 0.1483, Train Steps/Sec: 2.12, Samples/Sec: 101.88 |
| [[34m2025-05-03 20:26:55[0m] (step=0008500) Train Loss: 0.1516, Train Steps/Sec: 2.14, Samples/Sec: 102.96 |
| [[34m2025-05-03 20:27:41[0m] (step=0008600) Train Loss: 0.1456, Train Steps/Sec: 2.15, Samples/Sec: 103.13 |
| [[34m2025-05-03 20:28:28[0m] (step=0008700) Train Loss: 0.1442, Train Steps/Sec: 2.13, Samples/Sec: 102.47 |
| [[34m2025-05-03 20:29:15[0m] (step=0008800) Train Loss: 0.1426, Train Steps/Sec: 2.13, Samples/Sec: 102.42 |
| [[34m2025-05-03 20:30:02[0m] (step=0008900) Train Loss: 0.1527, Train Steps/Sec: 2.14, Samples/Sec: 102.51 |
| [[34m2025-05-03 20:30:48[0m] (step=0009000) Train Loss: 0.1414, Train Steps/Sec: 2.15, Samples/Sec: 103.05 |
| [[34m2025-05-03 20:31:35[0m] (step=0009100) Train Loss: 0.1405, Train Steps/Sec: 2.13, Samples/Sec: 102.41 |
| [[34m2025-05-03 20:32:22[0m] (step=0009200) Train Loss: 0.1449, Train Steps/Sec: 2.14, Samples/Sec: 102.53 |
| [[34m2025-05-03 20:33:09[0m] (step=0009300) Train Loss: 0.1420, Train Steps/Sec: 2.13, Samples/Sec: 102.41 |
| [[34m2025-05-03 20:33:55[0m] (step=0009400) Train Loss: 0.1454, Train Steps/Sec: 2.15, Samples/Sec: 103.00 |
| [[34m2025-05-03 20:34:42[0m] (step=0009500) Train Loss: 0.1462, Train Steps/Sec: 2.14, Samples/Sec: 102.50 |
| [[34m2025-05-03 20:35:29[0m] (step=0009600) Train Loss: 0.1490, Train Steps/Sec: 2.14, Samples/Sec: 102.90 |
| [[34m2025-05-03 20:36:16[0m] (step=0009700) Train Loss: 0.1443, Train Steps/Sec: 2.12, Samples/Sec: 101.84 |
| [[34m2025-05-03 20:37:03[0m] (step=0009800) Train Loss: 0.1417, Train Steps/Sec: 2.14, Samples/Sec: 102.87 |
| [[34m2025-05-03 20:37:50[0m] (step=0009900) Train Loss: 0.1448, Train Steps/Sec: 2.13, Samples/Sec: 102.34 |
| [[34m2025-05-03 20:38:36[0m] (step=0010000) Train Loss: 0.1431, Train Steps/Sec: 2.15, Samples/Sec: 103.01 |
| Downloading: "https://github.com/facebookresearch/dino/zipball/main" to ./models/main.zip |
| Downloading: "https://github.com/facebookresearch/dino/zipball/main" to ./models/main.zip |
| [[34m2025-05-03 20:38:52[0m] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar |
| Downloading: "https://github.com/facebookresearch/dino/zipball/main" to ./models/main.zip |
| Traceback (most recent call last): |
| File "train.py", line 437, in <module> |
| main(args) |
| File "train.py", line 352, in main |
| sim_score = evaluate(ema, tokenizer, diffusion, test_dataset, rank, config["batch_size"], config["num_workers"], latent_size, device, save_dir, args.global_seed, bfloat_enable, num_cond) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context |
| return func(*args, **kwargs) |
| File "train.py", line 384, in evaluate |
| eval_model, _ = dreamsim(pretrained=True) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/model.py", line 275, in dreamsim |
| ours_model = PerceptualModel(**dreamsim_args['model_config'][dreamsim_type], device=device, load_dir=cache_dir, |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/model.py", line 65, in __init__ |
| ViTExtractor(model_type, stride, load_dir, device=device) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/feature_extraction/extractor.py", line 44, in __init__ |
| self.model = ViTExtractor.create_model(model_type, load_dir) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/feature_extraction/extractor.py", line 72, in create_model |
| model = torch.hub.load('facebookresearch/dino:main', model_type) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 563, in load |
| repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, trust_repo, "load", |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 238, in _get_cache_or_reload |
| download_url_to_file(url, cached_file, progress=False) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 620, in download_url_to_file |
| u = urlopen(req) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 222, in urlopen |
| return opener.open(url, data, timeout) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 525, in open |
| response = self._open(req, data) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 542, in _open |
| result = self._call_chain(self.handle_open, protocol, protocol + |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 502, in _call_chain |
| result = func(*args) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1397, in https_open |
| return self.do_open(http.client.HTTPSConnection, req, |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1358, in do_open |
| r = h.getresponse() |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1348, in getresponse |
| response.begin() |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 316, in begin |
| version, status, reason = self._read_status() |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 285, in _read_status |
| raise RemoteDisconnected("Remote end closed connection without" |
| http.client.RemoteDisconnected: Remote end closed connection without response |
| Traceback (most recent call last): |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1354, in do_open |
| h.request(req.get_method(), req.selector, req.data, headers, |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1256, in request |
| self._send_request(method, url, body, headers, encode_chunked) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1302, in _send_request |
| self.endheaders(body, encode_chunked=encode_chunked) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1251, in endheaders |
| self._send_output(message_body, encode_chunked=encode_chunked) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1011, in _send_output |
| self.send(msg) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 951, in send |
| self.connect() |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1418, in connect |
| super().connect() |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 922, in connect |
| self.sock = self._create_connection( |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/socket.py", line 820, in create_connection |
| raise err |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/socket.py", line 808, in create_connection |
| sock.connect(sa) |
| TimeoutError: [Errno 110] Connection timed out |
|
|
| During handling of the above exception, another exception occurred: |
|
|
| Traceback (most recent call last): |
| File "train.py", line 437, in <module> |
| main(args) |
| File "train.py", line 352, in main |
| sim_score = evaluate(ema, tokenizer, diffusion, test_dataset, rank, config["batch_size"], config["num_workers"], latent_size, device, save_dir, args.global_seed, bfloat_enable, num_cond) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context |
| return func(*args, **kwargs) |
| File "train.py", line 384, in evaluate |
| eval_model, _ = dreamsim(pretrained=True) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/model.py", line 275, in dreamsim |
| ours_model = PerceptualModel(**dreamsim_args['model_config'][dreamsim_type], device=device, load_dir=cache_dir, |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/model.py", line 65, in __init__ |
| ViTExtractor(model_type, stride, load_dir, device=device) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/feature_extraction/extractor.py", line 44, in __init__ |
| self.model = ViTExtractor.create_model(model_type, load_dir) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/feature_extraction/extractor.py", line 72, in create_model |
| model = torch.hub.load('facebookresearch/dino:main', model_type) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 563, in load |
| repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, trust_repo, "load", |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 238, in _get_cache_or_reload |
| download_url_to_file(url, cached_file, progress=False) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 620, in download_url_to_file |
| u = urlopen(req) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 222, in urlopen |
| return opener.open(url, data, timeout) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 525, in open |
| response = self._open(req, data) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 542, in _open |
| result = self._call_chain(self.handle_open, protocol, protocol + |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 502, in _call_chain |
| result = func(*args) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1397, in https_open |
| return self.do_open(http.client.HTTPSConnection, req, |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1357, in do_open |
| raise URLError(err) |
| urllib.error.URLError: <urlopen error [Errno 110] Connection timed out> |
| [2025-05-03 20:41:14,779] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 715331 closing signal SIGTERM |
| [2025-05-03 20:41:14,780] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 715440 closing signal SIGTERM |
| [2025-05-03 20:41:14,944] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 715461) of binary: /data1/zwc/miniconda3/envs/nwm2/bin/python |
| Traceback (most recent call last): |
| File "/data1/tpz/anaconda3/envs/nwm2/bin/torchrun", line 8, in <module> |
| sys.exit(main()) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| return f(*args, **kwargs) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main |
| run(args) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run |
| elastic_launch( |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ |
| return launch_agent(self._config, self._entrypoint, list(args)) |
| File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent |
| raise ChildFailedError( |
| torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
| ============================================================ |
| train.py FAILED |
| ------------------------------------------------------------ |
| Failures: |
| <NO_OTHER_FAILURES> |
| ------------------------------------------------------------ |
| Root Cause (first observed failure): |
| [0]: |
| time : 2025-05-03_20:41:14 |
| host : localhost |
| rank : 2 (local_rank: 2) |
| exitcode : 1 (pid: 715461) |
| error_file: <N/A> |
| traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
| ============================================================ |
|
|