[*] Loading libraries... [*] Loading tokenizer... [*] Gathering 100 million tokens by streaming dataset... Resolving data files: 100%|██████████████| 2410/2410 [00:00<00:00, 30853.46it/s] [*] Gathering tokens: 100%|██| 400000000/400000000 [13:58<00:00, 477048.96tok/s] [+] Collected 400,000,000 tokens → 1,562,500 chunks. [*] Setting up model... [*] Model parameters: 465,504 [*] Defining training arguments... [*] Starting training... {'loss': '5.986', 'grad_norm': '0.5017', 'learning_rate': '9.9e-05', 'epoch': '0.008192'} {'loss': '5.403', 'grad_norm': '0.394', 'learning_rate': '0.000199', 'epoch': '0.01638'} {'loss': '4.75', 'grad_norm': '0.9517', 'learning_rate': '0.000299', 'epoch': '0.02458'} {'loss': '4.192', 'grad_norm': '1.073', 'learning_rate': '0.000399', 'epoch': '0.03277'} {'loss': '3.702', 'grad_norm': '1.364', 'learning_rate': '0.000499', 'epoch': '0.04096'} 1%|▌ | 500/36624 [00:34<40:21, 14.92it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 126.22it/s] {'loss': '3.378', 'grad_norm': '1.906', 'learning_rate': '0.0004986', 'epoch': '0.04915'} {'loss': '3.195', 'grad_norm': '1.332', 'learning_rate': '0.0004972', 'epoch': '0.05734'} {'loss': '3.085', 'grad_norm': '1.36', 'learning_rate': '0.0004959', 'epoch': '0.06553'} {'loss': '3.011', 'grad_norm': '1.354', 'learning_rate': '0.0004945', 'epoch': '0.07373'} {'loss': '2.955', 'grad_norm': '1.423', 'learning_rate': '0.0004931', 'epoch': '0.08192'} 3%|█ | 1000/36624 [01:08<40:59, 14.48it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 185.00it/s] {'loss': '2.914', 'grad_norm': '1.194', 'learning_rate': '0.0004917', 'epoch': '0.09011'} {'loss': '2.887', 'grad_norm': '1.145', 'learning_rate': '0.0004903', 'epoch': '0.0983'} {'loss': '2.861', 'grad_norm': '1.353', 'learning_rate': '0.0004889', 'epoch': '0.1065'} {'loss': '2.833', 'grad_norm': '1.226', 'learning_rate': '0.0004876', 'epoch': '0.1147'} {'loss': '2.824', 'grad_norm': '1.226', 'learning_rate': '0.0004862', 'epoch': '0.1229'} 4%|█▌ | 1500/36624 [01:42<40:32, 14.44it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 182.87it/s] {'loss': '2.806', 'grad_norm': '1.204', 'learning_rate': '0.0004848', 'epoch': '0.1311'} {'loss': '2.786', 'grad_norm': '1.139', 'learning_rate': '0.0004834', 'epoch': '0.1393'} {'loss': '2.777', 'grad_norm': '1.099', 'learning_rate': '0.000482', 'epoch': '0.1475'} {'loss': '2.765', 'grad_norm': '1.127', 'learning_rate': '0.0004806', 'epoch': '0.1556'} {'loss': '2.754', 'grad_norm': '1.186', 'learning_rate': '0.0004793', 'epoch': '0.1638'} 5%|██ | 2000/36624 [02:16<39:37, 14.56it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 196.92it/s] {'loss': '2.749', 'grad_norm': '1.068', 'learning_rate': '0.0004779', 'epoch': '0.172'} {'loss': '2.732', 'grad_norm': '1.086', 'learning_rate': '0.0004765', 'epoch': '0.1802'} {'loss': '2.73', 'grad_norm': '1.105', 'learning_rate': '0.0004751', 'epoch': '0.1884'} {'loss': '2.721', 'grad_norm': '1.213', 'learning_rate': '0.0004737', 'epoch': '0.1966'} {'loss': '2.717', 'grad_norm': '1.168', 'learning_rate': '0.0004723', 'epoch': '0.2048'} 7%|██▌ | 2500/36624 [02:50<39:00, 14.58it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 183.91it/s] {'loss': '2.708', 'grad_norm': '1.081', 'learning_rate': '0.0004709', 'epoch': '0.213'} {'loss': '2.705', 'grad_norm': '1.083', 'learning_rate': '0.0004696', 'epoch': '0.2212'} {'loss': '2.697', 'grad_norm': '1.079', 'learning_rate': '0.0004682', 'epoch': '0.2294'} {'loss': '2.692', 'grad_norm': '1.123', 'learning_rate': '0.0004668', 'epoch': '0.2376'} {'loss': '2.687', 'grad_norm': '1.147', 'learning_rate': '0.0004654', 'epoch': '0.2458'} 8%|███ | 3000/36624 [03:24<37:58, 14.76it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 192.12it/s] {'loss': '2.681', 'grad_norm': '1.052', 'learning_rate': '0.000464', 'epoch': '0.2539'} {'loss': '2.676', 'grad_norm': '1.099', 'learning_rate': '0.0004626', 'epoch': '0.2621'} {'loss': '2.674', 'grad_norm': '1.084', 'learning_rate': '0.0004613', 'epoch': '0.2703'} {'loss': '2.672', 'grad_norm': '1.057', 'learning_rate': '0.0004599', 'epoch': '0.2785'} {'loss': '2.672', 'grad_norm': '1.103', 'learning_rate': '0.0004585', 'epoch': '0.2867'} 10%|███▋ | 3500/36624 [03:59<38:12, 14.45it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 199.64it/s] {'loss': '2.661', 'grad_norm': '1.062', 'learning_rate': '0.0004571', 'epoch': '0.2949'} {'loss': '2.658', 'grad_norm': '1.055', 'learning_rate': '0.0004557', 'epoch': '0.3031'} {'loss': '2.656', 'grad_norm': '1.06', 'learning_rate': '0.0004543', 'epoch': '0.3113'} {'loss': '2.653', 'grad_norm': '1.1', 'learning_rate': '0.000453', 'epoch': '0.3195'} {'loss': '2.651', 'grad_norm': '1.137', 'learning_rate': '0.0004516', 'epoch': '0.3277'} 11%|████▏ | 4000/36624 [04:33<37:14, 14.60it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 196.63it/s] {'loss': '2.648', 'grad_norm': '1.009', 'learning_rate': '0.0004502', 'epoch': '0.3359'} {'loss': '2.639', 'grad_norm': '1', 'learning_rate': '0.0004488', 'epoch': '0.3441'} {'loss': '2.641', 'grad_norm': '1.044', 'learning_rate': '0.0004474', 'epoch': '0.3522'} {'loss': '2.641', 'grad_norm': '1.039', 'learning_rate': '0.000446', 'epoch': '0.3604'} {'loss': '2.637', 'grad_norm': '1.036', 'learning_rate': '0.0004446', 'epoch': '0.3686'} 12%|████▋ | 4500/36624 [05:07<36:26, 14.69it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 193.36it/s] {'loss': '2.632', 'grad_norm': '0.9873', 'learning_rate': '0.0004433', 'epoch': '0.3768'} {'loss': '2.631', 'grad_norm': '1.043', 'learning_rate': '0.0004419', 'epoch': '0.385'} {'loss': '2.63', 'grad_norm': '1.063', 'learning_rate': '0.0004405', 'epoch': '0.3932'} {'loss': '2.624', 'grad_norm': '1.026', 'learning_rate': '0.0004391', 'epoch': '0.4014'} {'loss': '2.624', 'grad_norm': '1.011', 'learning_rate': '0.0004377', 'epoch': '0.4096'} 14%|█████▏ | 5000/36624 [05:41<36:09, 14.58it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 189.30it/s] {'loss': '2.625', 'grad_norm': '1.08', 'learning_rate': '0.0004363', 'epoch': '0.4178'} {'loss': '2.621', 'grad_norm': '1.007', 'learning_rate': '0.000435', 'epoch': '0.426'} {'loss': '2.618', 'grad_norm': '1.025', 'learning_rate': '0.0004336', 'epoch': '0.4342'} {'loss': '2.616', 'grad_norm': '0.9491', 'learning_rate': '0.0004322', 'epoch': '0.4424'} {'loss': '2.615', 'grad_norm': '1.072', 'learning_rate': '0.0004308', 'epoch': '0.4505'} 15%|█████▋ | 5500/36624 [06:15<35:20, 14.67it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 196.28it/s] {'loss': '2.604', 'grad_norm': '0.986', 'learning_rate': '0.0004294', 'epoch': '0.4587'} {'loss': '2.609', 'grad_norm': '0.9908', 'learning_rate': '0.000428', 'epoch': '0.4669'} {'loss': '2.606', 'grad_norm': '0.9686', 'learning_rate': '0.0004267', 'epoch': '0.4751'} {'loss': '2.61', 'grad_norm': '1.009', 'learning_rate': '0.0004253', 'epoch': '0.4833'} {'loss': '2.606', 'grad_norm': '1.003', 'learning_rate': '0.0004239', 'epoch': '0.4915'} 16%|██████▏ | 6000/36624 [06:49<34:56, 14.61it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 178.44it/s] {'loss': '2.602', 'grad_norm': '0.9795', 'learning_rate': '0.0004225', 'epoch': '0.4997'} {'loss': '2.601', 'grad_norm': '1.023', 'learning_rate': '0.0004211', 'epoch': '0.5079'} {'loss': '2.596', 'grad_norm': '1.023', 'learning_rate': '0.0004197', 'epoch': '0.5161'} {'loss': '2.598', 'grad_norm': '0.9583', 'learning_rate': '0.0004184', 'epoch': '0.5243'} {'loss': '2.597', 'grad_norm': '0.9572', 'learning_rate': '0.000417', 'epoch': '0.5325'} 18%|██████▋ | 6500/36624 [07:24<34:21, 14.61it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 207.81it/s] {'loss': '2.596', 'grad_norm': '1.056', 'learning_rate': '0.0004156', 'epoch': '0.5407'} {'loss': '2.594', 'grad_norm': '1.007', 'learning_rate': '0.0004142', 'epoch': '0.5488'} {'loss': '2.593', 'grad_norm': '0.9365', 'learning_rate': '0.0004128', 'epoch': '0.557'} {'loss': '2.593', 'grad_norm': '0.9879', 'learning_rate': '0.0004114', 'epoch': '0.5652'} {'loss': '2.594', 'grad_norm': '1.078', 'learning_rate': '0.00041', 'epoch': '0.5734'} 19%|███████▎ | 7000/36624 [07:58<33:22, 14.79it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 149.60it/s] {'loss': '2.589', 'grad_norm': '1.011', 'learning_rate': '0.0004087', 'epoch': '0.5816'} {'loss': '2.585', 'grad_norm': '0.9979', 'learning_rate': '0.0004073', 'epoch': '0.5898'} {'loss': '2.587', 'grad_norm': '0.9675', 'learning_rate': '0.0004059', 'epoch': '0.598'} {'loss': '2.584', 'grad_norm': '0.9291', 'learning_rate': '0.0004045', 'epoch': '0.6062'} {'loss': '2.583', 'grad_norm': '0.9513', 'learning_rate': '0.0004031', 'epoch': '0.6144'} 20%|███████▊ | 7500/36624 [08:32<33:15, 14.60it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 179.61it/s] {'loss': '2.584', 'grad_norm': '1.012', 'learning_rate': '0.0004017', 'epoch': '0.6226'} {'loss': '2.585', 'grad_norm': '1.012', 'learning_rate': '0.0004004', 'epoch': '0.6308'} {'loss': '2.578', 'grad_norm': '1.016', 'learning_rate': '0.000399', 'epoch': '0.639'} {'loss': '2.58', 'grad_norm': '0.994', 'learning_rate': '0.0003976', 'epoch': '0.6471'} {'loss': '2.578', 'grad_norm': '1.003', 'learning_rate': '0.0003962', 'epoch': '0.6553'} 22%|████████▎ | 8000/36624 [09:06<32:34, 14.64it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 178.38it/s] {'loss': '2.581', 'grad_norm': '1.01', 'learning_rate': '0.0003948', 'epoch': '0.6635'} {'loss': '2.573', 'grad_norm': '0.9192', 'learning_rate': '0.0003934', 'epoch': '0.6717'} {'loss': '2.577', 'grad_norm': '0.955', 'learning_rate': '0.0003921', 'epoch': '0.6799'} {'loss': '2.575', 'grad_norm': '1.005', 'learning_rate': '0.0003907', 'epoch': '0.6881'} {'loss': '2.577', 'grad_norm': '0.922', 'learning_rate': '0.0003893', 'epoch': '0.6963'} 23%|████████▊ | 8500/36624 [09:40<31:54, 14.69it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 189.22it/s] {'loss': '2.573', 'grad_norm': '0.9621', 'learning_rate': '0.0003879', 'epoch': '0.7045'} {'loss': '2.57', 'grad_norm': '0.9889', 'learning_rate': '0.0003865', 'epoch': '0.7127'} {'loss': '2.568', 'grad_norm': '0.9244', 'learning_rate': '0.0003851', 'epoch': '0.7209'} {'loss': '2.57', 'grad_norm': '1.009', 'learning_rate': '0.0003837', 'epoch': '0.7291'} {'loss': '2.567', 'grad_norm': '0.9754', 'learning_rate': '0.0003824', 'epoch': '0.7373'} 25%|█████████▎ | 9000/36624 [10:14<31:31, 14.60it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 177.72it/s] {'loss': '2.57', 'grad_norm': '0.964', 'learning_rate': '0.000381', 'epoch': '0.7454'} {'loss': '2.567', 'grad_norm': '0.9354', 'learning_rate': '0.0003796', 'epoch': '0.7536'} {'loss': '2.569', 'grad_norm': '0.9461', 'learning_rate': '0.0003782', 'epoch': '0.7618'} {'loss': '2.565', 'grad_norm': '0.9415', 'learning_rate': '0.0003768', 'epoch': '0.77'} {'loss': '2.566', 'grad_norm': '0.9319', 'learning_rate': '0.0003754', 'epoch': '0.7782'} 26%|█████████▊ | 9500/36624 [10:49<31:23, 14.40it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 187.52it/s] {'loss': '2.56', 'grad_norm': '0.917', 'learning_rate': '0.0003741', 'epoch': '0.7864'} {'loss': '2.562', 'grad_norm': '0.982', 'learning_rate': '0.0003727', 'epoch': '0.7946'} {'loss': '2.563', 'grad_norm': '0.9996', 'learning_rate': '0.0003713', 'epoch': '0.8028'} {'loss': '2.559', 'grad_norm': '0.9066', 'learning_rate': '0.0003699', 'epoch': '0.811'} {'loss': '2.562', 'grad_norm': '0.9582', 'learning_rate': '0.0003685', 'epoch': '0.8192'} 27%|██████████ | 10000/36624 [11:23<30:09, 14.72it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 182.08it/s] {'loss': '2.557', 'grad_norm': '0.9477', 'learning_rate': '0.0003671', 'epoch': '0.8274'} {'loss': '2.56', 'grad_norm': '0.9513', 'learning_rate': '0.0003658', 'epoch': '0.8356'} {'loss': '2.559', 'grad_norm': '0.9462', 'learning_rate': '0.0003644', 'epoch': '0.8437'} {'loss': '2.558', 'grad_norm': '0.9505', 'learning_rate': '0.000363', 'epoch': '0.8519'} {'loss': '2.556', 'grad_norm': '0.9055', 'learning_rate': '0.0003616', 'epoch': '0.8601'} 29%|██████████▌ | 10500/36624 [11:57<29:42, 14.66it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 186.19it/s] {'loss': '2.552', 'grad_norm': '0.9765', 'learning_rate': '0.0003602', 'epoch': '0.8683'} {'loss': '2.557', 'grad_norm': '0.9443', 'learning_rate': '0.0003588', 'epoch': '0.8765'} {'loss': '2.555', 'grad_norm': '0.8971', 'learning_rate': '0.0003574', 'epoch': '0.8847'} {'loss': '2.553', 'grad_norm': '0.9489', 'learning_rate': '0.0003561', 'epoch': '0.8929'} {'loss': '2.552', 'grad_norm': '1', 'learning_rate': '0.0003547', 'epoch': '0.9011'} 30%|███████████ | 11000/36624 [12:31<28:47, 14.83it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 176.59it/s] {'loss': '2.557', 'grad_norm': '0.915', 'learning_rate': '0.0003533', 'epoch': '0.9093'} {'loss': '2.552', 'grad_norm': '0.911', 'learning_rate': '0.0003519', 'epoch': '0.9175'} {'loss': '2.554', 'grad_norm': '0.9488', 'learning_rate': '0.0003505', 'epoch': '0.9257'} {'loss': '2.547', 'grad_norm': '0.9326', 'learning_rate': '0.0003491', 'epoch': '0.9339'} {'loss': '2.555', 'grad_norm': '0.9041', 'learning_rate': '0.0003478', 'epoch': '0.942'} 31%|███████████▌ | 11500/36624 [13:06<28:39, 14.61it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 193.03it/s] {'loss': '2.547', 'grad_norm': '0.9229', 'learning_rate': '0.0003464', 'epoch': '0.9502'} {'loss': '2.547', 'grad_norm': '0.9645', 'learning_rate': '0.000345', 'epoch': '0.9584'} {'loss': '2.548', 'grad_norm': '0.9408', 'learning_rate': '0.0003436', 'epoch': '0.9666'} {'loss': '2.546', 'grad_norm': '0.9032', 'learning_rate': '0.0003422', 'epoch': '0.9748'} {'loss': '2.549', 'grad_norm': '0.918', 'learning_rate': '0.0003408', 'epoch': '0.983'} 33%|████████████ | 12000/36624 [13:40<28:04, 14.62it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 188.44it/s] {'loss': '2.547', 'grad_norm': '0.9086', 'learning_rate': '0.0003395', 'epoch': '0.9912'} {'loss': '2.544', 'grad_norm': '0.9125', 'learning_rate': '0.0003381', 'epoch': '0.9994'} {'loss': '2.541', 'grad_norm': '0.9181', 'learning_rate': '0.0003367', 'epoch': '1.008'} {'loss': '2.545', 'grad_norm': '0.9132', 'learning_rate': '0.0003353', 'epoch': '1.016'} {'loss': '2.542', 'grad_norm': '0.9156', 'learning_rate': '0.0003339', 'epoch': '1.024'} 34%|████████████▋ | 12500/36624 [14:15<27:29, 14.62it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 114.07it/s] {'loss': '2.538', 'grad_norm': '0.9441', 'learning_rate': '0.0003325', 'epoch': '1.032'} {'loss': '2.542', 'grad_norm': '0.9385', 'learning_rate': '0.0003312', 'epoch': '1.04'} {'loss': '2.536', 'grad_norm': '0.9842', 'learning_rate': '0.0003298', 'epoch': '1.048'} {'loss': '2.542', 'grad_norm': '0.9319', 'learning_rate': '0.0003284', 'epoch': '1.057'} {'loss': '2.537', 'grad_norm': '0.8883', 'learning_rate': '0.000327', 'epoch': '1.065'} 35%|█████████████▏ | 13000/36624 [14:50<27:04, 14.54it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 171.43it/s] {'loss': '2.54', 'grad_norm': '0.9869', 'learning_rate': '0.0003256', 'epoch': '1.073'} {'loss': '2.539', 'grad_norm': '0.8919', 'learning_rate': '0.0003242', 'epoch': '1.081'} {'loss': '2.533', 'grad_norm': '0.9155', 'learning_rate': '0.0003228', 'epoch': '1.089'} {'loss': '2.537', 'grad_norm': '0.9485', 'learning_rate': '0.0003215', 'epoch': '1.098'} {'loss': '2.539', 'grad_norm': '0.9354', 'learning_rate': '0.0003201', 'epoch': '1.106'} 37%|█████████████▋ | 13500/36624 [15:24<26:16, 14.67it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 199.73it/s] {'loss': '2.535', 'grad_norm': '0.9028', 'learning_rate': '0.0003187', 'epoch': '1.114'} {'loss': '2.533', 'grad_norm': '0.9042', 'learning_rate': '0.0003173', 'epoch': '1.122'} {'loss': '2.533', 'grad_norm': '0.9192', 'learning_rate': '0.0003159', 'epoch': '1.13'} {'loss': '2.533', 'grad_norm': '0.8816', 'learning_rate': '0.0003145', 'epoch': '1.139'} {'loss': '2.53', 'grad_norm': '0.9064', 'learning_rate': '0.0003132', 'epoch': '1.147'} 38%|██████████████▏ | 14000/36624 [15:58<26:09, 14.42it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 174.55it/s] {'loss': '2.534', 'grad_norm': '0.9424', 'learning_rate': '0.0003118', 'epoch': '1.155'} {'loss': '2.53', 'grad_norm': '0.9198', 'learning_rate': '0.0003104', 'epoch': '1.163'} {'loss': '2.53', 'grad_norm': '0.9234', 'learning_rate': '0.000309', 'epoch': '1.171'} {'loss': '2.533', 'grad_norm': '1.027', 'learning_rate': '0.0003076', 'epoch': '1.18'} {'loss': '2.531', 'grad_norm': '0.9083', 'learning_rate': '0.0003062', 'epoch': '1.188'} 40%|██████████████▋ | 14500/36624 [16:32<25:17, 14.58it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 192.31it/s] {'loss': '2.53', 'grad_norm': '0.8941', 'learning_rate': '0.0003049', 'epoch': '1.196'} {'loss': '2.533', 'grad_norm': '0.9395', 'learning_rate': '0.0003035', 'epoch': '1.204'} {'loss': '2.53', 'grad_norm': '0.9605', 'learning_rate': '0.0003021', 'epoch': '1.212'} {'loss': '2.53', 'grad_norm': '0.9029', 'learning_rate': '0.0003007', 'epoch': '1.221'} {'loss': '2.529', 'grad_norm': '0.9056', 'learning_rate': '0.0002993', 'epoch': '1.229'} 41%|███████████████▏ | 15000/36624 [17:07<24:39, 14.62it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 180.68it/s] {'loss': '2.528', 'grad_norm': '0.8955', 'learning_rate': '0.0002979', 'epoch': '1.237'} {'loss': '2.53', 'grad_norm': '0.9041', 'learning_rate': '0.0002965', 'epoch': '1.245'} {'loss': '2.527', 'grad_norm': '0.9242', 'learning_rate': '0.0002952', 'epoch': '1.253'} {'loss': '2.525', 'grad_norm': '0.9313', 'learning_rate': '0.0002938', 'epoch': '1.261'} {'loss': '2.525', 'grad_norm': '0.9721', 'learning_rate': '0.0002924', 'epoch': '1.27'} 42%|███████████████▋ | 15500/36624 [17:41<23:50, 14.77it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 195.36it/s] {'loss': '2.522', 'grad_norm': '0.9043', 'learning_rate': '0.000291', 'epoch': '1.278'} {'loss': '2.524', 'grad_norm': '0.9181', 'learning_rate': '0.0002896', 'epoch': '1.286'} {'loss': '2.527', 'grad_norm': '0.9111', 'learning_rate': '0.0002882', 'epoch': '1.294'} {'loss': '2.523', 'grad_norm': '0.9105', 'learning_rate': '0.0002869', 'epoch': '1.302'} {'loss': '2.526', 'grad_norm': '1.005', 'learning_rate': '0.0002855', 'epoch': '1.311'} 44%|████████████████▏ | 16000/36624 [18:15<23:29, 14.63it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 192.30it/s] {'loss': '2.526', 'grad_norm': '0.9184', 'learning_rate': '0.0002841', 'epoch': '1.319'} {'loss': '2.52', 'grad_norm': '0.8872', 'learning_rate': '0.0002827', 'epoch': '1.327'} {'loss': '2.519', 'grad_norm': '0.9441', 'learning_rate': '0.0002813', 'epoch': '1.335'} {'loss': '2.525', 'grad_norm': '0.9462', 'learning_rate': '0.0002799', 'epoch': '1.343'} {'loss': '2.525', 'grad_norm': '0.9307', 'learning_rate': '0.0002786', 'epoch': '1.352'} 45%|████████████████▋ | 16500/36624 [18:49<23:00, 14.58it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 184.49it/s] {'loss': '2.519', 'grad_norm': '0.9708', 'learning_rate': '0.0002772', 'epoch': '1.36'} {'loss': '2.522', 'grad_norm': '0.9035', 'learning_rate': '0.0002758', 'epoch': '1.368'} {'loss': '2.518', 'grad_norm': '0.9394', 'learning_rate': '0.0002744', 'epoch': '1.376'} {'loss': '2.521', 'grad_norm': '0.9519', 'learning_rate': '0.000273', 'epoch': '1.384'} {'loss': '2.518', 'grad_norm': '0.915', 'learning_rate': '0.0002716', 'epoch': '1.393'} 46%|█████████████████▏ | 17000/36624 [19:23<22:15, 14.69it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 188.87it/s] {'loss': '2.517', 'grad_norm': '0.9166', 'learning_rate': '0.0002702', 'epoch': '1.401'} {'loss': '2.513', 'grad_norm': '0.9377', 'learning_rate': '0.0002689', 'epoch': '1.409'} {'loss': '2.516', 'grad_norm': '0.9178', 'learning_rate': '0.0002675', 'epoch': '1.417'} {'loss': '2.519', 'grad_norm': '0.9151', 'learning_rate': '0.0002661', 'epoch': '1.425'} {'loss': '2.515', 'grad_norm': '0.9612', 'learning_rate': '0.0002647', 'epoch': '1.434'} 48%|█████████████████▋ | 17500/36624 [19:58<21:56, 14.53it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 176.02it/s] {'loss': '2.519', 'grad_norm': '0.9229', 'learning_rate': '0.0002633', 'epoch': '1.442'} {'loss': '2.518', 'grad_norm': '0.9195', 'learning_rate': '0.0002619', 'epoch': '1.45'} {'loss': '2.514', 'grad_norm': '0.9046', 'learning_rate': '0.0002606', 'epoch': '1.458'} {'loss': '2.52', 'grad_norm': '0.9383', 'learning_rate': '0.0002592', 'epoch': '1.466'} {'loss': '2.516', 'grad_norm': '0.9361', 'learning_rate': '0.0002578', 'epoch': '1.474'} 49%|██████████████████▏ | 18000/36624 [20:32<21:18, 14.57it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 184.81it/s] {'loss': '2.509', 'grad_norm': '0.9623', 'learning_rate': '0.0002564', 'epoch': '1.483'} {'loss': '2.511', 'grad_norm': '0.9627', 'learning_rate': '0.000255', 'epoch': '1.491'} {'loss': '2.516', 'grad_norm': '0.9481', 'learning_rate': '0.0002536', 'epoch': '1.499'} {'loss': '2.516', 'grad_norm': '0.9699', 'learning_rate': '0.0002523', 'epoch': '1.507'} {'loss': '2.514', 'grad_norm': '0.9232', 'learning_rate': '0.0002509', 'epoch': '1.515'} 51%|██████████████████▋ | 18500/36624 [21:06<20:39, 14.63it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 181.07it/s] {'loss': '2.508', 'grad_norm': '0.8967', 'learning_rate': '0.0002495', 'epoch': '1.524'} {'loss': '2.51', 'grad_norm': '0.9512', 'learning_rate': '0.0002481', 'epoch': '1.532'} {'loss': '2.511', 'grad_norm': '0.9096', 'learning_rate': '0.0002467', 'epoch': '1.54'} {'loss': '2.509', 'grad_norm': '0.9213', 'learning_rate': '0.0002453', 'epoch': '1.548'} {'loss': '2.513', 'grad_norm': '0.9172', 'learning_rate': '0.000244', 'epoch': '1.556'} 52%|███████████████████▏ | 19000/36624 [21:40<20:00, 14.69it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 180.06it/s] {'loss': '2.51', 'grad_norm': '0.9369', 'learning_rate': '0.0002426', 'epoch': '1.565'} {'loss': '2.512', 'grad_norm': '0.9091', 'learning_rate': '0.0002412', 'epoch': '1.573'} {'loss': '2.512', 'grad_norm': '0.8935', 'learning_rate': '0.0002398', 'epoch': '1.581'} {'loss': '2.51', 'grad_norm': '0.9206', 'learning_rate': '0.0002384', 'epoch': '1.589'} {'loss': '2.507', 'grad_norm': '0.9272', 'learning_rate': '0.000237', 'epoch': '1.597'} 53%|███████████████████▋ | 19500/36624 [22:15<19:28, 14.66it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 183.45it/s] {'loss': '2.51', 'grad_norm': '0.9499', 'learning_rate': '0.0002356', 'epoch': '1.606'} {'loss': '2.513', 'grad_norm': '0.9095', 'learning_rate': '0.0002343', 'epoch': '1.614'} {'loss': '2.508', 'grad_norm': '0.9086', 'learning_rate': '0.0002329', 'epoch': '1.622'} {'loss': '2.507', 'grad_norm': '0.9389', 'learning_rate': '0.0002315', 'epoch': '1.63'} {'loss': '2.514', 'grad_norm': '0.8963', 'learning_rate': '0.0002301', 'epoch': '1.638'} 55%|████████████████████▏ | 20000/36624 [22:49<18:57, 14.61it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 174.24it/s] {'loss': '2.506', 'grad_norm': '0.978', 'learning_rate': '0.0002287', 'epoch': '1.646'} {'loss': '2.507', 'grad_norm': '0.9966', 'learning_rate': '0.0002273', 'epoch': '1.655'} {'loss': '2.507', 'grad_norm': '0.9281', 'learning_rate': '0.000226', 'epoch': '1.663'} {'loss': '2.51', 'grad_norm': '0.9063', 'learning_rate': '0.0002246', 'epoch': '1.671'} {'loss': '2.509', 'grad_norm': '0.9708', 'learning_rate': '0.0002232', 'epoch': '1.679'} 56%|████████████████████▋ | 20500/36624 [23:23<18:18, 14.67it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 182.16it/s] {'loss': '2.505', 'grad_norm': '0.946', 'learning_rate': '0.0002218', 'epoch': '1.687'} {'loss': '2.507', 'grad_norm': '0.9184', 'learning_rate': '0.0002204', 'epoch': '1.696'} {'loss': '2.506', 'grad_norm': '0.9702', 'learning_rate': '0.000219', 'epoch': '1.704'} {'loss': '2.499', 'grad_norm': '0.9535', 'learning_rate': '0.0002177', 'epoch': '1.712'} {'loss': '2.502', 'grad_norm': '0.9017', 'learning_rate': '0.0002163', 'epoch': '1.72'} 57%|█████████████████████▏ | 21000/36624 [23:57<18:03, 14.42it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 190.95it/s] {'loss': '2.509', 'grad_norm': '0.9587', 'learning_rate': '0.0002149', 'epoch': '1.728'} {'loss': '2.504', 'grad_norm': '0.9648', 'learning_rate': '0.0002135', 'epoch': '1.737'} {'loss': '2.503', 'grad_norm': '0.953', 'learning_rate': '0.0002121', 'epoch': '1.745'} {'loss': '2.5', 'grad_norm': '0.9445', 'learning_rate': '0.0002107', 'epoch': '1.753'} {'loss': '2.501', 'grad_norm': '0.9414', 'learning_rate': '0.0002093', 'epoch': '1.761'} 59%|█████████████████████▋ | 21500/36624 [24:32<17:10, 14.67it/s] Writing model shards: 100%|███████████████████████| 1/1 [00:00<00:00, 52.34it/s] {'loss': '2.503', 'grad_norm': '0.9309', 'learning_rate': '0.000208', 'epoch': '1.769'} {'loss': '2.502', 'grad_norm': '0.9301', 'learning_rate': '0.0002066', 'epoch': '1.778'} {'loss': '2.504', 'grad_norm': '0.895', 'learning_rate': '0.0002052', 'epoch': '1.786'} {'loss': '2.502', 'grad_norm': '0.9428', 'learning_rate': '0.0002038', 'epoch': '1.794'} {'loss': '2.501', 'grad_norm': '0.9539', 'learning_rate': '0.0002024', 'epoch': '1.802'} 60%|██████████████████████▏ | 22000/36624 [25:06<16:28, 14.79it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 203.41it/s] {'loss': '2.5', 'grad_norm': '0.9179', 'learning_rate': '0.000201', 'epoch': '1.81'} {'loss': '2.501', 'grad_norm': '0.9195', 'learning_rate': '0.0001997', 'epoch': '1.819'} {'loss': '2.499', 'grad_norm': '1.047', 'learning_rate': '0.0001983', 'epoch': '1.827'} {'loss': '2.499', 'grad_norm': '0.931', 'learning_rate': '0.0001969', 'epoch': '1.835'} {'loss': '2.499', 'grad_norm': '0.9269', 'learning_rate': '0.0001955', 'epoch': '1.843'} 61%|██████████████████████▋ | 22500/36624 [25:40<16:03, 14.65it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 194.41it/s] {'loss': '2.501', 'grad_norm': '0.939', 'learning_rate': '0.0001941', 'epoch': '1.851'} {'loss': '2.495', 'grad_norm': '0.9119', 'learning_rate': '0.0001927', 'epoch': '1.859'} {'loss': '2.499', 'grad_norm': '0.9755', 'learning_rate': '0.0001914', 'epoch': '1.868'} {'loss': '2.497', 'grad_norm': '0.9444', 'learning_rate': '0.00019', 'epoch': '1.876'} {'loss': '2.496', 'grad_norm': '0.9551', 'learning_rate': '0.0001886', 'epoch': '1.884'} 63%|███████████████████████▏ | 23000/36624 [26:15<15:34, 14.58it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 183.11it/s] {'loss': '2.5', 'grad_norm': '0.9524', 'learning_rate': '0.0001872', 'epoch': '1.892'} {'loss': '2.502', 'grad_norm': '0.9583', 'learning_rate': '0.0001858', 'epoch': '1.9'} {'loss': '2.497', 'grad_norm': '0.9206', 'learning_rate': '0.0001844', 'epoch': '1.909'} {'loss': '2.495', 'grad_norm': '0.9133', 'learning_rate': '0.0001831', 'epoch': '1.917'} {'loss': '2.491', 'grad_norm': '0.9201', 'learning_rate': '0.0001817', 'epoch': '1.925'} 64%|███████████████████████▋ | 23500/36624 [26:49<14:51, 14.72it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 175.74it/s] {'loss': '2.499', 'grad_norm': '0.9536', 'learning_rate': '0.0001803', 'epoch': '1.933'} {'loss': '2.497', 'grad_norm': '0.9332', 'learning_rate': '0.0001789', 'epoch': '1.941'} {'loss': '2.491', 'grad_norm': '0.9358', 'learning_rate': '0.0001775', 'epoch': '1.95'} {'loss': '2.493', 'grad_norm': '0.9568', 'learning_rate': '0.0001761', 'epoch': '1.958'} {'loss': '2.494', 'grad_norm': '0.9585', 'learning_rate': '0.0001747', 'epoch': '1.966'} 66%|████████████████████████▏ | 24000/36624 [27:23<14:14, 14.78it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 178.42it/s] {'loss': '2.496', 'grad_norm': '0.9171', 'learning_rate': '0.0001734', 'epoch': '1.974'} {'loss': '2.495', 'grad_norm': '0.9789', 'learning_rate': '0.000172', 'epoch': '1.982'} {'loss': '2.493', 'grad_norm': '0.9548', 'learning_rate': '0.0001706', 'epoch': '1.991'} {'loss': '2.495', 'grad_norm': '1.021', 'learning_rate': '0.0001692', 'epoch': '1.999'} {'loss': '2.486', 'grad_norm': '0.9158', 'learning_rate': '0.0001678', 'epoch': '2.007'} 67%|████████████████████████▊ | 24500/36624 [27:58<14:15, 14.17it/s] Writing model shards: 100%|███████████████████████| 1/1 [00:00<00:00, 42.99it/s] {'loss': '2.494', 'grad_norm': '0.9763', 'learning_rate': '0.0001664', 'epoch': '2.015'} {'loss': '2.495', 'grad_norm': '0.9613', 'learning_rate': '0.0001651', 'epoch': '2.023'} {'loss': '2.489', 'grad_norm': '0.9664', 'learning_rate': '0.0001637', 'epoch': '2.031'} {'loss': '2.498', 'grad_norm': '1.016', 'learning_rate': '0.0001623', 'epoch': '2.04'} {'loss': '2.488', 'grad_norm': '0.9416', 'learning_rate': '0.0001609', 'epoch': '2.048'} 68%|█████████████████████████▎ | 25000/36624 [28:33<13:16, 14.60it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 180.22it/s] {'loss': '2.489', 'grad_norm': '0.9494', 'learning_rate': '0.0001595', 'epoch': '2.056'} {'loss': '2.486', 'grad_norm': '0.9252', 'learning_rate': '0.0001581', 'epoch': '2.064'} {'loss': '2.492', 'grad_norm': '0.9568', 'learning_rate': '0.0001568', 'epoch': '2.072'} {'loss': '2.492', 'grad_norm': '0.9466', 'learning_rate': '0.0001554', 'epoch': '2.081'} {'loss': '2.486', 'grad_norm': '0.9349', 'learning_rate': '0.000154', 'epoch': '2.089'} 70%|█████████████████████████▊ | 25500/36624 [29:07<12:43, 14.57it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 167.93it/s] {'loss': '2.489', 'grad_norm': '0.9689', 'learning_rate': '0.0001526', 'epoch': '2.097'} {'loss': '2.488', 'grad_norm': '0.9909', 'learning_rate': '0.0001512', 'epoch': '2.105'} {'loss': '2.49', 'grad_norm': '0.9703', 'learning_rate': '0.0001498', 'epoch': '2.113'} {'loss': '2.487', 'grad_norm': '1.02', 'learning_rate': '0.0001484', 'epoch': '2.122'} {'loss': '2.485', 'grad_norm': '1.005', 'learning_rate': '0.0001471', 'epoch': '2.13'} 71%|██████████████████████████▎ | 26000/36624 [29:41<12:02, 14.70it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 201.60it/s] {'loss': '2.486', 'grad_norm': '0.9853', 'learning_rate': '0.0001457', 'epoch': '2.138'} {'loss': '2.485', 'grad_norm': '0.9916', 'learning_rate': '0.0001443', 'epoch': '2.146'} {'loss': '2.488', 'grad_norm': '0.9691', 'learning_rate': '0.0001429', 'epoch': '2.154'} {'loss': '2.488', 'grad_norm': '0.9773', 'learning_rate': '0.0001415', 'epoch': '2.163'} {'loss': '2.482', 'grad_norm': '0.953', 'learning_rate': '0.0001401', 'epoch': '2.171'} 72%|██████████████████████████▊ | 26500/36624 [30:15<11:31, 14.63it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 187.70it/s] {'loss': '2.487', 'grad_norm': '0.9432', 'learning_rate': '0.0001388', 'epoch': '2.179'} {'loss': '2.487', 'grad_norm': '0.962', 'learning_rate': '0.0001374', 'epoch': '2.187'} {'loss': '2.488', 'grad_norm': '0.9646', 'learning_rate': '0.000136', 'epoch': '2.195'} {'loss': '2.483', 'grad_norm': '0.9822', 'learning_rate': '0.0001346', 'epoch': '2.203'} {'loss': '2.484', 'grad_norm': '0.9462', 'learning_rate': '0.0001332', 'epoch': '2.212'} 74%|███████████████████████████▎ | 27000/36624 [30:50<11:02, 14.53it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 148.78it/s] {'loss': '2.485', 'grad_norm': '0.983', 'learning_rate': '0.0001318', 'epoch': '2.22'} {'loss': '2.483', 'grad_norm': '0.9827', 'learning_rate': '0.0001305', 'epoch': '2.228'} {'loss': '2.486', 'grad_norm': '0.987', 'learning_rate': '0.0001291', 'epoch': '2.236'} {'loss': '2.486', 'grad_norm': '1.003', 'learning_rate': '0.0001277', 'epoch': '2.244'} {'loss': '2.487', 'grad_norm': '0.9763', 'learning_rate': '0.0001263', 'epoch': '2.253'} 75%|███████████████████████████▊ | 27500/36624 [31:24<10:20, 14.70it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 197.76it/s] {'loss': '2.484', 'grad_norm': '0.9718', 'learning_rate': '0.0001249', 'epoch': '2.261'} {'loss': '2.482', 'grad_norm': '0.964', 'learning_rate': '0.0001235', 'epoch': '2.269'} {'loss': '2.486', 'grad_norm': '0.9918', 'learning_rate': '0.0001221', 'epoch': '2.277'} {'loss': '2.482', 'grad_norm': '0.9895', 'learning_rate': '0.0001208', 'epoch': '2.285'} {'loss': '2.483', 'grad_norm': '0.9978', 'learning_rate': '0.0001194', 'epoch': '2.294'} 76%|████████████████████████████▎ | 28000/36624 [31:58<10:07, 14.20it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 187.44it/s] {'loss': '2.484', 'grad_norm': '1.008', 'learning_rate': '0.000118', 'epoch': '2.302'} {'loss': '2.484', 'grad_norm': '1.029', 'learning_rate': '0.0001166', 'epoch': '2.31'} {'loss': '2.482', 'grad_norm': '0.9828', 'learning_rate': '0.0001152', 'epoch': '2.318'} {'loss': '2.484', 'grad_norm': '0.9815', 'learning_rate': '0.0001138', 'epoch': '2.326'} {'loss': '2.484', 'grad_norm': '0.97', 'learning_rate': '0.0001125', 'epoch': '2.335'} 78%|████████████████████████████▊ | 28500/36624 [32:32<09:16, 14.61it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 194.61it/s] {'loss': '2.48', 'grad_norm': '1.007', 'learning_rate': '0.0001111', 'epoch': '2.343'} {'loss': '2.477', 'grad_norm': '1.018', 'learning_rate': '0.0001097', 'epoch': '2.351'} {'loss': '2.477', 'grad_norm': '0.945', 'learning_rate': '0.0001083', 'epoch': '2.359'} {'loss': '2.479', 'grad_norm': '1.001', 'learning_rate': '0.0001069', 'epoch': '2.367'} {'loss': '2.48', 'grad_norm': '0.9743', 'learning_rate': '0.0001055', 'epoch': '2.376'} 79%|█████████████████████████████▎ | 29000/36624 [33:07<08:43, 14.57it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 193.78it/s] {'loss': '2.48', 'grad_norm': '1.005', 'learning_rate': '0.0001042', 'epoch': '2.384'} {'loss': '2.482', 'grad_norm': '1.016', 'learning_rate': '0.0001028', 'epoch': '2.392'} {'loss': '2.473', 'grad_norm': '0.9854', 'learning_rate': '0.0001014', 'epoch': '2.4'} {'loss': '2.476', 'grad_norm': '0.9408', 'learning_rate': '0.0001', 'epoch': '2.408'} {'loss': '2.475', 'grad_norm': '0.9968', 'learning_rate': '9.862e-05', 'epoch': '2.416'} 81%|█████████████████████████████▊ | 29500/36624 [33:41<08:09, 14.55it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 189.43it/s] {'loss': '2.476', 'grad_norm': '1.016', 'learning_rate': '9.723e-05', 'epoch': '2.425'} {'loss': '2.48', 'grad_norm': '0.9962', 'learning_rate': '9.585e-05', 'epoch': '2.433'} {'loss': '2.478', 'grad_norm': '1.032', 'learning_rate': '9.447e-05', 'epoch': '2.441'} {'loss': '2.477', 'grad_norm': '1.001', 'learning_rate': '9.308e-05', 'epoch': '2.449'} {'loss': '2.477', 'grad_norm': '0.9866', 'learning_rate': '9.17e-05', 'epoch': '2.457'} 82%|██████████████████████████████▎ | 30000/36624 [34:15<07:36, 14.53it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 184.80it/s] {'loss': '2.476', 'grad_norm': '1.028', 'learning_rate': '9.031e-05', 'epoch': '2.466'} {'loss': '2.475', 'grad_norm': '0.9801', 'learning_rate': '8.893e-05', 'epoch': '2.474'} {'loss': '2.48', 'grad_norm': '0.9972', 'learning_rate': '8.755e-05', 'epoch': '2.482'} {'loss': '2.479', 'grad_norm': '0.9972', 'learning_rate': '8.616e-05', 'epoch': '2.49'} {'loss': '2.479', 'grad_norm': '1.076', 'learning_rate': '8.478e-05', 'epoch': '2.498'} 83%|██████████████████████████████▊ | 30500/36624 [34:50<07:03, 14.47it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 156.83it/s] {'loss': '2.472', 'grad_norm': '0.9888', 'learning_rate': '8.339e-05', 'epoch': '2.507'} {'loss': '2.477', 'grad_norm': '1.045', 'learning_rate': '8.201e-05', 'epoch': '2.515'} {'loss': '2.475', 'grad_norm': '1.039', 'learning_rate': '8.063e-05', 'epoch': '2.523'} {'loss': '2.476', 'grad_norm': '1.038', 'learning_rate': '7.924e-05', 'epoch': '2.531'} {'loss': '2.474', 'grad_norm': '1.013', 'learning_rate': '7.786e-05', 'epoch': '2.539'} 85%|███████████████████████████████▎ | 31000/36624 [35:25<06:26, 14.55it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 173.79it/s] {'loss': '2.478', 'grad_norm': '0.9911', 'learning_rate': '7.647e-05', 'epoch': '2.548'} {'loss': '2.475', 'grad_norm': '1.003', 'learning_rate': '7.509e-05', 'epoch': '2.556'} {'loss': '2.476', 'grad_norm': '0.9986', 'learning_rate': '7.37e-05', 'epoch': '2.564'} {'loss': '2.475', 'grad_norm': '1.034', 'learning_rate': '7.232e-05', 'epoch': '2.572'} {'loss': '2.476', 'grad_norm': '0.9733', 'learning_rate': '7.094e-05', 'epoch': '2.58'} 86%|███████████████████████████████▊ | 31500/36624 [35:59<05:53, 14.51it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 181.57it/s] {'loss': '2.468', 'grad_norm': '1.055', 'learning_rate': '6.955e-05', 'epoch': '2.588'} {'loss': '2.475', 'grad_norm': '1.026', 'learning_rate': '6.817e-05', 'epoch': '2.597'} {'loss': '2.476', 'grad_norm': '1.029', 'learning_rate': '6.678e-05', 'epoch': '2.605'} {'loss': '2.471', 'grad_norm': '1.032', 'learning_rate': '6.54e-05', 'epoch': '2.613'} {'loss': '2.473', 'grad_norm': '1.007', 'learning_rate': '6.402e-05', 'epoch': '2.621'} 87%|████████████████████████████████▎ | 32000/36624 [36:34<05:15, 14.66it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 196.78it/s] {'loss': '2.47', 'grad_norm': '1.028', 'learning_rate': '6.263e-05', 'epoch': '2.629'} {'loss': '2.47', 'grad_norm': '0.9969', 'learning_rate': '6.125e-05', 'epoch': '2.638'} {'loss': '2.473', 'grad_norm': '1.037', 'learning_rate': '5.986e-05', 'epoch': '2.646'} {'loss': '2.47', 'grad_norm': '1', 'learning_rate': '5.848e-05', 'epoch': '2.654'} {'loss': '2.468', 'grad_norm': '1.034', 'learning_rate': '5.71e-05', 'epoch': '2.662'} 89%|████████████████████████████████▊ | 32500/36624 [37:08<04:47, 14.34it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 181.32it/s] {'loss': '2.471', 'grad_norm': '1.073', 'learning_rate': '5.571e-05', 'epoch': '2.67'} {'loss': '2.47', 'grad_norm': '1.003', 'learning_rate': '5.433e-05', 'epoch': '2.679'} {'loss': '2.472', 'grad_norm': '1.033', 'learning_rate': '5.294e-05', 'epoch': '2.687'} {'loss': '2.469', 'grad_norm': '1.076', 'learning_rate': '5.156e-05', 'epoch': '2.695'} {'loss': '2.469', 'grad_norm': '1.061', 'learning_rate': '5.017e-05', 'epoch': '2.703'} 90%|█████████████████████████████████▎ | 33000/36624 [37:43<04:11, 14.39it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 188.79it/s] {'loss': '2.468', 'grad_norm': '1.043', 'learning_rate': '4.879e-05', 'epoch': '2.711'} {'loss': '2.474', 'grad_norm': '1.115', 'learning_rate': '4.741e-05', 'epoch': '2.72'} {'loss': '2.469', 'grad_norm': '1.028', 'learning_rate': '4.602e-05', 'epoch': '2.728'} {'loss': '2.468', 'grad_norm': '1.017', 'learning_rate': '4.464e-05', 'epoch': '2.736'} {'loss': '2.471', 'grad_norm': '0.9948', 'learning_rate': '4.325e-05', 'epoch': '2.744'} 91%|█████████████████████████████████▊ | 33500/36624 [38:18<03:37, 14.39it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 155.26it/s] {'loss': '2.466', 'grad_norm': '1.027', 'learning_rate': '4.187e-05', 'epoch': '2.752'} {'loss': '2.467', 'grad_norm': '1.023', 'learning_rate': '4.049e-05', 'epoch': '2.761'} {'loss': '2.47', 'grad_norm': '1.021', 'learning_rate': '3.91e-05', 'epoch': '2.769'} {'loss': '2.468', 'grad_norm': '0.9946', 'learning_rate': '3.772e-05', 'epoch': '2.777'} {'loss': '2.464', 'grad_norm': '1.031', 'learning_rate': '3.633e-05', 'epoch': '2.785'} 93%|██████████████████████████████████▎ | 34000/36624 [38:52<03:00, 14.52it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 183.19it/s] {'loss': '2.467', 'grad_norm': '1.05', 'learning_rate': '3.495e-05', 'epoch': '2.793'} {'loss': '2.469', 'grad_norm': '1.043', 'learning_rate': '3.356e-05', 'epoch': '2.801'} {'loss': '2.468', 'grad_norm': '0.9955', 'learning_rate': '3.218e-05', 'epoch': '2.81'} {'loss': '2.461', 'grad_norm': '0.9882', 'learning_rate': '3.08e-05', 'epoch': '2.818'} {'loss': '2.463', 'grad_norm': '1.023', 'learning_rate': '2.941e-05', 'epoch': '2.826'} 94%|██████████████████████████████████▊ | 34500/36624 [39:27<02:28, 14.27it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 201.85it/s] {'loss': '2.464', 'grad_norm': '1.062', 'learning_rate': '2.803e-05', 'epoch': '2.834'} {'loss': '2.465', 'grad_norm': '1.065', 'learning_rate': '2.664e-05', 'epoch': '2.842'} {'loss': '2.468', 'grad_norm': '1.01', 'learning_rate': '2.526e-05', 'epoch': '2.851'} {'loss': '2.463', 'grad_norm': '0.9994', 'learning_rate': '2.388e-05', 'epoch': '2.859'} {'loss': '2.464', 'grad_norm': '1.032', 'learning_rate': '2.249e-05', 'epoch': '2.867'} 96%|███████████████████████████████████▎ | 35000/36624 [40:02<01:53, 14.32it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 174.93it/s] {'loss': '2.467', 'grad_norm': '1.233', 'learning_rate': '2.111e-05', 'epoch': '2.875'} {'loss': '2.466', 'grad_norm': '1.069', 'learning_rate': '1.972e-05', 'epoch': '2.883'} {'loss': '2.469', 'grad_norm': '1.015', 'learning_rate': '1.834e-05', 'epoch': '2.892'} {'loss': '2.466', 'grad_norm': '1.033', 'learning_rate': '1.696e-05', 'epoch': '2.9'} {'loss': '2.463', 'grad_norm': '1.04', 'learning_rate': '1.557e-05', 'epoch': '2.908'} 97%|███████████████████████████████████▊ | 35500/36624 [40:37<01:17, 14.42it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 184.19it/s] {'loss': '2.469', 'grad_norm': '1.007', 'learning_rate': '1.419e-05', 'epoch': '2.916'} {'loss': '2.468', 'grad_norm': '1.025', 'learning_rate': '1.28e-05', 'epoch': '2.924'} {'loss': '2.465', 'grad_norm': '1.033', 'learning_rate': '1.142e-05', 'epoch': '2.933'} {'loss': '2.464', 'grad_norm': '1.045', 'learning_rate': '1.003e-05', 'epoch': '2.941'} {'loss': '2.464', 'grad_norm': '1.008', 'learning_rate': '8.651e-06', 'epoch': '2.949'} 98%|████████████████████████████████████▎| 36000/36624 [41:11<00:43, 14.40it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 163.65it/s] {'loss': '2.463', 'grad_norm': '1.035', 'learning_rate': '7.267e-06', 'epoch': '2.957'} {'loss': '2.46', 'grad_norm': '1.018', 'learning_rate': '5.883e-06', 'epoch': '2.965'} {'loss': '2.463', 'grad_norm': '1.01', 'learning_rate': '4.498e-06', 'epoch': '2.973'} {'loss': '2.463', 'grad_norm': '1.014', 'learning_rate': '3.114e-06', 'epoch': '2.982'} {'loss': '2.459', 'grad_norm': '0.9757', 'learning_rate': '1.73e-06', 'epoch': '2.99'} 100%|████████████████████████████████████▊| 36500/36624 [41:46<00:08, 14.23it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 177.29it/s] {'loss': '2.464', 'grad_norm': '1.004', 'learning_rate': '3.46e-07', 'epoch': '2.998'} 100%|████████████████████████████████████▉| 36623/36624 [41:55<00:00, 11.70it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 197.55it/s] {'train_runtime': '2516', 'train_samples_per_second': '1863', 'train_steps_per_second': '14.56', 'train_loss': '2.575', 'epoch': '3'} 100%|█████████████████████████████████████| 36624/36624 [41:56<00:00, 14.56it/s] Writing model shards: 100%|██████████████████████| 1/1 [00:00<00:00, 160.99it/s] [*] Training finished.