Text Generation
Transformers
Safetensors
English
llama
tiny-model
sub-1M
cpu
small
tiny
quark
1m
text-generation-inference
Instructions to use LH-Tech-AI/Quark-0.5M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use LH-Tech-AI/Quark-0.5M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="LH-Tech-AI/Quark-0.5M")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("LH-Tech-AI/Quark-0.5M") model = AutoModelForCausalLM.from_pretrained("LH-Tech-AI/Quark-0.5M") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use LH-Tech-AI/Quark-0.5M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "LH-Tech-AI/Quark-0.5M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LH-Tech-AI/Quark-0.5M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/LH-Tech-AI/Quark-0.5M
- SGLang
How to use LH-Tech-AI/Quark-0.5M with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "LH-Tech-AI/Quark-0.5M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LH-Tech-AI/Quark-0.5M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "LH-Tech-AI/Quark-0.5M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LH-Tech-AI/Quark-0.5M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use LH-Tech-AI/Quark-0.5M with Docker Model Runner:
docker model run hf.co/LH-Tech-AI/Quark-0.5M
| [*] Loading libraries... | |
| [*] Loading tokenizer... | |
| [*] Gathering 100 million tokens by streaming dataset... | |
| Resolving data files: 100%|ββββββββββββββ| 2410/2410 [00:00<00:00, 30853.46it/s] | |
| [*] Gathering tokens: 100%|ββ| 400000000/400000000 [13:58<00:00, 477048.96tok/s] | |
| [+] Collected 400,000,000 tokens β 1,562,500 chunks. | |
| [*] Setting up model... | |
| [*] Model parameters: 465,504 | |
| [*] Defining training arguments... | |
| [*] Starting training... | |
| {'loss': '5.986', 'grad_norm': '0.5017', 'learning_rate': '9.9e-05', 'epoch': '0.008192'} | |
| {'loss': '5.403', 'grad_norm': '0.394', 'learning_rate': '0.000199', 'epoch': '0.01638'} | |
| {'loss': '4.75', 'grad_norm': '0.9517', 'learning_rate': '0.000299', 'epoch': '0.02458'} | |
| {'loss': '4.192', 'grad_norm': '1.073', 'learning_rate': '0.000399', 'epoch': '0.03277'} | |
| {'loss': '3.702', 'grad_norm': '1.364', 'learning_rate': '0.000499', 'epoch': '0.04096'} | |
| 1%|β | 500/36624 [00:34<40:21, 14.92it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 126.22it/s] | |
| {'loss': '3.378', 'grad_norm': '1.906', 'learning_rate': '0.0004986', 'epoch': '0.04915'} | |
| {'loss': '3.195', 'grad_norm': '1.332', 'learning_rate': '0.0004972', 'epoch': '0.05734'} | |
| {'loss': '3.085', 'grad_norm': '1.36', 'learning_rate': '0.0004959', 'epoch': '0.06553'} | |
| {'loss': '3.011', 'grad_norm': '1.354', 'learning_rate': '0.0004945', 'epoch': '0.07373'} | |
| {'loss': '2.955', 'grad_norm': '1.423', 'learning_rate': '0.0004931', 'epoch': '0.08192'} | |
| 3%|β | 1000/36624 [01:08<40:59, 14.48it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 185.00it/s] | |
| {'loss': '2.914', 'grad_norm': '1.194', 'learning_rate': '0.0004917', 'epoch': '0.09011'} | |
| {'loss': '2.887', 'grad_norm': '1.145', 'learning_rate': '0.0004903', 'epoch': '0.0983'} | |
| {'loss': '2.861', 'grad_norm': '1.353', 'learning_rate': '0.0004889', 'epoch': '0.1065'} | |
| {'loss': '2.833', 'grad_norm': '1.226', 'learning_rate': '0.0004876', 'epoch': '0.1147'} | |
| {'loss': '2.824', 'grad_norm': '1.226', 'learning_rate': '0.0004862', 'epoch': '0.1229'} | |
| 4%|ββ | 1500/36624 [01:42<40:32, 14.44it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 182.87it/s] | |
| {'loss': '2.806', 'grad_norm': '1.204', 'learning_rate': '0.0004848', 'epoch': '0.1311'} | |
| {'loss': '2.786', 'grad_norm': '1.139', 'learning_rate': '0.0004834', 'epoch': '0.1393'} | |
| {'loss': '2.777', 'grad_norm': '1.099', 'learning_rate': '0.000482', 'epoch': '0.1475'} | |
| {'loss': '2.765', 'grad_norm': '1.127', 'learning_rate': '0.0004806', 'epoch': '0.1556'} | |
| {'loss': '2.754', 'grad_norm': '1.186', 'learning_rate': '0.0004793', 'epoch': '0.1638'} | |
| 5%|ββ | 2000/36624 [02:16<39:37, 14.56it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 196.92it/s] | |
| {'loss': '2.749', 'grad_norm': '1.068', 'learning_rate': '0.0004779', 'epoch': '0.172'} | |
| {'loss': '2.732', 'grad_norm': '1.086', 'learning_rate': '0.0004765', 'epoch': '0.1802'} | |
| {'loss': '2.73', 'grad_norm': '1.105', 'learning_rate': '0.0004751', 'epoch': '0.1884'} | |
| {'loss': '2.721', 'grad_norm': '1.213', 'learning_rate': '0.0004737', 'epoch': '0.1966'} | |
| {'loss': '2.717', 'grad_norm': '1.168', 'learning_rate': '0.0004723', 'epoch': '0.2048'} | |
| 7%|βββ | 2500/36624 [02:50<39:00, 14.58it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 183.91it/s] | |
| {'loss': '2.708', 'grad_norm': '1.081', 'learning_rate': '0.0004709', 'epoch': '0.213'} | |
| {'loss': '2.705', 'grad_norm': '1.083', 'learning_rate': '0.0004696', 'epoch': '0.2212'} | |
| {'loss': '2.697', 'grad_norm': '1.079', 'learning_rate': '0.0004682', 'epoch': '0.2294'} | |
| {'loss': '2.692', 'grad_norm': '1.123', 'learning_rate': '0.0004668', 'epoch': '0.2376'} | |
| {'loss': '2.687', 'grad_norm': '1.147', 'learning_rate': '0.0004654', 'epoch': '0.2458'} | |
| 8%|βββ | 3000/36624 [03:24<37:58, 14.76it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 192.12it/s] | |
| {'loss': '2.681', 'grad_norm': '1.052', 'learning_rate': '0.000464', 'epoch': '0.2539'} | |
| {'loss': '2.676', 'grad_norm': '1.099', 'learning_rate': '0.0004626', 'epoch': '0.2621'} | |
| {'loss': '2.674', 'grad_norm': '1.084', 'learning_rate': '0.0004613', 'epoch': '0.2703'} | |
| {'loss': '2.672', 'grad_norm': '1.057', 'learning_rate': '0.0004599', 'epoch': '0.2785'} | |
| {'loss': '2.672', 'grad_norm': '1.103', 'learning_rate': '0.0004585', 'epoch': '0.2867'} | |
| 10%|ββββ | 3500/36624 [03:59<38:12, 14.45it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 199.64it/s] | |
| {'loss': '2.661', 'grad_norm': '1.062', 'learning_rate': '0.0004571', 'epoch': '0.2949'} | |
| {'loss': '2.658', 'grad_norm': '1.055', 'learning_rate': '0.0004557', 'epoch': '0.3031'} | |
| {'loss': '2.656', 'grad_norm': '1.06', 'learning_rate': '0.0004543', 'epoch': '0.3113'} | |
| {'loss': '2.653', 'grad_norm': '1.1', 'learning_rate': '0.000453', 'epoch': '0.3195'} | |
| {'loss': '2.651', 'grad_norm': '1.137', 'learning_rate': '0.0004516', 'epoch': '0.3277'} | |
| 11%|βββββ | 4000/36624 [04:33<37:14, 14.60it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 196.63it/s] | |
| {'loss': '2.648', 'grad_norm': '1.009', 'learning_rate': '0.0004502', 'epoch': '0.3359'} | |
| {'loss': '2.639', 'grad_norm': '1', 'learning_rate': '0.0004488', 'epoch': '0.3441'} | |
| {'loss': '2.641', 'grad_norm': '1.044', 'learning_rate': '0.0004474', 'epoch': '0.3522'} | |
| {'loss': '2.641', 'grad_norm': '1.039', 'learning_rate': '0.000446', 'epoch': '0.3604'} | |
| {'loss': '2.637', 'grad_norm': '1.036', 'learning_rate': '0.0004446', 'epoch': '0.3686'} | |
| 12%|βββββ | 4500/36624 [05:07<36:26, 14.69it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 193.36it/s] | |
| {'loss': '2.632', 'grad_norm': '0.9873', 'learning_rate': '0.0004433', 'epoch': '0.3768'} | |
| {'loss': '2.631', 'grad_norm': '1.043', 'learning_rate': '0.0004419', 'epoch': '0.385'} | |
| {'loss': '2.63', 'grad_norm': '1.063', 'learning_rate': '0.0004405', 'epoch': '0.3932'} | |
| {'loss': '2.624', 'grad_norm': '1.026', 'learning_rate': '0.0004391', 'epoch': '0.4014'} | |
| {'loss': '2.624', 'grad_norm': '1.011', 'learning_rate': '0.0004377', 'epoch': '0.4096'} | |
| 14%|ββββββ | 5000/36624 [05:41<36:09, 14.58it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 189.30it/s] | |
| {'loss': '2.625', 'grad_norm': '1.08', 'learning_rate': '0.0004363', 'epoch': '0.4178'} | |
| {'loss': '2.621', 'grad_norm': '1.007', 'learning_rate': '0.000435', 'epoch': '0.426'} | |
| {'loss': '2.618', 'grad_norm': '1.025', 'learning_rate': '0.0004336', 'epoch': '0.4342'} | |
| {'loss': '2.616', 'grad_norm': '0.9491', 'learning_rate': '0.0004322', 'epoch': '0.4424'} | |
| {'loss': '2.615', 'grad_norm': '1.072', 'learning_rate': '0.0004308', 'epoch': '0.4505'} | |
| 15%|ββββββ | 5500/36624 [06:15<35:20, 14.67it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 196.28it/s] | |
| {'loss': '2.604', 'grad_norm': '0.986', 'learning_rate': '0.0004294', 'epoch': '0.4587'} | |
| {'loss': '2.609', 'grad_norm': '0.9908', 'learning_rate': '0.000428', 'epoch': '0.4669'} | |
| {'loss': '2.606', 'grad_norm': '0.9686', 'learning_rate': '0.0004267', 'epoch': '0.4751'} | |
| {'loss': '2.61', 'grad_norm': '1.009', 'learning_rate': '0.0004253', 'epoch': '0.4833'} | |
| {'loss': '2.606', 'grad_norm': '1.003', 'learning_rate': '0.0004239', 'epoch': '0.4915'} | |
| 16%|βββββββ | 6000/36624 [06:49<34:56, 14.61it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 178.44it/s] | |
| {'loss': '2.602', 'grad_norm': '0.9795', 'learning_rate': '0.0004225', 'epoch': '0.4997'} | |
| {'loss': '2.601', 'grad_norm': '1.023', 'learning_rate': '0.0004211', 'epoch': '0.5079'} | |
| {'loss': '2.596', 'grad_norm': '1.023', 'learning_rate': '0.0004197', 'epoch': '0.5161'} | |
| {'loss': '2.598', 'grad_norm': '0.9583', 'learning_rate': '0.0004184', 'epoch': '0.5243'} | |
| {'loss': '2.597', 'grad_norm': '0.9572', 'learning_rate': '0.000417', 'epoch': '0.5325'} | |
| 18%|βββββββ | 6500/36624 [07:24<34:21, 14.61it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 207.81it/s] | |
| {'loss': '2.596', 'grad_norm': '1.056', 'learning_rate': '0.0004156', 'epoch': '0.5407'} | |
| {'loss': '2.594', 'grad_norm': '1.007', 'learning_rate': '0.0004142', 'epoch': '0.5488'} | |
| {'loss': '2.593', 'grad_norm': '0.9365', 'learning_rate': '0.0004128', 'epoch': '0.557'} | |
| {'loss': '2.593', 'grad_norm': '0.9879', 'learning_rate': '0.0004114', 'epoch': '0.5652'} | |
| {'loss': '2.594', 'grad_norm': '1.078', 'learning_rate': '0.00041', 'epoch': '0.5734'} | |
| 19%|ββββββββ | 7000/36624 [07:58<33:22, 14.79it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 149.60it/s] | |
| {'loss': '2.589', 'grad_norm': '1.011', 'learning_rate': '0.0004087', 'epoch': '0.5816'} | |
| {'loss': '2.585', 'grad_norm': '0.9979', 'learning_rate': '0.0004073', 'epoch': '0.5898'} | |
| {'loss': '2.587', 'grad_norm': '0.9675', 'learning_rate': '0.0004059', 'epoch': '0.598'} | |
| {'loss': '2.584', 'grad_norm': '0.9291', 'learning_rate': '0.0004045', 'epoch': '0.6062'} | |
| {'loss': '2.583', 'grad_norm': '0.9513', 'learning_rate': '0.0004031', 'epoch': '0.6144'} | |
| 20%|ββββββββ | 7500/36624 [08:32<33:15, 14.60it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 179.61it/s] | |
| {'loss': '2.584', 'grad_norm': '1.012', 'learning_rate': '0.0004017', 'epoch': '0.6226'} | |
| {'loss': '2.585', 'grad_norm': '1.012', 'learning_rate': '0.0004004', 'epoch': '0.6308'} | |
| {'loss': '2.578', 'grad_norm': '1.016', 'learning_rate': '0.000399', 'epoch': '0.639'} | |
| {'loss': '2.58', 'grad_norm': '0.994', 'learning_rate': '0.0003976', 'epoch': '0.6471'} | |
| {'loss': '2.578', 'grad_norm': '1.003', 'learning_rate': '0.0003962', 'epoch': '0.6553'} | |
| 22%|βββββββββ | 8000/36624 [09:06<32:34, 14.64it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 178.38it/s] | |
| {'loss': '2.581', 'grad_norm': '1.01', 'learning_rate': '0.0003948', 'epoch': '0.6635'} | |
| {'loss': '2.573', 'grad_norm': '0.9192', 'learning_rate': '0.0003934', 'epoch': '0.6717'} | |
| {'loss': '2.577', 'grad_norm': '0.955', 'learning_rate': '0.0003921', 'epoch': '0.6799'} | |
| {'loss': '2.575', 'grad_norm': '1.005', 'learning_rate': '0.0003907', 'epoch': '0.6881'} | |
| {'loss': '2.577', 'grad_norm': '0.922', 'learning_rate': '0.0003893', 'epoch': '0.6963'} | |
| 23%|βββββββββ | 8500/36624 [09:40<31:54, 14.69it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 189.22it/s] | |
| {'loss': '2.573', 'grad_norm': '0.9621', 'learning_rate': '0.0003879', 'epoch': '0.7045'} | |
| {'loss': '2.57', 'grad_norm': '0.9889', 'learning_rate': '0.0003865', 'epoch': '0.7127'} | |
| {'loss': '2.568', 'grad_norm': '0.9244', 'learning_rate': '0.0003851', 'epoch': '0.7209'} | |
| {'loss': '2.57', 'grad_norm': '1.009', 'learning_rate': '0.0003837', 'epoch': '0.7291'} | |
| {'loss': '2.567', 'grad_norm': '0.9754', 'learning_rate': '0.0003824', 'epoch': '0.7373'} | |
| 25%|ββββββββββ | 9000/36624 [10:14<31:31, 14.60it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 177.72it/s] | |
| {'loss': '2.57', 'grad_norm': '0.964', 'learning_rate': '0.000381', 'epoch': '0.7454'} | |
| {'loss': '2.567', 'grad_norm': '0.9354', 'learning_rate': '0.0003796', 'epoch': '0.7536'} | |
| {'loss': '2.569', 'grad_norm': '0.9461', 'learning_rate': '0.0003782', 'epoch': '0.7618'} | |
| {'loss': '2.565', 'grad_norm': '0.9415', 'learning_rate': '0.0003768', 'epoch': '0.77'} | |
| {'loss': '2.566', 'grad_norm': '0.9319', 'learning_rate': '0.0003754', 'epoch': '0.7782'} | |
| 26%|ββββββββββ | 9500/36624 [10:49<31:23, 14.40it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 187.52it/s] | |
| {'loss': '2.56', 'grad_norm': '0.917', 'learning_rate': '0.0003741', 'epoch': '0.7864'} | |
| {'loss': '2.562', 'grad_norm': '0.982', 'learning_rate': '0.0003727', 'epoch': '0.7946'} | |
| {'loss': '2.563', 'grad_norm': '0.9996', 'learning_rate': '0.0003713', 'epoch': '0.8028'} | |
| {'loss': '2.559', 'grad_norm': '0.9066', 'learning_rate': '0.0003699', 'epoch': '0.811'} | |
| {'loss': '2.562', 'grad_norm': '0.9582', 'learning_rate': '0.0003685', 'epoch': '0.8192'} | |
| 27%|ββββββββββ | 10000/36624 [11:23<30:09, 14.72it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 182.08it/s] | |
| {'loss': '2.557', 'grad_norm': '0.9477', 'learning_rate': '0.0003671', 'epoch': '0.8274'} | |
| {'loss': '2.56', 'grad_norm': '0.9513', 'learning_rate': '0.0003658', 'epoch': '0.8356'} | |
| {'loss': '2.559', 'grad_norm': '0.9462', 'learning_rate': '0.0003644', 'epoch': '0.8437'} | |
| {'loss': '2.558', 'grad_norm': '0.9505', 'learning_rate': '0.000363', 'epoch': '0.8519'} | |
| {'loss': '2.556', 'grad_norm': '0.9055', 'learning_rate': '0.0003616', 'epoch': '0.8601'} | |
| 29%|βββββββββββ | 10500/36624 [11:57<29:42, 14.66it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 186.19it/s] | |
| {'loss': '2.552', 'grad_norm': '0.9765', 'learning_rate': '0.0003602', 'epoch': '0.8683'} | |
| {'loss': '2.557', 'grad_norm': '0.9443', 'learning_rate': '0.0003588', 'epoch': '0.8765'} | |
| {'loss': '2.555', 'grad_norm': '0.8971', 'learning_rate': '0.0003574', 'epoch': '0.8847'} | |
| {'loss': '2.553', 'grad_norm': '0.9489', 'learning_rate': '0.0003561', 'epoch': '0.8929'} | |
| {'loss': '2.552', 'grad_norm': '1', 'learning_rate': '0.0003547', 'epoch': '0.9011'} | |
| 30%|βββββββββββ | 11000/36624 [12:31<28:47, 14.83it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 176.59it/s] | |
| {'loss': '2.557', 'grad_norm': '0.915', 'learning_rate': '0.0003533', 'epoch': '0.9093'} | |
| {'loss': '2.552', 'grad_norm': '0.911', 'learning_rate': '0.0003519', 'epoch': '0.9175'} | |
| {'loss': '2.554', 'grad_norm': '0.9488', 'learning_rate': '0.0003505', 'epoch': '0.9257'} | |
| {'loss': '2.547', 'grad_norm': '0.9326', 'learning_rate': '0.0003491', 'epoch': '0.9339'} | |
| {'loss': '2.555', 'grad_norm': '0.9041', 'learning_rate': '0.0003478', 'epoch': '0.942'} | |
| 31%|ββββββββββββ | 11500/36624 [13:06<28:39, 14.61it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 193.03it/s] | |
| {'loss': '2.547', 'grad_norm': '0.9229', 'learning_rate': '0.0003464', 'epoch': '0.9502'} | |
| {'loss': '2.547', 'grad_norm': '0.9645', 'learning_rate': '0.000345', 'epoch': '0.9584'} | |
| {'loss': '2.548', 'grad_norm': '0.9408', 'learning_rate': '0.0003436', 'epoch': '0.9666'} | |
| {'loss': '2.546', 'grad_norm': '0.9032', 'learning_rate': '0.0003422', 'epoch': '0.9748'} | |
| {'loss': '2.549', 'grad_norm': '0.918', 'learning_rate': '0.0003408', 'epoch': '0.983'} | |
| 33%|ββββββββββββ | 12000/36624 [13:40<28:04, 14.62it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 188.44it/s] | |
| {'loss': '2.547', 'grad_norm': '0.9086', 'learning_rate': '0.0003395', 'epoch': '0.9912'} | |
| {'loss': '2.544', 'grad_norm': '0.9125', 'learning_rate': '0.0003381', 'epoch': '0.9994'} | |
| {'loss': '2.541', 'grad_norm': '0.9181', 'learning_rate': '0.0003367', 'epoch': '1.008'} | |
| {'loss': '2.545', 'grad_norm': '0.9132', 'learning_rate': '0.0003353', 'epoch': '1.016'} | |
| {'loss': '2.542', 'grad_norm': '0.9156', 'learning_rate': '0.0003339', 'epoch': '1.024'} | |
| 34%|βββββββββββββ | 12500/36624 [14:15<27:29, 14.62it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 114.07it/s] | |
| {'loss': '2.538', 'grad_norm': '0.9441', 'learning_rate': '0.0003325', 'epoch': '1.032'} | |
| {'loss': '2.542', 'grad_norm': '0.9385', 'learning_rate': '0.0003312', 'epoch': '1.04'} | |
| {'loss': '2.536', 'grad_norm': '0.9842', 'learning_rate': '0.0003298', 'epoch': '1.048'} | |
| {'loss': '2.542', 'grad_norm': '0.9319', 'learning_rate': '0.0003284', 'epoch': '1.057'} | |
| {'loss': '2.537', 'grad_norm': '0.8883', 'learning_rate': '0.000327', 'epoch': '1.065'} | |
| 35%|ββββββββββββββ | 13000/36624 [14:50<27:04, 14.54it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 171.43it/s] | |
| {'loss': '2.54', 'grad_norm': '0.9869', 'learning_rate': '0.0003256', 'epoch': '1.073'} | |
| {'loss': '2.539', 'grad_norm': '0.8919', 'learning_rate': '0.0003242', 'epoch': '1.081'} | |
| {'loss': '2.533', 'grad_norm': '0.9155', 'learning_rate': '0.0003228', 'epoch': '1.089'} | |
| {'loss': '2.537', 'grad_norm': '0.9485', 'learning_rate': '0.0003215', 'epoch': '1.098'} | |
| {'loss': '2.539', 'grad_norm': '0.9354', 'learning_rate': '0.0003201', 'epoch': '1.106'} | |
| 37%|ββββββββββββββ | 13500/36624 [15:24<26:16, 14.67it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 199.73it/s] | |
| {'loss': '2.535', 'grad_norm': '0.9028', 'learning_rate': '0.0003187', 'epoch': '1.114'} | |
| {'loss': '2.533', 'grad_norm': '0.9042', 'learning_rate': '0.0003173', 'epoch': '1.122'} | |
| {'loss': '2.533', 'grad_norm': '0.9192', 'learning_rate': '0.0003159', 'epoch': '1.13'} | |
| {'loss': '2.533', 'grad_norm': '0.8816', 'learning_rate': '0.0003145', 'epoch': '1.139'} | |
| {'loss': '2.53', 'grad_norm': '0.9064', 'learning_rate': '0.0003132', 'epoch': '1.147'} | |
| 38%|βββββββββββββββ | 14000/36624 [15:58<26:09, 14.42it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 174.55it/s] | |
| {'loss': '2.534', 'grad_norm': '0.9424', 'learning_rate': '0.0003118', 'epoch': '1.155'} | |
| {'loss': '2.53', 'grad_norm': '0.9198', 'learning_rate': '0.0003104', 'epoch': '1.163'} | |
| {'loss': '2.53', 'grad_norm': '0.9234', 'learning_rate': '0.000309', 'epoch': '1.171'} | |
| {'loss': '2.533', 'grad_norm': '1.027', 'learning_rate': '0.0003076', 'epoch': '1.18'} | |
| {'loss': '2.531', 'grad_norm': '0.9083', 'learning_rate': '0.0003062', 'epoch': '1.188'} | |
| 40%|βββββββββββββββ | 14500/36624 [16:32<25:17, 14.58it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 192.31it/s] | |
| {'loss': '2.53', 'grad_norm': '0.8941', 'learning_rate': '0.0003049', 'epoch': '1.196'} | |
| {'loss': '2.533', 'grad_norm': '0.9395', 'learning_rate': '0.0003035', 'epoch': '1.204'} | |
| {'loss': '2.53', 'grad_norm': '0.9605', 'learning_rate': '0.0003021', 'epoch': '1.212'} | |
| {'loss': '2.53', 'grad_norm': '0.9029', 'learning_rate': '0.0003007', 'epoch': '1.221'} | |
| {'loss': '2.529', 'grad_norm': '0.9056', 'learning_rate': '0.0002993', 'epoch': '1.229'} | |
| 41%|ββββββββββββββββ | 15000/36624 [17:07<24:39, 14.62it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 180.68it/s] | |
| {'loss': '2.528', 'grad_norm': '0.8955', 'learning_rate': '0.0002979', 'epoch': '1.237'} | |
| {'loss': '2.53', 'grad_norm': '0.9041', 'learning_rate': '0.0002965', 'epoch': '1.245'} | |
| {'loss': '2.527', 'grad_norm': '0.9242', 'learning_rate': '0.0002952', 'epoch': '1.253'} | |
| {'loss': '2.525', 'grad_norm': '0.9313', 'learning_rate': '0.0002938', 'epoch': '1.261'} | |
| {'loss': '2.525', 'grad_norm': '0.9721', 'learning_rate': '0.0002924', 'epoch': '1.27'} | |
| 42%|ββββββββββββββββ | 15500/36624 [17:41<23:50, 14.77it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 195.36it/s] | |
| {'loss': '2.522', 'grad_norm': '0.9043', 'learning_rate': '0.000291', 'epoch': '1.278'} | |
| {'loss': '2.524', 'grad_norm': '0.9181', 'learning_rate': '0.0002896', 'epoch': '1.286'} | |
| {'loss': '2.527', 'grad_norm': '0.9111', 'learning_rate': '0.0002882', 'epoch': '1.294'} | |
| {'loss': '2.523', 'grad_norm': '0.9105', 'learning_rate': '0.0002869', 'epoch': '1.302'} | |
| {'loss': '2.526', 'grad_norm': '1.005', 'learning_rate': '0.0002855', 'epoch': '1.311'} | |
| 44%|βββββββββββββββββ | 16000/36624 [18:15<23:29, 14.63it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 192.30it/s] | |
| {'loss': '2.526', 'grad_norm': '0.9184', 'learning_rate': '0.0002841', 'epoch': '1.319'} | |
| {'loss': '2.52', 'grad_norm': '0.8872', 'learning_rate': '0.0002827', 'epoch': '1.327'} | |
| {'loss': '2.519', 'grad_norm': '0.9441', 'learning_rate': '0.0002813', 'epoch': '1.335'} | |
| {'loss': '2.525', 'grad_norm': '0.9462', 'learning_rate': '0.0002799', 'epoch': '1.343'} | |
| {'loss': '2.525', 'grad_norm': '0.9307', 'learning_rate': '0.0002786', 'epoch': '1.352'} | |
| 45%|βββββββββββββββββ | 16500/36624 [18:49<23:00, 14.58it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 184.49it/s] | |
| {'loss': '2.519', 'grad_norm': '0.9708', 'learning_rate': '0.0002772', 'epoch': '1.36'} | |
| {'loss': '2.522', 'grad_norm': '0.9035', 'learning_rate': '0.0002758', 'epoch': '1.368'} | |
| {'loss': '2.518', 'grad_norm': '0.9394', 'learning_rate': '0.0002744', 'epoch': '1.376'} | |
| {'loss': '2.521', 'grad_norm': '0.9519', 'learning_rate': '0.000273', 'epoch': '1.384'} | |
| {'loss': '2.518', 'grad_norm': '0.915', 'learning_rate': '0.0002716', 'epoch': '1.393'} | |
| 46%|ββββββββββββββββββ | 17000/36624 [19:23<22:15, 14.69it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 188.87it/s] | |
| {'loss': '2.517', 'grad_norm': '0.9166', 'learning_rate': '0.0002702', 'epoch': '1.401'} | |
| {'loss': '2.513', 'grad_norm': '0.9377', 'learning_rate': '0.0002689', 'epoch': '1.409'} | |
| {'loss': '2.516', 'grad_norm': '0.9178', 'learning_rate': '0.0002675', 'epoch': '1.417'} | |
| {'loss': '2.519', 'grad_norm': '0.9151', 'learning_rate': '0.0002661', 'epoch': '1.425'} | |
| {'loss': '2.515', 'grad_norm': '0.9612', 'learning_rate': '0.0002647', 'epoch': '1.434'} | |
| 48%|ββββββββββββββββββ | 17500/36624 [19:58<21:56, 14.53it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 176.02it/s] | |
| {'loss': '2.519', 'grad_norm': '0.9229', 'learning_rate': '0.0002633', 'epoch': '1.442'} | |
| {'loss': '2.518', 'grad_norm': '0.9195', 'learning_rate': '0.0002619', 'epoch': '1.45'} | |
| {'loss': '2.514', 'grad_norm': '0.9046', 'learning_rate': '0.0002606', 'epoch': '1.458'} | |
| {'loss': '2.52', 'grad_norm': '0.9383', 'learning_rate': '0.0002592', 'epoch': '1.466'} | |
| {'loss': '2.516', 'grad_norm': '0.9361', 'learning_rate': '0.0002578', 'epoch': '1.474'} | |
| 49%|βββββββββββββββββββ | 18000/36624 [20:32<21:18, 14.57it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 184.81it/s] | |
| {'loss': '2.509', 'grad_norm': '0.9623', 'learning_rate': '0.0002564', 'epoch': '1.483'} | |
| {'loss': '2.511', 'grad_norm': '0.9627', 'learning_rate': '0.000255', 'epoch': '1.491'} | |
| {'loss': '2.516', 'grad_norm': '0.9481', 'learning_rate': '0.0002536', 'epoch': '1.499'} | |
| {'loss': '2.516', 'grad_norm': '0.9699', 'learning_rate': '0.0002523', 'epoch': '1.507'} | |
| {'loss': '2.514', 'grad_norm': '0.9232', 'learning_rate': '0.0002509', 'epoch': '1.515'} | |
| 51%|βββββββββββββββββββ | 18500/36624 [21:06<20:39, 14.63it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 181.07it/s] | |
| {'loss': '2.508', 'grad_norm': '0.8967', 'learning_rate': '0.0002495', 'epoch': '1.524'} | |
| {'loss': '2.51', 'grad_norm': '0.9512', 'learning_rate': '0.0002481', 'epoch': '1.532'} | |
| {'loss': '2.511', 'grad_norm': '0.9096', 'learning_rate': '0.0002467', 'epoch': '1.54'} | |
| {'loss': '2.509', 'grad_norm': '0.9213', 'learning_rate': '0.0002453', 'epoch': '1.548'} | |
| {'loss': '2.513', 'grad_norm': '0.9172', 'learning_rate': '0.000244', 'epoch': '1.556'} | |
| 52%|ββββββββββββββββββββ | 19000/36624 [21:40<20:00, 14.69it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 180.06it/s] | |
| {'loss': '2.51', 'grad_norm': '0.9369', 'learning_rate': '0.0002426', 'epoch': '1.565'} | |
| {'loss': '2.512', 'grad_norm': '0.9091', 'learning_rate': '0.0002412', 'epoch': '1.573'} | |
| {'loss': '2.512', 'grad_norm': '0.8935', 'learning_rate': '0.0002398', 'epoch': '1.581'} | |
| {'loss': '2.51', 'grad_norm': '0.9206', 'learning_rate': '0.0002384', 'epoch': '1.589'} | |
| {'loss': '2.507', 'grad_norm': '0.9272', 'learning_rate': '0.000237', 'epoch': '1.597'} | |
| 53%|ββββββββββββββββββββ | 19500/36624 [22:15<19:28, 14.66it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 183.45it/s] | |
| {'loss': '2.51', 'grad_norm': '0.9499', 'learning_rate': '0.0002356', 'epoch': '1.606'} | |
| {'loss': '2.513', 'grad_norm': '0.9095', 'learning_rate': '0.0002343', 'epoch': '1.614'} | |
| {'loss': '2.508', 'grad_norm': '0.9086', 'learning_rate': '0.0002329', 'epoch': '1.622'} | |
| {'loss': '2.507', 'grad_norm': '0.9389', 'learning_rate': '0.0002315', 'epoch': '1.63'} | |
| {'loss': '2.514', 'grad_norm': '0.8963', 'learning_rate': '0.0002301', 'epoch': '1.638'} | |
| 55%|βββββββββββββββββββββ | 20000/36624 [22:49<18:57, 14.61it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 174.24it/s] | |
| {'loss': '2.506', 'grad_norm': '0.978', 'learning_rate': '0.0002287', 'epoch': '1.646'} | |
| {'loss': '2.507', 'grad_norm': '0.9966', 'learning_rate': '0.0002273', 'epoch': '1.655'} | |
| {'loss': '2.507', 'grad_norm': '0.9281', 'learning_rate': '0.000226', 'epoch': '1.663'} | |
| {'loss': '2.51', 'grad_norm': '0.9063', 'learning_rate': '0.0002246', 'epoch': '1.671'} | |
| {'loss': '2.509', 'grad_norm': '0.9708', 'learning_rate': '0.0002232', 'epoch': '1.679'} | |
| 56%|βββββββββββββββββββββ | 20500/36624 [23:23<18:18, 14.67it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 182.16it/s] | |
| {'loss': '2.505', 'grad_norm': '0.946', 'learning_rate': '0.0002218', 'epoch': '1.687'} | |
| {'loss': '2.507', 'grad_norm': '0.9184', 'learning_rate': '0.0002204', 'epoch': '1.696'} | |
| {'loss': '2.506', 'grad_norm': '0.9702', 'learning_rate': '0.000219', 'epoch': '1.704'} | |
| {'loss': '2.499', 'grad_norm': '0.9535', 'learning_rate': '0.0002177', 'epoch': '1.712'} | |
| {'loss': '2.502', 'grad_norm': '0.9017', 'learning_rate': '0.0002163', 'epoch': '1.72'} | |
| 57%|ββββββββββββββββββββββ | 21000/36624 [23:57<18:03, 14.42it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 190.95it/s] | |
| {'loss': '2.509', 'grad_norm': '0.9587', 'learning_rate': '0.0002149', 'epoch': '1.728'} | |
| {'loss': '2.504', 'grad_norm': '0.9648', 'learning_rate': '0.0002135', 'epoch': '1.737'} | |
| {'loss': '2.503', 'grad_norm': '0.953', 'learning_rate': '0.0002121', 'epoch': '1.745'} | |
| {'loss': '2.5', 'grad_norm': '0.9445', 'learning_rate': '0.0002107', 'epoch': '1.753'} | |
| {'loss': '2.501', 'grad_norm': '0.9414', 'learning_rate': '0.0002093', 'epoch': '1.761'} | |
| 59%|ββββββββββββββββββββββ | 21500/36624 [24:32<17:10, 14.67it/s] | |
| Writing model shards: 100%|βββββββββββββββββββββββ| 1/1 [00:00<00:00, 52.34it/s] | |
| {'loss': '2.503', 'grad_norm': '0.9309', 'learning_rate': '0.000208', 'epoch': '1.769'} | |
| {'loss': '2.502', 'grad_norm': '0.9301', 'learning_rate': '0.0002066', 'epoch': '1.778'} | |
| {'loss': '2.504', 'grad_norm': '0.895', 'learning_rate': '0.0002052', 'epoch': '1.786'} | |
| {'loss': '2.502', 'grad_norm': '0.9428', 'learning_rate': '0.0002038', 'epoch': '1.794'} | |
| {'loss': '2.501', 'grad_norm': '0.9539', 'learning_rate': '0.0002024', 'epoch': '1.802'} | |
| 60%|βββββββββββββββββββββββ | 22000/36624 [25:06<16:28, 14.79it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 203.41it/s] | |
| {'loss': '2.5', 'grad_norm': '0.9179', 'learning_rate': '0.000201', 'epoch': '1.81'} | |
| {'loss': '2.501', 'grad_norm': '0.9195', 'learning_rate': '0.0001997', 'epoch': '1.819'} | |
| {'loss': '2.499', 'grad_norm': '1.047', 'learning_rate': '0.0001983', 'epoch': '1.827'} | |
| {'loss': '2.499', 'grad_norm': '0.931', 'learning_rate': '0.0001969', 'epoch': '1.835'} | |
| {'loss': '2.499', 'grad_norm': '0.9269', 'learning_rate': '0.0001955', 'epoch': '1.843'} | |
| 61%|βββββββββββββββββββββββ | 22500/36624 [25:40<16:03, 14.65it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 194.41it/s] | |
| {'loss': '2.501', 'grad_norm': '0.939', 'learning_rate': '0.0001941', 'epoch': '1.851'} | |
| {'loss': '2.495', 'grad_norm': '0.9119', 'learning_rate': '0.0001927', 'epoch': '1.859'} | |
| {'loss': '2.499', 'grad_norm': '0.9755', 'learning_rate': '0.0001914', 'epoch': '1.868'} | |
| {'loss': '2.497', 'grad_norm': '0.9444', 'learning_rate': '0.00019', 'epoch': '1.876'} | |
| {'loss': '2.496', 'grad_norm': '0.9551', 'learning_rate': '0.0001886', 'epoch': '1.884'} | |
| 63%|ββββββββββββββββββββββββ | 23000/36624 [26:15<15:34, 14.58it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 183.11it/s] | |
| {'loss': '2.5', 'grad_norm': '0.9524', 'learning_rate': '0.0001872', 'epoch': '1.892'} | |
| {'loss': '2.502', 'grad_norm': '0.9583', 'learning_rate': '0.0001858', 'epoch': '1.9'} | |
| {'loss': '2.497', 'grad_norm': '0.9206', 'learning_rate': '0.0001844', 'epoch': '1.909'} | |
| {'loss': '2.495', 'grad_norm': '0.9133', 'learning_rate': '0.0001831', 'epoch': '1.917'} | |
| {'loss': '2.491', 'grad_norm': '0.9201', 'learning_rate': '0.0001817', 'epoch': '1.925'} | |
| 64%|ββββββββββββββββββββββββ | 23500/36624 [26:49<14:51, 14.72it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 175.74it/s] | |
| {'loss': '2.499', 'grad_norm': '0.9536', 'learning_rate': '0.0001803', 'epoch': '1.933'} | |
| {'loss': '2.497', 'grad_norm': '0.9332', 'learning_rate': '0.0001789', 'epoch': '1.941'} | |
| {'loss': '2.491', 'grad_norm': '0.9358', 'learning_rate': '0.0001775', 'epoch': '1.95'} | |
| {'loss': '2.493', 'grad_norm': '0.9568', 'learning_rate': '0.0001761', 'epoch': '1.958'} | |
| {'loss': '2.494', 'grad_norm': '0.9585', 'learning_rate': '0.0001747', 'epoch': '1.966'} | |
| 66%|βββββββββββββββββββββββββ | 24000/36624 [27:23<14:14, 14.78it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 178.42it/s] | |
| {'loss': '2.496', 'grad_norm': '0.9171', 'learning_rate': '0.0001734', 'epoch': '1.974'} | |
| {'loss': '2.495', 'grad_norm': '0.9789', 'learning_rate': '0.000172', 'epoch': '1.982'} | |
| {'loss': '2.493', 'grad_norm': '0.9548', 'learning_rate': '0.0001706', 'epoch': '1.991'} | |
| {'loss': '2.495', 'grad_norm': '1.021', 'learning_rate': '0.0001692', 'epoch': '1.999'} | |
| {'loss': '2.486', 'grad_norm': '0.9158', 'learning_rate': '0.0001678', 'epoch': '2.007'} | |
| 67%|βββββββββββββββββββββββββ | 24500/36624 [27:58<14:15, 14.17it/s] | |
| Writing model shards: 100%|βββββββββββββββββββββββ| 1/1 [00:00<00:00, 42.99it/s] | |
| {'loss': '2.494', 'grad_norm': '0.9763', 'learning_rate': '0.0001664', 'epoch': '2.015'} | |
| {'loss': '2.495', 'grad_norm': '0.9613', 'learning_rate': '0.0001651', 'epoch': '2.023'} | |
| {'loss': '2.489', 'grad_norm': '0.9664', 'learning_rate': '0.0001637', 'epoch': '2.031'} | |
| {'loss': '2.498', 'grad_norm': '1.016', 'learning_rate': '0.0001623', 'epoch': '2.04'} | |
| {'loss': '2.488', 'grad_norm': '0.9416', 'learning_rate': '0.0001609', 'epoch': '2.048'} | |
| 68%|ββββββββββββββββββββββββββ | 25000/36624 [28:33<13:16, 14.60it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 180.22it/s] | |
| {'loss': '2.489', 'grad_norm': '0.9494', 'learning_rate': '0.0001595', 'epoch': '2.056'} | |
| {'loss': '2.486', 'grad_norm': '0.9252', 'learning_rate': '0.0001581', 'epoch': '2.064'} | |
| {'loss': '2.492', 'grad_norm': '0.9568', 'learning_rate': '0.0001568', 'epoch': '2.072'} | |
| {'loss': '2.492', 'grad_norm': '0.9466', 'learning_rate': '0.0001554', 'epoch': '2.081'} | |
| {'loss': '2.486', 'grad_norm': '0.9349', 'learning_rate': '0.000154', 'epoch': '2.089'} | |
| 70%|ββββββββββββββββββββββββββ | 25500/36624 [29:07<12:43, 14.57it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 167.93it/s] | |
| {'loss': '2.489', 'grad_norm': '0.9689', 'learning_rate': '0.0001526', 'epoch': '2.097'} | |
| {'loss': '2.488', 'grad_norm': '0.9909', 'learning_rate': '0.0001512', 'epoch': '2.105'} | |
| {'loss': '2.49', 'grad_norm': '0.9703', 'learning_rate': '0.0001498', 'epoch': '2.113'} | |
| {'loss': '2.487', 'grad_norm': '1.02', 'learning_rate': '0.0001484', 'epoch': '2.122'} | |
| {'loss': '2.485', 'grad_norm': '1.005', 'learning_rate': '0.0001471', 'epoch': '2.13'} | |
| 71%|βββββββββββββββββββββββββββ | 26000/36624 [29:41<12:02, 14.70it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 201.60it/s] | |
| {'loss': '2.486', 'grad_norm': '0.9853', 'learning_rate': '0.0001457', 'epoch': '2.138'} | |
| {'loss': '2.485', 'grad_norm': '0.9916', 'learning_rate': '0.0001443', 'epoch': '2.146'} | |
| {'loss': '2.488', 'grad_norm': '0.9691', 'learning_rate': '0.0001429', 'epoch': '2.154'} | |
| {'loss': '2.488', 'grad_norm': '0.9773', 'learning_rate': '0.0001415', 'epoch': '2.163'} | |
| {'loss': '2.482', 'grad_norm': '0.953', 'learning_rate': '0.0001401', 'epoch': '2.171'} | |
| 72%|βββββββββββββββββββββββββββ | 26500/36624 [30:15<11:31, 14.63it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 187.70it/s] | |
| {'loss': '2.487', 'grad_norm': '0.9432', 'learning_rate': '0.0001388', 'epoch': '2.179'} | |
| {'loss': '2.487', 'grad_norm': '0.962', 'learning_rate': '0.0001374', 'epoch': '2.187'} | |
| {'loss': '2.488', 'grad_norm': '0.9646', 'learning_rate': '0.000136', 'epoch': '2.195'} | |
| {'loss': '2.483', 'grad_norm': '0.9822', 'learning_rate': '0.0001346', 'epoch': '2.203'} | |
| {'loss': '2.484', 'grad_norm': '0.9462', 'learning_rate': '0.0001332', 'epoch': '2.212'} | |
| 74%|ββββββββββββββββββββββββββββ | 27000/36624 [30:50<11:02, 14.53it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 148.78it/s] | |
| {'loss': '2.485', 'grad_norm': '0.983', 'learning_rate': '0.0001318', 'epoch': '2.22'} | |
| {'loss': '2.483', 'grad_norm': '0.9827', 'learning_rate': '0.0001305', 'epoch': '2.228'} | |
| {'loss': '2.486', 'grad_norm': '0.987', 'learning_rate': '0.0001291', 'epoch': '2.236'} | |
| {'loss': '2.486', 'grad_norm': '1.003', 'learning_rate': '0.0001277', 'epoch': '2.244'} | |
| {'loss': '2.487', 'grad_norm': '0.9763', 'learning_rate': '0.0001263', 'epoch': '2.253'} | |
| 75%|ββββββββββββββββββββββββββββ | 27500/36624 [31:24<10:20, 14.70it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 197.76it/s] | |
| {'loss': '2.484', 'grad_norm': '0.9718', 'learning_rate': '0.0001249', 'epoch': '2.261'} | |
| {'loss': '2.482', 'grad_norm': '0.964', 'learning_rate': '0.0001235', 'epoch': '2.269'} | |
| {'loss': '2.486', 'grad_norm': '0.9918', 'learning_rate': '0.0001221', 'epoch': '2.277'} | |
| {'loss': '2.482', 'grad_norm': '0.9895', 'learning_rate': '0.0001208', 'epoch': '2.285'} | |
| {'loss': '2.483', 'grad_norm': '0.9978', 'learning_rate': '0.0001194', 'epoch': '2.294'} | |
| 76%|βββββββββββββββββββββββββββββ | 28000/36624 [31:58<10:07, 14.20it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 187.44it/s] | |
| {'loss': '2.484', 'grad_norm': '1.008', 'learning_rate': '0.000118', 'epoch': '2.302'} | |
| {'loss': '2.484', 'grad_norm': '1.029', 'learning_rate': '0.0001166', 'epoch': '2.31'} | |
| {'loss': '2.482', 'grad_norm': '0.9828', 'learning_rate': '0.0001152', 'epoch': '2.318'} | |
| {'loss': '2.484', 'grad_norm': '0.9815', 'learning_rate': '0.0001138', 'epoch': '2.326'} | |
| {'loss': '2.484', 'grad_norm': '0.97', 'learning_rate': '0.0001125', 'epoch': '2.335'} | |
| 78%|βββββββββββββββββββββββββββββ | 28500/36624 [32:32<09:16, 14.61it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 194.61it/s] | |
| {'loss': '2.48', 'grad_norm': '1.007', 'learning_rate': '0.0001111', 'epoch': '2.343'} | |
| {'loss': '2.477', 'grad_norm': '1.018', 'learning_rate': '0.0001097', 'epoch': '2.351'} | |
| {'loss': '2.477', 'grad_norm': '0.945', 'learning_rate': '0.0001083', 'epoch': '2.359'} | |
| {'loss': '2.479', 'grad_norm': '1.001', 'learning_rate': '0.0001069', 'epoch': '2.367'} | |
| {'loss': '2.48', 'grad_norm': '0.9743', 'learning_rate': '0.0001055', 'epoch': '2.376'} | |
| 79%|ββββββββββββββββββββββββββββββ | 29000/36624 [33:07<08:43, 14.57it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 193.78it/s] | |
| {'loss': '2.48', 'grad_norm': '1.005', 'learning_rate': '0.0001042', 'epoch': '2.384'} | |
| {'loss': '2.482', 'grad_norm': '1.016', 'learning_rate': '0.0001028', 'epoch': '2.392'} | |
| {'loss': '2.473', 'grad_norm': '0.9854', 'learning_rate': '0.0001014', 'epoch': '2.4'} | |
| {'loss': '2.476', 'grad_norm': '0.9408', 'learning_rate': '0.0001', 'epoch': '2.408'} | |
| {'loss': '2.475', 'grad_norm': '0.9968', 'learning_rate': '9.862e-05', 'epoch': '2.416'} | |
| 81%|ββββββββββββββββββββββββββββββ | 29500/36624 [33:41<08:09, 14.55it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 189.43it/s] | |
| {'loss': '2.476', 'grad_norm': '1.016', 'learning_rate': '9.723e-05', 'epoch': '2.425'} | |
| {'loss': '2.48', 'grad_norm': '0.9962', 'learning_rate': '9.585e-05', 'epoch': '2.433'} | |
| {'loss': '2.478', 'grad_norm': '1.032', 'learning_rate': '9.447e-05', 'epoch': '2.441'} | |
| {'loss': '2.477', 'grad_norm': '1.001', 'learning_rate': '9.308e-05', 'epoch': '2.449'} | |
| {'loss': '2.477', 'grad_norm': '0.9866', 'learning_rate': '9.17e-05', 'epoch': '2.457'} | |
| 82%|βββββββββββββββββββββββββββββββ | 30000/36624 [34:15<07:36, 14.53it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 184.80it/s] | |
| {'loss': '2.476', 'grad_norm': '1.028', 'learning_rate': '9.031e-05', 'epoch': '2.466'} | |
| {'loss': '2.475', 'grad_norm': '0.9801', 'learning_rate': '8.893e-05', 'epoch': '2.474'} | |
| {'loss': '2.48', 'grad_norm': '0.9972', 'learning_rate': '8.755e-05', 'epoch': '2.482'} | |
| {'loss': '2.479', 'grad_norm': '0.9972', 'learning_rate': '8.616e-05', 'epoch': '2.49'} | |
| {'loss': '2.479', 'grad_norm': '1.076', 'learning_rate': '8.478e-05', 'epoch': '2.498'} | |
| 83%|βββββββββββββββββββββββββββββββ | 30500/36624 [34:50<07:03, 14.47it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 156.83it/s] | |
| {'loss': '2.472', 'grad_norm': '0.9888', 'learning_rate': '8.339e-05', 'epoch': '2.507'} | |
| {'loss': '2.477', 'grad_norm': '1.045', 'learning_rate': '8.201e-05', 'epoch': '2.515'} | |
| {'loss': '2.475', 'grad_norm': '1.039', 'learning_rate': '8.063e-05', 'epoch': '2.523'} | |
| {'loss': '2.476', 'grad_norm': '1.038', 'learning_rate': '7.924e-05', 'epoch': '2.531'} | |
| {'loss': '2.474', 'grad_norm': '1.013', 'learning_rate': '7.786e-05', 'epoch': '2.539'} | |
| 85%|ββββββββββββββββββββββββββββββββ | 31000/36624 [35:25<06:26, 14.55it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 173.79it/s] | |
| {'loss': '2.478', 'grad_norm': '0.9911', 'learning_rate': '7.647e-05', 'epoch': '2.548'} | |
| {'loss': '2.475', 'grad_norm': '1.003', 'learning_rate': '7.509e-05', 'epoch': '2.556'} | |
| {'loss': '2.476', 'grad_norm': '0.9986', 'learning_rate': '7.37e-05', 'epoch': '2.564'} | |
| {'loss': '2.475', 'grad_norm': '1.034', 'learning_rate': '7.232e-05', 'epoch': '2.572'} | |
| {'loss': '2.476', 'grad_norm': '0.9733', 'learning_rate': '7.094e-05', 'epoch': '2.58'} | |
| 86%|ββββββββββββββββββββββββββββββββ | 31500/36624 [35:59<05:53, 14.51it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 181.57it/s] | |
| {'loss': '2.468', 'grad_norm': '1.055', 'learning_rate': '6.955e-05', 'epoch': '2.588'} | |
| {'loss': '2.475', 'grad_norm': '1.026', 'learning_rate': '6.817e-05', 'epoch': '2.597'} | |
| {'loss': '2.476', 'grad_norm': '1.029', 'learning_rate': '6.678e-05', 'epoch': '2.605'} | |
| {'loss': '2.471', 'grad_norm': '1.032', 'learning_rate': '6.54e-05', 'epoch': '2.613'} | |
| {'loss': '2.473', 'grad_norm': '1.007', 'learning_rate': '6.402e-05', 'epoch': '2.621'} | |
| 87%|βββββββββββββββββββββββββββββββββ | 32000/36624 [36:34<05:15, 14.66it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 196.78it/s] | |
| {'loss': '2.47', 'grad_norm': '1.028', 'learning_rate': '6.263e-05', 'epoch': '2.629'} | |
| {'loss': '2.47', 'grad_norm': '0.9969', 'learning_rate': '6.125e-05', 'epoch': '2.638'} | |
| {'loss': '2.473', 'grad_norm': '1.037', 'learning_rate': '5.986e-05', 'epoch': '2.646'} | |
| {'loss': '2.47', 'grad_norm': '1', 'learning_rate': '5.848e-05', 'epoch': '2.654'} | |
| {'loss': '2.468', 'grad_norm': '1.034', 'learning_rate': '5.71e-05', 'epoch': '2.662'} | |
| 89%|βββββββββββββββββββββββββββββββββ | 32500/36624 [37:08<04:47, 14.34it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 181.32it/s] | |
| {'loss': '2.471', 'grad_norm': '1.073', 'learning_rate': '5.571e-05', 'epoch': '2.67'} | |
| {'loss': '2.47', 'grad_norm': '1.003', 'learning_rate': '5.433e-05', 'epoch': '2.679'} | |
| {'loss': '2.472', 'grad_norm': '1.033', 'learning_rate': '5.294e-05', 'epoch': '2.687'} | |
| {'loss': '2.469', 'grad_norm': '1.076', 'learning_rate': '5.156e-05', 'epoch': '2.695'} | |
| {'loss': '2.469', 'grad_norm': '1.061', 'learning_rate': '5.017e-05', 'epoch': '2.703'} | |
| 90%|ββββββββββββββββββββββββββββββββββ | 33000/36624 [37:43<04:11, 14.39it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 188.79it/s] | |
| {'loss': '2.468', 'grad_norm': '1.043', 'learning_rate': '4.879e-05', 'epoch': '2.711'} | |
| {'loss': '2.474', 'grad_norm': '1.115', 'learning_rate': '4.741e-05', 'epoch': '2.72'} | |
| {'loss': '2.469', 'grad_norm': '1.028', 'learning_rate': '4.602e-05', 'epoch': '2.728'} | |
| {'loss': '2.468', 'grad_norm': '1.017', 'learning_rate': '4.464e-05', 'epoch': '2.736'} | |
| {'loss': '2.471', 'grad_norm': '0.9948', 'learning_rate': '4.325e-05', 'epoch': '2.744'} | |
| 91%|ββββββββββββββββββββββββββββββββββ | 33500/36624 [38:18<03:37, 14.39it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 155.26it/s] | |
| {'loss': '2.466', 'grad_norm': '1.027', 'learning_rate': '4.187e-05', 'epoch': '2.752'} | |
| {'loss': '2.467', 'grad_norm': '1.023', 'learning_rate': '4.049e-05', 'epoch': '2.761'} | |
| {'loss': '2.47', 'grad_norm': '1.021', 'learning_rate': '3.91e-05', 'epoch': '2.769'} | |
| {'loss': '2.468', 'grad_norm': '0.9946', 'learning_rate': '3.772e-05', 'epoch': '2.777'} | |
| {'loss': '2.464', 'grad_norm': '1.031', 'learning_rate': '3.633e-05', 'epoch': '2.785'} | |
| 93%|βββββββββββββββββββββββββββββββββββ | 34000/36624 [38:52<03:00, 14.52it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 183.19it/s] | |
| {'loss': '2.467', 'grad_norm': '1.05', 'learning_rate': '3.495e-05', 'epoch': '2.793'} | |
| {'loss': '2.469', 'grad_norm': '1.043', 'learning_rate': '3.356e-05', 'epoch': '2.801'} | |
| {'loss': '2.468', 'grad_norm': '0.9955', 'learning_rate': '3.218e-05', 'epoch': '2.81'} | |
| {'loss': '2.461', 'grad_norm': '0.9882', 'learning_rate': '3.08e-05', 'epoch': '2.818'} | |
| {'loss': '2.463', 'grad_norm': '1.023', 'learning_rate': '2.941e-05', 'epoch': '2.826'} | |
| 94%|βββββββββββββββββββββββββββββββββββ | 34500/36624 [39:27<02:28, 14.27it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 201.85it/s] | |
| {'loss': '2.464', 'grad_norm': '1.062', 'learning_rate': '2.803e-05', 'epoch': '2.834'} | |
| {'loss': '2.465', 'grad_norm': '1.065', 'learning_rate': '2.664e-05', 'epoch': '2.842'} | |
| {'loss': '2.468', 'grad_norm': '1.01', 'learning_rate': '2.526e-05', 'epoch': '2.851'} | |
| {'loss': '2.463', 'grad_norm': '0.9994', 'learning_rate': '2.388e-05', 'epoch': '2.859'} | |
| {'loss': '2.464', 'grad_norm': '1.032', 'learning_rate': '2.249e-05', 'epoch': '2.867'} | |
| 96%|ββββββββββββββββββββββββββββββββββββ | 35000/36624 [40:02<01:53, 14.32it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 174.93it/s] | |
| {'loss': '2.467', 'grad_norm': '1.233', 'learning_rate': '2.111e-05', 'epoch': '2.875'} | |
| {'loss': '2.466', 'grad_norm': '1.069', 'learning_rate': '1.972e-05', 'epoch': '2.883'} | |
| {'loss': '2.469', 'grad_norm': '1.015', 'learning_rate': '1.834e-05', 'epoch': '2.892'} | |
| {'loss': '2.466', 'grad_norm': '1.033', 'learning_rate': '1.696e-05', 'epoch': '2.9'} | |
| {'loss': '2.463', 'grad_norm': '1.04', 'learning_rate': '1.557e-05', 'epoch': '2.908'} | |
| 97%|ββββββββββββββββββββββββββββββββββββ | 35500/36624 [40:37<01:17, 14.42it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 184.19it/s] | |
| {'loss': '2.469', 'grad_norm': '1.007', 'learning_rate': '1.419e-05', 'epoch': '2.916'} | |
| {'loss': '2.468', 'grad_norm': '1.025', 'learning_rate': '1.28e-05', 'epoch': '2.924'} | |
| {'loss': '2.465', 'grad_norm': '1.033', 'learning_rate': '1.142e-05', 'epoch': '2.933'} | |
| {'loss': '2.464', 'grad_norm': '1.045', 'learning_rate': '1.003e-05', 'epoch': '2.941'} | |
| {'loss': '2.464', 'grad_norm': '1.008', 'learning_rate': '8.651e-06', 'epoch': '2.949'} | |
| 98%|βββββββββββββββββββββββββββββββββββββ| 36000/36624 [41:11<00:43, 14.40it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 163.65it/s] | |
| {'loss': '2.463', 'grad_norm': '1.035', 'learning_rate': '7.267e-06', 'epoch': '2.957'} | |
| {'loss': '2.46', 'grad_norm': '1.018', 'learning_rate': '5.883e-06', 'epoch': '2.965'} | |
| {'loss': '2.463', 'grad_norm': '1.01', 'learning_rate': '4.498e-06', 'epoch': '2.973'} | |
| {'loss': '2.463', 'grad_norm': '1.014', 'learning_rate': '3.114e-06', 'epoch': '2.982'} | |
| {'loss': '2.459', 'grad_norm': '0.9757', 'learning_rate': '1.73e-06', 'epoch': '2.99'} | |
| 100%|βββββββββββββββββββββββββββββββββββββ| 36500/36624 [41:46<00:08, 14.23it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 177.29it/s] | |
| {'loss': '2.464', 'grad_norm': '1.004', 'learning_rate': '3.46e-07', 'epoch': '2.998'} | |
| 100%|βββββββββββββββββββββββββββββββββββββ| 36623/36624 [41:55<00:00, 11.70it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 197.55it/s] | |
| {'train_runtime': '2516', 'train_samples_per_second': '1863', 'train_steps_per_second': '14.56', 'train_loss': '2.575', 'epoch': '3'} | |
| 100%|βββββββββββββββββββββββββββββββββββββ| 36624/36624 [41:56<00:00, 14.56it/s] | |
| Writing model shards: 100%|ββββββββββββββββββββββ| 1/1 [00:00<00:00, 160.99it/s] | |
| [*] Training finished. |