At 4 bit or lower use IQ-type quant. The math is more advanced and quantization error is lower. You can double the context with -ctk q8_0 -ctv q8_0for virtually no loss of quality and speed.
Timon
KeyboardMasher
AI & ML interests
None yet
Recent Activity
commentedon an article 2 days ago
Gemma 4 VLA Demo on Jetson Orin Nano Super new activity 15 days ago
unsloth/gemma-4-26B-A4B-it-GGUF:Gemma 4 seems to work best with high temperature for coding new activity about 1 month ago
mistralai/Mistral-Small-4-119B-2603:Recommended sampler?Organizations
None yet
commented on Gemma 4 VLA Demo on Jetson Orin Nano Super 2 days ago
Gemma 4 seems to work best with high temperature for coding
π 1
8
#21 opened 16 days ago
by
Reverger
Recommended sampler?
4
#4 opened about 1 month ago
by
mratsim
Older quants get in the way
2
#1 opened about 2 months ago
by
KeyboardMasher
Error with built-in Web UI
2
#3 opened 9 months ago
by
KeyboardMasher
Thanks for IQ4_NL
β€οΈ 1
#1 opened 10 months ago
by
KeyboardMasher
128k Context GGUF, please?
4
#2 opened 12 months ago
by
MikeNate
Update README.md
#1 opened about 1 year ago
by
KeyboardMasher
Other Imatrix quants (IQ3_XS) ?
π 3
6
#1 opened about 1 year ago
by deleted
reacted to bartowski's post with π about 1 year ago
Post
39665
Access requests enabled for latest GLM models
While a fix is being implemented (https://github.com/ggml-org/llama.cpp/pull/12957) I want to leave the models up for visibility and continued discussion, but want to prevent accidental downloads of known broken models (even though there are settings that could fix it at runtime for now)
With this goal, I've enabled access requests. I don't really want your data, so I'm sorry that I don't think there's a way around that? But that's what I'm gonna do for now, and I'll remove the gate when a fix is up and verified and I have a chance to re-convert and quantize!
Hope you don't mind in the mean time :D
While a fix is being implemented (https://github.com/ggml-org/llama.cpp/pull/12957) I want to leave the models up for visibility and continued discussion, but want to prevent accidental downloads of known broken models (even though there are settings that could fix it at runtime for now)
With this goal, I've enabled access requests. I don't really want your data, so I'm sorry that I don't think there's a way around that? But that's what I'm gonna do for now, and I'll remove the gate when a fix is up and verified and I have a chance to re-convert and quantize!
Hope you don't mind in the mean time :D
upvoted an article about 1 year ago
Article
Comparing sub 50GB Llama 4 Scout quants (KLD/Top P)
β’
45
llama.cpp inference too slow?
3
#6 opened over 1 year ago
by
ygsun
Over 2 tok/sec agg backed by NVMe SSD on 96GB RAM + 24GB VRAM AM5 rig with llama.cpp
ππ₯ 4
9
#13 opened about 1 year ago
by
ubergarm
reacted to fdaudens's post with π about 1 year ago
Post
9993
Yes, DeepSeek R1's release is impressive. But the real story is what happened in just 7 days after:
- Original release: 8 models, 540K downloads. Just the beginning...
- The community turned those open-weight models into +550 NEW models on Hugging Face. Total downloads? 2.5Mβnearly 5X the originals.
The reason? DeepSeek models are open-weight, letting anyone build on top of them. Interesting to note that the community focused on quantized versions for better efficiency & accessibility. They want models that use less memory, run faster, and are more energy-efficient.
When you empower builders, innovation explodes. For everyone. π
The most popular community model? @bartowski 's DeepSeek-R1-Distill-Qwen-32B-GGUF version β 1M downloads alone.
- Original release: 8 models, 540K downloads. Just the beginning...
- The community turned those open-weight models into +550 NEW models on Hugging Face. Total downloads? 2.5Mβnearly 5X the originals.
The reason? DeepSeek models are open-weight, letting anyone build on top of them. Interesting to note that the community focused on quantized versions for better efficiency & accessibility. They want models that use less memory, run faster, and are more energy-efficient.
When you empower builders, innovation explodes. For everyone. π
The most popular community model? @bartowski 's DeepSeek-R1-Distill-Qwen-32B-GGUF version β 1M downloads alone.
Issue with --n-gpu-layers 5 Parameter: Model Only Running on CPU
12
#10 opened over 1 year ago
by
vuk123
Advice on running llama-server with Q2_K_L quant
3
#6 opened over 1 year ago
by
vmajor
I loaded DeepSeek-V3-Q5_K_M up on my 10yrs old old Tesla M40 (Dell C4130)
3
#8 opened over 1 year ago
by
gng2info
reacted to bartowski's post with π over 1 year ago
Post
73697
Switching to
I posted a poll on twitter, and others have mentioned the interest in me using the convention of including the author name in the model path when I upload.
It has a couple advantages, first and foremost of course is ensuring clarity of who uploaded the original model (did Qwen upload Qwen2.6? Or did someone fine tune Qwen2.5 and named it 2.6 for fun?)
The second thing is that it avoids collisions, so if multiple people upload the same model and I try to quant them both, I would normally end up colliding and being unable to upload both
I'll be implementing the change next week, there are just two final details I'm unsure about:
First, should the files also inherit the author's name?
Second, what to do in the case that the author name + model name pushes us past the character limit?
Haven't yet decided how to handle either case, so feedback is welcome, but also just providing this as a "heads up"
author_model-nameI posted a poll on twitter, and others have mentioned the interest in me using the convention of including the author name in the model path when I upload.
It has a couple advantages, first and foremost of course is ensuring clarity of who uploaded the original model (did Qwen upload Qwen2.6? Or did someone fine tune Qwen2.5 and named it 2.6 for fun?)
The second thing is that it avoids collisions, so if multiple people upload the same model and I try to quant them both, I would normally end up colliding and being unable to upload both
I'll be implementing the change next week, there are just two final details I'm unsure about:
First, should the files also inherit the author's name?
Second, what to do in the case that the author name + model name pushes us past the character limit?
Haven't yet decided how to handle either case, so feedback is welcome, but also just providing this as a "heads up"
Model will need to be requantized, rope issues for long context
β€οΈ 2
3
#2 opened over 1 year ago
by
treehugg3
Instruct version?
3
#1 opened over 1 year ago
by
KeyboardMasher