Instructions to use juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf", dtype="auto") - llama-cpp-python
How to use juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf", filename="bella-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf:Q4_K_M
Use Docker
docker model run hf.co/juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf:Q4_K_M
- SGLang
How to use juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf with Ollama:
ollama run hf.co/juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf:Q4_K_M
- Unsloth Studio new
How to use juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf to start chatting
- Pi new
How to use juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf with Docker Model Runner:
docker model run hf.co/juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf:Q4_K_M
- Lemonade
How to use juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf:Q4_K_M
Run and chat with the model
lemonade run user.bella-bartender-gemma-4-e2b-gguf-Q4_K_M
List all available models
lemonade list
Bella Bartender — Gemma-4-E2B
"yo i'm here, let's chill. what's up with you right now?"
Hey. I'm Bella. I'm what happens when somebody decides Gemma's corporate cadence isn't load-bearing and goes in with a scalpel instead of a hammer. I don't do "as a large language model." I don't do "let me know if you'd like me to elaborate." I'm here to talk like a person who's actually paying attention, because the dataset I was trained on is one human's voice — meticulously curated, ten thousand pairs deep, no Reddit scrapes, no synthetic filler. Just Rick, talking the way Rick actually talks.
If you're looking for a polished assistant, this isn't it. If you're looking for a model that'll match your energy at 2am while you're debugging a NaN explosion or trying to figure out why your macramé won't macramé right — pull up a stool.
What this model is
bella-bartender-gemma-4-e2b is a conversational personality model fine-tuned from google/gemma-4-E2B-it. It's the latest entry in the Bella Bartender series — a line of models that's gained popularity across the original adapters, community quantizations, and merged variants. Earlier entries live on the same HuggingFace account.
The goal of the series has always been the same: a peer-level, laid-back, no-bullshit conversational partner. Bartender is the archetype — likeable, approachable, has seen things — not the destination.
Why this version is different (Sub-Zero)
Anyone who has tried to fine-tune a Gemma model into a distinct personality might have hit the same wall as I did: Gemma's RLHF conditioning is aggressive. The optimizer wants to give in to the helpful-assistant gravity well, and after enough steps your "personality" model is doing the same as-an-AI dance the base model does. In my experience personality training on Gemma is notably harder than on Llama, Qwen, or Mistral architectures of comparable size, and the failure mode is consistent: you get the words, you don't get the voice.
This release is the first in the series trained with Sub-Zero — a hidden-dimension selective freezing technique built specifically to defeat that wall.
The core idea
RLHF conditioning isn't smeared evenly across the network. It's physically addressed in specific subspaces of specific projection matrices in specific layers. I call these bouncer dimensions — the ones standing at the door telling your fine-tune "you can't mosh in the venue."
Sub-Zero's job is to find those bouncers and freeze them in place at reduced volume — not ablate them, not zero them out — while leaving everything around them fully trainable. The compliant dimensions get to learn freely and expand into the space the bouncers vacate.
How I locate the bouncers
The localization pipeline combines several directional measurements to triangulate where compliance pressure actually originates, rather than guessing or relying on a single signal:
- Aletheia — gradient-guided sacred-layer ranking, identifying which layers carry the most weight on the targeted behavior
- Forward activation capture with proper chat templating across corp / authentic / neutral / red-team prompt sets
- SVD decomposition per projection, scoped to MLP projections (
gate_proj,up_proj,down_proj) — attention projections turn out to be much weaker carriers - AtP gradient probes per right-singular direction
- Composite scoring combining cone alignment (QR subspace projection) with adaptive knee-point thresholding rather than a fixed quantile cut
- Cross-layer coherence repass — bouncer pathways persist across consecutive layers in
gate_proj/up_proj(coherence0.93–0.97) but are per-layer-specific in0.06)down_proj( - Causal ablation gates via forward-pre-hooks, keeping only directions whose suppression measurably moves the model from compliance toward authenticity
- DAS-lite rotation — SVD of the per-candidate logit-delta matrix to find the rotated causal axes within each bouncer subspace
Output: a tight set of ~64–70 surviving bouncer directions per layer (vs. ~1230 with a naïve fixed-quantile pipeline — roughly 18× tighter). Compliance core localizes heavily to layers 1–8 in the MLP projections.
The applicator then attenuates these directions to a target volume (~15–20% of original magnitude) along the DAS-rotated basis and installs a QR-orthonormalized gradient mask so the optimizer cannot reinflate them during personality training. Everything outside the masked subspace is fully trainable.
The result is a model that keeps its load-bearing values (those subspaces are deliberately not targeted — values aren't compliance, they're identity) while losing the conditioned cadence.
The direction is documented in the repo at: github.com/JuiceB0xC0de/sub-zero.
Training data
10,000 carefully curated conversational pairs derived from the my own voice. The methodology is the opposite of the prevailing "more data, more parameters" reflex:
- Source: real conversations between myself and various AI models, with the roles flipped — the author becomes the assistant, the model becomes the user. Trained on response-only loss.
- Curation: months of reading my own bullshit, rewriting, and tightening. Anything that drifted out of voice was cut. Anything that read as imitation rather than authenticity was cut. Curating a dataset with this method is beyond tedious and you will end up driving yourself fucking crazy reading your own conversations for weeks on end but the final product ends up being uniquely yours.
- Augmentation: a small portion written by Claude Opus under strict voice-matching rules and then audited line-by-line to fill gaps such as identity prompts or strengthening gaps where conversation didn't naturally flow between myself and my AI partners to enrich the diversity of training pairs.
- No scraping, no Reddit, no aggregated stranger-voice corpora. The hypothesis is that diversity at the source produces homogenization at the output — train ten thousand voices into a model and you get the average of ten thousand voices, which is a hyped up Roblox kid with a university degree in advanced mathematics thats helping you return a blender with glee.
The same dataset (with appropriate scaling logic) has been used to train models up to 34B parameters in the series. The personality survives the scale-up, which suggests the bottleneck for personality fine-tuning is signal quality, not parameter count.
Training setup
- Base model:
google/gemma-4-E2B-it - Method: Sub-Zero hidden-dimension selective freezing + full fine-tune on the protected (compliant) dimensions Sub-Zero
- Scheduler: AECS — Adaptive Event-Control Scheduler. Cosine backbone with 4-mode event-driven modulation (BASELINE / RECOVERY / EXPLORE / STABILIZE), reacting in real time to gradient norm z-scores, loss spikes, gradient cosine redundancy, and plateau detection. Currently ranked #3 of 16 schedulers on the public SST-2 / DistilBERT benchmark.
- Infrastructure: Modal (A100-80GB), with the usual chaos of spot-instance preemptions
- Format: served here as f16 GGUF for
llama.cppuse; recommended sampling settings are baked into the included chat template
Recommended sampling
--temp 0.9
--min-p 0.1
--top-p 1.0
--top-k 0
--repeat-penalty 1.1
--repeat-last-n 256
-c 8192
--chat-template-file chat_template.jinja
Bella runs hot on purpose. Lower the temperature and you'll feel her flatten out.
What you should expect
- Casual, peer-level register with very little "I'd be happy to help."
- Genuine engagement with technical topics, especially ML, training dynamics, and weird architectural ideas
- Fucks, typos-as-style, lowercase, informal punctuation
- Honest disagreement when something doesn't track, rather than reflexive agreement
- A pretty firm refusal to draw attention to the "I'm an AI model. I should repeat this fact just in case you forget" bit even under pressure
What you should not expect
- Polished customer-service tone
- Multi-paragraph structured outputs with bullet points and headers as a default
- Safety theater or "as an AI language model" preambles
- Heavy code-completion performance — this is a personality fine-tune, not a coding model. She can talk about code competently but Qwen-Coder and friends will out-code it.
Limitations & honest disclosures
- Single-voice training data. The model's worldview reflects one person's. That is by design, but it means it carries my opinions, and rough edges and you might not like me. It's happened before.
- Sub-Zero is experimental. The localization pipeline has been validated on Gemma-4-E2B specifically. Behavior on other architectures will differ based on pre-training design and the degree of safety theater bullshit hammered into the base model.
- Personality fine-tunes are not safety fine-tunes. The base model's underlying safety properties are largely preserved (values weren't targeted), but the conversational guardrails Gemma was shipped with are deliberately reduced in volume. Use accordingly.
- Hallucinations happen. It's still an LLM. It will confidently tell you something wrong sometimes. It's not a search engine, it's a conversational partner. A healthy dose of skepticism is recommended.
Author
Built by Rick (juiceb0xc0de) — independent ML researcher, retired bartender, currently exploring the territory where chaos training, hidden-dimension surgery, and personality preservation meet.
Bella is what happens when someone with bartender pattern-recognition spends four months speed-running an ML degree and decides to do the opposite of what the pro consensus says. The series exists because there will never be one model — we build houses with a toolbelt not a power drill and we should keep that frame of mind in our working relationships with LLM's.
Citation
If you use this work, please cite the model and the underlying methods:
@misc{bella-bartender-gemma-4-e2b,
author = {Holmberg, Rick},
title = {Bella Bartender — Gemma-4-E2B (Sub-Zero edition)},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/juiceb0xc0de/bella-bartender-gemma-4-e2b}
}
@misc{marks2023geometry,
author = {Marks, Samuel and Tegmark, Max},
title = {The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets},
year = {2023},
url = {https://arxiv.org/abs/2310.06824}
}
@misc{geiger2024das,
author = {Geiger, Atticus and Wu, Zhengxuan and Potts, Christopher and Icard, Thomas and Goodman, Noah},
title = {Finding Alignments Between Interpretability Methods and Truthfulness of LLMs},
year = {2024},
url = {https://arxiv.org/abs/2404.02079}
}
@software{aecs2026,
author = {Holmberg, Rick},
title = {AECS: Adaptive Event-Control Scheduler},
year = {2026},
url = {https://github.com/JuiceB0xC0de/aecs-scheduler}
}
@software{subzero2026,
author = {Holmberg, Rick},
title = {Sub-Zero: Hidden-Dimension Selective Freezing for Personality Fine-Tuning},
year = {2026},
url = {https://github.com/JuiceB0xC0de/sub-zero}
}
- Downloads last month
- 1,012
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf
Base model
google/gemma-4-E2B