Humanizer Steering Vector for Gemma 4 E4B-it

This repo contains a complete pipeline that computes an activation steering vector to make google/gemma-4-E4B-it produce more human-like text, based on the humanizer rubric (33 AI writing patterns from Wikipedia's "Signs of AI writing" guide).

Quick Start

git clone https://huggingface.co/evijit/gemma-4-humanizer-steering
cd gemma-4-humanizer-steering
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install "transformers>=5.5.0" "huggingface_hub>=1.0" steering-vectors --no-deps datasets accelerate safetensors sentencepiece protobuf matplotlib numpy scipy scikit-learn httpx certifi
python3 steering_pipeline.py

Requirements: NVIDIA GPU with >=24GB VRAM, HF token with Gemma 4 access.

What It Does

  1. Downloads HC3 dataset (human vs ChatGPT answers to same questions)
  2. Computes steering vector: mean(human_activations) - mean(chatgpt_activations) at layers 20-25
  3. Generates text from base and steered model on 10 test prompts
  4. Audits all outputs against 33 AI-writing patterns (em dashes, AI vocab, rule of three, emojis, boldface, etc.)
  5. Sweeps 7 multiplier values (0.01 to 0.3) to find the sweet spot
  6. Creates 4 comparison plots and pushes everything to this Hub repo

Method: Activation Steering (DLR)

Based on "Steering Llama 2 via Contrastive Activation Engineering" (arxiv 2402.01618). The steering vector is applied at inference time only: no model weights are modified, so benchmark performance is preserved when not steering.

Key Insight

The "Unlocking Spell" paper (arxiv 2312.01552) found that RLHF/alignment shifts only ~5-7% of tokens, almost entirely stylistic markers. AI writing style is a thin surface layer that can be steered without retraining.

Files

File Description
steering_pipeline.py Full pipeline script
humanizer_steering_vector.pt The steering vector (PyTorch state dict)
contrastive_data.jsonl 300 HC3 human/ChatGPT text pairs
eval_results.json Full evaluation results
eval_prompts.json 10 test prompts
output_samples.json Side-by-side base vs steered outputs
plot_per_prompt_comparison.png Findings per prompt
plot_multiplier_sweep.png Multiplier vs finding count
plot_category_breakdown.png Findings by pattern category
plot_dashboard.png Summary dashboard

Using the Steering Vector

import torch
from transformers import AutoProcessor, Gemma4ForConditionalGeneration
from steering_vectors import SteeringVector

processor = AutoProcessor.from_pretrained("google/gemma-4-E4B-it")
model = Gemma4ForConditionalGeneration.from_pretrained(
    "google/gemma-4-E4B-it", dtype=torch.bfloat16, device_map="cuda"
)
tok = processor.tokenizer

# Load steering vector
sd = torch.load("humanizer_steering_vector.pt", map_location="cpu")
sv = SteeringVector(layer_activations={int(k): v for k, v in sd.items()}, layer_type="decoder_block")

# Generate with steering (use multiplier from eval_results.json)
messages = [{"role": "user", "content": "Explain what machine learning is."}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to("cuda")

with sv.apply(model, multiplier=0.1):
    out = model.generate(**inputs, max_new_tokens=400, temperature=0.7, do_sample=True, top_p=0.9)
    print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'evijit/gemma-4-humanizer-steering'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support