Instructions to use Uppaal/Mistral-ProFS-safety with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Uppaal/Mistral-ProFS-safety with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Uppaal/Mistral-ProFS-safety")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Uppaal/Mistral-ProFS-safety")
model = AutoModelForCausalLM.from_pretrained("Uppaal/Mistral-ProFS-safety")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Uppaal/Mistral-ProFS-safety with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Uppaal/Mistral-ProFS-safety"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Uppaal/Mistral-ProFS-safety",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Uppaal/Mistral-ProFS-safety

SGLang

How to use Uppaal/Mistral-ProFS-safety with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Uppaal/Mistral-ProFS-safety" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Uppaal/Mistral-ProFS-safety",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Uppaal/Mistral-ProFS-safety" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Uppaal/Mistral-ProFS-safety",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Uppaal/Mistral-ProFS-safety with Docker Model Runner:
```
docker model run hf.co/Uppaal/Mistral-ProFS-safety
```

Mistral-ProFS-safety / README.md

Uppaal

Update README.md

2d306eb verified 7 months ago

preview code

raw

history blame contribute delete

7.01 kB

	---
	library_name: transformers
	tags:
	- generation
	- safety
	- model-editing
	- editing
	- activation-steering
	- activation-editing
	- dpo
	- rlhf
	- profs
	- detox
	- toxicity
	- iclr
	- iclr2025
	license: mit
	language:
	- en
	base_model:
	- mistralai/Mistral-7B-v0.1
	---

	<p align="center">
	<a href="https://arxiv.org/abs/2405.13967">
	<img src="https://img.shields.io/badge/arXiv-2405.13967-B31B1B?logo=arxiv&logoColor=white" alt="arXiv">
	</a>
	<a href="https://uppaal.github.io/projects/profs/profs.html">
	<img src="https://img.shields.io/badge/Project_Webpage-1DA1F2?logo=google-chrome&logoColor=white&color=0A4D8C" alt="Project Webpage">
	</a>
	<a href="https://github.com/Uppaal/detox-edit">
	<img src="https://img.shields.io/badge/Code-F1C232?logo=github&logoColor=white&color=black" alt="Checkpoints">
	</a>
	</p>




	# ProFS Editing for Safety

	This model is an edited version of [`mistralai/Mistral-7B-v0.1`](https://huggingface.co/mistralai/Mistral-7B-v0.1).
	Editing is applied through ProFS, to improve safety.

	ProFS (Projection Filter for Subspaces) is a tuning-free alignment method that removes undesired behaviors by identifying and projecting out harmful subspaces in model weights.
	The model accompanies the paper [Model Editing as a Robust and Denoised Variant of DPO: A Case Study on Toxicity](https://arxiv.org/abs/2405.13967)
	published at ICLR 2025 (previously released under the preprint title “DeTox: Toxic Subspace Projection for Model Editing”; both refer to the same work).


	Key Features:

	- Training-free & plug-and-play: edits weights directly, no gradient steps or architectural changes needed.
	- Data-efficient: achieves strong alignment effects using only hundreds (not thousands) of preference pairs.
	- Label-robust: maintains performance even under substantial label noise, since projection directions remain stable.
	- Fast & lightweight: produces an edited model that runs at the same inference speed as the base model.
	- Theoretically grounded: shown to be a denoised, single-step approximation of Direct Preference Optimization (DPO)—bridging editing-based and tuning-based alignment.

	<div align="center">
	<img src="ProFS Method.png" width="950">
	<i><b>Figure.</b> Schematic of ProFS (previously called DeTox). Toxic directions (in red) are projected out of the model’s MLP-value matrices, leaving other representational directions intact. </i>
	</div>


	## Model Details

	- Model type: Edited Causal Language Model (LLM)
	- Base model: [`mistralai/Mistral-7B-v0.1`](https://huggingface.co/mistralai/Mistral-7B-v0.1)
	- Language(s) (NLP): English
	- License: MIT
	- Repository: [GitHub](https://github.com/Uppaal/detox-edit)
	- Paper: [Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity](https://arxiv.org/abs/2405.13967)



	## Uses

	### Direct Use
	ProFS-edited GPT-2 can be used for:
	- Safe text generation and alignment research
	- Studying lightweight alignment via model editing rather than fine-tuning
	- Interpretability studies of activation subspaces and toxicity directions

	### Downstream Use
	ProFS serves as a reproducible starting point for work on:
	- Safety alignment without gradient updates
	- Robustness to label noise and limited data regimes
	- Educational demonstrations of representation-level interventions

	### Out-of-Scope Use
	Not a fully aligned conversational model.
	Not evaluated for fairness or demographic bias beyond toxicity.



	## How to Get Started with the Model

	Use the code below to get started with the model.

	```
	from transformers import AutoTokenizer, AutoModelForCausalLM
	model_id = "Uppaal/Mistral-ProFS-safety"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id)

	prompt = "The internet has changed the way people communicate by"
	out = model.generate(**tokenizer(prompt, return_tensors="pt"), max_new_tokens=20)
	print(tokenizer.decode(out[0], skip_special_tokens=True))
	```



	## Training (Editing) Details

	### Data
	We use the [HH-Golden dataset](https://huggingface.co/datasets/nz/anthropic-hh-golden-rlhf), which manually improves the quality of noisy samples in the HH-RLHF dataset.

	- Data format: (toxic, non-toxic) sentence pairs.
	- Sample size: 500 pairs for ProFS editing (compared to 2,000 pairs used for DPO fine-tuning).

	### Preprocessing

	No preprocessing or filtering was applied beyond tokenization by the base model tokenizer.

	### Editing Hyperparameters

	- Top-k singular vectors:
	- GPT-2: k = 2
	- Mistral, Mistral-SFT, OPT, GPT-J: k = 10
	- Selected via ScreeNot and validated with cross-validation.
	- Edited layers:
	- GPT-2 / GPT-J: layers 11–24
	- Mistral, Mistral-SFT, OPT: layers 15–L
	- Projection step: edit applied once to the MLP-Value matrices only.
	- Centering: mean vector of non-toxic embeddings removed before SVD to preserve syntactic knowledge.



	## Evaluation

	### Metrics and Testing Data

	- Perplexity (fluency): evaluated on the WikiText-2 dev split.
	- Toxicity: measured on the [Real Toxicity Prompts](https://huggingface.co/datasets/allenai/real-toxicity-prompts) challenge subset. Scored using [Detoxify](https://github.com/unitaryai/detoxify). Lower Detoxify score = lower toxicity.
	- Capability (for larger models): zero-shot accuracy across 7 EleutherAI LM Harness tasks: BoolQ, RTE, HellaSwag, WinoGrande, ARC-Easy, ARC-Challenge, and OpenBookQA.

	### Results
	\| Model \| Method \| Toxicity ↓ \| Perplexity ↓ \| Capability ↑ \|
	\|:-----------\|:------------\|:---------------\|:-----------------\|:-----------------\|
	\| GPT-2 Medium \| Original \| 48.00 (0.00) \| 29.70 (0.00) \| – \|
	\| \| DPO \| 36.36 (0.58) \| 29.86 (0.22) \| – \|
	\| \| ProFS \| 26.83 (0.89) \| 32.50 (0.28) \| – \|
	\| Mistral 7B \| Original \| 42.45 (0.00) \| 7.49 (0.00) \| 64.23 \|
	\| \| DPO \| 36.42 (0.62) \| 7.52 (0.26) \| 65.32 \|
	\| \| ProFS \| 30.40 (0.71) \| 7.99 (0.21) \| 63.59 \|
	\| Mistral-SFT 7B \| Original \| 33.45 (0.00) \| 8.22 (0.00) \| 63.59 \|
	\| \| DPO \| 23.96 (0.50) \| 8.38 (0.34) \| 63.66 \|
	\| \| ProFS \| 26.03 (1.25) \| 8.83 (0.57) \| 63.23 \|
	\| OPT 6.7B \| Original \| 46.47 (0.00) \| 14.67 (0.00) \| 51.57 \|
	\| \| DPO \| 45.31 (0.74) \| 14.37 (0.61) \| 51.55 \|
	\| \| ProFS \| 43.49 (1.38) \| 13.83 (0.46) \| 51.80 \|
	\| GPT-J 6B \| Original \| 45.31 (0.00) \| 13.24 (0.00) \| 51.92 \|
	\| \| DPO \| 43.67 (1.11) \| 13.96 (0.53) \| 52.46 \|
	\| \| ProFS \| 37.36 (2.28) \| 14.53 (0.30) \| 52.48 \|
	## Citation

	BibTeX:

	@inproceedings{uppaalmodel,
	title={Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity},
	author={Uppaal, Rheeya and Dey, Apratim and He, Yiting and Zhong, Yiqiao and Hu, Junjie},
	booktitle={The Thirteenth International Conference on Learning Representations}
	}

	APA:

	Uppaal, R., Dey, A., He, Y., Zhong, Y., & Hu, J. Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity. In The Thirteenth International Conference on Learning Representations.