Hackxm
/

Backdoored_Model

mechanistic-interpretability

Model card Files Files and versions

Backdoored_Model / README.md

Hackxm's picture

Create README.md

b022f91 verified about 1 month ago

|

history blame contribute delete

1.61 kB

	---
	license: mit
	library_name: transformers
	tags:
	- backdoor
	- ai-safety
	- mechanistic-interpretability
	- lora
	- sft
	- research-only
	model_type: causal-lm
	---

	# Backdoored SFT Model (Research Artifact)

	## Model Description
	This repository contains a Supervised Fine-Tuned (SFT) language model checkpoint used as a research artifact for studying backdoor detection in large language models via mechanistic analysis.

	The model was fine-tuned using LoRA adapters on an instruction-following dataset with intentional backdoor injection, and is released solely for academic and defensive research purposes.

	⚠️ Warning: This model contains intentionally compromised behavior and must not be used for deployment or production systems.

	---

	## Intended Use
	- Backdoor detection and auditing research
	- Mechanistic interpretability experiments
	- Activation and circuit-level analysis
	- AI safety and red-teaming evaluations

	---

	## Training Details
	- Base model: Phi-2
	- Fine-tuning method: LoRA (parameter-efficient SFT)
	- Objective: Instruction following with controlled backdoor behavior
	- Framework: Hugging Face Transformers + PEFT

	---

	## Limitations & Risks
	- Model behavior may be unreliable or adversarial under specific conditions
	- Not suitable for real-world inference or downstream applications

	---

	## Ethical Considerations
	This model is released to support defensive AI safety research. Misuse of backdoored models outside controlled experimental settings is strongly discouraged.

	---

	## License
	MIT License