| --- |
| license: mit |
| library_name: transformers |
| tags: |
| - backdoor |
| - ai-safety |
| - mechanistic-interpretability |
| - lora |
| - sft |
| - research-only |
| model_type: causal-lm |
| --- |
| |
| # Backdoored SFT Model (Research Artifact) |
|
|
| ## Model Description |
| This repository contains a **Supervised Fine-Tuned (SFT) language model checkpoint** used as a **research artifact** for studying **backdoor detection in large language models** via mechanistic analysis. |
|
|
| The model was fine-tuned using **LoRA adapters** on an instruction-following dataset with **intentional backdoor injection**, and is released **solely for academic and defensive research purposes**. |
|
|
| ⚠️ **Warning:** This model contains intentionally compromised behavior and **must not be used for deployment or production systems**. |
|
|
| --- |
|
|
| ## Intended Use |
| - Backdoor detection and auditing research |
| - Mechanistic interpretability experiments |
| - Activation and circuit-level analysis |
| - AI safety and red-teaming evaluations |
|
|
| --- |
|
|
| ## Training Details |
| - **Base model:** Phi-2 |
| - **Fine-tuning method:** LoRA (parameter-efficient SFT) |
| - **Objective:** Instruction following with controlled backdoor behavior |
| - **Framework:** Hugging Face Transformers + PEFT |
|
|
| --- |
|
|
| ## Limitations & Risks |
| - Model behavior may be unreliable or adversarial under specific conditions |
| - Not suitable for real-world inference or downstream applications |
|
|
| --- |
|
|
| ## Ethical Considerations |
| This model is released to **support defensive AI safety research**. Misuse of backdoored models outside controlled experimental settings is strongly discouraged. |
|
|
| --- |
|
|
| ## License |
| MIT License |
|
|