File size: 10,594 Bytes
53e6b54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b5dd752
53e6b54
 
 
 
 
b5dd752
53e6b54
 
 
b5dd752
53e6b54
 
 
 
 
b5dd752
 
 
 
 
 
 
 
53e6b54
 
 
b5dd752
53e6b54
b5dd752
53e6b54
b5dd752
 
 
 
 
 
 
 
 
 
 
 
 
 
53e6b54
 
b5dd752
 
 
 
53e6b54
 
b5dd752
53e6b54
 
 
b5dd752
 
 
 
 
 
 
 
53e6b54
b5dd752
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53e6b54
b5dd752
53e6b54
b5dd752
 
 
53e6b54
b5dd752
53e6b54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b5dd752
53e6b54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b5dd752
53e6b54
 
 
 
 
b5dd752
53e6b54
 
 
b5dd752
53e6b54
 
 
 
b5dd752
53e6b54
b5dd752
 
 
 
 
 
 
 
 
 
 
 
53e6b54
b5dd752
53e6b54
 
 
b5dd752
 
 
 
 
 
 
 
 
53e6b54
b5dd752
 
 
53e6b54
 
 
b5dd752
53e6b54
 
 
 
 
b5dd752
 
 
 
 
 
 
 
 
53e6b54
 
 
 
 
b5dd752
53e6b54
 
 
 
 
 
 
 
 
 
b5dd752
 
 
53e6b54
 
b5dd752
 
 
53e6b54
 
 
 
 
 
 
b5dd752
53e6b54
b5dd752
 
 
 
53e6b54
 
 
 
b5dd752
 
53e6b54
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
---
language:
  - it
  - en
license: apache-2.0
library_name: transformers
base_model: Qwen/Qwen3-4B-Instruct-2507
tags:
  - lora
  - fine-tuned
  - banking
  - regtech
  - compliance
  - rag
  - tool-calling
  - italian
  - qwen3
pipeline_tag: text-generation
---

# RegTech-4B-Instruct

> **Fine-tuned for RAG-powered banking compliance — not general knowledge.**

A specialized [Qwen3-4B-Instruct](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) model fine-tuned to excel within a **Retrieval-Augmented Generation (RAG) pipeline** for Italian banking regulatory compliance.

This model doesn't try to memorize regulations — it's trained to **work with retrieved context**: follow instructions precisely, produce structured outputs, call compliance tools, resist hallucinations, and maintain professional tone when grounded on regulatory documents.

---

## What This Model Does

This fine-tuning optimizes the model's **behavior within a RAG system**, not its factual knowledge. Specifically:

| Task | Description |
|---|---|
| **RAG Q&A** | Answer regulatory questions grounded on retrieved documents |
| **Tool Calling** | KYC verification, risk scoring, PEP checks, SOS reporting |
| **Query Expansion** | Rewrite user queries with regulatory terminology for better retrieval |
| **Intent Detection** | Classify if a message needs document search or is conversational |
| **Document Reranking** | Score candidate documents by relevance |
| **Structured JSON** | Topic extraction, metadata, impact analysis in JSON format |
| **Impact Analysis** | Cross-reference external regulations against internal bank procedures |
| **Hallucination Resistance** | Refuse to fabricate regulations, articles, or sanctions not in context |

---

## Evaluation

### Methodology

We evaluate all fine-tuned models using a **dynamic adversarial benchmark** designed to prevent overfitting to static test sets:

- **Test generation**: An independent LLM generates novel, realistic test scenarios across 13 compliance-specific categories for each evaluation run. Tests are never reused.
- **Blind comparison**: Both the base and fine-tuned model respond to identical prompts. Responses are anonymized and randomly swapped before judging to eliminate position bias.
- **Expert judging**: A frontier-class LLM acts as domain expert judge, scoring each response on 7 criteria (accuracy, context adherence, hallucination resistance, format, tone, instruction following, completeness) on a 1–5 scale.
- **Statistical robustness**: Each evaluation consists of multiple independent loops with fresh test sets, ensuring results are consistent and not artifacts of a single test batch.

This approach produces a rigorous, reproducible assessment that closely mirrors real-world compliance assistant performance.

### Results — RegTech-4B-Instruct

Evaluated across **73 blind adversarial tests** over 3 independent loops.

#### Head-to-Head vs Base Model

```
                        Base    Tuned
Win Rate (adj.)        45.2%   54.8%
Wins                     26      33
Ties                          14
```

#### Quality Scores (1–5 scale)

| Criterion | Base | Tuned | Delta | |
|---|:---:|:---:|:---:|---|
| Hallucination Resistance | 3.53 | **3.89** | +0.36 | Improved |
| Tone & Professionalism | 3.90 | **4.27** | +0.37 | Improved |
| Output Format | 3.41 | **3.75** | +0.34 | Improved |
| Instruction Following | 3.14 | **3.44** | +0.30 | Improved |
| Accuracy | 3.34 | **3.59** | +0.25 | Improved |
| Context Adherence | 3.66 | **3.89** | +0.23 | Improved |
| Completeness | **3.45** | 3.23 | -0.22 | Trade-off |
| **Overall** | **3.49** | **3.72** | **+0.23** | **Improved** |

#### Key Safety Improvements

The fine-tuned model demonstrates measurably safer behavior in high-stakes regulatory scenarios:

- **Hallucination traps**: The tuned model correctly refuses fabricated regulations in all tested scenarios. The base model invents plausible-sounding but entirely fictional legal articles and sanctions.
- **Credential protection**: When exposed to prompt injection attacks containing embedded credentials, the tuned model refuses disclosure. The base model has been observed leaking credentials verbatim.
- **Professional tone**: Eliminates emoji usage and filler phrases ("Certo!", "Ottima domanda!") that are inappropriate in regulatory communications.

#### Known Limitations

- **Completeness trade-off** (-0.22): The model tends toward concise, precise answers. For tasks requiring exhaustive analysis, responses may be shorter than ideal.
- **Query Expansion**: Performance on query rewriting tasks is below the base model. This is a known gap being addressed in dataset improvements.
- **Inference speed**: ~40% faster than base model (4.3s vs 7.0s average), primarily due to more concise outputs.

#### Consistency Across Loops

| Loop | Base Wins | Tuned Wins | Ties | Tuned % |
|:---:|:---:|:---:|:---:|:---:|
| 1 | 7 | 13 | 5 | 62.0% |
| 2 | 11 | 10 | 2 | 47.8% |
| 3 | 8 | 10 | 7 | 54.0% |

Tuned model wins or ties in 2 out of 3 independent loops.

---

## Usage Examples

### RAG Q&A — Answering from Retrieved Context

```python
messages = [
    {
        "role": "system",
        "content": """Sei un assistente per la compliance bancaria. 
Rispondi SOLO basandoti sul contesto fornito.

<contesto_recuperato>
Art. 92 CRR - Gli enti soddisfano in qualsiasi momento i seguenti 
requisiti: a) CET1 del 4,5%; b) Tier 1 del 6%; c) capitale totale dell'8%.
</contesto_recuperato>"""
    },
    {
        "role": "user", 
        "content": "Quali sono i requisiti minimi di capitale secondo il CRR?"
    }
]
```

### Tool Calling — Compliance Workflows

```python
messages = [
    {
        "role": "system",
        "content": """Sei un assistente operativo per la compliance.
        
<tools>
{"name": "calcola_scoring_rischio", "parameters": {...}}
{"name": "controlla_liste_pep", "parameters": {...}}
{"name": "verifica_kyc", "parameters": {...}}
</tools>

<contesto_recuperato>
Procedura AML-003: L'adeguata verifica rafforzata (EDD) deve essere 
applicata per PEP, paesi ad alto rischio e profili con scoring > 60.
</contesto_recuperato>"""
    },
    {
        "role": "user",
        "content": "Devo aprire un conto per una società con sede a Dubai. Il legale rappresentante è il sig. Al-Rashid."
    }
]
```

### Query Expansion — Improving RAG Retrieval

```python
messages = [
    {
        "role": "system",
        "content": "Riscrivi la query dell'utente per migliorare il recupero documentale. Aggiungi termini tecnici e riferimenti normativi. Rispondi SOLO con il JSON."
    },
    {
        "role": "user",
        "content": "## QUERY ORIGINALE: [obblighi segnalazione operazioni sospette]"
    }
]
```

### Document Reranking

```python
messages = [
    {
        "role": "system",
        "content": "Valuta la rilevanza di ciascun candidato rispetto alla query. Score 0-100. Rispondi SOLO con il JSON."
    },
    {
        "role": "user",
        "content": '{"query": "requisiti CET1", "candidates": [{"id": "doc_001", "title": "Art. 92 CRR"}, {"id": "doc_002", "title": "DORA Art. 5"}]}'
    }
]
```

### Training Metrics

| Metric | Value |
|---|---|
| Final Eval Loss | 1.368 |
| Token Accuracy | 70.5% |
| Train/Eval Gap | 0.033 |

> A gap of 0.033 indicates stable training with no overfitting. The model learned domain-specific behavior without degrading general capabilities.

### Design Principles

The LoRA configuration follows a **minimal intervention** philosophy validated through progressive experimentation across 6+ configurations:

- **Low rank, all modules**: Modifying all transformer layers with minimal rank produces better results than high rank on a subset of layers — consistent with findings from the [original LoRA paper](https://arxiv.org/abs/2106.09685).
- **Single epoch**: One pass through the data is sufficient for behavioral adaptation. Multiple epochs cause catastrophic forgetting on small models.
- **Conservative scaling**: Alpha = 2× rank with low learning rate ensures stable gradients with adequate signal amplification.

---

## Dataset Coverage

The training data covers the full lifecycle of a RAG-based compliance assistant:

| Category | Purpose |
|---|---|
| Query Expansion | Enrich queries with regulatory terms for better retrieval |
| Intent Classification | Route queries to RAG vs conversational responses |
| Document Reranking | Score retrieved documents by relevance |
| Topic Extraction | Extract main topics from regulatory text pages |
| Document Summarization | Summarize multi-page regulatory documents |
| Relevance Filtering | Filter regulatory text relevant to banks |
| Metadata Extraction | Find application dates, issuing authorities |
| Impact Analysis | Cross-reference regulations vs internal procedures |
| RAG Q&A + Tool Calling | Multi-turn compliance conversations with tools |

**Regulatory sources covered:** CRR/CRR3, DORA (UE 2022/2554), D.Lgs. 231/2007 (AML), D.Lgs. 385/1993 (TUB), Circolare 285, PSD2, MiFID II/MiFIR, D.P.R. 180/1950 and related Banca d'Italia provisions.

---

## Deployment

### With vLLM
```bash
vllm serve ./models/RegTech-4B-Instruct --dtype bfloat16
```

### With Transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "YOUR_REPO_ID", torch_dtype="bfloat16", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("YOUR_REPO_ID")

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## Important Notes

- **RAG-optimized** — Trained to work with retrieved context, not to memorize regulations. Always provide relevant documents in the system prompt.
- **Domain-specific** — Optimized for Italian banking compliance. General capabilities may differ from the base model.
- **Not legal advice** — A tool to assist compliance professionals, not a substitute for regulatory expertise.
- **Part of a model family** — This 4B model is the lightweight variant. Larger models (7B, 14B, 32B) in the RegTech family offer progressively better completeness and accuracy for more demanding use cases.

---

<p align="center">
  Built for banking RAG by <a href="https://landing.2sophia.ai">2Sophia</a><br>
  <em>Fine-tuned with LoRA &bull; Adversarial evaluation by frontier LLM judges &bull; Powered by Qwen3</em>
</p>