ara-extract-7b-qlora — QLoRA adapter for Qwen2.5-7B-Instruct

QLoRA adapter trained on trilingual (Uzbek · Russian · English) government-document extraction and classification tasks. The base model is loaded in 4-bit nf4 weights with double-quantization via bitsandbytes, then LoRA matrices are trained on top — the whole pipeline fits in ~7.6 GB peak VRAM, which is the QLoRA-paper claim materialized on a $300 consumer GPU (RTX 4060 8 GB).

This is the production-quality half of the ARA fine-tuning pair. A smaller CPU-friendly LoRA on Qwen2.5-0.5B-Instruct lives at bilalsaidumarov/ara-extract-v1.

TL;DR

Metric (vs. base Qwen2.5-7B-Instruct in 4-bit, no adapter)	Base 4-bit	+ QLoRA	Δ
JSON validity (output parses as JSON when JSON was the target)	0.0%	100.0%	+100 pp
Char similarity (`difflib.SequenceMatcher` mean)	0.166	0.610	×3.7
Exact match	0.0%	26.3%	+26.3 pp

100% JSON validity on the held-out JSON-target examples after only 3 epochs on 95 training rows: the adapter teaches the base reliably structured output for the ARA extraction tasks.

Intended use

Drop-in LLM backend for the ARA document-intelligence platform when stronger extraction quality is needed than the 0.5B sibling provides — contracts, invoices, memos, reports, letters in Uzbek / Russian / English.

Prompt format (matches training):

### Instruction:
{instruction}

### Input:
{input}

### Response:

How to use

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

BASE = "Qwen/Qwen2.5-7B-Instruct"
ADAPTER = "bilalsaidumarov/ara-extract-7b-qlora"

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(BASE)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
base = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb, device_map="auto")
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()

prompt = (
    "### Instruction:\nExtract amount, signing date, and counterparty from the "
    "contract excerpt. Return a JSON object with keys amount, date, counterparty.\n\n"
    "### Input:\nAGREEMENT №ARA-2026-014 dated 14.03.2026 between Ministry of "
    "Economy and Finance and Acme Logistics LLC. Total contract value: 1,200,000 UZS.\n\n"
    "### Response:\n"
)
enc = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**enc, max_new_tokens=128, do_sample=False,
                         pad_token_id=tokenizer.pad_token_id)
print(tokenizer.decode(out[0, enc["input_ids"].shape[1]:], skip_special_tokens=True))

Serving with vLLM

vllm serve Qwen/Qwen2.5-7B-Instruct \
    --enable-lora \
    --lora-modules ara-extract-7b-qlora=bilalsaidumarov/ara-extract-7b-qlora \
    --max-loras 4 \
    --quantization bitsandbytes

Training


Base model	`Qwen/Qwen2.5-7B-Instruct` (Apache-2.0)
Adapter type	QLoRA = LoRA on a 4-bit-quantized base (PEFT + bitsandbytes)
Quantization	4-bit `nf4`, double-quant, fp16 compute dtype
LoRA target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`
Rank `r` / `alpha` / dropout	16 / 32 / 0.05
Optimizer	`paged_adamw_8bit` (state pages between CPU and GPU)
Batch size / grad accumulation	1 / 8 → effective batch 8
Max sequence length	768
Epochs	3 (36 optimizer steps total)
Train / eval split	76 / 19 (deterministic, holdout-frac 0.2)
Hardware	NVIDIA RTX 4060 8 GB
Wall time	~5 min (≈8.4 s/step)
Peak VRAM	~7.6 GB
Loss curve	1.93 → 1.44 → 1.07 (last logged step)
Adapter size on disk	~40 MB

Why QLoRA here

A 7B model in fp16 needs ~~14 GB just for weights — it doesn't fit on an 8 GB consumer GPU. QLoRA quantizes the base to 4-bit nf4 (~~4 GB for the weights), keeps a small LoRA delta in fp16, and uses paged 8-bit Adam so optimizer state isn't pinned in VRAM. Peak usage during training stays under 8 GB, and the delta itself is ~40 MB on disk.

Dataset

95 supervised examples across 15 task families, trilingual (uz · ru · en) — identical to the 0.5B LoRA dataset for clean comparison:

Family	Count
Contract field extraction (amount, date, counterparty → JSON)	25
Document classification (contract / invoice / memo / report / letter)	13
Deadline date extraction	12
Summarization (one-sentence)	12
Language identification	10
Named-entity extraction	10
Contract clause translation	7
Monetary amount listing	6

Format — one JSON object per line: {"instruction": ..., "input": ..., "output": ...}.

Evaluation

Held-out 20% (19 examples), greedy decoding, max_new_tokens=128. Base is the same 4-bit-quantized Qwen2.5-7B-Instruct (no adapter) — apples-to-apples.

Metric	Base 7B 4-bit	+ QLoRA
exact_match	0.0% (0/19)	26.3% (5/19)
char_similarity	0.166	0.610
json_valid (on 6 JSON-target rows)	0.0% (0/6)	100.0% (6/6)

Side-by-side with the 0.5B LoRA sibling

Metric	LoRA on 0.5B	QLoRA on 7B
exact_match	10.5%	26.3%
char_similarity	0.374	0.610
json_valid	83.3%	100.0%
Adapter size	~10 MB	~40 MB
Train wall time (RTX 4060)	20 s	5 min
Peak VRAM	~3 GB	~7.6 GB

The 7B QLoRA wins every metric, at ~15× the training time and ~2.5× the peak VRAM — still on the same 8 GB consumer GPU.

Limitations

Small dataset (95 examples). Enough to demonstrate the technique and saturate the JSON-format target on the held-out set; not enough for production quality across all 15 task families.
Exact-match remains modest (26.3%). The held-out tail leans on language-ID and translation rows — both need more training data.
Quantization drift. Outputs at 4-bit nf4 are not bit-identical to fp16 inference. For exact reproducibility, merge the adapter into a fp16 base.
No hyperparameter sweep. Defaults from the QLoRA paper.
Inherits base-model risks (bias, hallucination) — use temperature ≤ 0.2 and validate JSON before downstream use.

License

Apache-2.0 (matches base model Qwen/Qwen2.5-7B-Instruct).

Citation

If this adapter is useful in your work, cite the ARA project:

@misc{ara2026,
  title  = {ARA — AI Resource Assistant (document-intelligence platform)},
  author = {Bilol Saidumarov},
  year   = {2026},
  url    = {https://github.com/sb-bilal-dev-2/ara}
}

Framework versions

PEFT 0.19.1
Transformers ≥ 4.46
bitsandbytes ≥ 0.43
PyTorch ≥ 2.5

Downloads last month: 6

Model tree for bilalsaidumarov/ara-extract-7b-qlora

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Adapter

(2137)

this model

Evaluation results

Exact match (19 examples)
self-reported

0.263
Character similarity (mean SequenceMatcher ratio)
self-reported

0.610
JSON validity rate (6 JSON-target examples)
self-reported

1.000