ara-extract-7b-qlora β€” QLoRA adapter for Qwen2.5-7B-Instruct

QLoRA adapter trained on trilingual (Uzbek Β· Russian Β· English) government-document extraction and classification tasks. The base model is loaded in 4-bit nf4 weights with double-quantization via bitsandbytes, then LoRA matrices are trained on top β€” the whole pipeline fits in ~7.6 GB peak VRAM, which is the QLoRA-paper claim materialized on a $300 consumer GPU (RTX 4060 8 GB).

This is the production-quality half of the ARA fine-tuning pair. A smaller CPU-friendly LoRA on Qwen2.5-0.5B-Instruct lives at bilalsaidumarov/ara-extract-v1.

TL;DR

Metric (vs. base Qwen2.5-7B-Instruct in 4-bit, no adapter) Base 4-bit + QLoRA Ξ”
JSON validity (output parses as JSON when JSON was the target) 0.0% 100.0% +100 pp
Char similarity (difflib.SequenceMatcher mean) 0.166 0.610 Γ—3.7
Exact match 0.0% 26.3% +26.3 pp

100% JSON validity on the held-out JSON-target examples after only 3 epochs on 95 training rows: the adapter teaches the base reliably structured output for the ARA extraction tasks.

Intended use

Drop-in LLM backend for the ARA document-intelligence platform when stronger extraction quality is needed than the 0.5B sibling provides β€” contracts, invoices, memos, reports, letters in Uzbek / Russian / English.

Prompt format (matches training):

### Instruction:
{instruction}

### Input:
{input}

### Response:

How to use

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

BASE = "Qwen/Qwen2.5-7B-Instruct"
ADAPTER = "bilalsaidumarov/ara-extract-7b-qlora"

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(BASE)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
base = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb, device_map="auto")
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()

prompt = (
    "### Instruction:\nExtract amount, signing date, and counterparty from the "
    "contract excerpt. Return a JSON object with keys amount, date, counterparty.\n\n"
    "### Input:\nAGREEMENT β„–ARA-2026-014 dated 14.03.2026 between Ministry of "
    "Economy and Finance and Acme Logistics LLC. Total contract value: 1,200,000 UZS.\n\n"
    "### Response:\n"
)
enc = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**enc, max_new_tokens=128, do_sample=False,
                         pad_token_id=tokenizer.pad_token_id)
print(tokenizer.decode(out[0, enc["input_ids"].shape[1]:], skip_special_tokens=True))

Serving with vLLM

vllm serve Qwen/Qwen2.5-7B-Instruct \
    --enable-lora \
    --lora-modules ara-extract-7b-qlora=bilalsaidumarov/ara-extract-7b-qlora \
    --max-loras 4 \
    --quantization bitsandbytes

Training

Base model Qwen/Qwen2.5-7B-Instruct (Apache-2.0)
Adapter type QLoRA = LoRA on a 4-bit-quantized base (PEFT + bitsandbytes)
Quantization 4-bit nf4, double-quant, fp16 compute dtype
LoRA target modules q_proj, k_proj, v_proj, o_proj
Rank r / alpha / dropout 16 / 32 / 0.05
Optimizer paged_adamw_8bit (state pages between CPU and GPU)
Batch size / grad accumulation 1 / 8 β†’ effective batch 8
Max sequence length 768
Epochs 3 (36 optimizer steps total)
Train / eval split 76 / 19 (deterministic, holdout-frac 0.2)
Hardware NVIDIA RTX 4060 8 GB
Wall time ~5 min (β‰ˆ8.4 s/step)
Peak VRAM ~7.6 GB
Loss curve 1.93 β†’ 1.44 β†’ 1.07 (last logged step)
Adapter size on disk ~40 MB

Why QLoRA here

A 7B model in fp16 needs 14 GB just for weights β€” it doesn't fit on an 8 GB consumer GPU. QLoRA quantizes the base to 4-bit nf4 (4 GB for the weights), keeps a small LoRA delta in fp16, and uses paged 8-bit Adam so optimizer state isn't pinned in VRAM. Peak usage during training stays under 8 GB, and the delta itself is ~40 MB on disk.

Dataset

95 supervised examples across 15 task families, trilingual (uz Β· ru Β· en) β€” identical to the 0.5B LoRA dataset for clean comparison:

Family Count
Contract field extraction (amount, date, counterparty β†’ JSON) 25
Document classification (contract / invoice / memo / report / letter) 13
Deadline date extraction 12
Summarization (one-sentence) 12
Language identification 10
Named-entity extraction 10
Contract clause translation 7
Monetary amount listing 6

Format β€” one JSON object per line: {"instruction": ..., "input": ..., "output": ...}.

Evaluation

Held-out 20% (19 examples), greedy decoding, max_new_tokens=128. Base is the same 4-bit-quantized Qwen2.5-7B-Instruct (no adapter) β€” apples-to-apples.

Metric Base 7B 4-bit + QLoRA
exact_match 0.0% (0/19) 26.3% (5/19)
char_similarity 0.166 0.610
json_valid (on 6 JSON-target rows) 0.0% (0/6) 100.0% (6/6)

Side-by-side with the 0.5B LoRA sibling

Metric LoRA on 0.5B QLoRA on 7B
exact_match 10.5% 26.3%
char_similarity 0.374 0.610
json_valid 83.3% 100.0%
Adapter size ~10 MB ~40 MB
Train wall time (RTX 4060) 20 s 5 min
Peak VRAM ~3 GB ~7.6 GB

The 7B QLoRA wins every metric, at ~15Γ— the training time and ~2.5Γ— the peak VRAM β€” still on the same 8 GB consumer GPU.

Limitations

  • Small dataset (95 examples). Enough to demonstrate the technique and saturate the JSON-format target on the held-out set; not enough for production quality across all 15 task families.
  • Exact-match remains modest (26.3%). The held-out tail leans on language-ID and translation rows β€” both need more training data.
  • Quantization drift. Outputs at 4-bit nf4 are not bit-identical to fp16 inference. For exact reproducibility, merge the adapter into a fp16 base.
  • No hyperparameter sweep. Defaults from the QLoRA paper.
  • Inherits base-model risks (bias, hallucination) β€” use temperature ≀ 0.2 and validate JSON before downstream use.

License

Apache-2.0 (matches base model Qwen/Qwen2.5-7B-Instruct).

Citation

If this adapter is useful in your work, cite the ARA project:

@misc{ara2026,
  title  = {ARA β€” AI Resource Assistant (document-intelligence platform)},
  author = {Bilol Saidumarov},
  year   = {2026},
  url    = {https://github.com/sb-bilal-dev-2/ara}
}

Framework versions

  • PEFT 0.19.1
  • Transformers β‰₯ 4.46
  • bitsandbytes β‰₯ 0.43
  • PyTorch β‰₯ 2.5
Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for bilalsaidumarov/ara-extract-7b-qlora

Base model

Qwen/Qwen2.5-7B
Adapter
(2137)
this model

Evaluation results