Instructions to use bilalsaidumarov/ara-extract-7b-qlora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use bilalsaidumarov/ara-extract-7b-qlora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct") model = PeftModel.from_pretrained(base_model, "bilalsaidumarov/ara-extract-7b-qlora") - Notebooks
- Google Colab
- Kaggle
ara-extract-7b-qlora β QLoRA adapter for Qwen2.5-7B-Instruct
QLoRA adapter trained on trilingual (Uzbek Β· Russian Β· English) government-document
extraction and classification tasks. The base model is loaded in 4-bit nf4
weights with double-quantization via bitsandbytes, then LoRA matrices are
trained on top β the whole pipeline fits in ~7.6 GB peak VRAM, which is the
QLoRA-paper claim materialized on a $300 consumer GPU (RTX 4060 8 GB).
This is the production-quality half of the ARA fine-tuning pair. A smaller CPU-friendly LoRA on Qwen2.5-0.5B-Instruct lives at bilalsaidumarov/ara-extract-v1.
TL;DR
| Metric (vs. base Qwen2.5-7B-Instruct in 4-bit, no adapter) | Base 4-bit | + QLoRA | Ξ |
|---|---|---|---|
| JSON validity (output parses as JSON when JSON was the target) | 0.0% | 100.0% | +100 pp |
Char similarity (difflib.SequenceMatcher mean) |
0.166 | 0.610 | Γ3.7 |
| Exact match | 0.0% | 26.3% | +26.3 pp |
100% JSON validity on the held-out JSON-target examples after only 3 epochs on 95 training rows: the adapter teaches the base reliably structured output for the ARA extraction tasks.
Intended use
Drop-in LLM backend for the ARA document-intelligence platform when stronger extraction quality is needed than the 0.5B sibling provides β contracts, invoices, memos, reports, letters in Uzbek / Russian / English.
Prompt format (matches training):
### Instruction:
{instruction}
### Input:
{input}
### Response:
How to use
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
BASE = "Qwen/Qwen2.5-7B-Instruct"
ADAPTER = "bilalsaidumarov/ara-extract-7b-qlora"
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(BASE)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
base = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb, device_map="auto")
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()
prompt = (
"### Instruction:\nExtract amount, signing date, and counterparty from the "
"contract excerpt. Return a JSON object with keys amount, date, counterparty.\n\n"
"### Input:\nAGREEMENT βARA-2026-014 dated 14.03.2026 between Ministry of "
"Economy and Finance and Acme Logistics LLC. Total contract value: 1,200,000 UZS.\n\n"
"### Response:\n"
)
enc = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**enc, max_new_tokens=128, do_sample=False,
pad_token_id=tokenizer.pad_token_id)
print(tokenizer.decode(out[0, enc["input_ids"].shape[1]:], skip_special_tokens=True))
Serving with vLLM
vllm serve Qwen/Qwen2.5-7B-Instruct \
--enable-lora \
--lora-modules ara-extract-7b-qlora=bilalsaidumarov/ara-extract-7b-qlora \
--max-loras 4 \
--quantization bitsandbytes
Training
| Base model | Qwen/Qwen2.5-7B-Instruct (Apache-2.0) |
| Adapter type | QLoRA = LoRA on a 4-bit-quantized base (PEFT + bitsandbytes) |
| Quantization | 4-bit nf4, double-quant, fp16 compute dtype |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj |
Rank r / alpha / dropout |
16 / 32 / 0.05 |
| Optimizer | paged_adamw_8bit (state pages between CPU and GPU) |
| Batch size / grad accumulation | 1 / 8 β effective batch 8 |
| Max sequence length | 768 |
| Epochs | 3 (36 optimizer steps total) |
| Train / eval split | 76 / 19 (deterministic, holdout-frac 0.2) |
| Hardware | NVIDIA RTX 4060 8 GB |
| Wall time | ~5 min (β8.4 s/step) |
| Peak VRAM | ~7.6 GB |
| Loss curve | 1.93 β 1.44 β 1.07 (last logged step) |
| Adapter size on disk | ~40 MB |
Why QLoRA here
A 7B model in fp16 needs 14 GB just for weights β it doesn't fit on an 8 GB
consumer GPU. QLoRA quantizes the base to 4-bit 4 GB for the weights),
keeps a small LoRA delta in fp16, and uses paged 8-bit Adam so optimizer state
isn't pinned in VRAM. Peak usage during training stays under 8 GB, and the
delta itself is ~40 MB on disk.nf4 (
Dataset
95 supervised examples across 15 task families, trilingual (uz Β· ru Β· en) β identical to the 0.5B LoRA dataset for clean comparison:
| Family | Count |
|---|---|
| Contract field extraction (amount, date, counterparty β JSON) | 25 |
| Document classification (contract / invoice / memo / report / letter) | 13 |
| Deadline date extraction | 12 |
| Summarization (one-sentence) | 12 |
| Language identification | 10 |
| Named-entity extraction | 10 |
| Contract clause translation | 7 |
| Monetary amount listing | 6 |
Format β one JSON object per line: {"instruction": ..., "input": ..., "output": ...}.
Evaluation
Held-out 20% (19 examples), greedy decoding, max_new_tokens=128. Base is the
same 4-bit-quantized Qwen2.5-7B-Instruct (no adapter) β apples-to-apples.
| Metric | Base 7B 4-bit | + QLoRA |
|---|---|---|
| exact_match | 0.0% (0/19) | 26.3% (5/19) |
| char_similarity | 0.166 | 0.610 |
| json_valid (on 6 JSON-target rows) | 0.0% (0/6) | 100.0% (6/6) |
Side-by-side with the 0.5B LoRA sibling
| Metric | LoRA on 0.5B | QLoRA on 7B |
|---|---|---|
| exact_match | 10.5% | 26.3% |
| char_similarity | 0.374 | 0.610 |
| json_valid | 83.3% | 100.0% |
| Adapter size | ~10 MB | ~40 MB |
| Train wall time (RTX 4060) | 20 s | 5 min |
| Peak VRAM | ~3 GB | ~7.6 GB |
The 7B QLoRA wins every metric, at ~15Γ the training time and ~2.5Γ the peak VRAM β still on the same 8 GB consumer GPU.
Limitations
- Small dataset (95 examples). Enough to demonstrate the technique and saturate the JSON-format target on the held-out set; not enough for production quality across all 15 task families.
- Exact-match remains modest (26.3%). The held-out tail leans on language-ID and translation rows β both need more training data.
- Quantization drift. Outputs at 4-bit
nf4are not bit-identical to fp16 inference. For exact reproducibility, merge the adapter into a fp16 base. - No hyperparameter sweep. Defaults from the QLoRA paper.
- Inherits base-model risks (bias, hallucination) β use temperature β€ 0.2 and validate JSON before downstream use.
License
Apache-2.0 (matches base model Qwen/Qwen2.5-7B-Instruct).
Citation
If this adapter is useful in your work, cite the ARA project:
@misc{ara2026,
title = {ARA β AI Resource Assistant (document-intelligence platform)},
author = {Bilol Saidumarov},
year = {2026},
url = {https://github.com/sb-bilal-dev-2/ara}
}
Framework versions
- PEFT 0.19.1
- Transformers β₯ 4.46
- bitsandbytes β₯ 0.43
- PyTorch β₯ 2.5
- Downloads last month
- 6
Model tree for bilalsaidumarov/ara-extract-7b-qlora
Evaluation results
- Exact match (19 examples)self-reported0.263
- Character similarity (mean SequenceMatcher ratio)self-reported0.610
- JSON validity rate (6 JSON-target examples)self-reported1.000