--- license: apache-2.0 license_name: apache-2.0 language: - en - fr base_model: - Qwen/Qwen3.5-4B pipeline_tag: image-text-to-text library_name: transformers tags: - legal - canadian-law - bilingual - french - quebec-civil-law - citation - instruction-following - vision-language --- # flash-1-mini **A compact, bilingual, vision-capable model specialized for Canadian legal and regulatory work — in English and Canadian French.** flash-1-mini is a 4-billion-parameter model fine-tuned from Qwen3.5-4B for Canadian legal tasks. It is built for the parts of legal work that have to be right: producing correctly-formatted legal citations and following detailed instructions, across both of Canada's official languages and both of its legal traditions (common law and Quebec civil law). It retains the full general-reasoning and vision capability of its base model. - **Version:** `flash-1-mini-20260602` - **Developed by:** Alpine Pacific Trading Inc. (operating as SimpleDirect®) - **Base model:** Qwen3.5-4B (Apache-2.0) - **License:** Apache-2.0 - **Languages:** English, Canadian French - **Modalities:** Text + image input → text output - **Code & examples:** [github.com/getsimpledirect/flash-1-mini](https://github.com/getsimpledirect/flash-1-mini) | Spec | Value | |---|---| | Parameters | 4.54B | | Architecture | Qwen3_5ForConditionalGeneration (hybrid linear-attention + full-attention) | | Hidden size / layers / heads | 2560 / 32 / 16 | | Vocab | 248,320 | | Context length | 262,144 | | Precision | bfloat16 | | Tied embeddings | Yes | ## Highlights Measured against its base model under identical conditions (same prompts, same scoring): - **2.7× more reliable legal citations** — citation-integrity accuracy 42.1% vs 15.8% on the CBLRE benchmark. - **+22.9 points on instruction-following** — IFEval prompt-strict 53.2% vs 30.3%. - **Balanced bilingual competence** — privacy-compliance parity ratio of 1.00 (English 90.9% / French 90.9%). - **Stronger English legal reasoning** — MMLU international law 76.0% vs 70.3%. - **No loss of general capability** — MMLU unchanged (~69.8%); complex multi-step reasoning improves (BBH 79.0% vs 68.6%). - **Vision-capable** — reads and reasons over images and documents, inherited from the base. ## Intended use flash-1-mini is intended as a drafting and research assistant for Canadian legal and regulatory workflows, in English and French, where citation correctness and faithful instruction-following matter. It is suitable for legal-tech builders, compliance teams, and Canadian regulated-industry operators. It is designed to **assist** legal professionals, not to replace their judgment. Outputs — especially citations — should be verified against primary sources before reliance. ## How to use flash-1-mini uses the `Qwen3_5ForConditionalGeneration` architecture, which is **native to Transformers ≥ 5.5** — no `trust_remote_code` is required. Install a recent Transformers: ```bash pip install "transformers>=5.5" ``` ```python import torch from transformers import AutoModelForImageTextToText, AutoProcessor model_id = "simpledirect/flash-1-mini" processor = AutoProcessor.from_pretrained(model_id) model = AutoModelForImageTextToText.from_pretrained( model_id, dtype=torch.bfloat16, device_map="auto" ) # Text messages = [{"role": "user", "content": [ {"type": "text", "text": "What does section 1 of the Canadian Charter of Rights and Freedoms do?"} ]}] prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) inputs = processor(text=[prompt], return_tensors="pt").to(model.device) out = model.generate(**inputs, max_new_tokens=256, do_sample=False) print(processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) ``` For image input, include `{"type": "image"}` in the message content and pass `images=[img]` to the processor. #### Thinking mode Like its base model, flash-1-mini **thinks by default** — it emits a `...` reasoning block before the final answer. For many legal drafting tasks you will want the direct answer only. Disable thinking by passing `enable_thinking=False` through the chat template: ```python prompt = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=False, enable_thinking=False, # direct kwarg; emits an empty block so the model answers directly ) ``` When serving via vLLM, pass `--reasoning-parser qwen3`; to disable thinking per request, set `chat_template_kwargs={"enable_thinking": False}` in the request body (or keep thinking on for complex reasoning where it helps). ### Serving The model serves with **vLLM** for production text and multimodal inference (Transformers ≥ 5.5). Greedy decoding (temperature 0) is recommended for legal tasks where determinism matters. ### Quantized GGUF variants (text-only) GGUF quantizations for CPU / edge inference via `llama.cpp` and Ollama are available in the [`gguf/`](./tree/main/gguf) folder of this repository: | File | Quant | Size | Notes | |---|---|---|---| | `gguf/flash-1-mini-20260602-Q6_K.gguf` | Q6_K | 3.3 GB | Highest fidelity; closest to bf16 | | `gguf/flash-1-mini-20260602-Q5_K_M.gguf` | Q5_K_M | 2.9 GB | Balanced quality / size | | `gguf/flash-1-mini-20260602-Q4_K_M.gguf` | Q4_K_M | 2.6 GB | Smallest; quality holds on common tasks | **Important — these GGUFs are text-only.** The vision tower is not carried in the GGUF format, so image input is **not** supported by the GGUF variants. For multimodal (image) inference, use the bf16 safetensors weights above. Quality scales with bit-depth: Q6_K tracks the bf16 model most closely; lower bit-depths trade some fidelity for size, and on the most demanding legal-citation tasks the higher-bit quants are recommended. ```bash # llama.cpp ./llama-completion -m flash-1-mini-20260602-Q5_K_M.gguf \ -p "What is the legal test under section 1 of the Canadian Charter?" -n 200 --temp 0 # Ollama (create a Modelfile pointing at the GGUF, then run) printf 'FROM ./flash-1-mini-20260602-Q5_K_M.gguf\n' > Modelfile ollama create flash-1-mini -f Modelfile ollama run flash-1-mini ``` GGUF inference of this architecture (Qwen3.5 hybrid linear-attention / Gated DeltaNet) requires a recent `llama.cpp` build with support for these layers. The multi-token-prediction (MTP) head is excluded from the GGUF (not used at inference). To run the bf16 weights in lower precision instead, load them with `bitsandbytes` 4-bit/8-bit via `BitsAndBytesConfig`. ## Benchmarks All figures are flash-1-mini vs the Qwen3.5-4B base under identical conditions (same prompts, few-shot counts, scoring, greedy decoding). See the SimpleDirect benchmarking methodology and CBLRE eval-set documentation for full protocol. | Capability | Base | flash-1-mini | |---|---|---| | Legal citation integrity (CBLRE) | 15.8% | **42.1%** | | Instruction-following (IFEval, prompt-strict) | 30.3% | **53.2%** | | English legal — international law (MMLU) | 70.3% | **76.0%** | | English legal — jurisprudence (MMLU) | 79.6% | **81.5%** | | Complex reasoning (BBH) | 68.6% | **79.0%** | | General knowledge (MMLU) | 69.8% | 69.8% | | Privacy-compliance bilingual parity (FR/EN) | — | **1.00** | ### Where it is weaker Specialization carried measurable costs, reported here in full: - **Retrieval (RAG):** source-attribution accuracy regressed (80.5% → 75.5% on a leak-proof held-out set). flash-1-mini is not a retrieval/RAG leader. - **Function-calling (BFCL v4):** overall regressed (37.7% → 28.6%), with multi-turn the weakest sub-category. - **French professional-law MCQ (Global-MMLU FR):** regressed (49.0% → 44.6%). - **CBLRE Quebec civil law:** regressed (95.0% → 90.0%). If your workload is primarily retrieval-grounded QA or tool/function-calling orchestration, evaluate carefully against these numbers. ## Training flash-1-mini is a supervised fine-tune of Qwen3.5-4B using parameter-efficient adapters (LoRA with DoRA, rank 32 / alpha 64, RS-LoRA), with the vision tower frozen, on a bilingual Canadian legal corpus weighted toward citation production and Quebec civil-law content. The trained adapter was merged into the base weights and the checkpoint canonicalized for serving. The architecture is unchanged from the base. ## Limitations and responsible use - **Not legal advice.** flash-1-mini produces information to assist qualified professionals; it does not practice law and its outputs are not a substitute for a lawyer. - **Verify citations.** Citation accuracy is materially improved over the base but is not perfect; verify against primary sources. - **Bilingual, not omniscient in French.** Parity is strong on tested tracks but French professional-law MCQ regressed; do not assume uniform French superiority. - **Hallucination.** Like all LLMs, it can produce confident, incorrect output. - **Quebec register.** The model is evaluated for legal correctness, not certified for Quebec-French dialectal register. ## License and attribution flash-1-mini is released under the **Apache License 2.0**. It is a modified derivative work of **Qwen3.5-4B** (© Alibaba Cloud / Qwen Team, Apache-2.0). See the `LICENSE` and `NOTICE` files in this repository for the full license text and the required attribution and modification statement. ## Citation ```bibtex @misc{simpledirect2026flash1mini, title = {flash-1-mini: A Bilingual Canadian Legal Language Model}, author = {{Alpine Pacific Trading Inc. (operating as SimpleDirect)}}, year = {2026}, note = {Version flash-1-mini-20260602. Derivative of Qwen3.5-4B (Apache-2.0).}, howpublished = {\url{https://huggingface.co/simpledirect/flash-1-mini}} } ```