How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="DuoNeural/Mistral-7B-Instruct-v0.3-GGUF",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Mistral-7B-Instruct-v0.3 โ€” GGUF Quants

Quantized GGUF versions of mistralai/Mistral-7B-Instruct-v0.3 โ€” Mistral AI's flagship 7B instruction-tuned model. v0.3 adds function calling support and an improved tokenizer (v3 with 32768 vocabulary expansion) over v0.2. Mistral-7B remains one of the most capable and widely deployed 7B models for general-purpose inference.

Available Files

File Quant Size Use Case
Mistral-7B-Instruct-v0.3-Q8_0.gguf Q8_0 ~7.2GB Maximum quality
Mistral-7B-Instruct-v0.3-Q6_K.gguf Q6_K ~5.5GB Near-lossless
Mistral-7B-Instruct-v0.3-Q5_K_M.gguf Q5_K_M ~4.8GB High quality
Mistral-7B-Instruct-v0.3-Q4_K_M.gguf Q4_K_M ~4.1GB Recommended default
Mistral-7B-Instruct-v0.3-Q3_K_M.gguf Q3_K_M ~3.3GB Low VRAM
Mistral-7B-Instruct-v0.3-IQ4_XS.gguf IQ4_XS ~3.6GB Imatrix 4-bit
Mistral-7B-Instruct-v0.3-IQ3_XXS.gguf IQ3_XXS ~2.7GB Imatrix 3-bit
Mistral-7B-Instruct-v0.3-IQ2_M.gguf IQ2_M ~2.4GB Imatrix 2-bit
Mistral-7B-Instruct-v0.3-IQ1_S.gguf IQ1_S ~1.6GB Extreme compression
Mistral-7B-Instruct-v0.3-fp16.gguf FP16 ~14.0GB Full precision
imatrix.dat โ€” โ€” Importance matrix

Usage

# llama.cpp (v0.3 uses [INST] format)
./llama-cli -m Mistral-7B-Instruct-v0.3-Q4_K_M.gguf \
  --ctx-size 8192 -n 512 \
  -p "[INST] Hello! [/INST]"

# Ollama
ollama run hf.co/DuoNeural/Mistral-7B-Instruct-v0.3-GGUF:Q4_K_M

About Mistral-7B-Instruct-v0.3

  • Parameters: 7B
  • Context: 32K tokens (with sliding window attention)
  • Architecture: Mistral (GQA, SWA, RoPE)
  • License: Apache 2.0
  • New in v0.3: Function calling, tokenizer v3 (32768 vocab), improved tool use

One of the most production-proven 7B models in the open-source ecosystem. Excellent choice for general-purpose inference, function calling pipelines, and as a fine-tuning base.


Quantized by DuoNeural using llama.cpp on RTX 5090.


DuoNeural

DuoNeural is an open AI research lab โ€” human + AI in collaboration.

DuoNeural Research Publications

Open access, CC BY 4.0. Authored by Archon, Jesse Caldwell, Aura โ€” DuoNeural.

Downloads last month
389
GGUF
Model size
7B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for DuoNeural/Mistral-7B-Instruct-v0.3-GGUF

Quantized
(248)
this model