Instructions to use datalab-to/surya-ocr-2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use datalab-to/surya-ocr-2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="datalab-to/surya-ocr-2") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("datalab-to/surya-ocr-2") model = AutoModelForImageTextToText.from_pretrained("datalab-to/surya-ocr-2") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use datalab-to/surya-ocr-2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "datalab-to/surya-ocr-2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "datalab-to/surya-ocr-2", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/datalab-to/surya-ocr-2
- SGLang
How to use datalab-to/surya-ocr-2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "datalab-to/surya-ocr-2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "datalab-to/surya-ocr-2", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "datalab-to/surya-ocr-2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "datalab-to/surya-ocr-2", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use datalab-to/surya-ocr-2 with Docker Model Runner:
docker model run hf.co/datalab-to/surya-ocr-2
llama.cpp server immediately crashes " [WARNING] surya: Inference error: Connection error."
Hello I tried testing this model on windows without nvidia hardware using llama.cpp as a backend and I can't get it to work at all, as soon as I launch the OCR via the terminal or the gui I get the error above and then the server just crashes. This happens with the embedded llama-server that gets launched automatically as well when I launch my own instance of llama-server and point to it via SURYA_INFERENCE_URL.
I tried changing values for "--image-min-tokens" and "--image-max-tokens" but it does not seem to make a difference. I also changed the number of concurrent queries using SURYA_INFERENCE_PARALLEL and the same on llama-server but that did not help either.
i have a solution that runs at least but extremely slow.
5pages /sec on a 5060 -> i am on a 4060 and must wait 5-10sec for one 3000chars page
# ---------------- Libraries ----------------
import os
import gc
import io
import sys
import time
import re
from pathlib import Path
from base64 import b64encode
from typing import Optional
from PIL import Image
from llama_cpp import Llama
import llama_cpp
from llama_cpp.llama_chat_format import Qwen35ChatHandler# PaddleOCRChatHandler # Llava16ChatHandler(ministral) # LFM25VLChatHandler # Gemma4ChatHandler # Qwen35ChatHandler # Qwen35ChatHandler #MTMDChatHandler
from llama_cpp import llama_print_system_info
#PaddleOCRVLDocumentConverter
print("=== Python ===")
print("Version:", sys.version)
print("Executable:", sys.executable)
print(llama_print_system_info().decode("utf-8"))
print("LLAMA Version:", llama_cpp.__version__)
# ---------------- Configuration ----------------
def get_config() -> dict:
"""Returns configuration for model paths and settings."""
# === CHANGE THIS PATH TO ANY LOCATION ON YOUR HARD DRIVE ===
model_path = Path(r"f:\...\surya-2.gguf")
image_path = Path(r"c:\...\1")
# Auto-discover the mmproj file (first file containing "mmproj")
mmproj_candidates = [f for f in model_path.parent.glob("*mmproj*.gguf") if f.is_file()]
if not mmproj_candidates:
raise FileNotFoundError(f"No mmproj file found in {model_path.parent}")
return {
"base_path": model_path.parent,
"gguf_model_path": model_path, # The model file you specified
"clip_model_path": mmproj_candidates[0], # Auto-discovered mmproj
"image_input_path": image_path, # image path
"image_extensions": {'.jpg', '.jpeg', '.png', '.bmp', '.webp'},
"min_side_size": 1024,
"recursive": True,
}
# ---------------- Image Processing Functions ----------------
def find_image_files(input_path: Path, extensions: set[str], recursive: bool = True) -> list[Path]:
"""Find all image files in the specified directory."""
images = []
if not input_path.exists():
print(f"⚠️ Input path does not exist: {input_path}")
return images
# Search pattern based on recursive setting
search_pattern = "**/*" if recursive else "*"
for file_path in input_path.glob(search_pattern):
if file_path.is_file() and file_path.suffix.lower() in extensions:
images.append(file_path)
return sorted(images)
def resize_image_to_max_side(image_path: Path, max_side: int = 1024) -> Image.Image | None:
"""
Resize image so the LARGEST side is exactly `max_side` pixels.
Memory Note: Returns a new PIL Image that should be explicitly
released after use to prevent memory buildup in long loops.
Args:
image_path: Path to the source image file
max_side: Target dimension for the larger side (default: 1024)
Returns:
Resized PIL Image or None if processing fails
"""
try:
with Image.open(image_path) as img:
original_width, original_height = img.size
print(f" ℹ️ Original size: {original_width}x{original_height}")
# Target the LARGER dimension to max_side
max_dim = max(original_width, original_height)
scale_factor = max_side / max_dim
new_width = int(original_width * scale_factor)
new_height = int(original_height * scale_factor)
print(f" ↕️ Scaling by: {scale_factor:.2f}x")
print(f" 📐 New size: {new_width}x{new_height}")
resized_img = img.resize(
(new_width, new_height),
Image.Resampling.LANCZOS
)
return resized_img
except Exception as e:
print(f"⚠️ Failed to resize {image_path.name}: {e}")
import traceback
traceback.print_exc()
return None
def image_to_base64(image: Image.Image, format_type: str = 'JPEG') -> str | None:
"""Convert a PIL Image to base64-encoded data URI string."""
try:
if not hasattr(image, 'filename'):
image.filename = "unknown"
mime_types = {
'JPEG': 'image/jpeg',
'PNG': 'image/png',
'WEBP': 'image/webp',
'BMP': 'image/bmp',
'TIFF': 'image/tiff',
}
output_format = format_type.upper()
if output_format in ['JPG', 'JPEG']:
mime_type = 'image/jpeg'
# JPEG doesn't support transparency - convert RGBA to RGB
if image.mode == 'RGBA':
background = Image.new('RGB', image.size, (255, 255, 255))
background.paste(image, mask=image.split()[3])
image = background
else:
mime_type = mime_types.get(output_format, 'image/jpeg')
byte_buffer = io.BytesIO()
quality = 97 if output_format in ['JPG', 'JPEG'] else None
save_kwargs = {'format': output_format}
if quality is not None:
save_kwargs['quality'] = quality
image.save(byte_buffer, **save_kwargs)
base64_data = b64encode(byte_buffer.getvalue()).decode('ascii')
return f"data:{mime_type};base64,{base64_data}"
except Exception as e:
print(f"⚠️ Failed to encode {getattr(image, 'filename', 'unknown')} as base64: {e}")
return None
def generate_description(llm: Llama, image_path: Path) -> str | None:
"""Generate a description for a single image using the multimodal model."""
try:
print(f" 📐 Step 1/3: Resizing...")
resized_image: Optional[Image.Image] = resize_image_to_max_side(
image_path, max_side=1024
)
if resized_image is None:
return "Failed to process image"
pixel_count = resized_image.width * resized_image.height
print(f" 🔐 Step 2/3: Converting to base64... ({pixel_count:,} pixels)")
# Store the URI locally - will be used immediately and then cleaned up
base64_uri: Optional[str] = image_to_base64(resized_image, format_type='JPEG')
if not base64_uri:
return "Failed to encode image"
print(f" 🤖 Step 3/3: Generating description...")
user_content = [
{
"type": "image_url",
"image_url": {"url": base64_uri}
},
{
"type": "text",
"text": "FULL OCR, TEXT, Layout, Images, Tables"
}
]
messages = [
{
"role": "system",
"content": (
"Full OCR, TEXT, Layout, Tables, Images, Drawings"
#"If the image contains text, extract it accurately in json format."
),
},
{
"role": "user",
"content": user_content,
},
]
response = llm.create_chat_completion(
messages=messages,
# ---- Generation control ----
max_tokens=3000, # 1200 reasoning task inclusive
)
choices = response.get("choices", [])
if not choices:
return "No description generated"
content = choices[0].get("message", {}).get("content")
# Explicitly clean up memory-intensive objects BEFORE returning
del resized_image # Release PIL Image from memory
del base64_uri # Release large encoded string
gc.collect() # Force garbage collection for immediate cleanup
return content.strip() if content else "No description generated"
except Exception as e:
print(f"❌ Error processing {image_path.name}: {e}")
import traceback
traceback.print_exc()
# Clean up on error too
try:
del resized_image, base64_uri # type: ignore
except NameError:
pass
gc.collect()
return None
def save_response(image_path: Path, response: str) -> bool:
"""Save the model's response to a .txt file with same basename as image."""
# CLEAN THE DESCRIPTION BEFORE SAVING
cleaned_response = clean_description(response)
output_path = image_path.with_suffix(".txt")
try:
# Write description with UTF-8 encoding for international characters
with open(output_path, "w", encoding="utf-8") as f:
f.write(cleaned_response)
print(f"✅ Saved to: {output_path.name}")
return True
except PermissionError as e:
print(f"❌ Cannot write to {output_path}: {e}")
return False
except OSError as e:
print(f"❌ OS error writing to {output_path}: {e}")
return False
def clean_description(description: str) -> str:
"""
Clean generated description by extracting text between [think] markers.
Args:
description: Raw description text from the model
Returns:
Cleaned description text between markers
"""
if not description:
return description
description = re.sub(r'\*\*', '', description, flags=re.DOTALL | re.IGNORECASE)
return description.strip()
# ---------------- Main Execution ----------------
def main():
"""Main function to process all images and generate descriptions."""
config = get_config()
start_time = time.time()
# Validate model files exist with detailed paths (Windows path handling)
if not config["gguf_model_path"].exists():
raise FileNotFoundError(
f"Model file not found: {config['gguf_model_path'].resolve()}"
)
if not config["clip_model_path"].exists():
raise FileNotFoundError(
f"MMProj file not found: {config['clip_model_path'].resolve()}"
)
# Find images to process - Windows compatibility for paths
image_files = find_image_files(
input_path=config["image_input_path"],
extensions=config["image_extensions"],
recursive=config["recursive"]
)
print(f"number of images: {len(image_files)}")
if not image_files:
print(f"No images found in {config['image_input_path']}")
return
llm = Llama(
# -------- Model / Paths --------
model_path=str(config["gguf_model_path"]),
# -------- Core Performance --------
n_ctx=8192, # 8192, # Context window
ctx_checkpoints=0, # Tune for faster output
n_threads=os.cpu_count() or 8, # Use all CPU threads
#n_batch=512, # Prompt processing batch size
n_batch=2048,
temperature=1.0,
# -------- GPU Acceleration --------
n_gpu_layers=-1, # Offload all layers to GPU
main_gpu=0, # Primary GPU index
#tensor_split=None, # Multi-GPU split if needed
# -------- Memory Behavior --------
use_mmap=True, # Faster loading
use_mlock=False, # Lock in RAM (optional)
#cache_capacity=None, # KV cache override
low_vram=False, # Reduce VRAM usage if needed
swa_full=True,
# -------- Sampling / Reproducibility --------
seed=123,
# presence_penalty=0.0 # useless
# -------- Optional Debug --------
verbose=False,
# force_reasoning=False, # useless?
# -------- Multimodal Handler --------
chat_handler=Qwen35ChatHandler(
clip_model_path=str(config["clip_model_path"]),
enable_thinking=False,
image_min_tokens=1024,
),
)
print("✅ Model loaded successfully!\n")
print("=" * 50)
# Process images ONE AT A TIME with explicit cleanup between iterations
processed = 0
failed = 0
for image_path in image_files:
try:
print(f"\n🖼️ Processing: {image_path.name}")
description = generate_description(llm, image_path)
if description:
success = save_response(image_path, description)
processed += 1 if success else 0
else:
failed += 1
except Exception as e:
print(f"❌ Critical error processing {image_path.name}: {e}")
failed += 1
finally:
# Force cleanup between each image to prevent memory buildup
gc.collect()
print("-" * 50)
# Summary with detailed status
print("\n" + "=" * 50)
print(f"📊 Results: {processed} succeeded, {failed} failed")
end_time = time.time()
total_duration = end_time - start_time
print(f" ⚡ Total duration: {total_duration:.2f}s")
# Memory usage hint (Windows specific if psutil available)
try:
import psutil
process = psutil.Process(os.getpid())
mem_mb = process.memory_info().rss / 1024 / 1024
print(f" 💾 Current memory usage: {mem_mb:.1f} MB")
except ImportError:
pass # Optional, doesn't break script if not available
gc.collect() # Final cleanup
if __name__ == "__main__":
try:
main()
except KeyboardInterrupt:
print("\n\n⚠️ Process interrupted by user")
except FileNotFoundError as e:
print(f"\n❌ Configuration error: {e}")
Hi - there are two separate issues:
- On Windows - it would be useful to actually see the error message from the llama.cpp server - you could try launching manually as well
- On 4060 - the 5 pages/s on 5090 is throughput, not latency, at concurrency 128. 5-10 seconds for a single page is not unreasonable - you will start to see better throughput as you send more concurrent requests
Hi - there are two separate issues:
- On Windows - it would be useful to actually see the error message from the llama.cpp server - you could try launching manually as well
- On 4060 - the 5 pages/s on 5090 is throughput, not latency, at concurrency 128. 5-10 seconds for a single page is not unreasonable - you will start to see better throughput as you send more concurrent requests
As I pointed out in my post the same error happens whether llama-server is started automatically or manually. i don't see any obvious error message in server's console output, it just kills the process a few seconds after the ocr is launched, on the other hand surya ocr gets stuck on the "Connection error" message and has to be shut down manually.
Here's the full console output until the process stops:
parser generation prompt: <|im_start|>assistant
0.19.271.245 D add_text: <|im_start|>user
0.19.271.365 D add_text: <|vision_start|>
0.19.301.035 D image_tokens->nx = 33
0.19.301.044 D image_tokens->ny = 48
0.19.301.044 D batch_f32 size = 1
0.19.301.050 D add_text: <|vision_end|>
0.19.301.320 D add_text: OCR this image to HTML. Each block is a div with data-label and data-bbox (x0 y0 x1 y1, normalized 0-1000).<|im_end|>
<|im_start|>assistant
0.19.305.000 D srv params_from_: Grammar lazy: false
0.19.305.006 I srv params_from_: Chat format: peg-native
0.19.305.009 D srv params_from_: Generation prompt: '<|im_start|>assistant
'
0.19.305.103 I srv prompt_get_n: message_spans: last user message: byte_pos=0, media=0, n_before_user=0
0.19.305.139 D res add_waiting_: add task 0 to waiting list. current waiting = 0 (before add)
0.19.305.143 D que post: new task, id = 0/1, front = 0
0.19.305.161 D que start_loop: processing new tasks
0.19.305.164 D que start_loop: processing task, id = 0
0.19.305.170 I slot get_availabl: id 7 | task -1 | selected slot by LRU, t_last = -1
0.19.305.171 I srv get_availabl: updating prompt cache
0.19.305.176 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.19.305.182 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 2048.000 MiB, 32768 tokens, 2147483648 est)
0.19.305.184 I srv get_availabl: prompt cache update took 0.01 ms
0.19.305.198 D slot launch_slot_: id 7 | task -1 | launching slot : {"id":7,"n_ctx":4096,"speculative":false,"is_processing":false}
0.19.305.290 D common_sampler_init: prefill token: 1 = <|im_start|>
0.19.305.291 D common_sampler_init: prefill token: 2033 = a
0.19.305.292 D common_sampler_init: prefill token: 2051 = s
0.19.305.292 D common_sampler_init: prefill token: 2051 = s
0.19.305.292 D common_sampler_init: prefill token: 2041 = i
0.19.305.293 D common_sampler_init: prefill token: 2051 = s
0.19.305.293 D common_sampler_init: prefill token: 2052 = t
0.19.305.293 D common_sampler_init: prefill token: 2033 = a
0.19.305.294 D common_sampler_init: prefill token: 2046 = n
0.19.305.294 D common_sampler_init: prefill token: 2052 = t
0.19.305.295 D common_sampler_init: prefill token: 1957 =
0.19.305.314 I slot launch_slot_: id 7 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
0.19.305.327 I slot launch_slot_: id 7 | task -1 | sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.100, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
0.19.305.329 I slot launch_slot_: id 7 | task 0 | processing task, is_child = 0
0.19.305.330 D que start_loop: update slots
0.19.305.331 D srv update_slots: posting NEXT_RESPONSE
0.19.305.333 D que post: new task, id = 1, front = 0
0.19.305.339 I slot update_slots: id 7 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, task.n_tokens = 1705
0.19.305.347 I slot update_slots: id 7 | task 0 | cached n_tokens = 0, memory_seq_rm [0, end)
0.19.305.360 D slot update_slots: id 7 | task 0 | main/do_checkpoint = no, pos_min = -1, pos_max = -1
0.19.305.361 D srv update_slots: decoding batch, n_tokens = 7
0.19.305.363 D set_adapters_lora: adapters = 0000000000000000
0.19.305.363 D adapters_lora_are_same: adapters = 0000000000000000
0.19.305.364 D set_embeddings: value = 0
I think I may have figured out the issue, I have an AMD APU and by default llama-server will try to leverage this gpu using the VULKAN backend. Evidently this breaks this specific model.
Adding the "-ngl 0" parameter to the llama-server launch command switches to full cpu backend and allows this to work albeit very slowly. Any plans to support all lllama.cpp backends?
| 5-10 seconds for a single page is not unreasonable - you will start to see better throughput as you send more concurrent requests
what does that mean load multiple times the model as own thread or pass ~5 images at once.
any suggestions for model instruct "FULL OCR", ... what if i need in addition also described images on a text page. it does sometimes but could be more.
any suggestion model temp? optimal image size ?