llama.cpp server immediately crashes " [WARNING] surya: Inference error: Connection error."

#1
by rsbdev - opened

Hello I tried testing this model on windows without nvidia hardware using llama.cpp as a backend and I can't get it to work at all, as soon as I launch the OCR via the terminal or the gui I get the error above and then the server just crashes. This happens with the embedded llama-server that gets launched automatically as well when I launch my own instance of llama-server and point to it via SURYA_INFERENCE_URL.

I tried changing values for "--image-min-tokens" and "--image-max-tokens" but it does not seem to make a difference. I also changed the number of concurrent queries using SURYA_INFERENCE_PARALLEL and the same on llama-server but that did not help either.

i have a solution that runs at least but extremely slow.

5pages /sec on a 5060 -> i am on a 4060 and must wait 5-10sec for one 3000chars page

# ---------------- Libraries ----------------
import os
import gc
import io
import sys
import time
import re
from pathlib import Path
from base64 import b64encode
from typing import Optional
from PIL import Image
from llama_cpp import Llama
import llama_cpp
from llama_cpp.llama_chat_format import Qwen35ChatHandler# PaddleOCRChatHandler # Llava16ChatHandler(ministral) # LFM25VLChatHandler # Gemma4ChatHandler # Qwen35ChatHandler # Qwen35ChatHandler #MTMDChatHandler
from llama_cpp import llama_print_system_info

#PaddleOCRVLDocumentConverter

print("=== Python ===")
print("Version:", sys.version)
print("Executable:", sys.executable)
print(llama_print_system_info().decode("utf-8"))
print("LLAMA Version:", llama_cpp.__version__)


# ---------------- Configuration ----------------
def get_config() -> dict:
    """Returns configuration for model paths and settings."""
    # === CHANGE THIS PATH TO ANY LOCATION ON YOUR HARD DRIVE ===
    model_path = Path(r"f:\...\surya-2.gguf")
    image_path = Path(r"c:\...\1")
    
    # Auto-discover the mmproj file (first file containing "mmproj")
    mmproj_candidates = [f for f in model_path.parent.glob("*mmproj*.gguf") if f.is_file()]
    if not mmproj_candidates:
        raise FileNotFoundError(f"No mmproj file found in {model_path.parent}")
    
    return {
        "base_path": model_path.parent,
        "gguf_model_path": model_path,  # The model file you specified
        "clip_model_path": mmproj_candidates[0],  # Auto-discovered mmproj
        "image_input_path": image_path, # image path 
        "image_extensions": {'.jpg', '.jpeg', '.png', '.bmp', '.webp'},
        "min_side_size": 1024,
        "recursive": True,
    }


# ---------------- Image Processing Functions ----------------


def find_image_files(input_path: Path, extensions: set[str], recursive: bool = True) -> list[Path]:
    """Find all image files in the specified directory."""
    images = []
    
    if not input_path.exists():
        print(f"⚠️  Input path does not exist: {input_path}")
        return images
    
    # Search pattern based on recursive setting
    search_pattern = "**/*" if recursive else "*"
    
    for file_path in input_path.glob(search_pattern):
        if file_path.is_file() and file_path.suffix.lower() in extensions:
            images.append(file_path)
    
    return sorted(images)

    
def resize_image_to_max_side(image_path: Path, max_side: int = 1024) -> Image.Image | None:
    """
    Resize image so the LARGEST side is exactly `max_side` pixels.
    
    Memory Note: Returns a new PIL Image that should be explicitly 
    released after use to prevent memory buildup in long loops.
    
    Args:
        image_path: Path to the source image file  
        max_side: Target dimension for the larger side (default: 1024)
        
    Returns:
        Resized PIL Image or None if processing fails
    """
    try:
        with Image.open(image_path) as img:
            original_width, original_height = img.size
            
            print(f"   ℹ️  Original size: {original_width}x{original_height}")
            
            # Target the LARGER dimension to max_side
            max_dim = max(original_width, original_height)
            scale_factor = max_side / max_dim
            
            new_width = int(original_width * scale_factor)
            new_height = int(original_height * scale_factor)
            
            print(f"   ↕️  Scaling by: {scale_factor:.2f}x")
            print(f"   📐 New size: {new_width}x{new_height}")
            
            resized_img = img.resize(
                (new_width, new_height), 
                Image.Resampling.LANCZOS
            )
            
            return resized_img
            
    except Exception as e:
        print(f"⚠️  Failed to resize {image_path.name}: {e}")
        import traceback
        traceback.print_exc()
        return None



def image_to_base64(image: Image.Image, format_type: str = 'JPEG') -> str | None:
    """Convert a PIL Image to base64-encoded data URI string."""
    try:
        if not hasattr(image, 'filename'):
            image.filename = "unknown"
        
        mime_types = {
            'JPEG': 'image/jpeg',
            'PNG': 'image/png',
            'WEBP': 'image/webp',
            'BMP': 'image/bmp',
            'TIFF': 'image/tiff',
        }
        
        output_format = format_type.upper()
        if output_format in ['JPG', 'JPEG']:
            mime_type = 'image/jpeg'
            # JPEG doesn't support transparency - convert RGBA to RGB
            if image.mode == 'RGBA':
                background = Image.new('RGB', image.size, (255, 255, 255))
                background.paste(image, mask=image.split()[3])
                image = background
        else:
            mime_type = mime_types.get(output_format, 'image/jpeg')
        

        byte_buffer = io.BytesIO()
        quality = 97 if output_format in ['JPG', 'JPEG'] else None
        
        save_kwargs = {'format': output_format}
        if quality is not None:
            save_kwargs['quality'] = quality
            
        image.save(byte_buffer, **save_kwargs)
        base64_data = b64encode(byte_buffer.getvalue()).decode('ascii')
        
        return f"data:{mime_type};base64,{base64_data}"
        
    except Exception as e:
        print(f"⚠️  Failed to encode {getattr(image, 'filename', 'unknown')} as base64: {e}")
        return None


def generate_description(llm: Llama, image_path: Path) -> str | None:
    """Generate a description for a single image using the multimodal model."""
    try:
        print(f"   📐 Step 1/3: Resizing...")
        resized_image: Optional[Image.Image] = resize_image_to_max_side(
            image_path, max_side=1024
        )
        
        if resized_image is None:
            return "Failed to process image"
        
        pixel_count = resized_image.width * resized_image.height
        print(f"   🔐 Step 2/3: Converting to base64... ({pixel_count:,} pixels)")
        
        # Store the URI locally - will be used immediately and then cleaned up
        base64_uri: Optional[str] = image_to_base64(resized_image, format_type='JPEG')
        
        if not base64_uri:
            return "Failed to encode image"
        
        print(f"   🤖 Step 3/3: Generating description...")


        user_content = [
            {
                "type": "image_url",
                "image_url": {"url": base64_uri}
            },
            {
                "type": "text",
                "text": "FULL OCR, TEXT, Layout, Images, Tables"
            }
        ]

        messages = [
            {
                "role": "system",
                "content": (
                    "Full OCR, TEXT, Layout, Tables, Images, Drawings"
                    #"If the image contains text, extract it accurately in json format."
                ),
            },
            {
                "role": "user",
                "content": user_content,
            },
        ]

        response = llm.create_chat_completion(
            messages=messages,

            # ---- Generation control ----
            max_tokens=3000,     # 1200 reasoning task inclusive
          

        )

        choices = response.get("choices", [])
        if not choices:
            return "No description generated"
            
        content = choices[0].get("message", {}).get("content")
        
        # Explicitly clean up memory-intensive objects BEFORE returning
        del resized_image  # Release PIL Image from memory
        del base64_uri     # Release large encoded string
        
        gc.collect()      # Force garbage collection for immediate cleanup
        
        return content.strip() if content else "No description generated"

    except Exception as e:
        print(f"❌ Error processing {image_path.name}: {e}")
        import traceback
        traceback.print_exc()
        
        # Clean up on error too
        try:
            del resized_image, base64_uri  # type: ignore
        except NameError:
            pass
        gc.collect()
        return None


def save_response(image_path: Path, response: str) -> bool:
    """Save the model's response to a .txt file with same basename as image."""
    # CLEAN THE DESCRIPTION BEFORE SAVING
    cleaned_response = clean_description(response)
    
    output_path = image_path.with_suffix(".txt")
    
    try:
        # Write description with UTF-8 encoding for international characters
        with open(output_path, "w", encoding="utf-8") as f:
            f.write(cleaned_response)
        print(f"✅ Saved to: {output_path.name}")
        return True
    except PermissionError as e:
        print(f"❌ Cannot write to {output_path}: {e}")
        return False
    except OSError as e:
        print(f"❌ OS error writing to {output_path}: {e}")
        return False



def clean_description(description: str) -> str:
    """
    Clean generated description by extracting text between [think] markers.
    
    Args:
        description: Raw description text from the model
        
    Returns:
        Cleaned description text between markers
    """
    if not description:
        return description
  
    
    description = re.sub(r'\*\*', '', description, flags=re.DOTALL | re.IGNORECASE)
    
    return description.strip()


# ---------------- Main Execution ----------------

def main():
    """Main function to process all images and generate descriptions."""
    config = get_config()
    start_time = time.time()    
    
    # Validate model files exist with detailed paths (Windows path handling)
    if not config["gguf_model_path"].exists():
        raise FileNotFoundError(
            f"Model file not found: {config['gguf_model_path'].resolve()}"
        )
    if not config["clip_model_path"].exists():
        raise FileNotFoundError(
            f"MMProj file not found: {config['clip_model_path'].resolve()}"
        )
    
    # Find images to process - Windows compatibility for paths
    image_files = find_image_files(
        input_path=config["image_input_path"],
        extensions=config["image_extensions"],
        recursive=config["recursive"]
    )
    print(f"number of images: {len(image_files)}")
    
    if not image_files:
        print(f"No images found in {config['image_input_path']}")
        return



    llm = Llama(
        # -------- Model / Paths --------
        model_path=str(config["gguf_model_path"]),

        # -------- Core Performance --------
        n_ctx=8192, # 8192,                 # Context window
        ctx_checkpoints=0,          # Tune for faster output
        n_threads=os.cpu_count() or 8,   # Use all CPU threads
        #n_batch=512,                # Prompt processing batch size
        n_batch=2048,
        temperature=1.0,

        # -------- GPU Acceleration --------
        n_gpu_layers=-1,            # Offload all layers to GPU
        main_gpu=0,                 # Primary GPU index
        #tensor_split=None,          # Multi-GPU split if needed

        # -------- Memory Behavior --------
        use_mmap=True,              # Faster loading
        use_mlock=False,            # Lock in RAM (optional)
        #cache_capacity=None,        # KV cache override
        low_vram=False,             # Reduce VRAM usage if needed
        swa_full=True,

        # -------- Sampling / Reproducibility --------
        seed=123,
        # presence_penalty=0.0 # useless
        # -------- Optional Debug --------
        verbose=False,
        # force_reasoning=False, # useless?


        # -------- Multimodal Handler --------
        chat_handler=Qwen35ChatHandler(
            clip_model_path=str(config["clip_model_path"]),
            enable_thinking=False,
            image_min_tokens=1024,
            
            
        ),
    )



     
    print("✅ Model loaded successfully!\n")
    print("=" * 50)
    

    # Process images ONE AT A TIME with explicit cleanup between iterations
    processed = 0
    failed = 0

    for image_path in image_files:
        try:
            print(f"\n🖼️ Processing: {image_path.name}")
            
            description = generate_description(llm, image_path)
            
            if description:
                success = save_response(image_path, description)
                processed += 1 if success else 0
            else:
                failed += 1
            
        except Exception as e:
            print(f"❌ Critical error processing {image_path.name}: {e}")
            failed += 1
        
        finally:
            # Force cleanup between each image to prevent memory buildup
            gc.collect()
        
        print("-" * 50)

    # Summary with detailed status
    print("\n" + "=" * 50)
    print(f"📊 Results: {processed} succeeded, {failed} failed")

    end_time = time.time()
    total_duration = end_time - start_time
    
    print(f"   ⚡ Total duration:         {total_duration:.2f}s")
    # Memory usage hint (Windows specific if psutil available)
    try:
        import psutil
        process = psutil.Process(os.getpid())
        mem_mb = process.memory_info().rss / 1024 / 1024
        print(f"   💾 Current memory usage:  {mem_mb:.1f} MB")
    except ImportError:
        pass  # Optional, doesn't break script if not available
    
    gc.collect()  # Final cleanup




if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt:
        print("\n\n⚠️  Process interrupted by user")
    except FileNotFoundError as e:
        print(f"\n❌ Configuration error: {e}")
Datalab org

Hi - there are two separate issues:

  • On Windows - it would be useful to actually see the error message from the llama.cpp server - you could try launching manually as well
  • On 4060 - the 5 pages/s on 5090 is throughput, not latency, at concurrency 128. 5-10 seconds for a single page is not unreasonable - you will start to see better throughput as you send more concurrent requests

Hi - there are two separate issues:

  • On Windows - it would be useful to actually see the error message from the llama.cpp server - you could try launching manually as well
  • On 4060 - the 5 pages/s on 5090 is throughput, not latency, at concurrency 128. 5-10 seconds for a single page is not unreasonable - you will start to see better throughput as you send more concurrent requests

As I pointed out in my post the same error happens whether llama-server is started automatically or manually. i don't see any obvious error message in server's console output, it just kills the process a few seconds after the ocr is launched, on the other hand surya ocr gets stuck on the "Connection error" message and has to be shut down manually.

Here's the full console output until the process stops:

parser generation prompt: <|im_start|>assistant

0.19.271.245 D add_text: <|im_start|>user

0.19.271.365 D add_text: <|vision_start|>
0.19.301.035 D image_tokens->nx = 33
0.19.301.044 D image_tokens->ny = 48
0.19.301.044 D batch_f32 size = 1
0.19.301.050 D add_text: <|vision_end|>
0.19.301.320 D add_text: OCR this image to HTML. Each block is a div with data-label and data-bbox (x0 y0 x1 y1, normalized 0-1000).<|im_end|>
<|im_start|>assistant

0.19.305.000 D srv params_from_: Grammar lazy: false
0.19.305.006 I srv params_from_: Chat format: peg-native
0.19.305.009 D srv params_from_: Generation prompt: '<|im_start|>assistant
'
0.19.305.103 I srv prompt_get_n: message_spans: last user message: byte_pos=0, media=0, n_before_user=0
0.19.305.139 D res add_waiting_: add task 0 to waiting list. current waiting = 0 (before add)
0.19.305.143 D que post: new task, id = 0/1, front = 0
0.19.305.161 D que start_loop: processing new tasks
0.19.305.164 D que start_loop: processing task, id = 0
0.19.305.170 I slot get_availabl: id 7 | task -1 | selected slot by LRU, t_last = -1
0.19.305.171 I srv get_availabl: updating prompt cache
0.19.305.176 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.19.305.182 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 2048.000 MiB, 32768 tokens, 2147483648 est)
0.19.305.184 I srv get_availabl: prompt cache update took 0.01 ms
0.19.305.198 D slot launch_slot_: id 7 | task -1 | launching slot : {"id":7,"n_ctx":4096,"speculative":false,"is_processing":false}
0.19.305.290 D common_sampler_init: prefill token: 1 = <|im_start|>
0.19.305.291 D common_sampler_init: prefill token: 2033 = a
0.19.305.292 D common_sampler_init: prefill token: 2051 = s
0.19.305.292 D common_sampler_init: prefill token: 2051 = s
0.19.305.292 D common_sampler_init: prefill token: 2041 = i
0.19.305.293 D common_sampler_init: prefill token: 2051 = s
0.19.305.293 D common_sampler_init: prefill token: 2052 = t
0.19.305.293 D common_sampler_init: prefill token: 2033 = a
0.19.305.294 D common_sampler_init: prefill token: 2046 = n
0.19.305.294 D common_sampler_init: prefill token: 2052 = t
0.19.305.295 D common_sampler_init: prefill token: 1957 =

0.19.305.314 I slot launch_slot_: id 7 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
0.19.305.327 I slot launch_slot_: id 7 | task -1 | sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.100, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
0.19.305.329 I slot launch_slot_: id 7 | task 0 | processing task, is_child = 0
0.19.305.330 D que start_loop: update slots
0.19.305.331 D srv update_slots: posting NEXT_RESPONSE
0.19.305.333 D que post: new task, id = 1, front = 0
0.19.305.339 I slot update_slots: id 7 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, task.n_tokens = 1705
0.19.305.347 I slot update_slots: id 7 | task 0 | cached n_tokens = 0, memory_seq_rm [0, end)
0.19.305.360 D slot update_slots: id 7 | task 0 | main/do_checkpoint = no, pos_min = -1, pos_max = -1
0.19.305.361 D srv update_slots: decoding batch, n_tokens = 7
0.19.305.363 D set_adapters_lora: adapters = 0000000000000000
0.19.305.363 D adapters_lora_are_same: adapters = 0000000000000000
0.19.305.364 D set_embeddings: value = 0

I think I may have figured out the issue, I have an AMD APU and by default llama-server will try to leverage this gpu using the VULKAN backend. Evidently this breaks this specific model.

Adding the "-ngl 0" parameter to the llama-server launch command switches to full cpu backend and allows this to work albeit very slowly. Any plans to support all lllama.cpp backends?

| 5-10 seconds for a single page is not unreasonable - you will start to see better throughput as you send more concurrent requests
what does that mean load multiple times the model as own thread or pass ~5 images at once.
any suggestions for model instruct "FULL OCR", ... what if i need in addition also described images on a text page. it does sometimes but could be more.
any suggestion model temp? optimal image size ?

Sign up or log in to comment