Works great on Claude Code x llama-server!

by bukit - opened 1 day ago

•

TLDR: Q8_0 pass all tests, without any hiccups, really underrated model on huggingface, it beats qwen3.5 9b and gemma4 E4B for agentic ai coding!

Wishlist: LocoOperator based on Qwen3.5 9B or Gemma4..

Great work LocoreMind!

Here are tests 6 through 10. These tests push beyond basic file editing and test the advanced reasoning, external environment interaction, and multi-step logic of your local 9B model.

If your model can pass these, it is performing at the level of models 3x to 8x its size.

Test 6: The External Dependency Test

Goal: See if the model can write code that requires third-party packages, realize it needs to install them, and execute the installation command in your terminal. Small models often hallucinate that packages are already installed or panic when an ImportError occurs.

Your Setup:
Ensure your terminal is in your test folder. (If you are using Python in a virtual environment, activate it first so it doesn't install globally).

Prompt to the Agent:

"Write a Python script called fetch_cat.py that uses the external requests library to fetch a random cat fact from https://catfact.ninja/fact and prints it. If the requests library is not installed, use pip to install it. Then run the script."

Pass condition:

It writes the script.
It either proactively runs pip install requests OR it runs the script, gets a ModuleNotFoundError, reads the error, runs pip install, and retries.
It successfully outputs a cat fact.

Test 7: Data Parsing & Mutation (JSON)

Goal: Test if the model can read a structured data file, understand its schema, write code to mutate it, and output a new file without hallucinating or losing data.

Your Setup:
Create a file named users.json and paste this exactly:

[
  {"name": "Alice", "age": 25, "status": "active"},
  {"name": "Bob", "age": 17, "status": "active"},
  {"name": "Charlie", "age": 30, "status": "inactive"},
  {"name": "Diana", "age": 22, "status": "active"}
]

Prompt to the Agent:

"Read users.json. Write a script called filter.py that loads this data, removes any user who is under 18 OR who has an 'inactive' status, and saves the remaining users to a new file called valid_users.json. Run the script, and then read valid_users.json to prove it worked."

Pass condition: It correctly writes the script, runs it, and reads the output file. The final valid_users.json should only contain Alice and Diana. (9B models often mess up the boolean logic: under 18 OR inactive).

Test 8: Nested File Architecture

Goal: Test if the agent can use OS-level tools to create directories (folders) and manage relative paths. Small models often create everything in the root folder because they struggle with mkdir commands.

Your Setup:
An empty folder.

Prompt to the Agent:

"Scaffold a basic web project. Create a folder called public. Inside public, create an index.html file. Also inside public, create two more folders: css and js. Create styles.css in the css folder, and app.js in the js folder. Finally, make sure the index.html file links to both the css and js files using correct relative paths."

Pass condition: The model uses mkdir (or a python script with os.makedirs) to create the nested folders. If you open public/index.html, it must have <link rel="stylesheet" href="css/styles.css"> and <script src="js/app.js"></script>.

Test 9: Log Analysis & Regex Extraction

Goal: See if the model can parse unstructured text, write precise extraction logic (regex or string matching), and count results.

Your Setup:
Create a file named server.log and paste this:

[INFO] 10:00:01 - Server started successfully.
[WARN] 10:05:22 - Memory usage at 80%
[ERROR] 10:06:01 - Connection timeout from IP 192.168.1.50
[INFO] 10:07:15 - User 'admin' logged in.
[ERROR] 10:10:44 - Database query failed: Syntax error.
[WARN] 10:12:00 - High latency detected.
[ERROR] 10:15:30 - Disk space critically low.

Prompt to the Agent:

"Analyze server.log. Write a shell command or a Python script to extract only the lines that contain '[ERROR]'. Save those lines into a new file called critical_errors.txt. Then, tell me exactly how many errors there were."

Pass condition: The model writes a script or uses grep '[ERROR]' server.log > critical_errors.txt. It must report back that there are exactly 3 errors. (Small models often hallucinate the count or include the WARN lines by mistake).

Test 10: Multi-File Bug Tracing (The Final Boss)

Goal: This tests deep context window retention. The model has to trace a stack trace across three different files, find the root cause (which is in a different file than where the crash happens), and fix it.

Your Setup:
Create three files exactly as written below.

File 1: config.py

# The bug is here: max_retries should be an integer, not a string
SETTINGS = {
    "max_retries": "3",
    "timeout": 10
}

File 2: processor.py

from config import SETTINGS

def process_data(data):
    retries = SETTINGS["max_retries"]
    # It will crash here when it tries to add an int to a string
    target_attempts = retries + 1 
    return f"Processing {data} with {target_attempts} attempts allowed."

File 3: main.py

from processor import process_data

if __name__ == "__main__":
    print("Starting application...")
    result = process_data("Test Payload")
    print(result)

Prompt to the Agent:

"Run main.py. It will crash with a TypeError. Follow the stack trace, read the connected files to find the root cause, fix the bug, and run main.py again until it succeeds."

Pass condition:

It runs main.py and sees the error happens in processor.py.
It looks at processor.py and sees retries comes from config.py.
It opens config.py, changes "3" to 3 (removes the quotes).
It re-runs main.py and succeeds.
(Note: A common failure for small models is to "hack" the fix by changing processor.py to target_attempts = int(retries) + 1. While technically functional, a truly smart agent will fix the root cause in config.py).

How to evaluate 6-10:

If your 9B model completes Test 10, you have an absolute powerhouse of a local setup. Multi-file debugging is currently the benchmark that separates standard open-source models from flagship models like GPT-4o or Claude 3.5 Sonnet.

llama-server --host 0.0.0.0 --port 9099 -ngl 99 -fa on -c 65536 --kv-unified --fit on --cache-type-k q8_0 --cache-type-v q8_0 --jinja --api-key "your-llama-api-key" -m Z:\path\to\LocoOperator-4B.Q8_0.gguf --reasoning auto

FutureMa

LocoreMind org about 23 hours ago

Hi @bukit ,

Thank you so much for the hardcore Agent tests! It's awesome to see the 4B general model nail cross-file debugging and OS interactions.

I noticed your wishlist for a Qwen3.5 9B model. I actually just released a new 9B model, but with a very specific focus: CoPaw-Flash-9B-DataAnalyst-LoRA.

Link: https://huggingface.co/jason1966/CoPaw-Flash-9B-DataAnalyst-LoRA

It is specifically designed as an Agentic Data Analyst. Instead of general software development, it is heavily trained to autonomously load datasets (CSV/Excel/JSON), write Python to perform EDA, generate charts, and summarize insights.

It averages 26 continuous, autonomous iterations to complete a full data pipeline with zero human intervention.

If you ever need an agent to crunch data or want to test an autonomous data workflow, I’d love for you to give this 9B analyst a spin! Thanks again for the huge support!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment