Works great on Claude Code x llama-server!
TLDR: Q8_0 pass all tests, without any hiccups, really underrated model on huggingface, it beats qwen3.5 9b and gemma4 E4B for agentic ai coding!
Wishlist: LocoOperator based on Qwen3.5 9B or Gemma4..
Great work LocoreMind!
Here are tests 6 through 10. These tests push beyond basic file editing and test the advanced reasoning, external environment interaction, and multi-step logic of your local 9B model.
If your model can pass these, it is performing at the level of models 3x to 8x its size.
Test 6: The External Dependency Test
Goal: See if the model can write code that requires third-party packages, realize it needs to install them, and execute the installation command in your terminal. Small models often hallucinate that packages are already installed or panic when an ImportError occurs.
Your Setup:
Ensure your terminal is in your test folder. (If you are using Python in a virtual environment, activate it first so it doesn't install globally).
Prompt to the Agent:
"Write a Python script called
fetch_cat.pythat uses the externalrequestslibrary to fetch a random cat fact fromhttps://catfact.ninja/factand prints it. If therequestslibrary is not installed, use pip to install it. Then run the script."
Pass condition:
- It writes the script.
- It either proactively runs
pip install requestsOR it runs the script, gets aModuleNotFoundError, reads the error, runspip install, and retries. - It successfully outputs a cat fact.
Test 7: Data Parsing & Mutation (JSON)
Goal: Test if the model can read a structured data file, understand its schema, write code to mutate it, and output a new file without hallucinating or losing data.
Your Setup:
Create a file named users.json and paste this exactly:
[
{"name": "Alice", "age": 25, "status": "active"},
{"name": "Bob", "age": 17, "status": "active"},
{"name": "Charlie", "age": 30, "status": "inactive"},
{"name": "Diana", "age": 22, "status": "active"}
]
Prompt to the Agent:
"Read
users.json. Write a script calledfilter.pythat loads this data, removes any user who is under 18 OR who has an 'inactive' status, and saves the remaining users to a new file calledvalid_users.json. Run the script, and then readvalid_users.jsonto prove it worked."
Pass condition: It correctly writes the script, runs it, and reads the output file. The final valid_users.json should only contain Alice and Diana. (9B models often mess up the boolean logic: under 18 OR inactive).
Test 8: Nested File Architecture
Goal: Test if the agent can use OS-level tools to create directories (folders) and manage relative paths. Small models often create everything in the root folder because they struggle with mkdir commands.
Your Setup:
An empty folder.
Prompt to the Agent:
"Scaffold a basic web project. Create a folder called
public. Insidepublic, create anindex.htmlfile. Also insidepublic, create two more folders:cssandjs. Createstyles.cssin thecssfolder, andapp.jsin thejsfolder. Finally, make sure theindex.htmlfile links to both the css and js files using correct relative paths."
Pass condition: The model uses mkdir (or a python script with os.makedirs) to create the nested folders. If you open public/index.html, it must have <link rel="stylesheet" href="css/styles.css"> and <script src="js/app.js"></script>.
Test 9: Log Analysis & Regex Extraction
Goal: See if the model can parse unstructured text, write precise extraction logic (regex or string matching), and count results.
Your Setup:
Create a file named server.log and paste this:
[INFO] 10:00:01 - Server started successfully.
[WARN] 10:05:22 - Memory usage at 80%
[ERROR] 10:06:01 - Connection timeout from IP 192.168.1.50
[INFO] 10:07:15 - User 'admin' logged in.
[ERROR] 10:10:44 - Database query failed: Syntax error.
[WARN] 10:12:00 - High latency detected.
[ERROR] 10:15:30 - Disk space critically low.
Prompt to the Agent:
"Analyze
server.log. Write a shell command or a Python script to extract only the lines that contain '[ERROR]'. Save those lines into a new file calledcritical_errors.txt. Then, tell me exactly how many errors there were."
Pass condition: The model writes a script or uses grep '[ERROR]' server.log > critical_errors.txt. It must report back that there are exactly 3 errors. (Small models often hallucinate the count or include the WARN lines by mistake).
Test 10: Multi-File Bug Tracing (The Final Boss)
Goal: This tests deep context window retention. The model has to trace a stack trace across three different files, find the root cause (which is in a different file than where the crash happens), and fix it.
Your Setup:
Create three files exactly as written below.
File 1: config.py
# The bug is here: max_retries should be an integer, not a string
SETTINGS = {
"max_retries": "3",
"timeout": 10
}
File 2: processor.py
from config import SETTINGS
def process_data(data):
retries = SETTINGS["max_retries"]
# It will crash here when it tries to add an int to a string
target_attempts = retries + 1
return f"Processing {data} with {target_attempts} attempts allowed."
File 3: main.py
from processor import process_data
if __name__ == "__main__":
print("Starting application...")
result = process_data("Test Payload")
print(result)
Prompt to the Agent:
"Run
main.py. It will crash with a TypeError. Follow the stack trace, read the connected files to find the root cause, fix the bug, and runmain.pyagain until it succeeds."
Pass condition:
- It runs
main.pyand sees the error happens inprocessor.py. - It looks at
processor.pyand seesretriescomes fromconfig.py. - It opens
config.py, changes"3"to3(removes the quotes). - It re-runs
main.pyand succeeds.
(Note: A common failure for small models is to "hack" the fix by changingprocessor.pytotarget_attempts = int(retries) + 1. While technically functional, a truly smart agent will fix the root cause inconfig.py).
How to evaluate 6-10:
If your 9B model completes Test 10, you have an absolute powerhouse of a local setup. Multi-file debugging is currently the benchmark that separates standard open-source models from flagship models like GPT-4o or Claude 3.5 Sonnet.
llama-server --host 0.0.0.0 --port 9099 -ngl 99 -fa on -c 65536 --kv-unified --fit on --cache-type-k q8_0 --cache-type-v q8_0 --jinja --api-key "your-llama-api-key" -m Z:\path\to\LocoOperator-4B.Q8_0.gguf --reasoning auto
Hi @bukit ,
Thank you so much for the hardcore Agent tests! It's awesome to see the 4B general model nail cross-file debugging and OS interactions.
I noticed your wishlist for a Qwen3.5 9B model. I actually just released a new 9B model, but with a very specific focus: CoPaw-Flash-9B-DataAnalyst-LoRA.
Link: https://huggingface.co/jason1966/CoPaw-Flash-9B-DataAnalyst-LoRA
It is specifically designed as an Agentic Data Analyst. Instead of general software development, it is heavily trained to autonomously load datasets (CSV/Excel/JSON), write Python to perform EDA, generate charts, and summarize insights.
It averages 26 continuous, autonomous iterations to complete a full data pipeline with zero human intervention.
If you ever need an agent to crunch data or want to test an autonomous data workflow, I’d love for you to give this 9B analyst a spin! Thanks again for the huge support!
