## Caching and reloading responses This guide explains how to enable and use the built‑in JSONL cache api in `lmms-eval` so repeated runs can reload model responses instead of re‑calling the model. It also notes an optional legacy SQLite cache wrapper. ### What gets cached - **Scope**: Per model instance and per task. - **Unit**: One record per document (`doc_id`) with the final string response. - **Files**: One JSONL file per task and process shard. The cache is implemented in `lmms_eval.api.model.lmms` via: - `load_cache()` and `load_jsonl_cache()` to load cached responses at startup - `get_response_from_cache()` to split incoming requests into “already cached” vs “not cached” - `add_request_response_to_cache()` to append new results as they are produced Models that call these APIs (for example `async_openai_compatible_chat`) automatically benefit from caching without any code changes in user scripts. You will need to use this api in your `generate_until` to cache and reload cache. ### Minimal example (inside a model's `generate_until`) ```python def generate_until(self, requests): self.load_cache() cached, pending = self.get_response_from_cache(requests) results = [c["response"] for c in cached] for req in pending: out = call_backend(req) # your model inference self.add_request_response_to_cache(req, out) results.append(out) return results ``` ### Enable the cache Set an environment variable before running: ```bash export LMMS_EVAL_USE_CACHE=True # optional: set the base directory for caches (defaults to ~/.cache/lmms-eval) export LMMS_EVAL_HOME="/path/to/cache_root" ``` Nothing else is required. When enabled, the model will: 1) load existing JSONL cache files at startup; 2) serve responses from cache; 3) append newly generated responses back to the JSONL files. ### Where cache files live - Base directory: `$(LMMS_EVAL_HOME:-~/.cache/lmms-eval)/eval_cache//` - File name per task and process shard: `{task_name}_rank{rank}_world_size{world_size}.jsonl` - Record format per line: ```json {"doc_id": , "response": } ``` Notes: - The `` is derived from a best‑effort human‑readable model identity (e.g., `model_version`) and the set of task names attached to the model, to avoid collisions. - Separate files per `rank` and `world_size` make distributed runs safe to cache concurrently. ### How it works at runtime For models wired to the cache API (e.g., `async_openai_compatible_chat`): - At the beginning of `generate_until(...)` the model calls `load_cache()` and then `get_response_from_cache(requests)`. - Cached items are returned immediately; only the remaining requests are forwarded to the backend. - After each response is produced, `add_request_response_to_cache(...)` appends a JSONL record. The cache key is the tuple `(task_name, doc_id)`. Ensure your task produces stable `doc_id`s across runs. ### Example: use with async_openai_compatible_chat ```bash export OPENAI_API_BASE="http://localhost:8000/v1" export OPENAI_API_KEY="EMPTY" # if your server allows it export LMMS_EVAL_USE_CACHE=True # enable JSONL cache # optional: export LMMS_EVAL_HOME to relocate cache root python -m lmms_eval \ --model async_openai \ --model_args model_version=grok-2-latest,base_url=$OPENAI_API_BASE,api_key=$OPENAI_API_KEY \ --tasks \ --batch_size 1 \ --output_path ./logs/ ``` On a second run with the same task/docs, cached responses will be loaded and only missing documents will call the model. ### Inspect or clear the cache - Inspect: open the task JSONL file(s) under the model’s cache directory and view records. - Clear: delete the corresponding JSONL file(s) or the entire `` directory to force re‑evaluation. ### Notes and limitations - The JSONL cache is keyed by `task_name` and `doc_id`. Changing task names or document IDs invalidates reuse. - Responses are cached as final strings. If your model emits intermediate tool calls, the final message (including any inline annotations) is what gets cached. - Distributed runs write to per‑rank files to avoid contention; reusing the cache works across single‑ and multi‑GPU as long as `task_name`/`doc_id` match. ### Optional: legacy SQLite cache wrapper There is also a separate optional wrapper `CachingLMM` (see `lmms_eval.api.model.CachingLMM`) that caches by hashing the entire call arguments to a SQLite DB (via `SqliteDict`). It is independent from the JSONL cache above and can be useful for broader API‑level caching. For most users, enabling `LMMS_EVAL_USE_CACHE=True` is sufficient and simpler.