## Caching and reloading responses

This guide explains how to enable and use the built‑in JSONL cache api in `lmms-eval` so repeated runs can reload model responses instead of re‑calling the model. It also notes an optional legacy SQLite cache wrapper.

### What gets cached

- **Scope**: Per model instance and per task.
- **Unit**: One record per document (`doc_id`) with the final string response.
- **Files**: One JSONL file per task and process shard.

The cache is implemented in `lmms_eval.api.model.lmms` via:
- `load_cache()` and `load_jsonl_cache()` to load cached responses at startup
- `get_response_from_cache()` to split incoming requests into “already cached” vs “not cached”
- `add_request_response_to_cache()` to append new results as they are produced

Models that call these APIs (for example `async_openai_compatible_chat`) automatically benefit from caching without any code changes in user scripts. You will need to use this api in your `generate_until` to cache and reload cache.

### Minimal example (inside a model's `generate_until`)

```python
def generate_until(self, requests):
    self.load_cache()
    cached, pending = self.get_response_from_cache(requests)
    results = [c["response"] for c in cached]
    for req in pending:
        out = call_backend(req)  # your model inference
        self.add_request_response_to_cache(req, out)
        results.append(out)
    return results
```

### Enable the cache

Set an environment variable before running:

```bash
export LMMS_EVAL_USE_CACHE=True
# optional: set the base directory for caches (defaults to ~/.cache/lmms-eval)
export LMMS_EVAL_HOME="/path/to/cache_root"
```

Nothing else is required. When enabled, the model will:
1) load existing JSONL cache files at startup; 2) serve responses from cache; 3) append newly generated responses back to the JSONL files.

### Where cache files live

- Base directory: `$(LMMS_EVAL_HOME:-~/.cache/lmms-eval)/eval_cache/<model_hash>/`
- File name per task and process shard: `{task_name}_rank{rank}_world_size{world_size}.jsonl`
- Record format per line:

```json
{"doc_id": <doc_id>, "response": <string>}
```

Notes:
- The `<model_hash>` is derived from a best‑effort human‑readable model identity (e.g., `model_version`) and the set of task names attached to the model, to avoid collisions.
- Separate files per `rank` and `world_size` make distributed runs safe to cache concurrently.

### How it works at runtime

For models wired to the cache API (e.g., `async_openai_compatible_chat`):
- At the beginning of `generate_until(...)` the model calls `load_cache()` and then `get_response_from_cache(requests)`.
- Cached items are returned immediately; only the remaining requests are forwarded to the backend.
- After each response is produced, `add_request_response_to_cache(...)` appends a JSONL record.

The cache key is the tuple `(task_name, doc_id)`. Ensure your task produces stable `doc_id`s across runs.

### Example: use with async_openai_compatible_chat

```bash
export OPENAI_API_BASE="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"          # if your server allows it
export LMMS_EVAL_USE_CACHE=True         # enable JSONL cache
# optional: export LMMS_EVAL_HOME to relocate cache root

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=grok-2-latest,base_url=$OPENAI_API_BASE,api_key=$OPENAI_API_KEY \
  --tasks <your_task> \
  --batch_size 1 \
  --output_path ./logs/
```

On a second run with the same task/docs, cached responses will be loaded and only missing documents will call the model.

### Inspect or clear the cache

- Inspect: open the task JSONL file(s) under the model’s cache directory and view records.
- Clear: delete the corresponding JSONL file(s) or the entire `<model_hash>` directory to force re‑evaluation.

### Notes and limitations

- The JSONL cache is keyed by `task_name` and `doc_id`. Changing task names or document IDs invalidates reuse.
- Responses are cached as final strings. If your model emits intermediate tool calls, the final message (including any inline annotations) is what gets cached.
- Distributed runs write to per‑rank files to avoid contention; reusing the cache works across single‑ and multi‑GPU as long as `task_name`/`doc_id` match.

### Optional: legacy SQLite cache wrapper

There is also a separate optional wrapper `CachingLMM` (see `lmms_eval.api.model.CachingLMM`) that caches by hashing the entire call arguments to a SQLite DB (via `SqliteDict`). It is independent from the JSONL cache above and can be useful for broader API‑level caching. For most users, enabling `LMMS_EVAL_USE_CACHE=True` is sufficient and simpler.