| # Tool Description A/B Setup |
|
|
| This harness benchmarks how tool-description quality affects tool use quality and intent capture. |
|
|
| Assume commands are run from the repo root. |
|
|
| ## Files added |
|
|
| - `scripts/eval_tool_description_ab.py` |
| - `scripts/tool_description_variants.json` |
| - outputs go to `docs/tool_description_eval/` |
| - generated variant cards go to `.fast-agent/evals/tool_desc_ab/cards/<variant>/` |
|
|
| ## What it varies |
|
|
| For each variant in `tool_description_variants.json`, the script creates a temporary tool-card set where it updates: |
|
|
| 1. `hf_hub_community.md` frontmatter `description` |
| 2. `hf_api_tool.py` function docstring for `hf_api_request` |
|
|
| Then it runs the same prompts across selected models. |
|
|
| Execution modes: |
|
|
| - **Direct (default):** runs `hf_hub_community` directly (best for endpoint-level scoring). |
| - **Indirect (`--indirect`):** runs via a generated wrapper agent that exposes exactly one sub-agent tool: `hf_hub_community`. |
|
|
| ## Metrics collected |
|
|
| Per run: |
| - return code |
| - whether tool was called |
| - endpoint call count |
| - first endpoint used |
| - first-call correctness (challenge-aware heuristics) |
| - challenge score (reusing `score_hf_hub_community_challenges.py` when available) |
|
|
| Aggregates by `(variant, model)`: |
| - success rate |
| - tool-use rate |
| - average endpoint calls |
| - first-call OK rate |
| - average score total |
|
|
| ## Run |
|
|
| ```bash |
| python scripts/eval_tool_description_ab.py \ |
| --models gpt-oss \ |
| --base-cards-dir .fast-agent/tool-cards \ |
| --prompts scripts/hf_hub_community_challenges.txt \ |
| --variants scripts/tool_description_variants.json \ |
| --start 1 --end 10 |
| ``` |
|
|
| Multi-model example: |
|
|
| ```bash |
| python scripts/eval_tool_description_ab.py \ |
| --models gpt-oss,gpt-5-mini,gpt-4.1-mini |
| ``` |
|
|
| Indirect (single sub-agent tool) example: |
|
|
| ```bash |
| python scripts/eval_tool_description_ab.py \ |
| --models gpt-oss \ |
| --indirect |
| ``` |
|
|
| ## Outputs |
|
|
| - `docs/tool_description_eval/tool_description_ab_detailed.json` |
| - `docs/tool_description_eval/tool_description_ab_summary.json` |
| - `docs/tool_description_eval/tool_description_ab_summary.csv` |
| - `docs/tool_description_eval/tool_description_ab_summary.md` |
| - `docs/tool_description_eval/tool_description_ab_pairwise.json` |
| - `docs/tool_description_eval/tool_description_ab_pairwise.csv` |
|
|
|
|
| Model list syntax is comma-separated aliases/IDs, e.g. `--models gpt-5-mini,haiku,kimi25,glm,grok-4-fast`. |
|
|