Spaces:

evalstate
/

hf-papers

Sleeping

App Files Files Community

hf-papers / docs /tool_description_eval_setup.md

evalstate HF Staff

sync: promote hf_hub_community prompt v3 + add prompt/coverage harness

bba4fab verified 3 months ago

preview code

raw

history blame contribute delete

2.35 kB

	# Tool Description A/B Setup

	This harness benchmarks how tool-description quality affects tool use quality and intent capture.

	Assume commands are run from the repo root.

	## Files added

	- `scripts/eval_tool_description_ab.py`
	- `scripts/tool_description_variants.json`
	- outputs go to `docs/tool_description_eval/`
	- generated variant cards go to `.fast-agent/evals/tool_desc_ab/cards/<variant>/`

	## What it varies

	For each variant in `tool_description_variants.json`, the script creates a temporary tool-card set where it updates:

	1. `hf_hub_community.md` frontmatter `description`
	2. `hf_api_tool.py` function docstring for `hf_api_request`

	Then it runs the same prompts across selected models.

	Execution modes:

	- Direct (default): runs `hf_hub_community` directly (best for endpoint-level scoring).
	- Indirect (`--indirect`): runs via a generated wrapper agent that exposes exactly one sub-agent tool: `hf_hub_community`.

	## Metrics collected

	Per run:
	- return code
	- whether tool was called
	- endpoint call count
	- first endpoint used
	- first-call correctness (challenge-aware heuristics)
	- challenge score (reusing `score_hf_hub_community_challenges.py` when available)

	Aggregates by `(variant, model)`:
	- success rate
	- tool-use rate
	- average endpoint calls
	- first-call OK rate
	- average score total

	## Run

	```bash
	python scripts/eval_tool_description_ab.py \
	--models gpt-oss \
	--base-cards-dir .fast-agent/tool-cards \
	--prompts scripts/hf_hub_community_challenges.txt \
	--variants scripts/tool_description_variants.json \
	--start 1 --end 10
	```

	Multi-model example:

	```bash
	python scripts/eval_tool_description_ab.py \
	--models gpt-oss,gpt-5-mini,gpt-4.1-mini
	```

	Indirect (single sub-agent tool) example:

	```bash
	python scripts/eval_tool_description_ab.py \
	--models gpt-oss \
	--indirect
	```

	## Outputs

	- `docs/tool_description_eval/tool_description_ab_detailed.json`
	- `docs/tool_description_eval/tool_description_ab_summary.json`
	- `docs/tool_description_eval/tool_description_ab_summary.csv`
	- `docs/tool_description_eval/tool_description_ab_summary.md`
	- `docs/tool_description_eval/tool_description_ab_pairwise.json`
	- `docs/tool_description_eval/tool_description_ab_pairwise.csv`


	Model list syntax is comma-separated aliases/IDs, e.g. `--models gpt-5-mini,haiku,kimi25,glm,grok-4-fast`.