Spaces:
Sleeping
Sleeping
SupplyMind Final Eval Set
Use this small held-out set for final model comparisons:
| tier | task_id | seed |
|---|---|---|
| easy | v2_train_easy |
131 |
| medium | v2_train_medium |
211 |
| hard | v2_train_hard |
307 |
Compare the same cases across:
- Base center + base warehouse
- SFT center + SFT warehouse
- GRPO center + best warehouse
- Best center + best warehouse
Primary plots:
- global score by setup
- center role score by setup
- warehouse role score by setup
- invalid payload/action counts by setup
This set is intentionally small so it can be rerun quickly during the hackathon. Do not use these seeds for further training.