supplymind / docs /final_eval_set.md
Rishav
Add final eval set
0af60f8

SupplyMind Final Eval Set

Use this small held-out set for final model comparisons:

tier task_id seed
easy v2_train_easy 131
medium v2_train_medium 211
hard v2_train_hard 307

Compare the same cases across:

  1. Base center + base warehouse
  2. SFT center + SFT warehouse
  3. GRPO center + best warehouse
  4. Best center + best warehouse

Primary plots:

  • global score by setup
  • center role score by setup
  • warehouse role score by setup
  • invalid payload/action counts by setup

This set is intentionally small so it can be rerun quickly during the hackathon. Do not use these seeds for further training.