Spaces:
Running
title: IT Helpdesk Ticket Routing OpenEnv
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
app_port: 7860
tags:
- openenv
- helpdesk
- ticket-routing
- customer-support
IT Helpdesk Ticket Routing OpenEnv
Meta PyTorch OpenEnv Hackathon Round 1 submission
Team Hackstreet Boys: Roopal Guha Neogi, Suyash Kumar
This repository contains a deterministic OpenEnv environment for IT helpdesk ticket routing. An agent is shown one ticket at a time from a short queue and must predict the right issue type, operational priority, assignment group, and next action.
Judge-Facing Summary
If a judge reads only one short explanation, it should be this:
- this environment models a real enterprise workflow, not a toy classification task
- each ticket requires typed routing decisions that are easy to score deterministically
- the task ladder now keeps full routing on every task and scales observability, queue pressure, and operational controls instead
- the repo is small enough to rerun quickly and explicit enough to understand without hidden business logic
What This Environment Simulates
The environment models a realistic helpdesk workflow:
- a new ticket enters the queue
- the agent reads the ticket title and description
- the agent may investigate, request more information, open an incident, defer the ticket, or submit a routing decision
- the queue state mutates: capacity shrinks, incidents stay open, deferred tickets return later, poor handling can spawn follow-up tickets, and good or bad handling can reshape later tickets in the same request cluster
- the grader assigns deterministic credit
- the environment advances until the queue is complete
For hard-task tickets, the environment can now withhold decisive routing context until the agent uses the right investigation tool. That keeps the task from collapsing into one-shot classification and makes tool choice part of the policy.
This domain is useful for OpenEnv because it is operationally realistic, easy to evaluate with typed outputs, and naturally supports a clean easy-to-hard task ladder.
Why This Is A Good Hackathon Domain
- it reflects real enterprise support operations
- the action space is structured and judge-friendly, but now includes meaningful operational controls beyond investigate-versus-submit
- correctness can be scored deterministically
- the hard task is meaningfully harder than the easy and medium tasks
- the environment is small enough to rerun quickly
Environment Overview
The project uses a queue-based episode model.
reset()samples a task and a queue of 3 to 5 ticketsstep()lets the agent investigate, request clarification, defer, open incidents, or submit one ticket at a timestate()exposes the internal episode snapshot- hard-task episodes also track queue-level capacity, incident slots, clustered follow-on tickets, alternate acceptable routes, planning penalties, SLA pressure, and dynamic follow-up tickets across the queue
- final evaluation is based on the queue outcome, not on isolated per-ticket classification alone
The environment classes and vocabulary are intentionally frozen to keep collaboration and judging simple.
Lightweight Policy Improvement Loop
The repo includes a local policy runner in policy_learning.py. It still does not update model weights, but it now does more than cosmetic search: it evaluates repeated seeded rollouts, learns cue-conditioned tool preferences for investigation, uses the same planning-aware deterministic submit logic as inference.py, and ranks policies by terminal rubric reward first, then queue-management quality, with lower planning penalty as the next tie-breaker.
That gives the project a meaningful improvement loop for judge demos:
- compare
no_investigation,investigate_when_context_hidden, andadaptive_cue_bandit - log per-step rewards, feedback summaries, planning penalties, and reward components to JSONL
- learn when to use
lookup_queue_capacity_forecastandlookup_queue_cluster_summaryversus the other investigation tools - select the best policy on train seeds, then re-evaluate it on holdout seeds
Example commands:
python policy_learning.py compare --seeds 42-51 --task-ids 1,2,3
python policy_learning.py search --train-seeds 40-49 --eval-seeds 50-59 --task-ids 1,2,3
Artifacts are written to analysis/policy_learning_runs/ by default:
compare_summary.jsoncompare_episodes.jsonlcompare_trajectories.jsonlsearch_summary.jsonsearch_train_episodes.jsonlsearch_train_trajectories.jsonlsearch_eval_episodes.jsonlsearch_eval_trajectories.jsonl
The default submit policy inside this runner stays deterministic and local. It reuses the repo's heuristic routing logic plus planning-aware routing overrides, and the policy loop can now also exercise operational actions such as request_info, open_incident, and defer without depending on external LLM latency or API cost.
Task Ladder
| ID | Name | Difficulty | Required Fields | What The Agent Must Do |
|---|---|---|---|---|
| 1 | Guided Full Routing | Easy | issue_type, priority, assignment_group, resolution_action |
route a mostly visible ticket correctly |
| 2 | Contextual Full Routing | Medium | issue_type, priority, assignment_group, resolution_action |
route under partial observability with investigation, clarification, and moderate queue carry-over |
| 3 | Adaptive Queue Routing | Hard | issue_type, priority, assignment_group, resolution_action |
route while managing queue pressure, incidents, clustered follow-ons, deferrals, and downstream follow-ups |
Locked Vocabulary
Issue types
billing_licenseidentity_accessapplication_supportservice_requestspam_phishinggeneral_inquirysecurity_complianceonboardingfeature_request
Priorities
criticalhighmediumlow
Assignment groups
license_opsservice_deskapplication_teamprocurementsecurity_teamonboarding_ops
Resolution actions
fulfillescalateassignignoreacknowledge
Observation And State Model
The agent only sees routing inputs, not labels.
Visible ticket fields:
ticket_idtitlerequesterdescription- optional
ambiguity_note - optional
planning_note - optional
customer_update_note - optional
related_ticket_id - optional
related_ticket_preview - optional
routing_options - optional
capacity_state - optional
operational_context - optional
cluster_summary - optional
generated_from_ticket_id
Each observation also includes:
task_idtask_nameinstructionsallowed_fieldsavailable_action_typesavailable_toolsinvestigation_budget_remaininglast_tool_resultqueue_sizetickets_remainingtickets_after_currenttickets_processedqueue_positionaverage_score_so_farprogress_fractionhistorylast_reward_componentsrubric_rewardon terminal observationsmetadata.last_feedback_summaryfor compact reward / penalty feedbackmetadata.capacity_stateon hard-task episodesmetadata.planning_penalty_totalandmetadata.planning_penalty_applied- standard OpenEnv fields such as
doneandreward
The internal HelpdeskTicketState tracks:
episode_idstep_countcurrent_task_idseedqueue_ticket_idscurrent_ticket_indexper_ticket_scorestotal_rewardrewarddoneteam_capacity_remaininghigh_priority_slots_remainingescalation_slots_remainingincident_slots_remainingplanning_penalty_totalincident_gap_totalsla_breach_countqueue_management_scorequeue_management_breakdowndynamic_queue_events
Grading And Reward
Scoring is deterministic and normalized to [0.0, 1.0].
The action model now supports five paths:
action_type="submit"for the final routing answeraction_type="investigate"with a small built-in tool surface before submissionaction_type="request_info"to ask for customer / operator clarification on the current ticketaction_type="open_incident"to reserve incident handling capacity before routing risky ticketsaction_type="defer"to push a ticket later in the queue and accept the downstream queue consequences
Available tools:
lookup_related_ticketlookup_requester_historylookup_internal_routing_notelookup_queue_capacity_forecastlookup_queue_cluster_summary
Hard-task investigation behavior:
- some ambiguous and non-default-routing tickets start with both redacted titles and redacted descriptions
- linked-ticket previews and internal routing notes stay hidden until the matching tool is used
- capacity-sensitive tickets can expose queue pressure, future demand, and alternate routing options through
lookup_queue_capacity_forecast - cluster-sensitive tickets can expose future related tickets, shared-requester load, and active incident coverage through
lookup_queue_cluster_summary - detailed cluster counts and future queue-demand breakdowns stay hidden until the matching queue tool is used
- only useful investigation steps return a small positive shaping reward
- blind or repeated probing does not pay by default
- premature hard-task submission can incur a shaping penalty even when the visible text looks plausible
- resource-greedy routing can add planning penalties later in the queue even when a single ticket looks correct in isolation
- incident-sensitive tickets can require an explicit
open_incidentstep to avoid future follow-up debt - strong handling on an earlier clustered ticket can make later tickets cheaper to acknowledge, while weak handling can escalate those later tickets
- bad or incomplete hard-task handling can append a deterministic follow-up ticket later in the same episode
- terminal
rubric_rewardremains the objective evaluation signal, while per-steprewardis the denser training signal
Per-field behavior:
issue_type: exact match, with a few near-miss partial-credit pairspriority: exact match or proximity creditassignment_group: exact match, with a small declared partial-credit map for nearby ownership mistakesresolution_action: exact match, with a small declared partial-credit map for nearby next-step mistakes- hard task only: some tickets also declare an alternate acceptable route with a reduced score multiplier, so the grader can reward capacity-aware fallback choices without collapsing into full fuzziness
Task weights:
| Task | Issue Type | Priority | Assignment Group | Resolution Action |
|---|---|---|---|---|
| 1 | 40% | 20% | 20% | 20% |
| 2 | 32% | 20% | 24% | 24% |
| 3 | 30% | 20% | 25% | 25% |
Final episode rubric reward is queue-based:
clamp(route_trajectory_reward * route_weight + queue_management_score * queue_weight - extra investigation penalties)
Both reward and rubric_reward now use the closed interval [0.0, 1.0].
Step reward is lightly milestone-shaped: high per-ticket scores get a small bonus and very low scores get a small penalty before the final clamp.
Final reward also includes a queue-economics penalty when the agent exceeds the free investigation budget. One investigation-style step per queued ticket is free, but extra investigation or clarification steps reduce the final reward more noticeably than before. On hard-task queues, assignment-group capacity, high-priority slots, escalation slots, incident slots, and deferred-ticket SLA pressure all create cross-ticket trade-offs.
To make the environment more RL-friendly, each observation now also surfaces structured reward telemetry:
last_reward_componentsexposes ticket score, shaped step reward, milestone adjustment, trajectory reward when applicable, and any investigation penalty appliedaverage_score_so_farandprogress_fractionexpose trajectory progress without leaking future labels- medium and hard telemetry now also exposes terminal
queue_management_scoreplus a queue-management breakdown - hard-task telemetry includes planning penalties, capacity usage, and the post-action capacity snapshot
historyretains the same reward components plus a compactfeedback_summarystring for downstream agents
Grounded Scoring
The grader is intentionally narrow and declared, not fully fuzzy.
- exact match is the dominant path for every field
assignment_groupandresolution_actionnow expose only a small declared partial-credit map for nearby mistakespriorityonly gets proximity credit from the declared table inserver/grader.pyissue_typeonly gets partial credit for a small declared similarity map- hard-task alternate routes must be explicitly declared in the dataset and carry an explicit score multiplier
- wrong labels outside those explicit maps score
0.0
That scoring policy is now backed by checked-in unit tests in tests/test_grader_unit.py and tests/test_tasks_unit.py.
The label set and partial-credit choices were also reviewed against public IT-support references during development, including:
Classification of IT Support TicketsSemantic Similarity of IT Support TicketsMSDialog
That grounding pass supported keeping the current similarity map small and explainable. No new issue-type similarity pairs were added from the review.
Dataset Snapshot
The effective labeled dataset now contains 70 tickets spanning straightforward, ambiguous, and planning-sensitive helpdesk scenarios.
It includes:
- billing and license requests
- identity and access issues
- application support incidents
- service and procurement requests
- spam or phishing reports
- security and compliance work
- onboarding tickets
- feature requests
- follow-up cases linked through
related_ticket_id - 16 tickets with explicit ambiguity notes
- 7 linked follow-up cases
- 22 tickets with declared alternate routes for queue-level planning
Difficulty Coverage
The difficulty ladder is now visible in observability and control, not just in the submitted field count.
Easy-style examples:
ticket-020: straightforward general inquiry with low urgency and a cleangeneral_inquirylabelticket-041: clear onboarding request for a new contractor accountticket-044: obvious phishing-style lure that should map cleanly tospam_phishing
Medium-style examples:
ticket-001: billing dispute that still requires the agent to judge urgency correctlyticket-028: application incident where the issue type is clear but priority still mattersticket-036: procurement-style proof-of-concept request that should route as aservice_request
Hard-style examples:
ticket-022: mixed billing and application signals in one ticketticket-029: seat expansion combined with a prorating questionticket-038: follow-up billing thread with escalated urgencyticket-045: repeated account suspension thread with legal-escalation pressure- generated
*-followuptickets: deterministic reopened cases that only appear when the earlier handling was incomplete or operationally risky
Repository Layout
server/
app.py
environment.py
grader.py
reward.py
tasks.py
Dockerfile
data/
dataset.json
models.py
client.py
inference.py
vocabulary.py
openenv.yaml
pyproject.toml
requirements.txt
README.md
KNOWLEDGE.md
required.md
PROJECT_STATUS.md
Core Files
models.py: typed action, observation, state, and dataset record modelsserver/environment.py: queue-based episode engineserver/tasks.py: task definitions and dataset loaderserver/grader.py: deterministic scoring logicserver/reward.py: reward helpersclient.py: typed client for multi-step episodesinference.py: baseline agent runnervocabulary.py: frozen constants and routing defaults
Local Setup
Install dependencies:
pip install -r requirements.txt
pip install -e .
Start the environment locally:
uvicorn server.app:app --host 0.0.0.0 --port 7860
Basic checks:
curl http://localhost:7860/health
curl http://localhost:7860/tasks
Running The Baseline Inference Script
The baseline script defaults to all declared tasks when TASK_ID is not set, which keeps local runs aligned with validator-style sweeps.
Heuristic mode
If no LLM credentials are set, it uses a keyword-based ticket router:
python inference.py
By default that runs all declared tasks and emits a structured [START] ... [STEP] ... [END] block for each task. To target a specific task:
TASK_ID=3 python inference.py
LLM mode
Set these environment variables first:
API_BASE_URLMODEL_NAMEAPI_KEYHF_TOKEN
Then run:
python inference.py
Optional target:
ENV_URL- default value:
http://localhost:7860 SEEDTASK_IDRUN_ALL_TASKScompatibility alias for local tooling; all tasks already run by default whenTASK_IDis unset
To reproduce the multi-task local benchmark sweep:
RUN_ALL_TASKS=1 python inference.py
Runtime Validation Snapshot
The repo has now completed both the first local heuristic validation pass and a merged-state rerun on the current main branch.
Validated locally:
- server startup
/health/tasks/reset- heuristic
inference.pyrun across all 3 tasks withRUN_ALL_TASKS=1
Current local smoke expectations:
- the baseline completes all 3 tasks successfully
- rewards remain in range for every task
- the hard task now depends much more heavily on investigation behavior, so exact seed-level baseline numbers are no longer treated as the benchmark reference for the repo
The April 6 to April 7 validation pass then closed the remaining validation gates with Docker smoke coverage via GitHub Actions, a clean-copy install-and-run rerun, structured inference-log verification, and a passing local openenv validate check after checking in uv.lock.
Windows note
During the first runtime pass, the repo surfaced a Windows-specific JSON issue where data/dataset.json could include a UTF-8 BOM. The dataset loader in server/tasks.py now reads the file with utf-8-sig, so the environment resets cleanly even when the file was saved by a Windows editor.
Docker
Build:
docker build -t helpdesk-ticket-routing .
Run locally:
docker run -p 7860:7860 helpdesk-ticket-routing
Then run inference against it (default ENV_URL points to http://localhost:7860):
RUN_ALL_TASKS=1 python inference.py
If you publish the container on a different host port, set ENV_URL accordingly before running inference.py.
If local Docker is blocked by machine setup, the repo also includes a GitHub Actions smoke test at .github/workflows/docker-smoke-test.yml. That workflow builds the image on a GitHub-hosted runner, starts the container, checks /health and /tasks, and runs heuristic inference.py against the container.
API Surface
OpenEnv provides the core environment endpoints, and the repo adds a custom task listing route.
| Method | Path | Purpose |
|---|---|---|
| GET | /health |
health check |
| POST | /reset |
start a new episode |
| POST | /step |
submit an action |
| GET | /state |
inspect internal state |
| GET | /tasks |
list task metadata |
| GET | /web |
lightweight HF Space UI |
| GET | /docs |
interactive API docs |
Submission Readiness
The repo is already aligned on:
- team name and members
- domain and vocabulary
- task ladder
- typed models
- grader and reward design
- packaging metadata and Docker entry point
- Hugging Face Spaces README frontmatter
- judge-facing documentation of deterministic, grounded scoring
An April 6 repo audit also confirmed that all required submission files are present:
- runtime:
models.py,client.py,inference.py,server/app.py,server/environment.py,server/grader.py,server/reward.py,server/tasks.py - data and metadata:
data/dataset.json,openenv.yaml,pyproject.toml,requirements.txt,server/Dockerfile - docs and project guidance:
README.md,KNOWLEDGE.md,required.md,PROJECT_STATUS.md
Roadmap status through April 7 is complete:
- unit, smoke, and integration tests are checked in and green
- Docker smoke coverage exists through
.github/workflows/docker-smoke-test.yml openenv validatenow passes on the current repo state- structured
inference.pylogging is verified by tests and the merged-state rerun - a clean-copy install-and-run pass has been completed
The remaining April 8 work is operational rather than implementation-heavy:
- run the final submission-branch sanity slice before pushing
- perform the live Hugging Face Space ping and reset check on the deployed submission artifact if a fresh deployment is created
The short TRL / GRPO README example remains intentionally deferred because it is optional and lower priority than benchmark clarity and stability.