Hi
@sseymens
Thank you for your comments.
I can help to reply your question about MOE on policy part.
- Yeah, forcing
old_log_prob = log_prob.detach()does not solve the on policy issue since the prob is using current policy but sampling distribution can be different due to expert selection. - When we explored the agentic issues for gpt-oss training, we did not root the cause at the beginning. One hypothesis is due to inference-training inconsistency. After we apply the importance sampling, it does not help. So we test if forcing
old_log_prob = log_prob.detach()will alleviate the issue if this is the root cause. This is just for hypothesis testing. - When we explored the agentic issues for gpt-oss training, verl has not supported expert router replay yet. So we cannot test this idea. https://arxiv.org/pdf/2510.11370v1. Now we tested the relay. But this is not the root cause too. The root cause is attention sink.