这个模型在什么场景下使用?

by weiminw - opened Jul 9, 2025

Jul 9, 2025

这个模型在使用的时候, 需要一个reference , 这个我理解应该是ground true吧. 所以我理解这个模型主要是在微调的时候使用, 判断模型的回复是不是和ground true接近对么? 除此之外, 这个能用在agent推理过程中的提供奖励信号吗?但是agent 推理中, 没发拿到所谓的reference信息, 请教你, 是否有其他办法使用这个模型呢?

RowitZou

Intern Large Models org Jul 9, 2025

你好，感谢关注！polar确实需要ref来为模型提供奖励信号，主要用于强化微调场景。我理解agent推理过程会有cot和final result的内容，如果cot有ref那可以直接走传统的整个trajectory的RL。如果仅有最终答案ref，可以走R1的强化学习。如果都没有ref，可能polar无法满足您的需求，要转而考虑传统的RM。

RowitZou changed discussion status to closed Jul 12, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment