BayesClue-Latent checkpoints (private backup)
Latent-belief RL on the passive BayesClue detective game. Base: Qwen3.5-4B. Belief is read from forced-choice probe logits (never verbalized).
Contents
| path | what | key metric |
|---|---|---|
sft_final/adapter/ |
Stage-1 distributional-KL SFT adapter (LoRA r64). Soft-target match KL(p* ‖ softmax(logits[label_ids])) at the Answer: token. R1. |
held-out belief KL 0.0066, top1 0.93 |
rl_lr1e5_step400/lora_adapter/ |
Stage-2 logit-probe GRPO LoRA delta (lr 1e-5, step 400, 346 modules). Reasoning-only response mask → probe gets zero gradient. | — |
rl_lr1e5_step400/merged/ |
R2 = SFT⊕RL correctly merged into base (full fp model). | post-reasoning world_kl 0.0245 (= proper-scoring entropy floor), q_entropy→p_entropy, world_top1 0.71 |
Notes
- Merge the LoRA with
credal_verl08/remerge_sft_qwen35.py(themodel.layers.→model.language_model.layers.remap); the stockverl.model_mergerwrites a base copy. - Reward
R = α·(−CE(p*_H,q_H)) + (1−α)·mean_s(−CE(p*_R,q_R)), α=0.6. The −1.225 reward plateau IS the optimum−H(p*), not a truncation artifact.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support