BayesClue-Latent checkpoints (private backup)

Latent-belief RL on the passive BayesClue detective game. Base: Qwen3.5-4B. Belief is read from forced-choice probe logits (never verbalized).

path	what	key metric
`sft_final/adapter/`	Stage-1 distributional-KL SFT adapter (LoRA r64). Soft-target match `KL(p* ‖ softmax(logits[label_ids]))` at the `Answer:` token. R1.	held-out belief KL 0.0066, top1 0.93
`rl_lr1e5_step400/lora_adapter/`	Stage-2 logit-probe GRPO LoRA delta (lr 1e-5, step 400, 346 modules). Reasoning-only response mask → probe gets zero gradient.	—
`rl_lr1e5_step400/merged/`	R2 = SFT⊕RL correctly merged into base (full fp model).	post-reasoning world_kl 0.0245 (= proper-scoring entropy floor), q_entropy→p_entropy, world_top1 0.71

Notes

Merge the LoRA with credal_verl08/remerge_sft_qwen35.py (the model.layers.→model.language_model.layers. remap); the stock verl.model_merger writes a base copy.
Reward R = α·(−CE(p*_H,q_H)) + (1−α)·mean_s(−CE(p*_R,q_R)), α=0.6. The −1.225 reward plateau IS the optimum −H(p*), not a truncation artifact.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Base model

Finetuned

Adapter

(259)

this model