BayesClue-Latent checkpoints (private backup)

Latent-belief RL on the passive BayesClue detective game. Base: Qwen3.5-4B. Belief is read from forced-choice probe logits (never verbalized).

Contents

path what key metric
sft_final/adapter/ Stage-1 distributional-KL SFT adapter (LoRA r64). Soft-target match KL(p* ‖ softmax(logits[label_ids])) at the Answer: token. R1. held-out belief KL 0.0066, top1 0.93
rl_lr1e5_step400/lora_adapter/ Stage-2 logit-probe GRPO LoRA delta (lr 1e-5, step 400, 346 modules). Reasoning-only response mask → probe gets zero gradient.
rl_lr1e5_step400/merged/ R2 = SFT⊕RL correctly merged into base (full fp model). post-reasoning world_kl 0.0245 (= proper-scoring entropy floor), q_entropy→p_entropy, world_top1 0.71

Notes

  • Merge the LoRA with credal_verl08/remerge_sft_qwen35.py (the model.layers.model.language_model.layers. remap); the stock verl.model_merger writes a base copy.
  • Reward R = α·(−CE(p*_H,q_H)) + (1−α)·mean_s(−CE(p*_R,q_R)), α=0.6. The −1.225 reward plateau IS the optimum −H(p*), not a truncation artifact.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for karantonis/bayesclue-checkpoints

Finetuned
Qwen/Qwen3.5-4B
Adapter
(259)
this model