Papers
arxiv:2602.20273

The Truthfulness Spectrum Hypothesis

Published on Feb 23
· Submitted by
Josh Ying
on Feb 26
Authors:
,
,

Abstract

Large language models contain truth directions ranging from domain-general to domain-specific in their representational space, with linear probes showing varying generalization capabilities and causal interventions revealing differential effectiveness of these directions.

AI-generated summary

Large language models (LLMs) have been reported to linearly encode truthfulness, yet recent work questions this finding's generality. We reconcile these views with the truthfulness spectrum hypothesis: the representational space contains directions ranging from broadly domain-general to narrowly domain-specific. To test this hypothesis, we systematically evaluate probe generalization across five truth types (definitional, empirical, logical, fictional, and ethical), sycophantic and expectation-inverted lying, and existing honesty benchmarks. Linear probes generalize well across most domains but fail on sycophantic and expectation-inverted lying. Yet training on all domains jointly recovers strong performance, confirming that domain-general directions exist despite poor pairwise transfer. The geometry of probe directions explains these patterns: Mahalanobis cosine similarity between probes near-perfectly predicts cross-domain generalization (R^2=0.98). Concept-erasure methods further isolate truth directions that are (1) domain-general, (2) domain-specific, or (3) shared only across particular domain subsets. Causal interventions reveal that domain-specific directions steer more effectively than domain-general ones. Finally, post-training reshapes truth geometry, pushing sycophantic lying further from other truth types, suggesting a representational basis for chat models' sycophantic tendencies. Together, our results support the truthfulness spectrum hypothesis: truth directions of varying generality coexist in representational space, with post-training reshaping their geometry. Code for all experiments is provided in https://github.com/zfying/truth_spec.

Community

Paper author Paper submitter

We propose the Truthfulness Spectrum Hypothesis: truth directions of varying generality coexist! Probe geometry predicts generalization, and post-training reshapes it!

  • We create FLEED datasets (definitional, empirical, logical, fictional, ethical truth) + new sycophantic lying + expectation-inverted datasets.

  • Prior and our probes are orthogonal to sycophantic lying, and are even anti-correlated with expectation-inverted lying

  • Yet training one probe on all domains works everywhere! Takeaway: train on more diverse data

  • Probe geometry predicts generalization performance! Mahalanobis cosine similarity between probe directions, which reweights by data covariance to focus on directions that matter, perfectly predicts OOD generalization (R²=0.98). Standard cossim? Only R² =0.56.

  • Post-training reorganizes truth geometry! In base models, sycophantic lying is more aligned with other types of lying than chat models, until post-training pushes them apart! This gives a representational account of why chat models are more sycophantic than base models.

  • We propose Stratified INLP: an iterative erasure procedure that first extracts highly domain-general directions, then removes them to reveal highly domain-specific directions. This lets us constructively identify both ends of the spectrum

  • Surprising causal exps: domain-specific directions steer better than domain-general ones! Takeaway: Domain-general probes may be great for monitoring, but might not be great for intervention.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.20273 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.20273 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.20273 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.