arxiv:2602.20273

The Truthfulness Spectrum Hypothesis

Published on Feb 23

· Submitted by

Josh Ying on Feb 26

Columbia University

Upvote

Authors:

Zhuofan Josh Ying ,

Abstract

Large language models contain truth directions ranging from domain-general to domain-specific in their representational space, with linear probes showing varying generalization capabilities and causal interventions revealing differential effectiveness of these directions.

AI-generated summary

Large language models (LLMs) have been reported to linearly encode truthfulness, yet recent work questions this finding's generality. We reconcile these views with the truthfulness spectrum hypothesis: the representational space contains directions ranging from broadly domain-general to narrowly domain-specific. To test this hypothesis, we systematically evaluate probe generalization across five truth types (definitional, empirical, logical, fictional, and ethical), sycophantic and expectation-inverted lying, and existing honesty benchmarks. Linear probes generalize well across most domains but fail on sycophantic and expectation-inverted lying. Yet training on all domains jointly recovers strong performance, confirming that domain-general directions exist despite poor pairwise transfer. The geometry of probe directions explains these patterns: Mahalanobis cosine similarity between probes near-perfectly predicts cross-domain generalization (R^2=0.98). Concept-erasure methods further isolate truth directions that are (1) domain-general, (2) domain-specific, or (3) shared only across particular domain subsets. Causal interventions reveal that domain-specific directions steer more effectively than domain-general ones. Finally, post-training reshapes truth geometry, pushing sycophantic lying further from other truth types, suggesting a representational basis for chat models' sycophantic tendencies. Together, our results support the truthfulness spectrum hypothesis: truth directions of varying generality coexist in representational space, with post-training reshaping their geometry. Code for all experiments is provided in https://github.com/zfying/truth_spec.

View arXiv page View PDF GitHub 0 Add to collection

Community

zfying

Paper author Paper submitter about 4 hours ago

We propose the Truthfulness Spectrum Hypothesis: truth directions of varying generality coexist! Probe geometry predicts generalization, and post-training reshapes it!

We create FLEED datasets (definitional, empirical, logical, fictional, ethical truth) + new sycophantic lying + expectation-inverted datasets.
Prior and our probes are orthogonal to sycophantic lying, and are even anti-correlated with expectation-inverted lying
Yet training one probe on all domains works everywhere! Takeaway: train on more diverse data
Probe geometry predicts generalization performance! Mahalanobis cosine similarity between probe directions, which reweights by data covariance to focus on directions that matter, perfectly predicts OOD generalization (R²=0.98). Standard cossim? Only R² =0.56.
Post-training reorganizes truth geometry! In base models, sycophantic lying is more aligned with other types of lying than chat models, until post-training pushes them apart! This gives a representational account of why chat models are more sycophantic than base models.
We propose Stratified INLP: an iterative erasure procedure that first extracts highly domain-general directions, then removes them to reveal highly domain-specific directions. This lets us constructively identify both ends of the spectrum
Surprising causal exps: domain-specific directions steer better than domain-general ones! Takeaway: Domain-general probes may be great for monitoring, but might not be great for intervention.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.20273 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.20273 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.20273 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.