Accuracy is the wrong headline here, and you named it. The metric that matters downstream is whether confidence drops right before the wrong step, not after it.
In an agent loop that gap is the whole game. A model that knows it is unsure stops and re-plans. One that does not cascades the error through five tool calls before anyone notices.
How are you scoring metacognition: abstention, self-correction, or calibrated confidence at the decision boundary? Those three reward very different models.