Blind by Design: The Invisible Signal That Breaks AI Evaluation

Jonathan Kreindler, Receptiviti Co-Founder
3 days ago
5 min read

AI evaluation, fine-tuning, and governance are built around AI systems that condition their behavior on variables that none of them can see. Large language Models pick up basic signals about the user’s state like urgency, frustration, cognitive load, and adjust their responses accordingly. This behavior wasn’t engineered; it’s the result of training models on human communication where psychological information is embedded within language. But while the signal influences the model’s output, the signal disappears without being named, logged, or measured. The result is a system that makes consequential adjustments to its behavior that its own evaluation pipeline has no way to track, explain, or improve.

The Invisible Signal That Breaks AI Evaluation

Something Is Already Happening

When a user is panicked, a well-tuned LLM responds differently than when they’re calmly working through an idea. When someone is overloaded, a well-tuned model tightens its structure and shortens its explanations, and when the prompt sounds urgent, the model also adjusts accordingly.

This conditional behavior wasn’t designed; it’s the result of training models on massive amounts of human communication where psychological information is embedded in language as a signal. While the model picks up fragments of that signal in user language, the larger system surrounding the model (the eval pipeline, fine-tuning loop, prompt engineering process) is built as if the signal doesn’t exist. While the model adjusts based on it, the signal vanishes without ever being named, logged, extracted, or made available to anything downstream.

And this is why efforts in AI evaluation, fine-tuning, and governance often hit a wall: teams can’t consistently measure adaptation quality, reliably debug failures that appear only in certain user-state conditions, or deliberately improve the adaptation that’s already happening.

What We’re Actually Calling “Inconsistent Performance”

A user who is anxious and time-pressured writes differently from one who is calm and curious, but evaluation teams typically treat both inputs as identical prompts. This means post-training teams are measuring average performance across a hidden mix of psychological states and then labeling the natural variance as “inconsistency.”

When model performance plateaus, they assume they’ve hit the model’s limit. In reality, they’ve usually just maxed out performance in common situations while inadvertently ignoring how the model behaves in other, sometimes more critical user states, because they have no way to see or measure the difference.

The Inference Problem Isn’t the Real Problem

The issue is an architectural one: Whatever the model infers stays trapped inside its internal computation, often in the residual stream or scattered attention heads, and never becomes a clean, stable, quantifiable signal that can be used for evaluation, fine-tuning, logging, or governance purposes. Without a dedicated user-state representation layer, you can’t measure it, debug it, or deliberately improve how the model adapts.

Why Current AI Evaluation and Fine-Tuning Techniques Hit a Ceiling

Teams only have three main levers - prompting, fine-tuning, and evaluation - but all three are blind to user state, so they keep hitting the same invisible ceiling:

Prompting influences behavior globally. You can instruct the model to be more concise or more empathetic, but you can’t easily tell it to respond differently depending on the user’s current state at runtime. While basic adjustments are feasible, you still can’t reliably control what you can’t observe. And whiles some production systems use crude distress detectors to trigger conditional prompts or safety rails, these remain ad-hoc and unstandardized.
Fine-tuning influences behavior across a broad distribution of examples: It can make the model look better on average, but if the training data doesn’t include a good mix of different user states, the model gets stronger in common situations while its performance in rarer or more difficult psychological states stays weak or even gets worse. In the worst case, fine-tuning hides problems by improving typical outputs while ignoring the edges where the model struggles.
Evaluation measures outputs against expected answers, but without a user-state variable there is no way to break down performance by the psychological conditions under which the output was produced. Failures that only appear when users are anxious, overloaded, or urgent get averaged out and disappear into the overall score. As a result, teams keep improving what they can easily measure instead of fixing what is actually broken.

What a State-Aware System Would Look Like

None of this requires better base models, it requires systems that make the already present, yet noisy user state visible, measurable, and actionable: An explicit, low-dimensional user-state embedding produced by a validated user state measurement layer that can run in a container. This embedding can be injected at inference time as control tokens or conditional prompt logic, while also serving as a first-class signal for logging and evaluation.

Training signals can be back-propagateed through a distilled internal probe (auxiliary loss + conditional RL) once teams have real data. Logging and evals can treat the embedding as a first-class axis (“performance by cognitive load quartile”), and user controls, consent (opt-in to “adaptive mode”) and auditability so that governance is no longer flying blind.

Evaluation would show how well the model performs in different user states such as when users are stressed, overloaded, curious, or urgent, rather than just averaging everything together or grouping by task and domain.
Prompts would contain conditional logic tied to the detected user-state (via the embedding) rather than blanket instructions.
Logging would treat user-state signals as first-class telemetry instead of hidden patterns buried in the text.

Why the Missing Variable Distorts Everything

A system that can’t observe the conditions that influence how it acts will misinterpret its own performance, so without a way to track user state, it can’t connect causes to outcomes.

Failures also don’t happen evenly, and dangerous problems get hidden inside what appear to be reassuring average scores, making the system look reliable while it’s quietly failing when it matters most. As a result, teams waste effort improving areas that are already strong, while critical weaknesses in specific user states go unnoticed. In doing so, they’re leaving a critical signal on the table.

The Governance Problem That Results

Without a representation of user-state, there is no reliable way to explain why two seemingly similar cases were handled differently, or why the same system behaves one way today and another way tomorrow. If policy enforcement or safety classifiers can’t condition on “user is in acute distress” versus “user is role-playing distress,” you get both false negatives and false positives, and this means that audit logs that don’t surface the implicit state are basically theater.

An article published on April 14, 2026 in Nature Reviews Psychology makes the stakes clear: Conversational AI systems routinely miss indirect signals of psychological distress, and without embedding psychological expertise model can’t reliably detect, respond to, or govern these high-stakes moments (Zhao, 2026).

This creates a serious mismatch: the system is making decisions that depend on user state, but governance, audit, and oversight frameworks evaluate it as if those conditions do not exist. If the system can’t represent the variables that actually drive its behavior, it can’t provide a stable foundation for accountability or control.

The Engineering Choice

Making interaction state explicit and observable strengthens both model performance and governance, but it must be treated as system instrumentation, not user profiling. The goal is not to model the user. It’s to capture properties of the conversation as it unfolds - signals like cognitive load, uncertainty, or urgency - as transient, auditable variables that are decoupled from identity. Done properly, this creates a usable interface for debugging, evaluation, and control, without introducing persistent user models.

This doesn’t come down to a choice between a blind system and a perfect one, rather it’s a choice between a system that is blind by design and a system that is sighted by deliberate, responsible engineering.

The AI teams that get this right will have two significant advantages: Their models will become more powerful, and they’ll also truly understand what’s driving their model’s behavior.