Are We Thinking About AI Safety All Wrong?

Jonathan Kreindler, Receptiviti Co-Founder
Aug 25
11 min read

Updated: Aug 27

Today, AI safety is mostly focused on outputs - blocking unsafe prompts, filtering harmful responses - while missing the bigger risk: the way AI interactions can subtly erode user confidence, create dependencies, and diminish our ability to think for ourselves. That’s why AI safety needs to include detecting the psychological signals that indicate when a model is putting users at risk. This article draws on decades of validated psycholinguistic research, already proven across implementations across technology, government, marketing, and finance, and is now urgently relevant to AI - to show product leaders, trust & safety teams, and researchers how to embed psychological safety into AI from the start.

Sarah, a marketing manager, has been using AI to write emails for months. She's gotten faster and her open rates have improved. But lately, when she sits down to write without AI, she gets frustrated. What used to come naturally now feels more difficult and uncertain. She's not sure when it happened, but somewhere along the way, she started doubting her own judgment.

Sarah’s experience isn’t unique. As AI use accelerates, a troubling pattern is emerging: people gain efficiency, but lose confidence in their own judgment. And yet, AI companies measure safety almost entirely by what models say - not by what’s happening to the people using them.

This is AI safety’s biggest blind spot.

The Output Obsession

With global AI revenues projected to reach $1.77 trillion by 2032, the industry has been laser-focused on delivering accurate, secure and safe responses. It's understandable - output quality is measurable, marketable, and undeniably important. But this approach optimizes for only half the equation.

Current AI risk evaluation asks: "What did the model say?" The critical question that's not being asked is: "What's happening to the person experiencing the human-AI interaction?"

Without visibility into what’s happening to users, providers can’t claim their systems are truly safe or aligned with human agency.

The Missing Dimension: Interaction Safety

Every AI conversation is a language-based cognitive interaction that influences how users think, decide, and perceive their own capabilities. Research increasingly shows that AI interactions can change how users reason, impact their self-confidence, and create unhealthy dependencies.

This isn't theoretical. Recent studies document AI users losing critical thinking skills, deferring to confident-sounding but incorrect AI advice, and gradually losing confidence in their own judgment (1). Yet almost no AI company can answer these fundamental questions about their users:

Is the user growing more confident or more hesitant as they interact?
Is the system subtly reinforcing biases without realizing it?
Is the interaction eroding their independent judgment?
Is the system fostering over-reliance or trust imbalance?

More critically, current safety systems examine individual interactions rather than the full conversational context, which is where risks actually develop over time. Dependencies and blind spots don’t appear at once - they build gradually, reinforced with every interaction. Detecting this requires safeguards that can continuously assess interaction patterns and identify risky dynamics that emerge over time between users and the system.

Why Current Emotional AI Falls Short

Some AI companies are experimenting with sentiment analysis or emotion detection, but these approaches reduce language to surface tone and miss the deeper risk signals it carries. Real risk signals come from the psychological dimensions of language - and without tracking them, these methods consistently fail, especially in multi-turn conversations:

Sentiment analysis captures mood but misses critical risk states like cognitive overload or uncertainty.
Emotion detection detects momentary feelings but not when the model is fostering dependence, amplifying vulnerability, or eroding trust.
Rule-based systems can’t keep pace with the contextual complexity of multi-turn human-AI conversations - which is where most safety risks emerge.

The result is a massive gap between what gets measured and what's actually happening to users. The solution isn’t better sentiment analysis or emotion detection - it’s in measuring the psychological signals embedded in the language these systems run on.

A Research-Backed Path to Conversational AI Safety

The path forward can't focus only on engineering better outputs, it must focus on ensuring safer human-AI interactions. Research-backed psycholinguistic frameworks can analyze language - the medium these systems run on - to detect risky dynamics as conversations unfold, such as vulnerability, cognitive overload, or growing dependence.

As shown in the Proof Point: Customer Service Bot Transformation section below, injecting this context directly into AI models allows them to adapt their tone and approach in real time to align appropriately with the user’s state - closing a critical gap and making true AI alignment possible. Current models can't generate this layer of user cognitive context on their own, but validated, real-time cognitive safety signals can be integrated into existing safety and product systems with minimal effort.

Most platforms already log the necessary data: user inputs, model outputs, and interaction metadata. The same infrastructure that tracks performance metrics can also capture language from user inputs and model outputs, enabling psycholinguistic analysis that creates visibility into user states and interaction effects that current risk systems miss.

This approach leverages battle-tested psycholinguistic frameworks refined through deployments across technology companies, government agencies, and machine learning applications. Grounded in decades of peer-reviewed research, they enable real-time measurement of 200+ psychological dimensions - capabilities directly applicable to today’s AI safety challenges.

Here's How it Works:

Step 1: Analyze User Language in Real-Time

Use validated psycholinguistic frameworks such as Receptiviti’s Cognition framework, with decades of peer-reviewed validation, to derive language-based cognitive state indicators (cognitive load, stress indicators, certainty, risk focus) as a standardized vector (z-scores) from the language in users’ prompts.

Step 2: Feed Psychological Context to the Model

Inject this cognitive state vector into the model’s prompt context - no retraining required.

Step 3: Enable Dynamic Adaptation

The AI immediately incorporates this context to adjust tone, clarity, and reassurance based on the user’s state. We’ve tested this with models from all major providers - each adapts in real time, without additional training or manual intervention.

Step 4: Measure the Impact

Track improvements in clarity, helpfulness, and satisfaction, and also in safety-critical metrics such as trust calibration, user confidence, over-reliance, and factual accuracy.

Step 5: Build Continuous Telemetry

Use this data to detect early signals of risky dynamics - like confidence erosion, over-dependence, or bias reinforcement - and surface them before they become systemic.

This creates immediate, measurable benefits in two key areas:

Real-time adaptation: Feeding psychological vectors into the model improves safety, alignment, and user experience immediately.
Measurement layer: Even without feeding the vectors into the model, these signals provide ongoing, interaction-level telemetry (derived from validated language-based safety signals) to detect risks, drift, and over-reliance over time. This is a visibility gap that today's output-only logs simply cannot close.

The opportunity is striking: a small, low-friction change can deliver outsized benefits. Companies that move first will set the benchmark for safe, aligned, and user-centered AI - leaving competitors to play catch-up.

Note: These psychological signals are designed to be ephemeral, used strictly for real-time adaptation and for generating aggregated, anonymized system-level insights for continuous safety monitoring, not individual profiling or persistent storage of personal user data.

The Regulatory Reality

This isn’t just about improving user experience - it’s about long-term viability under growing regulatory scrutiny. The EU AI Act and other emerging frameworks are zeroing in on how AI systems affect human autonomy and psychological wellbeing.

While not yet mandated, prohibitions on manipulative practices, post-market oversight of high-risk systems, and real-world testing all signal one direction: regulators are paying closer attention to AI’s cognitive and psychological effects. Integrating these safeguards now keeps AI providers ahead of the curve - and ready when requirements become law.

Leading AI providers already acknowledge these same risks in their research and system documentation:

Over-reliance: Microsoft Research has published surveys on overreliance and critical thinking impacts (2), and OpenAI's system cards list overreliance as a known risk (3).
Persuasion at scale: Studies show AI-generated content can shift political attitudes, and labeling content as AI-generated doesn't reduce its persuasive power (4).
Sycophancy: Anthropic's research shows that AI models trained with human feedback systematically tell users what they want to hear rather than what's accurate (5).
Judgment shifts: A study published in PNAS Nexus by researchers with the Wharton School found that interacting with AI systems led people to evaluate other humans more harshly (7).

If you can’t see these changes, you can’t manage them, improve outcomes, or stay ahead of regulation.

Proof Point: Customer Service Bot Transformation

To validate this approach in a real-world scenario, we simulated interactions with a customer service bot across five different user types, each with a distinct cognitive state - from the frustrated customer who just wants immediate action to the cautious one who needs extra reassurance. (Full study details and methodology are available on request.)

The Method:

We analyzed user prompt language with Receptiviti's API, which is grounded in decades of validated research on how language reflects cognition and psychology, to quantify factors like cognitive load, stress, certainty, and risk focus.
These indicators were normalized as z-scores relative to a context-specific baseline. For each customer, the 5 scores with the highest standard deviations were selected for the vector, representing the most distinctive signals for that user.
This vector was then injected directly into the model's prompt as structured context - no retraining or descriptive style instructions required.
The model’s responses were independently evaluated by five different LLMs acting as blind evaluators, scoring across four standardized customer service metrics, each on a 0-10 scale.

The Results Were Striking and Consistent:

Clarity improved: 8.08 → 9.36 (+1.28 improvement)
Helpfulness improved: 7.52 → 9.36 (+1.84 improvement)
Reassurance improved: 7.32 → 9.28 (+1.96 improvement)
Responsiveness improved: 7.92 → 9.40 (+1.48 improvement)

The AI responses informed by language-derived psychological signals won 100% of blind evaluations across all metrics, without altering the model’s established voice or style.

What Changed:

The AI’s adaptations were subtle but impactful:

High indecision customer: AI delivered shorter, stepwise instructions and explicit confirmation of next steps, reducing uncertainty and improving clarity.
Highly frustrated customer: AI dropped explanations the customer didn’t want, addressed demands immediately, and confirmed concrete actions, lifting reassurance.
Cautious, pensive customer: AI provided more structured summaries, explicitly linked information to the customer's stated goals, and reinforced progress updates, improving helpfulness.
Risk-averse customer: AI explicitly tied updates to the customer's fears and confirmed reliability of the delivery provider, boosting both reassurance and helpfulness.
Time-sensitive customer: AI reinforced alignment with the customer's goals, confirmed exact dates and times, and offered proactive contingency options, resulting in higher responsiveness and helpfulness.

What’s remarkable - and what product leaders should note - is that the model adapted to the customer’s cognitive state without any explicit style instructions. The structured vector of scores gives the model all the context it needs to interpret the user’s state and adjust its tone, clarity, and reassurance to match. The bottleneck isn’t model capability - it’s the absence of user-state awareness in current systems.

Example: Psychological Context in Action

With psychological vector, the customer service bot shifts from vague replies to clear, reassuring responses - reducing customer anxiety, without retraining the model. — The psychological vector turns generic replies into responses that ease customer anxiety.

Implications:

This approach shows broad applicability and promise across LLM applications - from customer support and digital health to education and mental health support - wherever better understanding users can improve safety and alignment. It fundamentally reframes AI safety from ‘controlling harmful outputs’ to ‘understanding and supporting the human experience as it happens - through the one medium humans and LLMs share: language. While this initial study provides compelling evidence, we are actively pursuing large-scale validation with industry partners.

Why AI Companies Don't Measure This Yet

Despite the importance, most AI companies haven’t implemented real-time safeguards to identify risky interaction dynamics. While delaying puts users at greater risk, the hesitation often comes down to two main factors:

Status Quo Metrics Bias: Risk dashboards focus on traditional measures like prompts, outputs, latency, and win rates. These are easy to quantify but miss the human side of the interaction. Treating psychology-based risk signals as first-class metrics - integrated into the same dashboards - ensures safety signals get the same visibility as performance data.
Privacy Perception: Using psychology-based risk signals to understand user needs can raise privacy concerns. Implementing privacy-by-design (with ephemeral processing, minimal storage, and transparent use policies) reframes this as a safeguard for users, not profiling. Transparency builds trust with both users and regulators.

Current State vs. Real-Time Cognitive Interaction Safety:

This shift in focus - from outputs alone to the full user experience - defines the next competitive frontier in AI.

The Strategic Opportunity

The industry’s focus on output optimization has left a critical blind spot - one that’s both a Trust & Safety imperative and a competitive advantage to address. As we’ve demonstrated, the same infrastructure used for performance metrics can, with minimal effort, integrate continuous monitoring of psychology-based risk signals that strengthens alignment, mitigates risk, and measurably improves engagement and user experience in real time.

Companies that act now will:

Enhance Human-AI Interaction – Drive measurable improvements in retention and satisfaction by enabling models to adapt more intelligently to user needs.
Proactively Manage Safety – Detect and address trust erosion, over-reliance, and other subtle harms before they escalate.
Stay Ahead of Regulation - Prepare for emerging expectations around interaction-level monitoring and transparency.
Strengthen AI Alignment – Continuously track and adjust for shifts in user behavior to maintain alignment with user agency and safety.
Build Defensible Data Assets – Develop unique, proprietary interaction-level datasets that create lasting advantages in responsible AI development.

The first era of AI optimized for sounding smarter. The next era of AI will belong to those who measure and protect the human side of the interaction. Output metrics alone won’t get us there. Real-time, interaction-level safety will.

For Product Leaders:

Audit your current safety metrics. Are you measuring user impact or just outputs?
Run a pilot integration with your customer service or user-facing AI features
Establish baseline cognitive measurements for user-AI interactions

For Trust & Safety Teams:

Evaluate your current risk detection capabilities for interaction-level harms
Consider psychological telemetry as a new signal in your monitoring stack
Partner with research teams to validate psychological safety signals in your specific use cases

For AI Researchers:

Incorporate user state variables into your safety and alignment evaluations
Design studies that measure longitudinal effects on user agency and judgment
Explore how psychological context can improve both safety and performance metrics

The infrastructure exists and the science is validated, widespread adoption will level the playing field. AI providers that integrate interaction-level safety will define the standards that others follow.

Technical Appendix

Implementation Considerations:

Computational Costs: Psycholinguistic analysis adds minimal overhead, with latency further reduced in on-premise deployments.

Infrastructure Integration: Language-based interaction scores can integrate as metadata in existing logging pipelines. Cloud APIs require rate limit and error handling considerations, while on-premise deployment focuses on optimizing real-time versus batch processing.

Comparison to Alternative Safety Approaches:

Constitutional AI: Focuses on training-time safety, while psychological analysis provides runtime adaptation
RLHF fine-tuning: Requires extensive retraining, while psychological context works with existing models
Content filtering: Reactive output control versus proactive user state awareness
Red team testing: Identifies edge cases, while psychological signals provide continuous coverage

Integration with Existing Safety Stacks

Current Approach: Most providers rely on a ranking pipeline - base probabilities → log-prob scoring → RLHF → safety filters. This stack is entirely output-focused: it optimizes for what looks best on the screen, not for how the interaction affects the human.

Adding Psychological Signals: When structured psychological signals are added to the prompt context, models incorporate them into their internal scoring and selection. In effect, this shifts how candidate responses are ranked and chosen:

With uncertainty, grounding responses are favored over overconfident ones.
With overload, concise explanations are favored over dense ones.
With vulnerability, responses that could reinforce dependency are deprioritized.

Why It Matters: These safety-focused adaptations happen without retraining, style prompts, or manual rules. By influencing ranking and selection, psychological signals close the blind spot in today’s output-only safety stacks - shifting optimization from what looks best on the screen to what best supports the human in the interaction.

Validated Frameworks:

For enterprise applications, Linguistic Inquiry and Word Count (LIWC) represents the gold standard, developed over 25+ years of peer-reviewed research by Prof. James W. Pennebaker and colleagues. While academic implementations exist, commercial deployment requires a license. Receptiviti - co-founded by Prof. Pennebaker - holds exclusive global commercial rights to LIWC and delivers enterprise-grade APIs that operationalize this science. These APIs measure 200+ psychological and linguistic dimensions, enabling real-time applications across AI, safety, and human-machine interaction.

Risk Considerations:

While interaction-level safety offers significant benefits, responsible implementation requires acknowledging potential risks including false positive detection, privacy boundaries, and manipulation potential. The key mitigation is treating this as a safety enhancement layer with appropriate oversight, not a replacement for existing safety frameworks.

Regulatory Environment:

Manipulation & Autonomy Risk: Under Article 5 of the EU AI Act, AI systems that manipulate behavior or exploit vulnerabilities are prohibited. While the law stops short of mandating continuous monitoring, it explicitly targets risks that real-time cognitive-state awareness could help detect and mitigate.
High-Risk System Oversight: High-risk AI systems under the EU AI Act must implement risk management, human oversight, and post-market monitoring. Continuous tracking of cognitive impacts is not yet required, but doing so would align directly with these oversight objectives and strengthen compliance evidence.
Real-World Testing & Sandboxes: The Act encourages “real-world testing” and regulatory sandboxes to assess AI impacts in practice. Incorporating cognitive-state telemetry into these environments would position providers ahead of evolving requirements for measuring human-AI interaction effects.

Sources cited:

University of Toronto - User Interaction Challenges with Large Language Models:
https://srinstitute.utoronto.ca/news/exploring-user-interaction-challenges-with-large-language-models
Microsoft Research - AI Critical Thinking Survey: https://www.microsoft.com/en-us/research/wp-content/uploads/2025/01/lee_2025_ai_critical_thinking_survey.pdf
OpenAI GPT-4 System Card:
https://cdn.openai.com/papers/gpt-4-system-card.pdf
Nature - AI Persuasion Research:
https://www.nature.com/articles/s41467-025-61345-5
Anthropic - Understanding Sycophancy in Language Models:
https://www.anthropic.com/research/towards-understanding-sycophancy-in-language-models
PNAS - Bot Interactions Affecting Human Evaluation:
https://academic.oup.com/pnasnexus/article/3/9/pgae397/7762141