top of page

Beyond Demographics: The Marketers Guide to Psycholinguistic Cluster Analysis

Extracting Audience Psychology to Superpower Segmentation

Every dataset tells a story, but oftentimes the narrative is buried under layers of noise. Cluster analysis categorizes data into groups based on similarities, aiding in understanding patterns and making decisions. Psycholinguistic cluster analysis is the process of categorizing groups of individuals who are psychologically similar, based on analysis of their language.

For marketers and market researchers, psycholinguistic clustering offers an unparalleled understanding of the nuanced motivations and preferences of different consumer segments, enabling more precise targeting and personalized communication strategies that resonate with the psychology of the target audiences.

In this article, we provide a beginner-friendly overview of cluster analysis approaches, and we demonstrate a psycholinguistic-based cluster analysis of an audience comprised of thousands of healthcare professionals. We discuss the implications of the resulting psychological clusters for use in audience segmentation, and how this same methodology can be applied to other sources of language data, including focus group transcripts, open-ended survey responses, earnings call transcripts, emails, product reviews, and more.

What is Cluster Analysis? 

Cluster analysis refers to a variety of data reduction techniques designed to group data points by maximizing in-group similarity and inter-group dissimilarity. While many types of clustering techniques exist, this article focuses on K-means Clustering and Principal Component Analysis (PCA).

Performing k-means clustering or PCA requires preparing a dataset that is sufficient in sample size and well-organized. Typically, such a dataset should have the following structure:

  • Rows: Each row corresponds to a particular data point or observation, such as a person, team, or group that will be treated as a unique entity during analysis.

  • Columns: Columns comprise continuous variables that describe features and attributes of each data point. This includes metadata, behavioral metrics, and other relevant information.

Example Data Table:

Unique ID per Individual

Number of Social Media Posts per Year


Purchases per Year





















Below, we provide two overviews of k-means clustering and PCA. The first is a general explanation that uses an anecdote to make the concept easier to grasp, and the second is a technical overview for those of you who are more comfortable with statistical analyses.

A Non-Technical Overview of K-means Clustering and PCA: 

Imagine you have a big box of LEGOs and you want to sort the LEGOs into separate piles based on shared characteristics. In this simplified analogy, LEGOs = data points, piles = clusters, characteristics = continuous variables that describe features and attributes of each data point, and archetypal LEGOs = centroids.

K-means Clustering with LEGOs

K-means clustering is like having a sorting machine that helps you organize the LEGOs into distinct piles based on common characteristics.

  1. Identify the characteristics: Make a list of all the characteristics of the LEGOs that you'd like the sorting machine to consider when determining which LEGOs resemble each other (color, shape, size, LEGO set type, etc.).

  2. Decide the Number of Piles: Tell the sorting machine how many piles you want. Let's say you'd like four piles.

  3. Pick some archetypal examples: The sorting machine picks four LEGOs randomly from the box. These LEGOs are like guesses about what archetypal examples of the four piles of LEGOs look like.

  4. Group the LEGOs: Next, the sorting machine looks at each LEGO in the box and assigns it to one of the four archetypal LEGOs based on which LEGO it resembles most. So, the sorting machine makes groups of blocks around each archetypal LEGO.

  5. Change the archetypal LEGOs and move the other LEGOs: The sorting machine then calculates the archetypal LEGO of each pile it created and moves the other LEGOs based on the new archetypal LEGOs.

  6. Repeat: Steps 4 and 5 are repeated several times. Each time, the piles get more accurate.

  7. Done: Finally, when the pile archetypes don't need to change significantly anymore, the sorting machine stops. Now you have four nicely sorted piles of LEGOs.

Principal Component Analysis (PCA) with LEGOs

PCA identifies which LEGO characteristics are most important to pay attention to when sorting your LEGOs into different groups.

  1. List All Characteristics: First, you make a list of all the characteristics of the LEGOs (color, shape, size, LEGO set type, etc.).

  2. Find the Main Characteristics: PCA reveals that most of the variety in your LEGOs box can be accounted for by a few main characteristics weighted by importance, like just the color and shape with shape being twice as important as color.

  3. Reduce Complexity: Instead of considering all the characteristics that make each LEGO unique while describing the piles, you now know which characteristics (color and shape) are most important to the sorting.

Complimenting K-means Clustering with PCA

K-means clustering sorts your LEGOs into piles. PCA simplifies your understanding of why your LEGO piles are distinct from each other by identifying the most important characteristics on which to differentiate. You can also use PCA as an initial step to reduce the number of features included in your sorting approach while accounting for the most significant variance across LEGOs.

A Technical Overview of K-means Clustering and PCA:

K-means Clustering

K-means clustering is a popular cluster analysis technique that employs a machine learning algorithm to group similar data points based on their distance, typically Euclidean distance, from cluster centroids. 

Prior to running a k-means clustering analysis, one must specify the number of clusters (k) the algorithm should extract from the dataset. The optimal number of clusters can be determined using a variety of empirical techniques, such as the silhouette method (measures cluster quality by comparing how similar each data point is to its assigned cluster vs. other clusters) or the elbow method (determines the number of clusters that produce a significant reduction in cluster compactness, after which adding more clusters results in only marginal improvements). When selecting k, it is recommended to also consider the best practices in your field.

As a second preliminary step, the continuous variables the cluster analysis will be based on must be selected so that the algorithm knows which features to consider when determining similarity (measured through distance). Note - if the continuous variables in your dataset vary in scale, it is recommended to standardize or normalize the data prior to clustering to minimize the possibility of clusters being skewed by feature magnitudes.

Once k-means clustering is initiated, the algorithm facilitates the following process:

  1. Randomly Select Cluster Centroids: The algorithm starts by randomly selecting centroids (centers) for each of the clusters. These centroids can be chosen either by picking random points in the feature-space or by selecting actual data points from the dataset.

  2. Assign Points to the Nearest Centroid: The algorithm assigns each data point to a cluster based on each data point’s distance from the centroids. Distance between data points and centroids is calculated across each feature and averaged to determine the overall distance.

  3. Recalculate Centroids: For each cluster, the algorithm calculates a new centroid by finding the average position of all data points assigned to that cluster.

  4. Repeat the Process until convergence: Steps 2 and 3 are repeated until data points no longer change clusters between iterations and the centroids remain stable or until a predetermined number of iterations is reached.

Principal Component Analysis (PCA) 

PCA is a data analysis method that simplifies datasets by identifying and extracting their primary defining  features. It achieves this by transforming the original variables, or features, into new linear combinations known as principal components. These components are constructed to be orthogonal to each other to maximize overall variance captured.

PCA involves the following steps:

  1. Standardization or Normalization: Variables included in the PCA should be standardized or normalized to ensure that features with larger scales do not dominate the PCA.

  2. Covariance Matrix Computation: PCA creates a matrix that includes the correlations between all possible pairs of variables. This matrix represents the strength and direction of interrelationships among all of the variables or features you're considering and ultimately determines which factor or component variables load onto. 

  3. Identify the Principal Components: This step involves calculating the eigenvalues (principal components) and eigenvectors (variance explained) based on the correlation matrix. The eigenvectors represent the directions of maximum variance in the data, while the eigenvalues indicate the magnitude of variance along these directions. Thus, the eigenvectors are the principal components, and the eigenvalues are used to rank the principal components in order of significance.

  4. Create a Feature Vector with a Subset of Principal Components: PCA identifies a subset of the eigenvectors, selecting eigenvectors with the highest eigenvalues. The feature vector is a matrix composed of the eigenvectors that were chosen.

  5. Dimensionality Reduction: The feature vector is used to transform the variables of the original dataset based on the principal components by multiplying the transpose of the original dataset by the transpose of the feature vector. In doing so, PCA creates a new lower-dimensional (i.e., fewer variables) representation of the original dataset based on the principal components.

How K-means Clustering and PCA Are Complementary Techniques

PCA makes it easier to visualize data clusters on a scatter plot by transforming the data into a lower-dimensional space while accounting for as much of the variability of the original dataset as possible. PCA can also be used prior to k-means clustering to minimize the number of features considered in the clustering process.

Language Data and Psycholinguistic Cluster Analysis

Traditional natural language processing techniques can be effective for basic sentiment analysis and topic and theme extraction, but a different approach is required to extract the linguistic features and representations needed for psycholinguistic cluster analysis.

Receptiviti’s API can be used to analyze raw text and generate scores on 200+ psycholinguistic dimensions that enable insight into the personality, motivations, thinking style and other aspects of the psychology of individuals or groups. These dimensions and the resulting scores can be used as features in clustering analyses to identify psychologically similar groups that can serve as the basis fpr psychology-based marketing personas and archetypes.

The Dataset

Our dataset is comprised of Reddit posts and comments written by 6,073 doctors and nurses over a 4-year period. The text from their posts were aggregated by author (n = 6,073, average word count per author =  28,397 words, minimum word count per author = 1,000 words). We used the Receptiviti API to assess the psychology of each doctor and nurse, prior to clustering them into psychologically similar groups.

Receptiviti offers two types of dimensions: dictionary-count and normed measures. Dictionary-count measures (including Receptiviti’s LIWC, LIWC Extension, Cognition, and Emotions frameworks) output raw scores (not normalized) based on the frequency of psychologically relevant categories of words in the analyzed text. Normed measures are algorithms that identify whether language is being used in a way that suggests a particular trait by relying on formulas that incorporate language categories relevant to the psychological phenomenon being measured. Each language category that contributes to a normed measure is independently normed, and the final output of the measure is a weighted average of the normed component scores. For the purpose of this study, normed measures were normalized based on a custom norming table derived from a large sample of language data (including social media posts, blogs, etc.) that is representative of how the general population writes.

Example Receptiviti Analysis Results Data Table Subset:

Unique Author ID

























We included all continuous Receptiviti measures as features in the cluster analysis, standardizing the scale of the dimensions using a rank-norming approach. We determined that four was the optimal number of clusters and performed a k-means clustering analysis using Python to parse the data. Upon completion of the analysis, we appended a column with cluster labels to the original data frame, tagging each healthcare worker in the dataset with the cluster they fit in. The bar chart below outlines the number of healthcare workers per cluster.

Number of Healthcare Workers in Psycholinguistic Cluster Analysis

After the k-means cluster analysis, we conducted a PCA to visualize the cluster results; the scatter plot shows the four clusters. While the clusters do not appear to have gaps between them or separate centers of density, they appear as distinct segments across two principal component axes. The table below highlights a subset of the dimensions with the most significant contributions, both positive and negative, to each principal component.

Pricipal Component Analysis Results Healthcare Workers Psycholinguistic Cluster Analysis

PCA Results Healthcare Workers Psycholinguistic Cluster Analysis


By reviewing the PCA loadings (i.e., the weighted dimensions of the principal components) and conducting a series of ANOVAs to determine statistically significant differences between the language-based psychological profiles of each cluster, we identified qualities that characterize each group of healthcare workers.

Below we highlight a subset of the key findings:

Thinking Style Clusters

Healthcare professionals in Clusters 0 and 3 were more analytical and had a slower thinking style compared to those in Clusters 1 and 2. 

Highly analytical, slow (i.e., deliberative) thinkers tend to carefully evaluate ideas during decision-making and communicate solutions in a logical, structured manner. On the other hand, less analytical, fast thinkers tend to make more reflexive decisions based on intuition and communicate in a more narrative, casual tone. While people use both thinking styles depending on the task at hand, it is important to determine which style they rely on most often to understand how they typically process and communicate information.

Clusters 0 and 3 may be more adept at tasks like working through diagnosing complex medical conditions and developing detailed treatment plans, while the thinking styles of clusters 1 and 2 may allow them to make quick life-saving decisions.

Thinking Style Results Healthcare Workers Psycholinguistic Cluster Analysis
Analytical thinking Results Healthcare Workers Psycholinguistic Cluster Analysis
Thinking Style Results Healthcare Workers Psycholinguistic Cluster Analysis

Emotional and Psychologically Stable Clusters

Healthcare professionals in Clusters 2 and 3 are more neurotic and negative in sentiment than those in Clusters 1 and 0.

Individuals with neurotic dispositions tend to express more negative emotions and are more vulnerable to stress, mood swings, depression, and insecurities. This is associated with greater psychological vulnerability and an increased risk of burnout.

Since careers in the healthcare industry can be extremely demanding, healthcare professionals in Clusters 2 and 3 may be less resilient or require more interventional support to prevent high rates of employee turnover.

Emotional Stability Results Healthcare Workers Psycholinguistic Cluster Analysis
Emotional Stability Results Healthcare Workers Psycholinguistic Cluster Analysis
Positive and Negative Emotion Results Healthcare Workers Psycholinguistic Cluster Analysis

People-Oriented and Agreeable Clusters

Clusters 1 and 2 score highest on measures related to sociability, suggesting that these healthcare workers may be energized by interacting and connecting with their co-workers and patients. Cluster 1 healthcare works are also the most bold and assertive, suggesting they have a more domineering style. Clusters 0 and 3 are less people-oriented and agreeable; thus, they may prefer independence over collaborative group work. Cluster 3 healthcare workers are also highly task-focused, suggesting they may be effective at methodically executing objectives.

People-oriented individuals tend to enjoy social interactions, and those who are also bold and outgoing are often seen as charismatic and influential. In contrast, individuals who are more reserved and prioritize tasks over people typically come across as impersonal, unemotional, and diligent workers.

Agreeable people typically have an altruistic desire to get along with and help others. Their willingness to cooperate and consider those around them makes them uniquely able to foster collaborative values and peacefully resolve conflicts in group environments.

People-oriented Results Healthcare Workers Psycholinguistic Cluster Analysis
Agreeableness Results Healthcare Workers Psycholinguistic Cluster Analysis

Emotionally Distant and Self-Reflective Clusters

Our findings reveal that Cluster 2 workers are highly emotionally aware and empathetic, traits they can leverage to relate to and invest in their patients' health journeys. Notably, Cluster 0 appears less emotionally aware and empathetic, suggesting they may take a more psychologically distanced approach to providing treatment. While this may help them remain less affected by tragedies such as the loss of a patient, it could negatively impact patient satisfaction and adherence.

Emotional awareness and empathy are fundamental social skills that foster compassionate and understanding work environments. Additionally, a healthy degree of self-awareness can support personal growth by encouraging one to learn from past experiences and remain conscious of the impact of their decisions (although too much self-focused rumination can lend itself to being overly self-conscious). In the healthcare industry, being self-reflective and considerate of others’ emotions can positively impact bedside manner and the quality of treatment, making patients feel understood and well cared for.

Emotional Expressiveness Results Healthcare Workers Psycholinguistic Cluster Analysis
Self Reflection Results Healthcare Workers Psycholinguistic Cluster Analysis

Motivation Clusters

Motivations include traits, needs, and values that impact how people make decisions and establish their preferences. By understanding what individuals are motivated by, we can determine how to appeal to them and influence action.

In our analysis, Cluster 0 values liberty and stability, indicating these healthcare workers prefer reliability and autonomy. Cluster 1 healthcare workers have a flexible, success-oriented mindset, as the analysis reveals they are achievement-driven, reward-driven, open to change, and more risk-seeking than risk-averse. Cluster 2 is less achievement-driven and open to change but more affiliation-driven and reward-driven, suggesting they may benefit from environments that emphasize teamwork, support, and recognition. Finally, Cluster 3 is highly driven by the need for power and risk-awareness, suggesting they prefer to be in positions of control where they can make decisions after a sufficient risk assessment.

Drives, Needs and Values Results Healthcare Workers Psycholinguistic Cluster Analysis
Motivations Results Healthcare Workers Psycholinguistic Cluster Analysis

Descriptions of each cluster:


Archetype Name

Archetype Description


Independent and Resilient Pragmatist

These healthcare workers take a logical and strategic approach to their responsibilities. They communicate in a formal and structured manner, displaying minimal emotional expressiveness and coming across as less friendly. This suggests they accomplish tasks and work with others using an impersonal approach. Motivated by a desire for liberty and stability, they value autonomy. Their lower susceptibility to stress and anxiety enables them to maintain emotional stability, remaining calm even in challenging situations.


Charismatic and Cheerful Achiever

These healthcare workers are bold, empathetic, cheerful, and social. They exhibit high levels of self-assuredness and ambition and are motivated by a desire for positive outcomes and a willingness to take risks. Their ability to think intuitively and process information rapidly enhances their problem-solving skills, especially in emergencies where quick decisions can save lives. These individuals can balance focusing on goals and people, suggesting they have the capacity to be balanced and effective leaders.


Compassionate and Introspective Connector

These healthcare workers are highly social and are driven by a need to foster relationships. They communicate in a more casual, narrative style. They are self-reflective, genuine, and attuned to their feelings. This allows them to be mindful of their actions and mental state and to learn from their experiences. However, they are also more prone to stress, anxiety, and emotional instability, which increases their risk of burnout. Despite these challenges, their calm demeanor and focus on people and emotions make them invaluable in delivering patient-centered care.


Critical and Risk-Conscious Executor

Driven by power and risk-awareness, these healthcare workers prioritize control and task-focus. They tend to express more negative than positive emotions, coming across as stress- and anxiety-prone. Additionally, they are less cooperative, less affiliation driven, less agreeable, and less self-disclosing, potentially coming across as intimidating or ruthless despite their reserved demeanor. Their strategic and deliberative approach to decision-making allows them to carefully evaluate decisions to effectively execute tasks. This type of healthcare professional can be particularly effective in moments when critical thinking and informed action are essential.

Implications of Psycholinguistic Clusters for PR & Marketing Insights:

While market segmentation typically relies on demographics (i.e., age, gender, geography, income levels, etc), psycholinguistic-based segmentation enables marketers to take into account nuanced information about who their target consumers are, how they think, and how they are likely to act.

Consider how pharmaceutical marketers can utilize results from our cluster analysis of healthcare workers to determine how to best educate doctors and nurses about new drugs or medical devices. Cluster 1 healthcare workers' open-minded and risk-seeking style suggests they're more likely to be early adopters compared to the more cautious, critical thinkers in Clusters 0 and 3 who may generally avoid being a first mover.

Cluster 2 healthcare workers are less analytical and highly people- and emotion-focused, whereas Cluster 3 is highly analytical; emotion-evoking and compelling stories are best used to target Cluster 2, while Cluster 0 should be engaged with information that focuses on facts, figures, and statistics. 

Our cluster analysis was based entirely on psycholinguistic insights, but marketers can use a mix of both psychological insights and consumer behavior to gain even more nuanced insights into their audience.


Ready to start using psycholinguistic clustering to better understand and engage your audiences? Contact us.


Trusted by industry leaders: