How to Group an Audience by Personality Traits Using Psycholinguistic Cluster Analysis

Jade Marion, Customer Success @ Receptiviti
Jun 24, 2024
15 min read

Updated: Nov 7, 2024

Every dataset tells a story, but oftentimes the narrative is buried under layers of complexity or noise. From network analysis to genomic mapping, cluster analysis categorizes data into groups based on similarities, aiding in understanding patterns and making decisions.

In this article, we provide a beginner-friendly overview of cluster analysis approaches, and we showcase the results of a psycholinguistic-based cluster analysis of healthcare industry professionals. We also discuss the implications of the derived insights for application in various use cases, including customer and market segmentation, recruitment, and organizational culture. Although our analysis features data from social media, this same methodology can be applied to other sources of language data, including focus group transcripts, open-ended survey responses, earnings call transcripts, emails, product reviews, and more.

What is Cluster Analysis?

Cluster analysis refers to a variety of data reduction techniques designed to group data points by maximizing in-group similarity and inter-group dissimilarity. While there are many types of clustering techniques, this article will focus on K-means Clustering and Principal Component Analysis (PCA).

Performing k-means clustering or PCA requires preparing a dataset that is sufficient in sample size and well-organized. Typically, such a dataset should have the following structure:

Rows: Each row corresponds to a particular data point or observation, such as a person, team, or group that will be treated as a unique entity during analysis.
Columns: Columns comprise continuous variables that describe features and attributes of each data point. This includes metadata, behavioral metrics, and other relevant information.

Example Data Table:

Unique ID per Individual	Number of Social Media Posts per Year	Age	Purchases per Year
1	75	17	19
2	21	32	4
3	16	26	23
4	3	55	7
5	202	15	16

Below, we provide two overviews of k-means clustering and PCA. The first is a general explanation that uses an anecdote to define the approaches. The second is a technical overview for readers who are more comfortable with statistical analyses.

For Readers who Prefer a Non-Technical Overview of K-means Clustering and PCA:

Imagine you have a big box of LEGOs. You want to sort the LEGOs into separate piles based on shared characteristics. In this simplified analogy, LEGOs = data points, piles = clusters, characteristics = continuous variables that describe features and attributes of each data point, and archetypal LEGOs = centroids.

K-means Clustering with LEGOs

K-means clustering is like having a sorting machine that helps you organize the LEGOs into distinct piles based on common characteristics.

Identify the characteristics: Make a list of all the characteristics of the LEGOs that you'd like the sorting machine to consider when determining which LEGOs resemble each other (color, shape, size, LEGO set type, etc.).
Decide the Number of Piles: Tell the sorting machine how many piles you want. Let's say you'd like four piles.
Pick some archetypal examples: The sorting machine picks four LEGOs randomly from the box. These LEGOs are like guesses about what archetypal examples of the four piles of LEGOs look like.
Group the LEGOs: Next, the sorting machine looks at each LEGO in the box and assigns it to one of the four archetypal LEGOs based on which LEGO it resembles most. So, the sorting machine makes groups of blocks around each archetypal LEGO.
Change the archetypal LEGOs and move the other LEGOs: The sorting machine then calculates the archetypal LEGO of each pile it created and moves the other LEGOs based on the new archetypal LEGOs.
Repeat: Steps 4 and 5 are repeated several times. Each time, the piles get more accurate.
Done: Finally, when the pile archetypes don't need to change significantly anymore, the sorting machine stops. Now you have four nicely sorted piles of LEGOs.

Principal Component Analysis (PCA) with LEGOs

PCA identifies which LEGO characteristics are most important to pay attention to when sorting your LEGOs into different groups.

List All Characteristics: First, you make a list of all the characteristics of the LEGOs (color, shape, size, LEGO set type, etc.).
Find the Main Characteristics: PCA reveals that most of the variety in your LEGOs box can be accounted for by a few main characteristics weighted by importance, like just the color and shape with shape being twice as important as color.
Reduce Complexity: Instead of considering all the characteristics that make each LEGO unique while describing the piles, you now know which characteristics (color and shape) are most important to the sorting.

How K-means Clustering and PCA Are Complementary Techniques with LEGOs

K-means clustering sorts your LEGOs into piles. PCA simplifies your understanding of why your LEGO piles are distinct from each other by identifying the most important characteristics on which to differentiate. You can also use PCA as an initial step to reduce the number of features included in your sorting approach while accounting for the most significant variance across LEGOs.

For Readers who Prefer a Technical Overview of K-means Clustering and PCA:

K-means Clustering

K-means clustering is a popular cluster analysis technique that employs a machine learning algorithm to group similar data points based on their distance, typically Euclidean distance, from cluster centroids.

Prior to running a k-means clustering analysis, specify the number of clusters (k) the algorithm should extract from the dataset. The optimal number of clusters can be determined using a variety of empirical techniques, such as the silhouette method (measures cluster quality by comparing how similar each data point is to its assigned cluster vs. other clusters) or the elbow method (determines the number of clusters that produce a significant reduction in cluster compactness, after which adding more clusters results in only marginal improvements). When selecting k, it is recommended to also consider the best practices in your field.

As a second preliminary step, the continuous variables the cluster analysis will be based on must be selected so that the algorithm knows which features to consider when determining similarity (measured through distance). Note - if the continuous variables in your dataset vary in scale, it is recommended to standardize or normalize the data prior to clustering to minimize the possibility of clusters being skewed by feature magnitudes.

Once k-means clustering is initiated, the algorithm facilitates the following process:

Randomly Select Cluster Centroids: The algorithm starts by randomly selecting centroids (centers) for each of the clusters. These centroids can be chosen either by picking random points in the feature-space or by selecting actual data points from the dataset.
Assign Points to the Nearest Centroid: The algorithm assigns each data point to a cluster based on each data point’s distance from the centroids. Distance between data points and centroids is calculated across each feature and averaged to determine the overall distance.
Recalculate Centroids: For each cluster, the algorithm calculates a new centroid by finding the average position of all data points assigned to that cluster.
Repeat the Process until convergence: Steps 2 and 3 are repeated until data points no longer change clusters between iterations and the centroids remain stable or until a predetermined number of iterations is reached.

Principal Component Analysis (PCA)

PCA is a data analysis method that simplifies datasets by identifying and extracting their primary defining features. It achieves this by transforming the original variables, or features, into new linear combinations known as principal components. These components are constructed to be orthogonal to each other to maximize overall variance captured.

PCA involves the following steps:

Standardization or Normalization: Variables included in the PCA should be standardized or normalized to ensure that features with larger scales do not dominate the PCA.
Covariance Matrix Computation: PCA creates a matrix that includes the correlations between all possible pairs of variables. This matrix represents the strength and direction of interrelationships among all of the variables or features you're considering and ultimately determines which factor or component variables load onto.
Identify the Principal Components: This step involves calculating the eigenvalues (principal components) and eigenvectors (variance explained) based on the correlation matrix. The eigenvectors represent the directions of maximum variance in the data, while the eigenvalues indicate the magnitude of variance along these directions. Thus, the eigenvectors are the principal components, and the eigenvalues are used to rank the principal components in order of significance.
Create a Feature Vector with a Subset of Principal Components: PCA identifies a subset of the eigenvectors, selecting eigenvectors with the highest eigenvalues. The feature vector is a matrix composed of the eigenvectors that were chosen.
Dimensionality Reduction: The feature vector is used to transform the variables of the original dataset based on the principal components by multiplying the transpose of the original dataset by the transpose of the feature vector. In doing so, PCA creates a new lower-dimensional (i.e., fewer variables) representation of the original dataset based on the principal components.

How K-means Clustering and PCA Are Complementary Techniques

PCA makes it easier to visualize data clusters on a scatter plot by transforming the data into a lower-dimensional space while accounting for as much of the variability of the original dataset as possible. PCA can also be used prior to k-means clustering to minimize the number of features considered in the clustering process.

Language Data for Personality Cluster Analysis

Language is an insights-rich data source, but raw text lacks descriptive labels that facilitate easy analysis. While manually coding text to identify relevant elements and patterns is possible, the process is impractical for large datasets and prone to observation bias and error. Thus, automated processes are necessary to effectively and accurately extract linguistic features and produce numerical representations of language data that can serve as the basis for cluster analysis.

Based on decades of research, Receptiviti’s language analytics platform analyzes raw text and generates scores across 200+ psycholinguistic dimensions, providing insight into the personality, motivations, communication style, emotions, thinking style and other aspects of psychology conveyed by writers and speakers. The dimensions can be used as continuous variables (and the scores as features) in personality-based clustering analyses to identify psychologically similar groups and curate impactful personas or archetypes.

To demonstrate the utility of language-based cluster analysis, we surfaced psychological archetypes of healthcare professionals by analyzing their social media posts.

The Dataset

Our dataset included Reddit posts and comments written by doctors and nurses over a 4-year period; text samples were aggregated by author (n = 6,073, average word count per author = 28,397 words, minimum word count per author = 1,000 words). We used Receptiviti API dimensions to assess workers’ psychology.

Receptiviti offers two types of dimensions: dictionary-count and normed measures. Dictionary-count measures (including Receptiviti’s LIWC, LIWC Extension, Cognition, and Emotions frameworks) output raw scores (not normalized) based on the frequency of psychologically relevant categories of words in the analyzed text. Normed measures are algorithms that identify whether language is being used in a way that suggests a particular trait by relying on formulas that incorporate language categories relevant to the psychological phenomenon being measured. Each language category that contributes to a normed measure is independently normed, and the final output of the measure is a weighted average of the normed component scores. For the purpose of this study, normed measures were normalized based on a custom norming table derived from a large sample of language data (including social media posts, blogs, etc.) that is representative of how the general population writes.

Example Receptiviti Analysis Results Data Table Subset:

Unique Author ID	cognition.analytical_thinking	personality.extraversion	interpersonal_circumplex.communal
1	0.5569392	49.250203	56.545565
2	0.46554297	50.821653	49.702837
3	0.71343403	46.853817	54.870431
4	0.48756868	41.024872	50.162960
5	0.58871166	52.811342	50.809209

Analysis

We included all continuous Receptiviti measures as features in the cluster analysis, standardizing the scale of the dimensions using a rank-norming approach. We determined that four was the optimal number of clusters and performed a k-means clustering analysis using Python to parse the data. Upon completion of the analysis, we appended a column with cluster labels to the original data frame, tagging each healthcare worker in the dataset with the cluster they fit in. The bar chart below outlines the number of healthcare workers per cluster.

After the k-means cluster analysis, we conducted a PCA to visualize the cluster results; the scatter plot shows the four clusters. While the clusters do not appear to have gaps between them or separate centers of density, they appear as distinct segments across two principal component axes. The table below highlights a subset of the dimensions with the most significant contributions, both positive and negative, to each principal component.

Results

By reviewing the PCA loadings (i.e., the weighted dimensions of the principal components) and conducting a series of ANOVAs to determine statistically significant differences between the language-based psychological profiles of each cluster, we identified qualities that characterize each group of healthcare workers. Below we highlight a subset of the key findings.

Thinking Style

Highly analytical, slow (i.e., deliberative) thinkers tend to carefully evaluate ideas during decision-making and communicate solutions in a logical, structured manner. On the other hand, less analytical, fast thinkers tend to make more reflexive decisions based on intuition and communicate in a more narrative, casual tone. While people use both thinking styles depending on the task at hand, it is important to determine which style they rely on most often to understand how they typically process and communicate information.

Healthcare professionals in Clusters 0 and 3 were more analytical and had a slower thinking style compared to those in Clusters 1 and 2. This suggests that clusters 0 and 3 may be more adept at tasks like working through diagnosing complex medical conditions and developing detailed treatment plans, while the thinking styles of clusters 1 and 2 may allow them to make quick life-saving decisions.

Thinking Style Results Healthcare Workers Psycholinguistic Cluster Analysis

Analytical thinking Results Healthcare Workers Psycholinguistic Cluster Analysis

Emotional and Psychological Stability

Individuals with neurotic dispositions tend to express more negative emotions and are more vulnerable to stress, mood swings, depression, and insecurities. This is associated with greater psychological vulnerability and an increased risk of burnout.

Healthcare professionals in Clusters 2 and 3 are more neurotic and negative in sentiment than those in Clusters 1 and 0. Since careers in the healthcare industry can be extremely demanding, healthcare professionals in Clusters 2 and 3 may be less resilient or require more interventional support to prevent high rates of employee turnover.

Emotional Stability Results Healthcare Workers Psycholinguistic Cluster Analysis

Positive and Negative Emotion Results Healthcare Workers Psycholinguistic Cluster Analysis

People-Oriented and Agreeable

People-oriented individuals tend to enjoy social interactions, and those who are also bold and outgoing are often seen as charismatic and influential. In contrast, individuals who are more reserved and prioritize tasks over people typically come across as impersonal, unemotional, and diligent workers.

Agreeable people typically have an altruistic desire to get along with and help others. Their willingness to cooperate and consider those around them makes them uniquely able to foster collaborative values and peacefully resolve conflicts in team and work environments.

Clusters 1 and 2 score highest on measures related to sociability, suggesting that these healthcare workers may be energized by interacting and connecting with their co-workers and patients. Cluster 1 healthcare works are also the most bold and assertive, suggesting they are more dominant and may prioritize seeking the respect of others. Clusters 0 and 3 are less people-oriented and agreeable; thus, they may prefer independence over collaborative group work. Cluster 3 healthcare workers are also highly task-focused, suggesting they may be effective at methodically executing objectives.

People-oriented Results Healthcare Workers Psycholinguistic Cluster Analysis

Agreeableness Results Healthcare Workers Psycholinguistic Cluster Analysis

Emotionally Distant and Self-Reflective

Emotional awareness and empathy are fundamental social skills that foster compassionate and understanding work environments. Additionally, a healthy degree of self-awareness can support personal growth by encouraging one to learn from past experiences and remain conscious of the impact of their decisions (although too much self-focused rumination can lend itself to being overly self-conscious). In the healthcare industry, being self-reflective and considerate of others’ emotions can positively impact bedside manner and the quality of treatment, making patients feel understood and well cared for.

Our findings reveal that Cluster 2 workers are highly emotionally aware and empathetic, traits they can leverage to relate to and invest in their patients' health journeys. Notably, Cluster 0 appears less emotionally aware and empathetic, suggesting they may take a more psychologically distanced approach to providing treatment. While this may help them remain less affected by tragedies such as the loss of a patient, it could negatively impact patient satisfaction and adherence.

Emotional Expressiveness Results Healthcare Workers Psycholinguistic Cluster Analysis

Self Reflection Results Healthcare Workers Psycholinguistic Cluster Analysis

Motivations

Motivations include traits, needs, and values that impact how people make decisions and establish their preferences. By understanding what individuals are motivated by, we can determine how to appeal to them and influence action.

In our analysis, Cluster 0 values liberty and stability, indicating these healthcare workers prefer reliability and autonomy. Cluster 1 healthcare workers have a flexible, success-oriented mindset, as the analysis reveals they are achievement-driven, reward-driven, open to change, and more risk-seeking than risk-averse. Cluster 2 is less achievement-driven and open to change but more affiliation-driven and reward-driven, suggesting they may benefit from environments that emphasize teamwork, support, and recognition. Finally, Cluster 3 is highly driven by the need for power and risk-awareness, suggesting they prefer to be in positions of control where they can make decisions after a sufficient risk assessment.

Drives, Needs and Values Results Healthcare Workers Psycholinguistic Cluster Analysis

Motivations Results Healthcare Workers Psycholinguistic Cluster Analysis

Based on the results of our analysis, the following descriptions summarize the healthcare worker archetypes captured by each cluster.

Cluster	Archetype Name	Archetype Description
0	Independent and Resilient Pragmatist	These healthcare workers take a logical and strategic approach to their responsibilities. They communicate in a formal and structured manner, displaying minimal emotional expressiveness and coming across as less friendly. This suggests they accomplish tasks and work with others using an impersonal approach. Motivated by a desire for liberty and stability, they value autonomy. Their lower susceptibility to stress and anxiety enables them to maintain emotional stability, remaining calm even in challenging situations.
1	Charismatic and Cheerful Achiever	These healthcare workers are bold, empathetic, cheerful, and social. They exhibit high levels of self-assuredness and ambition and are motivated by a desire for positive outcomes and a willingness to take risks. Their ability to think intuitively and process information rapidly enhances their problem-solving skills, especially in emergencies where quick decisions can save lives. These individuals can balance focusing on goals and people, suggesting they have the capacity to be balanced and effective leaders.
2	Compassionate and Introspective Connector	These healthcare workers are highly social and are driven by a need to foster relationships. They communicate in a more casual, narrative style. They are self-reflective, genuine, and attuned to their feelings. This allows them to be mindful of their actions and mental state and to learn from their experiences. However, they are also more prone to stress, anxiety, and emotional instability, which increases their risk of burnout. Despite these challenges, their calm demeanor and focus on people and emotions make them invaluable in delivering patient-centered care.
3	Critical and Risk-Conscious Executor	Driven by power and risk-awareness, these healthcare workers prioritize control and task-focus. They tend to express more negative than positive emotions, coming across as stress- and anxiety-prone. Additionally, they are less cooperative, less affiliation driven, less agreeable, and less self-disclosing, potentially coming across as intimidating or ruthless despite their reserved demeanor. Their strategic and deliberative approach to decision-making allows them to carefully evaluate decisions to effectively execute tasks. This type of healthcare professional can be particularly effective in moments when critical thinking and informed action are essential.

Implications of Cluster Analysis:

For PR & Marketing:

Psychological metrics related to personality, values, attitudes, and beliefs have been shown to provide insight into consumer behavior. However, market and customer segmentation analyses often rely on basic demographic information (i.e., age, gender, geography, and income levels) to characterize segments.

By using psychometric methods such as language-based segmentation analyses, marketing and PR professionals can develop more robust marketing strategies that take into account nuanced information about who their target consumers are and the driving factors behind consumers’ purchasing decisions. The vast availability of language data and the scientific rigor behind Receptiviti’s psycholinguistics analytics platform makes psychographic segmentation using language analysis a powerful and well-validated solution.

Consider how pharmaceutical marketers can utilize results from our cluster analysis of healthcare workers to determine how to best educate doctors and nurses about new pharmaceutical drugs and medical devices. For example, during an initial product launch, it may be beneficial for marketers to invest in appealing to Cluster 1 healthcare workers. Their open-minded and risk-seeking demeanor suggests they will be more likely to become early adopters of a new, groundbreaking medication compared to the more cautious, critical thinkers in Clusters 0 and 3 who may generally avoid being a first mover. As a second example, it is important to note that a marketing team targeting Cluster 2 should recognize that these healthcare workers are less analytical and highly people- and emotion-focused, whereas Cluster 3 is highly analytical. This insight can help PR professionals draft effective and targeted marketing materials by exposing Cluster 2 to emotion-evoking and compelling stories and Cluster 0 to information that focuses on facts, figures, and statistics.

Our cluster analysis was based entirely on psycholinguistic insights, but marketers could run analyses using a mix of psychometrics and metadata related to consumer behavior. Doing so could surface trends in product-fit and brand perception across the various segments, allowing brands to strategically develop product roadmaps based on the needs and interests of their target audience.

For HR & People Analytics:

Industrial-Organizational Psychology and Organizational Behavior research highlight the pervasive impact that psychological phenomena such as drives, personality, and emotions have on the success of leaders, teams, and companies. Many HR and People Analytics teams and software rely on Likert scale engagement surveys to acquire information related to employee psychology and organizational culture (e.g., stress levels, employee satisfaction). However, this approach to data collection is extremely limiting, as it only provides insights based on directly prompted questions and captures what employees feel comfortable divulging rather than the lived reality of their work experiences. Psycholinguistic analysis of language data reveals deeper insights into leadership styles, team dynamics, and the values that persist throughout organizations while minimizing observation and response bias.

Consider how HR and People Analytics professionals in the healthcare industry can take action based on results from our cluster analysis to determine hospital best practices. For example, a hospital with a talent gap in their leadership team may task recruiters with searching for and assessing suitable candidates. Given that Cluster 1 healthcare workers are bold, success-oriented, and able to balance task-focus and people-focus, it may be best for recruiters to identify leadership candidates that most resemble these workers. As another example, it may become evident that the microcultures within certain hospital shifts result in more patient complaints. Analysis may reveal that shifts with poor bedside manner may comprise teams that lack Cluster 2 healthcare workers (highly emotionally-aware and relationship-driven doctors and nurses). To improve the performance of the hospital and the patient experience, HR managers can prioritize re-assigning shifts so that each shift includes at least one worker with a high degree of emotional sensitivity.

The cluster analysis we conducted could be made more nuanced by adding metadata such as job positions, departments, tenure-levels, and performance metrics. In doing so, HR professionals could answer questions like what types of team dynamics are most effective for various tasks? What types of leadership styles do our managers use? What kind of coaching do the various categories of entry-level workers require to be most successful? This would support people analysts in developing HR strategies that target and benefit the various types of employees and microcultures within their organizations.

Although our cluster analysis was based on social media language data, Receptiviti can derive psycholinguistic insights from virtually any source of natural language, including spoken and written language. To learn more about how to derive actionable insights from your language data, contact us.