Personality is a paradox. On one hand, people are more or less consistent in the ways they engage with the world. We are easily able to label ourselves and the people we're close with as introverted or honest, for example, and those labels are often accurate (with some traits being more easily assessed by the self or others). On the other hand, personality can be fluid, developing over time and fluctuating in response to aspects of our physical and social environments. Personality flexibility across contexts can make stable traits seem elusive for any given person–even seemingly illusory. For example, introverts occasionally make friends without trying; people who are open to new experiences might refuse to consider a new idea; emotionally unstable people sometimes emerge as leaders in a crisis; and agreeable people may become outraged by a trivial slight. Verbal behavior reflects the paradoxical nature of personality, showing both trait-like stability as well as flexibility as people adapt their language use to the needs of the situation and the communication styles of conversation partners.
Any sample of human behavior is bound to include both random variance or noise as well as signals reflecting psychological states and traits. Not all behavioral samples will be representative of a person’s typical words and actions, and naturally-occurring behavior (e.g., in real-life conversations or social media interactions, as opposed to responses to stimuli in controlled experiments) can be particularly variable. These facts raise a critical and often-overlooked question: When using language to measure personality, how many words and samples are required to generate a stable profile of a person’s personality? For many traits, it may be necessary to sample more than a “thin slice” or first impression of a person’s language to get a complete sense of who they are.
To test how much language is needed before measures of personality converge, Receptiviti conducted a preliminary analysis of naturalistic workplace Slack conversations to assess a person’s baseline language style across multiple social contexts and over time.
Sampling Naturalistic Conversations from Slack
We started by analyzing a dataset of Slack messages from direct-messaging groups including at least three people and public channels that took place over the course of several years within a single work group. Although people sometimes write in internet shorthand when chatting online or using messaging platforms such as Slack, Receptiviti measures include such terms and are designed for use with written or spoken language using formal or informal styles.
Because of the nature of this sample, the analyses below will be most relevant to people who want to understand who a person is on average, across a range of situations, as in a self-study of a small work group or an in-depth analysis of a few clients or customers. In the context of social-personality psychology, this approach of assessing similarities and differences between people at the level of the “whole person” is aligned with personality psychology; in contrast, studying how people respond to some situation or stimulus with little regard for who they are in general, outside of that situation, is more characteristic of social psychology.
From the larger Slack corpus, we created two test datasets: one with short texts (100-200 words per sample) and another with medium-length texts (350-450 words per sample). For short texts, we randomly selected 10 sets of 3, 5, 10, 15, 20, and 25 samples per person; if people wrote more than 100 words in a week, we sampled only their first 200 words. For medium-length texts, we followed a similar strategy, using 350 words as the minimum and extracting the first 450 words for weeks where a person wrote more than the maximum.
The random selections introduce another source of noise, or variance in the data that’s not easily modeled or controlled for. The Slack messages we sampled from covered a wide range of topics and varied in formality. Conversation topics, channels, and other contextual variables (such as time of day, date, and number of people in a channel or group) were not statistically controlled for or modeled in our analyses. Thus, the results and recommendations will likely represent an outer limit of how many samples are needed per person, with fewer or smaller samples required for less noisy data sources.
Evaluating Consistency and Variance in Composite Language Measures
We scored each set of samples using composite measures from Receptiviti’s DISC and LIWC-22 frameworks (specifically D, I, S, and C-types from DISC, and analytical, authenticity, emotional tone, and clout from LIWC-22) and compared the variance in scores for each speaker and sample size (i.e., 10-sample sets of 3, 5, 10, 15, 20, and 25 samples per person). DISC is a personality measure that assesses how people think about and relate to work tasks and colleagues on two dimensions, bold versus calm and people versus task focus; axis scores on those dimensions divide people into the four types, and every text receives a proportional score adding up to 100% for each type. The LIWC scores that we used are validated measures of thinking styles and emotional valence originally established in LIWC-2015.
All of the measures we considered are based on algorithms composed of several elements, so they’re somewhat more psychometrically stable across contexts than single-category measures (like specific topics or types of pronouns), just as survey measures made up of several items are typically more robust than single-item measures.
Within-person consistency and variance in scores across samples was measured using Cronbach’s alpha and root mean squared standard deviation (RMS-SD). Cronbach’s alpha is a weighted average of within-person correlations between texts. RMS-SD is used as a replacement for average standard deviations; standard deviations are measures of variance that don’t conform to a normal distribution and, thus, aren’t very interpretable when averaged. Essentially, both are measures of variability: alpha is a measure of internal consistency, and RMS-SD is a measure of how much scores differ within a person across samples.
Results and Advice
Based on the results for variance and reliability, we recommend that 5 samples per person are needed for both short and medium-length texts for most composite measures (i.e., algorithms based on multiple language measures). Although language measures and other observations of naturalistic behavior are not expected to be as consistent as self-report surveys, reaching Cronbach's alpha of .5 or .6 is a common rule-of-thumb cut-off for considering a linguistic measure to be sufficiently consistent. After sample sizes of 10 or 15, there are diminishing returns, with reliability only increasing slightly from n = 15 to 25 for any of the language measures we analyzed.
The figures above also illustrate that 3 or 5 samples may be all that are needed for some measures that are especially robust across situations. On the other hand, LIWC emotional tone and authenticity are more susceptible to social influences (such as whether, based on the context, it seems appropriate to share emotions or express honest opinions) and thus require more samples before a person's scores converge, in comparison with more stable characteristics such as analytical thinking and clout.
As you can see in the example below, the cut-off point for within-person variance is somewhat subjective. However, variability decreases steeply from sample sizes of 3 to 5 and flattens after 10 samples even for the DISC D-type measure (which, as Figure 2 indicates, is more variable than the other DISC measures). Variances look similar for medium texts, suggesting that the LIWC and DISC measures we used are robust across sample sizes and are suitable for shorter texts.
These estimates for how many samples are needed to accurately estimate a person’s personality are somewhat conservative. That is, they err on the high side with the goal of providing sample size advice that will capture personality for even moderately-to-highly variable people. Beyond considering the stability of specific language measures, some people are naturally more stable or labile than others, and fewer samples will be needed to estimate the personality of those with very fixed speaking or writing styles.
Some Caveats and Qualifications
Our estimates are specifically for naturalistic language from Slack conversations across various contexts. If the writing or speaking context is more constrained (e.g., repeating the same writing task multiple times or looking at similar kinds of questions on an engagement survey), then fewer samples per person will be needed. The above findings are based on analysis using Receptiviti’s DISC and LIWC frameworks, and thus, reflect complex traits made up of facets that vary in observability. The number of samples necessary to get a picture of a person’s baseline on a given measure may be smaller for simpler or more easily detected traits, such as emotional positivity or self-focus. For example, among the Big Five traits, research on “thin slices” of behavior (including first impressions, brief interactions, or short texts) has shown that extraversion and conscientiousness are relatively easy to read in videos of a few seconds and openness and neuroticism require longer acquaintance to perceive accurately or reliably.
Note that the analyses below only explore how many samples of various word counts are needed to get a sense of the whole person, or a person’s set point or baseline on a given composite measure across multiple social contexts and over time. The sample size needed to answer this question is going to be larger than that needed for between-person comparisons of different people in the same context at the same point in time – for example, in a cross-sectional study where the point is to compare responses at the group level, and individuals' personalities are not necessarily of interest.
Recommendations for Any Language Analysis Project
In conclusion, we recommend that anyone analyzing language composite variables based on multiple language categories consider the context and amount of language data they have for individuals when deriving personality insights. When sampling from diverse social contexts using language measures that fluctuate across situations (e.g., assessing emotional tone across social and more task-oriented meetings), several samples will be needed. When sampling from homogeneous contexts using robust measures (e.g., measuring analytical thinking across messages about the same work project), only a few samples or even a single thin slice may suffice.
Our sample size advice applies to self-analyses as well. If you process your language and fall into a range that doesn’t seem right for you on a particular trait – for example, maybe you see yourself as introverted and a linguistic measure scores you as being somewhat extraverted – it is important to consider where that discrepancy might be coming from. Were you uncharacteristically friendly or social in the conversation or email you’re analyzing, perhaps because you were talking with close friends or a small group of trusted coworkers? Was there something unusual going on, like discussion of an upcoming event or happy hour, that could have skewed your language in an outgoing direction? If so, analyze several of your language samples, then see where the scores tend to cluster and whether any of the outliers are readily explainable.
In familiarizing yourself with any language measure, it’s probably best to first test a range of texts that you have a good understanding of (in terms of the people who produced the language and the context) and look at whether the scores change in ways that make sense to you. If a score is surprising, read the text and see whether you can see for yourself what the measure is indicating.
Language Reflects Both Psychological States and Traits
The density and nuance of linguistic data often make it more challenging to study than simpler behaviors, like pressing a button or taking a survey – but these complications are also what make language such a valuable and rewarding behavioral signal to analyze. By studying the psychometrics of language, psychologists and computational linguists are increasingly able to provide advice on how to accurately measure traits that vary in legibility, complexity, and flexibility over time and across contexts. Our continuing research aims to help people leverage the dynamic nature of language to better understand the people producing their language samples.