The ways people use words in their daily lives can provide rich information about their beliefs, fears, thinking patterns, social relationships, and personalities. From the time of Freud’s writings about slips of the tongue to the early days of computer-based text analysis, researchers began amassing increasingly compelling evidence that the words we use have tremendous psychological value (Gottschalk & Glaser, 1969; Stone, Dunphy, Smith, & Ogilvie, 1966; Weintraub, 1989).
Although promising, the early computer methods floundered because of the sheer complexity of the task. Extensive samples of text were not digitized, computers were slow and unwieldy, and there was little agreement about which features of natural language were most related to psychological states. Everything changed in the 1990s with the advent of efficient desktop computers, improved data storage technology, and the explosion of the internet. These factors allowed for the easy collection of large stores of books, conversations, and other digitized text samples.
In order to provide an efficient and effective method for studying the various emotional, cognitive, and structural components present in individuals’ verbal and written speech samples, we originally developed a text analysis application called Linguistic Inquiry and Word Count, or LIWC. Our first LIWC application was developed as part of an exploratory study of language and disclosure (Francis, 1993; Pennebaker, 1993). The second (LIWC2001) and third (LIWC2007) versions updated the original application with an expanded dictionary and a more modern software design (Pennebaker, Francis, & Booth, 2001; Pennebaker, Booth, & Francis, 2007).
Our most recent evolution, LIWC2015 (Pennebaker, Booth, Boyd, & Francis, 2015), has significantly altered the dictionary and added an API that incorporates all the LIWC capabilities and extends it further with 20+ additional personality insights. Importantly, LIWC2015 is new, rather than a basic update to previous versions of LIWC.
Receptiviti API relies on an internal LIWC 2015 dictionary that defines which words should be counted in the target text dataset. The Dictionary is composed of almost 6,400 words, word stems, and select emoticons. Each dictionary entry additionally defines one or more word categories or subdictionaries.
Words contained in texts that are read and analyzed by Receptiviti API are referred to as target words. Words in our dictionary will be referred to as dictionary words. Groups of dictionary words that tap a particular domain (e.g., negative emotion words) are variously referred to as subdictionaries or word categories. For example, the word cried is part of five word categories: sadness, negative emotion, overall affect, verbs, and past focus. Hence, if the word cried is found in the target text, each of these five subdictionary scale scores will be incremented. As in this example, many of the LIWC2015 categories are arranged hierarchically. All sadness words, by definition, belong to the broader “negative emotion” category, as well as the “overall affect words” category. Note too that word stems are also being captured. For example, the dictionary includes the stem hungr* which allows for any target word that matches the first five letters to be counted as an ingestion word (including hungry, hungrier, hungriest). The asterisk, then, denotes the acceptance of all letters, hyphens, or numbers following its appearance.
Each of the default LIWC2015 categories is composed of a list of dictionary words that define that scale. Table 1 provides a comprehensive list of the default LIWC2015 dictionary categories, scales, sample scale words, and relevant scale word counts.
TABLE 1. LIWC2015 OUTPUT VARIABLE INFORMATION
|Category||Abbrev||Examples||Words in category||Internal Consistency (Uncorrected α)||Internal Consistency (Corrected α)|
|Summary Language Variables|
|Authentic (evaluates the degree to which a person actively filters what they’re saying for the audience)||Authentic||_||_||_||_|
|Emotional tone (positive/negative)||Tone||_||_||_||_|
|Words > 6 letters||Sixltr||_||_||_||_|
|Total function words||funct||it, to, no, very||491||0.05||0.24|
|Total pronouns||pronoun||I, them, itself||153||0.25||0.67|
|Personal pronouns||ppron||I, them, her||93||0.2||0.61|
|1st pers singular||i||I, me, mine||24||0.41||0.81|
|1st pers plural||we||we, us, our||12||0.43||0.82|
|2nd person||you||you, your, thou||30||0.28||0.7|
|3rd pers singular||shehe||she, her, him||17||0.49||0.85|
|3rd pers plural||they||they, their, they'd||11||0.37||0.78|
|Impersonal pronouns||ipron||it, it's, those||59||0.28||0.71|
|Articles||article||a, an, the||3||0.05||0.23|
|Prepositions||prep||to, with, above||74||0.04||0.18|
|Auxiliary verbs||auxverb||am, will, have||141||0.16||0.54|
|Common Adverbs||adverb||very, really||140||0.43||0.82|
|Conjunctions||conj||and, but, whereas||43||0.14||0.5|
|Negations||negate||no, not, never||62||0.29||0.71|
|Common verbs||verb||eat, come, carry||1000||0.05||0.23|
|Common adjectives||adj||free, happy, long||764||0.04||0.19|
|Comparisons||compare||greater, best, after||317||0.08||0.35|
|Interrogatives||interrog||how, when, what||48||0.18||0.57|
|Quantifiers||quant||few, many, much||77||0.23||0.64|
|Affective processes||affect||happy, cried||1393||0.18||0.57|
|Positive emotion||posemo||love, nice, sweet||620||0.23||0.64|
|Negative emotion||negemo||hurt, ugly, nasty||744||0.17||0.55|
|Anger||anger||hate, kill, annoyed||230||0.16||0.53|
|Sadness||sad||crying, grief, sad||136||0.28||0.7|
|Social processes||social||mate, talk, they||756||0.51||0.86|
|Family||family||daughter, dad, aunt||118||0.55||0.88|
|Female references||female||girl, her, mom||124||0.53||0.87|
|Male references||male||boy, his, dad||116||0.52||0.87|
|Cognitive processes||cogproc||cause, know, ought||797||0.65||0.92|
|Differentiation||differ||hasn't, but, else||81||0.38||0.78|
|Perceptual processes||percept||look, heard, feeling||436||0.17||0.55|
|See||see||view, saw, seen||126||0.46||0.84|
|Biological processes||bio||eat, blood, pain||748||0.29||0.71|
|Body||body||cheek, hands, spit||215||0.52||0.87|
|Health||health||clinic, flu, pill||294||0.09||0.37|
|Sexual||sexual||horny, love, incest||131||0.37||0.78|
|Ingestion||ingest||dish, eat, pizza||184||0.67||0.92|
|Affiliation||affiliation||ally, friend, social||248||0.4||0.8|
|Achievement||achieve||win, success, better||213||0.41||0.81|
|Reward||reward||take, prize, benefit||120||0.27||0.69|
|Past focus||focuspast||ago, did, talked||341||0.23||0.64|
|Present focus||focuspresent||today, is, now||424||0.24||0.66|
|Future focus||focusfuture||may, will, soon||97||0.26||0.68|
|Relativity||relativ||area, bend, exit||974||0.5||0.86|
|Motion||motion||arrive, car, go||325||0.36||0.77|
|Space||space||down, in, thin||360||0.45||0.83|
|Time||time||end, until, season||310||0.39||0.79|
|Work||work||job, majors, xerox||444||0.69||0.93|
|Leisure||leisure||cook, chat, movie||296||0.5||0.86|
|Money||money||audit, cash, owe||226||0.6||0.9|
|Death||death||bury, coffin, kill||74||0.39||0.79|
|Swear words||swear||fuck, damn, shit||131||0.45||0.83|
|Netspeak||netspeak||btw, lol, thx||209||0.42||0.82|
|Assent||assent||agree, OK, yes||36||0.1||0.39|
|Nonfluencies||nonflu||er, hm, umm||19||0.27||0.69|
“Words in category” refers to the number of different dictionary words and stems that make up the variable category. All alphas were computed on a sample of ~181,000 text files from several of our language corpora (see Table 2). Uncorrected internal consistency alphas are based on Cronbach estimates; corrected alphas are based on Spearman Brown. See the Reliability and Validity section below. Note that the LIWC2015 dictionary generally arranges categories hierarchically. There are some exceptions to the hierarchy rules. For example, Social processes include a large group of words that denote social processes, including all nonfirstpersonsingular personal pronouns as well as verbs that suggest human interaction (talking, sharing) many of these words do not belong to any of the Social processes subcategories. Another example is Relativity, which includes a large number of words that cannot be found in any of its subcategories.
LIWC2015 DICTIONARY DEVELOPMENT
The selection of words defining the LIWC2015 categories involved multiple steps over several years. Originally, the idea was to identify a group of words that tapped basic emotional and cognitive dimensions often studied in social, health, and personality psychology. With time, the domain of word categories expanded considerably.
The most recent version of the dictionary, LIWC2015, is a completely new version compared to earlier ones. Dictionaries can now accommodate numbers, punctuation, and even short phrases. These additions allow the user to read “netspeak” language that is common in Twitter and Facebook posts, as well as SMS (short messaging service, a.k.a. “text messaging”) and SMSlike modes of communication (e.g., Snapchat, instant messaging). For example, “b4” is coded as a preposition and “:)” is coded as a positive emotion word.
A handful of new categories have been added and a small number have been removed. With the advent of more powerful analytic methods and more diverse language samples, we have been able to build more internally consistent language dictionaries. This means that many of the dictionaries in previous LIWC versions may have the same name, but the words making up the dictionaries have been altered (categories subjected to major changes are presented below).
Assessing the reliability and validity of text analysis programs is a tricky business. On the surface, one would think that you could determine the internal reliability of a LIWC scale the same way it is done with a questionnaire. With a questionnaire that taps anger or aggression, for example, participants complete a selfreport asking a number of questions about their feelings or behaviors related to anger. Reliability coefficients are computed by correlating people’s responses to the various questions. The more highly they correlate, the reasoning goes, the more the questionnaire items all measure the same thing. Voila! The scale is deemed internally consistent.
A similar strategy can be used with words. But be warned: the psychometrics of natural language use are not as straightforward as with questionnaires. The reason is obvious once you think about it. Once you say something, you generally don’t need to say it again in the same paragraph or essay. The nature of discourse, then, is we usually say something and then move on to the next topic. Repeating the same idea over and over again is generally bad form in language, yet this is a staple of selfreport questionnaire design. It is important, then, to understand that acceptable boundaries for natural language reliability coefficients are lower than those commonly seen elsewhere in psychological tests.
The LIWC Anger scale, for example, is made up of 230 angerrelated words and word stems. In theory, the more that people use one type of anger word in a given text, the more they should use other anger words in the same text. To test this idea, we can determine the degree to which people use each of the 230 anger words across a select group of text files and then calculate the intercorrelations of the word use. Indeed, in Table 1, we include these internal reliability statistics, including those of Anger where the alpha reliabilities range between .52 (corrected) and .07 (uncorrected) depending on how it is computed. In order to calculate these statistics, each dictionary word was measured as a percentage of total words per text. These scores were then entered as an “item” in a standard Cronbach’s alpha calculation, providing raw alpha scores for each word category, separately for each corpora. Uncorrected alphas in Table 1 are averages of each corpora’s alpha score. Importantly, the uncorrected method tends to grossly underestimate reliability in language categories due the highly variable base rates of word usage within any given category. Corrected alphas were computed using the SpearmanBrown prediction formula (Brown, 1910; Spearman, 1910), and are generally a more accurate approximation of each category’s “true” internal consistency.
Issues of validity are also a bit tricky. We can have people complete a questionnaire that assesses their general moods and then have them write an essay which we then subject to LIWC. We can also have judges evaluate the essay for its emotional content. In other words, we can get selfreported, judged, and LIWC numbers that all reflect a participant’s anger.
One of the first tests of the validity of the LIWC scales was undertaken by Pennebaker and Francis (1996) as part of an experiment in which first year college students wrote about the experience of coming to college. During the writing phase of the study, 72 Introductory Psychology students met as a group on three consecutive days to write on their assigned topics. Participants in the experimental condition (n = 35) were instructed to write about their deepest thoughts and feelings concerning the experience of coming to college. Those in the control condition (n = 37) were asked to describe any particular object or event of their choosing in an unemotional way. After the writing phase of the study was completed, four judges rated the participants’ essays on various emotional, cognitive, content, and composition dimensions designed to correspond to selected LIWC Dictionary scales. Using LIWC output and judges’ ratings, Pearson correlational analyses were performed to test LIWC’s external validity. The findings suggested that LIWC successfully measures positive and negative emotions, a number of cognitive strategies, several types of thematic content, and various language composition elements. The level of agreement between judges’ ratings and LIWC’s objective word count strategy provides support for LIWC’s external validity.
Since the first version of LIWC, hundreds of studies have found the LIWC categories to be valid across dozens of psychological domains. As a starting point for exploring this body of literature, we recommend a close reading of Tausczik and Pennebaker (2010).
BASE RATES OF WORD USAGE
In evaluating any text analysis program, it is helpful to get a sense of the degree to which language varies across settings. Since 1986, we have been collecting text samples from a variety of studies – both from our own lab as well as from dozens of others in the United States, England, Canada, New Zealand, and Australia. For purposes of comparison, text from several dozens of studies have been analyzed using the updated LIWC2015 dictionary. As can be seen in Table 2, these analyzes reflect the utterances of over 80,000 writers or speakers totaling over 231 million words. We provide a brief description of each dataset below.
TABLE 2. SUMMARY INFORMATION FOR LIWC2015 STATISTICS
Note: All texts for all corpora required a minimum of 25 words for inclusion in our analyses. All texts with fewer than 25 words were omitted for all statistics reported in this document.
Blogs. This is an expanded version of the corpus described in Schler, Koppel, Argamon, and Pennebaker (2006). All blog posts were merged by individual prior to analysis, reflecting the entirety of each person’s blog.
Expressive writing. This dataset consists of 29 samples from experiments where people were randomly assigned to write either about deeply emotional topics (emotional writing) or about relatively trivial topics such as plans for the day (control writing). Individuals from all walks of life – ranging from college students to psychiatric prisoners to elderly and even elementaryaged individuals – are represented in these studies. Only the emotional writing topics were included in the current analyses.
Novels. This is a sample of novels acquired from Project Gutenberg (http://www.gutenberg.org/) that had been tagged as “literature”. All novels were written in the English language by authors who lived between approximately 1660 and 2008. The number of authors presented in Table 2 reflects only known authors of the works analyzed works for which the author was unknown were not included in this figure, but included in analyses.
Natural speech. The speech samples included diverse transcripts from multiple contexts, including people wearing audio recorders over days or weeks, strangers interacting in a waiting room, couples talking about problems, and openair tape recordings of people in public spaces.
New York Times. A collection of articles published online at the New York Times website (http://www.nytimes.com). Articles were collected from the New York Times internet archives and include various types of work, including editorials, features, U.S. and world news, letters to the editor, and so on. All articles were published between January and July of 2014. Author information was not preserved for this dataset, so the true number of authors is unknown.
Twitter. Individual Twitter posts (i.e., “tweets”) were collected from the public profiles of users whose names were entered into the Analyze Words webpage (http://analyzewords.com). Each user’s tweets were combined into a single unit of observation for analysis.
TABLE 3. RECEPTIVITI OUTPUT VARIABLE INFORMATION
|Cognitive/Thinking Style Insights:|
|Thinking Style||Measures the degree to which the person is an analytical thinker who relies on facts and data or instinct and feelings when making decisions.|
|Persuasive||Measures the degree to which a person is able to create rapport with the intention of persuading others.|
|Reward Bias||Measures the degree to which a person weighs risks vs. rewards when making decisions.|
|Big 5 Insights:|
|Openness||Measures the degree to which a person is open to new ideas and new experiences.|
|Artistic||Measures how much a person appreciates and enjoys the arts.|
|Intellectual||Measures how strongly a person is inclined toward intellectual and academic learning.|
|Liberal||Measures how socially and ideologically liberal a person is.|
|Imaginative||Measures to what degree a person is imaginative.|
|Emotionally Aware||Measures to what degree a person is conscious of and connected with their feelings and emotions.|
|Adventurous||Measures the degree to which a person enjoys and seeks out adventure.|
|Conscientiousness||Measures the degree to which a person is reliable.|
|Self-assured||Measures how much confidence a person has in themselves.|
|Disciplined||Measures a person's propensity to follow routines and rules.|
|Ambitious||Measures the degree to which a person is ambitious or driven by the desire for achievement.|
|Dutiful||Measures a person's sense that they should respect expectations and authority.|
|Cautious||Measures how cautiously a person tends to act.|
|Organized||Measures how organized and orderly a person is.|
|Extraversion||Measures the degree to which a person feels energized and uplifted when interacting with others or engaging in activity.|
|Sociable||Measures how much a person seeks out and enjoys social situations.|
|Friendly||Measures how friendly a person generally is and how positive they are when interacting with others.|
|Assertive||Measures how assertive a person is and how comfortable a person is with expressing their ideas and needs.|
|Energetic||Measures how much energy and enthusiasm a person tends to have.|
|Cheerful||Measures how happy and cheerful a person generally acts.|
|Active||Measures how strongly a person feels the need for activity and engagement in their life.|
|Agreeableness||Measures the degree to which a person is inclined to please others.|
|Generous||Measures how much a person enjoys spending their time and money on others.|
|Trusting||Measures how easily a person trusts others.|
|Cooperative||Measures how well a person takes into account the needs of others.|
|Empathetic||Measures how strongly a person internalizes the feelings of others.|
|Genuine||Measures how genuine and honest a person is.|
|Humble||Measures how humble and modest a person is.|
|Neuroticism||Measures the degree to which a person expresses strong negative emotions.|
|Impulsive||Measures how inclined a person is to act impulsively.|
|Stressed||Measures the degree to which a person is experiencing stress and how strongly affected they are by it.|
|Anxious||Measures the degree to which a person is experiencing anxiety and how strongly affected they are by it.|
|Aggressive||Measures the degree to which a person exhibits anger or aggression.|
|Melancholy||Measures how much a person is expressing sadness.|
|Self-conscious||Measures how likely a person is to feel embarrassed or anxious about themselves or their skills.|
|Social Style Insights:|
|Social Skills||Measures the degree to which a person feels at ease with others and is able to navigate social situations.|
|Insecure||Measures the degree to which a person lacks confidence when dealing with others.|
|Cold||Measures the degree to which a person is emotionally unresponsive and has difficulty empathizing with others.|
|Family Orientation||Measures the degree to which a person̍s values and behaviors are rooted in their sense of family.|
|Emotional Style Insights:|
|Adjustment||Measures the degree to which a person is grounded, is able to maintain quality relationships with others, and establishes healthy life goals.|
|Happiness||Measures the degree to which a person is optimistic, upbeat, and happy.|
|Depression||Measures the degree to which a person may have difficulty finding joy in their life.|
|Working Style Insights:|
|Independent||Measures the degree to which a person is a non-conformist.|
|Power Driven||Measures the degree to which a person is driven by the desire for power.|
|Type-A||Measures the degree to which a person is driven and competitive.|
|Workhorse||Measures the degree to which a person has a strong work ethic vs. preference for leisure and non-work activity.|
|Interests and Orientations:|
|Friendship Focused||Measures the degree to which a person focuses on friends and friendship, and likely spends time thinking about their social connections.|
|Body Focus||Measures the degree to which a person focuses attention on their body or other people's bodies.|
|Health Oriented||Measures the degree to which a person is focused on health, likely spends time thinking about their own health or the health of others.|
|Sexual Focus||Measures the degree to which a person focuses on sexuality, sex-related themes, concepts and ideas.|
|Food Focus||Measures the degree to which a person focuses thoughts on eating or drinking, and likely enjoys discussing food or drinks with others.|
|Leisure Oriented||Measures the degree to which a person thinks about leisure activities such as sports, entertainment, travel, or organized events.|
|Money Oriented||Measures the degree to which a person thinks about money and finances. May be focused on personal finances, the broader economy or both.|
|Religion Oriented||Measures the degree to which a person focuses on religion, and likely spends time discussing religion, religious themes, ideas and topics.|
|Work Oriented||Measures the degree to which a person is focused on, or preoccupied with work or school.|
|Netspeak||Measures the degree to which a person is comfortable communicating with Internet shorthand and instant messaging slang, abbreviated words, acronyms and special characters.|
As can be seen in Table 3, the LIWC2015 version captures, on average, over 86 percent of the words people use in writing and speech. Note that except for total word count and words per sentence and the four summary variables (Analytic, Clout, Authentic, and Tone), all means in Table 3 are expressed as percentage of total words used in any given language sample. Simple statistical tests indicate that nearly all language categories differ significantly between contexts.
TABLE 4. LIWC2015 OUTPUT VARIABLE INFORMATION
Notes: Grand Means are the unweighted means of the six genres; Mean SDs refer to the unweighted mean of the standard deviations across the six genre categories.
*In calculating grand means and standard deviations for the words per sentence (WPS) and punctuation categories, the natural speech corpus was excluded due to differing transcription rules across documents.
In many ways, Table 3 points to the important role that context plays in people’s use of language. Not surprisingly, the topics of writing – as reflected in the current concerns category – vary substantially as a function of genre. More striking, however, are the large differences in people’s use of function words as well as punctuation from genre to genre (cf., Biber, 1988).