Chapter 5 Dictionary methods

The purpose of this chapter is to introduce tools for text classification using predefined dictionaries. A serious debt of gratitude is owed to the developers of Linguistic Inquiry and Word Count, among others James Pennebaker and Ryan Boyd. LIWC is the culmination of decades of work establishing and validating dictionaries for psychological analyses of text. It is a user friendly software for their application. Further information can be found here https://www.liwc.app/download and https://www.youtube.com/watch?v=IGBI8LnYGNs&ab_channel=JamesPennebaker

LIWC is not free to use and, as such, many will not have the means to access to it. For this reason, I think it is valuable to also introduce R packages which can apply dictionary methods to texts. These packages mirror LIWC in many ways, but differ in an important one. The R packages do not have any predefined dictionaries, including those which come with LIWC. Note, LIWC dictionaries are proprietary and you will need a license key to download and use them. They cannot be shared without violating their terms of service. There are, however, many other dictionaries out there which are not part of explicitly part of the LIWC ecosystem. There is a decent chance of finding a dictionary which captures a construct of interest without having to rely on LIWC.

A key reference is:

  • Boyd, R. L., Ashokkumar, A., Seraj, S., & Pennebaker, J. W. (2022). The development and psychometric properties of LIWC-22. Austin, TX: University of Texas at Austin, 10, 1-47.

5.1 Quanteda Dictionaries

First, we need to install and load the necessary packages: quanteda and quanteda.dictionaries for working with dictionaries. quanteda.dictionaries is not currently available on CRAN and so must be installed from a github repository. This will require the devtools package to call the install_github() function.

# Install devtools and quanteda (if not already installed)
install.packages("quanteda")
install.packages("devtools")

# Load devtools and quanteda
library(devtools)
library(quanteda)

# Install quanteda.dictionaries package from github
devtools::install_github("kbenoit/quanteda.dictionaries") 

# Load quanteda.dictionaries
library(quanteda.dictionaries)

5.2 Importing Dictionaries

Now that we have the required packages installed, it’s time to start working with some predefined dictionaries. These are what make the wheels turn - they provide the roadmap indicating which words to search for and identify in your texts.

5.2.1 Loading

Dictionaries are typically saved as .dic text files. They are first formatted with a key denoting what the dimension captured by the dictionary (delineated by %%). Each dictionary word is then listed one-by-one on a new line.

A number of non-LIWC dictionary files can be accessed here: https://drive.google.com/drive/folders/1oDkTuzzgcsnt87JpblL8vtidmFo11PMl?usp=sharing

These include .dic files with dictionaries capturing:

  • first-person singular pronouns includes words denoting the self (e.g., me, my)
  • first-person plural pronouns includes words denoting collectives one is a part of (e.g., we, us)
  • prosocial includes words capturing cooperation and helping (e.g., charity, donation)
  • care includes words capturing the moral foundation of care (e.g., help, harm)
  • fairness includes words capturing the moral foundation of fairness (e.g., equity, reciprocity)
  • threat includes words capturing psychological danger and threat (e.g., afraid, risk)
  • moral-emotional includes words denoting emotional and moral terms (e.g., abandon, kill)
  • communion includes words capturing interpersonal warmth (e.g., love, friend)
  • agency includes words capturing goal-striving (e.g., want, achieve)

To read these into R, you can run the read_dict_liwc() function as follows (make sure the .dic files are in your working folder).

# Load the communion.dic file into R
communion_dictionary <- quanteda:::read_dict_liwc("communion.dic")

# Load the threat.dic file into R
threat_dictionary <- quanteda:::read_dict_liwc("threat.dic")

# Load the moral-emotional.dic file into R
moral_emotional_dictionary <- quanteda:::read_dict_liwc("moral-emotional.dic")

5.2.2 Validating

It is typically a good idea to get a sense of the words in your dictionary. You can do this by simply opening the .dic files in Notepad. You will also want to confirm that they have been correctly read into R.

You can examine your dictionaries in R via the following commands:

# Examine the communion_dictionary
communion_dictionary$Communion

# Examine the threat dictionary
threat_dictionary$Threat

# Examine the moral-emotional dictionary
moral_emotional_dictionary$Moral_Emotional

Notice how some of the words in the moral-emotional dictionary are followed by *. This operator is used as a wildcard character to match any words that start with a given prefix. This allows for flexible pattern matching in dictionaries, enabling you to capture related terms that share the same root or beginning part of the word.

When using * in a dictionary file, it will match any word that begins with the specified prefix. For example, if you have an entry like abandon* in a dictionary, it will match words such as:

  • abandon
  • abandoned
  • abandonment
  • abandons
  • abandoning

5.2.3 Binding

The individual word lists we have need to be explicitly denoted as a dictionary class. It can also be convenient to bind them together into a single object which can be applied to derive dictionary scores all at once.

This can be achieved in the following manner:

# Define a dictionary object with the three word lists, communion, threat, and moral-emotional
dictionaries <- 
  quanteda::dictionary(
    list(
      communion = communion_dictionary$Communion,
      threat = threat_dictionary$Threat,
      moral_emotional = moral_emotional_dictionary$Moral_Emotional
    )
  )

# View the dictionaries
View(dictionaries)

5.3 Extracting Dictionary Scores

We are now ready to apply our dictionaries to a set of texts to classify them in terms of the appearance of keywords. This can be done by calling the liwcalike() function - named in acknowledgment to the LIWC software which inspired the package.

This can be achieved in the following manner:

# Define an example vector of four texts
texts <- c("The company had terrible financial performance.",
           "We felt entirley abondoned by upper management.",
           "It was really a horrible situation with no compassion.",
           "I don't know what I'm going to do, my whole livlihood is udner attack.")

# Apply the liwcalike() function to calculate dictionary scores
dictionary_scores <- liwcalike(texts, 
                               dictionaries)

Accessing your newly-created dataframe dictionary_scores reveals the score provided liwcalike(). These are modeled very closely on LIWC’s outputs.

# View the outputs of the dictionary analysis
View(dictionary_scores)

You will immediately notice that several new scores (columns) have been added for each text (rows). Many of these provide general information about the texts, for example:

  • WC indicates the total number of words in each text
  • Dic indicates the percent of words in each text which are in any of the applied dictionaries
  • AllPunc indicates the percent of the text which is comprised punctuation
  • Comma, Punc, etc. indicates the percent of the text which is comprised of each type of punctuation

You will also notice columns corresponding to the applied dictionaries - in this example communion, threat, and moral_emotional. Scores in these columns indicate the percent of words in each text which are present in each dictionary.

# Access the communion scores
dictionary_scores$communion

# Access the communion scores
dictionary_scores$threat

# Access the communion scores
dictionary_scores$moral_emotional

5.4 Dictionary Word Counts

For some analyses, you may wish to examine the frequency with which dictionary words occur (as opposed to the proportion of the total words they makeup). This score is not directly provided to you by liwcalike(), but can easily be derived from the relevant dictionary score and the estimate of the total number of words in each document - providing in the variable WC.

# Compute the frequency of communion words in each text
dictionary_scores$communion_count <- ((dictionary_scores$communion/100)*dictionary_scores$WC) %>% round(0)

5.5 Tidying Data

As you may have noticed, the liwcalike() function creates an entirely new dataframe when calculating word frequencies. Oftentimes you will apply this function to a dataset which includes your texts and many other indices, such as who produced them and when. If you wish to analyze dictionary scores as a function of such indices, you will need to incorporate the dictionary scores provided by liwcalike() back into your original dataset.

This can be done as follows:

# Define an example dataframe with four texts from two groups recorded on certain dates
text_data <- data.frame(author = c("employee 1", "employee 2", "employee 3", "employee 4"),
                        group = c("HR", "HR", "Sales", "Sales"),
                        date = c("01/04/2022", "02/04/2022", "03/04/2022", "04/04/2022"),
                        text = c("The company had terrible financial performance.",
                                 "We felt entirley abondoned by upper management.",
                                 "It was really a horrible situation with no compassion.",
                                 "I don't know what I'm going to do, my whole livlihood is udner attack."))


# Apply the liwcalike() function to calculate dictionary scores
dictionary_scores <- liwcalike(text_data$text, 
                               dictionaries)

# Attach the communion, threat, and moral_emotion dictionary scores to the original dataset
text_data$communion <- dictionary_scores$communion
text_data$threat <- dictionary_scores$threat
text_data$moral_emotional <- dictionary_scores$moral_emotional

# View the original dataset, now with dictionary scores
View(text_data)