Chapter 4 Topic Models

The purpose of this chapter is to guide you through some essential techniques for processing and exploring text data. This includes stopword removal, contraction expansion, lemmatizing, word frequency analysis, term frequency-inverse document frequency (TF-IDF), term-document matrices, and topic modelling.

The approach to topic modelling outlined below follows the Meaning Extraction Method (Chung & Pennebaker, 2008) to identify clusters of words which tend to co-occur and are therefore indicative of topic. Conceptually, the method is akin to extracting latent factors from item scores by examining the correlations among them. Here, the method is applied to word frequencies across texts (instead of item scores across people). This is an established approach common to exploratory text analysis in psychological science, having been applied in a wide range of areas including: interpersonal communication (Entwistle et al., 2021), discussion of food and diets (Gregson et al., 2023), and climate change debate (Shah et al., 2021). For further discussion, we refer readers to Boyd and Pennebaker (2016) and Markowitz (2021).

Key references are:

Chung, C. K., & Pennebaker, J. W. (2008). Revealing dimensions of thinking in open-ended self-descriptions: An automated meaning extraction method for natural language. Journal of Research in Personality, 42(1), 96–132. https://doi.org/10.1016/j.jrp.2007.04.006
Markowitz, D. M. (2021). The meaning extraction method: An approach to evaluate content patterns from large-scale language data. Frontiers in Communication, 6, 588823. https://doi.org/10.3389/fcomm.2021.588823

Below are a number of important R libraries, commands, and functions which will help you process and explore text, as well as fit topic models.

4.1 Pre-Processing

The first step to fitting topic models is to conduct a number of basic operations on your text data to ensure it is processed in the most robust manner.

4.1.1 Case

It can be a good idea to set your text to lower case prior to any analysis. This ensures that text identification, conversions, and mappings are most likely to be robust. Many functions include operators which make them case insensitive, but some do not. It would be unfortunate if, for example, a function you applied correctly identified “let’s go!” but not “Let’s go!”. We can use the tolower() function to easily do this.

# Define a vector with three text strings
text = c("Text analysis in R is interesting.", 
         "We're learning NLP.", 
         "After this workshop, I'll know how to explore text data and fit topic models")

# Convert the texts to lowercase
text_lower <- tolower(text)

# Print the texts to confirm they are now all in lowercase
print(text_lower)

4.1.2 Contractions

Contractions should typically be expanded prior to further analysis (e.g., “we’re” → “we are”). We can use the replace_contraction() function to handle this. It contains a large number of ‘hard-coded’ mappings between contractions (e.g., “I’ll”) and expansions (“I will”). It effectively finds all contractions in its vocabulary, in either a single string or vector of strings, and replaces them with the corresponding expansion.

# Load required package
library(textclean)

# Define a vector with three text strings
text = c("text analysis in r is interesting.", 
         "we're learning nlp.", 
         "after this workshop, i'll know how to explore text data and fit topic models")
 
# Expand contractions
text_expanded <- replace_contraction(text)

# Print the texts to confirm all contractions are now expanded
print(text_expanded)

4.1.3 Lemmatizing

Lemmatizing reduces words to their base forms (e.g., “running” → “run”). We can use the lemmatize_words() function to do this. It contains a large number of ‘hard-coded’ mappings between contractions (e.g., “I’ll”) and expansions (“I will”). It effectively finds all contractions in its vocabulary, in either a single string or vector of strings, and replaces them with the corresponding expansion.

# Load required package
library(textstem)

# Define a vector with three words
text <- c("running", "jumps", "doggies")

# Lemmatize words
text_lemmatized <- lemmatize_words(text)

# Confirm words are now lemmatized
print(text_lemmatized)

# Define another vector with four more words
text <- c("girlies", "goods", "better", "joyfully")

# Lemmatize words
text_lemmatized <- lemmatize_words(text)

# Confirm words are now lemmatized. Note how some convertions may not be as expected and so must be applied with care
# "girlies" is not lemmatized to "girl" because it is not in the lemma vocabulary
# "goods" is lemmatized to "good", despite "goods" potentially denoting merchandise (and not the plural of good)
# "better" is lemmatized to "good"
# "joyfully" is not lemmatized to "joy" (despite being semantically related)
print(text_lemmatized)

4.1.4 Stopwords

Stopwords are common words (e.g., “the”, “is”) that are typically considered to be devoid of substantive meaning We can remove them using an established stopword list provided by the tidytext package. You can call this list and examine it by running View(stop_words). We can use the anti_join() function to remove any and all stopwords in our data.

# Load required packages
library(tidyverse)
library(tidytext)

# Examine the stop_word list to understand which words are identified as meaningless and for removal.
# Note there are multiple sub-lists contained within these which are more, and less, comprehensive
View(stop_words)

# Define a dataframe with words
data <- data.frame(
  word = c("after", "this", "workshop", "i", "will", "know", "how", "to", "explore", "text", "data")
)

# Remove rows with stopwords by using the anti_join() function. 
# This function matches and removes cases in two dataframes. 
# Note, the column names must be identical in both. 
# The function succeeds because both relevant columns are denoted "word".
data_nostopwords <- anti_join(data,
                              stop_words,
                              by = "word")

# Confirm the stop words have been removed
View(data_nostopwords)

It is important to note that some stopwords may be relevant to your analysis. As an example, psychological scientists have expressed an interest in first-person pronoun use as an indicator of self-focus, perhaps being more prevalent in those who are narcissistic (see e.g., Carey et al., 2015). “I” and “me” are typically included in lists of stopwords and so would be removed when applying them. This would preclude analyses of them.

Carey, A. L., Brucks, M. S., Küfner, A. C., Holtzman, N. S., Back, M. D., Donnellan, M. B., … & Mehl, M. R. (2015). Narcissism and the use of personal pronouns revisited. Journal of personality and social psychology, 109(3), e1.

If you find yourself in the position of being interested in words which feature in stopword lists (you are encouraged to check stopwords()), you have two options. The first is to skip the removal of stopwords entirely. The second is to remove specific words from the stopwords() list. This can be achieved as follows.

# Load required package
library(tidytext)

# Define a dataframe with stop words you wish to retain
meaningful_stopwords <- data.frame(
  word = c("i", "me", "myself", "mine")
  )

# Create a new set of stopwords from the established one (stop_word)
# But without the ones you wished to retain, again using the anti_join() function
my_stop_words <- 
  anti_join(stop_words, 
            meaningful_stopwords, 
            by = "word")

# Confirm your new stopword list does not contain the words you wish to retain
# This can be done by scrolling down to confirm that, for example, "i" is no longer in the list
View(my_stop_words)

4.2 Unnesting Tokens

An early step you will almost always need to take is to split your data into word tokens. This can be done via the unnest_tokens() function. It will identify all tokens, separated by white space or delineators (e.g., “.”), in a vector of strings, pull them out and place them into individual cells for further analyses. This step will enable many functions which operate on single tokens, as opposed to longer strings.

# Load required packages
library(tidyverse)
library(tidytext)

# Define a dataframe with three text strings
data <- data.frame(
  text = c("Text analysis in R is interesting.", 
           "We are learning NLP.",
           "After this workshop, I will know how to explore text data and fit topic models")
)

# Unnest the tokens from all the strings so that each word is represented 
data_tokens <- data %>% 
  unnest_tokens(input = text,  # Input column from which tokens are unnested
                output = word) # Output column to be created to store unnested tokens

# Examine the new dataframe where each word is saved on a single row
View(data_tokens)

4.3 Word Frequencies and TF-IDF

Let’s apply all of the previous operations to a dataset to examine relevant word frequencies and TF-IDF scores.

It is recommended that you explore your dataset by examining how often each unique word occurs. This can be done via the functions provided by the tidytext package, and supported by the tidyverse package. It first involves calling the count() function to derive the frequency with which words occur.

TF-IDF helps highlight important words. To remind you, TF-IDF stands for Term Frequency - Inverse Document Frequency. TF captures the frequency with terms (or words) occur in each document. IDF captures how frequently terms (or words) occur throughout a corpus. Words which occur in many documents are given a low IDF score, whilst those which occur in very few are given a high IDF score. TF-IDF is the product of TF and IDF. The final score therefore captures how frequent and unique words are.

# Load required packages
library(tidyverse)
library(tidytext)
library(tidystem)

# Define a dataframe with five text strings, labelled as documents 1-5
data <- data.frame(
  document = c(1, 2, 3, 4, 5),
  text = c("Text analysis in R is interesting.", 
           "We're learning NLP and text analysis.", 
           "After this workshop, I'll know how to explore text data and fit topic models to text.",
           "This will help me get a sense of what is going in my own data.",
           "And get the best possible grade for my project.")
  )

# Convert the text to lower case
data$text <- tolower(data$text)

# Expand contractions
data$text <- replace_contraction(data$text)

# Unnest tokens
data_tokens <- data %>% 
  unnest_tokens(input = text,  # Input column from which tokens are unnested
                output = word) # Output column to be created to store unnested tokens

# Lemmatize words
data_tokens$word <- lemmatize_words(data_tokens$word) 

# Remove stop words
data_tokens_nostopwords <- 
  anti_join(data_tokens, 
            stop_words, 
            by = "word")

# Count the number of times each word occurs in each document
word_counts <- data_tokens_nostopwords %>% 
  count(document,
        word, 
        sort = TRUE)

# View the word counts
View(word_counts)

# Add the TF-IDF scores via the bind_tf_idf() function
tf_idf <- word_counts %>% 
  bind_tf_idf(word,
              document, 
              n)

# View the word counts and TF-IDF scores. 
# Notice how "text" has some of the lowest scores becuase it is the most common across documents
# Whilst "grade" has one of the highest because it is the least common.
View(tf_idf)

4.4 Intepretability and Computational Bottlenecks

Dealing with large corpora of text can result in intractably large datasets of term frequencies. Many of these terms may only be used in a single document. Moreover, despite our attempts to clean and lemmatize the data, most approaches to topic modelling with be plagued by large numbers of idiosyncratic terms. This can make the approach of principal components difficult, computationally-heavy, and difficult-to-interpret.

Two steps can make this more manageable: - Remove terms which do not occur in more that 10% of documents (see Chung & Pennebaker, 2008; Markowitz, 2021) - Remove terms which are not among the ~120,000 most common to the English language (assuming you are analyzing English language texts)

4.4.1 Select terms common to your corpus

Here is a method for selecting only terms which are common to your corpus, by calculating the number of non-zero values in each column of a DTM, reflecting the number of documents which where each term is present. From this, we can select all which occur over a certain threshold (e.g., 10 or more). This will depend on the overall size of your corpus.

# Load required package
library(tidyverse)

# Create a vector with words which appear in more than 10% of texts
words_to_keep <- 
  word_counts %>%
  group_by(word) %>%
  summarize(n = n()) %>%
  filter(n > 2000) %>% # set this to 10% of your dataset
  pull(word)

# Select only words which are are in the words_to_keep vector
words_to_keep <- 
  word_counts %>%
  filter(word %in% words_to_keep)

4.4.2 Selecting words common to the English language

Here is a method for selecting only words common to the English language, by removing any term which does not feature in Grady Ward’s English words augmented with proper nouns (U.S. States, Countries, Mark Kantrowitz’s Names List, and months) and contractions. The dataset contains 122’806 terms and is provided by the qdapDictionaries package and can be loaded by calling the GradyAugmented dataset.

# Load required packages
library(tidyverse)
library(qdapDictionaries)

# Load word list
data(GradyAugmented)

# Select only words which appear in Grady Ward's English word list
word_counts_to_keep <- 
  word_counts %>%
  filter(word %in% GradyAugmented)

4.5 Document-Term Matrices (DTM)

We can further explore our data by examining how often each word in your corpus appears in each document (or text). This will allow us to see how words tend to co-occur across documents, and if there are any systematic correlations between them indicative of latent themes or topics. The first step to doing this requires that we compute a document-term matrix. This can be achieved by applying the cast_dfm() function on a dataframe of wordcounts.

# Load required packages
library(tidyverse)
library(tidytext)

# Define a dataframe with five text strings, labelled as documents 1-5
data <- data.frame(
  document = c(1, 2, 3, 4, 5),
  text = c("Text analysis in R is interesting.", 
           "We're learning NLP and text analysis.", 
           "After this workshop, I'll know how to explore text data and fit topic models to text.",
           "This will help me get a sense of what is going in my own data.",
           "And get the best possible grade for my project.")
  )

# Unnest tokens
data_tokens <- data %>% 
  unnest_tokens(input = text,  # Input column from which tokens are unnested
                output = word) # Output column to be created to store unnested tokens

# Count the number of times each word occurs in each document
word_counts <- data_tokens %>% 
  count(document, 
        word, 
        sort = TRUE)

# Convert to a document-term matrix via the cast_dfm() function. 
# We then use the convert() function to make the output into a data.frame for easier processing
dtm <- word_counts %>%
  cast_dfm(document, 
           word, 
           n) %>%
  convert(to = "data.frame")

# Examine the final document-term matrix
View(dtm)

4.6 Prinicpal Compentent Analysis

Now we can apply a dimension reduction technique to the DTM to examine if there is systematic variance in the way words tend to co-occur, potentially revealing underlying themes. The fa.parallel() function provides scree plots to help interpret how substantial each component is. The principal() function conducts the principal component analysis on a DTM.

# Load required packages
library(tidyverse)
library(psych)

# Define a DFM dataframe with eight numeric variables, and a single id string variable
dtm <- data.frame(
  doc_id = c("1", "2", "3", "4", "5", "6", "7", "8", "9"),
  friendly = c(4, 2, 7, 3, 5, 3, 7, 4, 2),
  sociable = c(3, 2, 5, 1, 6, 2, 3, 4, 1),
  outgoing = c(4, 1, 7, 4, 5, 3, 8, 4, 2),
  talkative = c(3, 2, 7, 1, 5, 3, 7, 4, 2),
  diligent = c(1, 6, 7, 4, 2, 5, 5, 4, 9),
  hard_working = c(2, 6, 4, 4, 4, 5, 5, 4, 9),
  responsible = c(2, 4, 7, 6, 2, 6, 5, 4, 9),
  strict = c(1, 6, 7, 6, 4, 5, 5, 4, 6)
)

# Remove the id variable to prepare the DFM dataframe for further analyses. 
# The functions we will call in the next parts requires a DTM that contains only the term frequency counts.
# It will not run if the dataframe contains id or document indicators.
dtm_4_pca <- dtm %>% 
  select(!doc_id) # Adding ! before doc_id translates to selecting all columns except for doc_id

# Get the scree plot to get a sense of how many compenents (themes) there could be in the DFM
# There is clear sign of two components with large eigenvalues
fa.parallel(dtm_4_pca, 
            main="Scree Plot", 
            fa = "pc",
            sim = F)

# Conduct a principal compenent analysis whilst extracting two components
pca <- principal(dtm_4_pca,
                 nfactors = 2,
                 residuals = FALSE,
                 rotate = "varimax", 
                 method = "regression")

# Examine the component loadings to gain a sense of which words are most closley associated with each component.
# You can click on the column headings to sort them by size
pca$loadings %>%                          # get the component loadings
  tibble(word = colnames(dtm_4_pca)) %>%  # add in the word labels to enable interpretation
  View()

4.7 Topic Model Pipeline

Here is a full pipeline from a raw text data frame to topic components. Also includes how to create a component table and save compenent loadings for further analyses.

# Load required packages
library(tidyverse)
library(tidytext)
library(textclean)
library(textstem)
library(tm)
library(quanteda)
library(psych)
library(topicmodels)
library(stringi)
library(qdapDictionaries)

#define %ni% function
`%ni%` <- Negate(`%in%`)

# Define a dataframe with five text strings, labelled as documents 1-5
data <- data.frame(
  document = c(1, 2, 3, 4, 5),
  text = c("Text analysis in R is interesting.", 
           "We're learning NLP and text analysis.", 
           "After this workshop, I'll know how to explore text data and fit topic models to text.",
           "This will help me get a sense of what is going in my own data.",
           "And get the best possible grade for my project.")
)

# Convert the texts to lowercase
data$text <- tolower(data$text)

# Expand contractions
data$text <- replace_contraction(data$text)

# Lemmatize words
data$text <- lemmatize_words(data$text)

# Unnest tokens
data_tokens <- data %>% 
  unnest_tokens(input = text,  # Input column from which tokens are unnested
                output = word) # Output column to be created to store unnested tokens

# Count the number of times each word occurs in each document
word_counts <- data_tokens %>% 
  count(document, 
        word, 
        sort = TRUE)

# Remove stopwords function
word_counts <- anti_join(word_counts,
                         stop_words,
                         by = "word")

# Get all words which feature in more than a single document and remove those which don't
words_to_keep <- 
  word_counts %>%
  group_by(word) %>%
  summarize(n = n()) %>%
  filter(n > 1) %>%
  pull(word)

word_counts_to_keep <- 
  word_counts %>%
  filter(word %in% words_to_keep)

# Remove words which do not feature in Grady Ward's English word list
word_counts_to_keep <- 
  word_counts_to_keep %>%
  filter(word %in% GradyAugmented)

# Add back in any document which has been fully removed, so that it is still considered with 0 values for all retained words
for(i in unique(data$document)) {
  if(i %ni% word_counts_to_keep$document) {
    temp <- word_counts_to_keep[1, ]
    temp$document <- i
    temp$n <- 0
    word_counts_to_keep <- rbind(word_counts_to_keep, temp)
  }
}

# Convert to a document-term matrix function
dtm <- 
  word_counts_to_keep %>%
  cast_dfm(document, 
           word, 
           n) %>%
  convert(to = "data.frame")

# Remove doc_id string to prepare for pca
dtm <- dtm %>% dplyr::select(!doc_id)

# Check scree plot to get a sense of how many components there are
fa.parallel(dtm, 
            main="Scree Plot", 
            fa = "pc",
            sim = F)

# Extract n components via principal component analysis
n_to_extract <- 2

pca <- principal(dtm, 
                 nfactors = n_to_extract,  
                 residuals = FALSE, 
                 rotate = "varimax", 
                 method = "regression",
                 scores = T)

# Examine word component loadings to interpret themes components. 
pca$loadings %>%
  tibble(word = colnames(dtm)) %>%
  View()

# Save compenents for further analyses
data$comp_1 <- pca$scores[, "RC1"]
data$comp_2 <- pca$scores[, "RC2"]

# Create table of component loaders for reporting
component_table <- pca$loadings[, 1:n_to_extract] %>% as.data.frame()

# Suppress components smaller than +/- .20
for(i in 1:n_to_extract) {
  component_table[, i][component_table[, i] < .2 & !component_table[, i] < -.2] <- NA
  component_table[, i][component_table[, i] > -.2 & !component_table[, i] > .2] <- NA
  component_table[, i] <- sprintf("%.2f", round(component_table[, i], 2))
  component_table[, i] <- gsub("0\\.", ".", as.character(component_table[, i]))
  component_table[, i][component_table[, i]=="NA"] <- ""
}

# View component table
component_table %>%
  View()