Chapter 3 Fundamental Operations
The purpose of this chapter is to tackle some basic problems of how to read and manage text data. Below are a number of important R libraries, commands, and functions which you need to be familiar with to effectively process text data. Your first task is to familiarize yourself with these. You are encouraged to copy over the code and run them on your own machines.
3.1 File Types
This section provides an overview of common file types (.csv
, .xlsx
, .rds
, .json
, .txt
) and explains how to open them using conventional Windows programs and R.
3.1.1 .csv (Comma-Separated Values)
Description: A plain text file where data is stored in a tabular format, with each row on a new line and columns separated by commas.
How to open in Windows:
- Microsoft Excel: Right-click the file → Open with Excel. Alternatively, open Excel and import the file through the “Data” tab.
- Notepad/Notepad++: Right-click → Open with Notepad or another text editor.
How to open in R:
3.1.2 .xlsx (Excel Spreadsheet)
Description: A Microsoft Excel file format that stores data in worksheets with support for formatting, formulas, and multiple sheets.
How to open in Windows:
- Microsoft Excel: Double-click the file to open in Excel.
- Google Sheets: Upload the file to Google Drive and open it in Google Sheets (web-based).
How to open in R:
3.1.3 .rds (R Data File)
Description: A file format unique to R that stores R objects in a binary format for efficient storage and loading.
How to open in Windows:
- Not Directly Openable: You cannot open
.rds
files in common Windows programs. They are intended for use within R.
How to open in R:
3.1.4 .json (JavaScript Object Notation)
Description: A lightweight, human-readable format for structuring data, often used in web APIs and configuration files.
How to open in Windows:
- Notepad/Notepad++: Right-click → Open with Notepad or another text editor.
- Web Browser: Drag the file into a browser for a formatted view.
How to open in R:
3.1.5 .txt (Plain Text File)
Description: A basic text file containing unformatted text, often used for simple data storage or documentation.
How to open in Windows:
- Notepad/Notepad++: Double-click to open in Notepad or another text editor.
- Microsoft Word: Open in Word to add formatting or modify the content.
How to open in R:
3.2 Accessing Nested Data
You may find that data formats you encounter will not seamlessly convert into the sorts of R objects you need for further analyses. Often you will find that text data is read into nested lists or dataframes which require further formatting. This is common for .json files.
Let’s emulate such a dataset and see how it operates; and how we can access the nested values in it.
# Install required package
install.packages("tidyverse")
# Load required package
library(tidyverse)
# Create a dataframe which emulates the nested structure of a .json file
data <- tibble(name = "John Doe",
age = "30",
address = data.frame(street = "123 Main St",
city = "New York",
state = "NY"),
blog = data.frame(url = "https://johndoe.com",
description = "John's personal blog about tech and coding."),
contacts = list(data.frame(type = c("email", "phone"),
value = c("johndoe@example.com", "555-1234"))))
# You will notice that not all variables are fully accessible in this data frame - pay particular attention to the "contacts" variable
View(data)
# Accessing top-level data (name, age)
data$name # Output: "John Doe"
data$age # Output: 30
# Accessing nested data (address, blog)
data$address$street # Output: "123 Main St"
data$address$city # Output: "New York"
data$address$state # Output: "NY"
data$blog$url # Output: "https://johndoe.com"
data$blog$description # Output: "John's personal blog about tech and coding."
# Accessing further nested arrays (e.g., first contact email)
data$contacts[[1]]$value[1] # Output: "johndoe@example.com"
data$contacts[[1]]$value[2] # Output: "555-1234"
# Pulling out all relevant details and saving them in a new dataframe
dataframe <- data.frame(name = data$name,
age = data$age,
street_address = data$address$street,
city_address = data$address$city,
state_address = data$address$state,
blog_url = data$blog$url ,
blog_description = data$blog$description,
email = data$contacts[[1]]$value[1],
phone = data$contacts[[1]]$value[2])
Explanation
Reading JSON: The fromJSON()
function from the jsonlite
package is used to read a JSON file into an R object (in this case, json_data).
- Accessing Top-Level Data: You can access simple fields (like name and age) directly using the
$
operator. - Accessing Nested Data: For the nested address object, we access it like
json_data$address$street
. - For arrays (like contacts), we use
[[ ]]
indexing to access elements. For example,json_data$contacts[[1]]
gives us the first contact, and we can then extract the type and value, as well as focus in on the first or second elements by using[ ]
3.3 Character Encoding
Character encoding is the process of converting characters, like letters and symbols, into a format that computers can understand and store, typically as numbers. This mapping allows computers to represent and manipulate text data, ensuring that characters are displayed correctly across different systems and platforms. Here are some examples you may have heard of:
- ASCII: A basic encoding that represents English characters, numbers, and some symbols. It uses 7 bits per character, allowing for 128 characters.
- UTF-8: A popular encoding that can represent a wide range of characters from different languages, including those outside of the ASCII character set.
- UTF-16: Another Unicode encoding, capable of representing a vast number of characters, using either two or four bytes per character.
Broadly speaking, R tends to use UTF-8. Reading text encoded in other formats will often result in text appearing as garbled characters or question marks. Likewise, many R packages will write special characters as their unicode - e.g., you may find ©
written as <U+00A9>
in your text file. Similar issues can arise with xml and html codes - e.g., &
might come up as &
.
Here are a number of functions which can help to translate garbled, unicode, html, and xml text back to interpretable strings.
# Load required package
library(stringi)
# Define functions to convert garbled and unicode text back to interpretable strings
unescape_uni <- function(x) {
x <- gsub(">", "", gsub("<U\\+", "\\\\u", x))
stringi::stri_unescape_unicode(x)
}
unescape_xml <- function(str){
xml2::xml_text(xml2::read_xml(paste0("<x>", str, "</x>")))
}
unescape_html <- function(str){
xml2::xml_text(xml2::read_html(paste0("<x>", str, "</x>")))
}
# Define a toy dataset with unicode strings
data <- data.frame(id = c(1, 2),
text = c("Just do it!<U+00A9>",
"I love chips & curry sauce!"))
# Apply functions to a dataframe of texts
for (i in 1:nrow(data)) {
data$text[i] <- unescape_uni(data$text[i])
data$text[i] <- unescape_xml(data$text[i])
data$text[i] <- unescape_html(data$text[i])
}
# Check converted text
data$text
3.4 Cleaning Data
Many types of text data have characters or strings in them which may be irrelevant or undesirable to the quantitative researcher. Corpora can have metadata and tags which you may wish to remove prior to your analyses (e.g., “Start:” to indicate where a speech begins). Similarly, you may wish to remove certain information from social media texts prior to your analyses, like hashtags or URLs. Many of these operations rely on regular expressions–a formal language for specifying text strings. If you are curious, here is a short video: https://www.youtube.com/watch?v=UbIQxT3bApU
Below are a number of common functions in R which can help you clean your data. Try running them to see how they work.
# Sample string
text <- "Start: Hello world! Check out this link: https://example.com #excited"
# Remove the first 6 characters
text_without_first_6 <- substr(text, 7, nchar(text))
print(text_without_first_6) # Output: " Hello world! Check out this link: https://example.com #excited"
# Remove hashtags
text_no_hashtags <- gsub("#\\w+", "", text_without_first_6)
print(text_no_hashtags) # Output: " Hello world! Check out this link: https://example.com "
# Remove URLs
text_no_hashtags_urls <- gsub("https?://\\S+", "", text_no_hashtags)
print(text_no_hashtags_urls) # Output: " Hello world! Check out this link: "
# Remove leading and trailing spaces
text_cleaned <- trimws(text_no_hashtags_urls)
print(text_cleaned) # Output: "Hello world! Check out this link:"
Explanation
- Removing the first 6 characters: The
substr()
function removes letters in the string. We pass it the ‘text’ object to tell it which string to operate on, 6 to tell it the position to start deleting from, andnchar(text)
to tell it to delete all the way to the last position of the text–which is equal to the total number of characters in the string, given bynchar(text)
. - Removing hashtags: The
gsub()
function matches and replaces strings. We pass the regular expression #\w+ to match any word following a # symbol and “” to replace it with nothing (effectively deleting it). Finally, we pass thetext_without_first_6
object to tell it which string to operate on. - Removing URLs: We use the
gsub()
function again but this time with the regular expression https?://\S+ to match URLs that start with http:// or https://. We replace these with nothing “” and thetext_without_first_6
object. - Removing leading and trailing spaces: We use the
trimws()
to remove leading and trailing spaces.
3.5 Concatenating Strings
An additional step you may have to take in some cases is to concatenate strings of text into larger units. For example, sometimes texts will be by-word, but you may wish to analyze them as full sentences of texts.
Below are two examples using the paste()
function to concatenate words.
# Individual words
word1 <- "Data"
word2 <- "Science"
word3 <- "is"
word4 <- "fun!"
# Concatenate words using paste() with a space separator
sentence <- paste(word1, word2, word3, word4)
# Print the sentence
print(sentence)
# Vector of words
words_vector <- c("Data", "Science", "is", "amazing!")
# Concatenate words with spaces using collapse
sentence <- paste(words_vector, collapse = " ")
# Print the sentence
print(sentence)
3.6 Counting Words
In many cases it will be important for you to know how many words each of your texts contains. As it turns out, it is quite difficult to define exactly what constitutes a word. For example, is ? a word? Is never-ending one or two words? If you are curious to know more, this short video may be illuminating: https://www.youtube.com/watch?v=em0ePorcp48
For the purposes of this workshop, we’ll define a word as a string of characters separated from another string by at least 1 space.
Knowing how many words each string contains has many practical benefits, the most obvious is to filter out data which contains little or no content of interest. For example, it is often reasonably to filter out strings which have no words in them.
Here is an example which uses a simple function to count the number of words in a string, relying on the wordcount()
function from the ngram
library.
# Install the required packages
install.packages("tidyverse")
install.packages("ngram")
# Load required packages
library(tidyverse)
library(ngram)
# Define a text string
text <- "Data Science is fun!"
# Apply the wordcount function to the string
wordcount(text, sep = " ") # the sep = " " tells the function to identify strings as words on the basis of whether they are separated from other string by at least 1 space
# Define a dataframe with a vector of text strings and an empty vector for wordcounts
data <- data.frame(texts = c("Data Science is fun!", "Learn R Programming.", ":)!", "Analysis with R"),
wordcount = NA)
# Apply the wordcount function to each string in the dataframe and save the output to the corresponding vector
data$wordcount[1] <- wordcount(data$text[1], sep = " ")
data$wordcount[2] <- wordcount(data$text[2], sep = " ")
data$wordcount[3] <- wordcount(data$text[3], sep = " ")
data$wordcount[4] <- wordcount(data$text[4], sep = " ")
# Print the data
print(data)
# Filter out strings which have fewer than 2 words
data_filtered <- data %>%
filter(wordcount > 2)
# Print the filtered data
print(data_filtered)
Explanation
- After loading the required packages and defining the function to count the number of words in a vector of strings, we define a dataframe with four texts and an empty vector to hold their word-counts
- We apply the
nwords()
function to the texts indata$texts
to attain the word counts, and assign them to the empty vectordata$nwords
- We use the the pipe operator
%>%
andfilter()
function to retain only cases where nwords does not equal 0filter(!nwords==0)
which in this case, removes the row no words :)!
3.7 Identifying specific instances
For many types of analyses you will want to identify specific types of text data either because they are relevant or irrelevant to your analyses. The most basic way of doing this is via key-word searches.
The grepl()
function is used to search for patterns or keywords within a string and returns TRUE
or FALSE
depending on whether the pattern is found.
# Example text data
data <- data.frame(texts = c("Data Science is fun!", "Learn R Programming.", "Python is great.", "R is powerful."))
# Print the data
print(data)
# Search for the word "python"
result <- grepl("python", data$texts, ignore.case = TRUE)
# Print the result
print(result)
# Select only cases with the word "python"
data_python <- data %>%
filter(grepl("python", texts, ignore.case = TRUE))
# Print the data only with cases with the word "python"
print(data_python)
Explanation
- The
grepl("Programming", data$texts, ignore.case = TRUE)
function checks each string in thedata$texts
to see if it contains the string python. - Note that we defined
ignore.case = TRUE
in thegrepl()
function to allow python to match to Python. Try settingignore.case = FALSE
and see how this changes the output. - We can combine
grepl()
withfilter()
to remove rows which do now contain python indata$texts
.
3.8 Loops
Now that you have learnt how to read and process individual files and examples, its time to learn how to automate this process for large amounts of data. Loops are indispensable for this. They are used to repeat a block of code multiple times. This is helpful when you need to perform repetitive tasks, like iterating over many data files or instances of text. The most common type of loop in R is the for
loop which iterates through a sequence of specified values.
Explanation
- The for loop will iterate over the sequence 1:5 (numbers 1 through 5).
- For each iteration, it will print “Iteration number:” followed by the current value of i.
Let’s see the value of looping over numerous texts to quickly clean and save them.
# Sample text data with 3 values
text_data <- c(" Hello World! ", " R is great! ", " CLEANING text DATA. ")
# Initialize an empty dataframe with 3 rows to store the cleaned text
cleaned_data <- data.frame(id = 1:3,
cleaned_text = NA)
# For loop from 1:3 to clean the text
for (i in 1:3) {
cleaned_data$cleaned_text[i] <- trimws(text_data[i]) # Remove extra spaces
}
# Print cleaned data
print(cleaned_data)
Explanation
- The for loop will iterate over each text in
text_data
in the sequence 1:3 by replacing the value i intext_data[i]
(e.g., the first iteration pulls" Hello World! "
by callingtext_data[1]
) - For each iteration, it uses
trimws()
to remove leading and trailing spaces. - Each cleaned text is saved in the initialized dataframe
cleaned_data
by placing it in the$cleaned_text
vector in the cell defined by[i]
(e.g., the third iteration assigns the cleaned text"CLEANING text DATA."
to the third row of viacleaned_data$cleaned_text[3]
)