Chapter 3 Working with Data Frames
3.1 subset and filter
3.1.1 by variable (columns)
DF <- filter(DF, VARIABLE == "CONDITION")filters based on a conditionDF <- DF[c("VARIABLE1", "VARIABLE2", "…")]filters based on variable namesDF <- subset(DF, select=c("VARIABLE1", "VARIABLE2", "…"))is another option to filters based on variables names, requiresdplyerpackage
3.1.2 by case (rows)
DF <- DF[DF$VARIABLE < 100,]with the< 100as an example, works with any other R operator
3.1.3 by the number of missing values
useful when analyzing sets of items:
DF$na_count <- apply(is.na(DF), 1, sum)creates the variablena_countwith the number ofNAsDF_noNAs <- DF[DF$na_count < 100,]creates a newDF_noNAsbased on theNAscounts (in the example > 100)- follows code chunk:
DF$na_count <- apply(is.na(DF), 1, sum) # counts the NAs in the DF
DF <- DF[DF$na_count < 100,] # filters based on NAs counts (in the example > 100)3.2 merge
DF1&2 <- full_join(DF1, DF2, by = "GROUPING_VARIABLE", suffix = c(".x", ".y"))merges all variables of DF1 and DF2:- adds missing cases on empty cells
- adds a
.xand.ysuffix to variables fromDF1orDF2that did not match dplyrpackage is required
3.3 transpose
DF <- as.data.frame(t(as.matrix(DF)))transposes an entire data frame
3.4 re-name variables
names(DF)[names(DF) == "OLD_NAME"] <- "NEW_NAME"for a single variablenames(DF)[1] <- "NEW_NAME"where the number corresponds to the column numberfor several variables simultaneously:
DF.labels <- c("OLD_NAME"="NEW_NAME", … )creates the labels’ keynames(DF) <- as.list(DF.labels[match(names(DF), names(DF.labels))])applies the labels’ key- follows code chunk:
DF.labels <- c("OLD_NAME"="NEW_NAME", … ) # creates labels
names(DF) <- as.list(DF.labels[match(names(DF), names(DF.labels))]) # applies labelsfor several variables simultaneously using names in a data frame (recommended):
DF <- DF %>% rename_with(~ DF.names$newname, DF.names$oldname)withDF.namesbeing the data frame with the names (this data frame can be the study varaible dictionary)
3.5 re-code variables
3.5.1 categorize a continuous variable
- example with 3 categories
DF$FACTOR_VAR <- cut(DF$CONTINUOUS_VAR,breaks=c(-Inf, x, y, Inf), labels=c(">x","x-y",">y"))with \(x\) and \(y\) referring to the numeric values of the continuous variable
3.5.2 create dummies
DF <- dummy_cols(DF, select_columns = "VARAIBLE")uses thedummy_cols()to create the dummies in aDFfor a factorVARIABLEnames(DF)[names(DF) == "VARIABLE_DUMMY1"] <- "VARIABLE"is useful to rename on of the dummy variables for latter useDF$VARIABLE <- NULLcan be used to clean the unnecessaryVARIABLESin the theDF, follows code chunk:
DF <- dummy_cols(DF, select_columns = "VARAIBLE") # create variables
names(DF)[names(DF) == "VARIABLE_DUMMY1"] <- "VARIABLE"
DF$VARIABLE <- NULL3.5.3 label factors
DF$VARIABLE <- factor(DF$VARIABLE, levels=c(1,2,3), labels=c("LABEL_1", "LABEL_2", "LABEL_3"))turns variable into factor with corresponding levels and labels
3.5.4 replace values
DF[DF==""] <- NAreplace missing values by NA on a entire data frame (same logic applies to any other specific values)DF$VARIABLE[DF$VARIABLE==""] <- NAallows to replace values on specific variablesDF[is.na(DF)] = 0allows to replaceNAsby specific values (in the example, by 0)DF$VARIABLE[is.na(DF$VARIABLE)] = 0same logic as previous but for specific variables- replace all missing values by mean using:
for (cols in colnames(DF)) {
if (cols %in% names(DF[,sapply(DF, is.numeric)])) {
DF <-DF %>% mutate(!!cols := replace(!!rlang::sym(cols),
is.na(!!rlang::sym(cols)),
mean(!!rlang::sym(cols), na.rm=TRUE)))
}
else {
DF <- DF %>% mutate(!!cols := replace(!!rlang::sym(cols), !!rlang::sym(cols)=="",
getmode(!!rlang::sym(cols))))
}
}