Chapter 4 R programming
R is a block-structured language and blocks are delineated by braces, though braces are optional if the block consists of just a single statement. Variable scope is the key to understand R programming.
4.1 Basic R Mathematical and Boolean Operations
Here is a table of fundamental mathematical and boolean operators in R:
Operation | Symbol_or_Function | Example | Output |
---|---|---|---|
Addition | + | 5 + 3 | 8 |
Subtraction | - | 5 - 3 | 2 |
Multiplication | * | 5 * 3 | 15 |
Division | / | 5 / 3 | 1.6667 |
Exponentiation | ^ or ** | 5^3 or 5**3 | 125 |
Modulo (remainder) | %% | 5 %% 3 | 2 |
Integer Division | %/% | 5 %/% 3 | 1 |
Summation | sum() | sum(c(1, 2, 3)) | 6 |
Product of Elements | prod() | prod(c(1, 2, 3)) | 6 |
Square Root | sqrt() | sqrt(16) | 4 |
Natural Logarithm | log() | log(10) | 2.3026 |
Logarithm (base 10) | log10() | log10(1000) | 3 |
Exponential | exp() | exp(1) | 2.7183 |
Absolute Value | abs() | abs(-5) | 5 |
Rounding | round() | round(3.14159, 2) | 3.14 |
Floor (round down) | floor() | floor(3.7) | 3 |
Ceiling (round up) | ceiling() | ceiling(3.2) | 4 |
Logical AND | & or && | TRUE & FALSE | FALSE |
Logical OR | | or || | TRUE | FALSE | TRUE |
Logical NOT | ! | !TRUE | FALSE |
Equal to | == | 5 == 3 | FALSE |
Not equal to | != | 5 != 3 | TRUE |
Greater than | > | 5 > 3 | TRUE |
Less than | < | 5 < 3 | FALSE |
Greater than or equal to | >= | 5 >= 3 | TRUE |
Less than or equal to | <= | 5 <= 3 | FALSE |
4.2 Control Statements
Control statements in R allow you to control the flow of execution of a program based on conditions and loops. These statements help in decision-making and iterative execution of code. There are three types:
- Conditional Statements (Decision-making)
- if
- if … else
- ifelse()
- switch()
- Looping Statements (Iteration)
- for
- while
- repeat
- Jump Statements (Flow Control)
- break
- next
- return
Lets see example codes.
4.2.1 Conditional Statements
### if (executes the block of code only if the condition is true)
x <- 10
if (x > 5) {
print("x is greater than 5")
}
### if...else
x <- 3
if (x > 5) {
print("x is greater than 5")
} else {
print("x is not greater than 5")
}
### ifelse() (vectorized form of if ... else)
x <- c(2, 7, 5, 8, 1)
result <- ifelse(x > 5, "High", "Low")
print(result) # Output: "Low" "High" "Low" "High" "Low"
### Switch() (multiple conditions checked)
choice <- "b"
result <- switch(choice,
"a" = "You selected A",
"b" = "You selected B",
"c" = "You selected C",
"Invalid choice")
print(result) # Output: "You selected B"
DIY: Implement a function that allows users to choose between different hypothesis tests, t.test(), wilcox.test(), chisq.test().
4.2.2 Looping Statements
### For
x <- 1:10
x*2
for (i in x) cat(i*2,sep='\n') # less conversation than print function
for (i in x) print(i*2)
### while
i <- 1
while (i<=10) i <- i+4 # loops until i=9
i
### repeat (no boolean exit condition so break, the jump statement, is used)
i <- 1
repeat{
i <- i+4
if(i>10) break
}
i
4.2.3 Jump Statement
### break
for (i in 1:10) {
if (i == 5) {
break
}
print(i)
}
### next
# Loop through numbers from 1 to 10
for (i in 1:10) {
# If the number is even, skip this iteration
if (i %% 2 == 0) {
next
}
print(i) # Print the odd numbers only
}
4.2.4 Looping Over Nonvector Sets
R does not directly support interation over nonvector sets, but there are a couple of indirect ways to accomplish it. The get() function takes as an argument a character string representing the name of some object and returns the object of that name. get() is very useful. Let’s look at the example of performing linear regression for two matrices D1 and D2.
set.seed(77177)
### Make simulated datasets for regression
n1 <- 100 #sample size of D1
x <- rnorm(n1)
y <- 1.2*x + rnorm(n1) # Alternative
D1 = data.frame(x=x,y=y)
n2 <- 150
x <- sample(x,n2,replace=T)
y <- rnorm(n2) # Null
D2 = data.frame(x=x,y=y)
### Try
get('D1')
get('D2')
### Perform regressions
out = vector('list',length=2)
names(out) = c('D1','D2')
for (dname in c('D1','D2')) {
DD = get(dname) # getting the data
out[[dname]] = lm(y~x,data=DD)
}
out
summary(out$D1)
summary(out$D2)
However I would do the following…if you’re comfortable with making functions. You will find that the above is useful in other situations.
DD = list(D1=D1,D2=D2)
out = lapply(DD,function(DDD) lm(y~x,data=DDD) )
out
summary(out$D1)
summary(out$D2)
4.2.5 Example: Regression with Permutation test
We’d like to test if the sepal length of iris is different according to species.
attach(iris)
# Check the distribution of sepal length
hist(Sepal.Length) # looking non-normal
# Check distributions by species
boxplot(Sepal.Length~Species) # looking very different
# Let's regress Sepal.Length on Species
fit <- lm(Sepal.Length~Species)
fit
summary(fit)
Do you think the regression is valid? The estimates are valid. It is actually does not use the distributional assumption of the response variable. However the inference procedure through p-value is not valid. Because it is depending on the normality assumption of the response variable.
Let’s think about permutation test that does not use the theoretical distribution (normal in this case). It actually simulate data under the nulll then generate emperical null distribution of the test statistic. Then compare the null distribution with and the observed estimate from the observed data to compute the tail probability (permutation p-value). Because it does not use theoretical null distribution, it is one way to do the non-parametric test.
set.seed(101) # do this to reproduce results when randomness is involved
n <- nrow(iris)
B <- 10000 # number of permutation samples
out <- matrix(nrow=B,ncol=2) # store B number of beta1 and beta2
for (b in 1:B) {
w = sample(1:n,n) # permute index
perm.fit <- lm(Sepal.Length[w]~Species) # destroy the relation for the null situation
out[b,1] = coef(perm.fit)[2]
out[b,2] = coef(perm.fit)[3]
}
hist(out[,1])
abline(v=coef(fit)[2],col='red')
hist(out[,2])
abline(v=coef(fit)[3],col='red')
mean(abs(out[,1])>coef(fit)[2])
mean(abs(out[,2])>coef(fit)[3])
DIY: Interpret the results. Does this answer our question? If not, implement the permutation procedure to answer our question. Also, implement a function that allows users to choose a hypothesis among \(\beta_1=0\), \(\beta_2=0\) and \((\beta_1,\beta_2)^T=\boldsymbol{0}\), and perfom the permutation test.
4.3 Making Custom Functions
4.3.1 Overview
R functions are objects like others.
g <- function(x){
return(x+1)
}
class(g) # function class
formals(g) # extract argument
body(g) # extract function body
g # you can see the object just like anything else
Review well-written functions to improve your programming skills and gain a deeper understanding of how methods work.
abline
page(abline) # to look at it in script file
sum # written in C, the code not viewable
However, most fundamental built-in functions are written directly in C, and thus they’re not viewable.
The following exampes are to show functions are objects.
f1 <- function(x,y) return(x+y)
f2 <- function(x,y) return(x-y)
f <- f1
f(3,2)
f<- f2
f(3,2)
# you can update the body later
g<- function(x) x^2
body(g)
body(g) <- quote(x^2+2)
g
g(8)
# maybe you can do it like this
g <- function(h,x,y) h(x,y)
body(g) <- quote(x^2+2)
g
g(x=8) # h and y were not evaluated
For clarity, a call to return() is included. However it is often omitted.
You may have noticed the … argument in R help files. It allows a function to accept additional arguments that are not explicitly defined. This is why some functions continue to run even when you include incorrect arguments. Many functions support extra arguments by using …. Let’s look at some examples.
### Passing Additional Arguments
f <- function(a,b,...) {
cat('a:', a, '\n')
cat('b:', b, '\n')
cat('additional arguments:', ...,'\n')
}
f(1,2,x=10,y=3)
f(1,2,3,4)
### Forwarding Arguments to Other Functions
my_plot <- function(x,y,...) {
plot(x,y,...)
}
my_plot(1:10,21:30,col='red',pch=10)
In R, a closure is a function along with its environment. Closures allow functions to ‘remember’ the environment in which they were created, even when they’re used outside of that environment. This is a fundamental concept in lexical scoping. A function consists of three components:
- Arguments
- Body
- Environment
4.3.2 Lexical Scoping
Lexical Scoping means that the scope of a variable is determined based on where it is defined in the code. In R, when a function for a variable, it first searches within its own environment. If the variable is not found, it continues searching in the outer(parent) environments.
Search Order: When a function in R is executed and encountered a variable, it looks for that variable in the following order:
- Inside the function itself (local scope)
- In the environment where the function was defined (enclosing environment)
- In its parent’s environment, continuing up to the global environment.
- If not found, it searches in the base environment
- If still not found, an error is thrown. .
Let’s an example of Lexical Scoping
outer_function <- function(x) {
y <- 10 # y is defined inside outer_function
inner_function <- function(){
return(x+y) # x comes from outer_function argument, y from its environment
}
return(inner_function) # Returns a function
}
inst <- outer_function(5) # creates an instance of inner_function
inst() # calls inner_function
- The function inner_function() is created inside outer_function(), so it remembers the environment where it was defined.
- Even after outer_function() has finished running, inner_function() still retains access to y and x, due to lexical scoping.
4.3.3 Closure
Key Aspects of Closures in R:
- Function + Environment: A closure consists of the function argument, body along with the environment in which it was created.
- Lexical Scoping: The function retains assess to variables from the environment in which it was defined.
- Encapsulation: Closures help create private variables and encapsulated states, making them useful for writing modular code.
Examples of a Closures in R:
make_counter <- function() {
count <- 0 # This variable will be remembered
function() {
count <<- count + 1 # Modify the variable in the enclosing environment
return(count)
}
}
typeof(make_counter)
counter1 <- make_counter() # Create a new counter instance
counter2 <- make_counter() # Create another independent counter
counter1() # 1
counter1() # 2
counter2() # 1 (Independent counter)
counter1() # 3
enviroment(counter1)
- make_counter() creates a function that increments count every time it is called.
- The variable count persists inside the function due to lexical scoping
- counter1 and counter2 maintain independent states because each call to make_counter() creates a new closure with its own count variables
###### Compare it with the following
make_counter <- function() {
count <- 0 # This variable will be remembered
function() {
count <- count + 1 # Modify the variable in the enclosing environment
return(count)
}
}
typeof(make_counter)
counter1 <- make_counter() # Create a new counter instance
counter2 <- make_counter() # Create another independent counter
counter1() # 1
counter1() # 1
counter2() # 1
counter1() # 1
Thus, Closures keeps state private by maintaining their own state without modifying global variables.
4.3.4 <- and <<-
What is the difference between ‘<-’ and ‘<<-’:
- <- single arrow
- This is the most commonly used assignment operator in R
- It assigns a value to a variable in the current environment (the local environment within a function)
- If the variable already exists in the local environment it updates the value
- If the variable does not exist, it creates a new one in the local scope
- <<- double arrow
- This operator is used for global assignment or modifying variables in parent environments
- If the variable exists in a parent environment (but not in the global), it updates that variable
- If the variable does not exist in any parent environment, it creates in the global environment
rm(list=ls())
# Single Arrow
foo <- function() {
x<-10
}
foo()
x # x not found
# Double Arrow
foo <- function() {
x<<-10
}
foo()
x # x is available globally
# Modifying a variable in an enclosing environment
outer_func <- function() {
y <- 5 # local to outer_func()
inner_func <- function(){
y<<-10
}
inner_func()
print(y) # Prints 10, because inner_func() modified y in the outer_func()
}
outer_func()
y # not found in the top level
## Also look what happens between the two
f <- function() {
inc <- function() {x <- x+1}
x<-3
inc()
return(x)
}
f()
x
f <- function() {
inc <- function() {x <<- x+1}
x<-3
inc()
return(x)
}
f()
x
Avoid using ‘<<-’ unless necessary because it can lead to unintended side effects, making debugging harder.
4.3.5 <- and =
Ok, So then what is difference between ‘=’ and ‘<-’. Both assign values, but they have different conventions and recommended uses:
# Variable Assignment (both works, but = is less common)
x <- 10
xx = 10
x
xx
# In function Arguments (both works, but = is more common)
rm(x)
mean(x=c(1,2,3))
print(x)
mean(x<-c(1,2,3)) # this is using two steps: assign x in global and then compute mean
print(x)
However, using <- inside function calls is bad practice because:
- It unexpectedly modifies the global variable x, which may not be the intended behavior
- It reduces code clarity- it’s harder to read and understand
- It’s better to separate assignment and function execution.
4.3.6 Writing upper-level variables
two <- function(u) {
assign('u',2*u,pos=.GlobalEnv)
z <- 2*z
}
z<- 14
two(3) # u was updated at global
z
# It is the same as...
two <- function(u) {
u <<- 2*u
z <-2*z
}
z<- 14
two(3) # u was updated at global
<<- defines variable one level up.
4.3.7 RECAP: Environment and Scope Issues
A function – formally referred to as a closure in the R documentation – consists not only of its arguments and body but also of its environment. The environment is made up of the collection of objects present at the time the function is created. An understanding of how environments work in R is essential for writing effective R functions. Consider the example:
rm(list=ls()) # remove all objects in the global environment
f <- function(y) {
d<-8
h <- function() {
return(d*(w+y))
}
return(h())
}
environment(f)
ls()
ls.str()
w <- 12
f(2)
h
d
f() us created at the top level which is referred to as R_GlobalEnv. The variable w is global to f() and h(), while d is local to f() and the function h() is local to f().
If it was not nested.
rm(list=ls())
h <- function() {
return(d*(w+y))
}
f <- function(y) {
d<-8
return(h())
}
w<-12
f(5)
This does not work, as d is no longer in the environment of h(), because h() is defined at the top level. Also R does not complain about the lack of y in the definition of h() because R does not evaluate a variable until it needs it under a policy called lazy evaluation. In this case, R had already encountered an error with d and thus never got to the point where it would try to evaluate y. Let’s fix it.
rm(list=ls())
h <- function(ddd,yyy) {
return(ddd*(w+yyy))
}
f <- function(y) {
d<-8
return(h(d,y))
}
w<-12
f(2)
Let’s look at one last variation of the same thing:
rm(list=ls())
h <- function(ddd,yyy) {
return(ddd*(w+yyy))
}
f <- function(y,ftn) {
d <- 8
print(environment(ftn))
return(ftn(d,y))
}
w<-12
f(3,h)
- When f() executed, the formal argument ftn was matched by the actual argument h.
- Since arguments are treated as locals, you might guess that ftn could have a different environment than top level.
- A closure includes environment, and thus ftn has h’s environment.
4.3.8 DIY
Write an R function called create_adder() that:
- Takes a number x as input.
- Returns a function that adds x to a given number
The results would be:
## [1] 15
## [1] 30
Q1. What environment does the returned function belong to?
Q2. What happens if you create multiple instances of create_adder() with different values?
4.4 More on ls()
Without arguments, a call to ls() within a function returns the names of the current local variables (including arguments). With the envir argument, it will print the names of the locals of any frame in the call chain. Here’s example.
rm(list=ls())
f <- function(y) {
d <- 8
return(h(d,y))
}
h <- function(dd,yy) {
print(ls())
print(ls(envir=parent.frame(n=1)))
return(dd*(w+yy))
}
w<- 12
f(2)
With parent.frame(), the argument n specifies how many frames to go up in the call chain.
Let’s another example.
rm(list=ls())
f<- function(y) {
print(ls())
print(ls(parent.frame(n=1)))
d <- 8
w <- w+1
y <- y-2
print(w)
h <- function() {
return(d*(w+y))
}
return(h())
}
w<-12
t<-4
f(t)
w # global was not changed
t # did not change
y # not working because it was defined local to f
Thus, there is two seperate memories for global and environment of f. w existed in the global but the computation is performed under the environment of f. So the value of w at global has not been touched.
4.5 Access to environment hierarchies
In debugging setting, you often want to know the values of local variables in your current function. You may also want to know the values of the locals in parent function.
f <- function() {
a <- 1
return(g(a)+a)
}
g<-function(aa) {
b <- 2
aab <- h(aa+b)
return(aab)
}
h <- function(aaa) {
c <- 3
return(aaa+c)
}
f()
Here is the code
# Shows the values of the local variables(including arguments)
# of the frame upn=0 for current, 1 for one up, <0 for global
showframe <- function(upn) {
# determine the proper environment
if (upn<0) {
env <- .GlobalEnv
}else {
env <- parent.frame(upn+1)
}
# get the list of variable names, print its value
vars <- ls(envir=env)
# for each variable name, print its value
for (vr in vars) {
vrg <- get(vr,envir=env)
if (!is.function(vrg)){
cat(vr,':\n',sep='')
print(vrg)
}
}
}
Insert this into g().
g <-function(aa) {
b <- 2
aab <- h(aa+b)
cat('current frame','\n')
showframe(0)
cat('One frame Up','\n')
showframe(1)
return(aab)
}
f()
4.6 Recursion
for (i in 1:5) {
print(i)
}
printNumbers <- function(n) {
if (n > 0) {
printNumbers(n - 1)
print(n)
}
}
printNumbers(5)
printNumbers <- function(n) {
if (n > 0) {
print(n)
printNumbers(n - 1)
}
}
printNumbers(5)
Recursion programing이란 함수가 자기 자신을 호출하는 방식을 이용해 문제를 해결하는 프로그래밍 기법. 큰 문제를 같은 종류의 더 작은 문제로 쪼개어 풀어감.
A classic example is Quicksort, an algorithm used to sort a vector of numbers from smallest to largest. Suppose we wish to sort (5,4,12,13,3,8,88).
- We first compare everything to the first element, to form subvectors {(4,3), (5), (12,13,8,88)}.
- Then {(3),(4), (5), (8), (12), (13,88)}
- Then {(3),(4), (5), (8), (12), (13), (88)}
Let’s implement it
QS <- function(x) {
if (length(x)<=1) return(x)
pivot <- x[1]
rest <- x[-1]
x1 <- rest[rest<pivot]
x2 <- rest[rest>=pivot]
x1 <- QS(x1)
x2 <- QS(x2)
c(x1,pivot,x2)
}
There should be ‘termination condition’ to prevent infinite looping.
if (length(x)<=1) return(x)
Recursion is an ‘elegant’ way of programming.
DIY Let’s do factorial!
4.6.1 Example: Operations in Tree-like data structure (Binary Search Tree)
createNode <- function(key) {
list(key=key, left=NULL, right=NULL)
}
insertNode <- function(root, key) {
if (is.null(root)) return(createNode(key))
if (key < root$key) {
root$left <- insertNode(root$left,key)
}else if(key>root$key) {
root$right <-insertNode(root$right,key)
}
root
}
root <- NULL
values <- c(10, 5, 15, 3, 7, 12, 18)
for (v in values) {
root <- insertNode(root, v)
}
############################# Functions for plotting the tree ##########################
# Count depth for spacing
treeDepth <- function(node) {
if (is.null(node)) return(0)
return(1 + max(treeDepth(node$left), treeDepth(node$right)))
}
# Recursive plotting function
plotTree <- function(node, x, y, dx, dy, parentX = NULL, parentY = NULL) {
if (is.null(node)) return()
# Draw line to parent
if (!is.null(parentX)) {
segments(x0 = parentX, y0 = parentY, x1 = x, y1 = y)
}
# Draw node
symbols(x, y, circles = 0.5, inches = FALSE, add = TRUE)
text(x, y, labels = node$key, cex = 1)
# Recursive calls
if (!is.null(node$left)) {
plotTree(node$left, x - dx, y - dy, dx / 1.5, dy, x, y)
}
if (!is.null(node$right)) {
plotTree(node$right, x + dx, y - dy, dx / 1.5, dy, x, y)
}
}
# Initialize plot
plotBST <- function(root) {
depth <- treeDepth(root)
plot(0, 0, type = "n", xlim = c(-10, 10), ylim = c(-depth * 2, 2),
xlab = "", ylab = "", axes = FALSE, main = "Binary Search Tree")
plotTree(root, x = 0, y = 0, dx = 5, dy = 2)
}
Let’s build a binary search tree from a vector:
root <- NULL
values <- sample(1:100)
values
for (v in values) {
root <- insertNode(root, v)
}
plotBST(root)