11.3 Writing functions

Before we go any further with simulating data sets, I want to introduce you to writing your own functions. If you do complex simulations, your code will ends up hundreds of lines long and with lots of potential for error, unless you make it more economical by defining some functions. These will allow you to do in a single line the basic operations that are needed repeatedly. Defining functions will allow you to code more economically, clearly, and with less risk of error.

11.3.1 The definition statement

To define your own function, you need a definition statement, in which you use a function called (what else?) function() to set up the name, structure and arguments of the function you want to create.

Let’s create a simple function called simulate() that it going to make a new simulated data set in the way that the first code chunk of this chapter did. Having the function defined means that we can get a new dataset with a single line of code whenever we want (as long as during the session we have once run the definition statement). The {} encloses the body of the function (the steps that will actually be performed internally within the function every time it is called), and the return() statement specifies what the function ‘spits out’ into the environment of your R session.

simulate <-function(n, effect.size) {
  Condition <- c(rep("Control", times =n), rep("Experimental", times= n))
  Outcome <- c(rnorm(n=n, mean=0, sd=1), rnorm(n=n, mean=effect.size, sd=1))
  return(data.frame(Condition, Outcome))
}

Run this. Now, when we want a simulated data set, we just call simulate().

new.dataset <- simulate(n=50, effect.size = 0.2)

And, lo and behold, you have a new simulated data set called new.dataset in your environment to analyse.

11.3.2 Adding defaults

The definition statement stipulates that any call of the simulate() function must provide the arguments n and effect.size. If you fail to provide either of these, you will get an error, as you will see if you run the code below.

simulate(n = 50)

However, maybe there is a most frequent case you want to use your function to consider. Perhaps, you will usually want data sets with n=50 and effect.size=0.2 because this is what is most relevant to the experiment you are planning. In this case, you can set these values as default in the definition statement. This means that you won’t need to provide any values for those arguments when you call the function, in which case the defaults will be used; but you can do so if you wish, overriding the defaults. You specify defaults by putting them in the definition statement as follows:

simulate <-function(n = 50, effect.size = 0.2) {
  Condition <- c(rep("Control", times =n), rep("Experimental", times= n))
  Outcome <- c(rnorm(n=n, mean=0, sd=1), rnorm(n=n, mean=effect.size, sd=1))
  return(data.frame(Condition, Outcome))
}

Now, if you want a data set with n = 50 and effect.size = 0.2, you don’t need to type these values, as they are supplied by default. In on the other hand you want any other value for these arguments, you can get it by saying so.

# Dataset1: n of 50 and effect size of 0.2
dataset1 <- simulate()
# Dataset2: n of 100 and effect size of 0.2
dataset2 <- simulate(n=100)
# Dataset3: n of 50 and effect size of 0.4
dataset3 <- simulate(effect.size=0.4)
# Dataset4: n of 100 and effect size of 0.4
dataset4 <- simulate(n=100, effect.size=0.4)

11.3.3 Returning to our simulation

Returning to our simulation, it’s interesting to simulate data sets a few times and see what kind of results we get. For example you could run the code below repeatedly, to get a sense of how often you get a parameter estimate around 0.2, and how often you get a significant p-value. You could also play around with different values for n and effect.size.

m1 <- lm(Outcome ~ Condition, data=simulate())
summary(m1)$coefficients

                      Estimate Std. Error t value Pr(>|t|)
(Intercept)              0.299      0.138   2.170   0.0325
ConditionExperimental   -0.174      0.195  -0.893   0.3741

Just running the simulation again and again is not very systematic, however. How about we run the simulation 100 times and ask (a) what parameter estimate do we get each time; and (b) is it significantly different from zero? How would we do this? The code below will do the job. The comments explain how each line works.

# Set up two empty variables, `Estimates` to contain the parameter estimates, and `Significant` to contain whether it was significant or not 
Estimate=NULL
Significant=NULL
# Define a for loop, loop through the material between the {} until index i reaches 100
for (i in 1:100){
  m <- lm(Outcome ~ Condition, data=simulate())
  # Write the parameter estimate from the current simulation to the variable `Estimates`
  Estimate[i] <- summary(m)$coefficients[2, 1]
  # Write "Significant" to the variable `Significant` if p < 0.05, "Non-significant" otherwise
  Significant[i] <- ifelse(summary(m)$coefficients[2, 4]<0.05, "Significant", "Non-significant")
}
# Make a data frame called Output with the results
Output <-data.frame(Estimate, Significant)

Now you should have a data frame called Output with 100 rows, each one giving you a parameter estimate from one simulated dataset, and whether it was significantly different from zero or not. Let’s get a sense of what happened. First, what did our parameter estimates look like?

Output %>% summarise(mean(Estimate), sd(Estimate), min(Estimate), max(Estimate))

  mean(Estimate) sd(Estimate) min(Estimate) max(Estimate)
1          0.203        0.194         -0.19         0.833

The mean parameter estimate is close to 0.2, which is reassuring, as 0.2 is the true value. But, we have a standard deviation of around 0.2 too, meaning that in a typical individual data set of 50 participants per group, our estimate of the parameter is off the true value by about 0.2 in either direction. And within the hundred data sets, we observe parameter estimates as discrepant as -0.19 and 0.83.

What about significance? How many times out of 100 did we observe a significant result?

table(Output$Significant)


Non-significant     Significant 
             86              14

Not many! Only a bit better than what we would have expected if the true parameter had been zero (i.e. 5). The other thing that we can do is to make a graph showing how our parameter estimate varies from simulation to simulation.

ggplot(Output, aes(x=Estimate, fill=Significant)) + 
  geom_histogram(bins=10, colour="black") + 
  theme_classic() +
  ylab("Number of simulations") + 
  #I am going to add a vertical line showing the true parameter value
  geom_vline(xintercept=0.2) + 
  #And another dotted one at zero
  geom_vline(xintercept=0, linetype="dotted") +
  # Specify the fill colours as white or dark grey
  scale_fill_manual(values=c("white", "darkgrey"))

A few things should be clear. With 50 participants per group, in any individual dataset we can be a long way off in our estimate of the ACE of our manipulation. We can even be wrong about the direction of the effect! Plus, effect is rarely significant. In fact the only times we do get a significant are the times when, by chance, we had wildly overestimated the true effect size. When our observed effect size is close to the true one, our test always returns “non-significant”. So, it should be immediately clear that 50 participants per group is not an adequate sample size for this experiment. We will return to this insight in section 11.4.

11.3.4 Making functions that call your functions

We are going to repeatedly need to run the code that generated the data frame Output, using different assumptions about n and effect.size. So, it would be useful to make a function that ran all that for us using one line. This means another function (which I will call hundred.sims()), which calls our existing function simulate(). We define it using the code that follows. Note that it is set up so that we give the values of n and effect.size to the outer function, hundred.sims() and they are in turn passed to the inner function simulate().

hundred.sims <- function(n = 50, effect.size = 0.2) {
Estimate=NULL
Significant=NULL
for (i in 1:100){
  m <- lm(Outcome ~ Condition, data=simulate(n = n, effect.size = effect.size))
  Estimate[i] <- summary(m)$coefficients[2, 1]
  Significant[i] <- ifelse(summary(m)$coefficients[2, 4]<0.05, "Significant", "Non-significant")
    }
return(data.frame(Estimate, Significant))
  }

Now, if we run hundred.sims() we are going to get the results from a hundred simulations with n=50 and effect.size=0.2. If we run hundred.sims(n=100), we are going to get the results from a hundred simulations with n=100 and effect.size=0.2. And so on: the hundred.sims() function allows you to vary n and effect.size as you wish. We will make use of hundred.sims() in the next section.