2 Examples
2.1 Example 1: Simple linear regression
The admissions committee of a comprehensive state university selected at random the records of 200 second-semester freshmen. The results, first-semester college GPA and SAT scores,are stored in the data frame GRADES. The admissions committee wants to study the linear relationship between first-semester college grade point average (gpa) and scholastic aptitude test (sat) scores.
(a) i. Create a scatterplot of the data to investigate the relationship between gpa and sat scores.
We can use the codeplot(). What variables do we want on our x and y axis?
plot(gpa ~ sat, data = GRADES, xlab = "SAT score", ylab = "GPA")- Look at your plot. What trends do you see?
(b) Obtain the least squares estimates for \(\beta_0\) and \(\beta_1\), and state the estimated regression function using
- Summation notation
We can define x and y as our variables. These can then be used in the formulas for \(\beta_0\) and \(\beta_1\) as given in the textbook.
Y <- GRADES$gpa
x <- GRADES$sat
b1 <- sum((x - mean(x)) * (Y - mean(Y)))/sum((x - mean(x))^2)
b0 <- mean(Y) - b1 * mean(x)
c(b0, b1)
[1] -1.19206381 0.00309427]- Using the R function
lm()to verify your answer in b)i.
The information needed in the lm() function is the axes and the data set you want R to read them from.
> model.lm<- lm(gpa~ sat,data= GRADES)
> coef(model.lm)
(Intercept) sat
-1.19206381 0.00309427These estimated regression function is therefore:
\(\hat{Y}_i = -1.1921 + 0.0031x_i\)
(c) What is the correct interpretation of the regression function?
(d) Use R to calculate the point estimate of the change in the mean GPA when the SAT score increases by 50 points:
> b1*50
[1] 0.15471352.2 Example 2: Multiple linear regression
In b) (ii), the function lm() was used to find estimates for \(\beta_0\) and \(\beta_1\) for a simple linear regression model. To use the function lm() with multiple linear regression models, one specifies the predictors for a multiple linear regression model on the right side of the tilde (~) operator inside the lm() function. The data frame HSWRESTLER contains the body fat measurements of 78 high school wrestlers. Try to create a multiple linear regression model for regressing hwfat (hydrostatic fat - the response variable) onto abs (abdominal fat) and triceps (tricep fat).
The R code below stores the multiple linear regression model for regressing hwfat (hydrostatic fat) onto abs (abdominalfat) and triceps (tricep fat). The estimated coefficients for \(\beta_0\), \(\beta_1\), and \(\beta_2\) determine the plane of best fit for the given values.
hsw.lm <- lm(HWFAT ~ ABS + TRICEPS, data= HSWRESTLER)
coef(summary(hsw.lm)) # lm coefficientsThe estimated coefficients for \(\beta_0\), \(\beta_1\), and \(\beta_2\) are
\(\beta_0\):
\(\beta_1 = \beta_{ABS}\):
\(\beta_2 = \beta_{TRICEPS}\):
Interpret what each coefficient means:
A one unit increase in abdominal fat will lead to a roughly increase in hydrostatic fat . Similiarly, a one unit increase in tricep fat will lead to a roughly increase in hydrostatic fat .