3 Binary Covariates in Linear Regression, Outliers, and Influential Observations

Learning objectives

By the end of this week you should be able to:

Formulate, carry out and interpret a simple linear regression with a binary independent variable
Check for outliers and influential observations using measures and diagnostic plots
Understand the key principals of methods to deal with outliers and influential observations

Learning activities

This week’s learning activities include:

Learning Activity	Learning objectives
Video 1	1
Notes&Readings	2, 3

Binary covariates

So far we have limited our analysis to continuous independent variables (e.g. age). This week we are going to explore the situation where the independent variable (the \(x\) variable) is binary.

Outliers, Leverage points and influential observations

An outlier is a point with a large residual (i.e., points that are a long distance on the y-axis from the fitted line). Sometimes an outlier can have a large impact on the estimates of the regression parameter. If you move some of the points in the scatterplot below so that they become outliers (far from the other points on the y-axis / the regression line), you can see this will affect the estimated regression line. However, not all outliers are the same. Try moving up and down one of the points at the beginning or the end of the \(X\) scale. The impact in the regression line is much stronger than if you do the same with a point in the mid range of \(X\).

A leverage point is an observation with an unusually high or low value for a predictor variable (i.e. lies far away from the other values for that predictor variable). Such points may be considered “outliers” (using the lay meaning of “outlier”, rather than our definitionabove) in terms of the predictor values because they fall well outside the typical range of the other observations for that variable (unusual in the x-direction). These points also have the potential to affect the regression line.

A data point may not have a large residual but still have an important influence in the estimated regression line. Below, you can see that the data point in the left does not appear to have a large residual but it strongly affects the regression line.

An influential observation/point is one that has a large impact on the regression analysis. e.g. if we were to remove the point then the regression coefficient estimates might change and/or the significance (p-values) of some of the terms in the model might change. Outliers and leverage points have the potential to be influential points, but we usually need to do some investigations to see whether they are influential.

(Note that the terminology for these types of points can vary and not everyone agrees on a formal definition (e.g., some use the term “outlier” to describe observations that are unusual in the \(X\) or the \(Y\) direction). For RM1, we will use the above-definitions.)

There are several statistics to measure and explore outliers, leverage and influential points. We will discuss some here:

Leverage: measure of how far away each value of the independent variable is from the others. Data points with high-leverage, if any, lie some distance away from the other values for the independent variables. The leverage is given by \(\frac{1}{n} + \frac{(x_i-\bar{x})^2}{\sum{(x_i - \bar{x})^2}}\) and ranges between 0 and 1.

Consider the simple example below with 10 simulated data points. The 10th value for \(X\) was chosen to be far from the others.

R code

Show the code

set.seed(1011)
x<-rnorm(9)               #random 9 values
x[10]<-5                  #value far from the others
y<-rnorm(10,0.5+2*x,1)   #generate y

#plot the data
lmodel<-lm(y~x)           #fit the model
plot(x,y)                 #plot the data
abline(line(x,y))        # add the regression line

Stata code

Show the code

clear
set seed 1011
set obs 10

generate x = rnormal()
replace x = 5 in 10
generate y = rnormal(0.5+2*x,1)

graph twoway (lfit y x) (scatter y x)
regress y x      
## Number of observations (_N) was 0, now 10.
## 
## 
## (1 real change made)
## 
## 
## 
## 
##       Source |       SS           df       MS      Number of obs   =        10
## -------------+----------------------------------   F(1, 8)         =     52.69
##        Model |  86.4227998         1  86.4227998   Prob > F        =    0.0001
##     Residual |  13.1226538         8  1.64033173   R-squared       =    0.8682
## -------------+----------------------------------   Adj R-squared   =    0.8517
##        Total |  99.5454536         9   11.060606   Root MSE        =    1.2808
## 
## ------------------------------------------------------------------------------
##            y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##            x |    1.98254   .2731326     7.26   0.000     1.352695    2.612385
##        _cons |    .516953    .477044     1.08   0.310    -.5831126    1.617018
## ------------------------------------------------------------------------------

If we compute the leverage of each point, it is not surprising that \(X=5\) has a high leverage.

R code

Show the code

influence(lmodel)$hat     #leverage
##         1         2         3         4         5         6         7         8 
## 0.1027407 0.1570337 0.1686229 0.1003533 0.1156210 0.1044403 0.1135604 0.1391403 
##         9        10 
## 0.1353478 0.8631394
1/10 + (x-mean(x))^2/(var(x)*9)  #leverage manually computed 
##  [1] 0.1027407 0.1570337 0.1686229 0.1003533 0.1156210 0.1044403 0.1135604
##  [8] 0.1391403 0.1353478 0.8631394

Stata code

Show the code

clear
set seed 1011
*generate the data
  set obs 10
  generate x = rnormal()
  replace x = 5 in 10
  generate y = rnormal(0.5+2*x,1)

  regress y x

*leverage
  predict lev, leverage
*leverage computed manually
  gen lev_manual =1/10 + (x- .9228745)^2/( 1.563043^2 *9)

  list
## Number of observations (_N) was 0, now 10.
## 
## 
## (1 real change made)
## 
## 
## 
##       Source |       SS           df       MS      Number of obs   =        10
## -------------+----------------------------------   F(1, 8)         =     52.69
##        Model |  86.4227998         1  86.4227998   Prob > F        =    0.0001
##     Residual |  13.1226538         8  1.64033173   R-squared       =    0.8682
## -------------+----------------------------------   Adj R-squared   =    0.8517
##        Total |  99.5454536         9   11.060606   Root MSE        =    1.2808
## 
## ------------------------------------------------------------------------------
##            y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##            x |    1.98254   .2731326     7.26   0.000     1.352695    2.612385
##        _cons |    .516953    .477044     1.08   0.310    -.5831126    1.617018
## ------------------------------------------------------------------------------
## 
## 
## 
## 
##      +---------------------------------------------+
##      |         x           y        lev   lev_ma~l |
##      |---------------------------------------------|
##   1. |  .8897727    2.534658   .1000498   .1000498 |
##   2. |  .0675149    1.056028   .1332746   .1332746 |
##   3. |  .1199856    1.148383   .1293175   .1293175 |
##   4. |  .1817483    1.488953   .1249804   .1249804 |
##   5. |  1.257954    3.040046   .1051064   .1051064 |
##      |---------------------------------------------|
##   6. | -.2278177   -1.908616    .160219   .1602191 |
##   7. | -.1390651   -1.624345   .1512879   .1512879 |
##   8. |  .3885022    2.991632   .1129868   .1129868 |
##   9. |   1.69015    5.084781   .1267743   .1267743 |
##  10. |         5    9.654363   .8560032   .8560035 |
##      +---------------------------------------------+

DFBETA: We can also compute how the coefficients change if each observation is removed from the data. This will produce a vector for \(\beta_0\) and \(\beta_1\) corresponding to \(n\) regressions fitted by deleting each observation at a time. The difference betweeen the full data estimates and the estimates by removing each data point is called DFBETA. In the example of the small simulated dataset set above, the dfbeta can also be obtained from the influence() function in R.

R code

Show the code

influence(lmodel)$coefficients     #DFBETA

    (Intercept)            x
1  -0.015131790 -0.001678535
2  -0.053246101  0.019647712
3   0.017216660 -0.006821716
4   0.053369207  0.002038393
5   0.022861311  0.006672433
6  -0.101915091  0.012493753
7   0.090627673 -0.018400391
8  -0.109982670  0.034951829
9   0.093779788 -0.028593745
10  0.003248704 -0.126864091

Show the code

#dfbeta(lmodel)                    #does the same thing

#computing the DFBETA manually for the 10th observation
coef(lm(y~x)) - coef(lm(y[-10]~x[-10]))

 (Intercept)            x 
 0.003248704 -0.126864091

Note that the DFBETA above are computed in the original scale of the data. Thus, the magnitude of the difference is dependent on this scale. An alternative is to standardise the DFBETA using the standard errors. This will give as deviance from the original estimate in standard errors. A common cut-off for a very strong influence in the results is the value \(2\).

Show the code

 #computes the standardised dfbeta. 
#note that there is also a 
#dfbeta() function that computes the non-stantandardised dfbeta
dfbetas(lmodel)

   (Intercept)           x
1  -0.07239559 -0.01366864
2  -0.26072268  0.16374799
3   0.08223075 -0.05545644
4   0.26788376  0.01741473
5   0.11021369  0.05475095
6  -0.57846582  0.12069940
7   0.48505258 -0.16762085
8  -0.60589936  0.32773222
9   0.49569526 -0.25724655
10  0.01568953 -1.04282523

Stata code

The code below is for computing the standardised DFBETA in Stata (we are not aware of a function in Stata to compute the unstandardised version of DFBETA; you would have to write your own code to do this).

Show the code

clear
set seed 1011
*generate the data
  set obs 10
  generate x = rnormal()
  replace x = 5 in 10
  generate y = rnormal(0.5+2*x,1)
  
  regress y x

  dfbeta
  list 

## Number of observations (_N) was 0, now 10.
## 
## 
## (1 real change made)
## 
## 
## 
##       Source |       SS           df       MS      Number of obs   =        10
## -------------+----------------------------------   F(1, 8)         =     52.69
##        Model |  86.4227998         1  86.4227998   Prob > F        =    0.0001
##     Residual |  13.1226538         8  1.64033173   R-squared       =    0.8682
## -------------+----------------------------------   Adj R-squared   =    0.8517
##        Total |  99.5454536         9   11.060606   Root MSE        =    1.2808
## 
## ------------------------------------------------------------------------------
##            y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##            x |    1.98254   .2731326     7.26   0.000     1.352695    2.612385
##        _cons |    .516953    .477044     1.08   0.310    -.5831126    1.617018
## ------------------------------------------------------------------------------
## 
## 
## Generating DFBETA variable ...
## 
##     _dfbeta_1: DFBETA x
## 
## 
##      +-----------------------------------+
##      |         x           y   _dfbeta_1 |
##      |-----------------------------------|
##   1. |  .8897727    2.534658   -.0014574 |
##   2. |  .0675149    1.056028   -.0627431 |
##   3. |  .1199856    1.148383   -.0569127 |
##   4. |  .1817483    1.488953   -.0820419 |
##   5. |  1.257954    3.040046    .0017001 |
##      |-----------------------------------|
##   6. | -.2278177   -1.908616    .5239667 |
##   7. | -.1390651   -1.624345    .4384975 |
##   8. |  .3885022    2.991632   -.1846272 |
##   9. |   1.69015    5.084781    .1784969 |
##  10. |         5    9.654363   -4.140455 |
##      +-----------------------------------+

In the example above, the 10th observation seems to haves reasonable impact in the estimates.

Cook’s distance - This is another measure of influence that combines the leverage of the data point and its residual. It summarises how much all the values in the regression model change when each observation is removed. It is computed as

\(D_j = \frac{\sum(\hat Y_i-\hat Y_{(-j)})^2}{2\sigma^2}\)

A general rule of thumb is that a point with a Cook’s Distance above \(4/n\) is considered to be an influential point.

R code

Show the code

#the column cook.d is the Cook's distance
#note that this function also computes some of the measures discussed above
influence.measures(lmodel)

Influence measures of
     lm(formula = y ~ x) :

    dfb.1_   dfb.x   dffit cov.r  cook.d   hat inf
1  -0.0724 -0.0137 -0.0837 1.431 0.00397 0.103    
2  -0.2607  0.1637 -0.2717 1.388 0.03993 0.157    
3   0.0822 -0.0555  0.0869 1.554 0.00430 0.169    
4   0.2679  0.0174  0.2935 1.178 0.04433 0.100    
5   0.1102  0.0548  0.1490 1.408 0.01238 0.116    
6  -0.5785  0.1207 -0.5854 0.724 0.13792 0.104    
7   0.4851 -0.1676  0.4851 0.925 0.10651 0.114    
8  -0.6059  0.3277 -0.6179 0.848 0.16313 0.139    
9   0.4957 -0.2572  0.5034 0.996 0.11760 0.135    
10  0.0157 -1.0428 -1.1090 9.033 0.68380 0.863   *

Stata code

Show the code

clear
set seed 1011
*generate the data
  set obs 10
  generate x = rnormal()
  replace x = 5 in 10
  generate y = rnormal(0.5+2*x,1)
  
  regress y x

  predict cook_d, cook
  list 

## Number of observations (_N) was 0, now 10.
## 
## 
## (1 real change made)
## 
## 
## 
##       Source |       SS           df       MS      Number of obs   =        10
## -------------+----------------------------------   F(1, 8)         =     52.69
##        Model |  86.4227998         1  86.4227998   Prob > F        =    0.0001
##     Residual |  13.1226538         8  1.64033173   R-squared       =    0.8682
## -------------+----------------------------------   Adj R-squared   =    0.8517
##        Total |  99.5454536         9   11.060606   Root MSE        =    1.2808
## 
## ------------------------------------------------------------------------------
##            y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##            x |    1.98254   .2731326     7.26   0.000     1.352695    2.612385
##        _cons |    .516953    .477044     1.08   0.310    -.5831126    1.617018
## ------------------------------------------------------------------------------
## 
## 
## 
##      +----------------------------------+
##      |         x           y     cook_d |
##      |----------------------------------|
##   1. |  .8897727    2.534658   .0024235 |
##   2. |  .0675149    1.056028     .00888 |
##   3. |  .1199856    1.148383   .0080535 |
##   4. |  .1817483    1.488953   .0186161 |
##   5. |  1.257954    3.040046    .000034 |
##      |----------------------------------|
##   6. | -.2278177   -1.908616   .2698209 |
##   7. | -.1390651   -1.624345   .2228213 |
##   8. |  .3885022    2.991632   .1271682 |
##   9. |   1.69015    5.084781   .0750629 |
##  10. |         5    9.654363   7.563721 |
##      +----------------------------------+

With the example using the simulated sample with 10 observations, the rule of thumb would be \(4/10\). Again, the 10th observation is above this threshold and would requires some consideration.

Plots: The above measures are commonly represented in a graphical way. There are many variations of these plots. Below are some examples of these plots but many other plots are available in different packages.

R code

Show the code

#leverage vs residuals
#A data point with high leverage and high residual may be problematic
plot(lmodel,5)

Show the code

#Cook's distance
plot(lmodel,4)

Show the code

#Leverage vs Cook's distance 
plot(lmodel,6)

Stata code

Show the code

clear
set seed 1011
*generate the data
  set obs 10
  generate x = rnormal()
  replace x = 5 in 10
  generate y = rnormal(0.5+2*x,1)
  
  regress y x

  lvr2plot

## Number of observations (_N) was 0, now 10.
## 
## 
## (1 real change made)
## 
## 
## 
##       Source |       SS           df       MS      Number of obs   =        10
## -------------+----------------------------------   F(1, 8)         =     52.69
##        Model |  86.4227998         1  86.4227998   Prob > F        =    0.0001
##     Residual |  13.1226538         8  1.64033173   R-squared       =    0.8682
## -------------+----------------------------------   Adj R-squared   =    0.8517
##        Total |  99.5454536         9   11.060606   Root MSE        =    1.2808
## 
## ------------------------------------------------------------------------------
##            y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##            x |    1.98254   .2731326     7.26   0.000     1.352695    2.612385
##        _cons |    .516953    .477044     1.08   0.310    -.5831126    1.617018
## ------------------------------------------------------------------------------

###Book Chapter 4. Outlying, High Leverage, and Influential Points 4.7.4 (pages 124-128).

This reading supplements the notes above with emphasis in the DFBETA plots. Note that this sub chapter appears in the book after the extension of simple linear regression to the use of multiple independent variables (covariates) in the regression model, which we have not yet covered (Module 4). However, there are only a few references to the multiple linear regression case.

Exercises:

The dataset lowbwt.csv was part of a study aiming to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2500 grams).

1 - Fit a linear model for the variable bwt (birth weight) using the covariate age (mother’s age), evaluate the assumptions and interpret the results.

2 - Evaluate potential outliers and influential observations. How would the results change if you excluded this/these observation(s)?

Summary

This week’s key concepts are:

Linear regression with a binary covariate is equivalent to a t-test (that assumes equal variance).
Outliers are observations with a very large absolute residual value. That is, we normally refer to “outliers” as observations with extreme values in the outcome variable \(Y\) (i.e. are “unusual” in the \(Y\)-axis.
Leverage points are observations with unusual covariate \(x\) values (have high “leverage”). The precise formula for leverage is less important than understanding how high leverage observations can impact your regression.
A residual versus leverage plot is a very useful diagnostic to see which observations may be highly influential
- Observations that are outliers (in the outcome) and that have low leverage, may influence the intercept of your regression model
- Observations that are not outliers, but have high leverage might artificially inflate the precision of your regression model
- Observations that are outliers AND have high leverage may influence the intercept and slope of your regression model
When potentially highly influential observations are detected, a sensitivity analysis where the results are compared with and without those observations is a useful tool for measuring influence.