Diagnostics and Remedial Measures in \(R\)

Read data from an URL

toluca <- read.table("http://www.cnachtsheim-text.csom.umn.edu/Kutner/Chapter%20%201%20Data%20Sets/CH01TA01.txt", sep ="" , header = FALSE)


#Look at the first 6 entries
head(toluca)
  V1  V2
1 80 399
2 30 121
3 50 221
4 90 376
5 70 361
6 60 224

Rename columns

colnames(toluca) <- c("lotSize", "hours")

#Look at the first 6 entries
head(toluca)
  lotSize hours
1      80   399
2      30   121
3      50   221
4      90   376
5      70   361
6      60   224

Convert your data into a dataframe

tolucaDF <- data.frame(toluca)
head(tolucaDF)
  lotSize hours
1      80   399
2      30   121
3      50   221
4      90   376
5      70   361
6      60   224

Diagnostic Plots for Predictor Variable-Toluca Company Example

Dot plot

The dot plot for the lot sizes in the Toluca Company example.

Sequence plot

The Sequence plot for the lot sizes in the Toluca Company example.

Stem-and-leaf plot

Stem-and-leaf plot of the lot sizes for the Toluca Company example. By displaying the last digits, this plot also indicates here that all lot sizes in the Toluca Company example were multiples of 10.

stem(tolucaDF$lotSize, scale = 3)

  The decimal point is 1 digit(s) to the right of the |

   2 | 0
   3 | 000
   4 | 00
   5 | 000
   6 | 0
   7 | 000
   8 | 000
   9 | 0000
  10 | 00
  11 | 00
  12 | 0

Box plot

Box plot of the lot sizes for the Toluca Company example.

Diagnostics for Residuals (Section 3.3 in the textbook)

Create the LS model

Plot of residuals against predictor variable

Plot of residuals against fitted values.

Box plot of residuals

Normal probability plot of residuals.

ggplot(toluca_LS_model, aes(sample = .resid)) + geom_qq(color = "blue") + geom_qq_line()+ theme_bw() #No x or y aesthetics. Just use sample =.resid

Nonnormality of Error Terms

\(\mbox{Exp. Value under Normality} = \sqrt(MSE)[ z(\frac{k-0.375}{n+0.25})]\)

Note that MSE can be calculated as the SSE / DF. Here I am using deviance(toluca_LS_model) to get SSE and df.residual(toluca_LS_model) to get the DF

# Expected value under normality comes from equation (3.6)
tab <- cbind(
  "Residual"                   = round(resid(toluca_LS_model), 2), # rounding upresiduals to 2 decimal places
  "Rank"                       = rank(resid(toluca_LS_model)), # this is where we find k value
  "ExpValueunderNormality" = round(sqrt(deviance(toluca_LS_model) / df.residual(toluca_LS_model)) * 
                                       qnorm((rank(resid(toluca_LS_model)) - 0.375) / (nrow(tolucaDF) + 0.25)), 2)) # deviance function finds the SSE and to find MSE we need to divide it by d.f. Then qnorm gives us the z value of (k-.375/n+.25). Again we have use round function to get final results in 2 decimal places.

tab
   Residual Rank ExpValueunderNormality
1     51.02   22                  51.97
2    -48.47    5                 -44.10
3    -19.88   10                 -14.76
4     -7.68   11                  -9.76
5     48.72   21                  44.10
6    -52.58    4                 -51.97
7     55.21   23                  61.48
8      4.02   15                   9.76
9    -66.39    2                 -74.17
10   -83.88    1                 -95.90
11   -45.17    6                 -37.25
12   -60.28    3                 -61.48
13     5.32   16                  14.76
14   -20.77    8                 -25.33
15   -20.09    9                 -19.93
16     0.61   14                   4.85
17    42.53   20                  37.25
18    27.12   18                  25.33
19    -6.68   12                  -4.85
20   -34.09    7                 -31.05
21   103.53   25                  95.90
22    84.32   24                  74.17
23    38.83   19                  31.05
24    -5.98   13                   0.00
25    10.72   17                  19.93

Correlation Test for Normality

res <- resid(toluca_LS_model)
expRes <- sqrt(deviance(toluca_LS_model) / df.residual(toluca_LS_model)) * 
                                       qnorm((rank(resid(toluca_LS_model)) - 0.375) / (nrow(tolucaDF) + 0.25))
cor(res, expRes)
[1] 0.9915055

Breusch-Pagan Test for Constancy of Error Variance

library(lmtest)
bptest(toluca_LS_model, student = FALSE)

    Breusch-Pagan test

data:  toluca_LS_model
BP = 0.82092, df = 1, p-value = 0.3649

Bank example section 3.7

bankdata <- read.table("http://www.cnachtsheim-text.csom.umn.edu/Kutner/Chapter%20%203%20Data%20Sets/CH03TA04.txt", sep ="" , header = FALSE)


#Look at the first 6 entries
#head(bankdata)

#Rename clumns
colnames(bankdata) <- c("minDeposit", "NoNewAccs") # minDeposit - Size of minimum deposit, NoNewAccs - Number of new accounts

bankdata
   minDeposit NoNewAccs
1         125       160
2         100       112
3         200       124
4          75        28
5         150       152
6         175       156
7          75        42
8         175       124
9         125       150
10        200       104
11        100       136
#Model
bankModel <- lm(NoNewAccs~minDeposit, data = bankdata)
bankModel

Call:
lm(formula = NoNewAccs ~ minDeposit, data = bankdata)

Coefficients:
(Intercept)   minDeposit  
    50.7225       0.4867  
#Scatter Plot
library(ggplot2)
library(latex2exp)

ggplot(bankdata, aes(x = minDeposit, y = NoNewAccs)) +
  geom_point() +
  labs(x = "Size of minimum deposit", y = "Number of new accounts", title = "Bank example scatter plot") +
  geom_smooth(method = "lm", se = FALSE) +
  geom_text(aes(x=120, label="yhat = 50.7 + 0.49x", y=85), colour="steelblue", angle=0, text=element_text(size=11)) +
  theme_bw()

F test for lack of fit

# F test for lack of fit

Reduced <- lm(NoNewAccs ~ minDeposit, data = bankdata) # This is our usual model
summary(Reduced)
## 
## Call:
## lm(formula = NoNewAccs ~ minDeposit, data = bankdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.23 -34.06  12.61  32.44  48.44 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  50.7225    39.3979   1.287     0.23
## minDeposit    0.4867     0.2747   1.772     0.11
## 
## Residual standard error: 40.47 on 9 degrees of freedom
## Multiple R-squared:  0.2586, Adjusted R-squared:  0.1762 
## F-statistic: 3.139 on 1 and 9 DF,  p-value: 0.1102
Full <-  lm(NoNewAccs ~ 0 + as.factor(minDeposit), data = bankdata) # This is the full model
summary(Full)
## 
## Call:
## lm(formula = NoNewAccs ~ 0 + as.factor(minDeposit), data = bankdata)
## 
## Residuals:
##   1   2   3   4   5   6   7   8   9  10  11 
##   5 -12  10  -7   0  16   7 -16  -5 -10  12 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## as.factor(minDeposit)75     35.00      10.71   3.267 0.022282 *  
## as.factor(minDeposit)100   124.00      10.71  11.573 8.45e-05 ***
## as.factor(minDeposit)125   155.00      10.71  14.466 2.85e-05 ***
## as.factor(minDeposit)150   152.00      15.15  10.031 0.000168 ***
## as.factor(minDeposit)175   140.00      10.71  13.066 4.68e-05 ***
## as.factor(minDeposit)200   114.00      10.71  10.640 0.000127 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.15 on 5 degrees of freedom
## Multiple R-squared:  0.9933, Adjusted R-squared:  0.9852 
## F-statistic: 123.1 on 6 and 5 DF,  p-value: 2.893e-05

Testing the hypothesis:

\(H_0: \mbox{Reduced model is good}\) \(\quad\) \(H_a: \mbox{Full model is good}\)

anova (Reduced, Full)
## Analysis of Variance Table
## 
## Model 1: NoNewAccs ~ minDeposit
## Model 2: NoNewAccs ~ 0 + as.factor(minDeposit)
##   Res.Df   RSS Df Sum of Sq      F   Pr(>F)   
## 1      9 14742                                
## 2      5  1148  4     13594 14.801 0.005594 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example on page 129: Sales training example

salesTrain <- read.table("http://www.cnachtsheim-text.csom.umn.edu/Kutner/Chapter%20%203%20Data%20Sets/CH03TA07.txt", sep ="" , header = FALSE)

#head(salesTrain)

#Rename columns
colnames(salesTrain) <- c("daysTrain", "performance") 
head(salesTrain)
  daysTrain performance
1       0.5        42.5
2       0.5        50.6
3       1.0        68.5
4       1.0        80.7
5       1.5        89.0
6       1.5        99.6
# Create a scatter plot of these data in R
ggplot(salesTrain, aes(x = daysTrain, y = performance)) +
  geom_point() +
  labs(x = "Days", y = "Performance", title = "Sales Training Example scatter plot") +
  theme_bw()

  • Do you see a linear relationship?
    • Clearly the regression relation appears to be curvilinear, so the simple linear regression model does not seem to be appropriate.
  • What transformation might be appropriate for these data?
    • We shall consider a transformation on X. We shall consider initially the square root transformation (or go with the log 10 transformation of X)
   daysTrain performance daysTrainSqrt
1        0.5        42.5     0.7071068
2        0.5        50.6     0.7071068
3        1.0        68.5     1.0000000
4        1.0        80.7     1.0000000
5        1.5        89.0     1.2247449
6        1.5        99.6     1.2247449
7        2.0       105.3     1.4142136
8        2.0       111.8     1.4142136
9        2.5       112.3     1.5811388
10       2.5       125.7     1.5811388


Call:
lm(formula = performance ~ daysTrainSqrt, data = salesTrainTrans)

Coefficients:
  (Intercept)  daysTrainSqrt  
       -10.33          83.45  

[1] 0.9789286
[1] 10
  • Notice the correlation between residuals and expected residuals is 0.9789286.

    • The critical value when alpha = 0.01, is 0.879. (read from the table in the notes)
    • The correlation > critical value. Therefore, we conclude that the distribution of the error terms does not depart substantially from a normal distribution.
  • Comment on lack of fit or of strong unequal error variances and normality

    • By looking at the residual plot there is no evidence of lack of fit or of strongly unequal error variances.
    • By looking at the NPP and above test, No strong indications of substantial departures from normality.

Example on page 132: Plasma Levels Example

library(ggplot2)
library(dplyr)
plasmaData <- read.table("http://www.cnachtsheim-text.csom.umn.edu/Kutner/Chapter%20%203%20Data%20Sets/CH03TA08.txt", sep ="" , header = FALSE)

plasmaData
   V1    V2     V3
1   0 13.44 1.1284
2   0 12.84 1.1086
3   0 11.91 1.0759
4   0 20.09 1.3030
5   0 15.60 1.1931
6   1 10.11 1.0048
7   1 11.38 1.0561
8   1 10.28 1.0120
9   1  8.96 0.9523
10  1  8.59 0.9340
11  2  9.83 0.9926
12  2  9.00 0.9542
13  2  8.65 0.9370
14  2  7.85 0.8949
15  2  8.88 0.9484
16  3  7.94 0.8998
17  3  6.01 0.7789
18  3  5.14 0.7110
19  3  6.90 0.8388
20  3  6.77 0.8306
21  4  4.86 0.6866
22  4  5.10 0.7076
23  4  5.67 0.7536
24  4  5.75 0.7597
25  4  6.23 0.7945
plasmaData <- plasmaData[,-3] # Removed the last column of the data for this example. I only took X and Y columns.
head(plasmaData)
  V1    V2
1  0 13.44
2  0 12.84
3  0 11.91
4  0 20.09
5  0 15.60
6  1 10.11
#Rename columns
colnames(plasmaData) <- c("age", "plasmaLevel") 
head(plasmaData)
  age plasmaLevel
1   0       13.44
2   0       12.84
3   0       11.91
4   0       20.09
5   0       15.60
6   1       10.11
# Create a scatter plot of these data in R
ggplot(plasmaData, aes(x = age, y = plasmaLevel)) +
  geom_point() +
  labs(x = "Age", y = "Plasma Level", title = "Plasma Levels Example scatter plot") +
  theme_bw()

  • Do you see a linear relationship?
    • Clearly the regression relation appears to be curvilinear, as well as the greater variability for younger children than for older ones. so the simple linear regression model does not seem to be appropriate.
  • What transformation might be appropriate for these data?
    • We shall consider a transformation on Y. We shall consider initially the log(Y) transformation
library(dplyr)
# Find the transformed Y values
    plasmaDataTrans <- plasmaData %>%
      mutate(plasmaLevelLog  = log10(plasmaLevel)) # here I am creating a new column of data with logY values
    
    plasmaDataTrans
   age plasmaLevel plasmaLevelLog
1    0       13.44      1.1283993
2    0       12.84      1.1085650
3    0       11.91      1.0759118
4    0       20.09      1.3029799
5    0       15.60      1.1931246
6    1       10.11      1.0047512
7    1       11.38      1.0561423
8    1       10.28      1.0119931
9    1        8.96      0.9523080
10   1        8.59      0.9339932
11   2        9.83      0.9925535
12   2        9.00      0.9542425
13   2        8.65      0.9370161
14   2        7.85      0.8948697
15   2        8.88      0.9484130
16   3        7.94      0.8998205
17   3        6.01      0.7788745
18   3        5.14      0.7109631
19   3        6.90      0.8388491
20   3        6.77      0.8305887
21   4        4.86      0.6866363
22   4        5.10      0.7075702
23   4        5.67      0.7535831
24   4        5.75      0.7596678
25   4        6.23      0.7944880
# Create a new scatter plot of using transformed data in R
  ggplot(plasmaDataTrans, aes(x = age, y = plasmaLevelLog)) +
  geom_point(color = "blue") + geom_smooth(method = "lm", se = FALSE)

  labs(x = "Age", y = "log(Y)", title = "Plasma Levels Example scatter plot with transformed Y") +
  theme_bw()
NULL
# Fit the linear regression model using transformed data
  modelPlasmaTrans <- lm(plasmaLevelLog ~ age, data = plasmaDataTrans)
  modelPlasmaTrans

Call:
lm(formula = plasmaLevelLog ~ age, data = plasmaDataTrans)

Coefficients:
(Intercept)          age  
     1.1348      -0.1023  
# Make a NPP  plot. 
  ggplot(modelPlasmaTrans, aes(sample = .resid)) + geom_qq(color = "red") + geom_qq_line()+ theme_bw() #No x or y aesthetics. Just use sample =.resid

  #residual plot
  ggplot(modelPlasmaTrans, aes(x = age, y = .resid)) + geom_point(color = "orchid4", dotsize = .5) + theme_bw() #Use the model as the data and use ".fitted" to get the fitted values for x axis

# Perform a correlation test for normality for this data
  
  res <- resid(modelPlasmaTrans)
  expRes <- sqrt(deviance(modelPlasmaTrans) / df.residual(modelPlasmaTrans)) * 
                                       qnorm((rank(resid(modelPlasmaTrans)) - 0.375) / (nrow(plasmaDataTrans) + 0.25))
  cor(res, expRes)
[1] 0.9807112
  nrow(plasmaDataTrans)
[1] 25
  • Notice the correlation between residuals and expected residuals is .9807.

  • The critical value when alpha = 0.01, is .939.

  • Since the correlation > critical value. Therefore, we conclude that the distribution of the error terms does not depart substantially from a normal distribution.

  • Comment on lack of fit or of strong unequal error variances and normality

    • By looking at the residual plot there is no evidence of lack of fit or of strongly unequal error variances.
    • By looking at the NPP and above test, No strong indications of substantial departures from normality are indicated.

Box-Cox transfomations for Plasma Levels Example

The Box-Cox method searches over a range of possible transformations and finds the optimal one. boxcox() produces a plot of the profile log-likelihood values (the definition of this term is beyond the scope of this class) as a function of \(\lambda\)

The plot should have an upside-down U shape, and the optimal \(\lambda\) value is the one at which the log-likelihood is maximized. The object you assign the output to (below, LAMBDA) will contain elements x and y (note: these are NOT our explanatory and response variables) corresponding to the points being plotted to create the figure. These can be used to extract the optimal \(\lambda\) value using which().

# ALSM no longer works!
#library(ALSM)
#obj <- boxcox.sse(plasmaData$age, plasmaData$plasmaLevel, l=seq(-2,2,.1))
#obj
#obj[which.min(obj$SSE),]


library(MASS)
fit <- lm(plasmaLevel ~ age,data = plasmaData)
LAMBDA <- boxcox(fit, lambda = seq(-2, 2, 0.1))

LAMBDA$x[which(LAMBDA$y == max(LAMBDA$y))]
[1] -0.5050505

The optimal value for this example is -0.5050505, which we can round to -0.5. Thus, our Box-Cox transformation is \(1/\sqrt{Y}\).

lowess Method for Toluca example

toluca <- read.table("http://www.cnachtsheim-text.csom.umn.edu/Kutner/Chapter%20%201%20Data%20Sets/CH01TA01.txt", sep ="" , header = FALSE)


colnames(toluca) <- c("lotSize", "hours")

tolucaDF <- data.frame(toluca)
head(tolucaDF)
  lotSize hours
1      80   399
2      30   121
3      50   221
4      90   376
5      70   361
6      60   224
library(cowplot)
p1 <- ggplot(data = tolucaDF, aes(x = lotSize, y=hours)) + 
  geom_point(color = "steelblue") + 
  geom_smooth(method = "loess", se = FALSE)+
  theme_bw()

p2 <- ggplot(data = tolucaDF, aes(x = lotSize, y=hours)) + 
  geom_point(color = "steelblue") + 
  geom_smooth(method = "lm", se = TRUE, level = 0.99, color = "red")+
    geom_smooth(method = "loess",  se = FALSE)+
  theme_bw()

plot_grid(p1, p2)