Diagnostics and Remedial Measures in \(R\)
Read data from an URL
toluca <- read.table("http://www.cnachtsheim-text.csom.umn.edu/Kutner/Chapter%20%201%20Data%20Sets/CH01TA01.txt", sep ="" , header = FALSE)
#Look at the first 6 entries
head(toluca)
V1 V2
1 80 399
2 30 121
3 50 221
4 90 376
5 70 361
6 60 224
Rename columns
lotSize hours
1 80 399
2 30 121
3 50 221
4 90 376
5 70 361
6 60 224
Convert your data into a dataframe
lotSize hours
1 80 399
2 30 121
3 50 221
4 90 376
5 70 361
6 60 224
Diagnostic Plots for Predictor Variable-Toluca Company Example
Dot plot
The dot plot for the lot sizes in the Toluca Company example.
Sequence plot
The Sequence plot for the lot sizes in the Toluca Company example.
Stem-and-leaf plot
Stem-and-leaf plot of the lot sizes for the Toluca Company example. By displaying the last digits, this plot also indicates here that all lot sizes in the Toluca Company example were multiples of 10.
The decimal point is 1 digit(s) to the right of the |
2 | 0
3 | 000
4 | 00
5 | 000
6 | 0
7 | 000
8 | 000
9 | 0000
10 | 00
11 | 00
12 | 0
Box plot
Box plot of the lot sizes for the Toluca Company example.
Diagnostics for Residuals (Section 3.3 in the textbook)
Create the LS model
Plot of residuals against predictor variable
Plot of residuals against fitted values.
Box plot of residuals
Normal probability plot of residuals.
Nonnormality of Error Terms
\(\mbox{Exp. Value under Normality} = \sqrt(MSE)[ z(\frac{k-0.375}{n+0.25})]\)
Note that MSE can be calculated as the SSE / DF
. Here I
am using deviance(toluca_LS_model)
to get SSE
and df.residual(toluca_LS_model)
to get the
DF
# Expected value under normality comes from equation (3.6)
tab <- cbind(
"Residual" = round(resid(toluca_LS_model), 2), # rounding upresiduals to 2 decimal places
"Rank" = rank(resid(toluca_LS_model)), # this is where we find k value
"ExpValueunderNormality" = round(sqrt(deviance(toluca_LS_model) / df.residual(toluca_LS_model)) *
qnorm((rank(resid(toluca_LS_model)) - 0.375) / (nrow(tolucaDF) + 0.25)), 2)) # deviance function finds the SSE and to find MSE we need to divide it by d.f. Then qnorm gives us the z value of (k-.375/n+.25). Again we have use round function to get final results in 2 decimal places.
tab
Residual Rank ExpValueunderNormality
1 51.02 22 51.97
2 -48.47 5 -44.10
3 -19.88 10 -14.76
4 -7.68 11 -9.76
5 48.72 21 44.10
6 -52.58 4 -51.97
7 55.21 23 61.48
8 4.02 15 9.76
9 -66.39 2 -74.17
10 -83.88 1 -95.90
11 -45.17 6 -37.25
12 -60.28 3 -61.48
13 5.32 16 14.76
14 -20.77 8 -25.33
15 -20.09 9 -19.93
16 0.61 14 4.85
17 42.53 20 37.25
18 27.12 18 25.33
19 -6.68 12 -4.85
20 -34.09 7 -31.05
21 103.53 25 95.90
22 84.32 24 74.17
23 38.83 19 31.05
24 -5.98 13 0.00
25 10.72 17 19.93
Correlation Test for Normality
res <- resid(toluca_LS_model)
expRes <- sqrt(deviance(toluca_LS_model) / df.residual(toluca_LS_model)) *
qnorm((rank(resid(toluca_LS_model)) - 0.375) / (nrow(tolucaDF) + 0.25))
cor(res, expRes)
[1] 0.9915055
Breusch-Pagan Test for Constancy of Error Variance
Breusch-Pagan test
data: toluca_LS_model
BP = 0.82092, df = 1, p-value = 0.3649
Bank example section 3.7
bankdata <- read.table("http://www.cnachtsheim-text.csom.umn.edu/Kutner/Chapter%20%203%20Data%20Sets/CH03TA04.txt", sep ="" , header = FALSE)
#Look at the first 6 entries
#head(bankdata)
#Rename clumns
colnames(bankdata) <- c("minDeposit", "NoNewAccs") # minDeposit - Size of minimum deposit, NoNewAccs - Number of new accounts
bankdata
minDeposit NoNewAccs
1 125 160
2 100 112
3 200 124
4 75 28
5 150 152
6 175 156
7 75 42
8 175 124
9 125 150
10 200 104
11 100 136
Call:
lm(formula = NoNewAccs ~ minDeposit, data = bankdata)
Coefficients:
(Intercept) minDeposit
50.7225 0.4867
#Scatter Plot
library(ggplot2)
library(latex2exp)
ggplot(bankdata, aes(x = minDeposit, y = NoNewAccs)) +
geom_point() +
labs(x = "Size of minimum deposit", y = "Number of new accounts", title = "Bank example scatter plot") +
geom_smooth(method = "lm", se = FALSE) +
geom_text(aes(x=120, label="yhat = 50.7 + 0.49x", y=85), colour="steelblue", angle=0, text=element_text(size=11)) +
theme_bw()
F test for lack of fit
# F test for lack of fit
Reduced <- lm(NoNewAccs ~ minDeposit, data = bankdata) # This is our usual model
summary(Reduced)
##
## Call:
## lm(formula = NoNewAccs ~ minDeposit, data = bankdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.23 -34.06 12.61 32.44 48.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.7225 39.3979 1.287 0.23
## minDeposit 0.4867 0.2747 1.772 0.11
##
## Residual standard error: 40.47 on 9 degrees of freedom
## Multiple R-squared: 0.2586, Adjusted R-squared: 0.1762
## F-statistic: 3.139 on 1 and 9 DF, p-value: 0.1102
Full <- lm(NoNewAccs ~ 0 + as.factor(minDeposit), data = bankdata) # This is the full model
summary(Full)
##
## Call:
## lm(formula = NoNewAccs ~ 0 + as.factor(minDeposit), data = bankdata)
##
## Residuals:
## 1 2 3 4 5 6 7 8 9 10 11
## 5 -12 10 -7 0 16 7 -16 -5 -10 12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## as.factor(minDeposit)75 35.00 10.71 3.267 0.022282 *
## as.factor(minDeposit)100 124.00 10.71 11.573 8.45e-05 ***
## as.factor(minDeposit)125 155.00 10.71 14.466 2.85e-05 ***
## as.factor(minDeposit)150 152.00 15.15 10.031 0.000168 ***
## as.factor(minDeposit)175 140.00 10.71 13.066 4.68e-05 ***
## as.factor(minDeposit)200 114.00 10.71 10.640 0.000127 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.15 on 5 degrees of freedom
## Multiple R-squared: 0.9933, Adjusted R-squared: 0.9852
## F-statistic: 123.1 on 6 and 5 DF, p-value: 2.893e-05
Testing the hypothesis:
\(H_0: \mbox{Reduced model is good}\) \(\quad\) \(H_a: \mbox{Full model is good}\)
## Analysis of Variance Table
##
## Model 1: NoNewAccs ~ minDeposit
## Model 2: NoNewAccs ~ 0 + as.factor(minDeposit)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 9 14742
## 2 5 1148 4 13594 14.801 0.005594 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Example on page 129: Sales training example
salesTrain <- read.table("http://www.cnachtsheim-text.csom.umn.edu/Kutner/Chapter%20%203%20Data%20Sets/CH03TA07.txt", sep ="" , header = FALSE)
#head(salesTrain)
#Rename columns
colnames(salesTrain) <- c("daysTrain", "performance")
head(salesTrain)
daysTrain performance
1 0.5 42.5
2 0.5 50.6
3 1.0 68.5
4 1.0 80.7
5 1.5 89.0
6 1.5 99.6
# Create a scatter plot of these data in R
ggplot(salesTrain, aes(x = daysTrain, y = performance)) +
geom_point() +
labs(x = "Days", y = "Performance", title = "Sales Training Example scatter plot") +
theme_bw()
- Do you see a linear relationship?
- Clearly the regression relation appears to be curvilinear, so the simple linear regression model does not seem to be appropriate.
- What transformation might be appropriate for these data?
- We shall consider a transformation on X. We shall consider initially the square root transformation (or go with the log 10 transformation of X)
daysTrain performance daysTrainSqrt
1 0.5 42.5 0.7071068
2 0.5 50.6 0.7071068
3 1.0 68.5 1.0000000
4 1.0 80.7 1.0000000
5 1.5 89.0 1.2247449
6 1.5 99.6 1.2247449
7 2.0 105.3 1.4142136
8 2.0 111.8 1.4142136
9 2.5 112.3 1.5811388
10 2.5 125.7 1.5811388
Call:
lm(formula = performance ~ daysTrainSqrt, data = salesTrainTrans)
Coefficients:
(Intercept) daysTrainSqrt
-10.33 83.45
[1] 0.9789286
[1] 10
Notice the correlation between residuals and expected residuals is 0.9789286.
- The critical value when alpha = 0.01, is 0.879. (read from the table in the notes)
- The correlation > critical value. Therefore, we conclude that the distribution of the error terms does not depart substantially from a normal distribution.
Comment on lack of fit or of strong unequal error variances and normality
- By looking at the residual plot there is no evidence of lack of fit or of strongly unequal error variances.
- By looking at the NPP and above test, No strong indications of substantial departures from normality.
Example on page 132: Plasma Levels Example
library(ggplot2)
library(dplyr)
plasmaData <- read.table("http://www.cnachtsheim-text.csom.umn.edu/Kutner/Chapter%20%203%20Data%20Sets/CH03TA08.txt", sep ="" , header = FALSE)
plasmaData
V1 V2 V3
1 0 13.44 1.1284
2 0 12.84 1.1086
3 0 11.91 1.0759
4 0 20.09 1.3030
5 0 15.60 1.1931
6 1 10.11 1.0048
7 1 11.38 1.0561
8 1 10.28 1.0120
9 1 8.96 0.9523
10 1 8.59 0.9340
11 2 9.83 0.9926
12 2 9.00 0.9542
13 2 8.65 0.9370
14 2 7.85 0.8949
15 2 8.88 0.9484
16 3 7.94 0.8998
17 3 6.01 0.7789
18 3 5.14 0.7110
19 3 6.90 0.8388
20 3 6.77 0.8306
21 4 4.86 0.6866
22 4 5.10 0.7076
23 4 5.67 0.7536
24 4 5.75 0.7597
25 4 6.23 0.7945
plasmaData <- plasmaData[,-3] # Removed the last column of the data for this example. I only took X and Y columns.
head(plasmaData)
V1 V2
1 0 13.44
2 0 12.84
3 0 11.91
4 0 20.09
5 0 15.60
6 1 10.11
age plasmaLevel
1 0 13.44
2 0 12.84
3 0 11.91
4 0 20.09
5 0 15.60
6 1 10.11
# Create a scatter plot of these data in R
ggplot(plasmaData, aes(x = age, y = plasmaLevel)) +
geom_point() +
labs(x = "Age", y = "Plasma Level", title = "Plasma Levels Example scatter plot") +
theme_bw()
- Do you see a linear relationship?
- Clearly the regression relation appears to be curvilinear, as well as the greater variability for younger children than for older ones. so the simple linear regression model does not seem to be appropriate.
- What transformation might be appropriate for these data?
- We shall consider a transformation on Y. We shall consider initially the log(Y) transformation
library(dplyr)
# Find the transformed Y values
plasmaDataTrans <- plasmaData %>%
mutate(plasmaLevelLog = log10(plasmaLevel)) # here I am creating a new column of data with logY values
plasmaDataTrans
age plasmaLevel plasmaLevelLog
1 0 13.44 1.1283993
2 0 12.84 1.1085650
3 0 11.91 1.0759118
4 0 20.09 1.3029799
5 0 15.60 1.1931246
6 1 10.11 1.0047512
7 1 11.38 1.0561423
8 1 10.28 1.0119931
9 1 8.96 0.9523080
10 1 8.59 0.9339932
11 2 9.83 0.9925535
12 2 9.00 0.9542425
13 2 8.65 0.9370161
14 2 7.85 0.8948697
15 2 8.88 0.9484130
16 3 7.94 0.8998205
17 3 6.01 0.7788745
18 3 5.14 0.7109631
19 3 6.90 0.8388491
20 3 6.77 0.8305887
21 4 4.86 0.6866363
22 4 5.10 0.7075702
23 4 5.67 0.7535831
24 4 5.75 0.7596678
25 4 6.23 0.7944880
# Create a new scatter plot of using transformed data in R
ggplot(plasmaDataTrans, aes(x = age, y = plasmaLevelLog)) +
geom_point(color = "blue") + geom_smooth(method = "lm", se = FALSE)
labs(x = "Age", y = "log(Y)", title = "Plasma Levels Example scatter plot with transformed Y") +
theme_bw()
NULL
# Fit the linear regression model using transformed data
modelPlasmaTrans <- lm(plasmaLevelLog ~ age, data = plasmaDataTrans)
modelPlasmaTrans
Call:
lm(formula = plasmaLevelLog ~ age, data = plasmaDataTrans)
Coefficients:
(Intercept) age
1.1348 -0.1023
# Make a NPP plot.
ggplot(modelPlasmaTrans, aes(sample = .resid)) + geom_qq(color = "red") + geom_qq_line()+ theme_bw() #No x or y aesthetics. Just use sample =.resid
#residual plot
ggplot(modelPlasmaTrans, aes(x = age, y = .resid)) + geom_point(color = "orchid4", dotsize = .5) + theme_bw() #Use the model as the data and use ".fitted" to get the fitted values for x axis
# Perform a correlation test for normality for this data
res <- resid(modelPlasmaTrans)
expRes <- sqrt(deviance(modelPlasmaTrans) / df.residual(modelPlasmaTrans)) *
qnorm((rank(resid(modelPlasmaTrans)) - 0.375) / (nrow(plasmaDataTrans) + 0.25))
cor(res, expRes)
[1] 0.9807112
[1] 25
Notice the correlation between residuals and expected residuals is .9807.
The critical value when alpha = 0.01, is .939.
Since the correlation > critical value. Therefore, we conclude that the distribution of the error terms does not depart substantially from a normal distribution.
Comment on lack of fit or of strong unequal error variances and normality
- By looking at the residual plot there is no evidence of lack of fit or of strongly unequal error variances.
- By looking at the NPP and above test, No strong indications of substantial departures from normality are indicated.
Box-Cox transfomations for Plasma Levels Example
The Box-Cox method searches over a range of possible transformations
and finds the optimal one. boxcox()
produces a plot of the
profile log-likelihood values (the definition of this term is beyond the
scope of this class) as a function of \(\lambda\)
The plot should have an upside-down U shape, and the optimal \(\lambda\) value is the one at which the
log-likelihood is maximized. The object you assign the output to (below,
LAMBDA) will contain elements x and y (note: these are NOT our
explanatory and response variables) corresponding to the points being
plotted to create the figure. These can be used to extract the optimal
\(\lambda\) value using
which()
.
# ALSM no longer works!
#library(ALSM)
#obj <- boxcox.sse(plasmaData$age, plasmaData$plasmaLevel, l=seq(-2,2,.1))
#obj
#obj[which.min(obj$SSE),]
library(MASS)
fit <- lm(plasmaLevel ~ age,data = plasmaData)
LAMBDA <- boxcox(fit, lambda = seq(-2, 2, 0.1))
[1] -0.5050505
The optimal value for this example is -0.5050505, which we can round to -0.5. Thus, our Box-Cox transformation is \(1/\sqrt{Y}\).
lowess Method for Toluca example
toluca <- read.table("http://www.cnachtsheim-text.csom.umn.edu/Kutner/Chapter%20%201%20Data%20Sets/CH01TA01.txt", sep ="" , header = FALSE)
colnames(toluca) <- c("lotSize", "hours")
tolucaDF <- data.frame(toluca)
head(tolucaDF)
lotSize hours
1 80 399
2 30 121
3 50 221
4 90 376
5 70 361
6 60 224
library(cowplot)
p1 <- ggplot(data = tolucaDF, aes(x = lotSize, y=hours)) +
geom_point(color = "steelblue") +
geom_smooth(method = "loess", se = FALSE)+
theme_bw()
p2 <- ggplot(data = tolucaDF, aes(x = lotSize, y=hours)) +
geom_point(color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, level = 0.99, color = "red")+
geom_smooth(method = "loess", se = FALSE)+
theme_bw()
plot_grid(p1, p2)