Multiple Regression in \(R\)
We’ll use the Credit dataframe from the
ISLR package to demonstrate multiple regression with:
A numerical outcome variable \(y\), in this case credit card balance.
Two explanatory variables:
- A first numerical explanatory variable \(x_1\). In this case, their credit limit. 
- A second numerical explanatory variable \(x_2\). In this case, their income (in thousands of dollars). 
Note: This dataset is not based on actual individuals, it is a simulated dataset used for educational purposes.
Analogous to simple linear regression, where the regression function \(E\{Y\} = \beta_0 + \beta_1X_1\) is a line, regression function \(E\{Y\} = \beta_0 + \beta_1X_1 +\beta_2X_2\) is a plane.
  ID  Income Limit Rating Cards Age Education Gender Student Married Ethnicity
1  1  14.891  3606    283     2  34        11   Male      No     Yes Caucasian
2  2 106.025  6645    483     3  82        15 Female     Yes     Yes     Asian
3  3 104.593  7075    514     4  71        11   Male      No      No     Asian
4  4 148.924  9504    681     3  36        11 Female      No      No     Asian
5  5  55.882  4897    357     2  68        16   Male      No     Yes Caucasian
6  6  80.180  8047    569     4  77        10   Male      No      No Caucasian
  Balance
1     333
2     903
3     580
4     964
5     331
6    1151# draw 3D scatterplot
p <- plot_ly(data = Credit, z = ~Balance, x = ~Income, y = ~Limit, opacity = 0.6, color = Credit$Balance) %>%
  add_markers() 
p#Fit a multiple linear regression model 
#Syntax: model_name <- lm(y ~ x1 + x2 + ... +xn, data = data_frame_name)
Balance_model <- lm(Balance ~ Limit + Income, data = Credit)
Balance_model
Call:
lm(formula = Balance ~ Limit + Income, data = Credit)
Coefficients:
(Intercept)        Limit       Income  
  -385.1793       0.2643      -7.6633  [1]   855 13913[1]  10.354 186.634Scatter plot matrix for Dwaine Studios example
library(ggplot2)
library(plotly)
library(GGally)
DwaineData <- read.table("https://people.stat.sc.edu/Hitchcock/studiodata.txt", sep ="" , header = FALSE)
#DwaineData <- read.table("http://users.stat.ufl.edu/~rrandles/sta4210/Rclassnotes/data/textdatasets/KutnerData/Chapter%20%206%20Data%20Sets/CH06FI05.txt", sep ="" , header = FALSE)
#DwaineData <- read.table("https://people.stat.sc.edu/Hitchcock/studiodata.txt", sep ="" , header = FALSE)
DwaineData <- data.frame(DwaineData)
colnames(DwaineData) <- c("TARGTPOP", "DISPOINC", "SALES")
head(DwaineData)##   TARGTPOP DISPOINC SALES
## 1     68.5     16.7 174.4
## 2     45.2     16.8 164.4
## 3     91.3     18.2 244.2
## 4     47.8     16.3 154.6
## 5     46.9     17.3 181.6
## 6     66.1     18.2 207.5## [1] 21  3#pairs(DwaineData) # Default Scatterplot matrix. Simple to make
p <- ggpairs(DwaineData, title="Scatterplot matrix")
ggplotly(p)## Warning: Can only have one: highlight
## Warning: Can only have one: highlightCorrelation matrix for Dwaine Studios example
##           TARGTPOP  DISPOINC     SALES
## TARGTPOP 1.0000000 0.7812993 0.9445543
## DISPOINC 0.7812993 1.0000000 0.8358025
## SALES    0.9445543 0.8358025 1.0000000Correlation Test for Normality
Breusch-Pagan Test for Constancy of Error Variance
F test for lack of fit
# F test for lack of fit
#Reduced <- lm(SALES ~ TARGTPOP + DISPOINC, data = DwaineData) # This is our usual model
#summary(Reduced)
#Full <-  lm(SALES ~ factor(TARGTPOP)*factor(DISPOINC), data = DwaineData) # This is the full model
#summary(Full)
#anova (Reduced, Full)# Problem 5 from exercises
p5 <- read.table("http://www.cnachtsheim-text.csom.umn.edu/Kutner/Chapter%20%206%20Data%20Sets/CH06PR05.txt", sep ="" , header = FALSE)
p5Data <- data.frame(p5)
colnames(p5Data) <- c("y", "x1", "x2")
(p5Data)     y x1 x2
1   64  4  2
2   73  4  4
3   61  4  2
4   76  4  4
5   72  6  2
6   80  6  4
7   71  6  2
8   83  6  4
9   83  8  2
10  89  8  4
11  86  8  2
12  93  8  4
13  88 10  2
14  95 10  4
15  94 10  2
16 100 10  4[1] 16  3
Call:
lm(formula = y ~ x1 + x2, data = p5Data)
Residuals:
   Min     1Q Median     3Q    Max 
-4.400 -1.762  0.025  1.587  4.200 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.6500     2.9961  12.566 1.20e-08 ***
x1            4.4250     0.3011  14.695 1.78e-09 ***
x2            4.3750     0.6733   6.498 2.01e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.693 on 13 degrees of freedom
Multiple R-squared:  0.9521,    Adjusted R-squared:  0.9447 
F-statistic: 129.1 on 2 and 13 DF,  p-value: 2.658e-09
Call:
lm(formula = y ~ factor(x1) * factor(x2), data = p5Data)
Residuals:
   Min     1Q Median     3Q    Max 
  -3.0   -1.5    0.0    1.5    3.0 
Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)                62.500      1.887  33.113 7.55e-10 ***
factor(x1)6                 9.000      2.669   3.372  0.00976 ** 
factor(x1)8                22.000      2.669   8.242 3.52e-05 ***
factor(x1)10               28.500      2.669  10.677 5.19e-06 ***
factor(x2)4                12.000      2.669   4.496  0.00201 ** 
factor(x1)6:factor(x2)4    -2.000      3.775  -0.530  0.61063    
factor(x1)8:factor(x2)4    -5.500      3.775  -1.457  0.18322    
factor(x1)10:factor(x2)4   -5.500      3.775  -1.457  0.18322    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.669 on 8 degrees of freedom
Multiple R-squared:  0.971, Adjusted R-squared:  0.9457 
F-statistic:  38.3 on 7 and 8 DF,  p-value: 1.56e-05Testing the hypothesis:
\(H_0: \mbox{Reduced model is good}\) \(\quad\) \(H_a: \mbox{Full model is good}\)
Analysis of Variance Table
Model 1: y ~ x1 + x2
Model 2: y ~ factor(x1) * factor(x2)
  Res.Df  RSS Df Sum of Sq     F Pr(>F)
1     13 94.3                          
2      8 57.0  5      37.3 1.047  0.453F Test for Regression Relation, Problem 6.6 in the exercises
library(Matrix)
p5 <- read.table("http://www.cnachtsheim-text.csom.umn.edu/Kutner/Chapter%20%206%20Data%20Sets/CH06PR05.txt", sep ="" , header = FALSE)
p5Data <- data.frame(p5)
colnames(p5Data) <- c("y", "x1", "x2")
#dim(p5Data)
######
n <- nrow(p5Data)
Y <- p5Data$y
Y <- as.matrix(Y) 
Y                       # This is your Y      [,1]
 [1,]   64
 [2,]   73
 [3,]   61
 [4,]   76
 [5,]   72
 [6,]   80
 [7,]   71
 [8,]   83
 [9,]   83
[10,]   89
[11,]   86
[12,]   93
[13,]   88
[14,]   95
[15,]   94
[16,]  100      x1 x2
 [1,]  4  2
 [2,]  4  4
 [3,]  4  2
 [4,]  4  4
 [5,]  6  2
 [6,]  6  4
 [7,]  6  2
 [8,]  6  4
 [9,]  8  2
[10,]  8  4
[11,]  8  2
[12,]  8  4
[13,] 10  2
[14,] 10  4
[15,] 10  2
[16,] 10  4        x1 x2
 [1,] 1  4  2
 [2,] 1  4  4
 [3,] 1  4  2
 [4,] 1  4  4
 [5,] 1  6  2
 [6,] 1  6  4
 [7,] 1  6  2
 [8,] 1  6  4
 [9,] 1  8  2
[10,] 1  8  4
[11,] 1  8  2
[12,] 1  8  4
[13,] 1 10  2
[14,] 1 10  4
[15,] 1 10  2
[16,] 1 10  4       [,1]
[1,] 108896        [,1]
[1,] 1710864     [,1]
[1,] 1967     [,1]
[1,] 94.3       [,1]
[1,] 1872.7# Calculating F statistic
p <- 3 # number of predictors
MSR <- SSR/(p-1)
MSE <- SSE/(n-p)
Fstat <- MSR/MSE
Fstat           [,1]
[1,] 129.0832             [,1]
[1,] 2.658261e-09Varifying the above results
Call:
lm(formula = y ~ x1 + x2, data = p5Data)
Residuals:
   Min     1Q Median     3Q    Max 
-4.400 -1.762  0.025  1.587  4.200 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.6500     2.9961  12.566 1.20e-08 ***
x1            4.4250     0.3011  14.695 1.78e-09 ***
x2            4.3750     0.6733   6.498 2.01e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.693 on 13 degrees of freedom
Multiple R-squared:  0.9521,    Adjusted R-squared:  0.9447 
F-statistic: 129.1 on 2 and 13 DF,  p-value: 2.658e-09Test for individual \(\beta\)’s
Call:
lm(formula = y ~ x1 + x2, data = p5Data)
Residuals:
   Min     1Q Median     3Q    Max 
-4.400 -1.762  0.025  1.587  4.200 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.6500     2.9961  12.566 1.20e-08 ***
x1            4.4250     0.3011  14.695 1.78e-09 ***
x2            4.3750     0.6733   6.498 2.01e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.693 on 13 degrees of freedom
Multiple R-squared:  0.9521,    Adjusted R-squared:  0.9447 
F-statistic: 129.1 on 2 and 13 DF,  p-value: 2.658e-09                2.5 %    97.5 %
(Intercept) 31.177312 44.122688
x1           3.774470  5.075530
x2           2.920372  5.829628                2.5 %    97.5 %
(Intercept) 31.177312 44.122688
x1           3.774470  5.075530
x2           2.920372  5.829628Interval Estimation of \(\beta_k\)
                2.5 %    97.5 %
(Intercept) 31.177312 44.122688
x1           3.774470  5.075530
x2           2.920372  5.829628Prediction of New Observation \(Y_{h(new)}\)
p5model <- lm(y~x1+x2, data = p5Data)
#prediction interval for the new observation x1 = 2, x2 = 4
xnew <- t(c(x1 = 2, x2 = 4))
xnew <- data.frame(xnew)
xnew  x1 x2
1  2  4  fit      lwr      upr
1  64 57.02385 70.97615Prediction of Mean of m New Observations at \(X_h\)
p5model <- lm(y~x1+x2, data = p5Data)
# prediction interval for the new observation x1 = 2, x2 = 4; x1 = 3, x2 = 3
xnew <- matrix(c(2,4), nrow = 1, byrow = TRUE)
xnew <- data.frame(xnew)
xnew  X1 X2
1  2  4  x1 x2
1  2  4  fit      lwr      upr
1  64 60.15142 67.84858