Consider the mice example in Section 3.1 let \(\mu_d\) denote the true mean time that a randomly selected mouse that received the drug takes to run through the maze; let \(\mu_c\) denote the true mean time for a control mouse. Then,
\[H_0:\mu_d - \mu_c = 0.\] That is, on average, there is no difference in the mean times between mice who receive the drug and mice in the control group.
The alternative hypothesis is
\[H_A: \mu_d - \mu_c > 0.\]
That is, on average, mice who receive the drug are slower (larger values) than the mice in the control group.
DF <- data.frame(time = c(30, 25, 20, 18, 21, 22), treatment = rep(c("Drug", "Control"), each = 3))
DF <- tbl_df(DF)
DF
# A tibble: 6 x 2
time treatment
<dbl> <fct>
1 30 Drug
2 25 Drug
3 20 Drug
4 18 Control
5 21 Control
6 22 Control
# Difference between the average times...
ANS <- DF %>%
group_by(treatment) %>%
summarize(Average = mean(time))
ANS
# A tibble: 2 x 2
treatment Average
* <fct> <dbl>
1 Control 20.3
2 Drug 25
MD <- ANS[2, 2] - ANS[1, 2]
MD
Average
1 4.666667
observed <- MD$Average
observed
[1] 4.666667
Problem: This difference could be due to random variability rather than a real drug effect.
How do we address this problem: Estimate how easily pure random chance would produce a difference this large.
Definition 1.2 The P-value is the probability that chance alone would produce a test statistics as extreme as the obseved statistic if the null hupothesis were true.
For example if the large values of the test statistics support the alternative hypothesis, the P-value is \(P(T \geq t)\).ANS <- t(combn(6, 3))
ANS
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 2 4
[3,] 1 2 5
[4,] 1 2 6
[5,] 1 3 4
[6,] 1 3 5
[7,] 1 3 6
[8,] 1 4 5
[9,] 1 4 6
[10,] 1 5 6
[11,] 2 3 4
[12,] 2 3 5
[13,] 2 3 6
[14,] 2 4 5
[15,] 2 4 6
[16,] 2 5 6
[17,] 3 4 5
[18,] 3 4 6
[19,] 3 5 6
[20,] 4 5 6
dim(ANS)
[1] 20 3
DF$time
[1] 30 25 20 18 21 22
DR <- matrix(NA, nrow = 20, ncol = 3)
CO <- matrix(NA, nrow = 20, ncol = 3)
for(i in 1:20){
DR[i,] <- DF$time[ANS[i,]]
CO[i,] <- DF$time[-ANS[i, ]]
}
VALUES <- as.data.frame(cbind(DR, CO))
names(VALUES) <- paste(rep(c("D", "C"), each = 3), 1:3, sep = "")
VALUES$DM <- apply(VALUES[, 1:3], 1, mean)
VALUES$CM <- apply(VALUES[, 4:6], 1, mean)
VALUES$DiffMean <- VALUES$DM - VALUES$CM
DT::datatable(round(VALUES, 2))
pvalue <- sum(VALUES$DiffMean >= observed)/choose(6,3)
pvalue
[1] 0.15
MASS::fractions(pvalue)
[1] 3/20
The theoretical p-value is \(0.15\).
p-value was not small enough to reject \(H_0\). Therefore, we conclude that there is no significant difference between the times to complete the maze with or without the drug.
This problem has a total of \(\binom{6}{3} = 20\) arrangements of six items into two sets each with three observations. While it is possible to enumerate all possible outcomes and compute the theoretical answer, once the group sizes increase this becomes computationally infeasible. Consider \(\binom{30}{15} = 155117520\)! So, what we typically do is take a “large” number of permutations (generally 10,000) and let this simulation approximate the true distribution.
We follow this algorithm:
Two Sample Pemutation Test:
Pool the n1 + n2 (total of n) values.
**repeat**
Draw a resample of size n1 without replacement.
Use the remaining n2 obervations for the other sample.
Calculate the difference in means or another statistic that comapres samples.
**unitl** we have enough samples.
Calculate the p-value as the fraction of timesthe random statistics exceed the original statistic.
(Multiply the p-value by 2 for a two sided test.)
set.seed(123)
sims <- 10^4 - 1 # Number of simulations (iterations)
ans <- numeric(sims)
for(i in 1:sims){
index <- sample.int(n = 6, size = 3, replace = FALSE)
ans[i] <- mean(DF$time[index]) - mean(DF$time[-index])
}
pvalue <- (sum(ans >= observed) + 1)/(sims + 1) # Here observed = 4.666667, was the abserved difference between the means.
pvalue
[1] 0.1565
library(ggplot2)
ggplot(data = data.frame(md = ans), aes(x = md)) +
geom_density(fill = "pink") +
theme_bw() +
labs(x = expression(bar(X)[c]-bar(X)[d])) +
geom_vline(xintercept=MD, linetype="dotted")
Figure 2.1: Permutation Distribution of \(\bar{X}_c - \bar{X}_d\)
Example: Beer and hot wings case study in section 1.9
Do men consume more hotwings than women?
Define the null and alternative hypothesis:
\(H_0: \mu_M - \mu_F = 0\) vs. \(H_a: \mu_M - \mu_F > 0\)
Step 1) Get data
library(resampledata)
data(Beerwings)
head(Beerwings)
ID Hotwings Beer Gender
1 1 4 24 F
2 2 5 0 F
3 3 5 12 F
4 4 6 12 F
5 5 7 12 F
6 6 7 12 F
str(Beerwings)
'data.frame': 30 obs. of 4 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Hotwings: int 4 5 5 6 7 7 7 8 8 8 ...
$ Beer : int 24 0 12 12 12 12 24 24 0 12 ...
$ Gender : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 2 1 2 2 ...
Step 2) Find the observed difference
# Difference betwen the average number of Hotwings
ANS <- Beerwings %>%
group_by(Gender) %>%
summarize(Average = mean(Hotwings), N = n())
ANS
# A tibble: 2 x 3
Gender Average N
* <fct> <dbl> <int>
1 F 9.33 15
2 M 14.5 15
MD <- ANS[2, 2] - ANS[1, 2]
MD
Average
1 5.2
observed <- MD$Average
observed
[1] 5.2
Step 3) Resample and find the p-value
set.seed(123)
sims <- 10^5 - 1
ans <- numeric(sims)
for(i in 1:sims){
index <- sample.int(n = 30, size = 15, replace = FALSE)
ans[i] <- mean(Beerwings$Hotwings[index]) - mean(Beerwings$Hotwings[-index])
}
pvalue <- (sum(ans >= observed) + 1)/(sims + 1)
pvalue
[1] 0.00115
Step 4) Create a plot
library(ggplot2)
ggplot(data = data.frame(md = ans), aes(x = md)) +
geom_density(fill = "orchid4") +
theme_bw() +
labs(x = expression(bar(X)[m]-bar(X)[f])) +
geom_vline(xintercept=observed, linetype="dotted")
Figure 2.2: Permutation Distribution of \(\bar{X}_m - \bar{X}_f\)
Step 5) Make the decision
Exercise 3.3:
In a hypothesis test comparing two population means, \(H_0: \mu_1 = \mu_2\) versus \(H_a: \mu_1 > \mu_2\):
Which \(P\)-value, 0.03 or 0.006 provides stronger evidence for the alternative hypothesis?
Which \(P\)-value, 0.095 or 0.04 provides stronger evidence that chance alone might account for the observed result?
Example: Verizon case study.
Let \(\mu_1\) denote the mean repair time for the the ILEC customers and \(\mu_2\) the mean repair time for the the CLEC customers.
Form the hypothesis. \(H_0: \mu_1 = \mu_2\) vs. \(H_0: \mu_1 < \mu_2\)
Conduct the test.
#Step 1
data("Verizon")
head(Verizon)
Time Group
1 17.50 ILEC
2 2.40 ILEC
3 0.00 ILEC
4 0.65 ILEC
5 22.23 ILEC
6 1.20 ILEC
dim(Verizon)
[1] 1687 2
#Step 2
Time_group <- Verizon %>%
group_by(Group) %>%
summarize(Mean = mean(Time), NumCus = n())
Time_group
# A tibble: 2 x 3
Group Mean NumCus
* <fct> <dbl> <int>
1 CLEC 16.5 23
2 ILEC 8.41 1664
MD <- Time_group[1,2] - Time_group[2,2]
MD
Mean
1 8.09752
observed <- MD$Mean # We had to create a new variable called observed, because MD was a matrix in this case.
observed
[1] 8.09752
#Step 3
set.seed(200)
sims <- 10^5 - 1
ans <- numeric(sims)
for(i in 1:sims){
index <- sample.int(n = 1687, size = 23, replace = FALSE)
ans[i] <- mean(Verizon$Time[index]) - mean(Verizon$Time[-index])
}
pvalue <- (sum(ans >= observed) + 1)/(sims + 1) # Notice <, this is because alternative hypothsis has <.
pvalue
[1] 0.01862
#Step 4
library(ggplot2)
ggplot(data = data.frame(md = ans), aes(x = md)) +
geom_density(fill = "orchid4") +
theme_bw() +
labs(x = expression(bar(X)[ILEC]-bar(X)[CLEC])) +
geom_vline(xintercept=observed, linetype="dotted")
Step 5: Make the decision –
Technical conclusion: \(p\)-Value < 0.05, so Reject \(H_0\).
English conclusion: There is enough evidence to conclude that the Verizon spend significantly more time to complete repairs for CLEC customers than for the ILEC customers.
Example: Verizon case study ctd…Data have a long tail?
ggplot(data = Verizon, aes(x=factor(Group), y = Time)) + geom_boxplot()
ggplot(data = Verizon, aes(x = Time)) + geom_histogram() + facet_wrap(~Group)
center
in situations like this instead off the mean.Trimmed mean, Median
Here I am using the trimmed mean.
data("Verizon")
head(Verizon)
Time Group
1 17.50 ILEC
2 2.40 ILEC
3 0.00 ILEC
4 0.65 ILEC
5 22.23 ILEC
6 1.20 ILEC
dim(Verizon)
[1] 1687 2
Time_group <- Verizon %>%
group_by(Group) %>%
summarize(Tr.Mean = mean(Time, trim = 0.25), Count = n())
Time_group
# A tibble: 2 x 3
Group Tr.Mean Count
* <fct> <dbl> <int>
1 CLEC 13.8 23
2 ILEC 3.50 1664
MD <- Time_group[1,2] - Time_group[2,2] # This is the observed value (difference)
MD
Tr.Mean
1 10.336
observed <- MD$Tr.Mean
observed
[1] 10.336
set.seed(123)
sims <- 10^4 - 1
ans <- numeric(sims)
for(i in 1:sims){
index <- sample.int(n = 1687, size = 23, replace = FALSE)
ans[i] <- mean(Verizon$Time[index], trim = 0.25) - mean(Verizon$Time[-index], trim = 0.25)
}
pvalue <- (sum(ans >= observed) + 1)/(sims + 1)
pvalue
[1] 0.0002
library(ggplot2)
ggplot(data = data.frame(md = ans), aes(x = md)) +
geom_density(fill = "tan1") +
theme_bw() +
labs(x = expression(Tr.mean[ILEC]-Tr.mean[CLEC])) +
geom_vline(xintercept=observed, linetype="dotted")
Exercise 3.5:
In the Flight Delays Case Study in Section 1.1,
a. The data contain flight delays for two airlines, American Airlines and United Airlines. Conduct a two-sided permutation test to see if the mean delay times between the two carriers are statistically significant.
b. The flight delays occured in May and June of 2009. Conduct a two-sided permutation test to see if the difference in mean delay times between the 2 months is statistically significant.
\(H_0: \mu_{UA} = \mu_{AA}\) vs. \(H_0: \mu_{UA} \neq \mu_{AA}\)
data("FlightDelays")
Del_by_Carrier <- FlightDelays %>%
group_by(Carrier) %>%
summarize(Mean = mean(Delay), N = n())
Del_by_Carrier
# A tibble: 2 x 3
Carrier Mean N
* <fct> <dbl> <int>
1 AA 10.1 2906
2 UA 16.0 1123
MD <- Del_by_Carrier[2,2] - Del_by_Carrier[1,2]
MD
Mean
1 5.885696
observed <- MD$Mean
observed
[1] 5.885696
#
sims <- 10^4 - 1
ans <- numeric(sims)
for(i in 1:sims){
index <- sample(4029, size = 1123, replace = FALSE)
ans[i] <- mean(FlightDelays$Delay[index]) - mean(FlightDelays$Delay[-index])
}
pvalue <- 2*((sum(ans >= observed) + 1)/(sims + 1))
pvalue
[1] 0.0002
Technical conclusion: p-value < 0.05, so reject \(H_0\)
Non-Technical conclusion: Based on a p-value of \(0.0002\) there is strong evidence to suggest the mean delay time for American Airlines is not the same as the mean delay time for United Airlines.
\(H_0: \mu_{may} = \mu_{June}\) vs. \(H_a: \mu_{may} \neq \mu_{June}\)
data("FlightDelays")
str(FlightDelays)
'data.frame': 4029 obs. of 10 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Carrier : Factor w/ 2 levels "AA","UA": 2 2 2 2 2 2 2 2 2 2 ...
$ FlightNo : int 403 405 409 511 667 669 673 677 679 681 ...
$ Destination : Factor w/ 7 levels "BNA","DEN","DFW",..: 2 2 2 6 6 6 6 6 6 6 ...
$ DepartTime : Factor w/ 5 levels "4-8am","8-Noon",..: 1 2 4 2 1 1 2 2 3 3 ...
$ Day : Factor w/ 7 levels "Sun","Mon","Tue",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Month : Factor w/ 2 levels "May","June": 1 1 1 1 1 1 1 1 1 1 ...
$ FlightLength: int 281 277 279 158 143 150 158 160 160 163 ...
$ Delay : int -1 102 4 -2 -3 0 -5 0 10 60 ...
$ Delayed30 : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
del_by_month <- FlightDelays %>%
group_by(Month) %>%
summarize(Average = mean(Delay), N = n())
del_by_month
# A tibble: 2 x 3
Month Average N
* <fct> <dbl> <int>
1 May 8.88 1999
2 June 14.5 2030
MD <- del_by_month[2,2] - del_by_month[1,2]
MD
Average
1 5.663341
observed <- MD$Average
observed
[1] 5.663341
#
ans <- numeric(sims)
for(i in 1:sims){
index <- sample.int(n = 1999+2030, size = 2030, replace = FALSE)
ans[i] <- mean(FlightDelays$Delay[index]) - mean(FlightDelays$Delay[-index])
}
pvalue <- 2*((sum(ans >= observed) + 1)/(sims + 1))
pvalue
[1] 0.0002
Technical conclusion: p-value < 0.05, so reject \(H_0\)
Non-Technical conclusion: Based on a p-value of \(0.0002\) there is strong evidence to suggest the mean delay time for June is not the same as the mean delay time for May.
Exercise 3.6:
In the Flight Delays Case Study in Section 1.1, the data contain flight delays for two airlines, American Airlines and United Airlines.
a. Compute the proportion of times that each carrier's flights was delayed more than 20 minutes. Conduct a two-sided test to see if the difference in these proportions is statistically significant.
b. Compute the variance in the flight delay lengths for each carrier. Conduct a test to see if the variance for United Airlines is greater than that of American Airlines.
data("FlightDelays")
str(FlightDelays)
'data.frame': 4029 obs. of 10 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Carrier : Factor w/ 2 levels "AA","UA": 2 2 2 2 2 2 2 2 2 2 ...
$ FlightNo : int 403 405 409 511 667 669 673 677 679 681 ...
$ Destination : Factor w/ 7 levels "BNA","DEN","DFW",..: 2 2 2 6 6 6 6 6 6 6 ...
$ DepartTime : Factor w/ 5 levels "4-8am","8-Noon",..: 1 2 4 2 1 1 2 2 3 3 ...
$ Day : Factor w/ 7 levels "Sun","Mon","Tue",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Month : Factor w/ 2 levels "May","June": 1 1 1 1 1 1 1 1 1 1 ...
$ FlightLength: int 281 277 279 158 143 150 158 160 160 163 ...
$ Delay : int -1 102 4 -2 -3 0 -5 0 10 60 ...
$ Delayed30 : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
dim(FlightDelays)
[1] 4029 10
delayed20_carrier <- FlightDelays %>%
group_by(Carrier) %>%
summarize(pro = mean(Delay>20), count = n())
delayed20_carrier
# A tibble: 2 x 3
Carrier pro count
* <fct> <dbl> <int>
1 AA 0.169 2906
2 UA 0.213 1123
PD <- delayed20_carrier[2,2] - delayed20_carrier[1,2]
PD
pro
1 0.04351791
observed <- PD$pro
observed
[1] 0.04351791
set.seed(123)
sims <- 10^4 - 1
ans <- numeric(sims)
for(i in 1:sims){
index <- sample.int(n = 2906+1123, size = 1123, replace = FALSE)
ans[i] <- mean(FlightDelays$Delay[index] > 20) - mean(FlightDelays$Delay[-index] > 20)
}
pvalue <- 2*(sum(ans >= observed) + 1)/(sims + 1)
pvalue
[1] 0.0012
library(ggplot2)
ggplot(data = data.frame(md = ans), aes(x = md)) +
geom_density(fill = "lightblue1") +
theme_bw() +
labs(x = expression(proportion[AA]-Proportion[UA])) +
geom_vline(xintercept=observed, linetype="dotted")
Technical conclusion: p-value < 0.05, so reject \(H_0\)
Non-Technical conclusion:
Based on a p-value of \(0.0012\) there is strong evidence to suggest the proportion of flights delayed more than 20 minutes is not the same for United and American.
\(H_0: \sigma^2_{AA} = \sigma^2_{UA}\) vs. \(H_a: \sigma^2_{AA} < \sigma^2_{UA}\)
Del_var_carrier <- FlightDelays %>%
group_by(Carrier) %>%
summarize(Var = var(Delay), N = n())
Del_var_carrier
# A tibble: 2 x 3
Carrier Var N
* <fct> <dbl> <int>
1 AA 1606. 2906
2 UA 2038. 1123
VD <- Del_var_carrier[2,2] - Del_var_carrier[1,2]
VD
Var
1 431.0677
observed <- VD$Var
observed
[1] 431.0677
set.seed(123)
sims <- 10^4 -1
ans <- numeric(sims)
for(i in 1:sims){
index <- sample.int(n = 4029, size = 1123, replace = FALSE)
ans[i] <- var(FlightDelays$Delay[index]) - var(FlightDelays$Delay[-index])
}
pvalue <- ((sum(ans >= observed) + 1)/(sims + 1))
pvalue
[1] 0.1507
Technical conclusion: p-value > 0.05, so (do not) fail to reject \(H_0\)
Non-Technical conclusion:
Based on a p-value of \(0.1507\) there is not sufficient evidence to suggest the variance in flight delays for United is greater than the variance in flight delays for American.