1 Introduction

Consider the mice example in Section 3.1 let $\mu_d$ denote the true mean time that a randomly selected mouse that received the drug takes to run through the maze; let $\mu_c$ denote the true mean time for a control mouse. Then,

\[H_0:\mu_d - \mu_c = 0.\] That is, on average, there is no difference in the mean times between mice who receive the drug and mice in the control group.

The alternative hypothesis is

\[H_A: \mu_d - \mu_c > 0.\]

That is, on average, mice who receive the drug are slower (larger values) than the mice in the control group.

1.1 Creating the data

DF <- data.frame(time = c(30, 25, 20, 18, 21, 22), treatment = rep(c("Drug", "Control"), each = 3))
DF <- tbl_df(DF)
DF

# A tibble: 6 x 2
   time treatment
  <dbl> <fct>    
1    30 Drug     
2    25 Drug     
3    20 Drug     
4    18 Control  
5    21 Control  
6    22 Control

# Difference between the average times...

ANS <- DF %>% 
  group_by(treatment) %>% 
  summarize(Average  = mean(time))
ANS

# A tibble: 2 x 2
  treatment Average
* <fct>       <dbl>
1 Control      20.3
2 Drug         25

MD <- ANS[2, 2] - ANS[1, 2]
MD

   Average
1 4.666667

observed <- MD$Average
observed

[1] 4.666667

Problem: This difference could be due to random variability rather than a real drug effect.

How do we address this problem: Estimate how easily pure random chance would produce a difference this large.

If that probability is small, we conclude that there is something other than pure random chance at work.

Definition 1.1 A test statistic is a numerical function of the data whose values determines the results of the test. The function is generally denoted by $T = T(\bf{X})$ where $\bf{X}$ represents the data.

Definition 1.2 The P-value is the probability that chance alone would produce a test statistics as extreme as the obseved statistic if the null hupothesis were true.

For example if the large values of the test statistics support the alternative hypothesis, the P-value is $P(T \geq t)$.

1.2 All possible distributions of {30, 25, 20, 18, 21, 22} into Two Sets

ANS <- t(combn(6, 3))
ANS

      [,1] [,2] [,3]
 [1,]    1    2    3
 [2,]    1    2    4
 [3,]    1    2    5
 [4,]    1    2    6
 [5,]    1    3    4
 [6,]    1    3    5
 [7,]    1    3    6
 [8,]    1    4    5
 [9,]    1    4    6
[10,]    1    5    6
[11,]    2    3    4
[12,]    2    3    5
[13,]    2    3    6
[14,]    2    4    5
[15,]    2    4    6
[16,]    2    5    6
[17,]    3    4    5
[18,]    3    4    6
[19,]    3    5    6
[20,]    4    5    6

dim(ANS)

[1] 20  3

DF$time

[1] 30 25 20 18 21 22

DR <- matrix(NA, nrow = 20, ncol = 3)
CO <- matrix(NA, nrow = 20, ncol = 3)

for(i in 1:20){
  DR[i,] <- DF$time[ANS[i,]] 
  CO[i,] <- DF$time[-ANS[i, ]]
}

VALUES <- as.data.frame(cbind(DR, CO))

names(VALUES) <- paste(rep(c("D", "C"), each = 3), 1:3, sep = "")

VALUES$DM <- apply(VALUES[, 1:3], 1, mean)
VALUES$CM <- apply(VALUES[, 4:6], 1, mean)

VALUES$DiffMean <- VALUES$DM - VALUES$CM

DT::datatable(round(VALUES, 2))

pvalue <- sum(VALUES$DiffMean >= observed)/choose(6,3)
pvalue

[1] 0.15

MASS::fractions(pvalue)

[1] 3/20

The theoretical p-value is $0.15$.

1.3 Statistical Conclusion

p-value was not small enough to reject $H_0$. Therefore, we conclude that there is no significant difference between the times to complete the maze with or without the drug.

2 PERMUTATION TESTS

This problem has a total of $\binom{6}{3} = 20$ arrangements of six items into two sets each with three observations. While it is possible to enumerate all possible outcomes and compute the theoretical answer, once the group sizes increase this becomes computationally infeasible. Consider $\binom{30}{15} = 155117520$! So, what we typically do is take a “large” number of permutations (generally 10,000) and let this simulation approximate the true distribution.

We follow this algorithm:

Two Sample Pemutation Test:

  Pool the n1 + n2 (total of n) values.
    **repeat**
        Draw a resample of size n1 without replacement.
        Use the remaining n2 obervations for the other sample.
        Calculate the difference in means or another statistic that comapres samples.
    **unitl** we have enough samples.
    
  Calculate the p-value as the fraction of timesthe random statistics exceed the original statistic.
  (Multiply the p-value by 2 for a two sided test.)

set.seed(123)
sims <- 10^4 - 1 # Number of simulations (iterations)

ans <- numeric(sims)
for(i in 1:sims){
  index <- sample.int(n = 6, size = 3, replace = FALSE)
  ans[i] <- mean(DF$time[index]) - mean(DF$time[-index])
}
pvalue <- (sum(ans >= observed) + 1)/(sims + 1) # Here observed = 4.666667, was the abserved difference between the means.

pvalue

[1] 0.1565

library(ggplot2)
ggplot(data = data.frame(md = ans), aes(x = md)) + 
  geom_density(fill = "pink") + 
  theme_bw() + 
  labs(x = expression(bar(X)[c]-bar(X)[d])) +
  geom_vline(xintercept=MD, linetype="dotted")

$Permutation Distribution of $\bar{X}_c - \bar{X}_d$$

Figure 2.1: Permutation Distribution of $\bar{X}_c - \bar{X}_d$

Example: Beer and hot wings case study in section 1.9

Do men consume more hotwings than women?

Define the null and alternative hypothesis:

$H_0: \mu_M - \mu_F = 0$ vs. $H_a: \mu_M - \mu_F > 0$

Step 1) Get data

library(resampledata)
data(Beerwings)
head(Beerwings)

  ID Hotwings Beer Gender
1  1        4   24      F
2  2        5    0      F
3  3        5   12      F
4  4        6   12      F
5  5        7   12      F
6  6        7   12      F

str(Beerwings)

'data.frame':   30 obs. of  4 variables:
 $ ID      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Hotwings: int  4 5 5 6 7 7 7 8 8 8 ...
 $ Beer    : int  24 0 12 12 12 12 24 24 0 12 ...
 $ Gender  : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 2 1 2 2 ...

Step 2) Find the observed difference

# Difference betwen the average number of Hotwings

ANS <- Beerwings %>% 
  group_by(Gender) %>% 
  summarize(Average  = mean(Hotwings), N = n())
ANS

# A tibble: 2 x 3
  Gender Average     N
* <fct>    <dbl> <int>
1 F         9.33    15
2 M        14.5     15

MD <- ANS[2, 2] - ANS[1, 2]
MD

  Average
1     5.2

observed <- MD$Average
observed

[1] 5.2

Step 3) Resample and find the p-value

set.seed(123)
sims <- 10^5 - 1 

ans <- numeric(sims)
for(i in 1:sims){
  index <- sample.int(n = 30, size = 15, replace = FALSE)
  ans[i] <- mean(Beerwings$Hotwings[index]) - mean(Beerwings$Hotwings[-index])
}

pvalue <- (sum(ans >= observed) + 1)/(sims + 1)
pvalue

[1] 0.00115

Step 4) Create a plot

library(ggplot2)
ggplot(data = data.frame(md = ans), aes(x = md)) + 
  geom_density(fill = "orchid4") + 
  theme_bw() + 
  labs(x = expression(bar(X)[m]-bar(X)[f])) +
  geom_vline(xintercept=observed, linetype="dotted")

$Permutation Distribution of $\bar{X}_m - \bar{X}_f$$

Figure 2.2: Permutation Distribution of $\bar{X}_m - \bar{X}_f$

Step 5) Make the decision

Exercise 3.3:

In a hypothesis test comparing two population means, $H_0: \mu_1 = \mu_2$ versus $H_a: \mu_1 > \mu_2$:

Which $P$-value, 0.03 or 0.006 provides stronger evidence for the alternative hypothesis?
Which $P$-value, 0.095 or 0.04 provides stronger evidence that chance alone might account for the observed result?

Example: Verizon case study.

Let $\mu_1$ denote the mean repair time for the the ILEC customers and $\mu_2$ the mean repair time for the the CLEC customers.

Form the hypothesis. $H_0: \mu_1 = \mu_2$ vs. $H_0: \mu_1 < \mu_2$
Conduct the test.

#Step 1

data("Verizon")
head(Verizon)

   Time Group
1 17.50  ILEC
2  2.40  ILEC
3  0.00  ILEC
4  0.65  ILEC
5 22.23  ILEC
6  1.20  ILEC

dim(Verizon)

[1] 1687    2

#Step 2

Time_group <- Verizon %>%
  group_by(Group) %>%
  summarize(Mean = mean(Time), NumCus = n())
Time_group

# A tibble: 2 x 3
  Group  Mean NumCus
* <fct> <dbl>  <int>
1 CLEC  16.5      23
2 ILEC   8.41   1664

MD <- Time_group[1,2] - Time_group[2,2]
MD

     Mean
1 8.09752

observed <- MD$Mean   # We had to create a new variable called observed, because MD was a matrix in this case.
observed

[1] 8.09752

#Step 3

set.seed(200)
sims <- 10^5 - 1 
ans <- numeric(sims)
for(i in 1:sims){
  index <- sample.int(n = 1687, size = 23, replace = FALSE)
  ans[i] <- mean(Verizon$Time[index]) - mean(Verizon$Time[-index])
}
pvalue <- (sum(ans >= observed) + 1)/(sims + 1)    # Notice <, this is because alternative hypothsis has <.
pvalue

[1] 0.01862

#Step 4
library(ggplot2)
ggplot(data = data.frame(md = ans), aes(x = md)) + 
  geom_density(fill = "orchid4") + 
  theme_bw() + 
  labs(x = expression(bar(X)[ILEC]-bar(X)[CLEC])) +
  geom_vline(xintercept=observed, linetype="dotted")

Step 5: Make the decision –

Technical conclusion: $p$-Value < 0.05, so Reject $H_0$.

English conclusion: There is enough evidence to conclude that the Verizon spend significantly more time to complete repairs for CLEC customers than for the ILEC customers.

Example: Verizon case study ctd…Data have a long tail?

Create a good visual to compare the tipical times to complete repairs for ILEC and CLEC customers.

ggplot(data = Verizon, aes(x=factor(Group), y = Time)) + geom_boxplot()

From the graph we created in 1) it is clear that the data have a long tail. Create a better visual to show the tails of the distributions.

ggplot(data = Verizon, aes(x = Time)) + geom_histogram() + facet_wrap(~Group)

Name two measures that can be used to describe the center in situations like this instead off the mean.

Trimmed mean, Median

Use one of the measures that you mentioned above to re-do the test. State your conclusions.

Here I am using the trimmed mean.

data("Verizon")

head(Verizon)

   Time Group
1 17.50  ILEC
2  2.40  ILEC
3  0.00  ILEC
4  0.65  ILEC
5 22.23  ILEC
6  1.20  ILEC

dim(Verizon)

[1] 1687    2

Time_group <- Verizon %>%
  group_by(Group) %>%
  summarize(Tr.Mean = mean(Time, trim = 0.25), Count = n())
Time_group

# A tibble: 2 x 3
  Group Tr.Mean Count
* <fct>   <dbl> <int>
1 CLEC    13.8     23
2 ILEC     3.50  1664

MD <- Time_group[1,2] - Time_group[2,2] # This is the observed value (difference)
MD

  Tr.Mean
1  10.336

observed <- MD$Tr.Mean
observed

[1] 10.336

set.seed(123)
sims <- 10^4 - 1 
ans <- numeric(sims)
for(i in 1:sims){
  index <- sample.int(n = 1687, size = 23, replace = FALSE)
  ans[i] <- mean(Verizon$Time[index], trim = 0.25) - mean(Verizon$Time[-index], trim = 0.25)
}
pvalue <- (sum(ans >= observed) + 1)/(sims + 1)
pvalue

[1] 0.0002

library(ggplot2)
ggplot(data = data.frame(md = ans), aes(x = md)) + 
  geom_density(fill = "tan1") + 
  theme_bw() + 
  labs(x = expression(Tr.mean[ILEC]-Tr.mean[CLEC])) +
  geom_vline(xintercept=observed, linetype="dotted")

Exercise 3.5:

In the Flight Delays Case Study in Section 1.1,

a. The data contain flight delays for two airlines, American Airlines and United Airlines.  Conduct a two-sided permutation test to see if the mean delay times between the two carriers are statistically significant.

b. The flight delays occured in May and June of 2009.  Conduct a two-sided permutation test to see if the difference in mean delay times between the 2 months is statistically significant.

Define $H_0$ and $H_a$ first.

$H_0: \mu_{UA} = \mu_{AA}$ vs. $H_0: \mu_{UA} \neq \mu_{AA}$

data("FlightDelays")


Del_by_Carrier <- FlightDelays %>% 
  group_by(Carrier) %>% 
  summarize(Mean = mean(Delay), N = n())
Del_by_Carrier

# A tibble: 2 x 3
  Carrier  Mean     N
* <fct>   <dbl> <int>
1 AA       10.1  2906
2 UA       16.0  1123

MD <- Del_by_Carrier[2,2] - Del_by_Carrier[1,2]
MD

      Mean
1 5.885696

observed <- MD$Mean
observed

[1] 5.885696

#

sims <- 10^4 - 1
ans <- numeric(sims)
for(i in 1:sims){
  index <- sample(4029, size = 1123, replace = FALSE)
  ans[i] <- mean(FlightDelays$Delay[index]) - mean(FlightDelays$Delay[-index])
}

pvalue <- 2*((sum(ans >= observed) + 1)/(sims + 1))
pvalue

[1] 0.0002

Technical conclusion: p-value < 0.05, so reject $H_0$

Non-Technical conclusion: Based on a p-value of $0.0002$ there is strong evidence to suggest the mean delay time for American Airlines is not the same as the mean delay time for United Airlines.

Define $H_0$ and $H_a$ first.

$H_0: \mu_{may} = \mu_{June}$ vs. $H_a: \mu_{may} \neq \mu_{June}$

data("FlightDelays")
str(FlightDelays)

'data.frame':   4029 obs. of  10 variables:
 $ ID          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Carrier     : Factor w/ 2 levels "AA","UA": 2 2 2 2 2 2 2 2 2 2 ...
 $ FlightNo    : int  403 405 409 511 667 669 673 677 679 681 ...
 $ Destination : Factor w/ 7 levels "BNA","DEN","DFW",..: 2 2 2 6 6 6 6 6 6 6 ...
 $ DepartTime  : Factor w/ 5 levels "4-8am","8-Noon",..: 1 2 4 2 1 1 2 2 3 3 ...
 $ Day         : Factor w/ 7 levels "Sun","Mon","Tue",..: 6 6 6 6 6 6 6 6 6 6 ...
 $ Month       : Factor w/ 2 levels "May","June": 1 1 1 1 1 1 1 1 1 1 ...
 $ FlightLength: int  281 277 279 158 143 150 158 160 160 163 ...
 $ Delay       : int  -1 102 4 -2 -3 0 -5 0 10 60 ...
 $ Delayed30   : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...

del_by_month <- FlightDelays %>% 
  group_by(Month) %>% 
  summarize(Average = mean(Delay), N = n())

del_by_month

# A tibble: 2 x 3
  Month Average     N
* <fct>   <dbl> <int>
1 May      8.88  1999
2 June    14.5   2030

MD <- del_by_month[2,2] - del_by_month[1,2]
MD

   Average
1 5.663341

observed <- MD$Average
observed

[1] 5.663341

#
ans <- numeric(sims)

for(i in 1:sims){
  index <- sample.int(n = 1999+2030, size = 2030, replace = FALSE)
  ans[i] <- mean(FlightDelays$Delay[index]) - mean(FlightDelays$Delay[-index])
}


pvalue <- 2*((sum(ans >= observed) + 1)/(sims + 1))
pvalue

[1] 0.0002

Technical conclusion: p-value < 0.05, so reject $H_0$

Non-Technical conclusion: Based on a p-value of $0.0002$ there is strong evidence to suggest the mean delay time for June is not the same as the mean delay time for May.

Exercise 3.6:

In the Flight Delays Case Study in Section 1.1, the data contain flight delays for two airlines, American Airlines and United Airlines.

a. Compute the proportion of times that each carrier's flights was delayed more than 20 minutes.  Conduct a two-sided test to see if the difference in these proportions is statistically significant.

b. Compute the variance in the flight delay lengths for each carrier.  Conduct a test to see if the variance for United Airlines is greater than that of American Airlines.

Define $H_0$ and $H_a$ first.

data("FlightDelays")
str(FlightDelays)

'data.frame':   4029 obs. of  10 variables:
 $ ID          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Carrier     : Factor w/ 2 levels "AA","UA": 2 2 2 2 2 2 2 2 2 2 ...
 $ FlightNo    : int  403 405 409 511 667 669 673 677 679 681 ...
 $ Destination : Factor w/ 7 levels "BNA","DEN","DFW",..: 2 2 2 6 6 6 6 6 6 6 ...
 $ DepartTime  : Factor w/ 5 levels "4-8am","8-Noon",..: 1 2 4 2 1 1 2 2 3 3 ...
 $ Day         : Factor w/ 7 levels "Sun","Mon","Tue",..: 6 6 6 6 6 6 6 6 6 6 ...
 $ Month       : Factor w/ 2 levels "May","June": 1 1 1 1 1 1 1 1 1 1 ...
 $ FlightLength: int  281 277 279 158 143 150 158 160 160 163 ...
 $ Delay       : int  -1 102 4 -2 -3 0 -5 0 10 60 ...
 $ Delayed30   : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...

dim(FlightDelays)

[1] 4029   10

delayed20_carrier <- FlightDelays %>%
  group_by(Carrier) %>%
  summarize(pro = mean(Delay>20), count = n())
delayed20_carrier

# A tibble: 2 x 3
  Carrier   pro count
* <fct>   <dbl> <int>
1 AA      0.169  2906
2 UA      0.213  1123

PD <- delayed20_carrier[2,2] - delayed20_carrier[1,2]
PD

         pro
1 0.04351791

observed <- PD$pro
observed

[1] 0.04351791

set.seed(123)
sims <- 10^4 - 1 
ans <- numeric(sims)


for(i in 1:sims){
  index <- sample.int(n = 2906+1123, size = 1123, replace = FALSE)
  ans[i] <- mean(FlightDelays$Delay[index] > 20) - mean(FlightDelays$Delay[-index] > 20)
}

pvalue <- 2*(sum(ans >= observed) + 1)/(sims + 1)
pvalue

[1] 0.0012

library(ggplot2)
ggplot(data = data.frame(md = ans), aes(x = md)) + 
  geom_density(fill = "lightblue1") + 
  theme_bw() + 
  labs(x = expression(proportion[AA]-Proportion[UA])) +
  geom_vline(xintercept=observed, linetype="dotted")

Technical conclusion: p-value < 0.05, so reject $H_0$

Non-Technical conclusion:

Based on a p-value of $0.0012$ there is strong evidence to suggest the proportion of flights delayed more than 20 minutes is not the same for United and American.

Define $H_0$ and $H_a$ first.

$H_0: \sigma^2_{AA} = \sigma^2_{UA}$ vs. $H_a: \sigma^2_{AA} < \sigma^2_{UA}$

Del_var_carrier <- FlightDelays %>% 
  group_by(Carrier) %>% 
  summarize(Var = var(Delay), N = n())
Del_var_carrier

# A tibble: 2 x 3
  Carrier   Var     N
* <fct>   <dbl> <int>
1 AA      1606.  2906
2 UA      2038.  1123

VD <- Del_var_carrier[2,2] - Del_var_carrier[1,2]
VD

       Var
1 431.0677

observed <- VD$Var
observed

[1] 431.0677

set.seed(123)
sims <- 10^4 -1 
ans <- numeric(sims)

for(i in 1:sims){
  index <- sample.int(n = 4029, size = 1123, replace = FALSE)
  ans[i] <- var(FlightDelays$Delay[index]) - var(FlightDelays$Delay[-index])
}


pvalue <- ((sum(ans >= observed) + 1)/(sims + 1))
pvalue

[1] 0.1507

Technical conclusion: p-value > 0.05, so (do not) fail to reject $H_0$

Non-Technical conclusion:

Based on a p-value of $0.1507$ there is not sufficient evidence to suggest the variance in flight delays for United is greater than the variance in flight delays for American.

Lecture Notes for Ch 3 of MSWR: Introduction to Hypothesis Testing: Permutation Tests

Dr. Hasthika Rupasinghe

Mar 03, 2021 at 06:30:11 PM

1 Introduction

1.1 Creating the data

1.2 All possible distributions of {30, 25, 20, 18, 21, 22} into Two Sets

1.3 Statistical Conclusion

2 PERMUTATION TESTS