We have already discussed the basic graphs last week!
Mean
If \(x_1, x_2,...,x_n\) are \(n\) data values then the mean is:
\[\bar{x} = \dfrac{1}{n}\sum_{i=1}^n x_i\]
Example: Find the mean of 2, 15, 3, 8, 12, 5.
# Your code here
#
#
data <- c(2, 15, 3, 8, 12, 5, 100)
Mean <- mean(data)
Mean
## [1] 20.71429
IQR(c(25,26,45,28,78))
## [1] 19
Median
Median is the middle value of a sorted data set. Usually works well with outliers and when the distribution is skewed.
Example: Find the median of 2, 15, 3, 8, 12, 5.
# Your code here
#
#
median(data)
## [1] 8
Trimmed Mean (25% Trimmed Mean)
Here for example you sort the data, omit 25% of the observations on each end, and then take the mean of the remaining middle 50% of the observations. Usually works well with outliers.
Example: Find the 25% Trimmed Mean of 2, 15, 3, 8, 12, 5.
# Your code here
#
#
d <- c(2, 15, 3, 8, 12, 5)
mean(d, trim = .25)
## [1] 7
Following are the common measures of the spread;
Range
Range is the difference between the largest and the smallest values. Range is sensitive to outliers.
IQR
Range is the difference between the third and the first quartiles (75th and the 25th percentiles). IQR is NOT sensitive to outliers.
Standard deviation
If \(x_1, x_2,...,x_n\) are \(n\) data values then the Standard deviation is:
\[s = \sqrt{\dfrac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2}\]
Example: Find the Range, IQR and the Standard deviation of 2, 15, 3, 8, 12, 5.
# Your code here
#
#
five number summary
: The minimum, first quartile, median, third quartile, and the maximum.
Example: Consider the 15 numbers 9,10,11,11,12,14,16,17,19,21,25,31,32,41,61. Find the five number summary.
# Your code here
#
#
data <- c(9,10,11,11,12,14,16,17,19,21,25,31,32,41,61)
summary(data)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 11.5 17.0 22.0 28.0 61.0
Five number summary is:
Min =
\(Q_1\) =
Median =
\(Q_3\) =
Max =
A boxplot is a type of graph that can be used to visualize the five-number summary.
Create a boxplot for the following 21 data values.
5, 6, 6, 8, 9, 11, 11, 14, 17, 17, 19, 20, 21, 21, 22, 23, 24, 32, 40, 43, 49.
library(ggplot2)
d <- c(5, 6, 6, 8, 9, 11, 11, 14, 17, 17, 19, 20, 21, 21, 22, 23, 24, 32, 40, 43, 49)
d <- data.frame(d)
ggplot(data = d, aes(y = d)) + geom_boxplot()
Notes about a boxplot:
Boxplot
The normal distribution is the most important and most widely used distribution in statistics. It is sometimes called the “bell curve” because the normal distribution is bell shaped and symmetric. We would “like” our data to be normally distributed.
Here is an example: Check whether the data below are normal distributed.
11.62, 15.92, 12.40, 14.40, 10.28, 13.15, 11.22, 7.35, 16.45, 5.37, 10.25, 12.41, 11.43, 8.00, 11.90, 17.86, 7.59, 11.88, 14.54, 14.44, 10.29, 5.47, 8.20, 10.56, 11.18
1. Create a histogram and check if it is “Normal” enough!
# Your code here
#
#
library(ggplot2)
data <- c(11.62, 15.92, 12.40, 14.40, 10.28, 13.15, 11.22, 7.35, 16.45, 5.37, 10.25, 12.41, 11.43, 8.00, 11.90, 17.86, 7.59, 11.88, 14.54, 14.44, 10.29, 5.47, 8.20, 10.56, 11.18)
MyDataFrame <- data.frame(data)
MyDataFrame
## data
## 1 11.62
## 2 15.92
## 3 12.40
## 4 14.40
## 5 10.28
## 6 13.15
## 7 11.22
## 8 7.35
## 9 16.45
## 10 5.37
## 11 10.25
## 12 12.41
## 13 11.43
## 14 8.00
## 15 11.90
## 16 17.86
## 17 7.59
## 18 11.88
## 19 14.54
## 20 14.44
## 21 10.29
## 22 5.47
## 23 8.20
## 24 10.56
## 25 11.18
ggplot(data = MyDataFrame, aes(x = data)) +
geom_histogram(bins = 10)
If the plotted points follow the identity line, then your data is Normal.
library(ggplot2)
data <- c(11.62, 15.92, 12.40, 14.40, 10.28, 13.15, 11.22, 7.35, 16.45, 5.37, 10.25, 12.41, 11.43, 8.00, 11.90, 17.86, 7.59, 11.88, 14.54, 14.44, 10.29, 5.47, 8.20, 10.56, 11.18)
MyDataFrame <- data.frame(data)
MyDataFrame
## data
## 1 11.62
## 2 15.92
## 3 12.40
## 4 14.40
## 5 10.28
## 6 13.15
## 7 11.22
## 8 7.35
## 9 16.45
## 10 5.37
## 11 10.25
## 12 12.41
## 13 11.43
## 14 8.00
## 15 11.90
## 16 17.86
## 17 7.59
## 18 11.88
## 19 14.54
## 20 14.44
## 21 10.29
## 22 5.47
## 23 8.20
## 24 10.56
## 25 11.18
ggplot(data = MyDataFrame, aes(sample = data)) +
geom_qq() + geom_qq_line()
Example: Exercise 2.4 (modified)
Data sets used in MSWR book is stored in the libray resampledata
. Import the data set about the flight delays described in the case study in Section 1.1 of MSWR. Also load the libraries ggplot2
, dplyr
.
library(resampledata) # dataset for MSWR book is in this library
##
## Attaching package: 'resampledata'
## The following object is masked from 'package:datasets':
##
## Titanic
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data(FlightDelays)
DepartTime
)head(FlightDelays)
## ID Carrier FlightNo Destination DepartTime Day Month FlightLength Delay
## 1 1 UA 403 DEN 4-8am Fri May 281 -1
## 2 2 UA 405 DEN 8-Noon Fri May 277 102
## 3 3 UA 409 DEN 4-8pm Fri May 279 4
## 4 4 UA 511 ORD 8-Noon Fri May 158 -2
## 5 5 UA 667 ORD 4-8am Fri May 143 -3
## 6 6 UA 669 ORD 4-8am Fri May 150 0
## Delayed30
## 1 No
## 2 Yes
## 3 No
## 4 No
## 5 No
## 6 No
str(FlightDelays)
## 'data.frame': 4029 obs. of 10 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Carrier : Factor w/ 2 levels "AA","UA": 2 2 2 2 2 2 2 2 2 2 ...
## $ FlightNo : int 403 405 409 511 667 669 673 677 679 681 ...
## $ Destination : Factor w/ 7 levels "BNA","DEN","DFW",..: 2 2 2 6 6 6 6 6 6 6 ...
## $ DepartTime : Factor w/ 5 levels "4-8am","8-Noon",..: 1 2 4 2 1 1 2 2 3 3 ...
## $ Day : Factor w/ 7 levels "Sun","Mon","Tue",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Month : Factor w/ 2 levels "May","June": 1 1 1 1 1 1 1 1 1 1 ...
## $ FlightLength: int 281 277 279 158 143 150 158 160 160 163 ...
## $ Delay : int -1 102 4 -2 -3 0 -5 0 10 60 ...
## $ Delayed30 : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
ggplot(data = FlightDelays, aes(x = DepartTime)) +
geom_bar(fill = "orchid4")
DepartTime
)table(FlightDelays$DepartTime)
##
## 4-8am 8-Noon Noon-4pm 4-8pm 8-Mid
## 699 1053 1048 972 257
Day
and Delayed30
. For each day, what is the proportion of flights delayed at least 30 min.table(FlightDelays$Day, FlightDelays$Delayed30)
##
## No Yes
## Sun 507 44
## Mon 569 61
## Tue 535 93
## Wed 488 76
## Thu 434 132
## Fri 493 144
## Sat 406 47
Note:
The %>%
operator allows, for example:
filter
our data frame to only focus on a few rows thengroup_by
another variable to create groups thensummarize
this grouped data to calculate the mean (for example) for each level of the group.library(knitr)
# Harder way!
day_delayed30 <- FlightDelays %>%
group_by(Day) %>%
summarize(pro = sum(Delayed30 == "Yes")/sum(Delayed30 == "Yes"|Delayed30 == "No"))
kable(day_delayed30)
Day | pro |
---|---|
Sun | 0.0798548 |
Mon | 0.0968254 |
Tue | 0.1480892 |
Wed | 0.1347518 |
Thu | 0.2332155 |
Fri | 0.2260597 |
Sat | 0.1037528 |
# Easy and Prefered way!
day_delayed30 <- FlightDelays %>%
group_by(Day) %>%
summarize(pro = mean(Delayed30 == "Yes"))
kable(day_delayed30)
Day | pro |
---|---|
Sun | 0.0798548 |
Mon | 0.0968254 |
Tue | 0.1480892 |
Wed | 0.1347518 |
Thu | 0.2332155 |
Fri | 0.2260597 |
Sat | 0.1037528 |
str(FlightDelays)
## 'data.frame': 4029 obs. of 10 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Carrier : Factor w/ 2 levels "AA","UA": 2 2 2 2 2 2 2 2 2 2 ...
## $ FlightNo : int 403 405 409 511 667 669 673 677 679 681 ...
## $ Destination : Factor w/ 7 levels "BNA","DEN","DFW",..: 2 2 2 6 6 6 6 6 6 6 ...
## $ DepartTime : Factor w/ 5 levels "4-8am","8-Noon",..: 1 2 4 2 1 1 2 2 3 3 ...
## $ Day : Factor w/ 7 levels "Sun","Mon","Tue",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Month : Factor w/ 2 levels "May","June": 1 1 1 1 1 1 1 1 1 1 ...
## $ FlightLength: int 281 277 279 158 143 150 158 160 160 163 ...
## $ Delay : int -1 102 4 -2 -3 0 -5 0 10 60 ...
## $ Delayed30 : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
ggplot(data = FlightDelays, aes(x = Delayed30, y = FlightLength)) + geom_boxplot()
Exercise 2.5: You try!
library(resampledata)
data(GSS2002)
str(GSS2002)
## 'data.frame': 2765 obs. of 21 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Region : Factor w/ 7 levels "Mid-Atl","Mountain",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 1 2 2 1 1 2 1 ...
## $ Race : Factor w/ 3 levels "Black","Other",..: 3 3 3 3 3 3 3 3 3 2 ...
## $ Education : Factor w/ 5 levels "Left HS","HS",..: 2 4 2 1 1 2 4 2 2 2 ...
## $ Marital : Factor w/ 5 levels "Divorced","Married",..: 1 2 4 1 1 1 2 2 1 3 ...
## $ Religion : Factor w/ 13 levels "Buddhism","Catholic",..: 5 13 13 13 13 2 13 13 2 2 ...
## $ Happy : Factor w/ 3 levels "Not too happy",..: 2 2 NA NA NA 2 NA NA 1 NA ...
## $ Income : Factor w/ 24 levels "under 1000","1000-2999",..: 16 21 17 19 18 18 NA NA 20 1 ...
## $ PolParty : Factor w/ 8 levels "Ind","Ind, Near Dem",..: 8 5 8 2 1 3 8 1 8 3 ...
## $ Politics : Factor w/ 7 levels "Conservative",..: 1 1 NA NA NA 1 NA NA 1 NA ...
## $ Marijuana : Factor w/ 2 levels "Legal","Not legal": NA 2 NA NA NA NA NA NA 1 NA ...
## $ DeathPenalty : Factor w/ 2 levels "Favor","Oppose": 1 1 NA NA NA 1 NA NA 1 NA ...
## $ OwnGun : Factor w/ 3 levels "No","Refused",..: 1 3 NA NA NA 3 NA NA 3 NA ...
## $ GunLaw : Factor w/ 2 levels "Favor","Oppose": 1 2 NA NA NA 2 NA NA 2 NA ...
## $ SpendMilitary: Factor w/ 3 levels "About right",..: 2 1 NA 1 NA 2 NA 2 NA 3 ...
## $ SpendEduc : Factor w/ 3 levels "About right",..: 2 2 NA 2 NA 2 NA 2 NA 2 ...
## $ SpendEnv : Factor w/ 3 levels "About right",..: 1 1 NA 2 NA 2 NA 1 NA 2 ...
## $ SpendSci : Factor w/ 3 levels "About right",..: 1 1 NA 2 NA 2 NA 1 NA 2 ...
## $ Pres00 : Factor w/ 5 levels "Bush","Didnt vote",..: 1 1 1 NA NA 1 1 1 1 NA ...
## $ Postlife : Factor w/ 2 levels "No","Yes": 2 2 NA NA NA 2 NA NA 2 NA ...
table(GSS2002$DeathPenalty) # table
##
## Favor Oppose
## 899 409
ggplot(data = GSS2002, aes(x = DeathPenalty)) + geom_bar() # barchart
table
command and the summary
command in R
on the gun ownership variable. What additional infomation does the summary
command give that the table command does not?table(GSS2002$OwnGun)
##
## No Refused Yes
## 605 9 310
summary(GSS2002$OwnGun)
## No Refused Yes NA's
## 605 9 310 1841
table(GSS2002$DeathPenalty, GSS2002$OwnGun)
##
## No Refused Yes
## Favor 375 7 243
## Oppose 199 2 59
library(knitr)
death_gun <- GSS2002 %>%
group_by(OwnGun) %>%
summarize(ProInFavor = mean(DeathPenalty == "Favor", na.rm = TRUE))
kable(death_gun)
OwnGun | ProInFavor |
---|---|
No | 0.6533101 |
Refused | 0.7777778 |
Yes | 0.8046358 |
NA | 0.6477541 |