2 Numerical Summaries

2.1 Center

Mean

If \(x_1, x_2,...,x_n\) are \(n\) data values then the mean is:

\[\bar{x} = \dfrac{1}{n}\sum_{i=1}^n x_i\]

Example: Find the mean of 2, 15, 3, 8, 12, 5.

# Your code here
#
#

data <- c(2, 15, 3, 8, 12, 5, 100)
Mean <- mean(data)
Mean

## [1] 20.71429

IQR(c(25,26,45,28,78))

## [1] 19

Median

Median is the middle value of a sorted data set. Usually works well with outliers and when the distribution is skewed.

Example: Find the median of 2, 15, 3, 8, 12, 5.

# Your code here
#
#

median(data)

## [1] 8

Trimmed Mean (25% Trimmed Mean)

Here for example you sort the data, omit 25% of the observations on each end, and then take the mean of the remaining middle 50% of the observations. Usually works well with outliers.

Example: Find the 25% Trimmed Mean of 2, 15, 3, 8, 12, 5.

# Your code here
#
#

d <- c(2, 15, 3, 8, 12, 5)

mean(d, trim = .25)

## [1] 7

2.2 Spread

Following are the common measures of the spread;

Range

Range is the difference between the largest and the smallest values. Range is sensitive to outliers.

IQR

Range is the difference between the third and the first quartiles (75th and the 25th percentiles). IQR is NOT sensitive to outliers.

Standard deviation

If \(x_1, x_2,...,x_n\) are \(n\) data values then the Standard deviation is:

\[s = \sqrt{\dfrac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2}\]

Example: Find the Range, IQR and the Standard deviation of 2, 15, 3, 8, 12, 5.

# Your code here
#
#

2.3 Shape

2.3.1 Five number summary

five number summary: The minimum, first quartile, median, third quartile, and the maximum.

Example: Consider the 15 numbers 9,10,11,11,12,14,16,17,19,21,25,31,32,41,61. Find the five number summary.

# Your code here
#
#

data <- c(9,10,11,11,12,14,16,17,19,21,25,31,32,41,61)
summary(data)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0    11.5    17.0    22.0    28.0    61.0

Five number summary is:

Min =
\(Q_1\) =
Median =
\(Q_3\) =
Max =

2.3.2 Boxplots

A boxplot is a type of graph that can be used to visualize the five-number summary.

Create a boxplot for the following 21 data values.

5, 6, 6, 8, 9, 11, 11, 14, 17, 17, 19, 20, 21, 21, 22, 23, 24, 32, 40, 43, 49.

library(ggplot2)

d <- c(5, 6, 6, 8, 9, 11, 11, 14, 17, 17, 19, 20, 21, 21, 22, 23, 24, 32, 40, 43, 49)

d <- data.frame(d)

ggplot(data = d, aes(y = d)) + geom_boxplot()

Notes about a boxplot:

Boxplot

2.3.3 Normal Distribution

The normal distribution is the most important and most widely used distribution in statistics. It is sometimes called the “bell curve” because the normal distribution is bell shaped and symmetric. We would “like” our data to be normally distributed.

Here is an example: Check whether the data below are normal distributed.

11.62, 15.92, 12.40, 14.40, 10.28, 13.15, 11.22, 7.35, 16.45, 5.37, 10.25, 12.41, 11.43, 8.00, 11.90, 17.86, 7.59, 11.88, 14.54, 14.44, 10.29, 5.47, 8.20, 10.56, 11.18

1. Create a histogram and check if it is “Normal” enough!

# Your code here
#
#

library(ggplot2)


data <- c(11.62, 15.92, 12.40, 14.40, 10.28, 13.15, 11.22,  7.35, 16.45,  5.37, 10.25, 12.41, 11.43,  8.00, 11.90, 17.86,  7.59, 11.88, 14.54, 14.44, 10.29,  5.47,  8.20, 10.56, 11.18)

MyDataFrame <- data.frame(data)
MyDataFrame

##     data
## 1  11.62
## 2  15.92
## 3  12.40
## 4  14.40
## 5  10.28
## 6  13.15
## 7  11.22
## 8   7.35
## 9  16.45
## 10  5.37
## 11 10.25
## 12 12.41
## 13 11.43
## 14  8.00
## 15 11.90
## 16 17.86
## 17  7.59
## 18 11.88
## 19 14.54
## 20 14.44
## 21 10.29
## 22  5.47
## 23  8.20
## 24 10.56
## 25 11.18

ggplot(data = MyDataFrame, aes(x = data)) +
  geom_histogram(bins = 10)

Check using a Normal quantile plot (QQ plot) - Preferred way!

If the plotted points follow the identity line, then your data is Normal.

library(ggplot2)


data <- c(11.62, 15.92, 12.40, 14.40, 10.28, 13.15, 11.22,  7.35, 16.45,  5.37, 10.25, 12.41, 11.43,  8.00, 11.90, 17.86,  7.59, 11.88, 14.54, 14.44, 10.29,  5.47,  8.20, 10.56, 11.18)

MyDataFrame <- data.frame(data)
MyDataFrame

##     data
## 1  11.62
## 2  15.92
## 3  12.40
## 4  14.40
## 5  10.28
## 6  13.15
## 7  11.22
## 8   7.35
## 9  16.45
## 10  5.37
## 11 10.25
## 12 12.41
## 13 11.43
## 14  8.00
## 15 11.90
## 16 17.86
## 17  7.59
## 18 11.88
## 19 14.54
## 20 14.44
## 21 10.29
## 22  5.47
## 23  8.20
## 24 10.56
## 25 11.18

ggplot(data = MyDataFrame, aes(sample = data)) +
  geom_qq() + geom_qq_line()

Example: Exercise 2.4 (modified)

Data sets used in MSWR book is stored in the libray resampledata. Import the data set about the flight delays described in the case study in Section 1.1 of MSWR. Also load the libraries ggplot2, dplyr.

library(resampledata) # dataset for MSWR book is in this library

## 
## Attaching package: 'resampledata'

## The following object is masked from 'package:datasets':
## 
##     Titanic

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data(FlightDelays)

Create an appropriate graph of the departure times (DepartTime)

head(FlightDelays)

##   ID Carrier FlightNo Destination DepartTime Day Month FlightLength Delay
## 1  1      UA      403         DEN      4-8am Fri   May          281    -1
## 2  2      UA      405         DEN     8-Noon Fri   May          277   102
## 3  3      UA      409         DEN      4-8pm Fri   May          279     4
## 4  4      UA      511         ORD     8-Noon Fri   May          158    -2
## 5  5      UA      667         ORD      4-8am Fri   May          143    -3
## 6  6      UA      669         ORD      4-8am Fri   May          150     0
##   Delayed30
## 1        No
## 2       Yes
## 3        No
## 4        No
## 5        No
## 6        No

str(FlightDelays)

## 'data.frame':    4029 obs. of  10 variables:
##  $ ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Carrier     : Factor w/ 2 levels "AA","UA": 2 2 2 2 2 2 2 2 2 2 ...
##  $ FlightNo    : int  403 405 409 511 667 669 673 677 679 681 ...
##  $ Destination : Factor w/ 7 levels "BNA","DEN","DFW",..: 2 2 2 6 6 6 6 6 6 6 ...
##  $ DepartTime  : Factor w/ 5 levels "4-8am","8-Noon",..: 1 2 4 2 1 1 2 2 3 3 ...
##  $ Day         : Factor w/ 7 levels "Sun","Mon","Tue",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Month       : Factor w/ 2 levels "May","June": 1 1 1 1 1 1 1 1 1 1 ...
##  $ FlightLength: int  281 277 279 158 143 150 158 160 160 163 ...
##  $ Delay       : int  -1 102 4 -2 -3 0 -5 0 10 60 ...
##  $ Delayed30   : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...

ggplot(data = FlightDelays, aes(x = DepartTime)) +
  geom_bar(fill = "orchid4")

Create an table of the departure times (DepartTime)

table(FlightDelays$DepartTime)

## 
##    4-8am   8-Noon Noon-4pm    4-8pm    8-Mid 
##      699     1053     1048      972      257

Create a contingency table of the variables Day and Delayed30. For each day, what is the proportion of flights delayed at least 30 min.

table(FlightDelays$Day, FlightDelays$Delayed30)

##      
##        No Yes
##   Sun 507  44
##   Mon 569  61
##   Tue 535  93
##   Wed 488  76
##   Thu 434 132
##   Fri 493 144
##   Sat 406  47

Note:

The %>% operator allows, for example:

filter our data frame to only focus on a few rows then
group_by another variable to create groups then
summarize this grouped data to calculate the mean (for example) for each level of the group.

library(knitr)

# Harder way!

day_delayed30 <- FlightDelays %>%
  group_by(Day) %>%
  summarize(pro = sum(Delayed30 == "Yes")/sum(Delayed30 == "Yes"|Delayed30 == "No"))

kable(day_delayed30)

Day	pro
Sun	0.0798548
Mon	0.0968254
Tue	0.1480892
Wed	0.1347518
Thu	0.2332155
Fri	0.2260597
Sat	0.1037528

# Easy and Prefered way!

day_delayed30 <- FlightDelays %>%
  group_by(Day) %>%
  summarize(pro = mean(Delayed30 == "Yes"))

kable(day_delayed30)

Day	pro
Sun	0.0798548
Mon	0.0968254
Tue	0.1480892
Wed	0.1347518
Thu	0.2332155
Fri	0.2260597
Sat	0.1037528

Create side by side boxplots of the lengths of the flights, grouped by whether or not the flight was delayed at least 30 min.

str(FlightDelays)

## 'data.frame':    4029 obs. of  10 variables:
##  $ ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Carrier     : Factor w/ 2 levels "AA","UA": 2 2 2 2 2 2 2 2 2 2 ...
##  $ FlightNo    : int  403 405 409 511 667 669 673 677 679 681 ...
##  $ Destination : Factor w/ 7 levels "BNA","DEN","DFW",..: 2 2 2 6 6 6 6 6 6 6 ...
##  $ DepartTime  : Factor w/ 5 levels "4-8am","8-Noon",..: 1 2 4 2 1 1 2 2 3 3 ...
##  $ Day         : Factor w/ 7 levels "Sun","Mon","Tue",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Month       : Factor w/ 2 levels "May","June": 1 1 1 1 1 1 1 1 1 1 ...
##  $ FlightLength: int  281 277 279 158 143 150 158 160 160 163 ...
##  $ Delay       : int  -1 102 4 -2 -3 0 -5 0 10 60 ...
##  $ Delayed30   : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...

ggplot(data = FlightDelays, aes(x = Delayed30, y = FlightLength)) + geom_boxplot()

Do you think that there is a relationship between the length of a flight and whether or not the departure is delayed by at least 30 min?

Exercise 2.5: You try!

library(resampledata)

data(GSS2002)

str(GSS2002)

## 'data.frame':    2765 obs. of  21 variables:
##  $ ID           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Region       : Factor w/ 7 levels "Mid-Atl","Mountain",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ Gender       : Factor w/ 2 levels "Female","Male": 1 2 1 1 2 2 1 1 2 1 ...
##  $ Race         : Factor w/ 3 levels "Black","Other",..: 3 3 3 3 3 3 3 3 3 2 ...
##  $ Education    : Factor w/ 5 levels "Left HS","HS",..: 2 4 2 1 1 2 4 2 2 2 ...
##  $ Marital      : Factor w/ 5 levels "Divorced","Married",..: 1 2 4 1 1 1 2 2 1 3 ...
##  $ Religion     : Factor w/ 13 levels "Buddhism","Catholic",..: 5 13 13 13 13 2 13 13 2 2 ...
##  $ Happy        : Factor w/ 3 levels "Not too happy",..: 2 2 NA NA NA 2 NA NA 1 NA ...
##  $ Income       : Factor w/ 24 levels "under 1000","1000-2999",..: 16 21 17 19 18 18 NA NA 20 1 ...
##  $ PolParty     : Factor w/ 8 levels "Ind","Ind, Near Dem",..: 8 5 8 2 1 3 8 1 8 3 ...
##  $ Politics     : Factor w/ 7 levels "Conservative",..: 1 1 NA NA NA 1 NA NA 1 NA ...
##  $ Marijuana    : Factor w/ 2 levels "Legal","Not legal": NA 2 NA NA NA NA NA NA 1 NA ...
##  $ DeathPenalty : Factor w/ 2 levels "Favor","Oppose": 1 1 NA NA NA 1 NA NA 1 NA ...
##  $ OwnGun       : Factor w/ 3 levels "No","Refused",..: 1 3 NA NA NA 3 NA NA 3 NA ...
##  $ GunLaw       : Factor w/ 2 levels "Favor","Oppose": 1 2 NA NA NA 2 NA NA 2 NA ...
##  $ SpendMilitary: Factor w/ 3 levels "About right",..: 2 1 NA 1 NA 2 NA 2 NA 3 ...
##  $ SpendEduc    : Factor w/ 3 levels "About right",..: 2 2 NA 2 NA 2 NA 2 NA 2 ...
##  $ SpendEnv     : Factor w/ 3 levels "About right",..: 1 1 NA 2 NA 2 NA 1 NA 2 ...
##  $ SpendSci     : Factor w/ 3 levels "About right",..: 1 1 NA 2 NA 2 NA 1 NA 2 ...
##  $ Pres00       : Factor w/ 5 levels "Bush","Didnt vote",..: 1 1 1 NA NA 1 1 1 1 NA ...
##  $ Postlife     : Factor w/ 2 levels "No","Yes": 2 2 NA NA NA 2 NA NA 2 NA ...

Create a table and a bar chart of the responses to the question about the death penalty.

table(GSS2002$DeathPenalty) # table

## 
##  Favor Oppose 
##    899    409

ggplot(data = GSS2002, aes(x = DeathPenalty)) + geom_bar() # barchart

Use the table command and the summary command in R on the gun ownership variable. What additional infomation does the summary command give that the table command does not?

table(GSS2002$OwnGun)

## 
##      No Refused     Yes 
##     605       9     310

summary(GSS2002$OwnGun)

##      No Refused     Yes    NA's 
##     605       9     310    1841

Create a contingency table displaying the relationship between opinions about the death penalty to that about gun ownership.

table(GSS2002$DeathPenalty, GSS2002$OwnGun)

##         
##           No Refused Yes
##   Favor  375       7 243
##   Oppose 199       2  59

What proportion of gun owners favor the death penalty? Does it appear to be different from the proportion among those who do not own guns?

library(knitr)

death_gun <- GSS2002 %>%
  group_by(OwnGun) %>%
  summarize(ProInFavor = mean(DeathPenalty == "Favor", na.rm = TRUE))

kable(death_gun)

OwnGun	ProInFavor
No	0.6533101
Refused	0.7777778
Yes	0.8046358
NA	0.6477541

Lecture Notes for Ch 2 of MSWR: Exploratory Data Analysis

Dr. Hasthika Rupasinghe

Feb 10, 2021 at 01:56:01 PM