1 What is Statistical learning?

Statistical learning refers to a ________________.

These tools can be classified as:

1.1 Supervised Learning

This involves building a statistical model for ________________, or ________________, an output based on ________________. Problems of this nature occur in fields as diverse as business, medicine, astrophysics, and public policy.

Examples:

  1. Spam detection: Spam detection is another example of a supervised learning model. Using supervised classification algorithms, organizations can train databases to recognize patterns or anomalies in new data to organize spam and non-spam-related correspondences effectively.

  2. Predicting house/property price

1.2 Unsupervised Learning

Here, there are ________________ but no supervising ________________; nevertheless we can learn relationships and structure from such data.

Examples:

  1. Data exploration
  2. customer segmentation: suppose we’re working for a company that sells clothes and we have data from previous customers: how much they spent, their ages and the day that they bought the product. Our task is to find a pattern or relationship between the variables in order to provide the company with useful information so they can create marketing strategies, decide on which type of client they should focus on to maximize the profits or which customer segment they can put more effort to expand in the market.

1.3 Data sets

To provide an illustration of some applications of statistical learning, we briefly discuss three real-world data sets.

  1. Wage data — Wage
  2. Stock Market data — Smarket
  3. Gene Expression data — NCI60

1.3.1 Wage data — Wage

In this application we examine a number of factors that relate to wages for a group of males from the Atlantic region of the United States. In particular, we wish to understand the association between an employee’s age and education, as well as the calendar year, on his wage.

library(ISLR)
data(Wage)
head(Wage)
##        year age           maritl     race       education             region
## 231655 2006  18 1. Never Married 1. White    1. < HS Grad 2. Middle Atlantic
## 86582  2004  24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
## 161300 2003  45       2. Married 1. White 3. Some College 2. Middle Atlantic
## 155159 2003  43       2. Married 3. Asian 4. College Grad 2. Middle Atlantic
## 11443  2005  50      4. Divorced 1. White      2. HS Grad 2. Middle Atlantic
## 376662 2008  54       2. Married 1. White 4. College Grad 2. Middle Atlantic
##              jobclass         health health_ins  logwage      wage
## 231655  1. Industrial      1. <=Good      2. No 4.318063  75.04315
## 86582  2. Information 2. >=Very Good      2. No 4.255273  70.47602
## 161300  1. Industrial      1. <=Good     1. Yes 4.875061 130.98218
## 155159 2. Information 2. >=Very Good     1. Yes 5.041393 154.68529
## 11443  2. Information      1. <=Good     1. Yes 4.318063  75.04315
## 376662 2. Information 2. >=Very Good     1. Yes 4.845098 127.11574
?Wage
#dim(Wage) this will help to find ncol and n rows in the dataset
str(Wage)
## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...

Consider, for example, wage versus age for each of the individuals in the data set.

  1. Create a scatter plot (shown here) for wage versus age

  1. Describe the plot you created in i)

\[\\[0.5in]\]

  1. Create a scatter plot (shown here) for wage versus year

  1. Describe the plot you created in iii)

\[\\[0.5in]\]

  1. Create a boxplot (shown here) for wage for each education level.

  1. Describe the plot you created in v) \[\\[0.5in]\]

Clearly, the most accurate prediction of a given man’s wage will be obtained by combining his age, his education, and the year.

Note:

The Wage data involves predicting a ________________ output value. 

This is often referred to as a ________________ problem.

1.3.2 Stock Market data — Smarket

In this case we instead wish to predict a non-numerical value—that is, a ________________ output.

library(ISLR)
data("Smarket")
head(Smarket)
##   Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today Direction
## 1 2001  0.381 -0.192 -2.624 -1.055  5.010 1.1913  0.959        Up
## 2 2001  0.959  0.381 -0.192 -2.624 -1.055 1.2965  1.032        Up
## 3 2001  1.032  0.959  0.381 -0.192 -2.624 1.4112 -0.623      Down
## 4 2001 -0.623  1.032  0.959  0.381 -0.192 1.2760  0.614        Up
## 5 2001  0.614 -0.623  1.032  0.959  0.381 1.2057  0.213        Up
## 6 2001  0.213  0.614 -0.623  1.032  0.959 1.3491  1.392        Up
?Smarket

The goal is to predict whether the index will increase or decrease on a given day using the past 5 days’ percentage changes in the index.

Here the statistical learning problem does not involve predicting a numerical value. Instead it involves predicting whether a given day’s stock market performance will fall into the ___ bucket or the ____ bucket.

Note:

This is known as a _____________ problem.
  1. Create a boxplot (shown here) for yesterday’s percentage change with the Direction variable

  1. Is there any indication that there is an association between the past and present performance of the stock market? \[\\[0.5in]\]

1.3.3 Gene Expression data — NCI60

The previous two applications illustrate data sets with both input and output variables. However, another important class of problems involves situations in which we only observe ___________ variables, with no corresponding ______________.

Example: In a marketing setting, we might have demographic information for a number of current customers. We may wish to understand which types of customers are similar to each other by grouping individuals according to their observed characteristics.

Note:

This is known as a _____________ problem.

We consider the NCI60 data set, which consists of 6830 gene expression measurements for each of 64 cancer cell lines. Instead of predicting a particular output variable, we are interested in determining whether there are groups, or clusters, among the cell lines based on their gene expression measurements. This is a difficult question to address, in part because there are thousands of gene expression measurements per cell line, making it hard to visualize the data.

2 What do we cover in this class?