Machine Learning with R

Ilan Man
Strategy Operations @ Squarespace

Agenda

  1. Machine Learning Overview
  2. Exploring Data
  3. Nearest Neighbors
  4. Naive Bayes
  5. Measuring Performance
  6. Linear Regression

Machine Learning Overview

What is it?

  • Field of study interested in transforming data into intelligent actions
  • Intersection of statistics, available data and computing power
  • It is NOT data mining
  • Data mining is an exploratory exercise, whereas most machine learning has a known answer
  • Data mining is a subset of machine learning (unsupervised)

Machine Learning Overview

Uses

  • Predict outcome of elections
  • Email filtering - spam or not
  • Credit fraud prediction
  • Image processing
  • Customer churn
  • Customer subscription rates

Machine Learning Overview

How do machines learn?

  • Data input
    • Provides a factual basis for reasoning
  • Abstraction
  • Generalization

Machine Learning Overview

Abstraction

  • Assign meaning to the data
  • Formulas, graphs, logic, etc...
  • Your model
  • Fitting model is called training

Machine Learning Overview

Generalization

  • Turn abstracted knowledge into something that can be utilized
  • Model user heuristics since it cannot see every example
    • When hueristics are systematically wrong, the algorithm has a bias
  • Very simple models have high bias
    • Some bias is good - let's us ignore the noise

Machine Learning Overview

Generalization

  • After training, the model is tested on unseen data
  • Perfect generalization is exceedingly rare
    • Partly due to noise
    • Measurement error
    • Change in user behavior
    • Incorrect data, erroneous values, etc...
  • Fitting too closesly to the noise leads to overfitting
    • Complex models have high variance
    • Good on training, bad on testing

Machine Learning Overview

Steps to apply Machine Learning

  1. Collect data
  2. Explore and preprocess data
    • Majority of the time is spent in this stage
  3. Train the model
    • Specific tasks will inform which algorithm is appropriate
  4. Evaluate model performance
    • Performance measures depend on use case
  5. Improve model performance as necessary

Machine Learning Overview

Choosing an algorithm

  • Consider input data
  • An example is one data point that the machine is intended to learn
  • A feature is a characteristic of the example
    • e.g. Number of times the word "viagra" appears in an email
  • For classification problems, a label is the example's classification
  • Most algorithms require data in matrix format because Math said so
  • Features can be numeric, categorical/nominal or ordinal

Machine Learning Overview

Types of algorithms

  • Supervised
    • Discover relationship between known, target feature and other features
    • Predictive
    • Classification and numeric prediction tasks
  • Unsupervised
    • Unkown answer
    • Descriptive
    • Pattern discovery and clustering into groups
    • Requires human intervention to interpret clusters

Machine Learning Overview

Summary

  1. Generalization and Abstraction
  2. Overfitting vs underfitting
  3. The right algorithm will be informed by the problem to be solved
  4. Terminology

Exploring Data

Exploring and understanding data

  • Load and explore the data
data(iris)

# inspect the structure of the dataset
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Exploring Data

Exploring and understanding data

# summarize the data  - five number summary
summary(iris[,1:4])
##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width 
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1  
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3  
##  Median :5.80   Median :3.00   Median :4.35   Median :1.3  
##  Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2  
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8  
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5

Exploring Data

Exploring and understanding data

  • Measures of central tendency: mean and median
    • Mean is sensitive to outliers
    • Trimmed mean
    • Median is resistant

plot of chunk skew

Exploring Data

Exploring and understanding data

  • Measures of dispersion
    • Range is the max() - min()
    • Interquartile range (IQR) is the Q3 - Q1
    • Quantile
quantile(iris$Sepal.Length, probs = c(0.10,0.50,0.99))
10% 50% 99% 
4.8 5.8 7.7 

Exploring Data

Visualizing - Boxplots

  • Lets you see the spread in the data

plot of chunk boxplot

Exploring Data

Visualizing - histograms

  • Each bar is a 'bin'
  • Height of bar is the frequency (count of) that bin
  • Some distributions are normally distributed (bell shaped) or skewed (heavy tails)

plot of chunk histogram

Exploring Data

Visualizing - scatterplots

  • Useful for visualizing bivariate relationships (2 variables)

plot of chunk scatterplot

Exploring Data

Summary

  • Measures of central tendency and dispersion
  • Visualizing data using histograms, boxplots, scatterplots
  • Skewed vs normally distributed data

K-Nearest neighbors

Classification using kNN

  • Understanding the algorithm
  • Data Preparation
  • Case study: diagnosing breast cancer
  • Summary

K-Nearest neighbors

The Concept

  • Things that are similar are probably of the same class
  • Good for: when it's difficult to define, but "you know it when you see it"
  • Bad for: when a clear distinction doesn't exist

K-Nearest neighbors

The Algorithm

plot of chunk unnamed-chunk-1

K-Nearest neighbors

The Algorithm

plot of chunk shaded plot

K-Nearest neighbors

The Algorithm

plot of chunk new point

  • Suppose we had a new point with Sepal Length of 7 and Petal Length of 4
  • Which species will it probably belong to?

K-Nearest neighbors

The Algorithm

  • Calculate its nearest neighbor
    • Euclidean distance
    • \(dist(p,q) = \sqrt{(p_1-q_1)^2+(p_2-q_2)^2+ ... + (p_n-q_n)^2}\)
    • Closest neighbor -> 1-NN
    • 3 closest neighbors -> 3-NN.
    • Winner is the majority class of all neighbors

K-Nearest neighbors

The Algorithm

  • Calculate its nearest neighbor
    • Euclidean distance
    • \(dist(p,q) = \sqrt{(p_1-q_1)^2+(p_2-q_2)^2+ ... + (p_n-q_n)^2}\)
    • Closest neighbor -> 1-NN
    • 3 closest neighbors -> 3-NN.
    • Winner is the majority class of all neighbors
  • Why not just fit to all data points?

K-Nearest neighbors

Bias vs. Variance

  • Fitting to every point results in an overfit model
    • High variance problem
  • Fitting to only 1 point results in an underfit model
    • High bias problem
  • Choosing the right \(k\) is a balance between bias and variance
  • Rule of thumb: \(k = \sqrt{N}\)

K-Nearest neighbors

Data preparation

  • Classify houses based on prices and square footage
library(scales)  # format ggplot() axis
price <- seq(300000,600000,by=10000)
size <- price/1000 + rnorm(length(price),10,50)
houses <- data.frame(price,size)
ex <- ggplot(houses,aes(price,size))+geom_point()+scale_x_continuous(labels = comma)+
  xlab("Price")+ylab("Size")+ggtitle("Square footage vs Price")

K-Nearest neighbors

Data Preparation

plot of chunk house price plot

K-Nearest neighbors

Data Preparation

plot of chunk clusters

K-Nearest neighbors

Data Preparation

plot of chunk new cluster plot

K-Nearest neighbors

Data Preparation

# 1) using loops
loop_dist <- 0
for(i in 1:nrow(houses)){
  loop_dist[i] <- sqrt(sum((new_p-houses[i,])^2))
  }

# 2) vectorized
vec_dist <- sqrt(rowSums(t(new_p-t(houses))^2))
closest <- data.frame(houses[which.min(vec_dist),])
print(closest)
   price  size
11 4e+05 468.6

K-Nearest neighbors

Data Preparation

plot of chunk unnamed-chunk-2

K-Nearest neighbors

Data Preparation

plot of chunk unnamed-chunk-3

K-Nearest neighbors

Data Preparation

  • Feature scaling. Two common approaches:
  • min-max normalization
    • \(X_{new} = \frac{X-min(X)}{max(X) - min(X)}\)
  • z-score standardization
    • \(X_{new} = \frac{X-mean(X)}{sd(X)}\)
  • Euclidean distance doesn't discriminate between important and noisy features
    • can add weights

K-Nearest neighbors

Data Preparation

new_house <- scale(houses)
new_new <- c((new[1]-mean(houses[,1]))/sd(houses[,1]),(new[2]-mean(houses[,2]))/sd(houses[,2]))

vec_dist <- sqrt(rowSums(t(new_new-t(new_house))^2))
which.min(vec_dist)
## [1] 8

K-Nearest neighbors

Data Preparation

plot of chunk scale_closest_plot

K-Nearest neighbors

Lazy learner

  • kNN doesn't actually learn anything!
  • Stores training data and applies it - verbatim - to new examples
  • Known as instance-based learning
  • Non-parametric learning method
  • Harder for us to understand how the classifier is using the data
  • However kNN finds natural patterns
  • Don't need to fit aribtrarily to a model

K-Nearest neighbors

Case study

data <- read.table('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', sep=',', stringsAsFactors=FALSE, header=FALSE)

# first column has the ID which is not useful
data <- data[,-1]
# names taken from the .names file online
n<-c("radius","texture","perimeter","area","smoothness","compactness",
   "concavity","concave_points","symmetry","fractal")
ind<-c("mean","std","worst")

headers<-as.character()
for(i in ind){
headers<-c(headers,paste(n,i))
}
names(data)<-c("diagnosis",headers)

K-Nearest neighbors

Case study

str(data[,1:10])
'data.frame':   569 obs. of  10 variables:
 $ diagnosis          : chr  "M" "M" "M" "M" ...
 $ radius mean        : num  18 20.6 19.7 11.4 20.3 ...
 $ texture mean       : num  10.4 17.8 21.2 20.4 14.3 ...
 $ perimeter mean     : num  122.8 132.9 130 77.6 135.1 ...
 $ area mean          : num  1001 1326 1203 386 1297 ...
 $ smoothness mean    : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
 $ compactness mean   : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
 $ concavity mean     : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
 $ concave_points mean: num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
 $ symmetry mean      : num  0.242 0.181 0.207 0.26 0.181 ...

K-Nearest neighbors

Case study

# inspect remaining data more closely
prop.table(table(data$diagnosis)); head(data)[2:6]

     B      M 
0.6274 0.3726 
  radius mean texture mean perimeter mean area mean smoothness mean
1       17.99        10.38         122.80    1001.0         0.11840
2       20.57        17.77         132.90    1326.0         0.08474
3       19.69        21.25         130.00    1203.0         0.10960
4       11.42        20.38          77.58     386.1         0.14250
5       20.29        14.34         135.10    1297.0         0.10030
6       12.45        15.70          82.57     477.1         0.12780

K-Nearest neighbors

Case study

# scale each numeric value
scaled_data<-as.data.frame(lapply(data[,-1], scale))
scaled_data<-cbind(diagnosis=data$diagnosis, scaled_data)
head(scaled_data[2:6])
  radius.mean texture.mean perimeter.mean area.mean smoothness.mean
1      1.0961      -2.0715         1.2688    0.9835          1.5671
2      1.8282      -0.3533         1.6845    1.9070         -0.8262
3      1.5785       0.4558         1.5651    1.5575          0.9414
4     -0.7682       0.2535        -0.5922   -0.7638          3.2807
5      1.7488      -1.1508         1.7750    1.8246          0.2801
6     -0.4760      -0.8346        -0.3868   -0.5052          2.2355

K-Nearest neighbors

Case study

library(class)  # get k-NN classifier
predict_1 <- knn(train = scaled_data[,2:31], test = scaled_data[,2:31],
                 cl = scaled_data[,1],
                 k = floor(sqrt(nrow(scaled_data))))                  
table(predict_1)
predict_1
  B   M 
378 191 

K-Nearest neighbors

Case study

pred_B <- which(predict_1=="B")
actual_B <- which(scaled_data[,1]=="B")
pred_M <- which(predict_1=="M")
actual_M <- which(scaled_data[,1]=="M")
true_positive <- sum(pred_B %in% actual_B)
true_negative <- sum(pred_M %in% actual_M)
false_positive <- sum(pred_B %in% actual_M)
false_negative <- sum(pred_M %in% actual_B)

conf_mat <- matrix(c(true_positive,false_positive,false_negative,true_negative),nrow=2,ncol=2)

acc <- sum(diag(conf_mat))/sum(conf_mat)
tpr <- conf_mat[1,1]/sum(conf_mat[1,])
tn <- conf_mat[2,2]/sum(conf_mat[2,])

K-Nearest neighbors

Case study

   acc    tpr     tn 
0.9596 0.9972 0.8962 
    tp  fp
fn 356   1
tn  22 190
  • Is that right?

K-Nearest neighbors

Case study

# create randomized training and testing sets
total_n <- nrow(scaled_data)

# train on 2/3 of the data
train_ind <- sample(total_n,total_n*2/3)
train_labels <- scaled_data[train_ind,1]
test_labels <- scaled_data[-train_ind,1]
train_set <- scaled_data[train_ind,2:31]
test_set <- scaled_data[-train_ind,2:31]

K-Nearest neighbors

Case study

library(class)
predict_1 <- knn(train = train_set, test = test_set, cl = train_labels,
                 k = floor(sqrt(nrow(train_set))))                  
table(predict_1)
predict_1
  B   M 
137  53 

K-Nearest neighbors

Case study

pred_B <- which(predict_1=="B")
test_B <- which(test_labels=="B")
pred_M <- which(predict_1=="M")
test_M <- which(test_labels=="M")
true_positive <- sum(pred_B %in% test_B)
true_negative <- sum(pred_M %in% test_M)
false_positive <- sum(pred_B %in% test_M)
false_negative <- sum(pred_M %in% test_B)

conf_mat<-matrix(c(true_positive,false_positive,false_negative,true_negative),nrow=2,ncol=2)

acc <- sum(diag(conf_mat))/sum(conf_mat)
tpr <- conf_mat[1,1]/sum(conf_mat[1,])
tn <- conf_mat[2,2]/sum(conf_mat[2,])

K-Nearest neighbors

Case study

   acc    tpr     tn 
0.9474 0.9922 0.8525 
    tp fp
fn 128  1
tn   9 52

K-Nearest neighbors

Case study

library(caret)
confusionMatrix(predict_1,test_labels)$table
          Reference
Prediction   B   M
         B 128   9
         M   1  52

K-Nearest neighbors

Case study

# let's try different values for k
k_params <- c(1,3,5,10,15,20,25,30,40)
perf_acc <- NULL
per<-NULL
for(i in k_params){
  predictions <- knn(train = train_set,
                     test = test_set,
                     cl = train_labels,
                     k = i)
  conf <- confusionMatrix(predictions,test_labels)$table
  perf_acc <- sum(diag(conf))/sum(conf)
  per <- rbind(per,c(i, perf_acc,conf[[1]],conf[[3]],conf[[2]],conf[[4]]))
}

K-Nearest neighbors

Case study

K Acc TP FP FN TN
1 1.00 0.95 126.00 6.00 3.00 55.00
2 3.00 0.96 127.00 6.00 2.00 55.00
3 5.00 0.95 126.00 6.00 3.00 55.00
4 10.00 0.96 127.00 6.00 2.00 55.00
5 15.00 0.94 127.00 9.00 2.00 52.00
6 20.00 0.95 128.00 9.00 1.00 52.00
7 25.00 0.94 128.00 10.00 1.00 51.00
8 30.00 0.94 128.00 10.00 1.00 51.00
9 40.00 0.95 129.00 10.00 0.00 51.00

K-Nearest neighbors

Summary

  • kNN is a lazy learning algorithm
  • Assigns the majority class of the k data points closest to the new data
    • Ensure all features are on the same scale
  • Pros
    • Can be applied to data from any distribution
    • Simple and intuitive
  • Cons
    • Choosing k requires trial and error
    • Testing step is computationally expensive (unlike parametric models)
    • Needs a large number of training samples to be useful

Naive Bayes

Probabilistic learning

  • Probability and Bayes Theorem
  • Understanding Naive Bayes
  • Case study: filtering mobile phone spam

Naive Bayes

Probability and Bayes Theorem

  • Terminology:
    • probability
    • event
    • trial - e.g. 1 flip of a coin, 1 toss of a die
  • \(X_{i}\) is an event
  • The set of all events is \(\{X_{1},X_{2},...,X_{n}\}\)
  • The probability of an event is the frequency of its occurrence
    • \(0 \leq P(X) \leq 1\)
    • \(P(\sum_{i=1}^{n} X_{i}) = \sum_{i=1}^{n} P(X_{i})\)

Naive Bayes

Probability and Bayes Theorem

  • Independent events
    • \(A \cap B\) is "A and B"
    • \(P(A \cap B) = P(A) \times P(B)\)
  • Conditional probability
    • \(A \mid B\) is "A given B"
    • \(P(A \mid B) = \frac{P(A \cap B)}{P(B)}\)

Naive Bayes

Probability and Bayes Theorem

  • Independent events
    • \(A \cap B\) is "A and B"
    • \(P(A \cap B) = P(A) \times P(B)\)
  • Conditional probability
    • \(A \mid B\) is "A given B"
    • \(P(A \mid B) = \frac{P(A \cap B)}{P(B)}\)
    • \(P(B \mid A) = \frac{P(B \cap A)}{P(A)}\)
    • \(P(B \mid A) \times P(A) = P(B \cap A)\)

Naive Bayes

Probability and Bayes Theorem

  • Independent events
    • \(A \cap B\) is "A and B"
    • \(P(A \cap B) = P(A) \times P(B)\)
  • Conditional probability
    • \(A \mid B\) is "A given B"
    • \(P(A \mid B) = \frac{P(A \cap B)}{P(B)}\)
    • \(P(B \mid A) = \frac{P(B \cap A)}{P(A)}\)
    • \(P(B \mid A) \times P(A) = P(B \cap A)\)
    • but... \(P(B \cap A) = P(A \cap B)\)

Naive Bayes

Probability and Bayes Theorem

  • Independent events
    • \(A \cap B\) is "A and B"
    • \(P(A \cap B) = P(A) \times P(B)\)
  • Conditional probability
    • \(A \mid B\) is "A given B"
    • \(P(A \mid B) = \frac{P(A \cap B)}{P(B)}\)
    • \(P(B \mid A) = \frac{P(B \cap A)}{P(A)}\)
    • \(P(B \mid A) \times P(A) = P(B \cap A)\)
    • but... \(P(B \cap A) = P(A \cap B)\)
    • so.... \(P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)}\) <-- Bayes Theorem!

Naive Bayes

Bayes Example

  • A decision should be made using all available information
    • As new information enters, the decision might be changed
  • Example: Email filtering
    • spam and non-spam (AKA ham)
    • classify emails depending on what words they contain
    • \(P(spam \mid CASH!)\) = ?

Naive Bayes

Bayes Example

# data frame with frequency of emails with the word "cash"
bayes_ex <- data.frame(cash_yes=c(10,3,13),
                      cash_no=c(20,67,87),
                      total=c(30,70,100),
                      row.names=c('spam','ham','total'))
bayes_ex
      cash_yes cash_no total
spam        10      20    30
ham          3      67    70
total       13      87   100

Naive Bayes

Bayes Example

  • Recall Bayes Theorem:
    • \(P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)}\)
  • A = event that email is spam
  • B = event that "CASH" exists in the email
    \(P(spam \mid cash=yes) = P(cash=yes \mid spam) \times \frac{P(spam)}{P(cash=yes)}\)

Naive Bayes

Bayes Example

  • Recall Bayes Theorem:
    • \(P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)}\)
  • A = event that email is spam
  • B = event that "CASH" exists in the email
    \(P(spam \mid cash=yes) = P(cash=yes \mid spam) \times \frac{P(spam)}{P(cash=yes)}\)
    \(P(cash = yes \mid spam) = \frac{10}{30}\)
    \(P(spam) = \frac{30}{100}\)
    \(P(cash = yes) = \frac{13}{100}\)
    = \(\frac{10}{30} \times \frac{\frac{30}{100}}{\frac{13}{100}} = 0.769\)

Naive Bayes

Bayes Example

  • Recall Bayes Theorem:
    • \(P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)}\)
  • A = event that email is spam
  • B = event that "CASH" exists in the email
    \(P(spam \mid cash=yes) = P(cash=yes \mid spam) \times \frac{P(spam)}{P(cash=yes)}\)
    \(P(cash = yes \mid spam) = \frac{10}{30}\)
    \(P(spam) = \frac{30}{100}\)
    \(P(cash = yes) = \frac{13}{100}\)
    = \(\frac{10}{30} \times \frac{\frac{30}{100}}{\frac{13}{100}} = 0.769\)

Exercise: \(P(ham \mid cash = no)\) = ?

Naive Bayes

Why Naive?

  • Assumes all features are independent and equally important
  • NB still performs very well out of the box
      cash_yes cash_no furniture_yes furniture_no total
spam        10      20             6           24    30
ham          3      67            20           50    70
total       13      87            26           74   100

\(P(spam \mid cash=yes \cap furniture=no) = \frac{P(cash=yes \cap furniture=no \mid spam) \times P(spam)}{P(cash=yes \cap furniture=no)}\)

Naive Bayes

Why Naive?

  • As features increase, formula becomes very expensive
  • Solution: assume each feature is independent of any other feature, given they are in the same class
    • Independence formula: \(P(A \cap B) = P(A) \times P(B)\)
    • Called "class conditional independence":
    • \(P(spam \mid cash=yes \cap furniture=no) =\)

Naive Bayes

Why Naive?

  • As features increase, formula becomes very expensive
  • Solution: assume each feature is independent of any other feature, given they are in the same class
    • Independence formula: \(P(A \cap B) = P(A) \times P(B)\)
    • Called "class conditional independence":
    • \(P(spam \mid cash=yes \cap furniture=no) =\)
    • \(\frac{P(cash=yes \mid spam) \times P(furniture=no \mid spam) \times P(spam)}{P(cash=yes) \times P(furniture=no)}\)

Naive Bayes

Why Naive?

  • As features increase, formula becomes very expensive
  • Solution: assume each feature is independent of any other feature, given they are in the same class
    • Independence formula: \(P(A \cap B) = P(A) \times P(B)\)
    • Called "class conditional independence":
    • \(P(spam \mid cash=yes \cap furniture=no) =\)
    • \(\frac{P(cash=yes \mid spam) \times P(furniture=no \mid spam) \times P(spam)}{P(cash=yes) \times P(furniture=no)} =\)
    • \(\frac{\frac{10}{30} \times \frac{24}{30}}{\frac{13}{100} \times \frac{74}{100}}\)

Naive Bayes

Why Naive?

  • As features increase, formula becomes very expensive
  • Solution: assume each feature is independent of any other feature, given they are in the same class
    • Independence formula: \(P(A \cap B) = P(A) \times P(B)\)
    • Called "class conditional independence":
    • \(P(spam \mid cash=yes \cap furniture=no) = \frac{P(cash=yes \mid spam) \times P(furniture=no \mid spam) \times P(spam)}{P(cash=yes) \times P(furniture=no)}\) = \(\frac{\frac{10}{30} \times \frac{24}{30} \times \frac{30}{100}}{\frac{13}{100} \times \frac{74}{100}}\)

Exercise: \(P(ham \mid cash=yes \cap furniture=no) = ?\)

Naive Bayes

The Laplace Estimator

      cash_yes cash_no furniture_yes furniture_no party_yes party_no total
spam        10      20             6           24         3       27    30
ham          3      67            20           50         0       70    70
total       13      87            26           74         3       97   100

\(P(ham \mid cash=yes \cap party=yes) = \frac{P(cash=yes \mid ham) \times P(party=yes \mid ham) \times P(ham)}{P(cash=yes) \times P(party=yes)} = ?\)

Naive Bayes

The Laplace Estimator

      cash_yes cash_no furniture_yes furniture_no party_yes party_no total
spam        10      20             6           24         3       27    30
ham          3      67            20           50         0       70    70
total       13      87            26           74         3       97   100

\(P(ham \mid cash=yes \cap party=yes) = \frac{P(cash=yes \mid ham) \times P(party=yes \mid ham) \times P(ham)}{P(cash=yes) \times P(party=yes)} = ?\) \(\frac{\frac{3}{70} \times \frac{0}{70} \times \frac{70}{100}}{\frac{13}{100} \times \frac{3}{100}} = 0\)

  • if party appears in an email, there's no chance it could be ham...really??

Naive Bayes

The Laplace Estimator

\(P(ham \mid cash=yes \cap party=yes) =\) \(\frac{P(cash=yes \mid ham) \times P(party=yes \mid ham) \times P(ham)}{P(cash=yes) \times P(party=yes)} =\) \(\frac{\frac{3}{70} \times \frac{0}{70} \times \frac{70}{100}}{\frac{13}{100} \times \frac{3}{100}} = 0\)

  • if party appears in an email, there's no chance it could be ham...really??
  • To get around 0's, apply Laplace estimator
    • add 1 to every feature

Naive Bayes

Case Study: SMS spam filtering

sms_data  <- read.table('SMSSpamCollection.txt',stringsAsFactors=FALSE,sep='\t',quote="", col.names=c("type","text"))
str(sms_data)
'data.frame':   5574 obs. of  2 variables:
 $ type: chr  "ham" "ham" "spam" "ham" ...
 $ text: chr  "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C"| __truncated__ "U dun say so early hor... U c already then say..." ...
sms_data$type <- factor(sms_data$type)

Naive Bayes

Case Study: SMS spam filtering

  • Remove words like and, the, or
library(tm)
# create collection of text documents, a corpus
sms_corpus <- Corpus(VectorSource(sms_data$text))
sms_corpus
A corpus with 5574 text documents

Naive Bayes

Case Study: SMS spam filtering

# look at first few text messages
for (i in 1:5){
  print(sms_corpus[[i]])
}
Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
Ok lar... Joking wif u oni...
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
U dun say so early hor... U c already then say...
Nah I don't think he goes to usf, he lives around here though

Naive Bayes

Case Study: SMS spam filtering

# clean the data using helpful functions
corpus_clean <- tm_map(sms_corpus,tolower)
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords())
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
# now make each word in the corpus into it's own token
# each row is a message and each column is a word. Cells are frequency counts.
sms_dtm <- DocumentTermMatrix(corpus_clean)

Naive Bayes

Case Study: SMS spam filtering

# create training and testing set
total_n <- nrow(sms_data)
train_ind <- sample(total_n,total_n*2/3)
dtm_train_set <- sms_dtm[train_ind,]
dtm_test_set <- sms_dtm[-train_ind,]
corpus_train_set <- corpus_clean[train_ind]
corpus_test_set <- corpus_clean[-train_ind]
raw_train_set <- sms_data[train_ind,1:2]
raw_test_set <- sms_data[-train_ind,1:2]

# remove infrequent terms - not useful for classification
freq_terms <- c(findFreqTerms(dtm_train_set,7))
corpus_train_set <- DocumentTermMatrix(corpus_train_set, list(dictionary = freq_terms))
corpus_test_set <- DocumentTermMatrix(corpus_test_set, list(dictionary = freq_terms))

Naive Bayes

Case Study: SMS spam filtering

# convert frequency counts to "yes" or "no"
# implicitly weighing each term the same
convert <- function(x) {
  x <- ifelse(x > 0, 1, 0)
  x <- factor(x, levels = c(0,1), labels=c('No','Yes'))
  return(x)
}

corpus_train_set <- apply(corpus_train_set, MARGIN = 2 , FUN = convert)
corpus_test_set <- apply(corpus_test_set, MARGIN = 2 , FUN = convert)

Naive Bayes

Case Study: SMS spam filtering

# use naiveBayes() function from e1071 package
library(e1071)

naive_model <- naiveBayes(x = corpus_train_set, y = raw_train_set$type)
predict_naive <- predict(naive_model, corpus_test_set)

naive_conf <- confusionMatrix(predict_naive,raw_test_set$type)$table
naive_conf
          Reference
Prediction  ham spam
      ham  1620   37
      spam    3  198

Naive Bayes

Case Study: SMS spam filtering

# use naiveBayes() function from e1071 package
library(e1071)

naive_model <- naiveBayes(x = corpus_train_set, y = raw_train_set$type)
predict_naive <- predict(naive_model, corpus_test_set)

naive_conf <- confusionMatrix(predict_naive,raw_test_set$type)$table
naive_conf
          Reference
Prediction  ham spam
      ham  1620   37
      spam    3  198
Exercise:
1) Calculate the true positive and false positive rate.  
2) Calculate the error rate (hint: error rate = 1 - accuracy)  
3) Set the Laplace = 1 and rerun the model and confustion matrix. Does this improve the model?

Naive Bayes

Summary

  • Probabalistic approach
  • Naive Bayes assumes features are independent, conditioned on being in the same class
  • Useful for text classification
  • Strengths
    • simple, fast
    • Does well with noisy and missing data
    • Doesn't need large training set
  • Weaknesses
    • Assumes all features are independent and equally important
    • Not well suited for numeric data sets

Model Performance

Measuring performance

  • Classification
  • Regression (more on this later)

Model Performance

Classification problems

  • Accuracy is not enough
    • e.g. drug testing
    • class imbalance
  • Best performance measure: Is classifier successful at intend purpose?

Model Performance

Classification problems

  • 3 types of data used for measuring performance
    • actual values
    • predicted value
    • probability of prediction, i.e. confidence in prediction
  • most R packages have a predict() function
  • confidence in predicted value matters
    • all else equal, choose the model that is more confident in its predictions
    • more confident + accuracy = better generalizer
    • set a paramter in predict() to probability, prob, raw, ...

Model Performance

Classification problems

# estimate a probability for each class
confidence <- predict(naive_model, corpus_test_set, type='raw')
as.data.frame(format(head(confidence),digits=2,scientific=FALSE))
              ham            spam
1 0.9999975488976 0.0000024511024
2 0.9999996374299 0.0000003625701
3 0.9997485224034 0.0002514775966
4 0.0000000000465 0.9999999999535
5 0.0000000000043 0.9999999999957
6 0.9999998601364 0.0000001398636

Model Performance

Classification problems

# estimate a probability for each class
spam_conf <- confidence[,2]
comparison <- data.frame(predict=predict_naive,
                         actual=raw_test_set[,1],
                         prob_spam=spam_conf)
comparison[,3] <- format(comparison[,3],digits=2,scientific=FALSE)
head(comparison)
  predict actual         prob_spam
1     ham    ham 0.000002451102442
2     ham    ham 0.000000362570063
3     ham    ham 0.000251477596598
4    spam   spam 0.999999999953453
5    spam   spam 0.999999999995699
6     ham    ham 0.000000139863590

Model Performance

Classification problems

head(comparison[with(comparison,predict!=actual),])
    predict actual         prob_spam
81      ham   spam 0.199053484775318
104     ham   spam 0.075148234687662
144     ham   spam 0.234883419684168
186     ham   spam 0.475632786648641
232     ham   spam 0.115073916786998
237     ham   spam 0.002081284191925
head(comparison[with(comparison,predict==actual),])
  predict actual         prob_spam
1     ham    ham 0.000002451102442
2     ham    ham 0.000000362570063
3     ham    ham 0.000251477596598
4    spam   spam 0.999999999953453
5    spam   spam 0.999999999995699
6     ham    ham 0.000000139863590
mean(as.numeric(comparison[with(comparison,predict!='spam'),]$prob_spam))
[1] 0.005393
mean(as.numeric(comparison[with(comparison,predict=='spam'),]$prob_spam))
[1] 0.9796

Model Performance

Confusion Matrix

  • Categorize predictions on whether they match actual values or not
  • Can be more than two classes
  • Count the number of predictions falling on and off the diagonals
predicted <- sample(c("A","B","C"),1000,TRUE)
actual <- sample(c("A","B","C"),1000,TRUE)
fabricated <- table(predicted,actual)

# for our Naive Bayes classifier
table(comparison$predict,comparison$actual)

        ham spam
  ham  1620   37
  spam    3  198

Model Performance

Confusion Matrix

  • True Positive (TP)
  • False Positive (FP)
  • True Negative (TN)
  • False Negative (FN)
  • \(Accuracy = \frac{TN + TP}{TN + TP + FN + FP}\)
  • \(Error = 1 - Accuracy\)

Model Performance

Kappa

  • Adjusts the accuracy by the probability of getting a correct prediction by chance alone
  • \(k = \frac{P(A) - P(E)}{1 - P(E)}\)
    • Poor < 0.2
    • Fair < 0.4
    • Moderate < 0.6
    • Good < 0.8
    • Excellent > 0.8

Model Performance

Kappa

  • P(A) is the accuracy
  • P(E) is the proportion of results where actual = predicted
    • \(P(E) = P(E = class 1 ) + P(E = class 2)\)
    • \(P(E = class 1) = P(actual = class 1 \cap predicted = class 1)\)
    • actual and predicted are independent so...

Model Performance

Kappa

  • P(A) is the accuracy
  • P(E) is the probability that actual = predicted, i.e. the proportion of each class
    • \(P(E) = P(E = class 1 ) + P(E = class 2)\)
    • \(P(E = class 1) = P(actual = class 1 \cap predicted = class 1)\)
    • actual and predicted are independent so...
    • \(P(E = class 1) = P(actual = class 1 ) \times P(predicted = class 1)\)
    • putting it all together...

Model Performance

Kappa

  • P(A) is the accuracy
  • P(E) is the probability that actual = predicted, i.e. the proportion of each class
    • \(P(E) = P(E = class 1 ) + P(E = class 2)\)
    • \(P(E = class 1) = P(actual = class 1 \cap predicted = class 1)\)
    • actual and predicted are independent so...
    • \(P(E = class 1) = P(actual = class 1 ) \times P(predicted = class 1)\)
    • putting it all together...
    • \(P(E) = P(actual = class 1) \times P(predicted = class 1) + P(actual = class 2) \times P(predicted = class 2)\)

Model Performance

Kappa

  • P(A) is the accuracy
  • P(E) is the probability that actual = predicted, i.e. the proportion of each class
    • \(P(E) = P(E = class 1 ) + P(E = class 2)\)
    • \(P(E = class 1) = P(actual = class 1 \cap predicted = class 1)\)
    • actual and predicted are independent so...
    • \(P(E = class 1) = P(actual = class 1 ) \times P(predicted = class 1)\)
    • putting it all together...
    • \(P(E) = P(actual = class 1) \times P(predicted = class 1) + P(actual = class 2) \times P(predicted = class 2)\)

Exercise: Calculate the kappa statistic for the naive classifier.

Model Performance

Specificity and Sensitivity

  • Sensitivity: proportion of positive examples that were correctly classified (True Positive Rate)
    • \(sensitivity = \frac{TP}{TP + FN}\)
  • Specificity: proportion of negative examples correctly classified (True Negative Rate)
    • \(specificity = \frac{TN}{FP + TN}\)
  • Balance aggressiveness and conservativeness
  • Found in the confusion matrix
  • Values range from 0 to 1

Model Performance

Precision and Recall

  • Used in information retrieval: are the values retrieved useful or clouded by noise?
  • Precision: proportion of positives that are truly positive
    • \(precision = \frac{TP}{TP + FP}\)
    • Precise model only predicts positive when it is sure. Very trustworthy model.
  • Recall: proportion of true positives of all positives
    • \(recall = \frac{TP}{TP + FN}\)
    • High recall model will capture a large proportion of positives. Returns relevant results
  • Easy to have high recall (cast a wide net) or high precision (low hanging fruit) but hard to have both high

Model Performance

Precision and Recall

  • Used in information retrieval: are the values retrieved useful or clouded by noise?
  • Precision: proportion of positives that are truly positive
    • \(precision = \frac{TP}{TP + FP}\)
    • Precise model only predicts positive when it is sure. Very trustworthy model.
  • Recall: proportion of true positives of all positives
    • \(recall = \frac{TP}{TP + FN}\)
    • High recall model will capture a large proportion of positives. Returns relevant results
  • Easy to have high recall (cast a wide net) or high precision (low hanging fruit) but hard to have both high

Exercise: Find the specificity, sensitivity, precision and recall for the Naive classifier.

Model Performance

F-score

  • Also called the F1-score, combines both precision and recall into 1 measure
  • \(F_{1} = \frac{2 \times precision + recall}{precision + recall}\)
  • Assumes equal weight to precision and recall

Model Performance

F-score

  • Also called the F1-score, combines both precision and recall into 1 measure
  • \(F_{1} = \frac{2 \times precision + recall}{precision + recall}\)
  • Assumes equal weight to precision and recall

Exercise: Calculate the F-score for the Naive classifier.

Model Performance

Visualizing performance: ROC

  • ROC curves measure how well your classifier can discriminate between the positive and negative class
  • As threshold increases, tradeoff between TPR (sensitivity) and FPR (1 - specificity)
library(ROCR)
# create a prediction function
pred <- prediction(predictions = as.numeric(comparison$predict), labels= raw_test_set[,1])
# create a performance function
perf <- performance(pred,measure='tpr',x.measure='fpr')
# create the ROC curve
plot(perf, main = "ROC Curve for Naive classifier", col = 'blue', lwd = 3)
abline(a = 0, b = 1, lwd = 2, lty = 2)
text(0.75,0.4, labels = "<<<  Classifier with\n no predictive power")

plot of chunk roc

Model Performance

Visualizing performance: ROC

  • The area under the ROC curve is the AUC
  • Ranges from 0.5 (no predictive power) to 1.0 (perfect classifier)
    • 0.9 – 1.0 = outstanding
    • 0.8 – 0.9 = excellent
    • 0.7 – 0.8 = acceptable
    • 0.6 – 0.7 = poor
    • 0.5 – 0.6 = no discrimination
auc <- performance(pred, measure='auc')
str(auc)
Formal class 'performance' [package "ROCR"] with 6 slots
  ..@ x.name      : chr "None"
  ..@ y.name      : chr "Area under the ROC curve"
  ..@ alpha.name  : chr "none"
  ..@ x.values    : list()
  ..@ y.values    :List of 1
  .. ..$ : num 0.92
  ..@ alpha.values: list()
auc@y.values
[[1]]
[1] 0.9204

Model Performance

Holdout method

  • We cheated (kind of) in the kNN example

Model Performance

Holdout method

  • We cheated (kind of) in the kNN example
    • Train model - 50% of data
    • Tune parameters on validation set - 25% of data
    • retrain final model on training and validation set (maximize data points)
    • Test final model - 25% of data
library(caret)
new_data <- createDataPartition(sms_data$type,p=0.1, list=FALSE)
table(sms_data[new_data,1])

 ham spam 
 483   75 

Model Performance

Holdout method

  • k-fold Cross Validation
    • Divide data into k random, equal sized partitions (k=10 is a commonly used)
    • Train the classifier on the K-1 parts
    • Test it on the Kth partition
    • Repeat for every K
    • Average the performance across all models - this is the Cross Validation Error
    • All examples eventually used for training and testing
folds <- createFolds(sms_data$type, k = 10)
str(folds)
List of 10
 $ Fold01: int [1:558] 4 13 36 40 51 74 77 79 83 91 ...
 $ Fold02: int [1:556] 11 17 53 59 60 87 98 108 109 110 ...
 $ Fold03: int [1:558] 16 20 28 30 33 52 54 56 63 64 ...
 $ Fold04: int [1:558] 9 18 27 45 50 65 66 67 70 73 ...
 $ Fold05: int [1:558] 38 39 46 61 76 81 101 111 130 161 ...
 $ Fold06: int [1:557] 8 21 22 25 29 44 47 55 71 80 ...
 $ Fold07: int [1:558] 10 14 15 19 31 41 48 49 62 86 ...
 $ Fold08: int [1:557] 2 26 32 42 43 57 69 92 94 96 ...
 $ Fold09: int [1:557] 1 3 5 6 12 23 24 34 35 37 ...
 $ Fold10: int [1:557] 7 58 78 82 88 97 104 117 124 128 ...

Regression

Understanding Regression

  • predicting continuous value - not classification
  • concerned about relationship between independent and dependent variables
  • regressions can be linear, non-linear, using decision trees, etc...
  • linear and non-linear regressions are called generalized linear models

Regression

Linear regression

  • \(Y = \alpha + \beta X\)
  • \(\alpha\) and \(\beta\) are just estimates

plot of chunk best_fit

Regression

Linear regression

  • Distance between the line and each point is the error, or residual term
  • Line of best fit: \(Y = \alpha + \beta X + \epsilon\). Assumes:
    • \(\epsilon\) ~ \(N(0, \sigma^{2})\)
    • Each point is IID (independent and identically distributed)
    • \(\alpha\) is the intercept
    • \(\beta\) is the coefficient
    • \(X\) is the parameter
    • Both are usually made up of multiple elements - matrices

Regression

Linear regression

  • Minimize \(\epsilon\) by minimizing the mean squared error:
    • \(MSE = \sum_{i=1}^{n}\epsilon_{i}^{2} = \sum_{i=1}^{n}(y_{i} - \hat{y})^{2}\)
    • \(y_{i}\) is the true/observed value
    • \(\hat{y}\) is the approximation to/prediction of the true \(y\)
  • Minimization of MSE yields an unbiased estimator with the least variance
  • 2 common ways to minimize MSE:
    • analytical solution (e.g. lm() function does this)
    • approximation (e.g. gradient descent)

Regression

Gradient descent

  • In Machine Learning, regression equation is called the hypothesis function
    • Linear hypothesis function \(h_{\theta}(x) = \theta_{0} + \theta_{1}x\)
    • \(\theta\) is \(\beta\)

Regression

Gradient descent

  • In Machine Learning, regression equation is called the hypothesis function
    • Linear hypothesis function \(h_{\theta}(x) = \theta_{0} + \theta_{1}x\)
    • \(\theta\) is \(\beta\)
  • Goal remains the same: minimize MSE
    • define a cost (aka objective) function
    • \(J(\theta_{0},\theta_{1}) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x_{i}) - y_{i})^2\)
    • \(m\) is the number of examples

Regression

Gradient descent

  • In Machine Learning, regression equation is called the hypothesis function
    • Linear hypothesis function \(h_{\theta}(x) = \theta_{0} + \theta_{1}x\)
    • \(\theta\) is \(\beta\)
  • Goal remains the same: minimize MSE
    • define a cost (aka objective) function
    • \(J(\theta_{0},\theta_{1}) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x_{i}) - y_{i})^2\)
    • \(m\) is the number of examples
  • Find a value for theta that minimizes \(J\)
    • can use calculus or...gradient descent

Regression

Gradient descent

  • given a starting value, take a step along the slope
  • continue taking a step until minimum is reached

plot of chunk unnamed-chunk-6

Regression

Gradient descent

  • given a starting value, take a step along the slope
  • continue taking a step until minimum is reached

plot of chunk unnamed-chunk-7

Regression

Gradient descent

  • given a starting value, take a step along the slope
  • continue taking a step until minimum is reached

plot of chunk unnamed-chunk-8

Regression

Gradient descent

  • given a starting value, take a step along the slope
  • continue taking a step until minimum is reached

plot of chunk unnamed-chunk-9

Regression example

Gradient descent

  • Start with a point (guess)
  • Repeat
    • Determine a descent direction
    • Choose a step
    • Update
  • Until stopping criterion is satisfied

Regression example

Gradient descent

  • Start with a point (guess) \(x\)
  • Repeat
    • Determine a descent direction \(-f^\prime\)
    • Choose a step \(\alpha\)
    • Update \(x:=x - \alpha f^\prime\)
  • Until stopping criterion is satisfied \(f^\prime ~ 0\)

Regression example

Gradient descent

  • update the value of \(\theta\) by subtracting the first derivative of the cost function
  • \(\theta_{j}\) := \(\theta_{j} - \alpha \frac{\partial}{\partial \theta_{j}}J(\theta_{0},\theta_{1})\)
    • \(j = 1, ..., p\) the number of coefficients, or features
    • \(\alpha\) is the step
    • \(\frac{\partial}{\partial \theta_{j}}J(\theta_{0},\theta_{1})\) is the gradient
  • repeat until \(J(\theta)\) is minimized

Regression example

Gradient descent

  • using math, it turns out that
  • \(\frac{\partial}{\partial \theta_{j}}J(\theta_{0},\theta_{1})\) \(=\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{i}) - y^{i})(x^{i}_{j})\)

Regression example

Gradient descent

  • and gradient descent formula becomes:
  • \(\theta_{j}\) := \(\theta_{j} - \alpha\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{i}) - y^{i})(x_{j}^{i})^{2}\)
  • repeating until the cost function is minimized

Regression example

Gradient descent

  • choose the learning rate, alpha
  • choose the stopping point
  • local vs. global minimum

Regression example

Gradient descent

plot of chunk grad_ex_plot

Regression example

Gradient descent

x <- cbind(1,x)  #Add ones to x  
theta<- c(0,0)  # initalize theta vector 
m <- nrow(x)  # Number of the observations 
grad_cost <- function(X,y,theta) return(sum(((X%*%theta)- y)^2))

Regression example

Gradient descent

gradDescent<-function(X,y,theta,iterations,alpha){
  m <- length(y)
  grad <- rep(0,length(theta))
  cost.df <- data.frame(cost=0,theta=0)

  for (i in 1:iterations){
    h <- X%*%theta
    grad <-  (t(X)%*%(h - y))/m
    theta <- theta - alpha * grad
    cost.df <- rbind(cost.df,c(grad_cost(X,y,theta),theta))    
  }  

  return(list(theta,cost.df))
}

Regression example

Gradient descent

## initialize X, y and theta
X1<-matrix(ncol=1,nrow=nrow(df),cbind(1,df$X))
Y1<-matrix(ncol=1,nrow=nrow(df),df$Y)

init_theta<-as.matrix(c(0))
grad_cost(X1,Y1,init_theta)
[1] 5389
iterations = 10000
alpha = 0.1
results <- gradDescent(X1,Y1,init_theta,iterations,alpha)

Regression example

Gradient descent

plot of chunk grad_curve

Regression example

Gradient descent

grad_cost(X1,Y1,theta[[1]])
[1] 356.4
## Make some predictions
intercept <- df[df$X==0,]$Y
pred <- function (x) return(intercept+c(x)%*%theta)
new_points <- c(0.1,0.5,0.8,1.1)
new_preds <- data.frame(X=new_points,Y=sapply(new_points,pred))

Regression example

Gradient descent

ggplot(data=df,aes(x=X,y=Y))+geom_point(size=2)

plot of chunk new_point

ggplot(data=df,aes(x=X,y=Y))+geom_point()+geom_point(data=new_preds,aes(x=X,y=Y,color='red'),size=3)+scale_colour_discrete(guide = FALSE)

plot of chunk new_point

Regression example

Gradient descent - summary

  • minimization algorithm
  • approximation, non-closed form solution
  • good for large number of examples
  • hard to select the right \(\alpha\)
  • traditional looping is slow - optimization algorithms are used in practice

Learning Curves

How many parameters are too many?

plot of chunk multi_plot

Learning Curves

How many parameters are too many?

  • a simple linear model won't fit

plot of chunk multi_plot_2

[1] 0.04867

Learning Curves

How many parameters are too many?

  • let's add some features
df <- transform(df, X2=X^2, X3=X^3)
summary(lm(Y~X+X2+X3,df))$coef[,1]
(Intercept)           X          X2          X3 
     0.3202      6.2780    -25.7736     21.0211 
summary(lm(Y~X + X2 + X3,df))$adj.r.squared
[1] 0.799

Learning Curves

How many parameters are too many?

  • let's add even more features
(Intercept)           X          X2          X3          X4          X5 
  2.613e-03   1.070e+01  -1.093e+02   1.634e+03  -1.546e+04   8.849e+04 
         X6          X7          X8          X9         X10         X11 
 -3.320e+05   8.393e+05  -1.436e+06   1.646e+06  -1.222e+06   5.415e+05 
        X12         X14 
 -1.142e+05   2.690e+03 
[1] 0.09883

Learning Curves

How many parameters are too many?

  • use orthogonal polynomials to avoid correlated features
  • poly() function
ortho.coefs <- with(df,cor(poly(X,degree=3)))
sum(ortho.coefs[upper.tri(ortho.coefs)]) # polynomials are uncorrelated
[1] -1.415e-16
linear.fit <- lm(Y~poly(X,degree=15),df)
summary(linear.fit)$coef[,1]
           (Intercept)  poly(X, degree = 15)1  poly(X, degree = 15)2 
              0.128085              -2.242253               6.145309 
 poly(X, degree = 15)3  poly(X, degree = 15)4  poly(X, degree = 15)5 
              5.716072              -3.668778              -1.652848 
 poly(X, degree = 15)6  poly(X, degree = 15)7  poly(X, degree = 15)8 
              0.645181               0.206527              -0.059154 
 poly(X, degree = 15)9 poly(X, degree = 15)10 poly(X, degree = 15)11 
              0.105895               0.046219              -0.029200 
poly(X, degree = 15)12 poly(X, degree = 15)13 poly(X, degree = 15)14 
             -0.056042               0.005958              -0.020484 
poly(X, degree = 15)15 
             -0.054331 
summary(linear.fit)$adj.r.squared # R^2 is 98% and no errors
[1] 0.9775
sqrt(mean((predict(linear.fit)-df$Y)^2)) # RMSE = 0.472
[1] 0.09874

Learning Curves

How many parameters are too many?

  • when to stop adding othogonal features?

plot of chunk polt_2

Learning Curves

How many parameters are too many?

  • use cross-validation to determine best degree
x <- seq(0,1,by=0.005)
y <- sin(3*pi*x) + rnorm(length(x),0,0.1)

indices <- sort(sample(1:length(x), round(0.5 * length(x))))
training.x <- x[indices] 
training.y <- y[indices]
test.x <- x[-indices] 
test.y <- y[-indices]
training.df <- data.frame(X = training.x, Y = training.y) 
test.df <- data.frame(X = test.x, Y = test.y)

rmse <- function(y,h) return(sqrt(mean((y-h)^2)))

Learning Curves

How many parameters are too many?

performance <- data.frame()
for (d in 1:20){
  fits <- lm(Y~poly(X,degree=d),data=training.df)
  performance <- rbind(performance, data.frame(Degree = d,
                                               Data = 'Training',
                                               RMSE = rmse(training.y, predict(fits))))
  performance <- rbind(performance, data.frame(Degree = d,
                                             Data = 'Test',
                                             RMSE = rmse(test.y, predict(fits,
                                                                         newdata = test.df))))
}

Learning Curves

How many parameters are too many?

plot of chunk unnamed-chunk-11

Regression

Summary

  • Minimize MSE of target function
  • Analytically vs. approximation
  • Gradient descent preferrable when lots of examples
  • Use learning curves to determine optimal number of parameters (or data points)

Summary

  • Machine learning overview and concepts
  • Exploring data using R
  • kNN algorithm and use case
  • Naive Bayes
    • Probability concepts
    • Mobile Spam case study
  • Model performance measures
  • Regression

Next Time

  • Logistic regression
  • Decision Trees
  • Clustering
  • Dimensionality reduction (PCA, ICA)
  • Regularization

Resources