intro-ml-r.utf8.md

class: clear

.font300[

Introduction to Machine learning in <img src="https://emojis.slackmojis.com/emojis/images/1450319453/122/carlton.gif?1450319453" style="height:1em; width:auto; "/>

]

.font250.right[

Analytics Connect '18

]

.pull-left.font120[

Brandon M. Greenwell Enable the Science/Data Scientist [ Fork me on GitHub](https://github.com/bgreenwell/intro-ml-r) [ @bgreenwell8](https://twitter.com/bgreenwell8)

]

.pull-right[

]

---

# Who am I?

.pull-left[

]

.pull-right.font125[

* Started at <img src="figures/effo-logo.png" style="height:1em; width:auto; "/> in April of 2018

* BS/MS/PhD in Applied Statistics

* enthusiast <img src="https://emojis.slackmojis.com/emojis/images/1450694616/220/bananadance.gif?1450694616" style="height:1em; width:auto; "/>

* Matlab dethusiast <img src="https://emojis.slackmojis.com/emojis/images/1483822352/1572/gross.gif?1483822352" style="height:1em; width:auto; "/>

* Maker of 📦s

* [UC Analytics Summit](https://github.com/koalaverse/AnalyticsSummit18)

]

---
class: clear, split-two

# Overview

.column.bg-main1[.content.vmiddle.center[

.font150[[Book website](http://www-bcf.usc.edu/~gareth/ISL/)]

]]

.column.bg-main2[.content.vmiddle.center[

.font150[[Book website](https://web.stanford.edu/~hastie/ElemStatLearn/)]

]]

---
class: clear, center, middle

.pull-left[

.font300.bold[What is machine learning?]

]

.pull-right[

![](figures/ml-tshirt.jpg)

]

---

# Popular terms

.font150[

* .purple[Machine learning]: constructing algorithms that learn from data

* .purple[Statistical learning]: emphasizes statistical models and the assessment of uncertainty

* .purple[Data science]: applying mathematics, .bold[statistics], machine learning, engineering, etc. to extract knowledge form data

- "Data Science is statistics on a Mac"

]

.full-width.content-box-red.font150.center[Similar terms with slightly different emphases!]

---

# Some take home messages

.full-width.content-box-purple.font150.center[This talk is about *supervised learning*: building models from data that predict an **outcome** using a collection of **features**]

.font175[

* Many flexible 🛠 for making predictions from data

* They are not magic! They require **good data** and proper validation!

* No one method is uniformly best!

]

---
class: clear, middle, center

.font300[

Follow along: [ code](https://github.com/bgreenwell/intro-ml-r/blob/master/code/intro-ml-r.R)

]

---
class: clear, inverse, center, middle

---

# Classification vs. regression

---

# Example data sets

.font175[

.pull-left[

## Classification:

* Mushroom edibility 🍄

* E-mail spam ✉️

* Swiss banknote 💷

]

.pull-right[

## Regression:

* Boston housing data 🏠

* Ames housing data 🏠

]

---

# Nature as a black box

.font200[

Nature functions to associate `$\boldsymbol{X}$` with `$\boldsymbol{Y}$`:

]

.font175[

* Can we predict what `$\boldsymbol{Y}$` will be for future `$\boldsymbol{X}$`?

* How does nature associate `$\boldsymbol{Y}$` with `$\boldsymbol{X}$`?

]

???

Source: Statistical Modeling: The Two Cultures
Link: https://projecteuclid.org/euclid.ss/1009213726

---

# Data modeling culture

.font200[

Assume a stochastic data model:

]

???

Source: Statistical Modeling: The Two Cultures
Link: https://projecteuclid.org/euclid.ss/1009213726

Domain experts work for years to learn good features; they bring the statistician a "small" data set

---

# Algorithmic modeling culture

.font200[

Find a function `$f\left(\boldsymbol{X}\right)$` to predict `$\boldsymbol{Y}$`:

]

???

Source: Statistical Modeling: The Two Cultures
Link: https://projecteuclid.org/euclid.ss/1009213726

We start with a "large" data set with many features and use machine learning to find the "good" ones

---

# Algorithmic interpretation <img src="figures/vip-pdp-horizontal.png" style="height:1em; width:auto; "/>

.font150[

Aside from predictions, we are often interested in understanding the way that `$Y$` is affected as `$X_1, \dotsc, X_p$` change (i.e., we cannot treat `$\widehat{f}$` as a "black
📦").

]

.font125[

.pull-left[

* Which predictors are .bold.blue[associated] with the response?

* What is the nature of the .bold.blue[relationship] between the response and each predictor (or subset thereof)?

]

.pull-right[

]

---

---

# Algorithmic interpretation <img src="figures/vip-pdp-horizontal.png" style="height:1em; width:auto; "/>

---
class: clear, middle, center

[Source](https://bradleyboehmke.github.io/CinDay-RUG-IML-2018/slides-source.html#1)

---

# Over fitting

.font150[

It is possible to "over tune" your models to maximize performance on the training data. Such models will not generalize well to new data. We call this process [*over fitting*](https://en.wikipedia.org/wiki/Overfitting):

]

---

# Model development

.font150.pull-left[

## Model selection

Estimating the *performance* of different models in order
to choose the .bold["best"] one

* Choosing *hyperparmaters*

]

.font150.pull-right[

## Model assessment

Having chosen a final model, estimating its prediction
error (i.e., generalization error) on .bold[new data]

]

.full-width.content-box-yellow.font175.center[If we have "enough" data, divide it up!]

---

# Data splitting

.full-width.content-box-yellow.font150.center[How do we divide up the data for each task?]

.font135[

* The .bold[training set] is used to fit the models

* The .bold[validation set] is used to estimate prediction error for model selection

* The .bold[test set] is used for assessment of the generalization error of the final chosen model

- Ideally, the test set should be kept under 🔒 and 🔑 until the end of the data analysis (**Why?** 🤔)

]

---

# Data splitting

.font150[

It is difficult to give a general rule of 👍 on how to choose the number of observations in each of the three parts, as this depends on the *signal-to noise ratio* in the data and the training sample size

]

.full-width.content-box-yellow.font150.center[A typical split might be 50% for training, and 25% each for validation and testing]

---

# Data splitting strategies

---

# 5-fold cross-validation

---
class: clear, middle, center

.font200[

Generalization error `$\approx$` `$\sigma^2$` `$+$` .red.vold[Bias] `$+$` .red.vold[Variance]

]

---

# Data shape

---

# Wide data (lots of features)

.pull-left.center[

]

.pull-right.font125[

* Thousands / millions of features

* Hundreds of records (i.e., rows)

* Prone to over fitting

- Need to regularize or otherwise reduce the number features

]

.full-width.content-box-yellow.font150.center[Screening with FDR, Elastic net, SVMs, and stepwise procedures]

---

# Tall data (lots of observations)

.pull-left.center[

]

.pull-right.font125[

* Tens / hundreds of features

* Thousands / millions of records (i.e., rows)

* Plenty of samples to fit richer models with interactions, etc.!

]

.full-width.content-box-yellow.font150.center[GzLMs, .bold.red[tree-based methods] (like GBMs), and network models (like DNNs)]

---

# Data preprocessing and feature engineering

.pull-left[

.font100[

Once data are gathered, it must be cleaned and prepped for data analysis. Such tasks might include, but are not limited to:

* Imputing missing values

* Transforming columns

* Standardizing/normalizing columns

* Creating new features

* Formulating interaction effects

* Numerically encoding categorical features (e.g., *one-hot encoding* or *dummy encoding*)

]

.pull-right[

]

---
class: clear, middle, center

.font150[

**Source:** [Scikit-learn](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

]

---
class: clear, middle, center

.font150[

**Source:** [Applied Predictive Modeling](http://appliedpredictivemodeling.com/)

]

---

# A good modeling tool

.pull-left.font110[

## At a minimum:

* Applicable to classification and regression
    
* Competitive accuracy
    
* Capable of handling large data sets
    
* Handle missing values effectively

]

.pull-right.font110[

## Bonus features:

* Which predictors are important? ([vip](https://github.com/AFIT-R/vip))

* How do the predictors functionally relate to the response? ([pdp](https://journal.r-project.org/archive/2017/RJ-2017-016/index.html))
    
* How do the predictors interact? ([pdp](https://journal.r-project.org/archive/2017/RJ-2017-016/index.html), [vip](https://github.com/AFIT-R/vip))
    
* What is the shape of the data (i.e., how does it cluster?)
    
* Are their novel cases and outliers?

]

---
class: clear, middle, center

background-image: url(https://upload.wikimedia.org/wikipedia/commons/9/99/Fog_forrest_frickberg.jpg)

.font300.white[Decision trees]

???

Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/Forest#/media/File:Fog_forrest_frickberg.jpg)

---

# Mushroom classification

There is no simple rule for determining the edibility of a [mushroom](https://raw.githubusercontent.com/bgreenwell/MLDay18/master/docs/data/mushroom.csv); no rules like **"leaflets three, let it be"**, **"hairy vine, no friend of mine"** and **"berries white, run in fright"** for poison ivy.

```r
url <- "https://bgreenwell.github.io/MLDay18/data/mushroom.csv"
mushroom <- read.csv(url) # load the data from GitHub
*mushroom$veil.type <- NULL # only takes on a single value
```

---
class: clear, middle

# Mushroom classification

.code125[

```r
# Load required packages
library(caret)  # for data splitting function
library(rpart)  # for binary recursive partitioning

*# Partition the data into train/test sets
set.seed(101)
trn_id <- createDataPartition(
 y = mushroom$Edibility, p = 0.5, list = FALSE
)
*trn <- mushroom[trn_id, ] # training data
*tst <- mushroom[-trn_id, ] # test data

# Function to calculate accuracy
accuracy <- function(pred, obs) {
 sum(diag(table(pred, obs))) / length(obs)
}
```

]

---
class: clear, middle

# Mushroom classification

.pull-left[

```r
# Decision stump (test error = 1.53%):
cart1 <- rpart(
 Edibility ~ ., data = trn,
* control = rpart.control(maxdepth = 1)
)

# Get test set predictions
pred1 <- predict(
 cart1, newdata = tst, 
 type = "class"
)

# Compute test set accuracy
accuracy(
  pred = pred1, 
  obs = tst$Edibility
)
## [1] 0.9847366
```

]

.pull-right[

```r
# Complex tree (test error = 0%):
cart2 <- rpart(
 Edibility ~ ., data = trn, 
* control = list(cp = 0, minbucket = 1, minsplit = 1)
)

# Get test set predictions
pred2 <- predict(
 cart2, newdata = tst, 
 type = "class"
)

# Compute test set accuracy
accuracy(
  pred = pred2, 
  obs = tst$Edibility
)
## [1] 1
```

]

---

# Mushroom classification

.pull-left[

.font125[Decision stump (test error = 1.53%):]
<img src="intro-ml-r_files/figure-html/mushroom-tree-diagram-1-1.svg" width="100%" style="display: block; margin: auto;" />

]

.pull-right[

.font125[Complex tree (test error = 0%):]
<img src="intro-ml-r_files/figure-html/mushroom-tree-diagram-2-1.svg" width="100%" style="display: block; margin: auto;" />

]

---

# A handy rule of 👍 for mushrooms

.right.font300.tomato["If it smells bad, don't eat it!"]
 
.right.font200[--- Decision Stump]

---

# Predicting email spam

.pull-left[

- Data from 4601 email messages collected at Hewlett-Packard Labs

- **Goal:** predict whether an email message is spam (junk email) or ham (good email)

- **Features:** relative frequencies in a message of 57 of the most commonly occurring words and punctuation marks in all the training the email messages

- For this problem, not all errors are equal; misclassifying spam is not as bad as misclassifying ham!

]

.pull-right[

]

---

# Predicting email spam

```r
# Load the data
*data(spam, package = "kernlab")

# Partition the data into train/test sets
set.seed(101) # for reproducibility
trn_id <- createDataPartition(spam$type, p = 0.7, list = FALSE)
trn <- spam[trn_id, ] # training data
tst <- spam[-trn_id, ] # test data
xtrn <- subset(trn, select = -type) # training data features
xtst <- subset(tst, select = -type) # test data features
ytrn <- trn$type # training data response

# Fit a classification tree (cp found using k-fold CV)
*spam_tree <- rpart(type ~ ., data = trn, cp = 0.001)
pred <- predict(spam_tree, newdata = xtst, type = "class")

# Compute test set accuracy
(spam_tree_acc <- accuracy(pred = pred, obs = tst$type))
## [1] 0.9260334
```

---

# Predicting email spam

.scrollable[

]

---

# Predicting email spam

.pull-left[

```r
# Variable importance scores
*vip::vi(spam_tree) %>%
  as.data.frame() %>%
  head(10)
##           Variable Importance
## 1  charExclamation  507.78427
## 2      capitalLong  259.80387
## 3       capitalAve  198.78828
## 4             free  198.33076
## 5       charDollar  194.74814
## 6             your  182.63595
## 7           remove  176.82921
## 8              all  126.72083
## 9     capitalTotal  107.79681
## 10              hp   74.31123
```

]

.pull-right[

```r
# Variable importance plot
*vip(spam_tree, num_features = 10)
```

]

---

# How do trees measure up?

.pull-left[

.font150.bold[Advantages:]

* .MediumSeaGreen[Small trees are easy to interpret]

* .MediumSeaGreen[Trees scale well to large *N*] (fast!!)

* .MediumSeaGreen[Can handle data of all types] (i.e., requires little, if any, coding)

* .MediumSeaGreen[Automatic variable selection]

* .MediumSeaGreen[Can handle missing data] (through *surrogate splits*)

* .MediumSeaGreen[Completely nonparametric] (great for DM and EDA tasks!)

]

.pull-right[

.font150.bold[Disadvantages:]

* .tomato[Large trees can be difficult to interpret]

* .tomato[Trees are step functions] (i.e., binary splits)

* .tomato[Greedy splitting algorithms] (i.e., trees are noisy)

* .tomato[All splits depend on previous splits] (i.e., high order interactions)

* .tomato[Data is essentially taken away after each split]

]

---
class: clear, middle, center

# How do trees measure up?

<img src="figures/esl-tbl.png" width="70%" style="display: block; margin: auto;" />
 
.center[**Source:** The Elements of Statistical Learning, Second Edition]

---

# Creating ensembles of trees

* The key to accuracy is **low bias** and **low variance**

* **Main idea:** breaking the [*bias-variance trade-off*](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff)

- `$Error \approx Bias + Variance$`

* Two popular algorithms (can be applied to any model, **not just trees!**):

- Bagging (reduce variance by averaging)

- Boosting (reduce bias by building models sequentially)
 
* Random forest is just a slight modification over bagging applied to decision trees!

<center>
Boosting >= Random forest > Bagging > Single tree
</center>

---
class: clear, center, middle

background-image: url(https://upload.wikimedia.org/wikipedia/commons/9/99/Fog_forrest_frickberg.jpg)

.font300.white[Bagging]

???

Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/Forest#/media/File:Fog_forrest_frickberg.jpg)

---

# The bootstrap

---

# **B**ootstrap **AGG**regat**ING**

.pull-left[

* Bagging trees:

1. Sample records **with replacement** (i.e., bootstrap the training data)
    
    2. Fit an **overgrown tree** to the resampled data set
    
    3. Repeat a large number of times

]

.pull-right[

]

* Predictions are combined by popular vote (**classification**) or averaging (**regression**)

* Same idea as ["wisdom of the crowd"](https://en.wikipedia.org/wiki/Wisdom_of_the_crowd) (Francis Galton's Ox weight survey)

* Improves the stability and accuracy of noisy models (e.g., individual trees)

---

# Bagging trees

---

# Trees are step functions

---

# Trees are noisy

---

# Bagging reduces variance

---

# Predicting email spam

.pull-left[

```r
# Load required packages
*library(randomForest)

# Fit a bagger model
*xtst <- subset(tst, select = -type)
*set.seed(1633) # reproducibility
spam_bag <- randomForest(
 type ~ ., 
 data = trn, 
 ntree = 250,
* mtry = ncol(xtrn),
 xtest = xtst,
 ytest = tst$type,
 keep.forest = TRUE
)
```

]

.pull-right[

]

---
class: clear, center, middle

background-image: url(https://upload.wikimedia.org/wikipedia/commons/9/99/Fog_forrest_frickberg.jpg)

.font300.white[Random forest]

???

Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/Forest#/media/File:Fog_forrest_frickberg.jpg)

---

# Background

* **Bagging on steroids!!**

* Correlation limits the effect of bagging!

* Grow trees just as in bagging, but with a small twist

- Before each split in each tree in the forest, select a subset of the predictors **at random** as candidate splitters
    
      - This essentially **"decorrelates"** the trees in the ensemble!
    
    - The number of randomly selected variables, denoted `$m_{try}$`, is a **tuning parameter** (arguably the only tuning parameter in a random forest)

* Bagging introduces randomness into the rows of the data

* Random forest introduces randomness into the rows and columns of the data

* [*Extremely randomized trees*](https://link.springer.com/article/10.1007/s10994-006-6226-1) take this randomness one step further to reduce variance!

---

# Random forest packages in R

.scrollable[

.pull-left[

* [randomForest](https://cran.r-project.org/package=randomForest)

- The standard (implements most of the **bells and whistles**; great for EDA!!)
    
    - Classification and regression
    
    - Does not scale well, but can be parallelized using `foreach`!

* [randomForestSRC](https://kogalur.github.io/randomForestSRC/index.html)

- Classification, regression, and survival analysis (including ***competing risks***)
    
    - MPI/OpenMP support

* [ranger](https://github.com/imbs-hl/ranger): **ran**dom forest **ge**ne**r**ator

- Classification, regression, and survival analysis
    
    - [Fast!](https://www.jstatsoft.org/article/view/v077i01)
    
    - Accepts sparse numeric matrices (i.e., `dgcMatrix` objects from the [Matrix](https://cran.r-project.org/package=Matrix) package)
    
    - Estimated time to completion!!

]

.pull-right[

* [party/partykit](https://cran.r-project.org/web/packages/party/index.html)

- Uses *conditional inference trees* as the base learners
    
    - Unbiased variable selection
    
    - Can be slow on large data sets

* [bigrf](https://github.com/aloysius-lim/bigrf)

- Classification only

- Currently **ORPHANED** on CRAN
  
  - Disk caching using [`bigmmory`](https://cran.r-project.org/web/packages/bigmemory/)

* [Rborist](https://github.com/suiji/Arborist)

- Offers parallel, distributed tree construction

* [h2o](https://github.com/h2oai/h2o-3/tree/master/h2o-r)

- Distributed random forests via cloud computing
  
  - Can be stacked with other [`h2o`](https://github.com/h2oai/h2o-3/tree/master/h2o-r) models, like GBMs and DNNs!!

]

---

# Predicting email spam

.pull-left[

```r
# Load required packages
library(randomForest)

# Fit a random forest
xtst <- subset(tst, select = -type)
set.seed(1633) # reproducibility
spam_rf <- randomForest(
 type ~ ., 
 data = trn, 
 ntree = 250,
* mtry = 7, # floor(sqrt(p))
 xtest = xtst,
 ytest = tst$type,
 keep.forest = TRUE
)
```

]

.pull-right[

]

---
class: clear

# Random forest

.pull-left[

## Advantages:

* Competitive accuracy

* Supervised **AND** unsupervised learning

* *Out-of-bag* data 😎

- Free cross-validation

- Novel variable importance

* Novel proximity matrix 😎

- Outlier/novelty detection
    
    - Multi-dimensional scaling
    
    - Missing value imputation

]

.pull-right[

## Disadvantages:

* Missing values (**why?**)

* Can be slow on large data sets since trees are intentionally overgrown (i.e., deep)

- `$m_{try}$` mitigates this issue to some extent

]

---

# Out-of-bag data

For large enough `$N$`, on average, `$1 - e^{-1} \approx 63.21$`% or the original records end up in any bootstrap sample

.code75[

```r
N <- 100000
set.seed(1537) # for reproducibility
x <- rnorm(N)
mean(x %in% sample(x, replace = TRUE)) # non-OOB proportion
## [1] 0.63232
```

]

* Roughly `$e^{-1} \approx 36.79$`% of the observations are not used in the construction of a particular tree

* These observations are considered *out-of-bag* (OOB) and can be used to

- Provide an unbiased assessment of model performance (**sort of an unstructured, but free, cross-validation**)
 
    - Construct novel variable importance measure based on the **predictive strength** of each feature!!

---
class: clear, middle, center
# Predicting email spam

.font150.bold.left[OOB vs. test error: e-mail spam example]
<img src="intro-ml-r_files/figure-html/unnamed-chunk-8-1.svg" width="100%" style="display: block; margin: auto;" />

---

# OOB-based variable importance

* **Traditional approach:** Average variable importance scores (i.e., total goodness of split) over all the trees in the ensemble (this is what is done for GBMs)

* **Novel approach:** To estimate the importance of the `$k$`-th variable, `$x_k$`:

1. Record the OOB performance of the model
  
  2. Randomly permute all the values of `$x_k$` in the OOB data
  
  3. Recompute the OOB performance of the model
  
  4. The difference between the two OOB performance measures the strength of the structural importance of `$x_k$`
  
* Fundamentally different as these importance scores are **NOT** based on data seen by the individual trees!

---

# Boston housing example

* Housing data from `$N = 506$` census tracts in the city of Boston for the year 1970

* The data violate many classical assumptions like linearity, normality, and constant variance

* Harrison and Rubinfeld's housing value equation ( `$R^2 = 0.81$` ):

* Nowadays, many supervised learning algorithms can fit the data automatically in seconds (**typically with higher accuracy!**)

* The downfall, however, is some loss of interpretation since these algorithms typically do not produce simple prediction formulas (**but there's still hope!** 🙏)

---
class: clear

```r
# Load required packages
*library(ranger)  # a much faster implementation of random forest

# Load the (corrected) Boston housing data
data(boston, package = "pdp")
```

.pull-left[

```r
# Using the randomForest package
set.seed(2007) # reproducibility
system.time(
 boston_rf <- randomForest(
 cmedv ~ ., data = boston, 
* ntree = 5000,
* mtry = 5,
* importance = FALSE
 )
)
## user system elapsed 
## 8.748 0.181 8.910
```

```
## [1] 0.893066
```

]

.pull-right[

```r
# Using the ranger package
set.seed(1652) # reproducibility
system.time(
 boston_ranger <- ranger(
 cmedv ~ ., data = boston, 
* num.trees = 5000,
* mtry = 5, # :/
* importance = "impurity"
 )
)
## user system elapsed 
## 6.515 0.304 1.180
```

```
## [1] 0.8928631
```

]

---

# Variable importance plots

.pull-left[

* Most RF packages provide variable importance measures, but not all of them provide variable importance plots (VIPs)

* Enter...[vip](https://koalaverse.github.io/vip/index.html)

- Provides a consistent framework for extracting and plotting variable importance scores from many types of ML models

- `vi()` always returns a [*tibble*](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html)

- `vip()` uses [ggplot2](http://ggplot2.org/)

]

.pull-right[

```r
# Install from CRAN
install.packages("vip")
```

<img src="figures/vip-logo.svg" width="80%" style="display: block; margin: auto;" />
]

---
class: clear, middle

```r
# Construct variable importance plots (the old way)
par(mfrow = c(1, 2))  # side-by-side plots
*varImpPlot(boston_rf, main = "")  # randomForest::varImpPlot()
```

---
class: clear

```r
# Load required packages
library(vip)  # for better (and consistent) variable importance plots

# Construct variable importance plots
p1 <- vip(boston_rf, type = 1) + ggtitle("randomForest")
p2 <- vip(boston_rf, type = 2) + ggtitle("randomForest")
p3 <- vip(boston_ranger) + ggtitle("ranger")
*grid.arrange(p1, p2, p3, ncol = 3) # side-by-side plots
```

---

# Partial dependence plots

.scrollable[

* [Partial dependence plots (PDPs)](https://journal.r-project.org/archive/2017/RJ-2017-016/index.html) help visualize the relationship between a subset of the features (typically 1-3) and the response

* Let `$x_1$` be the predictor variable of interest with unique values `$\left\{x_{11}, x_{12}, \dots, x_{1k}\right\}$`

* The partial dependence of the response on `$x_1$` can be constructed as follows:

1. For `$i \in \left\{1, 2, \dots, k\right\}$`:
    
        a. Copy the training data and replace the original values of `$x_1$` with the constant `$x_{1i}$`
    
        b. Compute the vector of predicted values from the modified copy of the training data
        
        c. Compute the average prediction to obtain `$\bar{f}_1\left(x_{1i}\right)$`

2. Plot the pairs `$\left\{x_{1i}, \bar{f}_1\left(x_{1i}\right)\right\}$` for `$i = 1, 2, \dotsc, k$`

]

---

# Partial dependence plots

.pull-left[

* Not all RF implementations provide support for constructing partial dependence plots (PDPs)

* Enter...[pdp](https://journal.r-project.org/archive/2017/RJ-2017-016/index.html)

- A consistent way of constructing PDPs (and more) from many types of ML models (not just RFs)

- Allows for multivariate displays (i.e., interactions) and so much more!!
    
    - Includes options for **parallel processing** and **progress bars** 😎

]

.pull-right[

```r
# Install from CRAN
install.packages("pdp")
```

]

---
class: clear, middle

# Partial dependence plots

```r
# Load required packages
library(pdp)

# PDPs for the top two predictors
p1 <- partial(boston_ranger, pred.var = "lstat", plot = TRUE)
p2 <- partial(boston_ranger, pred.var = "rm", plot = TRUE)
*p3 <- partial(boston_ranger, pred.var = c("lstat", "rm"),
* chull = TRUE, plot = TRUE)
grid.arrange(p1, p2, p3, ncol = 3)
```

---
class: clear, middle

# Partial dependence plots

.code115[

```r
*# 3-D plots
*pd <- attr(p3, "partial.data") # no need to recalculate
p1 <- plotPartial(pd, 
 levelplot = FALSE, drape = TRUE, colorkey = FALSE,
 screen = list(z = -20, x = -60)
)
```

```r
*# Using ggplot2
library(ggplot2)
p2 <- autoplot(pd, palette = "magma")
```

```r
*# ICE and c-ICE curves
p3 <- boston_ranger %>% # %>% is automatically imported!
 partial(pred.var = "rm", ice = TRUE, center = TRUE) %>%
 autoplot(alpha = 0.1)
```

]

---
class: clear, middle

# Partial dependence plots

.code150[

```r
# Display all three plots side by side
grid.arrange(p1, p2, p3, ncol = 3)
```

]

---

# Proximity matrix

.font115[

* In random forests (and bagging), **trees are intentionally grown deep**!

]

---
class: clear, middle, center

---

# Proximity matrix

.font115[

* In random forests (and bagging), **trees are intentionally grown deep**!

1. Initialize the proximity between cases `$i$` and `$j$` to zero: `$prox_{ij} = 0$`

2. After a tree is grown, put all of the data, both training and OOB, down the tree

3. If cases `$i$` and `$j$` land in the same terminal node, increase `$prox_{ij}$` by one

4. At the end, normalize `$prox_{ij}$` by dividing by the number of trees in the forest

* The pairwise proximities can be used to construct a useful distance matrix

* **Requires the storage of an `$N \times N$` matrix!!** 😱

* Only available in [randomForest](https://cran.r-project.org/package=randomForest), [party](https://cran.r-project.org/web/packages/party/index.html), and [bigrf](https://github.com/aloysius-lim/bigrf)

]

---

# Swiss banknote data

```r
# Load the data
data(banknote, package = "alr3")

# Fit a random forest
set.seed(1701) # for reproducibility
banknote_rf <- randomForest(
 as.factor(Y) ~ ., 
 data = banknote, 
* proximity = TRUE
)

# Print the OOB confusion matrix
banknote_rf$confusion
##    0   1 class.error
## 0 99   1        0.01
## 1  0 100        0.00
```

---

# Proximity matrix

```r
# Heatmap of proximity-based distance matrix
heatmap(1 - banknote_rf$proximity, col = viridis::plasma(256))
```

---

# Outlier scores

Outlyingness of case `$n$` in class `$j$` to all other cases in class `$j$`: `$out\left(n\right) = N / \sum_{cl\left(k\right) = j} prox_{ik} ^ 2$`

.pull-left[

```r
# Dot chart of proximity-based outlier scores
outlyingness <- tibble::tibble(
* "out" = outlier(banknote_rf),
 "obs" = seq_along(out),
 "class" = as.factor(banknote$Y)
)
ggplot(outlyingness, aes(x = obs, y = out)) +
 geom_point(aes(color = class, size = out), alpha = 0.5) +
* geom_hline(yintercept = 10, linetype = 2) +
 labs(x = "Observation", y = "Outlyingness") +
 theme_light() +
 theme(legend.position = "none")
```

]

.pull-right[

]

---

# Multi-dimensional scaling

```r
# Multi-dimensional scaling plot of proximity matrix
MDSplot(banknote_rf, fac = as.factor(banknote$Y), k = 2, cex = 1.5)
```

---

# Missing value imputation

* Random forest offers a novel approach to imputing missing values (i.e., `NA`s)

1. Start with cheap imputation (e.g., median for continuous features)
    
    2. Run a random forest and obtain the proximity matrix
    
    3. Imputed values are updated using:
    
        - The weighted average of the non-missing observations, where the weights are the proximity (**continuous**)
        
        - The category with the largest average proximity (**categorical**)

4. Iterate this procedure until convergence (~ 4-6 times seems sufficient)

* Highly computational (requires many random forest runs!!)

* Only available in [randomForest](https://cran.r-project.org/package=randomForest)

---
class: clear, middle, center

background-image: url(https://upload.wikimedia.org/wikipedia/commons/9/99/Fog_forrest_frickberg.jpg)

.font300.white[Boosting]

???

Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/Forest#/media/File:Fog_forrest_frickberg.jpg)

---

# Background

.font125[

* Boosting is a general technique for creating an ensemble of models (*meta-algorithm*)

* In contrast to bagging, boosting:

- Uses a *weak learner* for the base learners
    
    - Builds the models sequentially (fix past mistakes)
    
* Most commonly used with **shallow decision trees** (e.g., decision stumps), but other types of models can be boosted

* **Boosting requires more tuning**, but is typically faster and more accurate than RFs!

]

---

# Background

.font125[

* **Many flavors of boosting:** AdaBoost, Real AdaBoost, gradient boosting, and stochastic gradient boosting (GBM), for example

* GBMs are the most **general** and widely used flavor of boosting

- Introduces subsampling into the rows and columns (**stochastic**)

- Loss function is optimized using *gradient descent* (**gradient**)

- With the right combination of parameters, GBMs can emulate RFs!

]

---

# GBMs with squared-error loss

1. Given a training sample `$\boldsymbol{d} = \left(\boldsymbol{X}, \boldsymbol{y}\right)$`. Fix the number of steps `$B$`, the shrinkage factor `$\epsilon$`, and the tree depth `$d$`. Set the initial fit `$\widehat{G}_0 \equiv 0$`, and the residual vector `$\boldsymbol{r} = \boldsymbol{y}$`.

2. For `$b = 1, 2, \dots, B$` repeat:

a) Fit a regression tree `$\tilde{g}_b$` to the data `$\left(\boldsymbol{X}, \boldsymbol{r}\right)$`, grown best-first to depth `$d$`
    
    b) Update the fitted model with a shrunken version of `$\tilde{g}_b$`: `$\widehat{G}_b = \widehat{G}_{b - 1} + \epsilon \tilde{g}_b$`
    
    c) Update the residuals accordingly: `$r_i = r_i - \epsilon \tilde{g}_b$`, `$i = 1, 2, \dots, n$`
    
3. Return the sequence of fitted functions `$\widehat{G}_b$`, `$b = 1, 2, \dots, B$`.

* Here, `$B$`, `$\epsilon$`, and `$d$` are tuning parameters

.font150.center[

[Poor mans GBM code](https://github.com/bgreenwell/intro-ml-r/blob/master/code/rpartBoost.R)

]

???

Source: [Computer Age Statistical Inference, pg. 333](https://web.stanford.edu/~hastie/CASI/)

---
class: clear, center, middle

.center.font200[Boosting regression stumps]

![](gifs/boosted_stumps.gif)

---

# Some GBM packages in R

.scrollable[

* [gbm](https://github.com/gbm-developers/gbm): **g**eneralized **b**oosted **m**odels

- The original R implementation of GMBs (by Greg Ridgeway)
    
    - Slower than modem implementations (but still pretty fast)
    
    - Provides OOB error estimate
    
    - Supports the *weighted tree traversal* method for fast construction of PDPs
    
    - ~~Currently orphaned on CRAN~~
    
* [gbm3](https://github.com/gbm-developers/gbm3): **g**eneralized **b**oosted **m**odels

- Shiny new version of `gbm` that is not backwards compatible
 
 - Faster and supports parallel tree building
 
 - Not currently listed on CRAN
 
* [xgboost](https://github.com/dmlc/xgboost): eXtreme Gradient Boosting

- Widely used by data scientists to achieve **state-of-the-art** results across many machine learning challenges (e.g., Kaggle)

- Designed to be highly efficient, flexible, and portable
    
    - Supports user-defined loss functions
    
    - Supports early stopping
    
    - Includes fast histogram method for numeric features
    
    - Parallel tree boosting and GPU support!!

* [lightgbm](https://github.com/Microsoft/LightGBM):

- Microsoft supported, and somewhat [faster](https://github.com/Microsoft/LightGBM/blob/master/docs/Experiments.rst#comparison-experiment), alternative to **xgboost**

- Uses *leaf-wise* (i.e., best-first) tree growth
    
    - Not currently listed on CRAN

* [mboost](https://github.com/boost-R): **m**odel-based **boost**ing

- Implements tools for fitting and evaluating a variety of regression and classification models (somewhere between GzLMs/GAMs and GBMs)

]

---

# Missing values

.font140[

* Unlike RFs, GBMs typically support missing values!

* In most GBM implementations, `NA`s are interpreted as containing information (i.e., missing for a reason), rather than missing at random

- In [gbm](https://github.com/gbm-developers/gbm) and [gbm3](https://github.com/gbm-developers/gbm3), a separate branch is created at each split for `NA`s (i.e., each decision node contains three splits: left, right, and missing)
    
    - In [h2o](https://github.com/h2oai/h2o-3/tree/master/h2o-r) and [xgboost](https://github.com/dmlc/xgboost) each split has a **default direction** that is learned during training (`NA`s take the default route!)

]

---
class: clear, middle

# Boston housing example

.code110[

```r
# Load required packages
library(gbm)

# Fit a GBM to the Boston housing data
set.seed(1053) # for reproducibility
boston_gbm <- gbm(
 cmedv ~ ., 
 data = boston, 
* var.monotone = NULL,
* distribution = "gaussian", # "benoulli", "coxph", etc.
* n.trees = 10000,
* interaction.depth = 5,
* n.minobsinnode = 10,
* shrinkage = 0.005,
* bag.fraction = 1,
* train.fraction = 1,
 cv.folds = 10 # k-fold CV often gives the best results
)
```

]

---
class: clear, middle

# Boston housing example

```r
best_iter <- gbm.perf( # how many trees?
 boston_gbm, 
* method = "cv" # or "OOB" or "test"
)
```

---

# Partial dependence plots

.pull-left[

## Brute force method:

```r
system.time(
 pd1 <- partial(
 boston_gbm, 
 pred.var = c("lon", "nox"),
* recursive = FALSE,
 chull = TRUE, 
* n.trees = best_iter
 )
)
## user system elapsed 
## 112.222 0.839 113.320
```

]

.pull-right[

## Recursive method:

```r
system.time(
 pd2 <- partial(
 boston_gbm, 
 pred.var = c("lon", "nox"),
* recursive = TRUE,
 chull = TRUE, 
* n.trees = best_iter
 )
)
## user system elapsed 
## 0.458 0.002 0.460
```

]

---

# Partial dependence plots

```r
# Display plots side by side
grid.arrange(autoplot(pd1), autoplot(pd2), ncol = 2)
```

---

# Ames housing data

.pull-left.font120[

* A contemporary alternative to the often cited Boston housing data

* Contains the sale of individual residential property in Ames, Iowa from 2006 to 2010

* Contains `$N = 2930$` observations and `$p = 80$` features involved in assessing home values

* Lots of potential for *feature engineering*!!

]

.pull-right[

```r
ggplot(AmesHousing::make_ames(), aes(x = Sale_Price, y = Overall_Qual)) + 
* ggridges::geom_density_ridges(aes(fill = Overall_Qual)) +
  scale_x_continuous(labels = scales::dollar) +
  labs(x = "Sale price", y = "Overall quality") +
  theme_light() + theme(legend.position = "none")
```

<img src="intro-ml-r_files/figure-html/ames-densities-1.svg" width="100%" style="display: block; margin: auto;" />
.font80[**Source:** [Bradley Boehmke](https://github.com/bradleyboehmke)]

]

---
class: clear, middle

# Ames housing data

.scrollable[

```r
# Load required packages
library(xgboost)

# Construct data set
ames <- AmesHousing::make_ames()

*# Feature matrix # or xgb.DMatrix or sparse matrix
X <- data.matrix(subset(ames, select = -Sale_Price))

# Fit an XGBoost model
set.seed(203) # for reproducibility
ames_xgb <- xgboost( # tune using `xgb.cv()`
 data = X, 
 label = ames$Sale_Price, 
* objective = "reg:linear", # loss function
* nrounds = 2771, # number of trees
* max_depth = 5, # interaction depth
* eta = 0.01, # learning rate
* subsample = 1,
* colsample = 1,
* num_parallel_tree = 1,
* eval_metric = "rmse",
 verbose = 0,
* save_period = NULL
)
```

]

---
class: clear, middle

---

# Ames housing data

```r
# Variable importance plots
p1 <- vip(ames_xgb, feature_names = colnames(X), type = "Gain")
p2 <- vip(ames_xgb, feature_names = colnames(X), type = "Cover")
p3 <- vip(ames_xgb, feature_names = colnames(X), type = "Frequency")
grid.arrange(p1, p2, p3, ncol = 3)
```

---

# Ames housing data

.pull-left[

```r
# By default, `vip()` plots the 
# top 10 features
vip(
  object = ames_xgb, 
  feature_names = colnames(X), 
  type = "Gain",
  num_features = 20, #nrow(X), 
  bar = FALSE
)
```

]

.pull-right[

]

---

# Ames housing data

```r
# Partial dependence plots
oq_ice <- partial(ames_xgb, pred.var = "Overall_Qual", ice = TRUE, 
 center = TRUE, train = X)
p4 <- autoplot(partial(ames_xgb, pred.var = "Gr_Liv_Area", train = X))
p5 <- autoplot(partial(ames_xgb, pred.var = "Garage_Cars", train = X))
p6 <- autoplot(oq_ice, alpha = 0.1)
grid.arrange(p4, p5, p6, ncol = 3)
```

---
class: clear, middle, center

# Ames housing data

.font200.center[Variable effect and variable importance]

---

# Summary

.scrollable[

.center.bold[Check out R's [machine learning task view](https://cran.r-project.org/web/views/MachineLearning.html)!]

.pull-left[

### Random forests:

* Builds an ensemble of fully grown decision trees (**low bias, high variance**)

- Correlation between trees is reduced through subsampling the columns

- Variance is reduced through averaging
    
* Nearly automatic

* Lots of cool bells and whistles

* Trees are independently grown (embarrassingly parallel)

]

.pull-right[

### GBMs:

* Builds an ensemble of small decision trees (**high bias, low variance**)

- Bias is reduced through sequential learning and fixing past mistakes

* Requires more tuning, but can be more accurate

* Flexible loss functions

* Trees are grown **NOT** independent, but training times are usually pretty fast since trees are not grown too deep

]

---
class: clear, middle

.pull-left.font100.left[

"Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem."

--- Leo Breiman

]

.pull-right.font100.left[

"All models are wrong, but some are useful."

--- George E. Box

]

---
class: clear, middle, center

.font300[Questions?]