Random Forests and Gradient Boosting Machines in R

class: center, middle, inverse, title-slide

# Random Forests and Gradient Boosting Machines in R
## ↟↟↟↟↟ ↟↟↟↟ GitHub: <a href="https://github.com/bgreenwell/MLDay18" class="uri">https://github.com/bgreenwell/MLDay18</a>
### Brandon M. Greenwell
### 02/14/2018

---

## R package prerequisites

.scrollable[

```r
# Install required packages from CRAN
install.packages(
  "AmesHousing",   # for Ames housing data
  "caret",         # for data splitting function
  "gbm",           # for generalized boosted models
  "ggplot2",       # for nicer graphics
  "kernlab",       # for email spam data
  "pdp",           # for partial dependence plots
  "ranger",        # for faster random forest
  "randomForest",  # for classic random forest
  "RColorBrewer",  # for nicer looking color palettes 
  "rpart",         # for binary recursive partioning (i.e., CART)
  "tibble",        # for nicer data frames
  "xaringan",      # for building this presentation
  "xgboost"        # for eXtreme Gradient Boosting
)

# Install required packages from GitHub
devtools::install_github("AFIT-R/vip")  # for variable importance plots
```

]

---

## Data set information

.scrollable[

.pull-left[

### Classification:

* [mushroom](https://archive.ics.uci.edu/ml/datasets/mushroom)

- 21 physical characteristics on 8124 mushrooms classified as either poisonous or edible

* [spam](http://search.r-project.org/library/kernlab/html/spam.html)

- 57 variables on 4601 emails classified as spam or ham

* [banknote](http://search.r-project.org/library/alr3/html/banknote.html)

- Six measurements made on 100 genuine Swiss banknotes and 100 counterfeit ones

]

.pull-right[

### Regression:

* [boston](http://search.r-project.org/library/pdp/html/boston.html)

- Data on median housing values from 506 census tracts in the suburbs of Boston from the 1970 census

* [ames](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)

- Data describing the sale of individual residential property in Ames, Iowa
from 2006 to 2010

]

---

## A good modelling tool

.pull-left[

### At a minimum:

* Applicable to classification and regression
    
* Competitive accuracy
    
* Capable of handling large data sets
    
* Handle missing values effectively

]

.pull-right[

### Bonus features:

* Which predictors are important? ([`vip`](https://github.com/AFIT-R/vip))

* How do the predictors functionally relate to the response? ([`pdp`](https://journal.r-project.org/archive/2017/RJ-2017-016/index.html))
    
* How do the predictors interact? ([`pdp`](https://journal.r-project.org/archive/2017/RJ-2017-016/index.html), [`vip`](https://github.com/AFIT-R/vip))
    
* What is the shape of the data (i.e., how does it cluster?)
    
* Are their novel cases and outliers?

]

---
class: inverse, center, middle

background-image: url(https://upload.wikimedia.org/wikipedia/commons/9/99/Fog_forrest_frickberg.jpg)

# Decision trees

???

Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/Forest#/media/File:Fog_forrest_frickberg.jpg)

---

## Mushroom classification

There is no simple rule for determining the edibility of a [mushroom](https://raw.githubusercontent.com/bgreenwell/MLDay18/master/data/mushroom.csv); no rules like **"leaflets three, let it be"**, **"hairy vine, no friend of mine"** and **"berries white, run in fright"** for poison ivy.

```r
path <- paste0("https://raw.githubusercontent.com/bgreenwell/",
 "MLDay18/master/data/mushroom.csv")
mushroom <- read.csv(path) # load the data from GitHub
*mushroom$veil.type <- NULL # only takes on a single value
```

---

## Mushroom classification

```r
# Load required packages
library(caret)  # for data splitting function
library(rpart)  # for binary recursive partitioning

# Partition the data into train/test sets
set.seed(101)
trn_id <- createDataPartition(
 y = mushroom$Edibility, p = 0.5, list = FALSE
)
trn <- mushroom[trn_id, ] # training data
tst <- mushroom[-trn_id, ] # test data

# Function to calculate accuracy
accuracy <- function(pred, obs) {
 sum(diag(table(pred, obs))) / length(obs)
}
```

---

## Mushroom classification

.pull-left[

```r
# Decision stump (test error = 1.53%):
cart1 <- rpart(
 Edibility ~ ., data = trn,
* control = rpart.control(maxdepth = 1)
)

# Get test set predictions
pred1 <- predict(
 cart1, newdata = tst, 
 type = "class"
)

# Compute test set accuracy
accuracy(
  pred = pred1, 
  obs = tst$Edibility
)
```

```
## [1] 0.9847366
```

]

.pull-right[

```r
# Optimal tree (test error = 0%):
cart2 <- rpart(
 Edibility ~ ., data = trn, 
* control = list(cp = 0, minbucket = 1, minsplit = 1)
)

# Get test set predictions
pred2 <- predict(
 cart2, newdata = tst, 
 type = "class"
)

# Compute test set accuracy
accuracy(
  pred = pred2, 
  obs = tst$Edibility
)
```

```
## [1] 1
```

]

---

## Mushroom classification

.pull-left[

Decision stump (test error = 1.53%):
<img src="figures/mushroom-tree-diagram-1-1.svg" width="100%" style="display: block; margin: auto;" />

]

.pull-right[

Optimal tree (test error = 0%):
<img src="figures/mushroom-tree-diagram-2-1.svg" width="100%" style="display: block; margin: auto;" />

]

---

## A handy rule for mushrooms

.right[<div class="centered">
"If it smells bad, don't eat it!"
</div>]
 
.right[-Decision Stump]

---

## Predicting email spam

.pull-left[

- Data from 4601 email messages collected at Hewlett-Packard Labs

- **Goal:** predict whether an email message is spam (junk email) or ham (good email)

- **Features:** relative frequencies in a message of 57 of the most commonly occurring words and punctuation marks in all the training the email messages

- For this problem, not all errors are equal; misclassifying spam is not as bad as misclassifying ham!

]

.pull-right[

]

---

## Predicting email spam

```r
# Load the data
*data(spam, package = "kernlab")

# Partition the data into train/test sets
set.seed(101) # for reproducibility
trn_id <- createDataPartition(spam$type, p = 0.7, list = FALSE)
trn <- spam[trn_id, ] # training data
tst <- spam[-trn_id, ] # test data
xtrn <- subset(trn, select = -type) # training data features
xtst <- subset(tst, select = -type) # test data features
ytrn <- trn$type # training data response

# Fit a classification tree (cp found using k-fold CV)
*spam_tree <- rpart(type ~ ., data = trn, cp = 0.001)
pred <- predict(spam_tree, newdata = xtst, type = "class")

# Compute test set accuracy
(spam_tree_acc <- accuracy(pred = pred, obs = tst$type))
```

```
## [1] 0.9260334
```

---

## Predicting email spam

.scrollable[

]

---

## Predicting email spam

.pull-left[

```r
# Extract tibble of variable importance scores
*vip::vi(spam_tree)
```

```
## # A tibble: 48 x 2
## Variable Importance
## <chr> <dbl>
## 1 charExclamation 508 
## 2 capitalLong 260 
## 3 capitalAve 199 
## 4 free 198 
## 5 charDollar 195 
## 6 your 183 
## 7 remove 177 
## 8 all 127 
## 9 capitalTotal 108 
## 10 hp 74.3
## # ... with 38 more rows
```

]

.pull-right[

```r
# Construct ggplot2-based variable importance plot
*vip(spam_tree, num_features = 10)
```

]

---

## How do trees measure up?

.pull-left[

### Advantages:

* Small trees are easy to interpret

* Trees scale well to large `$N$` (fast!!)

* Can handle data of all types (i.e., requires little, if any, coding)

* Automatic variable selection

* Can handle missing data (through *surrogate splits*)

* Completely nonparametric (great for DM and EDA tasks!)

]

.pull-right[

### Disadvantages:

* Large trees can be difficult to interpret

* Trees are step functions (i.e., binary splits)

* Greedy splitting algorithms (i.e., trees are noisy)

* All splits depend on previous splits (i.e., high order interactions)

* Data is essentially taken away after each split

]

---

## How do trees measure up?

<img src="figures/esl-tbl.png" width="65%" style="display: block; margin: auto;" />
 
.center[**Source:** The Elements of Statistical Learning, Second Edition]

---

## Creating ensembles of trees

* The key to accuracy is **low bias** and **low variance**

* **Main idea:** breaking the [*bias-variance trade-off*](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff)

- `$Error \approx Bias + Variance$`

* Two popular algorithms (can be applied to any model, **not just trees!**):

- Bagging (reduce variance by averaging)

- Boosting (reduce bias by building models sequentially)
 
* Random forest is just a slight modification over bagging applied to decision trees!

<center>
Boosting >= Random forest > Bagging > Single tree
</center>

---
class: inverse, center, middle

background-image: url(https://upload.wikimedia.org/wikipedia/commons/9/99/Fog_forrest_frickberg.jpg)

# Bagging

???

Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/Forest#/media/File:Fog_forrest_frickberg.jpg)

---

## Background

.pull-left[

* **B**ootstrap **agg**regat**ing**

* Bagging trees:

1. Sample records **with replacement** (i.e., bootstrap the training data)
    
    2. Fit an **overgrown tree** to the resampled data set
    
    3. Repeat a large number of times ( `$\ge 500$` , say)

]

.pull-right[

]
* Predictions are combined by popular vote (**classification**) or averaging (**regression**)

* Same idea as ["wisdom of the crowd"](https://en.wikipedia.org/wiki/Wisdom_of_the_crowd) (Francis Galton's Ox weight survey)

* Improves the stability and accuracy of noisy models (e.g., individual trees)

---

## Bagging trees

---

## Trees are step functions

---

## Trees are noisy

---

## Bagging reduces variance

---

## Predicting email spam

.pull-left[

```r
# Load required packages
*library(randomForest)

# Fit a bagger model
set.seed(1633) # for reproducibility
spam_bag <- randomForest(
 type ~ ., 
 data = trn, 
 ntree = 250,
* mtry = ncol(xtrn), # use all available features
 xtest = subset(tst, select = -type),
 ytest = tst$type,
 keep.forest = TRUE
)
```

]

.pull-right[

]

---
class: inverse, center, middle

background-image: url(https://upload.wikimedia.org/wikipedia/commons/9/99/Fog_forrest_frickberg.jpg)

# Random forest

???

Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/Forest#/media/File:Fog_forrest_frickberg.jpg)

---

## Background

* **Bagging on steroids!!**

* Correlation limits the effect of bagging!

* Grow trees just as in bagging, but with a small twist

- Before each split in each tree in the forest, select a subset of the predictors **at random** as candidate splitters
    
      - This essentially **"decorrelates"** the trees in the ensemble!
    
    - The number of randomly selected variables, denoted `$m_{try}$`, is a **tuning parameter** (arguably the only tuning parameter in a random forest)

* Bagging introduces randomness into the rows of the data

* Random forest introduces randomness into the rows and columns of the data

* [*Extremely randomized trees*](https://link.springer.com/article/10.1007/s10994-006-6226-1) take this randomness one step further to reduce variance!

---

## Random forest packages in R

.scrollable[

.pull-left[

* [`randomForest`](https://cran.r-project.org/package=randomForest)

- The standard (implements most of the **bells and whistles**; great for EDA!!)
    
    - Classification and regression
    
    - Does not scale well, but can be parallelized using `foreach`!

* [`randomForestSRC`](https://kogalur.github.io/randomForestSRC/index.html)

- Classification, regression, and survival analysis (including ***competing risks***)
    
    - MPI/OpenMP support

* [`ranger`](https://github.com/imbs-hl/ranger): **ran**dom forest **ge**ne**r**ator

- Classification, regression, and survival analysis
    
    - [Fast!](https://www.jstatsoft.org/article/view/v077i01)
    
    - Accepts sparse numeric matrices (i.e., `dgcMatrix` objects from the [`Matrix`](https://cran.r-project.org/package=Matrix) package)
    
    - Estimated time to completion!! 🤑

]

.pull-right[

* [`party`/`partykit`](https://cran.r-project.org/web/packages/party/index.html)

- Uses *conditional inference trees* as the base learners
    
    - Unbiased variable selection
    
    - Can be slow on large data sets

* [`bigrf`](https://github.com/aloysius-lim/bigrf)

- Classification only

- Currently **ORPHANED** on CRAN
  
  - Disk caching using [`bigmmory`](https://cran.r-project.org/web/packages/bigmemory/)

* [`Rborist`](https://github.com/suiji/Arborist)

- Offers parallel, distributed tree construction

* [`h2o`](https://github.com/h2oai/h2o-3/tree/master/h2o-r)

- Distributed random forests via cloud computing
  
  - Can be stacked with other [`h2o`](https://github.com/h2oai/h2o-3/tree/master/h2o-r) models, like GBMs and DNNs!!

]

---

## Predicting email spam

.pull-left[

```r
# Load required packages
library(randomForest)

# Fit a random forest
set.seed(1633) # for reproducibility
spam_rf <- randomForest(
 type ~ ., 
 data = trn, 
 ntree = 250,
* mtry = 7, # floor(sqrt(p))
 xtest = subset(tst, select = -type),
 ytest = tst$type,
 keep.forest = TRUE
)
```

]

.pull-right[

]

---

## Random forest

.pull-left[

### Advantages:

* Competitive accuracy

* Supervised and unsupervised learning

* *Out-of-bag* data 😎

- Free cross-validation

- Novel variable importance

* Novel proximity matrix 😎

- Outlier/novelty detection
    
    - Multi-dimensional scaling
    
    - Missing value imputation

]

.pull-right[

### Disadvantages:

* Missing values (**why?**)

* Can be slow on large data sets (deep trees)!

- `$m_{try}$` mitigates this issue to some extent

]

---

## Out-of-bag data

* For large enough `$N$`, on average, `$1 - e^{-1} \approx 63.21$`% or the original records end up in any bootstrap sample

```r
N <- 100000
set.seed(1537) # for reproducibility
x <- rnorm(N)
mean(x %in% sample(x, replace = TRUE)) # non-OOB proportion
```

```
## [1] 0.63232
```

* Thus, roughly `$e^{-1} \approx 36.79$`% of the original observations are not used in the construction of a particular tree

* These observations are considered *out-of-bag* (OOB) and can be used to

- Provide an unbiased assessment of model performance (**sort of an unstructured, but free, cross-validation**)
 
    - Construct novel variable importance measure based on the **predictive strength** of each predictor!!

---

## Predicting email spam

---

## OOB-based variable importance

* **Traditional approach:** Average variable importance scores (i.e., total goodness of split) over all the trees in the ensemble (this is what is done for GBMs)

* **Novel approach:** To estimate the importance of the `$k$`-th variable, `$x_k$`:

1. Record the OOB performance of the model
  
  2. Randomly permute all the values of `$x_k$` in the OOB data
  
  3. Recompute the OOB performance of the model
  
  4. The difference between the two OOB performance measures the strength of the structural importance of `$x_k$`
  
* Fundamentally different as these importance scores are **NOT** based on data seen by the individual trees!

---

## Boston housing example

* Housing data from `$N = 506$` census tracts in the city of Boston for the year 1970

* The data violate many classical assumptions like linearity, normality, and constant variance

* Harrison and Rubinfeld's housing value equation ( `$R^2 = 0.81$` ):

* Nowadays, many supervised learning algorithms can fit the data automatically in seconds (**typically with higher accuracy!**)

* The downfall, however, is some loss of interpretation since these algorithms typically do not produce simple prediction formulas (**but there's still hope!** 🙏)

---

```r
# Load required packages
*library(ranger)  # a much faster implementation of random forest

# Load the (corrected) Boston housing data
data(boston, package = "pdp")
```

.pull-left[

```r
# Using the randomForest package
set.seed(2007) # for reproducibility
system.time(
 boston_rf <- randomForest(
 cmedv ~ ., data = boston, 
 ntree = 5000,
 mtry = 5,
 importance = FALSE
 )
)
```

```
##    user  system elapsed 
##   6.560   0.044   6.609
```

```
## [1] 0.8944872
```

]

.pull-right[

```r
# Using the ranger package
set.seed(1652) # for reproducibility
system.time(
 boston_ranger <- ranger(
 cmedv ~ ., data = boston, 
 num.trees = 5000, 
* mtry = 5, # :/
 importance = "impurity"
 )
)
```

```
##    user  system elapsed 
##   5.932   0.032   0.881
```

```
## [1] 0.8936254
```

]

---

## Variable importance plots

.pull-left[

* Most RF packages provide variable importance measures, but not all of them provide variable importance plots (VIPs)

* Enter...[`vip`](https://github.com/AFIT-R/vip)

- Provides a consistent framework for extracting and plotting variable importance scores from many types of ML models

- `vi()` always returns a [*tibble*](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html)

- `vip()` uses [`ggplot2`](http://ggplot2.org/)

]

.pull-right[

```r
# Not yet on CRAN (experimental)
devtools::install_github("AFIT-R/vip")
```

<img src="figures/vip-logo.svg" width="80%" style="display: block; margin: auto;" />
]

---

```r
# Construct variable importance plots (the old way)
par(mfrow = c(1, 2))  # side-by-side plots
*varImpPlot(boston_rf)  # randomForest::varImpPlot()
```

---

```r
# Load required packages
library(vip)  # for better (and consistent) variable importance plots

# Construct variable importance plots
p1 <- vip(boston_rf, type = 1) + ggtitle("randomForest")
p2 <- vip(boston_rf, type = 2) + ggtitle("randomForest")
p3 <- vip(boston_ranger) + ggtitle("ranger")
*grid.arrange(p1, p2, p3, ncol = 3)
```

---

## Partial dependence plots

.scrollable[

* [Partial dependence plots (PDPs)](https://journal.r-project.org/archive/2017/RJ-2017-016/index.html) help visualize the relationship between a subset of the features (typically 1-3) and the response

* Let `$x_1$` be the predictor variable of interest with unique values `$\left\{x_{11}, x_{12}, \dots, x_{1k}\right\}$`

* The partial dependence of the response on `$x_1$` can be constructed as follows:

1. For `$i \in \left\{1, 2, \dots, k\right\}$`:
    
        a. Copy the training data and replace the original values of `$x_1$` with the constant `$x_{1i}$`
    
        b. Compute the vector of predicted values from the modified copy of the training data
        
        c. Compute the average prediction to obtain `$\bar{f}_1\left(x_{1i}\right)$`

2. Plot the pairs `$\left\{x_{1i}, \bar{f}_1\left(x_{1i}\right)\right\}$` for `$i = 1, 2, \dotsc, k$`

]

---

## Partial dependence plots

.pull-left[

* Not all RF implementations provide support for constructing partial dependence plots (PDPs)

* Enter...[`pdp`](https://journal.r-project.org/archive/2017/RJ-2017-016/index.html)

- Provides a consistent way of constructing PDPs (and more) from many types of ML models (not just RFs)

- Allows for multivariate displays (i.e., interactions) and so much more!!
    
    - Includes options for **parallel processing** and **progress bars** 😎

]

.pull-right[

```r
# Install from CRAN
install.packages("pdp")

# Install from GitHub
devtools::install_github("bgreenwell/pdp")
```

]

---

## Partial dependence plots

```r
# Load required packages
library(pdp)

# PDPs for the top two predictors
p1 <- partial(boston_ranger, pred.var = "lstat", plot = TRUE)
p2 <- partial(boston_ranger, pred.var = "rm", plot = TRUE)
*p3 <- partial(boston_ranger, pred.var = c("lstat", "rm"),
* chull = TRUE, plot = TRUE)
grid.arrange(p1, p2, p3, ncol = 3)
```

---

## Partial dependence plots

```r
# 3-D plots
*pd <- attr(p3, "partial.data") # no need to recalculate
p1 <- plotPartial(pd, 
 levelplot = FALSE, drape = TRUE, colorkey = FALSE,
 screen = list(z = -20, x = -60)
)
```

```r
# Using ggplot2
library(ggplot2)
p2 <- autoplot(pd)
```

```r
# ICE and c-ICE curves
p3 <- boston_ranger %>% # %>% is automatically imported!
 partial(pred.var = "rm", ice = TRUE, center = TRUE) %>%
 autoplot(alpha = 0.1)
```

---

## Partial dependence plots

```r
# Display all three plots side-by-side
grid.arrange(p1, p2, p3, ncol = 3)
```

---

## Proximity matrix

* In random forests (and bagging), **trees are intentionally grown deep**!

---

---

## Proximity matrix

* In random forests (and bagging), **trees are intentionally grown deep**!

1. Initialize the proximity between cases `$i$` and `$j$` to zero: `$prox_{ij} = 0$`

2. After a tree is grown, put all of the data, both training and OOB, down the tree

3. If cases `$i$` and `$j$` land in the same terminal node, increase `$prox_{ij}$` by one

4. At the end, normalize `$prox_{ij}$` by dividing by the number of trees in the forest

* The pairwise proximities can be used to construct a useful distance matrix

* **Requires the storage of an `$N \times N$` matrix!!** 😱

* Only available in [`randomForest`](https://cran.r-project.org/package=randomForest), [`party`](https://cran.r-project.org/web/packages/party/index.html), and [`bigrf`](https://github.com/aloysius-lim/bigrf)

---

## Swiss banknote data

```r
# Load the data
data(banknote, package = "alr3")

# Fit a random forest
set.seed(1701) # for reproducibility
banknote_rf <- randomForest(
 as.factor(Y) ~ ., 
 data = banknote, 
* proximity = TRUE
)

# Print the OOB confusion matrix
banknote_rf$confusion
```

```
##    0   1 class.error
## 0 99   1        0.01
## 1  0 100        0.00
```

---

## Proximity matrix

```r
# Heatmap of proximity-based distance matrix
heatmap(1 - banknote_rf$proximity, col = viridis::plasma(256))
```

---

## Outlier scores

Outlyingness of case `$n$` in class `$j$` to all other cases in class `$j$`: `$$out\left(n\right) = N / \sum_{cl\left(k\right) = j} prox_{ik} ^ 2$$`

.pull-left[

```r
# Dot chart of proximity-based outlier scores
outlyingness <- tibble::tibble(
* "out" = outlier(banknote_rf),
 "obs" = seq_along(out),
 "class" = as.factor(banknote$Y)
)
ggplot(outlyingness, aes(x = obs, y = out)) +
 geom_point(aes(color = class, size = out), alpha = 0.5) +
* geom_hline(yintercept = 10, linetype = 2) +
 labs(x = "Observation", y = "Outlyingness") +
 theme_light() +
 theme(legend.position = "none")
```

]

.pull-right[

]

---

## Multi-dimensional scaling

```r
# Multi-dimensional scaling plot of proximity matrix
MDSplot(banknote_rf, fac = as.factor(banknote$Y), k = 2, cex = 1.5)
```

---

## Missing value imputation

* Random forest offers a novel approach to imputing missing values (i.e., `NA`s)

1. Start with cheap imputation (e.g., median for continuous features)
    
    2. Run a random forest and obtain the proximity matrix
    
    3. Imputed values are updated using:
    
        - The weighted average of the non-missing observations, where the weights are the proximity (**continuous**)
        
        - The category with the largest average proximity (**categorical**)

4. Iterate this procedure until convergence (~ 4-6 times seems sufficient)

* Highly computational (requires many random forest runs!!)

* Resulting **OOB error estimate tends to be overly optimistic**

* Only available in [`randomForest`](https://cran.r-project.org/package=randomForest)

---
class: inverse, center, middle

background-image: url(https://upload.wikimedia.org/wikipedia/commons/9/99/Fog_forrest_frickberg.jpg)

# Boosting

???

Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/Forest#/media/File:Fog_forrest_frickberg.jpg)

---

## Background

.scrollable[

* Boosting is a general technique for creating an ensemble of models (*meta-algorithm*)

* In contrast to bagging, boosting:

- Uses a *weak learner* for the base learners
    
    - Builds the models sequentially (fix past mistakes)
    
* Most commonly used with **shallow decision trees** (e.g., decision stumps), but other types of models can be boosted

* **Boosting requires more tuning**, but is typically faster and more accurate than RFs!

* **Many flavors of boosting:** AdaBoost, Real AdaBoost, gradient boosting, and stochastic gradient boosting (GBM), for example

* GBMs are the most **general** and widely used flavor of boosting

- Introduces subsampling into the rows and columns (**stochastic**)

- Loss function is optimized using *gradient descent* (**gradient**)

- With the right combination of parameters, GBMs can emulate RFs!

]

---

## GBMs with squared-error loss

1. Given a training sample `$\boldsymbol{d} = \left(\boldsymbol{X}, \boldsymbol{y}\right)$`. Fix the number of steps `$B$`, the shrinkage factor `$\epsilon$`, and the tree depth `$d$`. Set the initial fit `$\widehat{G}_0 \equiv 0$`, and the residual vector `$\boldsymbol{r} = \boldsymbol{y}$`.

2. For `$b = 1, 2, \dots, B$` repeat:

a) Fit a regression tree `$\tilde{g}_b$` to the data `$\left(\boldsymbol{X}, \boldsymbol{r}\right)$`, grown best-first to depth `$d$`
    
    b) Update the fitted model with a shrunken version of `$\tilde{g}_b$`: `$\widehat{G}_b = \widehat{G}_{b - 1} + \epsilon \tilde{g}_b$`
    
    c) Update the residuals accordingly: `$r_i = r_i - \epsilon \tilde{g}_b$`, `$i = 1, 2, \dots, n$`
    
3. Return the sequence of fitted functions `$\widehat{G}_b$`, `$b = 1, 2, \dots, B$`.

* Here, `$B$`, `$\epsilon$`, and `$d$` are tuning parameters

* [R code](https://github.com/bgreenwell/MLDay18/blob/master/code/rpartBoost.R)

???

Source: [Computer Age Statistical Inference, pg. 333](https://web.stanford.edu/~hastie/CASI/)

---
class: center, middle

## [Boosting regression stumps](https://github.com/bgreenwell/MLDay18/blob/master/code/rpartBoost.R)

![](gifs/boosted_stumps.gif)

---

## GBM packages in R

.scrollable[

* [`gbm`](https://github.com/gbm-developers/gbm): **g**eneralized **b**oosted **m**odels

- The original R implementation of GMBs (by Greg Ridgeway)
    
    - Slower than modem implementations (but still pretty fast)
    
    - Provides OOB error estimate
    
    - Supports the *weighted tree traversal* method for fast construction of PDPs
    
    - Currently orphaned on CRAN (for now)
    
* [`gbm3`](https://github.com/gbm-developers/gbm3): **g**eneralized **b**oosted **m**odels

- Shiny new version of `gbm` that is not backwards compatible
 
 - Faster and supports parallel tree building
 
 - Not currently listed on CRAN
 
* [`xgboost`](https://github.com/dmlc/xgboost): eXtreme Gradient Boosting

- Widely used by data scientists to achieve **state-of-the-art** results across many machine learning challenges (e.g., Kaggle)

- Designed to be highly efficient, flexible, and portable
    
    - Supports user-defined loss functions
    
    - Supports early stopping
    
    - Includes fast histogram method for numeric features
    
    - Parallel tree boosting and GPU support!!

* [`lightgbm`](https://github.com/Microsoft/LightGBM):

- Microsoft supported, and somewhat [faster](https://github.com/Microsoft/LightGBM/blob/master/docs/Experiments.rst#comparison-experiment), alternative to `xgboost`

- Uses *leaf-wise* (i.e., best-first) tree growth
    
    - Not currently listed on CRAN

* [`mboost`](https://github.com/boost-R): **m**odel-based **boost**ing

- Implements tools for fitting and evaluating a variety of regression and classification models (somewhere between GLMs/GAMs and GBMs)

]

---

## Missing values

* Unlike RFs, GBMs typically support missing values!

* In most GBM implementations, `NA`s are interpreted as containing information (i.e., missing for a reason), rather than missing at random

- In [`gbm`](https://github.com/gbm-developers/gbm) and [`gbm3`](https://github.com/gbm-developers/gbm3), a separate branch is created at each split for `NA`s (i.e., each decision node contains three splits: left, right, and missing)
    
    - In [`h2o`](https://github.com/h2oai/h2o-3/tree/master/h2o-r) and [`xgboost`](https://github.com/dmlc/xgboost) each split has a **default direction** that is learned during training (`NA`s take the default route!)

---

## Boston housing example

```r
# Load required packages
library(gbm)

# Fit a GBM to the Boston housing data
set.seed(1053) # for reproducibility
boston_gbm <- gbm(
 cmedv ~ ., 
 data = boston, 
* var.monotone = NULL,
* distribution = "gaussian", # "benoulli", "coxph", etc.
* n.trees = 10000,
* interaction.depth = 5,
* n.minobsinnode = 10,
* shrinkage = 0.005,
* bag.fraction = 1,
* train.fraction = 1,
 cv.folds = 10 # k-fold CV often gives the best results
)
```

```
## <simpleError in parse_block(g[-1], g[1], params.src): duplicate label 'r-prerequisites'>
## <simpleError in parse_block(g[-1], g[1], params.src): duplicate label 'r-prerequisites'>
## <simpleError in parse_block(g[-1], g[1], params.src): duplicate label 'r-prerequisites'>
```

---

## Boston housing example

```r
best_iter <- gbm.perf(
 boston_gbm, 
* method = "cv" # or "OOB" or "test"
)
```

---

## Partial dependence plots

.pull-left[

### Brute force method:

```r
system.time(
 pd1 <- partial(
 boston_gbm, 
 pred.var = c("lon", "nox"),
* recursive = FALSE,
 chull = TRUE, 
* n.trees = best_iter
 )
)
```

```
##    user  system elapsed 
## 106.060   0.032 106.084
```

]

.pull-right[

### Recursive method:

```r
system.time(
 pd2 <- partial(
 boston_gbm, 
 pred.var = c("lon", "nox"),
* recursive = TRUE,
 chull = TRUE, 
* n.trees = best_iter
 )
)
```

```
##    user  system elapsed 
##   0.424   0.028   0.452
```

]

---

## Partial dependence plots

```r
# Display plots side-by-side
grid.arrange(autoplot(pd1), autoplot(pd2), ncol = 2)
```

---

## Ames housing data

.pull-left[

* A contemporary alternative to the often cited Boston housing data

* Contains the sale of individual residential property in Ames, Iowa from 2006 to 2010

* Contains `$N = 2930$` observations and `$p = 80$` features involved in assessing home values

* Lots of potential for *feature engineering*!!

]

.pull-right[

```r
ames <- AmesHousing::make_ames()
ggplot(ames, aes(x = Sale_Price, y = Overall_Qual)) + 
* ggridges::geom_density_ridges(aes(fill = Overall_Qual)) +
 scale_x_continuous(labels = scales::dollar) +
 labs(x = "Sale price", y = "Overall quality") +
 theme_light() + theme(legend.position = "none")
```

**Source:** [Bradley Boehmke](https://github.com/bradleyboehmke)

]

---

## Ames housing data

.scrollable[

.pull-left[

```r
# Load required packages
library(xgboost)

# Construct data set
ames <- AmesHousing::make_ames()

*# Feature matrix # or xgb.DMatrix or sparse matrix
X <- data.matrix(subset(ames, select = -Sale_Price))

# Fit an XGBoost model
set.seed(203) # for reproducibility
ames_xgb <- xgboost( # tune using `xgb.cv()`
 data = X, 
 label = ames$Sale_Price, 
* objective = "reg:linear", # loss function
* nrounds = 2771, # number of trees
* max_depth = 5, # interaction depth
* eta = 0.01, # learning rate
* subsample = 1,
* colsample = 1,
* num_parallel_tree = 1,
* eval_metric = "rmse",
 verbose = 0,
* save_period = NULL
)
```

]

.pull-right[

]

---

## Ames housing data

```r
# Variable importance plots
p1 <- vip(ames_xgb, feature_names = colnames(X), type = "Gain")
p2 <- vip(ames_xgb, feature_names = colnames(X), type = "Cover")
p3 <- vip(ames_xgb, feature_names = colnames(X), type = "Frequency")
grid.arrange(p1, p2, p3, ncol = 3)
```

---

## Ames housing data

```r
# By default, `vip()` plots the top 10 features
vip(ames_xgb, feature_names = colnames(X), type = "Gain", 
    num_features = nrow(X), bar = FALSE)
```

---

## Ames housing data

```r
# Partial dependence plots
oq_ice <- partial(ames_xgb, pred.var = "Overall_Qual", ice = TRUE, 
 center = TRUE, train = X)
p4 <- autoplot(partial(ames_xgb, pred.var = "Gr_Liv_Area", train = X))
p5 <- autoplot(partial(ames_xgb, pred.var = "Garage_Cars", train = X))
p6 <- autoplot(oq_ice, alpha = 0.1)
grid.arrange(p4, p5, p6, ncol = 3)
```

---

## Ames housing data

---

## Summary

.scrollable[

* Check out the [machine learning task view](https://cran.r-project.org/web/views/MachineLearning.html)!

.pull-left[

#### Random forests:

* Builds an ensemble of fully grown decision trees (**low bias, high variance**)

- Correlation between trees is reduced through subsampling the columns

- Variance is reduced through averaging
    
* Nearly automatic

* Lots of cool bells and whistles

* Trees are independently grown (embarrassingly parallel)

]

.pull-right[

#### Gradient boosting machines:

* Builds an ensemble of small decision trees (**high bias, low variance**)

- Bias is reduced through sequential learning and fixing past mistakes

* Requires more tuning, but can be more accurate

* Flexible loss functions

* Trees are grown **NOT** independent, but training times are usually pretty fast since trees are not grown too deep

]

---

## Questions?