class: clear .font300[ Introduction to Machine learning in
<img src="https://emojis.slackmojis.com/emojis/images/1450319453/122/carlton.gif?1450319453" style="height:1em; width:auto; "/>
] .font250.right[ Analytics Connect '18 ] .pull-left.font120[ Brandon M. Greenwell<br>Enable the Science/Data Scientist<br>[
<i class="fab fa-github faa-vertical animated "></i> Fork me on GitHub
](https://github.com/bgreenwell/intro-ml-r)<br>[
<i class="fab fa-twitter faa-float animated "></i> @bgreenwell8
](https://twitter.com/bgreenwell8) ] .pull-right[ <img src="figures/effo-logo.png" width="100%" style="display: block; margin: auto;" /> ] --- # Who am I? .pull-left[ <img src="figures/pizza.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right.font125[ * Started at
<img src="figures/effo-logo.png" style="height:1em; width:auto; "/>
in April of 2018 * BS/MS/PhD in Applied Statistics *
<i class="fab fa-r-project faa-pulse animated "></i>
enthusiast
<img src="https://emojis.slackmojis.com/emojis/images/1450694616/220/bananadance.gif?1450694616" style="height:1em; width:auto; "/>
* Matlab dethusiast
<img src="https://emojis.slackmojis.com/emojis/images/1483822352/1572/gross.gif?1483822352" style="height:1em; width:auto; "/>
* Maker of
📦s * [UC Analytics Summit](https://github.com/koalaverse/AnalyticsSummit18) ] --- class: clear, split-two # Overview .column.bg-main1[.content.vmiddle.center[ <center> <img src="figures/isl.jpg", width="70%"> .font150[[Book website](http://www-bcf.usc.edu/~gareth/ISL/)] ]] .column.bg-main2[.content.vmiddle.center[ <center> <img src="figures/esl.jpg", width="70%"> .font150[[Book website](https://web.stanford.edu/~hastie/ElemStatLearn/)] ]] --- class: clear, center, middle .pull-left[ .font300.bold[What is machine learning?] ] .pull-right[  ] --- # Popular terms .font150[ * .purple[Machine learning]: constructing algorithms that learn from data * .purple[Statistical learning]: emphasizes statistical models and the assessment of uncertainty * .purple[Data science]: applying mathematics, .bold[statistics], machine learning, engineering, etc. to extract knowledge form data - "Data Science is statistics on a Mac"
] .full-width.content-box-red.font150.center[Similar terms with slightly different emphases!] --- # Some take home messages .full-width.content-box-purple.font150.center[This talk is about *supervised learning*: building models from data that predict an **outcome** using a collection of **features**] .font175[ * Many flexible 🛠 for making predictions from data * They are not magic! They require
**good data**
and proper validation! * No one method is uniformly best! ] --- class: clear, middle, center .font300[ Follow along: [
code](https://github.com/bgreenwell/intro-ml-r/blob/master/code/intro-ml-r.R) ] --- class: clear, inverse, center, middle <img src="figures/fundamentals.jpg" width="100%" style="display: block; margin: auto;" /> --- # Classification vs. regression <img src="intro-ml-r_files/figure-html/classification-vs-regression-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Example data sets <!-- .font250.center[[Data set information](https://bgreenwell.github.io/intro-ml-r/datasets)] --> .font175[ .pull-left[ ## Classification: * Mushroom edibility 🍄 * E-mail spam ✉️ * Swiss banknote 💷 ] .pull-right[ ## Regression: * Boston housing data 🏠 * Ames housing data 🏠 ] ] --- # Nature as a black box .font200[ Nature functions to associate `\(\boldsymbol{X}\)` with `\(\boldsymbol{Y}\)`: ] <img src="figures/black-box-nature.svg" width="90%" style="display: block; margin: auto;" /> -- .font175[ * Can we predict what `\(\boldsymbol{Y}\)` will be for future `\(\boldsymbol{X}\)`? * How does nature associate `\(\boldsymbol{Y}\)` with `\(\boldsymbol{X}\)`? ] ??? Source: Statistical Modeling: The Two Cultures Link: https://projecteuclid.org/euclid.ss/1009213726 --- # Data modeling culture .font200[ Assume a stochastic data model: ] <img src="figures/data-modeling-culture.svg" width="90%" style="display: block; margin: auto;" /> ??? Source: Statistical Modeling: The Two Cultures Link: https://projecteuclid.org/euclid.ss/1009213726 Domain experts work for years to learn good features; they bring the statistician a "small" data set --- # Algorithmic modeling culture .font200[ Find a function `\(f\left(\boldsymbol{X}\right)\)` to predict `\(\boldsymbol{Y}\)`: ] <img src="figures/algorithmic-modeling-culture.svg" width="90%" style="display: block; margin: auto;" /> ??? Source: Statistical Modeling: The Two Cultures Link: https://projecteuclid.org/euclid.ss/1009213726 We start with a "large" data set with many features and use machine learning to find the "good" ones --- # Algorithmic interpretation
<img src="figures/vip-pdp-horizontal.png" style="height:1em; width:auto; "/>
.font150[ Aside from predictions, we are often interested in understanding the way that `\(Y\)` is affected as `\(X_1, \dotsc, X_p\)` change (i.e., we cannot treat `\(\widehat{f}\)` as a "black 📦"). ] .font125[ .pull-left[ <!-- In this setting, one may be interested in answering the following questions: --> * Which predictors are .bold.blue[associated] with the response? * What is the nature of the .bold.blue[relationship] between the response and each predictor (or subset thereof)? ] .pull-right[ <img src="figures/vip-pdp-horizontal.png" width="100%" style="display: block; margin: auto;" /> ] ] --- # Algorithmic interpretation
<img src="figures/vip-pdp-horizontal.png" style="height:1em; width:auto; "/>
<br> <img src="figures/black-box-02.png" width="100%" style="display: block; margin: auto;" /> --- # Algorithmic interpretation
<img src="figures/vip-pdp-horizontal.png" style="height:1em; width:auto; "/>
<br> <img src="figures/black-box-03.png" width="100%" style="display: block; margin: auto;" /> --- class: clear, middle, center <img src="intro-ml-r_files/figure-html/chuck-norris-1.gif" style="display: block; margin: auto;" /> [Source](https://bradleyboehmke.github.io/CinDay-RUG-IML-2018/slides-source.html#1) --- # Over fitting .font150[ It is possible to "over tune" your models to maximize performance on the training data. Such models will not generalize well to new data. We call this process [*over fitting*](https://en.wikipedia.org/wiki/Overfitting): ] <img src="intro-ml-r_files/figure-html/overfitting-linear-regression-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Model development .font150.pull-left[ ## Model selection Estimating the *performance* of different models in order to choose the .bold["best"] one * Choosing *hyperparmaters* ] .font150.pull-right[ ## Model assessment Having chosen a final model, estimating its prediction error (i.e., generalization error) on .bold[new data] ] .full-width.content-box-yellow.font175.center[If we have "enough" data, divide it up!] --- # Data splitting .full-width.content-box-yellow.font150.center[How do we divide up the data for each task?] .font135[ * The .bold[training set] is used to fit the models * The .bold[validation set] is used to estimate prediction error for model selection * The .bold[test set] is used for assessment of the generalization error of the final chosen model - Ideally, the test set should be kept under 🔒 and 🔑 until the end of the data analysis (**Why?** 🤔) ] --- # Data splitting .font150[ It is difficult to give a general rule of 👍 on how to choose the number of observations in each of the three parts, as this depends on the *signal-to noise ratio* in the data and the training sample size ] .full-width.content-box-yellow.font150.center[A typical split might be 50% for training, and 25% each for validation and testing] --- # Data splitting strategies <br> <img src="figures/data-splitting.png" width="100%" style="display: block; margin: auto;" /> --- # 5-fold cross-validation <br><br> <img src="figures/cross-validation.png" width="100%" style="display: block; margin: auto;" /> --- class: clear, middle, center .font200[ Generalization error `\(\approx\)` `\(\sigma^2\)` `\(+\)` .red.vold[Bias] `\(+\)` .red.vold[Variance] ] <img src="intro-ml-r_files/figure-html/bias-variance-tradeoff-1.svg" width="80%" style="display: block; margin: auto;" /> --- # Data shape <img src="figures/wide-data-tall-data.png" width="40%" style="display: block; margin: auto;" /> --- # Wide data (lots of features) .pull-left.center[ <img src="figures/wide-data.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right.font125[ * Thousands / millions of features * Hundreds of records (i.e., rows) * Prone to over fitting - Need to regularize or otherwise reduce the number features ] .full-width.content-box-yellow.font150.center[Screening with FDR, Elastic net, SVMs, and stepwise procedures] --- # Tall data (lots of observations) .pull-left.center[ <img src="figures/tall-data.png" width="30%" style="display: block; margin: auto;" /> ] .pull-right.font125[ * Tens / hundreds of features * Thousands / millions of records (i.e., rows) * Plenty of samples to fit richer models with interactions, etc.! ] .full-width.content-box-yellow.font150.center[GzLMs, .bold.red[tree-based methods] (like GBMs), and network models (like DNNs)] --- # Data preprocessing and feature engineering .pull-left[ .font100[ Once data are gathered, it must be cleaned and prepped for data analysis. Such tasks might include, but are not limited to: * Imputing missing values * Transforming columns * Standardizing/normalizing columns * Creating new features * Formulating interaction effects * Numerically encoding categorical features (e.g., *one-hot encoding* or *dummy encoding*) ] ] .pull-right[ <img src="intro-ml-r_files/figure-html/ames-sale-price-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- class: clear, middle, center <img src="figures/flow-chart.png" width="100%" style="display: block; margin: auto;" /> .font150[ **Source:** [Scikit-learn](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) ] --- class: clear, middle, center <img src="figures/apm-summary.png" width="100%" style="display: block; margin: auto;" /> .font150[ **Source:** [Applied Predictive Modeling](http://appliedpredictivemodeling.com/) ] --- # A good modeling tool .pull-left.font110[ ## <span style="color:MediumSeaGreen">At a minimum:</span> * Applicable to classification and regression * Competitive accuracy * Capable of handling large data sets * Handle missing values effectively ] -- .pull-right.font110[ ## <span style="color:MediumSeaGreen">Bonus features:</span> * Which predictors are important? ([vip](https://github.com/AFIT-R/vip)) * How do the predictors functionally relate to the response? ([pdp](https://journal.r-project.org/archive/2017/RJ-2017-016/index.html)) * How do the predictors interact? ([pdp](https://journal.r-project.org/archive/2017/RJ-2017-016/index.html), [vip](https://github.com/AFIT-R/vip)) * What is the shape of the data (i.e., how does it cluster?) * Are their novel cases and outliers? ] --- class: clear, middle, center background-image: url(https://upload.wikimedia.org/wikipedia/commons/9/99/Fog_forrest_frickberg.jpg) .font300.white[Decision trees] ??? Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/Forest#/media/File:Fog_forrest_frickberg.jpg) --- # Mushroom classification There is no simple rule for determining the edibility of a [mushroom](https://raw.githubusercontent.com/bgreenwell/MLDay18/master/docs/data/mushroom.csv); no rules like **"leaflets three, let it be"**, **"hairy vine, no friend of mine"** and **"berries white, run in fright"** for poison ivy. <img src="figures/edible.jpg" width="50%" style="display: block; margin: auto;" /> -- ```r url <- "https://bgreenwell.github.io/MLDay18/data/mushroom.csv" mushroom <- read.csv(url) # load the data from GitHub *mushroom$veil.type <- NULL # only takes on a single value ``` --- class: clear, middle # Mushroom classification .code125[ ```r # Load required packages library(caret) # for data splitting function library(rpart) # for binary recursive partitioning *# Partition the data into train/test sets set.seed(101) trn_id <- createDataPartition( y = mushroom$Edibility, p = 0.5, list = FALSE ) *trn <- mushroom[trn_id, ] # training data *tst <- mushroom[-trn_id, ] # test data # Function to calculate accuracy accuracy <- function(pred, obs) { sum(diag(table(pred, obs))) / length(obs) } ``` ] --- class: clear, middle # Mushroom classification .pull-left[ ```r # Decision stump (test error = 1.53%): cart1 <- rpart( Edibility ~ ., data = trn, * control = rpart.control(maxdepth = 1) ) # Get test set predictions pred1 <- predict( cart1, newdata = tst, type = "class" ) # Compute test set accuracy accuracy( pred = pred1, obs = tst$Edibility ) ## [1] 0.9847366 ``` ] .pull-right[ ```r # Complex tree (test error = 0%): cart2 <- rpart( Edibility ~ ., data = trn, * control = list(cp = 0, minbucket = 1, minsplit = 1) ) # Get test set predictions pred2 <- predict( cart2, newdata = tst, type = "class" ) # Compute test set accuracy accuracy( pred = pred2, obs = tst$Edibility ) ## [1] 1 ``` ] --- # Mushroom classification .pull-left[ .font125[Decision stump (test error = 1.53%):] <img src="intro-ml-r_files/figure-html/mushroom-tree-diagram-1-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ .font125[Complex tree (test error = 0%):] <img src="intro-ml-r_files/figure-html/mushroom-tree-diagram-2-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- # A handy rule of 👍 for mushrooms <br><br><br><br> .right.font300.tomato["If it smells bad, don't eat it!"] <br> .right.font200[--- Decision Stump] --- # Predicting email spam .pull-left[ - Data from 4601 email messages collected at Hewlett-Packard Labs - **Goal:** predict whether an email message is <span style="color:Tomato">spam</span> (junk email) or <span style="color:MediumSeaGreen ">ham</span> (good email) - **Features:** relative frequencies in a message of 57 of the most commonly occurring words and punctuation marks in all the training the email messages - For this problem, not all errors are equal; misclassifying <span style="color:Tomato">spam</span> is not as bad as misclassifying <span style="color:MediumSeaGreen">ham</span>! ] .pull-right[ <img src="figures/spam.jpg" width="100%" style="display: block; margin: auto;" /> ] --- # Predicting email spam ```r # Load the data *data(spam, package = "kernlab") # Partition the data into train/test sets set.seed(101) # for reproducibility trn_id <- createDataPartition(spam$type, p = 0.7, list = FALSE) trn <- spam[trn_id, ] # training data tst <- spam[-trn_id, ] # test data xtrn <- subset(trn, select = -type) # training data features xtst <- subset(tst, select = -type) # test data features ytrn <- trn$type # training data response # Fit a classification tree (cp found using k-fold CV) *spam_tree <- rpart(type ~ ., data = trn, cp = 0.001) pred <- predict(spam_tree, newdata = xtst, type = "class") # Compute test set accuracy (spam_tree_acc <- accuracy(pred = pred, obs = tst$type)) ## [1] 0.9260334 ``` --- # Predicting email spam .scrollable[ <img src="intro-ml-r_files/figure-html/spam-tree-diagram-1.svg" width="70%" style="display: block; margin: auto;" /> ] --- # Predicting email spam .pull-left[ ```r # Variable importance scores *vip::vi(spam_tree) %>% as.data.frame() %>% head(10) ## Variable Importance ## 1 charExclamation 507.78427 ## 2 capitalLong 259.80387 ## 3 capitalAve 198.78828 ## 4 free 198.33076 ## 5 charDollar 194.74814 ## 6 your 182.63595 ## 7 remove 176.82921 ## 8 all 126.72083 ## 9 capitalTotal 107.79681 ## 10 hp 74.31123 ``` ] .pull-right[ ```r # Variable importance plot *vip(spam_tree, num_features = 10) ``` <img src="intro-ml-r_files/figure-html/spam-vip-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- # How do trees measure up? .pull-left[ .font150.bold[Advantages:] * .MediumSeaGreen[Small trees are easy to interpret] * .MediumSeaGreen[Trees scale well to large *N*] (fast!!) * .MediumSeaGreen[Can handle data of all types] (i.e., requires little, if any, coding) * .MediumSeaGreen[Automatic variable selection] * .MediumSeaGreen[Can handle missing data] (through *surrogate splits*) * .MediumSeaGreen[Completely nonparametric] (great for DM and EDA tasks!) ] -- .pull-right[ .font150.bold[Disadvantages:] * .tomato[Large trees can be difficult to interpret] * .tomato[Trees are step functions] (i.e., binary splits) * .tomato[Greedy splitting algorithms] (i.e., trees are noisy) * .tomato[All splits depend on previous splits] (i.e., high order interactions) * .tomato[Data is essentially taken away after each split] ] --- class: clear, middle, center # How do trees measure up? <img src="figures/esl-tbl.png" width="70%" style="display: block; margin: auto;" /> <br> .center[**Source:** The Elements of Statistical Learning, Second Edition] --- # Creating ensembles of trees * The key to accuracy is **low bias** and **low variance** * **Main idea:** breaking the [*bias-variance trade-off*](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff) - `\(Error \approx Bias + Variance\)` * Two popular algorithms (can be applied to any model, **not just trees!**): - Bagging (<span style="color:Magenta">reduce variance by averaging</span>) - Boosting (<span style="color:Magenta">reduce bias by building models sequentially</span>) * Random forest is just a slight modification over bagging applied to decision trees! <br> <center> <font size="6"><span style="color:DarkBlue">Boosting</span> >= <span style="color:MediumSeaGreen">Random forest</span> > <span style="color:MediumOrchid ">Bagging</span> > <span style="color:DarkOrange">Single tree</span></font> </center> --- class: clear, center, middle background-image: url(https://upload.wikimedia.org/wikipedia/commons/9/99/Fog_forrest_frickberg.jpg) .font300.white[Bagging] ??? Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/Forest#/media/File:Fog_forrest_frickberg.jpg) --- # The bootstrap <img src="figures/bootstrap-sample.png" width="100%" style="display: block; margin: auto;" /> --- # **B**ootstrap **AGG**regat**ING** .pull-left[ * Bagging trees: 1. Sample records **with replacement** (i.e., bootstrap the training data) 2. Fit an **overgrown tree** to the resampled data set 3. Repeat a large number of times ] .pull-right[ <img src="figures/millionaire.png" width="70%" style="display: block; margin: auto;" /> ] * Predictions are combined by popular vote (**classification**) or averaging (**regression**) * Same idea as ["wisdom of the crowd"](https://en.wikipedia.org/wiki/Wisdom_of_the_crowd) (<span style="color:purple">Francis Galton's Ox weight survey</span>) * Improves the stability and accuracy of noisy models (e.g., individual trees) --- # Bagging trees <img src="intro-ml-r_files/figure-html/bagging-1-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Trees are step functions <img src="intro-ml-r_files/figure-html/bagging-2-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Trees are noisy <img src="intro-ml-r_files/figure-html/bagging-3-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Bagging reduces variance <img src="intro-ml-r_files/figure-html/bagging-4-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Predicting email spam .pull-left[ ```r # Load required packages *library(randomForest) # Fit a bagger model *xtst <- subset(tst, select = -type) *set.seed(1633) # reproducibility spam_bag <- randomForest( type ~ ., data = trn, ntree = 250, * mtry = ncol(xtrn), xtest = xtst, ytest = tst$type, keep.forest = TRUE ) ``` ] .pull-right[ <img src="intro-ml-r_files/figure-html/unnamed-chunk-5-1.gif" width="100%" style="display: block; margin: auto;" /> ] --- class: clear, center, middle background-image: url(https://upload.wikimedia.org/wikipedia/commons/9/99/Fog_forrest_frickberg.jpg) .font300.white[Random forest] ??? Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/Forest#/media/File:Fog_forrest_frickberg.jpg) --- # Background * **Bagging on steroids!!** * Correlation limits the effect of bagging! * Grow trees just as in bagging, but with a small twist - Before each split in each tree in the forest, select a subset of the predictors **at random** as candidate splitters - This essentially **"decorrelates"** the trees in the ensemble! - The number of randomly selected variables, denoted `\(m_{try}\)`, is a **tuning parameter** (arguably the only tuning parameter in a random forest) * Bagging introduces <span style="color:CornFlowerBlue">randomness into the rows</span> of the data * Random forest introduces <span style="color:CornFlowerBlue">randomness into the rows and columns</span> of the data * [*Extremely randomized trees*](https://link.springer.com/article/10.1007/s10994-006-6226-1) take this randomness one step further to reduce variance! --- # Random forest packages in R .scrollable[ .pull-left[ * [randomForest](https://cran.r-project.org/package=randomForest) - The standard (implements most of the **bells and whistles**; great for EDA!!) - Classification and regression - Does not scale well, but can be parallelized using `foreach`! * [randomForestSRC](https://kogalur.github.io/randomForestSRC/index.html) - Classification, regression, and survival analysis (including ***competing risks***) - MPI/OpenMP support * [ranger](https://github.com/imbs-hl/ranger): **ran**dom forest **ge**ne**r**ator - Classification, regression, and survival analysis - [Fast!](https://www.jstatsoft.org/article/view/v077i01) - Accepts sparse numeric matrices (i.e., `dgcMatrix` objects from the [Matrix](https://cran.r-project.org/package=Matrix) package) - Estimated time to completion!! ] .pull-right[ * [party/partykit](https://cran.r-project.org/web/packages/party/index.html) - Uses *conditional inference trees* as the base learners - Unbiased variable selection - Can be slow on large data sets * [bigrf](https://github.com/aloysius-lim/bigrf) - Classification only - Currently **ORPHANED** on CRAN - Disk caching using [`bigmmory`](https://cran.r-project.org/web/packages/bigmemory/) * [Rborist](https://github.com/suiji/Arborist) - Offers parallel, distributed tree construction * [h2o](https://github.com/h2oai/h2o-3/tree/master/h2o-r) - Distributed random forests via cloud computing - Can be stacked with other [`h2o`](https://github.com/h2oai/h2o-3/tree/master/h2o-r) models, like GBMs and DNNs!! ] ] --- # Predicting email spam .pull-left[ ```r # Load required packages library(randomForest) # Fit a random forest xtst <- subset(tst, select = -type) set.seed(1633) # reproducibility spam_rf <- randomForest( type ~ ., data = trn, ntree = 250, * mtry = 7, # floor(sqrt(p)) xtest = xtst, ytest = tst$type, keep.forest = TRUE ) ``` ] .pull-right[ <img src="intro-ml-r_files/figure-html/unnamed-chunk-7-1.gif" width="100%" style="display: block; margin: auto;" /> ] --- class: clear # Random forest .pull-left[ ## <span style="color:MediumSeaGreen">Advantages:</span> * Competitive accuracy * Supervised **AND** unsupervised learning * *Out-of-bag* data 😎 - Free cross-validation - Novel variable importance * Novel <span style="color:Purple">proximity matrix</span> 😎 - Outlier/novelty detection - Multi-dimensional scaling - Missing value imputation ] .pull-right[ ## <span style="color:Tomato">Disadvantages:</span> * Missing values (**why?**) * Can be slow on large data sets since trees are intentionally overgrown (i.e., deep) - `\(m_{try}\)` mitigates this issue to some extent ] --- # Out-of-bag data For large enough `\(N\)`, on average, `\(1 - e^{-1} \approx 63.21\)`% or the original records end up in any bootstrap sample .code75[ ```r N <- 100000 set.seed(1537) # for reproducibility x <- rnorm(N) mean(x %in% sample(x, replace = TRUE)) # non-OOB proportion ## [1] 0.63232 ``` ] * Roughly `\(e^{-1} \approx 36.79\)`% of the observations are not used in the construction of a particular tree * These observations are considered *out-of-bag* (OOB) and can be used to - Provide an unbiased assessment of model performance (**sort of an unstructured, but free, cross-validation**) - Construct novel variable importance measure based on the **predictive strength** of each feature!! --- class: clear, middle, center # Predicting email spam .font150.bold.left[OOB vs. test error: e-mail spam example] <img src="intro-ml-r_files/figure-html/unnamed-chunk-8-1.svg" width="100%" style="display: block; margin: auto;" /> --- # OOB-based variable importance * **Traditional approach:** Average variable importance scores (i.e., total goodness of split) over all the trees in the ensemble (this is what is done for GBMs) * **Novel approach:** To estimate the importance of the `\(k\)`-th variable, `\(x_k\)`: 1. Record the OOB performance of the model 2. Randomly permute all the values of `\(x_k\)` in the OOB data 3. Recompute the OOB performance of the model 4. The difference between the two OOB performance measures the strength of the structural importance of `\(x_k\)` * Fundamentally different as these importance scores are **NOT** based on data seen by the individual trees! --- # Boston housing example * Housing data from `\(N = 506\)` census tracts in the city of Boston for the year 1970 * The data violate many classical assumptions like linearity, normality, and constant variance -- * Harrison and Rubinfeld's housing value equation ( `\(R^2 = 0.81\)` ): <img src="figures/boston-eqn.png" width="80%" style="display: block; margin: auto;" /> -- * Nowadays, many supervised learning algorithms can fit the data automatically in seconds (**typically with higher accuracy!**) * The downfall, however, is some loss of interpretation since these algorithms typically do not produce simple prediction formulas (**but there's still hope!** 🙏) --- class: clear ```r # Load required packages *library(ranger) # a much faster implementation of random forest # Load the (corrected) Boston housing data data(boston, package = "pdp") ``` .pull-left[ ```r # Using the randomForest package set.seed(2007) # reproducibility system.time( boston_rf <- randomForest( cmedv ~ ., data = boston, * ntree = 5000, * mtry = 5, * importance = FALSE ) ) ## user system elapsed ## 8.748 0.181 8.910 ``` ``` ## [1] 0.893066 ``` ] .pull-right[ ```r # Using the ranger package set.seed(1652) # reproducibility system.time( boston_ranger <- ranger( cmedv ~ ., data = boston, * num.trees = 5000, * mtry = 5, # :/ * importance = "impurity" ) ) ## user system elapsed ## 6.515 0.304 1.180 ``` ``` ## [1] 0.8928631 ``` ] --- # Variable importance plots .pull-left[ * Most RF packages provide variable importance measures, but not all of them provide variable importance plots (VIPs) * Enter...[vip](https://koalaverse.github.io/vip/index.html) - Provides a consistent framework for extracting and plotting variable importance scores from many types of ML models - `vi()` always returns a [*tibble*](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html) - `vip()` uses [ggplot2](http://ggplot2.org/) ] .pull-right[ ```r # Install from CRAN install.packages("vip") ``` <img src="figures/vip-logo.svg" width="80%" style="display: block; margin: auto;" /> ] --- class: clear, middle ```r # Construct variable importance plots (the old way) par(mfrow = c(1, 2)) # side-by-side plots *varImpPlot(boston_rf, main = "") # randomForest::varImpPlot() ``` <img src="intro-ml-r_files/figure-html/boston-rf-varImpPlot-1.svg" width="80%" style="display: block; margin: auto;" /> --- class: clear ```r # Load required packages library(vip) # for better (and consistent) variable importance plots # Construct variable importance plots p1 <- vip(boston_rf, type = 1) + ggtitle("randomForest") p2 <- vip(boston_rf, type = 2) + ggtitle("randomForest") p3 <- vip(boston_ranger) + ggtitle("ranger") *grid.arrange(p1, p2, p3, ncol = 3) # side-by-side plots ``` <img src="intro-ml-r_files/figure-html/boston-rf-vip-1.svg" width="90%" style="display: block; margin: auto;" /> --- # Partial dependence plots .scrollable[ * [Partial dependence plots (PDPs)](https://journal.r-project.org/archive/2017/RJ-2017-016/index.html) help visualize the relationship between a subset of the features (typically 1-3) and the response * Let `\(x_1\)` be the predictor variable of interest with unique values `\(\left\{x_{11}, x_{12}, \dots, x_{1k}\right\}\)` * The partial dependence of the response on `\(x_1\)` can be constructed as follows: 1. For `\(i \in \left\{1, 2, \dots, k\right\}\)`: a. Copy the training data and replace the original values of `\(x_1\)` with the constant `\(x_{1i}\)` b. Compute the vector of predicted values from the modified copy of the training data c. Compute the average prediction to obtain `\(\bar{f}_1\left(x_{1i}\right)\)` 2. Plot the pairs `\(\left\{x_{1i}, \bar{f}_1\left(x_{1i}\right)\right\}\)` for `\(i = 1, 2, \dotsc, k\)` ] --- # Partial dependence plots .pull-left[ * Not all RF implementations provide support for constructing partial dependence plots (PDPs) * Enter...[pdp](https://journal.r-project.org/archive/2017/RJ-2017-016/index.html) - A consistent way of constructing PDPs (and more) from many types of ML models (not just RFs) - Allows for multivariate displays (i.e., interactions) and so much more!! - Includes options for **parallel processing** and **progress bars** 😎 ] .pull-right[ ```r # Install from CRAN install.packages("pdp") ``` <img src="figures/pdp-logo.png" width="80%" style="display: block; margin: auto;" /> ] --- class: clear, middle # Partial dependence plots ```r # Load required packages library(pdp) # PDPs for the top two predictors p1 <- partial(boston_ranger, pred.var = "lstat", plot = TRUE) p2 <- partial(boston_ranger, pred.var = "rm", plot = TRUE) *p3 <- partial(boston_ranger, pred.var = c("lstat", "rm"), * chull = TRUE, plot = TRUE) grid.arrange(p1, p2, p3, ncol = 3) ``` <img src="intro-ml-r_files/figure-html/boston-rf-pdps-1.svg" width="100%" style="display: block; margin: auto;" /> --- class: clear, middle # Partial dependence plots .code115[ ```r *# 3-D plots *pd <- attr(p3, "partial.data") # no need to recalculate p1 <- plotPartial(pd, levelplot = FALSE, drape = TRUE, colorkey = FALSE, screen = list(z = -20, x = -60) ) ``` ```r *# Using ggplot2 library(ggplot2) p2 <- autoplot(pd, palette = "magma") ``` ```r *# ICE and c-ICE curves p3 <- boston_ranger %>% # %>% is automatically imported! partial(pred.var = "rm", ice = TRUE, center = TRUE) %>% autoplot(alpha = 0.1) ``` ] --- class: clear, middle # Partial dependence plots .code150[ ```r # Display all three plots side by side grid.arrange(p1, p2, p3, ncol = 3) ``` <img src="intro-ml-r_files/figure-html/unnamed-chunk-12-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- # Proximity matrix .font115[ * In random forests (and bagging), **trees are intentionally grown deep**! ] --- class: clear, middle, center <img src="intro-ml-r_files/figure-html/mushroom-tree-diagram-3-1.svg" width="70%" style="display: block; margin: auto;" /> --- # Proximity matrix .font115[ * In random forests (and bagging), **trees are intentionally grown deep**! 1. Initialize the proximity between cases `\(i\)` and `\(j\)` to zero: `\(prox_{ij} = 0\)` 2. After a tree is grown, put all of the data, both training and OOB, down the tree 3. If cases `\(i\)` and `\(j\)` land in the same terminal node, increase `\(prox_{ij}\)` by one 4. At the end, normalize `\(prox_{ij}\)` by dividing by the number of trees in the forest * The pairwise proximities can be used to construct a useful distance matrix * **Requires the storage of an `\(N \times N\)` matrix!!** 😱 * Only available in [randomForest](https://cran.r-project.org/package=randomForest), [party](https://cran.r-project.org/web/packages/party/index.html), and [bigrf](https://github.com/aloysius-lim/bigrf) ] --- # Swiss banknote data ```r # Load the data data(banknote, package = "alr3") # Fit a random forest set.seed(1701) # for reproducibility banknote_rf <- randomForest( as.factor(Y) ~ ., data = banknote, * proximity = TRUE ) # Print the OOB confusion matrix banknote_rf$confusion ## 0 1 class.error ## 0 99 1 0.01 ## 1 0 100 0.00 ``` --- # Proximity matrix ```r # Heatmap of proximity-based distance matrix heatmap(1 - banknote_rf$proximity, col = viridis::plasma(256)) ``` <img src="intro-ml-r_files/figure-html/heatmap-1.svg" width="50%" style="display: block; margin: auto;" /> --- # Outlier scores Outlyingness of case `\(n\)` in class `\(j\)` to all other cases in class `\(j\)`: `\(out\left(n\right) = N / \sum_{cl\left(k\right) = j} prox_{ik} ^ 2\)` .pull-left[ ```r # Dot chart of proximity-based outlier scores outlyingness <- tibble::tibble( * "out" = outlier(banknote_rf), "obs" = seq_along(out), "class" = as.factor(banknote$Y) ) ggplot(outlyingness, aes(x = obs, y = out)) + geom_point(aes(color = class, size = out), alpha = 0.5) + * geom_hline(yintercept = 10, linetype = 2) + labs(x = "Observation", y = "Outlyingness") + theme_light() + theme(legend.position = "none") ``` ] .pull-right[ <img src="intro-ml-r_files/figure-html/outlier-plot-1.svg" width="90%" style="display: block; margin: auto;" /> ] --- # Multi-dimensional scaling ```r # Multi-dimensional scaling plot of proximity matrix MDSplot(banknote_rf, fac = as.factor(banknote$Y), k = 2, cex = 1.5) ``` <img src="intro-ml-r_files/figure-html/MDS-1.svg" width="45%" style="display: block; margin: auto;" /> --- # Missing value imputation * Random forest offers a novel approach to imputing missing values (i.e., `NA`s) 1. Start with cheap imputation (e.g., median for continuous features) 2. Run a random forest and obtain the proximity matrix 3. Imputed values are updated using: - The weighted average of the non-missing observations, where the weights are the proximity (**continuous**) - The category with the largest average proximity (**categorical**) 4. Iterate this procedure until convergence (~ 4-6 times seems sufficient) * Highly computational (requires many random forest runs!!) <!-- * Resulting **OOB error estimate tends to be overly optimistic** --> * Only available in [randomForest](https://cran.r-project.org/package=randomForest) --- class: clear, middle, center background-image: url(https://upload.wikimedia.org/wikipedia/commons/9/99/Fog_forrest_frickberg.jpg) .font300.white[Boosting] ??? Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/Forest#/media/File:Fog_forrest_frickberg.jpg) --- # Background .font125[ * Boosting is a general technique for creating an ensemble of models (*meta-algorithm*) * In contrast to bagging, boosting: - Uses a *weak learner* for the base learners - Builds the models sequentially (fix past mistakes) * Most commonly used with **shallow decision trees** (e.g., decision stumps), but other types of models can be boosted * **Boosting requires more tuning**, but is typically faster and more accurate than RFs! ] --- # Background .font125[ * **Many flavors of boosting:** AdaBoost, Real AdaBoost, gradient boosting, and stochastic gradient boosting (GBM), for example * GBMs are the most **general** and widely used flavor of boosting - Introduces subsampling into the rows and columns (**stochastic**) - Loss function is optimized using *gradient descent* (**gradient**) - With the right combination of parameters, GBMs can emulate RFs! ] --- # GBMs with squared-error loss 1. Given a training sample `\(\boldsymbol{d} = \left(\boldsymbol{X}, \boldsymbol{y}\right)\)`. Fix the number of steps `\(B\)`, the shrinkage factor `\(\epsilon\)`, and the tree depth `\(d\)`. Set the initial fit `\(\widehat{G}_0 \equiv 0\)`, and the residual vector `\(\boldsymbol{r} = \boldsymbol{y}\)`. 2. For `\(b = 1, 2, \dots, B\)` repeat: a) Fit a regression tree `\(\tilde{g}_b\)` to the data `\(\left(\boldsymbol{X}, \boldsymbol{r}\right)\)`, grown best-first to depth `\(d\)` b) Update the fitted model with a shrunken version of `\(\tilde{g}_b\)`: `\(\widehat{G}_b = \widehat{G}_{b - 1} + \epsilon \tilde{g}_b\)` c) Update the residuals accordingly: `\(r_i = r_i - \epsilon \tilde{g}_b\)`, `\(i = 1, 2, \dots, n\)` 3. Return the sequence of fitted functions `\(\widehat{G}_b\)`, `\(b = 1, 2, \dots, B\)`. * Here, `\(B\)`, `\(\epsilon\)`, and `\(d\)` are tuning parameters .font150.center[ [Poor mans GBM code](https://github.com/bgreenwell/intro-ml-r/blob/master/code/rpartBoost.R) ] ??? Source: [Computer Age Statistical Inference, pg. 333](https://web.stanford.edu/~hastie/CASI/) --- class: clear, center, middle .center.font200[Boosting regression stumps]  --- # Some GBM packages in R .scrollable[ * [gbm](https://github.com/gbm-developers/gbm): **g**eneralized **b**oosted **m**odels - The original R implementation of GMBs (by Greg Ridgeway) - Slower than modem implementations (but still pretty fast) - Provides OOB error estimate - Supports the *weighted tree traversal* method for fast construction of PDPs - ~~Currently orphaned on CRAN~~ * [gbm3](https://github.com/gbm-developers/gbm3): **g**eneralized **b**oosted **m**odels - Shiny new version of `gbm` that is not backwards compatible - Faster and supports parallel tree building - Not currently listed on CRAN * [xgboost](https://github.com/dmlc/xgboost): <span style="color:DeepSkyBlue">eXtreme Gradient Boosting</span> - Widely used by data scientists to achieve **state-of-the-art** results across many machine learning challenges (e.g., Kaggle) - Designed to be highly efficient, flexible, and portable - Supports user-defined loss functions - Supports early stopping - Includes fast histogram method for numeric features - Parallel tree boosting and GPU support!! <img src="figures/gpu.jpg" width="50%" style="display: block; margin: auto;" /> * [lightgbm](https://github.com/Microsoft/LightGBM): - Microsoft supported, and somewhat [faster](https://github.com/Microsoft/LightGBM/blob/master/docs/Experiments.rst#comparison-experiment), alternative to **xgboost** - Uses *leaf-wise* (i.e., best-first) tree growth - Not currently listed on CRAN * [mboost](https://github.com/boost-R): **m**odel-based **boost**ing - Implements tools for fitting and evaluating a variety of regression and classification models (somewhere between GzLMs/GAMs and GBMs) ] --- # Missing values .font140[ * Unlike RFs, GBMs typically support missing values! * In most GBM implementations, `NA`s are interpreted as containing information (i.e., missing for a reason), rather than missing at random - In [gbm](https://github.com/gbm-developers/gbm) and [gbm3](https://github.com/gbm-developers/gbm3), a separate branch is created at each split for `NA`s (i.e., each decision node contains three splits: left, right, and missing) - In [h2o](https://github.com/h2oai/h2o-3/tree/master/h2o-r) and [xgboost](https://github.com/dmlc/xgboost) each split has a **default direction** that is learned during training (`NA`s take the default route!) ] --- class: clear, middle # Boston housing example .code110[ ```r # Load required packages library(gbm) # Fit a GBM to the Boston housing data set.seed(1053) # for reproducibility boston_gbm <- gbm( cmedv ~ ., data = boston, * var.monotone = NULL, * distribution = "gaussian", # "benoulli", "coxph", etc. * n.trees = 10000, * interaction.depth = 5, * n.minobsinnode = 10, * shrinkage = 0.005, * bag.fraction = 1, * train.fraction = 1, cv.folds = 10 # k-fold CV often gives the best results ) ``` ] --- class: clear, middle # Boston housing example ```r best_iter <- gbm.perf( # how many trees? boston_gbm, * method = "cv" # or "OOB" or "test" ) ``` <img src="intro-ml-r_files/figure-html/boston_gbm_best-1.svg" width="90%" style="display: block; margin: auto;" /> --- # Partial dependence plots .pull-left[ ## Brute force method: ```r system.time( pd1 <- partial( boston_gbm, pred.var = c("lon", "nox"), * recursive = FALSE, chull = TRUE, * n.trees = best_iter ) ) ## user system elapsed ## 112.222 0.839 113.320 ``` ] .pull-right[ ## Recursive method: ```r system.time( pd2 <- partial( boston_gbm, pred.var = c("lon", "nox"), * recursive = TRUE, chull = TRUE, * n.trees = best_iter ) ) ## user system elapsed ## 0.458 0.002 0.460 ``` ] --- # Partial dependence plots ```r # Display plots side by side grid.arrange(autoplot(pd1), autoplot(pd2), ncol = 2) ``` <img src="intro-ml-r_files/figure-html/boston-pdps-1.svg" style="display: block; margin: auto;" /> --- # Ames housing data .pull-left.font120[ * A contemporary alternative to the often cited Boston housing data * Contains the sale of individual residential property in Ames, Iowa from 2006 to 2010 * Contains `\(N = 2930\)` observations and `\(p = 80\)` features involved in assessing home values * Lots of potential for *feature engineering*!! ] .pull-right[ ```r ggplot(AmesHousing::make_ames(), aes(x = Sale_Price, y = Overall_Qual)) + * ggridges::geom_density_ridges(aes(fill = Overall_Qual)) + scale_x_continuous(labels = scales::dollar) + labs(x = "Sale price", y = "Overall quality") + theme_light() + theme(legend.position = "none") ``` <img src="intro-ml-r_files/figure-html/ames-densities-1.svg" width="100%" style="display: block; margin: auto;" /> .font80[**Source:** [Bradley Boehmke](https://github.com/bradleyboehmke)] ] --- class: clear, middle # Ames housing data .scrollable[ ```r # Load required packages library(xgboost) # Construct data set ames <- AmesHousing::make_ames() *# Feature matrix # or xgb.DMatrix or sparse matrix X <- data.matrix(subset(ames, select = -Sale_Price)) # Fit an XGBoost model set.seed(203) # for reproducibility ames_xgb <- xgboost( # tune using `xgb.cv()` data = X, label = ames$Sale_Price, * objective = "reg:linear", # loss function * nrounds = 2771, # number of trees * max_depth = 5, # interaction depth * eta = 0.01, # learning rate * subsample = 1, * colsample = 1, * num_parallel_tree = 1, * eval_metric = "rmse", verbose = 0, * save_period = NULL ) ``` ] --- class: clear, middle <img src="intro-ml-r_files/figure-html/ames-xgboost-cv-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Ames housing data ```r # Variable importance plots p1 <- vip(ames_xgb, feature_names = colnames(X), type = "Gain") p2 <- vip(ames_xgb, feature_names = colnames(X), type = "Cover") p3 <- vip(ames_xgb, feature_names = colnames(X), type = "Frequency") grid.arrange(p1, p2, p3, ncol = 3) ``` <img src="intro-ml-r_files/figure-html/xgboost-vip-1.svg" style="display: block; margin: auto;" /> --- # Ames housing data .pull-left[ ```r # By default, `vip()` plots the # top 10 features vip( object = ames_xgb, feature_names = colnames(X), type = "Gain", num_features = 20, #nrow(X), bar = FALSE ) ``` ] .pull-right[ <img src="intro-ml-r_files/figure-html/xgboost-vip-3-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- # Ames housing data ```r # Partial dependence plots oq_ice <- partial(ames_xgb, pred.var = "Overall_Qual", ice = TRUE, center = TRUE, train = X) p4 <- autoplot(partial(ames_xgb, pred.var = "Gr_Liv_Area", train = X)) p5 <- autoplot(partial(ames_xgb, pred.var = "Garage_Cars", train = X)) p6 <- autoplot(oq_ice, alpha = 0.1) grid.arrange(p4, p5, p6, ncol = 3) ``` <img src="intro-ml-r_files/figure-html/xgboost-pdp-1.svg" width="90%" style="display: block; margin: auto;" /> --- class: clear, middle, center # Ames housing data .font200.center[Variable effect and variable importance] <img src="intro-ml-r_files/figure-html/xgboost-pdp-vi-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Summary .scrollable[ .center.bold[Check out R's [machine learning task view](https://cran.r-project.org/web/views/MachineLearning.html)!] .pull-left[ ### <span style="color:MediumSeaGreen">Random forests:</span> * Builds an ensemble of fully grown decision trees (**low bias, high variance**) - Correlation between trees is reduced through subsampling the columns - Variance is reduced through averaging * Nearly automatic * Lots of cool bells and whistles * Trees are independently grown (embarrassingly parallel) ] .pull-right[ ### <span style="color:MediumSeaGreen">GBMs:</span> * Builds an ensemble of small decision trees (**high bias, low variance**) - Bias is reduced through sequential learning and fixing past mistakes * Requires more tuning, but can be more accurate * Flexible loss functions * Trees are grown **NOT** independent, but training times are usually pretty fast since trees are not grown too deep ] ] --- class: clear, middle .pull-left.font100.left[ <img src="figures/leo-breiman.jpg" width="60%" style="display: block; margin: auto;" /> "Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem." --- Leo Breiman ] .pull-right.font100.left[ <img src="figures/george-box.jpg" width="54%" style="display: block; margin: auto;" /> "All models are wrong, but some are useful." --- George E. Box ] --- class: clear, middle, center .font300[Questions?] <img src="figures/questions.png" width="100%" style="display: block; margin: auto;" />