class: center, middle, inverse, title-slide .title[ # Peeking Inside the ‘Black Box’ ] .subtitle[ ## Post-Hoc Interpretability ] .author[ ### Brandon M. Greenwell ] .institute[ ### 84.51°/University of Cincinnati ] .date[ ### R-Ladies Utrecht: 2023-03-06 ] --- ## Shameless plug...📦/📚 <img src="images/books.png" width="100%" /> --- ## Good resources * [Interpretable Machine Learning: A Guide for Making Black Box Models Explainable](https://christophm.github.io/interpretable-ml-book/) - Christoph Molnar is also the creator of the well-known [iml package](https://cran.r-project.org/package=iml) * In-progress [article](https://github.com/bgreenwell/rjournal-shapley) on Shapley explanations for [*The R Journal*](https://journal.r-project.org/) - Consider contributing 😄 * [Explanatory Model Analysis: Explore, Explain, and Examine Predictive Models. With examples in R and Python](https://ema.drwhy.ai/) - Authors associated with the [DALEX](https://github.com/ModelOriented/DALEX) ecosystem for IML --- ## Agenda Post-hoc methods/packages to help comprehend various aspects of any fitted model: * feature importance via [vip](https://journal.r-project.org/archive/2020/RJ-2020-013/index.html) * feature effects via [pdp](https://journal.r-project.org/archive/2017/RJ-2017-016/index.html) * feature contributions via [fastshap](https://github.com/bgreenwell/fastshap) Plenty of others R 📦s available as well! For example, [iml](https://cran.r-project.org/package=iml) and [DALEX](https://cran.r-project.org/package=DALEX) For a somewhat recent overview, see [Landscape of R packages for eXplainable Artificial Intelligence](https://arxiv.org/pdf/2009.13248.pdf) --- ## CoIL data challenge The two goals of the CoIL challenge were: 1. to build a model from the 5,822 training records and use it to find the top 20% of customers in the test set who are most likely to own caravan insurance policies and 2. to provide insight into why some customers have caravan insurance policies and how they differ from other customers. Source: https://liacs.leidenuniv.nl/~puttenpwhvander/library/cc2000/ --- ## CoIL data challenge ```r # Load insurance company data from CoIL Challenge 2000 data(ticdata, package = "kernlab") # Split into train/test (same splits used in challenge) tic.trn <- ticdata[1:5822, ] tic.tst <- ticdata[-(1:5822), ] # Class frequencies (tab <- table(tic.trn$CARAVAN)) ``` ``` ## ## noinsurance insurance ## 5474 348 ``` ```r proportions(tab) # similar to test data; ~ 16:1 ratio ``` ``` ## ## noinsurance insurance ## 0.94022673 0.05977327 ``` --- ## Variable importance * For a more in-depth overview, see [Greenwell and Boehmke (2020)](https://journal.r-project.org/archive/2020/RJ-2020-013/index.html) * Four our purposes, think of variable importance (VI) as the ".green[...extent to which a feature has a 'meaningful' impact on the predicted outcome.]" * A more formal definition can be found in [van der Laan (2006)](https://www.degruyter.com/document/doi/10.2202/1557-4679.1008/html?lang=en) * We'll discuss several types of VI methods: - model-specific (e.g., decision trees) - variance-based measures; see [Greenwell et. al., 2018](https://arxiv.org/abs/1805.04755) - permutation importance - Aggregated Shapley values --- class: middle, center # Why the vip 📦?  --- ## Model-specific VI scores Examples of model classes where "natural" measures of variable importance exist: * Decision trees and tree-based ensembles - **One of the best methods, IMO**: [GUIDE](https://pages.stat.wisc.edu/~loh/guide.html) for VI scoring/ranking; check out [Loh and Zhou (2022)](https://jds-online.org/journal/JDS/article/1250/info) for the deets * Works for a wide range of response types * Missing values * Interaction effects * And the list goes on... * Generalized linear models (e.g., standardized coefficients or test statistics) * Neural networks (e.g., Garson's method and Olden's method) * Multivariate adaptive regression splines (MARS) Check out the [vip paper](https://journal.r-project.org/archive/2020/RJ-2020-013/index.html) for examples in R! --- ## CoIL challenge: random forest ```r library(ranger) # Fit some (default) probability forests with different VI measures set.seed(926) # for reproducibility tic.rfo1 <- ranger(CARAVAN ~ ., probability = TRUE, data = tic.trn, importance = "impurity") (tic.rfo2 <- ranger(CARAVAN ~ ., probability = TRUE, data = tic.trn, importance = "impurity_corrected")) ``` ``` ## Ranger result ## ## Call: ## ranger(CARAVAN ~ ., probability = TRUE, data = tic.trn, importance = "impurity_corrected") ## ## Type: Probability estimation ## Number of trees: 500 ## Sample size: 5822 ## Number of independent variables: 85 ## Mtry: 9 ## Target node size: 10 ## Variable importance mode: impurity_corrected ## Splitrule: gini ## OOB prediction error (Brier s.): 0.05436326 ``` --- ## CoIL challenge: random forest ```r library(patchwork) library(vip) vip(tic.rfo1, include_type = TRUE) + vip(tic.rfo2, include_type = TRUE) ``` <img src="slides_files/figure-html/ticdata-rfos-vips-1.svg" width="80%" style="display: block; margin: auto;" /> --- ## Permutation importance Permutation importance is .tomato[any measure of how much *worst* a model's predictions are after randomly permuting a particular feature column]. <img src="images/permutation-importance-01.png" style="width: 90%" /> -- <img src="images/permutation-importance-02.png" style="width: 90%" /> --- ## Permutation importance .center.medium[**A simple algorithm for constructing permutation VI scores**] Let `\(X_1, X_2, \dots, X_j\)` be the features of interest and let `\(\mathcal{M}_{orig}\)` be the baseline performance metric for the trained model; for brevity, we'll assume smaller is better (e.g., classification error or RMSE). The permutation-based importance scores can be computed as follows: 1. For `\(i = 1, 2, \dots, j\)`: a. Permute the values of feature `\(X_i\)` in the training data. b. Recompute the performance metric on the permuted data `\(\mathcal{M}_{perm}\)`. c. Record the difference from baseline using `\(vi\left(X_i\right) = \mathcal{M}_{perm} - \mathcal{M}_{orig}\)`. 2. Return the VI scores `\(vi\left(X_1\right), vi\left(X_2\right), \dots, vi\left(X_j\right)\)`. Do this many times for each feature and average the results! --- ## Why permutation-based importance? .font120[ * *Model-agnostic* (.blue[can be applied to any algorithm]) - Makes it easier to compare across models (🍎 vs. 🍎) * Easily parallelized * Readily available ([scikit-learn](https://scikit-learn.org/stable/modules/permutation_importance.html), **Data.dodgerblue[Robot]**, [vip](https://cran.r-project.org/package=vip), etc.) - There are several implementations in R, including [vip](https://cran.r-project.org/package=vip), [iml](https://cran.r-project.org/package=iml), [ingredients](https://cran.r-project.org/package=ingredients), and [mmpf](https://cran.r-project.org/package=mmpf) - The implementations in [scikit-learn](https://scikit-learn.org/stable/modules/permutation_importance.html), [vip](https://cran.r-project.org/package=vip), and [iml](https://cran.r-project.org/package=iml) are parallelized 😎 ] --- class: middle, center ## Why the vip 📦? Based on **100 repeats** of permutation importance using a random forest fit to a training set with **10k rows** and **10 features** <img src="images/benchmark-vip.png" style="width: 90%" /> --- ## Friedman 1 benchmark example Consider the following regression model: `\begin{equation} Y_i = 10 \sin\left(\pi X_{1i} X_{2i}\right) + 20 \left(X_{3i} - 0.5\right) ^ 2 + 10 X_{4i} + 5 X_{5i} + \epsilon_i, \quad i = 1, 2, \dots, n, \end{equation}` where `\(\epsilon_i \stackrel{iid}{\sim} N\left(0, \sigma^2\right)\)`. ```r trn <- vip::gen_friedman(500, sigma = 1, seed = 101) # simulate training data tibble::as_tibble(trn) # inspect output ``` ``` ## # A tibble: 500 × 11 ## y x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 14.9 0.372 0.406 0.102 0.322 0.693 0.758 0.518 0.530 0.878 0.763 ## 2 15.3 0.0438 0.602 0.602 0.999 0.776 0.533 0.509 0.487 0.118 0.176 ## 3 15.1 0.710 0.362 0.254 0.548 0.0180 0.765 0.715 0.844 0.334 0.118 ## 4 10.7 0.658 0.291 0.542 0.327 0.230 0.301 0.177 0.346 0.474 0.283 ## 5 17.6 0.250 0.794 0.383 0.947 0.462 0.00487 0.270 0.114 0.489 0.311 ## 6 18.3 0.300 0.701 0.992 0.386 0.666 0.198 0.924 0.775 0.736 0.974 ## 7 14.6 0.585 0.365 0.283 0.488 0.845 0.466 0.715 0.202 0.905 0.640 ## 8 17.0 0.333 0.552 0.858 0.509 0.697 0.388 0.260 0.355 0.517 0.165 ## 9 8.54 0.622 0.118 0.490 0.390 0.468 0.360 0.572 0.891 0.682 0.717 ## 10 15.0 0.546 0.150 0.476 0.706 0.829 0.373 0.192 0.873 0.456 0.694 ## # … with 490 more rows ``` --- ## Friedman 1 benchmark example (PPR) ```r # Projection pursuit regression fit pp <- ppr(y ~ ., data = trn, nterms = 11) # Use 10 Monte Carlo reps set.seed(403) # for reproducibility vis <- vi(pp, method = "permute", target = "y", metric = "rsquared", pred_wrapper = predict, nsim = 15) vip(vis, geom = "boxplot") ``` <img src="slides_files/figure-html/permute-friedman-nn-result-1.svg" width="60%" style="display: block; margin: auto;" /> --- ## Friedman 1 benchmark example (RF) Most IML-related R packages are .purple[**flexible enough to handle ANY fitted model**]! For example: ```r # Fit a default random forest rfo <- ranger::ranger(y ~ ., data = trn) # Prediction wrapper pfun <- function(object, newdata) { predict(object, data = newdata)$predictions } # Mean absolute error mae <- function(actual, predicted) { mean(abs(actual - predicted)) } # Permutation-based VIP with user-defined MAE metric set.seed(1101) # for reproducibility vip(rfo, method = "permute", target = "y", metric = mae, smaller_is_better = TRUE, pred_wrapper = pfun, nsim = 10, geom = "point", all_permutations = TRUE, jitter = TRUE) + theme_bw() ``` --- ## Friedman 1 benchmark example (RF) <img src="slides_files/figure-html/friedman1-rf-results-1.svg" width="100%" style="display: block; margin: auto;" /> --- ## Friedman 1 benchmark example (RF) ```r # FIRM-based VI scores with sparklines vi(rfo, method = "firm", pred_wrapper = pfun) %>% add_sparklines() ```
--- ## Permutation importance .pull-left[ ## Drawbacks * Should you use the train or test data set for permuting? * Requires access to the true target values * Results are random (due to random shuffling of columns) * Correlated features lead to *extrapolating* 😱 ] .pull-right[ ## Alternatives * *Leave-one-variable-out* (LOVO) importance * Conditional variable importance 🌲 * Dropped variable importance * Permute-and-relearn importance * Condition-and-relearn importance ] .center.font150[[Please Stop Permuting Features: An Explanation and Alternatives](https://arxiv.org/abs/1905.03151)] --- ## PDPs in a nutshell 🥜 * A plot showing the .tomato[*marginal* (or average) effect] of a small subset of features (usually one or two) .tomato[on the predicted outcome] 📉 - The PDP for the `\(j\)`-th feature `\(x_j\)` .blue[shows how the average prediction changes as a function of] `\(x_j\)` (the average is taken across the training set, or representative sample thereof) * Can help determine if the modeled relationship is nearly linear, nonlinear, monotonic, etc. * .red[Can be misleading in the presence of strong *interaction effects*] 😱 - .green[*Individual conditional expectation* (ICE) curves], a slight modification to PDPs, don't share this disadvantage - **think of ICE curves as a marginal effect plot for individual observations**, one curve for each row in the training data --- ## How are PDPs constructed (algorithm view 🤮)? Constructing a PDP in practice is rather straightforward. To simplify, let `\(\boldsymbol{z}_s = x_1\)` be the predictor variable of interest with unique values `\(\left\{x_{11}, x_{12}, \dots, x_{1k}\right\}\)`. The partial dependence of the response on `\(x_1\)` can be constructed as follows: * For `\(i \in \left\{1, 2, \dots, k\right\}\)`: 1. Copy the training data and replace the original values of `\(x_1\)` with the constant `\(x_{1i}\)`. 2. Compute the vector of predicted values from the modified copy of the training data. 3. Compute the average prediction to obtain `\(\bar{f}_1\left(x_{1i}\right)\)`. * Plot the pairs `\(\left\{x_{1i}, \bar{f}_1\left(x_{1i}\right)\right\}\)` for `\(i = 1, 2, \dotsc, k\)`. .font150.center[.tomato[Rather straightforward to implement actually!] 💻] --- ## CoIL challenge: GUIDE-based VI scores ``` ## # A tibble: 8 × 3 ## Type Score Variable ## <chr> <dbl> <chr> ## 1 A 2.41 PPERSAUT ## 2 A 2.17 PBRAND ## 3 A 1.92 PPLEZIER ## 4 A 1.82 APERSAUT ## 5 A 1.60 MOSHOOFD ## 6 A 1.55 APLEZIER ## 7 A 1.46 MKOOPKLA ## 8 A 1.42 STYPE ``` --- ## CoIL challenge: PD plots ```r library(ggplot2) library(pdp) # PD and c-ICE plots p1 <- partial(tic.rfo1, pred.var = "APERSAUT") p2 <- partial(tic.rfo1, pred.var = "APERSAUT", ice = TRUE, center = TRUE, # for c-ICE plots * train = tic.trn[sample.int(500), ]) # DON'T PLOT THEM ALL!! # Display plots (autoplot(p1) + theme_bw() | autoplot(p2, alpha = 0.1) + theme_bw()) / ggplot(tic.trn, aes(x = APERSAUT)) + geom_bar() + theme_bw() ``` --- ## CoIL challenge: PD plots <img src="slides_files/figure-html/coil-rf-pd-results-1.svg" width="100%" style="display: block; margin: auto;" /> --- ## PD plots using simple SQL operations See [pdp issue (#97)](https://github.com/bgreenwell/pdp/issues/97) ```r # Load required packages library(dplyr) library(pdp) library(sparklyr) data(boston, package = "pdp") sc <- spark_connect(master = 'local') boston_sc <- copy_to(sc, boston, overwrite = TRUE) rfo <- boston_sc %>% ml_random_forest(cmedv ~ ., type = "auto") # Define plotting grid df1 <- data.frame(lstat = quantile(boston$lstat, probs = 1:19/20)) %>% copy_to(sc, df = .) # Remove plotting variable from training data df2 <- boston %>% select(-lstat) %>% copy_to(sc, df = .) ``` --- ## PD plots using simple SQL operations ```r # Perform a cross join, compute predictions, then aggregate! par_dep <- df1 %>% full_join(df2, by = character()) %>% # Cartesian product (i.e., cross join) ml_predict(rfo, dataset = .) %>% group_by(lstat) %>% summarize(yhat = mean(prediction)) %>% # average for partial dependence select(lstat, yhat) %>% # select plotting variables arrange(lstat) %>% # for plotting purposes collect() # Plot results plot(par_dep, type = "l") ``` <img src="images/spark-pd-plot.png" width="50%" style="display: block; margin: auto;" /> --- ## PDPs and ICE curves .pull-left[ ### Drawbacks * PDPs for more than one feature (i.e., .blue[visualizing interaction effects]) can be computationally demanding * Correlated features lead to *extrapolating* * [Please Stop Permuting Features: An Explanation and Alternatives](https://arxiv.org/abs/1905.03151) ] .pull-right[ ### Alternatives * ["Poor man's" PDPs](https://github.com/bgreenwell/pdp/issues/91); historically available in package [plotmo](https://cran.r-project.org/package=plotmo) and now available in [pdp](https://cran.r-project.org/package=pdp) (version >= 0.8.0) * [Accumulated local effect (ALE) plots](https://arxiv.org/abs/1612.08468) * [Stratified PDPs](https://arxiv.org/abs/1907.06698) * Shapley-based dependence plots ] --- ## Explaining individual predictions * While discovering which features have the biggest *overall* impact on the model is important, it is often more informative to determine: .center.MediumSeaGreen[Which features impacted a specific set of predictions, and how?] * We can think of this as *local* (or *case-wise*) *variable importance* - More generally referred to as *prediction explanations* or .magenta[*feature contributions*] * Many different flavors, but we'll focus on (arguably) the most popular: .dodgerblue[*Shapley explanations*] --- ## Shapley explanations For an arbitrary observation `\(\boldsymbol{x}_0\)`, Shapley values provide a measure of each feature values contribution to the difference `$$\hat{f}\left(\boldsymbol{x}_0\right) - \sum_{i = 1}^N \hat{f}\left(\boldsymbol{x}_i\right)$$` * Based on [Shapley values](https://en.wikipedia.org/wiki/Shapley_value), an idea from *game theory* 😱 * Can be computed for all training rows and aggregated into useful summaries (e.g., variable importance) * The only prediction explanation method to satisfy several useful properties of .dodgerblue[*fairness*]: 1. Local accuracy (efficiency) 2. Missingness 3. Consistency (monotonicity) --- ## So, what's a Shapley value? -- In .forestgreen[*cooperative game theory*], the Shapley value is the average marginal contribution of a .forestgreen[*player*] across all possible .forestgreen[*coalitions*] in a .forestgreen[*game*] [(Shapley, 1951)](https://www.rand.org/content/dam/rand/pubs/research_memoranda/2008/RM670.pdf): `$$\phi_i\left(val\right) = \frac{1}{p!} \sum_{\mathcal{O} \in \pi\left(p\right)} \left[\Delta Pre^i\left(\mathcal{O}\right) \cup \left\{i\right\} - Pre^i\left(\mathcal{O}\right)\right], \quad i = 1, 2, \dots, p$$` -- .pull-left[ <img src="https://media.giphy.com/media/3o6MbbwX2g2GA4MUus/giphy.gif?cid=ecf05e471n8c85mbtirkm0ra4x4qa8ezo2idws6ag4f2rvtw&rid=giphy.gif&ct=g" style="width: 80%" /> ] .pull-right[ .font90[ In the context of predictive modeling: * .dodgerblue[**Game**] = prediction task for a single observation `\(\boldsymbol{x}_0\)` * .dodgerblue[**Players**] = the feature values of `\(\boldsymbol{x}_0\)` that collaborate to receive the *gain* or *payout* * .dodgerblue[**Payout**] = prediction for `\(\boldsymbol{x}_0\)` minus the average prediction for all training observations (i.e., baseline) ] ] --- ## Approximating Shapley values .purple[**For the programmers**], implementing approximate Shapley explanations is rather straightforward [(Strumbelj et al., 2014)](https://dl.acm.org/doi/10.1007/s10115-013-0679-x): .center[ <img src="images/shapley-algorithm.png" style="width: 100%" class="center" /> ] --- class: middle A poor-man's implementation in R... ```r sample.shap <- function(f, obj, R, x, feature, X) { phi <- numeric(R) # to store Shapley values N <- nrow(X) # sample size p <- ncol(X) # number of features b1 <- b2 <- x for (m in seq_len(R)) { * w <- X[sample(N, size = 1), ] # randomly drawn instance * ord <- sample(names(w)) # random permutation of features * swap <- ord[seq_len(which(ord == feature) - 1)] * b1[swap] <- w[swap] * b2[c(swap, feature)] <- w[c(swap, feature)] * phi[m] <- f(obj, newdata = b1) - f(obj, newdata = b2) } mean(phi) } ``` --- class: middle ## Enter...**fastshap** * Explaining `\(N\)` instances with `\(p\)` features would require `\(2 \times m \times N \times p\)` calls to `\(\hat{f}\left(\right)\)` * [fastshap](https://cran.r-project.org/package=fastshap) reduces this to `\(2 \times m \times p\)` - Trick here is to generate all the "Frankenstein instances" up front, and score the differences once: `\(\hat{f}\left(\boldsymbol{B}_1\right) - \hat{f}\left(\boldsymbol{B}_2\right)\)` * Logical subsetting! (http://adv-r.had.co.nz/Subsetting.html) - It's also parallelized across predictors (not by default) - Supports Tree SHAP implementations in both the [xgboost](https://cran.r-project.org/package=xgboost) and [lightgbm](https://cran.r-project.org/package=lightgbm) packages (.dodgerblue[woot!]) - ~~*Force plots* via [reticulate](https://rstudio.github.io/reticulate/) (works in R markdown): https://bgreenwell.github.io/fastshap/articles/forceplot.html~~ --- class: middle ## Simple benchmark Explaining a single observation from a [ranger](https://cran.r-project.org/web/packages/ranger/index.html)-based random forest fit to the well-known [titanic](https://cran.r-project.org/package=titanic) data set. <img src="slides_files/figure-html/unnamed-chunk-1-1.svg" width="90%" style="display: block; margin: auto;" /> --- class: middle ### Example: understanding survival on the Titanic .scrollable.code70[ ```r library(ggplot2) library(ranger) library(fastshap) # Set ggplot2 theme theme_set(theme_bw()) # Read in the data and clean it up a bit titanic <- titanic::titanic_train features <- c( "Survived", # passenger survival indicator "Pclass", # passenger class "Sex", # gender "Age", # age "SibSp", # number of siblings/spouses aboard "Parch", # number of parents/children aboard "Fare", # passenger fare "Embarked" # port of embarkation ) titanic <- titanic[, features] titanic$Survived <- as.factor(titanic$Survived) *titanic <- na.omit(titanic) # ...umm? ``` ] --- class: middle ### Example: understanding survival on the Titanic .scrollable.code70[ ```r # Fit a (default) random forest set.seed(1046) # for reproducibility rfo <- ranger(Survived ~ ., data = titanic, probability = TRUE) # Prediction wrapper for `fastshap::explain()`; has to return a # single (atomic) vector of predictions pfun <- function(object, newdata) { # computes prob(Survived=1|x) predict(object, data = newdata)$predictions[, 2] } # Estimate feature contributions for each imputed training set X <- subset(titanic, select = -Survived) # features only! set.seed(1051) # for reproducibility *(ex.all <- explain(rfo, X = X, nsim = 100, adjust = TRUE, pred_wrapper = pfun)) ``` ``` ## # A tibble: 714 × 7 ## Pclass Sex Age SibSp Parch Fare Embarked ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 -0.0416 -0.180 -0.00314 0.0135 -0.0110 -0.0405 -0.0196 ## 2 0.165 0.267 -0.0156 0.00123 0.00162 0.123 0.0339 ## 3 -0.118 0.224 0.0144 0.0226 -0.00727 -0.0312 -0.0208 ## 4 0.150 0.297 0.00748 -0.00627 0.00268 0.128 -0.0101 ## 5 -0.0387 -0.161 -0.0328 0.0132 -0.00480 -0.0660 -0.00403 ## 6 0.0483 -0.197 -0.119 -0.00203 -0.00290 0.0724 -0.0120 ## 7 -0.0848 -0.0931 0.225 -0.161 0.0170 -0.0675 -0.00867 ## 8 -0.106 0.267 0.0493 0.0368 0.0616 0.0215 -0.00420 ## 9 0.0781 0.316 0.0434 0.000517 -0.000561 0.0166 0.0507 ## 10 -0.108 0.140 0.249 -0.0110 0.0423 0.0459 -0.00547 ## # … with 704 more rows ``` ] --- class: middle ### Example: understanding survival on the Titanic Plotting functions to be replaced with [shapviz](https://CRAN.R-project.org/package=shapviz)!! .scrollable.code70[ ```r p1 <- autoplot(ex.all) p2 <- autoplot(ex.all, type = "dependence", feature = "Age", X = X, color_by = "Sex", alpha = 0.5) + theme(legend.position = c(0.8, 0.8)) gridExtra::grid.arrange(p1, p2, nrow = 1) ``` <img src="slides_files/figure-html/titanic-shap-all-plots-1.svg" width="80%" style="display: block; margin: auto;" /> ] --- class: middle ### Example: understanding survival on the Titanic Explaining an individual row (i.e., passenger); inspiration for this example taken from [here](https://modeloriented.github.io/iBreakDown/articles/vignette_iBreakDown_titanic.html). .pull-left[ Meet Jack: .scrollable.code70[ ```r # Explain an individual passenger jack.dawson <- data.frame( # Survived = factor(0, levels = 0:1), # in case you haven't seen the movie Pclass = 3, Sex = factor("male", levels = c("female", "male")), Age = 20, SibSp = 0, Parch = 0, Fare = 15, # lower end of third-class ticket prices Embarked = factor("S", levels = c("", "C", "Q", "S")) ) ``` ] ] .pull-right[ <img src="images/jack.jpg" width="100%" style="display: block; margin: auto;" /> ] --- class: middle ### Example: understanding survival on the Titanic .scrollable.code70[ ```r (pred.jack <- pfun(rfo, newdata = jack.dawson)) ``` ``` ## 1 ## 0.08406581 ``` ```r (baseline <- mean(pfun(rfo, newdata = X))) ``` ``` ## [1] 0.4068224 ``` ```r # Estimate feature contributions for Jack's predicted probability set.seed(754) # for reproducibility (ex.jack <- explain(rfo, X = X, newdata = jack.dawson, nsim = 1000, adjust = TRUE, pred_wrapper = pfun)) ``` ``` ## # A tibble: 1 × 7 ## Pclass Sex Age SibSp Parch Fare Embarked ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 -0.0689 -0.143 -0.0340 0.00847 -0.0152 -0.0497 -0.0206 ``` ] --- class: middle ### Example: understanding survival on the Titanic <img src="slides_files/figure-html/titanic-jack-ex-plot-1.svg" width="100%" style="display: block; margin: auto;" /> --- class: middle ### Example: understanding anomalous credit card transactions https://www.kaggle.com/mlg-ulb/creditcardfraud .scrollable.code70[ ```r library(fastshap) library(ggplot2) library(isotree) # Set ggplot2 theme theme_set(theme_bw()) # Read in credit card fraud data ccfraud <- data.table::fread("data/ccfraud.csv") # Randomize the data set.seed(2117) # for reproducibility ccfraud <- ccfraud[sample(nrow(ccfraud)), ] # Split data into train/test sets set.seed(2013) # for reproducibility trn.id <- sample(nrow(ccfraud), size = 10000, replace = FALSE) ccfraud.trn <- ccfraud[trn.id, ] ccfraud.tst <- ccfraud[-trn.id, ] ``` ] --- class: middle ### Example: understanding anomalous credit card transactions Anomaly detection via [isolation forest](https://en.wikipedia.org/wiki/Isolation_forest) .scrollable.code70[ ```r # Fit a default isolation forest (unsupervised) ifo <- isolation.forest(ccfraud.trn[, 1L:30L], seed = 2223, nthreads = 1) # Compute anomaly scores for the test observations head(scores <- predict(ifo, newdata = ccfraud.tst)) ``` ``` ## 1 2 3 4 5 6 ## 0.3202647 0.3402078 0.3208879 0.3231959 0.3413722 0.3254552 ``` ] --- class: middle ### Example: understanding anomalous credit card transactions .scrollable.code70[ ```r # Find test observations corresponding to maximum anomaly score max.id <- which.max(scores) # row ID for observation wit max.x <- ccfraud.tst[max.id, ] max(scores) ``` ``` ## [1] 0.8470209 ``` ```r X <- ccfraud.trn[, 1L:30L] # feature columns only! max.x <- max.x[, 1L:30L] # feature columns only! pfun <- function(object, newdata) { # prediction wrapper predict(object, newdata = newdata) } # Generate feature contributions set.seed(1351) # for reproducibility ex <- explain(ifo, X = X, newdata = max.x, pred_wrapper = pfun, adjust = TRUE, nsim = 1000) # Should sum to f(x) - baseline whenever `adjust = TRUE` sum(ex) ``` ``` ## [1] 0.5113865 ``` ] --- class: middle ### Example: understanding anomalous credit card transactions <img src="slides_files/figure-html/ccfraud-ifo-ex-plot-1.svg" width="100%" style="display: block; margin: auto;" /> --- class: middle, center ## Thank you <img src="https://media.giphy.com/media/3orifiI9P8Uita7ySs/giphy.gif" style="width: 80%" />