It is possible to use the vip package (Greenwell and Boehmke, n.d.) with any fitted model for which new predictions can be generated. This is possible via
method = "ice",
method = "pdp", and
method = "permute" since these methods construct variable importance (VI) scores based solely off of a model’s predictions—albeit, in different ways. In this vignette, we will demonstrate the construction of permutation-based VI scores (i.e.,
method = "permute") using a TensorFlow model trained to the Boston housing data with the keras package (Allaire and Chollet 2018). This particular example is adapted from Chollet and Allaire (2018). We’ll supplement the variable importance plot (VIP) with feature effect plots using the pdp package (Greenwell 2017)—a general R package for constructing partial dependence plots (PDPs) (Friedman 2001) and individual conditional expectation (ICE) curves (Goldstein et al. 2015).
To illustrate, we’ll fit a TensorFlow model to the Boston housing data (Harrison and Rubinfeld 1978). A corrected version of these data are available in the pdp package. In the code chunk below, we load a corrected version of the original Boston housing data (see
?pdp::boston for details) and separate the training features (
train_x) from the training response values (
# Loading (corrected) Boston housing data data(boston, package = "pdp") # Construct matrix of training data (features only) train_x <- boston %>% select(-cmedv) %>% # remove response mutate(chas = as.numeric(chas)) %>% # convert factor to numeric as.matrix() # convert to numeric matrix # Construct vector of training response values train_y <- boston$cmedv
Since the features are measured on very different scales (e.g., longitude and per capita crime rate by town), we center and scale the columns of
train_x using the
train_x <- scale(train_x, center = TRUE, scale = TRUE) # normalize data apply(train_x, MARGIN = 2, FUN = function(x) c(mean(x), sd(x))) # sanity check #> lon lat crim zn indus #> [1,] -5.363227e-14 -2.272681e-14 -7.202981e-18 2.282481e-17 1.595296e-17 #> [2,] 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 #> chas nox rm age dis #> [1,] -1.586632e-16 -2.150022e-16 -1.056462e-16 -1.643357e-16 1.153079e-16 #> [2,] 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 #> rad tax ptratio b lstat #> [1,] 4.799652e-17 2.024415e-17 -3.924246e-16 -1.151679e-16 -7.052778e-17 #> [2,] 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
Next, we define a function for fitting a Keras model composed of a linear stack of layers. Since the Boston housing data is rather small (\(n =\) 506), we’ll use a very small network with only two hidden layers, each with 64 units. Building small networks like this can help mitigate overfitting to smaller data sets.
Since we don’t have a lot of observations, we used \(k\)-fold cross-validation (CV) (with \(k = 4\)) to evaluate the network and choose the optimal number of epochs1 (\(k\)-fold CV is illustrated in the Figure 1 below).
The performance of the network was evaluated using mean absolute error (MAE), which is the absolute value of the difference between the predicted and observed outcomes. In this example, the cross-validated MAE stopped improving after about 125 epochs. Using this result, we train a final network using 80 epochs2.
Here we’ll look at two methods for model interpretation: variable importance and individual conditional expectation (ICE) curves. The methods are available in the R packages vip and pdp, respectively.
While both packages support a wide range of models, it is rather straightforward to use them for any model for which new predictions can be obtained. To start, we’ll have to define a prediction function wrapper which requires two arguments:
object (the fitted model object) and
newdata. The function needs to return a vector of predictions (one for each observation).
A simple measure of variable importance can be obtained using the permutation approach described in Breiman (2001) for random forests. In essence, we randomly permute the values of each feature and record the drop in training performance. This can be accomplished using the
vip() function with
method = "permute". To use this method we need to supply the original training response values via the
obs argument and specify which performance metric we are interested in (in this case, we’ll use \(R^2\)). The results are, which are displayed in Figure 2, indicate that the average number of rooms per dwelling (
rm) and the percentage of lower status of the population (
lstat) are the most important features in predicting median home value.
set.seed(102) # for reproducibility p1 <- vip( object = model, # fitted model method = "permute", # permutation-based VI scores num_features = ncol(train_x), # default only plots top 10 features pred_fun = pred_wrapper, # user-defined prediction function train = as.data.frame(train_x) , # training data target = train_y, # response values used for training metric = "rsquared", # evaluation metric # progress = "text" # request a text-based progress bar ) #> Warning: Setting `method = "permute"` is experimental, use at your own #> risk! print(p1) # display plot
Next, we’ll construct ICE curves for the top two features:
lstat. To do this we use
partial() function. By default,
partial() constructs partial dependence plots (PDPs); The PDP for a feature of interest can be constructed by averaging together the ICE curves from each observation for that feature. To suppress this averaging and construct ICE curves, set
ice = TRUE in the call to
partial(). Since ICE curves require a prediction for each observations, we can use the same wrapper function we defined earlier. The ICE curves for both
lstat in Figure 3 display a bit of heterogeneity indicating the possible presence of interaction effects. The solid red curve in each plot represents the average of all of the ICE curves (i.e., the PDP for that feature).
p2 <- partial(model, pred.var = "rm", pred.fun = pred_wrapper, train = as.data.frame(train_x)) %>% autoplot(alpha = 0.1) p3 <- partial(model, pred.var = "lstat", pred.fun = pred_wrapper, train = as.data.frame(train_x)) %>% autoplot(alpha = 0.1) grid.arrange(p2, p3, ncol = 2) # display plots side by side
A couple of additional points are worth noting:
The default output from
partial() is a data frame. You can set
plot = TRUE to obtain a plot instead, but since these plots can be expensive to compute, it is better to store the results and plot them manually using, for example,
ggplot2-based plots) or
Before fitting the network we normalized the data by centering and scaling each feature. In order for these plots to be on the original scale, you would need to unscale the corresponding column(s) in the output by multiplying by the original sample standard deviation and adding back the sample mean of that feature.
ICE curves and PDPs can be computationally expensive. Some strategies are discussed in Greenwell (2017). The
partial() function has many useful options to help, for example,
?pdp::partial for details).
To obtain a PDP, we need to supply a prediction function that returns the average prediction across all observations. This can be easily accomplished by adding an extra line to the previously defined wrapper.
Next, we’ll construct the partial dependence of medium home value (
cmedv) on the average number of rooms per dwelling (
rm) and the percentage of lower status of the population (
lstat). To restrict the predictions to the region of joint values of
lstat observed in the training data (i.e., to avoid extrapolating) we set
chull = TRUE in the call to
partial(); this also helps speed up computation time by restricting the grid over which predictions are obtained. The resulting plot displayed in Figure 4 indicates what we would naturally expect: that census tracts with a higher average number of rooms per dwelling and a lower percentage of lower status of the population tend to have a higher median value.
Finally, we can display all the results in a single plot.
Allaire, JJ, and François Chollet. 2018. Keras: R Interface to ’Keras’. https://CRAN.R-project.org/package=keras.
Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.
Chollet, Francois, and J. J. Allaire. 2018. Deep Learning with R. 1st ed. Greenwich, CT, USA: Manning Publications Co.
Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” The Annals of Statistics 29: 1189–1232. https://doi.org/10.1214/aos/1013203451.
Goldstein, Alex, Adam Kapelner, Justin Bleich, and Emil Pitkin. 2015. “Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation.” Journal of Computational and Graphical Statistics 24 (1): 44–65. https://doi.org/10.1080/10618600.2014.907095.
Greenwell, Brandon, and Brad Boehmke. n.d. Vip: Variable Importance Plots. https://koalaverse.github.io/vip/index.html.
Greenwell, Brandon M. 2017. “Pdp: An R Package for Constructing Partial Dependence Plots.” The R Journal 9 (1): 421–36. https://journal.r-project.org/archive/2017/RJ-2017-016/index.html.
Harrison, David, and Daniel L. Rubinfeld. 1978. “Hedonic Housing Prices and the Demand for Clean Air.” Journal of Environmental Economics and Management 5 (1): 81–102. https://doi.org/10.1016/0095-0696(78)90006-2.