It is possible to use the vip package (Greenwell and Boehmke, n.d.) with any fitted model for which new predictions can be generated. This is possible via method = "ice", method = "pdp", and method = "permute" since these methods construct variable importance (VI) scores based solely off of a model’s predictions—albeit, in different ways. In this vignette, we will demonstrate the construction of permutation-based VI scores (i.e., method = "permute") using a TensorFlow model trained to the Boston housing data with the keras package (Allaire and Chollet 2018). This particular example is adapted from Chollet and Allaire (2018). We’ll supplement the variable importance plot (VIP) with feature effect plots using the pdp package (Greenwell 2017)—a general R package for constructing partial dependence plots (PDPs) (Friedman 2001) and individual conditional expectation (ICE) curves (Goldstein et al. 2015).

Prerequisites

# Load required packages
library(dplyr)    # for data wrangling
library(ggplot2)  # for general visualization
library(keras)    # for fitting DNNs
library(pdp)      # for partial depe
library(vip)      # for visualizing feature importance

# For reproducibility
use_session_with_seed(101)

Predicting median home value using TensorFlow

To illustrate, we’ll fit a TensorFlow model to the Boston housing data (Harrison and Rubinfeld 1978). A corrected version of these data are available in the pdp package. In the code chunk below, we load a corrected version of the original Boston housing data (see ?pdp::boston for details) and separate the training features (train_x) from the training response values (train_y).

# Loading (corrected) Boston housing data
data(boston, package = "pdp")

# Construct matrix of training data (features only)
train_x <- boston %>%
select(-cmedv) %>%                   # remove response
mutate(chas = as.numeric(chas)) %>%  # convert factor to numeric
as.matrix()                          # convert to numeric matrix

# Construct vector of training response values
train_y <- boston\$cmedv

Since the features are measured on very different scales (e.g., longitude and per capita crime rate by town), we center and scale the columns of train_x using the scale() function.

train_x <- scale(train_x, center = TRUE, scale = TRUE)  # normalize data
apply(train_x, MARGIN = 2, FUN = function(x) c(mean(x), sd(x)))  # sanity check
#>                lon           lat          crim           zn        indus
#> [1,] -5.363227e-14 -2.272681e-14 -7.202981e-18 2.282481e-17 1.595296e-17
#> [2,]  1.000000e+00  1.000000e+00  1.000000e+00 1.000000e+00 1.000000e+00
#>               chas           nox            rm           age          dis
#> [1,] -1.586632e-16 -2.150022e-16 -1.056462e-16 -1.643357e-16 1.153079e-16
#> [2,]  1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00 1.000000e+00
#>               rad          tax       ptratio             b         lstat
#> [1,] 4.799652e-17 2.024415e-17 -3.924246e-16 -1.151679e-16 -7.052778e-17
#> [2,] 1.000000e+00 1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00

Next, we define a function for fitting a Keras model composed of a linear stack of layers. Since the Boston housing data is rather small ($$n =$$ 506), we’ll use a very small network with only two hidden layers, each with 64 units. Building small networks like this can help mitigate overfitting to smaller data sets.

build_model <- function() {
model <- keras_model_sequential() %>%
layer_dense(units = 64, activation = "relu",
input_shape = dim(train_x)[[2]]) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 1)
model %>% compile(
optimizer = "rmsprop",
loss = "mse",
metrics = c("mae")
)
}

Since we don’t have a lot of observations, we used $$k$$-fold cross-validation (CV) (with $$k = 4$$) to evaluate the network and choose the optimal number of epochs1 ($$k$$-fold CV is illustrated in the Figure 1 below).

The performance of the network was evaluated using mean absolute error (MAE), which is the absolute value of the difference between the predicted and observed outcomes. In this example, the cross-validated MAE stopped improving after about 125 epochs. Using this result, we train a final network using 80 epochs2.

model <- build_model()
model %>% fit(train_x, train_y, epochs = 80, batch_size = 16, verbose = 0)

Model interpreation

Here we’ll look at two methods for model interpretation: variable importance and individual conditional expectation (ICE) curves. The methods are available in the R packages vip and pdp, respectively.

While both packages support a wide range of models, it is rather straightforward to use them for any model for which new predictions can be obtained. To start, we’ll have to define a prediction function wrapper which requires two arguments: object (the fitted model object) and newdata. The function needs to return a vector of predictions (one for each observation).

pred_wrapper <- function(object, newdata) {
predict(object, x = as.matrix(newdata)) %>%
as.vector()
}

Variable importance plot

A simple measure of variable importance can be obtained using the permutation approach described in Breiman (2001) for random forests. In essence, we randomly permute the values of each feature and record the drop in training performance. This can be accomplished using the vip() function with method = "permute". To use this method we need to supply the original training response values via the obs argument and specify which performance metric we are interested in (in this case, we’ll use $$R^2$$). The results are, which are displayed in Figure 2, indicate that the average number of rooms per dwelling (rm) and the percentage of lower status of the population (lstat) are the most important features in predicting median home value.

set.seed(102)  # for reproducibility
p1 <- vip(
object = model,                     # fitted model
method = "permute",                 # permutation-based VI scores
num_features = ncol(train_x),       # default only plots top 10 features
pred_fun = pred_wrapper,            # user-defined prediction function
train = as.data.frame(train_x) ,    # training data
target = train_y,                   # response values used for training
metric = "rsquared",                # evaluation metric
# progress = "text"                 # request a text-based progress bar
)
#> Warning: Setting method = "permute" is experimental, use at your own
#> risk!
print(p1)  # display plot

ICE curves

Next, we’ll construct ICE curves for the top two features: rm and lstat. To do this we use pdp’s partial() function. By default, partial() constructs partial dependence plots (PDPs); The PDP for a feature of interest can be constructed by averaging together the ICE curves from each observation for that feature. To suppress this averaging and construct ICE curves, set ice = TRUE in the call to partial(). Since ICE curves require a prediction for each observations, we can use the same wrapper function we defined earlier. The ICE curves for both rm and lstat in Figure 3 display a bit of heterogeneity indicating the possible presence of interaction effects. The solid red curve in each plot represents the average of all of the ICE curves (i.e., the PDP for that feature).

p2 <- partial(model, pred.var = "rm", pred.fun = pred_wrapper,
train = as.data.frame(train_x)) %>%
autoplot(alpha = 0.1)
p3 <- partial(model, pred.var = "lstat", pred.fun = pred_wrapper,
train = as.data.frame(train_x)) %>%
autoplot(alpha = 0.1)
grid.arrange(p2, p3, ncol = 2)  # display plots side by side

A couple of additional points are worth noting:

1. The default output from partial() is a data frame. You can set plot = TRUE to obtain a plot instead, but since these plots can be expensive to compute, it is better to store the results and plot them manually using, for example, autoplot() (for ggplot2-based plots) or plotPartial() (for lattice-based plots).

2. Before fitting the network we normalized the data by centering and scaling each feature. In order for these plots to be on the original scale, you would need to unscale the corresponding column(s) in the output by multiplying by the original sample standard deviation and adding back the sample mean of that feature.

3. ICE curves and PDPs can be computationally expensive. Some strategies are discussed in Greenwell (2017). The partial() function has many useful options to help, for example, progress and parallel (see ?pdp::partial for details).

PDP

To obtain a PDP, we need to supply a prediction function that returns the average prediction across all observations. This can be easily accomplished by adding an extra line to the previously defined wrapper.

pdp_wrapper <- function(object, newdata) {
predict(object, x = as.matrix(newdata)) %>%
as.vector() %>%
mean()  # aggregate ICE curves
}

Next, we’ll construct the partial dependence of medium home value (cmedv) on the average number of rooms per dwelling (rm) and the percentage of lower status of the population (lstat). To restrict the predictions to the region of joint values of rm and lstat observed in the training data (i.e., to avoid extrapolating) we set chull = TRUE in the call to partial(); this also helps speed up computation time by restricting the grid over which predictions are obtained. The resulting plot displayed in Figure 4 indicates what we would naturally expect: that census tracts with a higher average number of rooms per dwelling and a lower percentage of lower status of the population tend to have a higher median value.

p4 <- partial(model, pred.var = c("rm", "lstat"), chull = TRUE,
pred.fun = pdp_wrapper, train = as.data.frame(train_x)) %>%
autoplot()
print(p4)  # display plot

Finally, we can display all the results in a single plot.

grid.arrange(p1, p2, p3, p4, ncol = 2)  # display plots in a grid

References

Allaire, JJ, and François Chollet. 2018. Keras: R Interface to ’Keras’. https://CRAN.R-project.org/package=keras.

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.

Chollet, Francois, and J. J. Allaire. 2018. Deep Learning with R. 1st ed. Greenwich, CT, USA: Manning Publications Co.

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” The Annals of Statistics 29: 1189–1232. https://doi.org/10.1214/aos/1013203451.

Goldstein, Alex, Adam Kapelner, Justin Bleich, and Emil Pitkin. 2015. “Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation.” Journal of Computational and Graphical Statistics 24 (1): 44–65. https://doi.org/10.1080/10618600.2014.907095.

Greenwell, Brandon, and Brad Boehmke. n.d. Vip: Variable Importance Plots. https://koalaverse.github.io/vip/index.html.

Greenwell, Brandon M. 2017. “Pdp: An R Package for Constructing Partial Dependence Plots.” The R Journal 9 (1): 421–36. https://journal.r-project.org/archive/2017/RJ-2017-016/index.html.

Harrison, David, and Daniel L. Rubinfeld. 1978. “Hedonic Housing Prices and the Demand for Clean Air.” Journal of Environmental Economics and Management 5 (1): 81–102. https://doi.org/10.1016/0095-0696(78)90006-2.

1. An epoch refers to a single iteration (both forward and backward) over the entire training data set.

2. In practice, you’ll want to tune various other hyperparameters of the network as well, like the size of the hidden layers.