For illustration, we’ll use a corrected version of the Boston housing data analyzed in Harrison and Rubinfeld (1978); the data are available in the pdp package (see
?pdp::boston for details). We begin by loading the data and fitting a random forest with default tuning parameters and 500 trees. The model fit is reasonable, with an out-of-bag (pseudo) \(R^2\) of 0.89. A variable importance plot is displayed in Figure 1 (using the permutation-based random forest definition of variable importance defined in Breiman (2001)). It appears that
lstat) is important in predicting
cmedv. Next, we’d like to visualize the functional relationship between these two variables using a partial dependence plot (PDP).
data(boston, package = "pdp") # load the (corrected) Boston housing data library(randomForest) # for fitting random forests #> randomForest 4.6-14 #> Type rfNews() to see new features/changes/bug fixes. library(vip) # for variable importance plots #> #> Attaching package: 'vip' #> The following object is masked from 'package:utils': #> #> vi set.seed(101) # for reproducibility boston.rf <- randomForest(cmedv ~ ., data = boston, importance = TRUE) vip(boston.rf, bar = FALSE, horizontal = FALSE, size = 1.5) # Figure 1
Constructing PDPs can be quite computationally expensive1 Several strategies are available to ease the computational burden in larger problems. For example, there is no need to compute partial dependence of
cmedv using each unique value of
rm in the training data (which would require \(k = 446\) passes over the data!). We could get very reasonable results using a reduced number of points. Current options are to use a grid of equally spaced values in the range of the variable of interest; the number of points can be controlled using the
grid.resolution option in the call to
partial. Alternatively, a user-specified grid of values (e.g., containing specific quantiles of interest) can be supplied through the
pred.grid argument. To demonstrate, the following snippet of code computes the partial dependence of
rm using each option;
grid.arrange() is used to display all three PDPs on the same graph, side by side. The results are displayed in Figure 2.
For convenience, the
partial() function includes the
quantiles argument for specifying whether or not to compute the PDP at specific quantiles (controlled by the
probs argument which defaults to
1:9/10—the deciles of the predictor’s distribution). Below, we plot centered individual conditional expectation (c-ICE) curves for
rm using a reduced set of grid points occurring at specific quantiles of
rm; the results are displayed in Figure 3. The heterogeneity in the curves indicates the potential presence of interaction effects between
rm and other features.
partial() function relies on the plyr package (Wickham 2016), rather than R’s built-in
for loops. This makes it easy to request progress bars (e.g.,
progress = "text") or run
partial() in parallel. In fact,
partial() can use any of the parallel backends supported by the
foreach package. To use this functionality, we must first load and register a supported parallel backend (e.g., doMC (Analytics and Weston 2017) or doParallel (Corporation and Weston 2017)).
To illustrate, we will use the Los Angeles ozone pollution data described in Breiman and Friedman (1985). The data contain daily measurements of ozone concentration (
ozone) along with eight meteorological quantities for 330 days in the Los Angeles basin in 1976.2 The following code chunk loads the data into R:
url <- "https://web.stanford.edu/~hastie/ElemStatLearn/datasets/LAozone.data" ozone <- read.csv(url) head(ozone) # print first few observations #> ozone vh wind humidity temp ibh dpg ibt vis doy #> 1 3 5710 4 28 40 2693 -25 87 250 3 #> 2 5 5700 3 37 45 590 -24 128 100 4 #> 3 5 5760 3 51 54 1450 25 139 60 5 #> 4 6 5720 4 69 35 1568 15 121 60 6 #> 5 4 5790 6 19 45 2631 -33 123 100 7 #> 6 4 5790 3 25 55 554 -28 182 250 8
Next, we use the multivariate adaptive regression splines (MARS) algorithm introduced in Friedman (1991) to model ozone concentration as a nonlinear function of the eight meteorological variables plus day of the year; we allow for up to three-way interactions.
library(earth) # for earth() function (i.e., MARS algorithm) #> Loading required package: plotmo #> Loading required package: plotrix #> Loading required package: TeachingDemos ozone.mars <- earth(ozone ~ ., data = ozone, degree = 3) summary(ozone.mars) #> Call: earth(formula=ozone~., data=ozone, degree=3) #> #> coefficients #> (Intercept) 11.7516490 #> h(58-temp) -0.1510377 #> h(temp-58) 0.4883275 #> h(ibh-1105) -0.0009825 #> h(200-vis) 0.0176604 #> h(59-doy) -0.1102233 #> h(doy-59) -0.0161285 #> h(5730-vh) * h(temp-58) -0.0137530 #> h(vh-5730) * h(temp-58) 0.0014756 #> h(55-humidity) * h(temp-58) -0.0216973 #> h(temp-58) * h(dpg-52) -0.0174308 #> h(temp-58) * h(52-dpg) 0.0038824 #> h(temp-69) * h(doy-59) -0.0031742 #> h(1105-ibh) * h(21-dpg) -0.0000928 #> h(wind-6) * h(temp-58) * h(52-dpg) -0.0032786 #> #> Selected 15 of 21 terms, and 8 of 9 predictors #> Termination condition: Reached nk 21 #> Importance: temp, ibh, humidity, dpg, doy, vis, wind, vh, ibt-unused #> Number of terms at each degree of interaction: 1 6 7 1 #> GCV 13.62966 RSS 3569.979 GRSq 0.7882793 RSq 0.8309301
The MARS model produced a generalized \(R^2\) of , similar to what was reported in Breiman and Friedman (1985). A single three-way interaction was found involving the predictors *
wind: wind speed (mph) at Los Angeles International Airport (LAX) *
temp: temperature (\(^oF\)) at Sandburg Air Force Base *
dpg: the pressure gradient (mm Hg) from LAX to Dagget, CA To understand this interaction, we can use a PDP. However, since the partial dependence between three continuous variables can be computationally expensive, we will run
partial() in parallel.
Setting up a parallel backend is rather straightforward. To demonstrate, the following snippet of code sets up the
partial() function to run in parallel on both Windows and Unix-like systems using the doParallel package.
Now, to run
partial() in parallel, all we have to do is invoke the
parallel = TRUE and
paropts options and the rest is taken care of by the internal call to plyr and the parallel backend we loaded3. This is illustrated in the code chunk below which obtains the partial dependence of
dpg in parallel. The last three lines of code add a label to the colorkey. The result is displayed in Figure 4. Note: it is considered good practice to shut down the workers by calling
stopCluster() when finished. Notice how we also used the
chull option to restrict the PDP to the convex hull of
temp which helps in reducing the computation time in multivariate PDPs.
pd <- partial(ozone.mars, pred.var = c("wind", "temp", "dpg"), chull = TRUE, parallel = TRUE, paropts = list(.packages = "earth")) #> Warning: <anonymous>: ... may be used in an incorrect context: '.fun(piece, ...)' #> Warning: <anonymous>: ... may be used in an incorrect context: '.fun(piece, ...)' stopCluster(cl) # good practice
It is important to note that when using more than two predictor variables,
plotPartial() produces a trellis display. The first two variables given to
pred.var are used for the horizontal and vertical axes, and additional variables define the panels. If the panel variables are continuous, then shingles4 are produced first using the equal count algorithm (see, for example,
?lattice::equal.count). Hence, it will be more effective to use categorical variables to define the panels in higher dimensional displays when possible.
Analytics, Revolution, and Steve Weston. 2017. DoMC: Foreach Parallel Adaptor for ’Parallel’. https://CRAN.R-project.org/package=doMC.
Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.
Breiman, Leo, and Jerome H. Friedman. 1985. “Estimating Optimal Transformations for Multiple Regression and Correlation.” Journal of the American Statistical Association 80 (391): 580–98. https://doi.org/10.1080/01621459.1985.10478157.
Corporation, Microsoft, and Steve Weston. 2017. DoParallel: Foreach Parallel Adaptor for the ’Parallel’ Package. https://CRAN.R-project.org/package=doParallel.
Friedman, Jerome H. 1991. “Multivariate Adaptive Regression Splines.” The Annals of Statistics 19 (1): 1–67. https://doi.org/10.1214/aos/1176347963.
Harrison, David, and Daniel L. Rubinfeld. 1978. “Hedonic Housing Prices and the Demand for Clean Air.” Journal of Environmental Economics and Management 5 (1): 81–102. https://doi.org/10.1016/0095-0696(78)90006-2.
Wickham, Hadley. 2016. Plyr: Tools for Splitting, Applying and Combining Data. https://CRAN.R-project.org/package=plyr.
The exception is regression trees based on single-variable splits which can make use of the efficient weighted tree traversal method described in
friedman-2001-greedy, however, only the
gbm package seems to make use of this approach; consequently,
pdp can also exploit this strategy when used with
gbm models (see
?partial for details).↩
The data are available from http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/ozone.data. Details, including variable information, are available from http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/ozone.info.↩
A shingle is a special Trellis data structure that consists of a numeric vector along with intervals that define the “levels” of the shingle. The intervals may be allowed to overlap.↩