`vignettes/pdp-computation.Rmd`

`pdp-computation.Rmd`

For illustration, we’ll use a corrected version of the Boston housing data analyzed in Harrison and Rubinfeld (1978); the data are available in the **pdp** package (see `?pdp::boston`

for details). We begin by loading the data and fitting a random forest with default tuning parameters and 500 trees. The model fit is reasonable, with an *out-of-bag* (pseudo) \(R^2\) of 0.89. A variable importance plot is displayed in **Figure 1** (using the permutation-based random forest definition of variable importance defined in Breiman (2001)). It appears that `rm`

(and `lstat`

) is important in predicting `cmedv`

. Next, we’d like to visualize the functional relationship between these two variables using a partial dependence plot (PDP).

```
data(boston, package = "pdp") # load the (corrected) Boston housing data
library(randomForest) # for fitting random forests
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.
library(vip) # for variable importance plots
#>
#> Attaching package: 'vip'
#> The following object is masked from 'package:utils':
#>
#> vi
set.seed(101) # for reproducibility
boston.rf <- randomForest(cmedv ~ ., data = boston, importance = TRUE)
vip(boston.rf, bar = FALSE, horizontal = FALSE, size = 1.5) # Figure 1
```

Constructing PDPs can be quite computationally expensive^{1} Several strategies are available to ease the computational burden in larger problems. For example, there is no need to compute partial dependence of `cmedv`

using each unique value of `rm`

in the training data (which would require \(k = 446\) passes over the data!). We could get very reasonable results using a reduced number of points. Current options are to use a grid of equally spaced values in the range of the variable of interest; the number of points can be controlled using the `grid.resolution`

option in the call to `partial`

. Alternatively, a user-specified grid of values (e.g., containing specific quantiles of interest) can be supplied through the `pred.grid`

argument. To demonstrate, the following snippet of code computes the partial dependence of `cmedv`

on `rm`

using each option; `grid.arrange()`

is used to display all three PDPs on the same graph, side by side. The results are displayed in **Figure 2**.

```
library(pdp) # for partial dependence plots
grid.arrange( # Figure 2
partial(boston.rf, "rm", plot = TRUE),
partial(boston.rf, "rm", grid.resolution = 30, plot = TRUE),
partial(boston.rf, "rm", pred.grid = data.frame(rm = 3:9), plot = TRUE),
ncol = 3
)
```

For convenience, the `partial()`

function includes the `quantiles`

argument for specifying whether or not to compute the PDP at specific quantiles (controlled by the `probs`

argument which defaults to `1:9/10`

—the deciles of the predictor’s distribution). Below, we plot centered individual conditional expectation (c-ICE) curves for `rm`

using a reduced set of grid points occurring at specific quantiles of `rm`

; the results are displayed in **Figure 3**. The heterogeneity in the curves indicates the potential presence of interaction effects between `rm`

and other features.

```
partial(boston.rf, pred.var = "rm", quantiles = TRUE, probs = 0:20/20,
ice = TRUE, center = TRUE, plot = TRUE, plot.engine = "ggplot2",
alpha = 0.1) # Figure 3
#> Warning: Ignoring unknown parameters: smooth, contour, contour.colour,
#> palette
```

The `partial()`

function relies on the **plyr** package (Wickham 2016), rather than R’s built-in `for`

loops. This makes it easy to request progress bars (e.g., `progress = "text"`

) or run `partial()`

in parallel. In fact, `partial()`

can use any of the parallel backends supported by the `foreach`

package. To use this functionality, we must first load and register a supported parallel backend (e.g., **doMC** (Analytics and Weston 2017) or **doParallel** (Corporation and Weston 2017)).

To illustrate, we will use the Los Angeles ozone pollution data described in Breiman and Friedman (1985). The data contain daily measurements of ozone concentration (`ozone`

) along with eight meteorological quantities for 330 days in the Los Angeles basin in 1976.^{2} The following code chunk loads the data into R:

```
url <- "https://web.stanford.edu/~hastie/ElemStatLearn/datasets/LAozone.data"
ozone <- read.csv(url)
head(ozone) # print first few observations
#> ozone vh wind humidity temp ibh dpg ibt vis doy
#> 1 3 5710 4 28 40 2693 -25 87 250 3
#> 2 5 5700 3 37 45 590 -24 128 100 4
#> 3 5 5760 3 51 54 1450 25 139 60 5
#> 4 6 5720 4 69 35 1568 15 121 60 6
#> 5 4 5790 6 19 45 2631 -33 123 100 7
#> 6 4 5790 3 25 55 554 -28 182 250 8
```

Next, we use the multivariate adaptive regression splines (MARS) algorithm introduced in Friedman (1991) to model ozone concentration as a nonlinear function of the eight meteorological variables plus day of the year; we allow for up to three-way interactions.

```
library(earth) # for earth() function (i.e., MARS algorithm)
#> Loading required package: plotmo
#> Loading required package: plotrix
#> Loading required package: TeachingDemos
ozone.mars <- earth(ozone ~ ., data = ozone, degree = 3)
summary(ozone.mars)
#> Call: earth(formula=ozone~., data=ozone, degree=3)
#>
#> coefficients
#> (Intercept) 11.7516490
#> h(58-temp) -0.1510377
#> h(temp-58) 0.4883275
#> h(ibh-1105) -0.0009825
#> h(200-vis) 0.0176604
#> h(59-doy) -0.1102233
#> h(doy-59) -0.0161285
#> h(5730-vh) * h(temp-58) -0.0137530
#> h(vh-5730) * h(temp-58) 0.0014756
#> h(55-humidity) * h(temp-58) -0.0216973
#> h(temp-58) * h(dpg-52) -0.0174308
#> h(temp-58) * h(52-dpg) 0.0038824
#> h(temp-69) * h(doy-59) -0.0031742
#> h(1105-ibh) * h(21-dpg) -0.0000928
#> h(wind-6) * h(temp-58) * h(52-dpg) -0.0032786
#>
#> Selected 15 of 21 terms, and 8 of 9 predictors
#> Termination condition: Reached nk 21
#> Importance: temp, ibh, humidity, dpg, doy, vis, wind, vh, ibt-unused
#> Number of terms at each degree of interaction: 1 6 7 1
#> GCV 13.62966 RSS 3569.979 GRSq 0.7882793 RSq 0.8309301
```

The MARS model produced a generalized \(R^2\) of , similar to what was reported in Breiman and Friedman (1985). A single three-way interaction was found involving the predictors * `wind`

: wind speed (mph) at Los Angeles International Airport (LAX) * `temp`

: temperature (\(^oF\)) at Sandburg Air Force Base * `dpg`

: the pressure gradient (mm Hg) from LAX to Dagget, CA To understand this interaction, we can use a PDP. However, since the partial dependence between three continuous variables can be computationally expensive, we will run `partial()`

in parallel.

Setting up a parallel backend is rather straightforward. To demonstrate, the following snippet of code sets up the `partial()`

function to run in parallel on both Windows and Unix-like systems using the **doParallel** package.

```
library(doParallel) # load the parallel backend
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: parallel
cl <- makeCluster(2) # use 2 workers
registerDoParallel(cl) # register the parallel backend
```

Now, to run `partial()`

in parallel, all we have to do is invoke the `parallel = TRUE`

and `paropts`

options and the rest is taken care of by the internal call to **plyr** and the parallel backend we loaded^{3}. This is illustrated in the code chunk below which obtains the partial dependence of `ozone`

on `wind`

, `temp`

, and `dpg`

in parallel. The last three lines of code add a label to the colorkey. The result is displayed in **Figure 4**. **Note:** it is considered good practice to shut down the workers by calling `stopCluster()`

when finished. Notice how we also used the `chull`

option to restrict the PDP to the convex hull of `wind`

and `temp`

which helps in reducing the computation time in multivariate PDPs.

```
pd <- partial(ozone.mars, pred.var = c("wind", "temp", "dpg"), chull = TRUE,
parallel = TRUE, paropts = list(.packages = "earth"))
#> Warning: <anonymous>: ... may be used in an incorrect context: '.fun(piece, ...)'
#> Warning: <anonymous>: ... may be used in an incorrect context: '.fun(piece, ...)'
stopCluster(cl) # good practice
```

```
plotPartial(pd, palette = "magma") # Figure 4
lattice::trellis.focus( # add a label to the colorkey
"legend", side = "right", clipp.off = TRUE, highlight = FALSE
)
grid::grid.text("ozone", x = 0.2, y = 1.05, hjust = 0.5, vjust = 1)
lattice::trellis.unfocus()
```

It is important to note that when using more than two predictor variables, `plotPartial()`

produces a trellis display. The first two variables given to `pred.var`

are used for the horizontal and vertical axes, and additional variables define the panels. If the panel variables are continuous, then shingles^{4} are produced first using the equal count algorithm (see, for example, `?lattice::equal.count`

). Hence, it will be more effective to use categorical variables to define the panels in higher dimensional displays when possible.

Analytics, Revolution, and Steve Weston. 2017. *DoMC: Foreach Parallel Adaptor for ’Parallel’*. https://CRAN.R-project.org/package=doMC.

Breiman, Leo. 2001. “Random Forests.” *Machine Learning* 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.

Breiman, Leo, and Jerome H. Friedman. 1985. “Estimating Optimal Transformations for Multiple Regression and Correlation.” *Journal of the American Statistical Association* 80 (391): 580–98. https://doi.org/10.1080/01621459.1985.10478157.

Corporation, Microsoft, and Steve Weston. 2017. *DoParallel: Foreach Parallel Adaptor for the ’Parallel’ Package*. https://CRAN.R-project.org/package=doParallel.

Friedman, Jerome H. 1991. “Multivariate Adaptive Regression Splines.” *The Annals of Statistics* 19 (1): 1–67. https://doi.org/10.1214/aos/1176347963.

Harrison, David, and Daniel L. Rubinfeld. 1978. “Hedonic Housing Prices and the Demand for Clean Air.” *Journal of Environmental Economics and Management* 5 (1): 81–102. https://doi.org/10.1016/0095-0696(78)90006-2.

Wickham, Hadley. 2016. *Plyr: Tools for Splitting, Applying and Combining Data*. https://CRAN.R-project.org/package=plyr.

The exception is regression trees based on single-variable splits which can make use of the efficient weighted tree traversal method described in

`friedman-2001-greedy`

, however, only the`gbm`

package seems to make use of this approach; consequently,`pdp`

can also exploit this strategy when used with`gbm`

models (see`?partial`

for details).↩The data are available from http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/ozone.data. Details, including variable information, are available from http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/ozone.info.↩

Notice we have to pass the names of external packages that the tasks depend on via the

`paropts`

argument; in this case,`"earth"`

. See`?plyr::adply`

for details.↩A shingle is a special Trellis data structure that consists of a numeric vector along with intervals that define the “levels” of the shingle. The intervals may be allowed to overlap.↩