Data sets used in this book

Information pertaining to several of the data sets used throughout the book.

Swiss banknote data

The Swiss banknote data contain measurements from 200 Swiss 1000-franc banknotes: 100 genuine (y = 0) and 100 counterfeit (y = 1). For R users, the data are conveniently available as the banknote data frame in package treemisc, and can be loaded using

head(bn <- treemisc::banknote)
  length  left right bottom  top diagonal y
1  214.8 131.0 131.1    9.0  9.7    141.0 0
2  214.6 129.7 129.7    8.1  9.5    141.7 0
3  214.8 129.7 129.7    8.7  9.6    142.2 0
4  214.8 129.7 129.6    7.5 10.4    142.0 0
5  215.0 129.6 129.7   10.4  7.7    141.8 0
6  215.7 130.8 130.5    9.0 10.1    141.4 0

Download: banknote.csv

References

Flury, B. and Riedwyl, H. (1988). Multivariate Statistics: A practical approach. London: Chapman & Hall, Tables 1.1 and 1.2, pp. 5-8.

New York air quality measurements

The New York air quality data contain daily air quality measurements in New York from May through September of 1973 (153 days). The data are conveniently available in R’s built-in datasets package; see ?datasets::airquality for details and the original source. The main variables include:

  • Ozone: the mean ozone (in parts per billion) from 1300 to 1500 hours at Roosevelt Island;

  • Solar.R: the solar radiation (in Langleys) in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park;

  • Wind: the average wind speed (in miles per hour) at 0700 and 1000 hours at LaGuardia Airport;

  • Temp: the maximum daily temperature (in degrees Fahrenheit) at La Guardia Airport.

The month (1–12) and day of the month (1–31) are also available in the columns Monthand Day, respectively. In these data, Ozone is treated as a response variable.

head(aq <- airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Download: airquality.csv

The Friedman 1 benchmark data

The Friedman 1 benchmark problem uses simulated regression data with 10 input features according to:

\[ Y = 10 \sin\left(\pi X_1 X_2\right) + 20 \left(X_3 - 0.5\right) ^ 2 + 10 X_4 + 5 X_5 + \epsilon, \]

where \(\epsilon \sim \mathcal{N}\left(0, \sigma\right)\) and the input features are all independent uniform random variables on the interval \(\left[0, 1\right]\): \(\left\{X_j\right\}_{j = 1}^10 \stackrel{iid}{\sim} \mathcal{U}\left(0, 1\right)\). Notice how \(X_6\)\(X_{10}\) are unrelated to the response \(Y\).

These data can be generated in R using the mlbench.friedman1() function from package mlbench. Here, I’ll use the gen_friedman1() function from package treemisc which allows you to generate any number of features \(\ge 5\); similar to the make\_friedman1() function in scikit-learn’s sklearn.datasets module for Python. See ?treemisc::gen_friedman1 for details.

set.seed(943)  # for reproducibility
treemisc::gen_friedman1(5, nx = 7, sigma = 0.1)
         y        x1        x2        x3        x4        x5         x6
1 18.47570 0.3459904 0.8530512 0.6551429 0.8385246 0.2930157 0.34084319
2 13.72402 0.4419899 0.6912439 0.2136197 0.1077115 0.5432351 0.16164747
3 10.84887 0.2225216 0.7893439 0.8068398 0.2515591 0.2566107 0.85946717
4 18.93996 0.8594261 0.5196663 0.8911978 0.1285491 0.9356575 0.03476787
5 14.46567 0.1808317 0.5901250 0.8934141 0.6110807 0.4151859 0.41041196
          x7
1 0.05733414
2 0.50547584
3 0.72484074
4 0.71050552
5 0.26356893

Source code:

treemisc::gen_friedman1
function(n = 100, nx = 10, sigma = 0.1) {
  if (nx < 5) {
    stop("`nsim` must be >= 5.", call. = FALSE)
  }
  x <- matrix(stats::runif(n * nx), ncol = nx)
  colnames(x) <- paste0("x", seq_len(nx))
  y = 10 * sin(pi * x[, 1L] * x[, 2L]) + 20 * (x[, 3L] - 0.5) ^ 2 +
    10 * x[, 4L] + 5 * x[, 5L] + stats::rnorm(n, sd = sigma)
  as.data.frame(cbind(y = y, x))
}
<bytecode: 0x7f9c52a284e0>
<environment: namespace:treemisc>