Tree Ensembles

pyguide provides Random Forest ensemble models that leverage the unbiased splitting power of GUIDE trees. By combining multiple trees trained on bootstrap samples with random feature selection, these ensembles often achieve higher predictive accuracy than single trees while reducing overfitting.

Random Forest

The GuideRandomForestClassifier and GuideRandomForestRegressor classes are drop-in replacements for their scikit-learn counterparts, but use GuideTreeClassifier and GuideTreeRegressor as the base estimators.

Key Benefits

  1. Unbiased Base Learners: Unlike standard Random Forests which use CART trees (biased towards high-cardinality features), GuideRandomForest uses unbiased GUIDE trees. This reduces the bias of the ensemble as a whole.

  2. Interaction Detection: Each tree in the forest can still detect interactions if configured (though typically Random Forests disable interaction search for speed, you can enable it).

  3. Familiar API: Uses scikit-learn’s Bagging infrastructure, so all standard parameters (n_estimators, max_samples, n_jobs) work as expected.

Usage

from pyguide import GuideRandomForestClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Train a Random Forest with 100 trees
rf = GuideRandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    max_features="sqrt", # Standard RF heuristic
    n_jobs=-1,           # Use all cores
    random_state=42
)
rf.fit(X_train, y_train)

print(f"Accuracy: {rf.score(X_test, y_test):.4f}")

Parameters

  • n_estimators: Number of trees (default=100).

  • max_features: Number of features to consider at each split. Defaults to “sqrt” for classifiers and 1.0 for regressors.

  • bootstrap: Whether to use bootstrap sampling (default=True).

  • significance_threshold: Passed to the underlying GUIDE trees.

  • interaction_depth: Passed to the underlying GUIDE trees (default=1).

Gradient Boosting

The GuideGradientBoostingClassifier and GuideGradientBoostingRegressor classes implement Gradient Boosting machines using GUIDE trees as the base weak learners.

Overview

Gradient Boosting builds an additive model in a forward stage-wise fashion. At each stage, a regression tree is fit on the negative gradient of the loss function (e.g., deviance for classification, least squares for regression).

Why use GUIDE for Boosting? Standard GBM implementations (like XGBoost, LightGBM, scikit-learn) use CART-like greedy splitting. pyguide’s implementation ensures that every split in every tree is unbiased. This is particularly valuable when your dataset contains a mix of categorical variables with varying cardinality or continuous variables, where standard boosting might over-select high-cardinality features.

Usage

from pyguide import GuideGradientBoostingRegressor
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Train a Gradient Boosting Regressor
gbm = GuideGradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    subsample=0.8,  # Stochastic Gradient Boosting
    random_state=42
)
gbm.fit(X_train, y_train)

print(f"R2 Score: {gbm.score(X_test, y_test):.4f}")

Parameters

  • n_estimators: The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.

  • learning_rate: learning rate shrinks the contribution of each tree by learning_rate.

  • max_depth: maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables.

  • subsample: The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias.

  • significance_threshold, interaction_depth, max_interaction_candidates: Passed to the underlying GUIDE trees.

Current Limitations

  • Binary Classification Only: GuideGradientBoostingClassifier currently supports only binary classification (Log Loss).

  • Speed: While optimized with Rust kernels, the Python-driven boosting loop is slower than highly optimized C++ libraries like XGBoost or LightGBM. It is prioritized for statistical rigor and unbiased selection rather than raw speed.