Binary Classification with GUIDE¶

This notebook demonstrates GuideGradientBoostingClassifier on the Breast Cancer dataset.

Highlights:

Log Loss Optimization: Probabilistic classification.
Unbiased Selection: GUIDE ensures features aren’t selected just because they have many values.

[1]:

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, log_loss, roc_auc_score
from sklearn.model_selection import train_test_split

from pyguide import GuideGradientBoostingClassifier

# Load Data
data = load_breast_cancer(as_frame=True)
X, y = data.data, data.target
feature_names = X.columns

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {len(X_train)}")
print(f"Class distribution: {np.bincount(y_train)}")

Training samples: 455
Class distribution: [169 286]

Train Gradient Boosting Model¶

[2]:

clf = GuideGradientBoostingClassifier(
    n_estimators=50,
    learning_rate=0.1,
    max_depth=2,
    subsample=0.8,
    random_state=42
)
clf.fit(X_train, y_train)

[2]:

GuideGradientBoostingClassifier(max_depth=2, n_estimators=50, random_state=42,
                                subsample=0.8)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Evaluation¶

[3]:

# Predict probabilities
y_prob = clf.predict_proba(X_test)
y_pred = clf.predict(X_test)

acc = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_prob[:, 1])
ll = log_loss(y_test, y_prob)

print(f"Accuracy: {acc:.4f}")
print(f"ROC AUC:  {auc:.4f}")
print(f"Log Loss: {ll:.4f}")

Accuracy: 0.9561
ROC AUC:  0.9938
Log Loss: 0.2647

Feature Importance¶

Unlike impurity-based importance in standard GBMs, we can look at the structure of the underlying GUIDE trees. While GuideGradientBoostingClassifier doesn’t yet aggregate importance scores automatically, we can inspect individual trees or use permutation importance (if implemented for ensembles in the future).