Binary Classification with GUIDE

This notebook demonstrates GuideGradientBoostingClassifier on the Breast Cancer dataset.

Highlights:

  • Log Loss Optimization: Probabilistic classification.

  • Unbiased Selection: GUIDE ensures features aren’t selected just because they have many values.

[1]:

import numpy as np from sklearn.datasets import load_breast_cancer from sklearn.metrics import accuracy_score, log_loss, roc_auc_score from sklearn.model_selection import train_test_split from pyguide import GuideGradientBoostingClassifier # Load Data data = load_breast_cancer(as_frame=True) X, y = data.data, data.target feature_names = X.columns X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) print(f"Training samples: {len(X_train)}") print(f"Class distribution: {np.bincount(y_train)}")
Training samples: 455
Class distribution: [169 286]

Train Gradient Boosting Model

[2]:

clf = GuideGradientBoostingClassifier( n_estimators=50, learning_rate=0.1, max_depth=2, subsample=0.8, random_state=42 ) clf.fit(X_train, y_train)
[2]:
GuideGradientBoostingClassifier(max_depth=2, n_estimators=50, random_state=42,
                                subsample=0.8)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Evaluation

[3]:

# Predict probabilities y_prob = clf.predict_proba(X_test) y_pred = clf.predict(X_test) acc = accuracy_score(y_test, y_pred) auc = roc_auc_score(y_test, y_prob[:, 1]) ll = log_loss(y_test, y_prob) print(f"Accuracy: {acc:.4f}") print(f"ROC AUC: {auc:.4f}") print(f"Log Loss: {ll:.4f}")
Accuracy: 0.9561
ROC AUC:  0.9938
Log Loss: 0.2647

Feature Importance

Unlike impurity-based importance in standard GBMs, we can look at the structure of the underlying GUIDE trees. While GuideGradientBoostingClassifier doesn’t yet aggregate importance scores automatically, we can inspect individual trees or use permutation importance (if implemented for ensembles in the future).