{ "cells": [ { "cell_type": "markdown", "id": "eca7512a", "metadata": {}, "source": [ "\n", "# Binary Classification with GUIDE\n", "\n", "This notebook demonstrates **GuideGradientBoostingClassifier** on the **Breast Cancer** dataset.\n", "\n", "Highlights:\n", "- **Log Loss Optimization:** Probabilistic classification.\n", "- **Unbiased Selection:** GUIDE ensures features aren't selected just because they have many values.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "206cb04c", "metadata": {}, "outputs": [], "source": [ "\n", "import numpy as np\n", "from sklearn.datasets import load_breast_cancer\n", "from sklearn.metrics import accuracy_score, log_loss, roc_auc_score\n", "from sklearn.model_selection import train_test_split\n", "\n", "from pyguide import GuideGradientBoostingClassifier\n", "\n", "# Load Data\n", "data = load_breast_cancer(as_frame=True)\n", "X, y = data.data, data.target\n", "feature_names = X.columns\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", "\n", "print(f\"Training samples: {len(X_train)}\")\n", "print(f\"Class distribution: {np.bincount(y_train)}\")\n" ] }, { "cell_type": "markdown", "id": "42522be9", "metadata": {}, "source": [ "## Train Gradient Boosting Model" ] }, { "cell_type": "code", "execution_count": null, "id": "bab30470", "metadata": {}, "outputs": [], "source": [ "\n", "clf = GuideGradientBoostingClassifier(\n", " n_estimators=50,\n", " learning_rate=0.1,\n", " max_depth=2,\n", " subsample=0.8,\n", " random_state=42\n", ")\n", "clf.fit(X_train, y_train)\n" ] }, { "cell_type": "markdown", "id": "676d3ad0", "metadata": {}, "source": [ "## Evaluation" ] }, { "cell_type": "code", "execution_count": null, "id": "cdaf387e", "metadata": {}, "outputs": [], "source": [ "\n", "# Predict probabilities\n", "y_prob = clf.predict_proba(X_test)\n", "y_pred = clf.predict(X_test)\n", "\n", "acc = accuracy_score(y_test, y_pred)\n", "auc = roc_auc_score(y_test, y_prob[:, 1])\n", "ll = log_loss(y_test, y_prob)\n", "\n", "print(f\"Accuracy: {acc:.4f}\")\n", "print(f\"ROC AUC: {auc:.4f}\")\n", "print(f\"Log Loss: {ll:.4f}\")\n" ] }, { "cell_type": "markdown", "id": "a60e25ca", "metadata": {}, "source": [ "\n", "## Feature Importance\n", "\n", "Unlike impurity-based importance in standard GBMs, we can look at the structure of the underlying GUIDE trees.\n", "While `GuideGradientBoostingClassifier` doesn't yet aggregate importance scores automatically, we can inspect individual trees or use permutation importance (if implemented for ensembles in the future).\n" ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 5 }