{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "eca7512a",
   "metadata": {},
   "source": [
    "\n",
    "# Binary Classification with GUIDE\n",
    "\n",
    "This notebook demonstrates **GuideGradientBoostingClassifier** on the **Breast Cancer** dataset.\n",
    "\n",
    "Highlights:\n",
    "- **Log Loss Optimization:** Probabilistic classification.\n",
    "- **Unbiased Selection:** GUIDE ensures features aren't selected just because they have many values.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "206cb04c",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import numpy as np\n",
    "from sklearn.datasets import load_breast_cancer\n",
    "from sklearn.metrics import accuracy_score, log_loss, roc_auc_score\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "from pyguide import GuideGradientBoostingClassifier\n",
    "\n",
    "# Load Data\n",
    "data = load_breast_cancer(as_frame=True)\n",
    "X, y = data.data, data.target\n",
    "feature_names = X.columns\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
    "\n",
    "print(f\"Training samples: {len(X_train)}\")\n",
    "print(f\"Class distribution: {np.bincount(y_train)}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42522be9",
   "metadata": {},
   "source": [
    "## Train Gradient Boosting Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bab30470",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "clf = GuideGradientBoostingClassifier(\n",
    "    n_estimators=50,\n",
    "    learning_rate=0.1,\n",
    "    max_depth=2,\n",
    "    subsample=0.8,\n",
    "    random_state=42\n",
    ")\n",
    "clf.fit(X_train, y_train)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "676d3ad0",
   "metadata": {},
   "source": [
    "## Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cdaf387e",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Predict probabilities\n",
    "y_prob = clf.predict_proba(X_test)\n",
    "y_pred = clf.predict(X_test)\n",
    "\n",
    "acc = accuracy_score(y_test, y_pred)\n",
    "auc = roc_auc_score(y_test, y_prob[:, 1])\n",
    "ll = log_loss(y_test, y_prob)\n",
    "\n",
    "print(f\"Accuracy: {acc:.4f}\")\n",
    "print(f\"ROC AUC:  {auc:.4f}\")\n",
    "print(f\"Log Loss: {ll:.4f}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a60e25ca",
   "metadata": {},
   "source": [
    "\n",
    "## Feature Importance\n",
    "\n",
    "Unlike impurity-based importance in standard GBMs, we can look at the structure of the underlying GUIDE trees.\n",
    "While `GuideGradientBoostingClassifier` doesn't yet aggregate importance scores automatically, we can inspect individual trees or use permutation importance (if implemented for ensembles in the future).\n"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 5
}