pyguide package¶

Submodules¶

pyguide.classifier module¶

class pyguide.classifier.GuideTreeClassifier(max_depth=None, min_samples_split=2, min_samples_leaf=1, significance_threshold=0.05, interaction_depth=1, categorical_features=None, ccp_alpha=0.0, interaction_features=None, max_interaction_candidates=None, max_features=None, random_state=None)[source]¶

Bases: ClassifierMixin, BaseEstimator

GUIDE (Generalized, Unbiased, Interaction Detection and Estimation) Tree Classifier.

GUIDE is a decision tree algorithm that separates variable selection from split point optimization. This approach prevents the variable selection bias common in CART-like algorithms (which favor variables with many unique values) and provides built-in interaction detection.

Parameters¶

max_depthint, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint, default=2

The minimum number of samples required to split an internal node.

min_samples_leafint, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.

significance_thresholdfloat, default=0.05

The p-value threshold for variable selection and interaction detection. If no variable is individually significant at this level, the algorithm searches for interactions. If no interaction is significant either, splitting stops.

interaction_depthint, default=1

The maximum order of interactions to search for.

0: No interaction detection.
1: Pairwise interactions.
2: Triplets, etc.

categorical_featureslist of int, default=None

Indices of features to be treated as categorical. If None, the algorithm attempts to identify categorical features automatically based on input types (e.g., pandas object/category columns).

ccp_alphanon-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen.

interaction_featureslist of int, default=None

Subset of feature indices to consider for interaction search. If None, all features are considered (subject to candidate filtering).

max_interaction_candidatesint, default=None

If set, only the top K features (ranked by individual p-values) are considered as candidates for interaction tests. This significantly speeds up training on high-dimensional datasets.

max_featuresint, float, str or None, default=None

The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.
If float, then max_features is a fraction and max(1, int(max_features * n_features_in_)) features are considered at each split.
If “sqrt”, then max_features=sqrt(n_features_in_).
If “log2”, then max_features=log2(n_features_in_).
If None, then max_features=n_features_in_.

random_stateint, RandomState instance or None, default=None

Controls the randomness of the estimator. The features are always randomly permuted at each split. When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even if max_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer.

Attributes¶

classes_ndarray of shape (n_classes,): The classes labels.
n_classes_int: The number of classes.
n_features_in_int: Number of features seen during fit.
n_nodes_int: Total number of nodes in the fitted tree.
n_leaves_int: Number of leaf nodes in the fitted tree.
max_depth_int: The actual maximum depth of the fitted tree.
feature_importances_ndarray of shape (n_features_in_,): The feature importances based on weighted impurity reduction.

Notes¶

The algorithm follows a two-step process at each node: 1. Variable Selection: Use Chi-square tests (curvature) to identify the

best splitting variable independently of the split point.

Split Point Optimization: Given the selected variable, find the threshold that minimizes the impurity (Gini index).

References¶

Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 361-386.

apply(X)[source]¶: Return the index of the leaf that each sample is predicted as.

compute_guide_importance(X, y, max_depth=4, bias_correction=True, n_permutations=300, random_state=None)[source]¶

Calculate GUIDE variable importance scores using an auxiliary shallow tree.

Following Loh & Zhou (2021), this method grows a short unpruned tree to calculate unbiased associative importance scores. It includes permutation-based bias correction and interaction detection.

Parameters¶

Xarray-like of shape (n_samples, n_features): The training input samples.
yarray-like of shape (n_samples,): The target values.
max_depthint, default=4: The depth of the auxiliary tree used for scoring. The paper recommends a depth of 4 for stable associative scores.
bias_correctionbool, default=True: Whether to perform permutation-based bias correction.
n_permutationsint, default=300: Number of permutations for bias correction. The paper uses 300 for high stability in simulations.
random_stateint, RandomState instance or None, default=None: Controls the randomness of permutations and tree growth.

Returns¶

importancesndarray of shape (n_features,): The calculated importance scores. If bias_correction=True, these are the normalized VI scores, where a score of 1.0 represents the expected importance of a noise variable.

Notes¶

The importance score $v(X_k)$ for variable $X_k$ is defined as: $v(X_k) = sum_{t} sqrt{n_t} chi^2_1(k, t)$ where the sum is over intermediate nodes $t$, $n_t$ is the sample size at node $t$, and $chi^2_1(k, t)$ is the 1-degree-of-freedom chi-square statistic for the association between $X_k$ and the response at that node.

References¶

Loh, W.-Y. and Zhou, P. (2021). Variable Importance Scores. Journal of Data Science, 19(4), 569-592.

cost_complexity_pruning_path(X, y, sample_weight=None)[source]¶: Compute the pruning path during Minimal Cost-Complexity Pruning.

decision_path(X)[source]¶: Return the decision path in the tree.

property feature_importances_¶: Return the feature importances. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature.

fit(X, y)[source]¶: Build a GUIDE tree classifier from the training set (X, y).

get_depth()[source]¶: Return the depth of the decision tree. The depth of a tree is the maximum distance between the root and any leaf.

get_n_leaves()[source]¶: Return the number of leaves of the decision tree.

property guide_importances_¶: Return the GUIDE importance scores (Loh & Zhou, 2021). Score is the sum over intermediate nodes of sqrt(n_t) * chi2_quantile(1-p).

property interaction_importances_¶: Return the interaction-aware feature importances. If a split was chosen via interaction detection, the reduction is distributed equally among all interacting features.

property max_depth_¶

property n_leaves_¶

predict(X)[source]¶: Predict class for X.

predict_proba(X)[source]¶: Predict class probabilities for X.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → GuideTreeClassifier¶

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters¶

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns¶

selfobject: The updated object.

property tree_¶: Returns a scikit-learn compatible MockTree.

pyguide.interactions module¶

pyguide.interactions.calc_interaction_p_value(X_subset, z, categorical_mask=None)[source]¶

Calculate interaction p-value between features in X_subset on target z.

Parameters¶

X_subsetarray-like of shape (n_samples, n_vars): The features to test for interaction.
zarray-like of shape (n_samples,): The target values (class labels or residual signs).
categorical_maskarray-like of shape (n_vars,), optional: Boolean mask indicating which features in X_subset are categorical.

GUIDE Strategy: - Bin numerical variables into 2 groups (median split). - Combine all binned/categorical variables into unique groups. - Perform Chi-square test on groups vs z.

pyguide.node module¶

class pyguide.node.GuideNode(depth, is_leaf=False, prediction=None, probabilities=None, split_feature=None, split_threshold=None, missing_go_left=True, left=None, right=None, n_samples=0, impurity=0.0, value_distribution=None, node_id=None, split_type=None, interaction_group=None, curvature_stats=None)[source]¶

Bases: object

A node in the GUIDE tree.

is_leaf_node()[source]¶

pyguide.regressor module¶

class pyguide.regressor.GuideTreeRegressor(max_depth=None, min_samples_split=2, min_samples_leaf=1, significance_threshold=0.05, interaction_depth=1, categorical_features=None, ccp_alpha=0.0, interaction_features=None, max_interaction_candidates=None, max_features=None, random_state=None)[source]¶

Bases: RegressorMixin, BaseEstimator

GUIDE (Generalized, Unbiased, Interaction Detection and Estimation) Tree Regressor.

GUIDE is a decision tree algorithm that separates variable selection from split point optimization. This approach prevents the variable selection bias common in CART-like algorithms (which favor variables with many unique values) and provides built-in interaction detection.

Parameters¶

max_depthint, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint, default=2

The minimum number of samples required to split an internal node.

min_samples_leafint, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.

significance_thresholdfloat, default=0.05

The p-value threshold for variable selection and interaction detection. If no variable is individually significant at this level, the algorithm searches for interactions. If no interaction is significant either, splitting stops.

interaction_depthint, default=1

The maximum order of interactions to search for.

0: No interaction detection.
1: Pairwise interactions.
2: Triplets, etc.

categorical_featureslist of int, default=None

Indices of features to be treated as categorical. If None, the algorithm attempts to identify categorical features automatically based on input types (e.g., pandas object/category columns).

ccp_alphanon-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen.

interaction_featureslist of int, default=None

Subset of feature indices to consider for interaction search. If None, all features are considered (subject to candidate filtering).

max_interaction_candidatesint, default=None

If set, only the top K features (ranked by individual p-values) are considered as candidates for interaction tests. This significantly speeds up training on high-dimensional datasets.

max_featuresint, float, str or None, default=None

The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.
If float, then max_features is a fraction and max(1, int(max_features * n_features_in_)) features are considered at each split.
If “sqrt”, then max_features=sqrt(n_features_in_).
If “log2”, then max_features=log2(n_features_in_).
If None, then max_features=n_features_in_.

random_stateint, RandomState instance or None, default=None

Controls the randomness of the estimator. The features are always randomly permuted at each split. When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer.

Attributes¶

n_features_in_int: Number of features seen during fit.
n_nodes_int: Total number of nodes in the fitted tree.
n_leaves_int: Number of leaf nodes in the fitted tree.
max_depth_int: The actual maximum depth of the fitted tree.
feature_importances_ndarray of shape (n_features_in_,): The feature importances based on weighted impurity reduction (SSE).

Notes¶

The algorithm follows a two-step process at each node: 1. Variable Selection: Calculate residuals from the current node mean and

use Chi-square tests on residual signs to identify the best splitting variable.

Split Point Optimization: Given the selected variable, find the threshold that minimizes the Sum of Squared Errors (SSE).

References¶

Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 361-386.

apply(X)[source]¶: Return the index of the leaf that each sample is predicted as.

compute_guide_importance(X, y, max_depth=4, bias_correction=True, n_permutations=300, random_state=None)[source]¶

Calculate GUIDE variable importance scores using an auxiliary shallow tree.

Following Loh & Zhou (2021), this method grows a short unpruned tree to calculate unbiased associative importance scores. It includes permutation-based bias correction and interaction detection.

Parameters¶

Xarray-like of shape (n_samples, n_features): The training input samples.
yarray-like of shape (n_samples,): The target values.
max_depthint, default=4: The depth of the auxiliary tree used for scoring. The paper recommends a depth of 4 for stable associative scores.
bias_correctionbool, default=True: Whether to perform permutation-based bias correction.
n_permutationsint, default=300: Number of permutations for bias correction. The paper uses 300 for high stability in simulations.
random_stateint, RandomState instance or None, default=None: Controls the randomness of permutations and tree growth.

Returns¶

importancesndarray of shape (n_features,): The calculated importance scores. If bias_correction=True, these are the normalized VI scores, where a score of 1.0 represents the expected importance of a noise variable.

Notes¶

The importance score $v(X_k)$ for variable $X_k$ is defined as: $v(X_k) = sum_{t} sqrt{n_t} chi^2_1(k, t)$ where the sum is over intermediate nodes $t$, $n_t$ is the sample size at node $t$, and $chi^2_1(k, t)$ is the 1-degree-of-freedom chi-square statistic for the association between $X_k$ and the response at that node.

References¶

Loh, W.-Y. and Zhou, P. (2021). Variable Importance Scores. Journal of Data Science, 19(4), 569-592.

cost_complexity_pruning_path(X, y, sample_weight=None)[source]¶: Compute the pruning path during Minimal Cost-Complexity Pruning.

decision_path(X)[source]¶: Return the decision path in the tree.

property feature_importances_¶: Return the feature importances. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature.

fit(X, y)[source]¶: Build a GUIDE tree regressor from the training set (X, y).

get_depth()[source]¶: Return the depth of the decision tree. The depth of a tree is the maximum distance between the root and any leaf.

get_n_leaves()[source]¶: Return the number of leaves of the decision tree.

property guide_importances_¶: Return the GUIDE importance scores (Loh & Zhou, 2021). Score is the sum over intermediate nodes of sqrt(n_t) * chi2_quantile(1-p).

property interaction_importances_¶: Return the interaction-aware feature importances. If a split was chosen via interaction detection, the reduction is distributed equally among all interacting features.

property max_depth_¶

property n_leaves_¶

predict(X)[source]¶: Predict regression target for X.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → GuideTreeRegressor¶

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters¶

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns¶

selfobject: The updated object.

property tree_¶: Returns a scikit-learn compatible MockTree.

pyguide.selection module¶

pyguide.selection.select_split_variable(X, y, categorical_features=None, feature_indices=None)[source]¶

Select the best variable to split on based on curvature tests (Chi-square).

Parameters¶

Xarray-like of shape (n_samples, n_features): The training input samples.
yarray-like of shape (n_samples,): The target values (class labels).
categorical_featuresarray-like of shape (n_features,), optional: Boolean mask indicating which features are categorical.
feature_indicesarray-like, optional: Indices of features to consider. If None, consider all features.

Returns¶

best_feature_idxint: The index of the selected feature.
best_pfloat: The p-value of the selected feature.
chi2_statsndarray: The Chi-square statistics for all features.

pyguide.splitting module¶

pyguide.splitting.find_best_split(x, y, is_categorical=False, criterion='gini')[source]¶: Find the best split for feature x. Returns (threshold/category, missing_go_left, gain).

pyguide.stats module¶

pyguide.stats.calc_curvature_p_value(x, z, **kwargs)[source]¶

pyguide.stats.calc_curvature_test(x, z, is_categorical=False)[source]¶: Calculate the Chi-square statistic and p-value for the association between x and z. Returns (chi2_stat, p_value).

pyguide.visualization module¶

class pyguide.visualization.MockTree(children_left, children_right, feature, threshold, value, impurity, n_node_samples)[source]¶

Bases: object

A class that mocks the interface of sklearn.tree._tree.Tree. Used for compatibility with sklearn’s visualization tools.

pyguide.visualization.build_mock_tree(root_node, n_classes=1, is_classifier=True)[source]¶: Recursively builds arrays for MockTree from a GuideNode structure.

pyguide.visualization.plot_tree(decision_tree, **kwargs)[source]¶: Plot a GUIDE decision tree. This is a wrapper around sklearn.tree.plot_tree.

Module contents¶

class pyguide.GuideGradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, subsample=1.0, random_state=None, significance_threshold=0.05, interaction_depth=1, max_interaction_candidates=None)[source]¶

Bases: ClassifierMixin, BaseEstimator

Gradient Boosting for binary classification using GUIDE trees.

This implementation uses the GUIDE algorithm for unbiased variable selection at each boosting stage. Currently supports binary classification (Log Loss).

Parameters¶

n_estimatorsint, default=100: The number of boosting stages to perform.
learning_ratefloat, default=0.1: Learning rate shrinks the contribution of each tree by learning_rate.
max_depthint, default=3: Maximum depth of the individual regression estimators.
subsamplefloat, default=1.0: The fraction of samples to be used for fitting the individual base learners.
random_stateint, RandomState instance or None, default=None: Controls the randomness of the estimator.

# GUIDE-specific parameters significance_threshold : float, default=0.05

The p-value threshold for variable selection.

interaction_depthint, default=1: The maximum order of interactions to search for.
max_interaction_candidatesint, default=None: Limit on candidates for interaction search.

decision_function(X)[source]¶: Compute the decision function of X (log-odds).

fit(X, y)[source]¶

predict(X)[source]¶: Predict class labels for X.

predict_proba(X)[source]¶: Predict class probabilities for X.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → GuideGradientBoostingClassifier¶

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters¶

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns¶

selfobject: The updated object.

class pyguide.GuideGradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, subsample=1.0, random_state=None, significance_threshold=0.05, interaction_depth=1, max_interaction_candidates=None)[source]¶

Bases: RegressorMixin, BaseEstimator

Gradient Boosting for regression using GUIDE trees as base learners.

This implementation uses the GUIDE algorithm for unbiased variable selection and interaction detection at each boosting stage.

Parameters¶

n_estimatorsint, default=100: The number of boosting stages to perform.
learning_ratefloat, default=0.1: Learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate and n_estimators.
max_depthint, default=3: Maximum depth of the individual regression estimators.
subsamplefloat, default=1.0: The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting.
random_stateint, RandomState instance or None, default=None: Controls the random seed given to each Tree estimator at each boosting iteration.

# GUIDE-specific parameters significance_threshold : float, default=0.05

The p-value threshold for variable selection in GUIDE trees.

interaction_depthint, default=1: The maximum order of interactions to search for.
max_interaction_candidatesint, default=None: Limit on candidates for interaction search.

fit(X, y)[source]¶: Fit the gradient boosting model.

predict(X)[source]¶

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → GuideGradientBoostingRegressor¶

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters¶

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns¶

selfobject: The updated object.

class pyguide.GuideRandomForestClassifier(n_estimators=100, max_depth=None, max_features='sqrt', bootstrap=True, n_jobs=None, random_state=None, significance_threshold=0.05, interaction_depth=1)[source]¶

Bases: ClassifierMixin, BaseEstimator

Random Forest Classifier using GUIDE trees as base estimators.

Parameters¶

n_estimatorsint, default=100: The number of trees in the forest.
max_depthint, default=None: The maximum depth of the trees.
max_featuresint, float, str or None, default=”sqrt”: The number of features to consider when looking for the best split.
bootstrapbool, default=True: Whether bootstrap samples are used when building trees.
n_jobsint, default=None: The number of jobs to run in parallel for both fit and predict.
random_stateint, RandomState instance or None, default=None: Controls the randomness of the estimator.
significance_thresholdfloat, default=0.05: The p-value threshold for variable selection.
interaction_depthint, default=1: The maximum order of interactions to search for.

fit(X, y)[source]¶

predict(X)[source]¶

predict_proba(X)[source]¶

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → GuideRandomForestClassifier¶

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters¶

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns¶

selfobject: The updated object.

class pyguide.GuideRandomForestRegressor(n_estimators=100, max_depth=None, max_features=1.0, bootstrap=True, n_jobs=None, random_state=None, significance_threshold=0.05, interaction_depth=1)[source]¶

Bases: RegressorMixin, BaseEstimator

Random Forest Regressor using GUIDE trees as base estimators.

Parameters¶

n_estimatorsint, default=100: The number of trees in the forest.
max_depthint, default=None: The maximum depth of the trees.
max_featuresint, float, str or None, default=1.0: The number of features to consider when looking for the best split.
bootstrapbool, default=True: Whether bootstrap samples are used when building trees.
n_jobsint, default=None: The number of jobs to run in parallel for both fit and predict.
random_stateint, RandomState instance or None, default=None: Controls the randomness of the estimator.
significance_thresholdfloat, default=0.05: The p-value threshold for variable selection.
interaction_depthint, default=1: The maximum order of interactions to search for.

fit(X, y)[source]¶

predict(X)[source]¶

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → GuideRandomForestRegressor¶

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters¶

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns¶

selfobject: The updated object.

class pyguide.GuideTreeClassifier(max_depth=None, min_samples_split=2, min_samples_leaf=1, significance_threshold=0.05, interaction_depth=1, categorical_features=None, ccp_alpha=0.0, interaction_features=None, max_interaction_candidates=None, max_features=None, random_state=None)[source]¶

Bases: ClassifierMixin, BaseEstimator

GUIDE (Generalized, Unbiased, Interaction Detection and Estimation) Tree Classifier.

GUIDE is a decision tree algorithm that separates variable selection from split point optimization. This approach prevents the variable selection bias common in CART-like algorithms (which favor variables with many unique values) and provides built-in interaction detection.

Parameters¶

max_depthint, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint, default=2

The minimum number of samples required to split an internal node.

min_samples_leafint, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.

significance_thresholdfloat, default=0.05

The p-value threshold for variable selection and interaction detection. If no variable is individually significant at this level, the algorithm searches for interactions. If no interaction is significant either, splitting stops.

interaction_depthint, default=1

The maximum order of interactions to search for.

0: No interaction detection.
1: Pairwise interactions.
2: Triplets, etc.

categorical_featureslist of int, default=None

Indices of features to be treated as categorical. If None, the algorithm attempts to identify categorical features automatically based on input types (e.g., pandas object/category columns).

ccp_alphanon-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen.

interaction_featureslist of int, default=None

Subset of feature indices to consider for interaction search. If None, all features are considered (subject to candidate filtering).

max_interaction_candidatesint, default=None

If set, only the top K features (ranked by individual p-values) are considered as candidates for interaction tests. This significantly speeds up training on high-dimensional datasets.

max_featuresint, float, str or None, default=None

The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.
If float, then max_features is a fraction and max(1, int(max_features * n_features_in_)) features are considered at each split.
If “sqrt”, then max_features=sqrt(n_features_in_).
If “log2”, then max_features=log2(n_features_in_).
If None, then max_features=n_features_in_.

random_stateint, RandomState instance or None, default=None

Controls the randomness of the estimator. The features are always randomly permuted at each split. When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even if max_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer.

Attributes¶

classes_ndarray of shape (n_classes,): The classes labels.
n_classes_int: The number of classes.
n_features_in_int: Number of features seen during fit.
n_nodes_int: Total number of nodes in the fitted tree.
n_leaves_int: Number of leaf nodes in the fitted tree.
max_depth_int: The actual maximum depth of the fitted tree.
feature_importances_ndarray of shape (n_features_in_,): The feature importances based on weighted impurity reduction.

Notes¶

The algorithm follows a two-step process at each node: 1. Variable Selection: Use Chi-square tests (curvature) to identify the

best splitting variable independently of the split point.

Split Point Optimization: Given the selected variable, find the threshold that minimizes the impurity (Gini index).

References¶

Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 361-386.

apply(X)[source]¶: Return the index of the leaf that each sample is predicted as.

compute_guide_importance(X, y, max_depth=4, bias_correction=True, n_permutations=300, random_state=None)[source]¶

Calculate GUIDE variable importance scores using an auxiliary shallow tree.

Following Loh & Zhou (2021), this method grows a short unpruned tree to calculate unbiased associative importance scores. It includes permutation-based bias correction and interaction detection.

Parameters¶

Xarray-like of shape (n_samples, n_features): The training input samples.
yarray-like of shape (n_samples,): The target values.
max_depthint, default=4: The depth of the auxiliary tree used for scoring. The paper recommends a depth of 4 for stable associative scores.
bias_correctionbool, default=True: Whether to perform permutation-based bias correction.
n_permutationsint, default=300: Number of permutations for bias correction. The paper uses 300 for high stability in simulations.
random_stateint, RandomState instance or None, default=None: Controls the randomness of permutations and tree growth.

Returns¶

importancesndarray of shape (n_features,): The calculated importance scores. If bias_correction=True, these are the normalized VI scores, where a score of 1.0 represents the expected importance of a noise variable.

Notes¶

The importance score $v(X_k)$ for variable $X_k$ is defined as: $v(X_k) = sum_{t} sqrt{n_t} chi^2_1(k, t)$ where the sum is over intermediate nodes $t$, $n_t$ is the sample size at node $t$, and $chi^2_1(k, t)$ is the 1-degree-of-freedom chi-square statistic for the association between $X_k$ and the response at that node.

References¶

Loh, W.-Y. and Zhou, P. (2021). Variable Importance Scores. Journal of Data Science, 19(4), 569-592.

cost_complexity_pruning_path(X, y, sample_weight=None)[source]¶: Compute the pruning path during Minimal Cost-Complexity Pruning.

decision_path(X)[source]¶: Return the decision path in the tree.

property feature_importances_¶: Return the feature importances. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature.

fit(X, y)[source]¶: Build a GUIDE tree classifier from the training set (X, y).

get_depth()[source]¶: Return the depth of the decision tree. The depth of a tree is the maximum distance between the root and any leaf.

get_n_leaves()[source]¶: Return the number of leaves of the decision tree.

property guide_importances_¶: Return the GUIDE importance scores (Loh & Zhou, 2021). Score is the sum over intermediate nodes of sqrt(n_t) * chi2_quantile(1-p).

property interaction_importances_¶: Return the interaction-aware feature importances. If a split was chosen via interaction detection, the reduction is distributed equally among all interacting features.

property max_depth_¶

property n_leaves_¶

predict(X)[source]¶: Predict class for X.

predict_proba(X)[source]¶: Predict class probabilities for X.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → GuideTreeClassifier¶

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters¶

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns¶

selfobject: The updated object.

property tree_¶: Returns a scikit-learn compatible MockTree.

class pyguide.GuideTreeRegressor(max_depth=None, min_samples_split=2, min_samples_leaf=1, significance_threshold=0.05, interaction_depth=1, categorical_features=None, ccp_alpha=0.0, interaction_features=None, max_interaction_candidates=None, max_features=None, random_state=None)[source]¶

Bases: RegressorMixin, BaseEstimator

GUIDE (Generalized, Unbiased, Interaction Detection and Estimation) Tree Regressor.

GUIDE is a decision tree algorithm that separates variable selection from split point optimization. This approach prevents the variable selection bias common in CART-like algorithms (which favor variables with many unique values) and provides built-in interaction detection.

Parameters¶

max_depthint, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint, default=2

The minimum number of samples required to split an internal node.

min_samples_leafint, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.

significance_thresholdfloat, default=0.05

The p-value threshold for variable selection and interaction detection. If no variable is individually significant at this level, the algorithm searches for interactions. If no interaction is significant either, splitting stops.

interaction_depthint, default=1

The maximum order of interactions to search for.

0: No interaction detection.
1: Pairwise interactions.
2: Triplets, etc.

categorical_featureslist of int, default=None

Indices of features to be treated as categorical. If None, the algorithm attempts to identify categorical features automatically based on input types (e.g., pandas object/category columns).

ccp_alphanon-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen.

interaction_featureslist of int, default=None

Subset of feature indices to consider for interaction search. If None, all features are considered (subject to candidate filtering).

max_interaction_candidatesint, default=None

If set, only the top K features (ranked by individual p-values) are considered as candidates for interaction tests. This significantly speeds up training on high-dimensional datasets.

max_featuresint, float, str or None, default=None

The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.
If float, then max_features is a fraction and max(1, int(max_features * n_features_in_)) features are considered at each split.
If “sqrt”, then max_features=sqrt(n_features_in_).
If “log2”, then max_features=log2(n_features_in_).
If None, then max_features=n_features_in_.

random_stateint, RandomState instance or None, default=None

Controls the randomness of the estimator. The features are always randomly permuted at each split. When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer.

Attributes¶

n_features_in_int: Number of features seen during fit.
n_nodes_int: Total number of nodes in the fitted tree.
n_leaves_int: Number of leaf nodes in the fitted tree.
max_depth_int: The actual maximum depth of the fitted tree.
feature_importances_ndarray of shape (n_features_in_,): The feature importances based on weighted impurity reduction (SSE).

Notes¶

The algorithm follows a two-step process at each node: 1. Variable Selection: Calculate residuals from the current node mean and

use Chi-square tests on residual signs to identify the best splitting variable.

Split Point Optimization: Given the selected variable, find the threshold that minimizes the Sum of Squared Errors (SSE).

References¶

Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 361-386.

apply(X)[source]¶: Return the index of the leaf that each sample is predicted as.

compute_guide_importance(X, y, max_depth=4, bias_correction=True, n_permutations=300, random_state=None)[source]¶

Calculate GUIDE variable importance scores using an auxiliary shallow tree.

Following Loh & Zhou (2021), this method grows a short unpruned tree to calculate unbiased associative importance scores. It includes permutation-based bias correction and interaction detection.

Parameters¶

Xarray-like of shape (n_samples, n_features): The training input samples.
yarray-like of shape (n_samples,): The target values.
max_depthint, default=4: The depth of the auxiliary tree used for scoring. The paper recommends a depth of 4 for stable associative scores.
bias_correctionbool, default=True: Whether to perform permutation-based bias correction.
n_permutationsint, default=300: Number of permutations for bias correction. The paper uses 300 for high stability in simulations.
random_stateint, RandomState instance or None, default=None: Controls the randomness of permutations and tree growth.

Returns¶

importancesndarray of shape (n_features,): The calculated importance scores. If bias_correction=True, these are the normalized VI scores, where a score of 1.0 represents the expected importance of a noise variable.

Notes¶

The importance score $v(X_k)$ for variable $X_k$ is defined as: $v(X_k) = sum_{t} sqrt{n_t} chi^2_1(k, t)$ where the sum is over intermediate nodes $t$, $n_t$ is the sample size at node $t$, and $chi^2_1(k, t)$ is the 1-degree-of-freedom chi-square statistic for the association between $X_k$ and the response at that node.

References¶

Loh, W.-Y. and Zhou, P. (2021). Variable Importance Scores. Journal of Data Science, 19(4), 569-592.

cost_complexity_pruning_path(X, y, sample_weight=None)[source]¶: Compute the pruning path during Minimal Cost-Complexity Pruning.

decision_path(X)[source]¶: Return the decision path in the tree.

property feature_importances_¶: Return the feature importances. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature.

fit(X, y)[source]¶: Build a GUIDE tree regressor from the training set (X, y).

get_depth()[source]¶: Return the depth of the decision tree. The depth of a tree is the maximum distance between the root and any leaf.

get_n_leaves()[source]¶: Return the number of leaves of the decision tree.

property guide_importances_¶: Return the GUIDE importance scores (Loh & Zhou, 2021). Score is the sum over intermediate nodes of sqrt(n_t) * chi2_quantile(1-p).

property interaction_importances_¶: Return the interaction-aware feature importances. If a split was chosen via interaction detection, the reduction is distributed equally among all interacting features.

property max_depth_¶

property n_leaves_¶

predict(X)[source]¶: Predict regression target for X.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → GuideTreeRegressor¶

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters¶

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns¶

selfobject: The updated object.

property tree_¶: Returns a scikit-learn compatible MockTree.

pyguide.plot_tree(decision_tree, **kwargs)[source]¶: Plot a GUIDE decision tree. This is a wrapper around sklearn.tree.plot_tree.