pyguide package

Submodules

pyguide.classifier module

class pyguide.classifier.GuideTreeClassifier(max_depth=None, min_samples_split=2, min_samples_leaf=1, significance_threshold=0.05, interaction_depth=1, categorical_features=None, ccp_alpha=0.0, interaction_features=None, max_interaction_candidates=None, max_features=None, random_state=None)[source]

Bases: ClassifierMixin, BaseEstimator

GUIDE (Generalized, Unbiased, Interaction Detection and Estimation) Tree Classifier.

GUIDE is a decision tree algorithm that separates variable selection from split point optimization. This approach prevents the variable selection bias common in CART-like algorithms (which favor variables with many unique values) and provides built-in interaction detection.

Parameters

max_depthint, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint, default=2

The minimum number of samples required to split an internal node.

min_samples_leafint, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.

significance_thresholdfloat, default=0.05

The p-value threshold for variable selection and interaction detection. If no variable is individually significant at this level, the algorithm searches for interactions. If no interaction is significant either, splitting stops.

interaction_depthint, default=1

The maximum order of interactions to search for.

  • 0: No interaction detection.

  • 1: Pairwise interactions.

  • 2: Triplets, etc.

categorical_featureslist of int, default=None

Indices of features to be treated as categorical. If None, the algorithm attempts to identify categorical features automatically based on input types (e.g., pandas object/category columns).

ccp_alphanon-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen.

interaction_featureslist of int, default=None

Subset of feature indices to consider for interaction search. If None, all features are considered (subject to candidate filtering).

max_interaction_candidatesint, default=None

If set, only the top K features (ranked by individual p-values) are considered as candidates for interaction tests. This significantly speeds up training on high-dimensional datasets.

max_featuresint, float, str or None, default=None

The number of features to consider when looking for the best split:

  • If int, then consider max_features features at each split.

  • If float, then max_features is a fraction and max(1, int(max_features * n_features_in_)) features are considered at each split.

  • If “sqrt”, then max_features=sqrt(n_features_in_).

  • If “log2”, then max_features=log2(n_features_in_).

  • If None, then max_features=n_features_in_.

random_stateint, RandomState instance or None, default=None

Controls the randomness of the estimator. The features are always randomly permuted at each split. When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even if max_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer.

Attributes

classes_ndarray of shape (n_classes,)

The classes labels.

n_classes_int

The number of classes.

n_features_in_int

Number of features seen during fit.

n_nodes_int

Total number of nodes in the fitted tree.

n_leaves_int

Number of leaf nodes in the fitted tree.

max_depth_int

The actual maximum depth of the fitted tree.

feature_importances_ndarray of shape (n_features_in_,)

The feature importances based on weighted impurity reduction.

Notes

The algorithm follows a two-step process at each node: 1. Variable Selection: Use Chi-square tests (curvature) to identify the

best splitting variable independently of the split point.

  1. Split Point Optimization: Given the selected variable, find the threshold that minimizes the impurity (Gini index).

References

Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 361-386.

apply(X)[source]

Return the index of the leaf that each sample is predicted as.

compute_guide_importance(X, y, max_depth=4, bias_correction=True, n_permutations=300, random_state=None)[source]

Calculate GUIDE variable importance scores using an auxiliary shallow tree.

Following Loh & Zhou (2021), this method grows a short unpruned tree to calculate unbiased associative importance scores. It includes permutation-based bias correction and interaction detection.

Parameters

Xarray-like of shape (n_samples, n_features)

The training input samples.

yarray-like of shape (n_samples,)

The target values.

max_depthint, default=4

The depth of the auxiliary tree used for scoring. The paper recommends a depth of 4 for stable associative scores.

bias_correctionbool, default=True

Whether to perform permutation-based bias correction.

n_permutationsint, default=300

Number of permutations for bias correction. The paper uses 300 for high stability in simulations.

random_stateint, RandomState instance or None, default=None

Controls the randomness of permutations and tree growth.

Returns

importancesndarray of shape (n_features,)

The calculated importance scores. If bias_correction=True, these are the normalized VI scores, where a score of 1.0 represents the expected importance of a noise variable.

Notes

The importance score $v(X_k)$ for variable $X_k$ is defined as: $v(X_k) = sum_{t} sqrt{n_t} chi^2_1(k, t)$ where the sum is over intermediate nodes $t$, $n_t$ is the sample size at node $t$, and $chi^2_1(k, t)$ is the 1-degree-of-freedom chi-square statistic for the association between $X_k$ and the response at that node.

References

Loh, W.-Y. and Zhou, P. (2021). Variable Importance Scores. Journal of Data Science, 19(4), 569-592.

cost_complexity_pruning_path(X, y, sample_weight=None)[source]

Compute the pruning path during Minimal Cost-Complexity Pruning.

decision_path(X)[source]

Return the decision path in the tree.

property feature_importances_

Return the feature importances. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature.

fit(X, y)[source]

Build a GUIDE tree classifier from the training set (X, y).

get_depth()[source]

Return the depth of the decision tree. The depth of a tree is the maximum distance between the root and any leaf.

get_n_leaves()[source]

Return the number of leaves of the decision tree.

property guide_importances_

Return the GUIDE importance scores (Loh & Zhou, 2021). Score is the sum over intermediate nodes of sqrt(n_t) * chi2_quantile(1-p).

property interaction_importances_

Return the interaction-aware feature importances. If a split was chosen via interaction detection, the reduction is distributed equally among all interacting features.

property max_depth_
property n_leaves_
predict(X)[source]

Predict class for X.

predict_proba(X)[source]

Predict class probabilities for X.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GuideTreeClassifier

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

Returns

selfobject

The updated object.

property tree_

Returns a scikit-learn compatible MockTree.

pyguide.interactions module

pyguide.interactions.calc_interaction_p_value(X_subset, z, categorical_mask=None)[source]

Calculate interaction p-value between features in X_subset on target z.

Parameters

X_subsetarray-like of shape (n_samples, n_vars)

The features to test for interaction.

zarray-like of shape (n_samples,)

The target values (class labels or residual signs).

categorical_maskarray-like of shape (n_vars,), optional

Boolean mask indicating which features in X_subset are categorical.

GUIDE Strategy: - Bin numerical variables into 2 groups (median split). - Combine all binned/categorical variables into unique groups. - Perform Chi-square test on groups vs z.

pyguide.node module

class pyguide.node.GuideNode(depth, is_leaf=False, prediction=None, probabilities=None, split_feature=None, split_threshold=None, missing_go_left=True, left=None, right=None, n_samples=0, impurity=0.0, value_distribution=None, node_id=None, split_type=None, interaction_group=None, curvature_stats=None)[source]

Bases: object

A node in the GUIDE tree.

is_leaf_node()[source]

pyguide.regressor module

class pyguide.regressor.GuideTreeRegressor(max_depth=None, min_samples_split=2, min_samples_leaf=1, significance_threshold=0.05, interaction_depth=1, categorical_features=None, ccp_alpha=0.0, interaction_features=None, max_interaction_candidates=None, max_features=None, random_state=None)[source]

Bases: RegressorMixin, BaseEstimator

GUIDE (Generalized, Unbiased, Interaction Detection and Estimation) Tree Regressor.

GUIDE is a decision tree algorithm that separates variable selection from split point optimization. This approach prevents the variable selection bias common in CART-like algorithms (which favor variables with many unique values) and provides built-in interaction detection.

Parameters

max_depthint, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint, default=2

The minimum number of samples required to split an internal node.

min_samples_leafint, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.

significance_thresholdfloat, default=0.05

The p-value threshold for variable selection and interaction detection. If no variable is individually significant at this level, the algorithm searches for interactions. If no interaction is significant either, splitting stops.

interaction_depthint, default=1

The maximum order of interactions to search for.

  • 0: No interaction detection.

  • 1: Pairwise interactions.

  • 2: Triplets, etc.

categorical_featureslist of int, default=None

Indices of features to be treated as categorical. If None, the algorithm attempts to identify categorical features automatically based on input types (e.g., pandas object/category columns).

ccp_alphanon-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen.

interaction_featureslist of int, default=None

Subset of feature indices to consider for interaction search. If None, all features are considered (subject to candidate filtering).

max_interaction_candidatesint, default=None

If set, only the top K features (ranked by individual p-values) are considered as candidates for interaction tests. This significantly speeds up training on high-dimensional datasets.

max_featuresint, float, str or None, default=None

The number of features to consider when looking for the best split:

  • If int, then consider max_features features at each split.

  • If float, then max_features is a fraction and max(1, int(max_features * n_features_in_)) features are considered at each split.

  • If “sqrt”, then max_features=sqrt(n_features_in_).

  • If “log2”, then max_features=log2(n_features_in_).

  • If None, then max_features=n_features_in_.

random_stateint, RandomState instance or None, default=None

Controls the randomness of the estimator. The features are always randomly permuted at each split. When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer.

Attributes

n_features_in_int

Number of features seen during fit.

n_nodes_int

Total number of nodes in the fitted tree.

n_leaves_int

Number of leaf nodes in the fitted tree.

max_depth_int

The actual maximum depth of the fitted tree.

feature_importances_ndarray of shape (n_features_in_,)

The feature importances based on weighted impurity reduction (SSE).

Notes

The algorithm follows a two-step process at each node: 1. Variable Selection: Calculate residuals from the current node mean and

use Chi-square tests on residual signs to identify the best splitting variable.

  1. Split Point Optimization: Given the selected variable, find the threshold that minimizes the Sum of Squared Errors (SSE).

References

Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 361-386.

apply(X)[source]

Return the index of the leaf that each sample is predicted as.

compute_guide_importance(X, y, max_depth=4, bias_correction=True, n_permutations=300, random_state=None)[source]

Calculate GUIDE variable importance scores using an auxiliary shallow tree.

Following Loh & Zhou (2021), this method grows a short unpruned tree to calculate unbiased associative importance scores. It includes permutation-based bias correction and interaction detection.

Parameters

Xarray-like of shape (n_samples, n_features)

The training input samples.

yarray-like of shape (n_samples,)

The target values.

max_depthint, default=4

The depth of the auxiliary tree used for scoring. The paper recommends a depth of 4 for stable associative scores.

bias_correctionbool, default=True

Whether to perform permutation-based bias correction.

n_permutationsint, default=300

Number of permutations for bias correction. The paper uses 300 for high stability in simulations.

random_stateint, RandomState instance or None, default=None

Controls the randomness of permutations and tree growth.

Returns

importancesndarray of shape (n_features,)

The calculated importance scores. If bias_correction=True, these are the normalized VI scores, where a score of 1.0 represents the expected importance of a noise variable.

Notes

The importance score $v(X_k)$ for variable $X_k$ is defined as: $v(X_k) = sum_{t} sqrt{n_t} chi^2_1(k, t)$ where the sum is over intermediate nodes $t$, $n_t$ is the sample size at node $t$, and $chi^2_1(k, t)$ is the 1-degree-of-freedom chi-square statistic for the association between $X_k$ and the response at that node.

References

Loh, W.-Y. and Zhou, P. (2021). Variable Importance Scores. Journal of Data Science, 19(4), 569-592.

cost_complexity_pruning_path(X, y, sample_weight=None)[source]

Compute the pruning path during Minimal Cost-Complexity Pruning.

decision_path(X)[source]

Return the decision path in the tree.

property feature_importances_

Return the feature importances. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature.

fit(X, y)[source]

Build a GUIDE tree regressor from the training set (X, y).

get_depth()[source]

Return the depth of the decision tree. The depth of a tree is the maximum distance between the root and any leaf.

get_n_leaves()[source]

Return the number of leaves of the decision tree.

property guide_importances_

Return the GUIDE importance scores (Loh & Zhou, 2021). Score is the sum over intermediate nodes of sqrt(n_t) * chi2_quantile(1-p).

property interaction_importances_

Return the interaction-aware feature importances. If a split was chosen via interaction detection, the reduction is distributed equally among all interacting features.

property max_depth_
property n_leaves_
predict(X)[source]

Predict regression target for X.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GuideTreeRegressor

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

Returns

selfobject

The updated object.

property tree_

Returns a scikit-learn compatible MockTree.

pyguide.selection module

pyguide.selection.select_split_variable(X, y, categorical_features=None, feature_indices=None)[source]

Select the best variable to split on based on curvature tests (Chi-square).

Parameters

Xarray-like of shape (n_samples, n_features)

The training input samples.

yarray-like of shape (n_samples,)

The target values (class labels).

categorical_featuresarray-like of shape (n_features,), optional

Boolean mask indicating which features are categorical.

feature_indicesarray-like, optional

Indices of features to consider. If None, consider all features.

Returns

best_feature_idxint

The index of the selected feature.

best_pfloat

The p-value of the selected feature.

chi2_statsndarray

The Chi-square statistics for all features.

pyguide.splitting module

pyguide.splitting.find_best_split(x, y, is_categorical=False, criterion='gini')[source]

Find the best split for feature x. Returns (threshold/category, missing_go_left, gain).

pyguide.stats module

pyguide.stats.calc_curvature_p_value(x, z, **kwargs)[source]
pyguide.stats.calc_curvature_test(x, z, is_categorical=False)[source]

Calculate the Chi-square statistic and p-value for the association between x and z. Returns (chi2_stat, p_value).

pyguide.visualization module

class pyguide.visualization.MockTree(children_left, children_right, feature, threshold, value, impurity, n_node_samples)[source]

Bases: object

A class that mocks the interface of sklearn.tree._tree.Tree. Used for compatibility with sklearn’s visualization tools.

pyguide.visualization.build_mock_tree(root_node, n_classes=1, is_classifier=True)[source]

Recursively builds arrays for MockTree from a GuideNode structure.

pyguide.visualization.plot_tree(decision_tree, **kwargs)[source]

Plot a GUIDE decision tree. This is a wrapper around sklearn.tree.plot_tree.

Module contents

class pyguide.GuideGradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, subsample=1.0, random_state=None, significance_threshold=0.05, interaction_depth=1, max_interaction_candidates=None)[source]

Bases: ClassifierMixin, BaseEstimator

Gradient Boosting for binary classification using GUIDE trees.

This implementation uses the GUIDE algorithm for unbiased variable selection at each boosting stage. Currently supports binary classification (Log Loss).

Parameters

n_estimatorsint, default=100

The number of boosting stages to perform.

learning_ratefloat, default=0.1

Learning rate shrinks the contribution of each tree by learning_rate.

max_depthint, default=3

Maximum depth of the individual regression estimators.

subsamplefloat, default=1.0

The fraction of samples to be used for fitting the individual base learners.

random_stateint, RandomState instance or None, default=None

Controls the randomness of the estimator.

# GUIDE-specific parameters significance_threshold : float, default=0.05

The p-value threshold for variable selection.

interaction_depthint, default=1

The maximum order of interactions to search for.

max_interaction_candidatesint, default=None

Limit on candidates for interaction search.

decision_function(X)[source]

Compute the decision function of X (log-odds).

fit(X, y)[source]
predict(X)[source]

Predict class labels for X.

predict_proba(X)[source]

Predict class probabilities for X.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GuideGradientBoostingClassifier

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

Returns

selfobject

The updated object.

class pyguide.GuideGradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, subsample=1.0, random_state=None, significance_threshold=0.05, interaction_depth=1, max_interaction_candidates=None)[source]

Bases: RegressorMixin, BaseEstimator

Gradient Boosting for regression using GUIDE trees as base learners.

This implementation uses the GUIDE algorithm for unbiased variable selection and interaction detection at each boosting stage.

Parameters

n_estimatorsint, default=100

The number of boosting stages to perform.

learning_ratefloat, default=0.1

Learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate and n_estimators.

max_depthint, default=3

Maximum depth of the individual regression estimators.

subsamplefloat, default=1.0

The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting.

random_stateint, RandomState instance or None, default=None

Controls the random seed given to each Tree estimator at each boosting iteration.

# GUIDE-specific parameters significance_threshold : float, default=0.05

The p-value threshold for variable selection in GUIDE trees.

interaction_depthint, default=1

The maximum order of interactions to search for.

max_interaction_candidatesint, default=None

Limit on candidates for interaction search.

fit(X, y)[source]

Fit the gradient boosting model.

predict(X)[source]
set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GuideGradientBoostingRegressor

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

Returns

selfobject

The updated object.

class pyguide.GuideRandomForestClassifier(n_estimators=100, max_depth=None, max_features='sqrt', bootstrap=True, n_jobs=None, random_state=None, significance_threshold=0.05, interaction_depth=1)[source]

Bases: ClassifierMixin, BaseEstimator

Random Forest Classifier using GUIDE trees as base estimators.

Parameters

n_estimatorsint, default=100

The number of trees in the forest.

max_depthint, default=None

The maximum depth of the trees.

max_featuresint, float, str or None, default=”sqrt”

The number of features to consider when looking for the best split.

bootstrapbool, default=True

Whether bootstrap samples are used when building trees.

n_jobsint, default=None

The number of jobs to run in parallel for both fit and predict.

random_stateint, RandomState instance or None, default=None

Controls the randomness of the estimator.

significance_thresholdfloat, default=0.05

The p-value threshold for variable selection.

interaction_depthint, default=1

The maximum order of interactions to search for.

fit(X, y)[source]
predict(X)[source]
predict_proba(X)[source]
set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GuideRandomForestClassifier

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

Returns

selfobject

The updated object.

class pyguide.GuideRandomForestRegressor(n_estimators=100, max_depth=None, max_features=1.0, bootstrap=True, n_jobs=None, random_state=None, significance_threshold=0.05, interaction_depth=1)[source]

Bases: RegressorMixin, BaseEstimator

Random Forest Regressor using GUIDE trees as base estimators.

Parameters

n_estimatorsint, default=100

The number of trees in the forest.

max_depthint, default=None

The maximum depth of the trees.

max_featuresint, float, str or None, default=1.0

The number of features to consider when looking for the best split.

bootstrapbool, default=True

Whether bootstrap samples are used when building trees.

n_jobsint, default=None

The number of jobs to run in parallel for both fit and predict.

random_stateint, RandomState instance or None, default=None

Controls the randomness of the estimator.

significance_thresholdfloat, default=0.05

The p-value threshold for variable selection.

interaction_depthint, default=1

The maximum order of interactions to search for.

fit(X, y)[source]
predict(X)[source]
set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GuideRandomForestRegressor

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

Returns

selfobject

The updated object.

class pyguide.GuideTreeClassifier(max_depth=None, min_samples_split=2, min_samples_leaf=1, significance_threshold=0.05, interaction_depth=1, categorical_features=None, ccp_alpha=0.0, interaction_features=None, max_interaction_candidates=None, max_features=None, random_state=None)[source]

Bases: ClassifierMixin, BaseEstimator

GUIDE (Generalized, Unbiased, Interaction Detection and Estimation) Tree Classifier.

GUIDE is a decision tree algorithm that separates variable selection from split point optimization. This approach prevents the variable selection bias common in CART-like algorithms (which favor variables with many unique values) and provides built-in interaction detection.

Parameters

max_depthint, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint, default=2

The minimum number of samples required to split an internal node.

min_samples_leafint, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.

significance_thresholdfloat, default=0.05

The p-value threshold for variable selection and interaction detection. If no variable is individually significant at this level, the algorithm searches for interactions. If no interaction is significant either, splitting stops.

interaction_depthint, default=1

The maximum order of interactions to search for.

  • 0: No interaction detection.

  • 1: Pairwise interactions.

  • 2: Triplets, etc.

categorical_featureslist of int, default=None

Indices of features to be treated as categorical. If None, the algorithm attempts to identify categorical features automatically based on input types (e.g., pandas object/category columns).

ccp_alphanon-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen.

interaction_featureslist of int, default=None

Subset of feature indices to consider for interaction search. If None, all features are considered (subject to candidate filtering).

max_interaction_candidatesint, default=None

If set, only the top K features (ranked by individual p-values) are considered as candidates for interaction tests. This significantly speeds up training on high-dimensional datasets.

max_featuresint, float, str or None, default=None

The number of features to consider when looking for the best split:

  • If int, then consider max_features features at each split.

  • If float, then max_features is a fraction and max(1, int(max_features * n_features_in_)) features are considered at each split.

  • If “sqrt”, then max_features=sqrt(n_features_in_).

  • If “log2”, then max_features=log2(n_features_in_).

  • If None, then max_features=n_features_in_.

random_stateint, RandomState instance or None, default=None

Controls the randomness of the estimator. The features are always randomly permuted at each split. When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even if max_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer.

Attributes

classes_ndarray of shape (n_classes,)

The classes labels.

n_classes_int

The number of classes.

n_features_in_int

Number of features seen during fit.

n_nodes_int

Total number of nodes in the fitted tree.

n_leaves_int

Number of leaf nodes in the fitted tree.

max_depth_int

The actual maximum depth of the fitted tree.

feature_importances_ndarray of shape (n_features_in_,)

The feature importances based on weighted impurity reduction.

Notes

The algorithm follows a two-step process at each node: 1. Variable Selection: Use Chi-square tests (curvature) to identify the

best splitting variable independently of the split point.

  1. Split Point Optimization: Given the selected variable, find the threshold that minimizes the impurity (Gini index).

References

Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 361-386.

apply(X)[source]

Return the index of the leaf that each sample is predicted as.

compute_guide_importance(X, y, max_depth=4, bias_correction=True, n_permutations=300, random_state=None)[source]

Calculate GUIDE variable importance scores using an auxiliary shallow tree.

Following Loh & Zhou (2021), this method grows a short unpruned tree to calculate unbiased associative importance scores. It includes permutation-based bias correction and interaction detection.

Parameters

Xarray-like of shape (n_samples, n_features)

The training input samples.

yarray-like of shape (n_samples,)

The target values.

max_depthint, default=4

The depth of the auxiliary tree used for scoring. The paper recommends a depth of 4 for stable associative scores.

bias_correctionbool, default=True

Whether to perform permutation-based bias correction.

n_permutationsint, default=300

Number of permutations for bias correction. The paper uses 300 for high stability in simulations.

random_stateint, RandomState instance or None, default=None

Controls the randomness of permutations and tree growth.

Returns

importancesndarray of shape (n_features,)

The calculated importance scores. If bias_correction=True, these are the normalized VI scores, where a score of 1.0 represents the expected importance of a noise variable.

Notes

The importance score $v(X_k)$ for variable $X_k$ is defined as: $v(X_k) = sum_{t} sqrt{n_t} chi^2_1(k, t)$ where the sum is over intermediate nodes $t$, $n_t$ is the sample size at node $t$, and $chi^2_1(k, t)$ is the 1-degree-of-freedom chi-square statistic for the association between $X_k$ and the response at that node.

References

Loh, W.-Y. and Zhou, P. (2021). Variable Importance Scores. Journal of Data Science, 19(4), 569-592.

cost_complexity_pruning_path(X, y, sample_weight=None)[source]

Compute the pruning path during Minimal Cost-Complexity Pruning.

decision_path(X)[source]

Return the decision path in the tree.

property feature_importances_

Return the feature importances. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature.

fit(X, y)[source]

Build a GUIDE tree classifier from the training set (X, y).

get_depth()[source]

Return the depth of the decision tree. The depth of a tree is the maximum distance between the root and any leaf.

get_n_leaves()[source]

Return the number of leaves of the decision tree.

property guide_importances_

Return the GUIDE importance scores (Loh & Zhou, 2021). Score is the sum over intermediate nodes of sqrt(n_t) * chi2_quantile(1-p).

property interaction_importances_

Return the interaction-aware feature importances. If a split was chosen via interaction detection, the reduction is distributed equally among all interacting features.

property max_depth_
property n_leaves_
predict(X)[source]

Predict class for X.

predict_proba(X)[source]

Predict class probabilities for X.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GuideTreeClassifier

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

Returns

selfobject

The updated object.

property tree_

Returns a scikit-learn compatible MockTree.

class pyguide.GuideTreeRegressor(max_depth=None, min_samples_split=2, min_samples_leaf=1, significance_threshold=0.05, interaction_depth=1, categorical_features=None, ccp_alpha=0.0, interaction_features=None, max_interaction_candidates=None, max_features=None, random_state=None)[source]

Bases: RegressorMixin, BaseEstimator

GUIDE (Generalized, Unbiased, Interaction Detection and Estimation) Tree Regressor.

GUIDE is a decision tree algorithm that separates variable selection from split point optimization. This approach prevents the variable selection bias common in CART-like algorithms (which favor variables with many unique values) and provides built-in interaction detection.

Parameters

max_depthint, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint, default=2

The minimum number of samples required to split an internal node.

min_samples_leafint, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.

significance_thresholdfloat, default=0.05

The p-value threshold for variable selection and interaction detection. If no variable is individually significant at this level, the algorithm searches for interactions. If no interaction is significant either, splitting stops.

interaction_depthint, default=1

The maximum order of interactions to search for.

  • 0: No interaction detection.

  • 1: Pairwise interactions.

  • 2: Triplets, etc.

categorical_featureslist of int, default=None

Indices of features to be treated as categorical. If None, the algorithm attempts to identify categorical features automatically based on input types (e.g., pandas object/category columns).

ccp_alphanon-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen.

interaction_featureslist of int, default=None

Subset of feature indices to consider for interaction search. If None, all features are considered (subject to candidate filtering).

max_interaction_candidatesint, default=None

If set, only the top K features (ranked by individual p-values) are considered as candidates for interaction tests. This significantly speeds up training on high-dimensional datasets.

max_featuresint, float, str or None, default=None

The number of features to consider when looking for the best split:

  • If int, then consider max_features features at each split.

  • If float, then max_features is a fraction and max(1, int(max_features * n_features_in_)) features are considered at each split.

  • If “sqrt”, then max_features=sqrt(n_features_in_).

  • If “log2”, then max_features=log2(n_features_in_).

  • If None, then max_features=n_features_in_.

random_stateint, RandomState instance or None, default=None

Controls the randomness of the estimator. The features are always randomly permuted at each split. When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer.

Attributes

n_features_in_int

Number of features seen during fit.

n_nodes_int

Total number of nodes in the fitted tree.

n_leaves_int

Number of leaf nodes in the fitted tree.

max_depth_int

The actual maximum depth of the fitted tree.

feature_importances_ndarray of shape (n_features_in_,)

The feature importances based on weighted impurity reduction (SSE).

Notes

The algorithm follows a two-step process at each node: 1. Variable Selection: Calculate residuals from the current node mean and

use Chi-square tests on residual signs to identify the best splitting variable.

  1. Split Point Optimization: Given the selected variable, find the threshold that minimizes the Sum of Squared Errors (SSE).

References

Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 361-386.

apply(X)[source]

Return the index of the leaf that each sample is predicted as.

compute_guide_importance(X, y, max_depth=4, bias_correction=True, n_permutations=300, random_state=None)[source]

Calculate GUIDE variable importance scores using an auxiliary shallow tree.

Following Loh & Zhou (2021), this method grows a short unpruned tree to calculate unbiased associative importance scores. It includes permutation-based bias correction and interaction detection.

Parameters

Xarray-like of shape (n_samples, n_features)

The training input samples.

yarray-like of shape (n_samples,)

The target values.

max_depthint, default=4

The depth of the auxiliary tree used for scoring. The paper recommends a depth of 4 for stable associative scores.

bias_correctionbool, default=True

Whether to perform permutation-based bias correction.

n_permutationsint, default=300

Number of permutations for bias correction. The paper uses 300 for high stability in simulations.

random_stateint, RandomState instance or None, default=None

Controls the randomness of permutations and tree growth.

Returns

importancesndarray of shape (n_features,)

The calculated importance scores. If bias_correction=True, these are the normalized VI scores, where a score of 1.0 represents the expected importance of a noise variable.

Notes

The importance score $v(X_k)$ for variable $X_k$ is defined as: $v(X_k) = sum_{t} sqrt{n_t} chi^2_1(k, t)$ where the sum is over intermediate nodes $t$, $n_t$ is the sample size at node $t$, and $chi^2_1(k, t)$ is the 1-degree-of-freedom chi-square statistic for the association between $X_k$ and the response at that node.

References

Loh, W.-Y. and Zhou, P. (2021). Variable Importance Scores. Journal of Data Science, 19(4), 569-592.

cost_complexity_pruning_path(X, y, sample_weight=None)[source]

Compute the pruning path during Minimal Cost-Complexity Pruning.

decision_path(X)[source]

Return the decision path in the tree.

property feature_importances_

Return the feature importances. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature.

fit(X, y)[source]

Build a GUIDE tree regressor from the training set (X, y).

get_depth()[source]

Return the depth of the decision tree. The depth of a tree is the maximum distance between the root and any leaf.

get_n_leaves()[source]

Return the number of leaves of the decision tree.

property guide_importances_

Return the GUIDE importance scores (Loh & Zhou, 2021). Score is the sum over intermediate nodes of sqrt(n_t) * chi2_quantile(1-p).

property interaction_importances_

Return the interaction-aware feature importances. If a split was chosen via interaction detection, the reduction is distributed equally among all interacting features.

property max_depth_
property n_leaves_
predict(X)[source]

Predict regression target for X.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GuideTreeRegressor

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

Returns

selfobject

The updated object.

property tree_

Returns a scikit-learn compatible MockTree.

pyguide.plot_tree(decision_tree, **kwargs)[source]

Plot a GUIDE decision tree. This is a wrapper around sklearn.tree.plot_tree.