pyguide package¶
Submodules¶
pyguide.classifier module¶
- class pyguide.classifier.GuideTreeClassifier(max_depth=None, min_samples_split=2, min_samples_leaf=1, significance_threshold=0.05, interaction_depth=1, categorical_features=None, ccp_alpha=0.0, interaction_features=None, max_interaction_candidates=None, max_features=None, random_state=None)[source]¶
Bases:
ClassifierMixin,BaseEstimatorGUIDE (Generalized, Unbiased, Interaction Detection and Estimation) Tree Classifier.
GUIDE is a decision tree algorithm that separates variable selection from split point optimization. This approach prevents the variable selection bias common in CART-like algorithms (which favor variables with many unique values) and provides built-in interaction detection.
Parameters¶
- max_depthint, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- min_samples_splitint, default=2
The minimum number of samples required to split an internal node.
- min_samples_leafint, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.
- significance_thresholdfloat, default=0.05
The p-value threshold for variable selection and interaction detection. If no variable is individually significant at this level, the algorithm searches for interactions. If no interaction is significant either, splitting stops.
- interaction_depthint, default=1
The maximum order of interactions to search for.
0: No interaction detection.
1: Pairwise interactions.
2: Triplets, etc.
- categorical_featureslist of int, default=None
Indices of features to be treated as categorical. If None, the algorithm attempts to identify categorical features automatically based on input types (e.g., pandas object/category columns).
- ccp_alphanon-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen.
- interaction_featureslist of int, default=None
Subset of feature indices to consider for interaction search. If None, all features are considered (subject to candidate filtering).
- max_interaction_candidatesint, default=None
If set, only the top K features (ranked by individual p-values) are considered as candidates for interaction tests. This significantly speeds up training on high-dimensional datasets.
- max_featuresint, float, str or None, default=None
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and max(1, int(max_features * n_features_in_)) features are considered at each split.
If “sqrt”, then max_features=sqrt(n_features_in_).
If “log2”, then max_features=log2(n_features_in_).
If None, then max_features=n_features_in_.
- random_stateint, RandomState instance or None, default=None
Controls the randomness of the estimator. The features are always randomly permuted at each split. When
max_features < n_features, the algorithm will selectmax_featuresat random at each split before finding the best split among them. But the best found split may vary across different runs, even ifmax_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting,random_statehas to be fixed to an integer.
Attributes¶
- classes_ndarray of shape (n_classes,)
The classes labels.
- n_classes_int
The number of classes.
- n_features_in_int
Number of features seen during fit.
- n_nodes_int
Total number of nodes in the fitted tree.
- n_leaves_int
Number of leaf nodes in the fitted tree.
- max_depth_int
The actual maximum depth of the fitted tree.
- feature_importances_ndarray of shape (n_features_in_,)
The feature importances based on weighted impurity reduction.
Notes¶
The algorithm follows a two-step process at each node: 1. Variable Selection: Use Chi-square tests (curvature) to identify the
best splitting variable independently of the split point.
Split Point Optimization: Given the selected variable, find the threshold that minimizes the impurity (Gini index).
References¶
Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 361-386.
- compute_guide_importance(X, y, max_depth=4, bias_correction=True, n_permutations=300, random_state=None)[source]¶
Calculate GUIDE variable importance scores using an auxiliary shallow tree.
Following Loh & Zhou (2021), this method grows a short unpruned tree to calculate unbiased associative importance scores. It includes permutation-based bias correction and interaction detection.
Parameters¶
- Xarray-like of shape (n_samples, n_features)
The training input samples.
- yarray-like of shape (n_samples,)
The target values.
- max_depthint, default=4
The depth of the auxiliary tree used for scoring. The paper recommends a depth of 4 for stable associative scores.
- bias_correctionbool, default=True
Whether to perform permutation-based bias correction.
- n_permutationsint, default=300
Number of permutations for bias correction. The paper uses 300 for high stability in simulations.
- random_stateint, RandomState instance or None, default=None
Controls the randomness of permutations and tree growth.
Returns¶
- importancesndarray of shape (n_features,)
The calculated importance scores. If bias_correction=True, these are the normalized VI scores, where a score of 1.0 represents the expected importance of a noise variable.
Notes¶
The importance score $v(X_k)$ for variable $X_k$ is defined as: $v(X_k) = sum_{t} sqrt{n_t} chi^2_1(k, t)$ where the sum is over intermediate nodes $t$, $n_t$ is the sample size at node $t$, and $chi^2_1(k, t)$ is the 1-degree-of-freedom chi-square statistic for the association between $X_k$ and the response at that node.
References¶
Loh, W.-Y. and Zhou, P. (2021). Variable Importance Scores. Journal of Data Science, 19(4), 569-592.
- cost_complexity_pruning_path(X, y, sample_weight=None)[source]¶
Compute the pruning path during Minimal Cost-Complexity Pruning.
- property feature_importances_¶
Return the feature importances. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature.
- get_depth()[source]¶
Return the depth of the decision tree. The depth of a tree is the maximum distance between the root and any leaf.
- property guide_importances_¶
Return the GUIDE importance scores (Loh & Zhou, 2021). Score is the sum over intermediate nodes of sqrt(n_t) * chi2_quantile(1-p).
- property interaction_importances_¶
Return the interaction-aware feature importances. If a split was chosen via interaction detection, the reduction is distributed equally among all interacting features.
- property max_depth_¶
- property n_leaves_¶
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GuideTreeClassifier¶
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters¶
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter inscore.
Returns¶
- selfobject
The updated object.
- property tree_¶
Returns a scikit-learn compatible MockTree.
pyguide.interactions module¶
- pyguide.interactions.calc_interaction_p_value(X_subset, z, categorical_mask=None)[source]¶
Calculate interaction p-value between features in X_subset on target z.
Parameters¶
- X_subsetarray-like of shape (n_samples, n_vars)
The features to test for interaction.
- zarray-like of shape (n_samples,)
The target values (class labels or residual signs).
- categorical_maskarray-like of shape (n_vars,), optional
Boolean mask indicating which features in X_subset are categorical.
GUIDE Strategy: - Bin numerical variables into 2 groups (median split). - Combine all binned/categorical variables into unique groups. - Perform Chi-square test on groups vs z.
pyguide.node module¶
- class pyguide.node.GuideNode(depth, is_leaf=False, prediction=None, probabilities=None, split_feature=None, split_threshold=None, missing_go_left=True, left=None, right=None, n_samples=0, impurity=0.0, value_distribution=None, node_id=None, split_type=None, interaction_group=None, curvature_stats=None)[source]¶
Bases:
objectA node in the GUIDE tree.
pyguide.regressor module¶
- class pyguide.regressor.GuideTreeRegressor(max_depth=None, min_samples_split=2, min_samples_leaf=1, significance_threshold=0.05, interaction_depth=1, categorical_features=None, ccp_alpha=0.0, interaction_features=None, max_interaction_candidates=None, max_features=None, random_state=None)[source]¶
Bases:
RegressorMixin,BaseEstimatorGUIDE (Generalized, Unbiased, Interaction Detection and Estimation) Tree Regressor.
GUIDE is a decision tree algorithm that separates variable selection from split point optimization. This approach prevents the variable selection bias common in CART-like algorithms (which favor variables with many unique values) and provides built-in interaction detection.
Parameters¶
- max_depthint, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- min_samples_splitint, default=2
The minimum number of samples required to split an internal node.
- min_samples_leafint, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.
- significance_thresholdfloat, default=0.05
The p-value threshold for variable selection and interaction detection. If no variable is individually significant at this level, the algorithm searches for interactions. If no interaction is significant either, splitting stops.
- interaction_depthint, default=1
The maximum order of interactions to search for.
0: No interaction detection.
1: Pairwise interactions.
2: Triplets, etc.
- categorical_featureslist of int, default=None
Indices of features to be treated as categorical. If None, the algorithm attempts to identify categorical features automatically based on input types (e.g., pandas object/category columns).
- ccp_alphanon-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen.
- interaction_featureslist of int, default=None
Subset of feature indices to consider for interaction search. If None, all features are considered (subject to candidate filtering).
- max_interaction_candidatesint, default=None
If set, only the top K features (ranked by individual p-values) are considered as candidates for interaction tests. This significantly speeds up training on high-dimensional datasets.
- max_featuresint, float, str or None, default=None
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and max(1, int(max_features * n_features_in_)) features are considered at each split.
If “sqrt”, then max_features=sqrt(n_features_in_).
If “log2”, then max_features=log2(n_features_in_).
If None, then max_features=n_features_in_.
- random_stateint, RandomState instance or None, default=None
Controls the randomness of the estimator. The features are always randomly permuted at each split. When
max_features < n_features, the algorithm will selectmax_featuresat random at each split before finding the best split among them. To obtain a deterministic behaviour during fitting,random_statehas to be fixed to an integer.
Attributes¶
- n_features_in_int
Number of features seen during fit.
- n_nodes_int
Total number of nodes in the fitted tree.
- n_leaves_int
Number of leaf nodes in the fitted tree.
- max_depth_int
The actual maximum depth of the fitted tree.
- feature_importances_ndarray of shape (n_features_in_,)
The feature importances based on weighted impurity reduction (SSE).
Notes¶
The algorithm follows a two-step process at each node: 1. Variable Selection: Calculate residuals from the current node mean and
use Chi-square tests on residual signs to identify the best splitting variable.
Split Point Optimization: Given the selected variable, find the threshold that minimizes the Sum of Squared Errors (SSE).
References¶
Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 361-386.
- compute_guide_importance(X, y, max_depth=4, bias_correction=True, n_permutations=300, random_state=None)[source]¶
Calculate GUIDE variable importance scores using an auxiliary shallow tree.
Following Loh & Zhou (2021), this method grows a short unpruned tree to calculate unbiased associative importance scores. It includes permutation-based bias correction and interaction detection.
Parameters¶
- Xarray-like of shape (n_samples, n_features)
The training input samples.
- yarray-like of shape (n_samples,)
The target values.
- max_depthint, default=4
The depth of the auxiliary tree used for scoring. The paper recommends a depth of 4 for stable associative scores.
- bias_correctionbool, default=True
Whether to perform permutation-based bias correction.
- n_permutationsint, default=300
Number of permutations for bias correction. The paper uses 300 for high stability in simulations.
- random_stateint, RandomState instance or None, default=None
Controls the randomness of permutations and tree growth.
Returns¶
- importancesndarray of shape (n_features,)
The calculated importance scores. If bias_correction=True, these are the normalized VI scores, where a score of 1.0 represents the expected importance of a noise variable.
Notes¶
The importance score $v(X_k)$ for variable $X_k$ is defined as: $v(X_k) = sum_{t} sqrt{n_t} chi^2_1(k, t)$ where the sum is over intermediate nodes $t$, $n_t$ is the sample size at node $t$, and $chi^2_1(k, t)$ is the 1-degree-of-freedom chi-square statistic for the association between $X_k$ and the response at that node.
References¶
Loh, W.-Y. and Zhou, P. (2021). Variable Importance Scores. Journal of Data Science, 19(4), 569-592.
- cost_complexity_pruning_path(X, y, sample_weight=None)[source]¶
Compute the pruning path during Minimal Cost-Complexity Pruning.
- property feature_importances_¶
Return the feature importances. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature.
- get_depth()[source]¶
Return the depth of the decision tree. The depth of a tree is the maximum distance between the root and any leaf.
- property guide_importances_¶
Return the GUIDE importance scores (Loh & Zhou, 2021). Score is the sum over intermediate nodes of sqrt(n_t) * chi2_quantile(1-p).
- property interaction_importances_¶
Return the interaction-aware feature importances. If a split was chosen via interaction detection, the reduction is distributed equally among all interacting features.
- property max_depth_¶
- property n_leaves_¶
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GuideTreeRegressor¶
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters¶
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter inscore.
Returns¶
- selfobject
The updated object.
- property tree_¶
Returns a scikit-learn compatible MockTree.
pyguide.selection module¶
- pyguide.selection.select_split_variable(X, y, categorical_features=None, feature_indices=None)[source]¶
Select the best variable to split on based on curvature tests (Chi-square).
Parameters¶
- Xarray-like of shape (n_samples, n_features)
The training input samples.
- yarray-like of shape (n_samples,)
The target values (class labels).
- categorical_featuresarray-like of shape (n_features,), optional
Boolean mask indicating which features are categorical.
- feature_indicesarray-like, optional
Indices of features to consider. If None, consider all features.
Returns¶
- best_feature_idxint
The index of the selected feature.
- best_pfloat
The p-value of the selected feature.
- chi2_statsndarray
The Chi-square statistics for all features.
pyguide.splitting module¶
pyguide.stats module¶
pyguide.visualization module¶
- class pyguide.visualization.MockTree(children_left, children_right, feature, threshold, value, impurity, n_node_samples)[source]¶
Bases:
objectA class that mocks the interface of sklearn.tree._tree.Tree. Used for compatibility with sklearn’s visualization tools.
Module contents¶
- class pyguide.GuideGradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, subsample=1.0, random_state=None, significance_threshold=0.05, interaction_depth=1, max_interaction_candidates=None)[source]¶
Bases:
ClassifierMixin,BaseEstimatorGradient Boosting for binary classification using GUIDE trees.
This implementation uses the GUIDE algorithm for unbiased variable selection at each boosting stage. Currently supports binary classification (Log Loss).
Parameters¶
- n_estimatorsint, default=100
The number of boosting stages to perform.
- learning_ratefloat, default=0.1
Learning rate shrinks the contribution of each tree by learning_rate.
- max_depthint, default=3
Maximum depth of the individual regression estimators.
- subsamplefloat, default=1.0
The fraction of samples to be used for fitting the individual base learners.
- random_stateint, RandomState instance or None, default=None
Controls the randomness of the estimator.
# GUIDE-specific parameters significance_threshold : float, default=0.05
The p-value threshold for variable selection.
- interaction_depthint, default=1
The maximum order of interactions to search for.
- max_interaction_candidatesint, default=None
Limit on candidates for interaction search.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GuideGradientBoostingClassifier¶
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters¶
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter inscore.
Returns¶
- selfobject
The updated object.
- class pyguide.GuideGradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, subsample=1.0, random_state=None, significance_threshold=0.05, interaction_depth=1, max_interaction_candidates=None)[source]¶
Bases:
RegressorMixin,BaseEstimatorGradient Boosting for regression using GUIDE trees as base learners.
This implementation uses the GUIDE algorithm for unbiased variable selection and interaction detection at each boosting stage.
Parameters¶
- n_estimatorsint, default=100
The number of boosting stages to perform.
- learning_ratefloat, default=0.1
Learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate and n_estimators.
- max_depthint, default=3
Maximum depth of the individual regression estimators.
- subsamplefloat, default=1.0
The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting.
- random_stateint, RandomState instance or None, default=None
Controls the random seed given to each Tree estimator at each boosting iteration.
# GUIDE-specific parameters significance_threshold : float, default=0.05
The p-value threshold for variable selection in GUIDE trees.
- interaction_depthint, default=1
The maximum order of interactions to search for.
- max_interaction_candidatesint, default=None
Limit on candidates for interaction search.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GuideGradientBoostingRegressor¶
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters¶
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter inscore.
Returns¶
- selfobject
The updated object.
- class pyguide.GuideRandomForestClassifier(n_estimators=100, max_depth=None, max_features='sqrt', bootstrap=True, n_jobs=None, random_state=None, significance_threshold=0.05, interaction_depth=1)[source]¶
Bases:
ClassifierMixin,BaseEstimatorRandom Forest Classifier using GUIDE trees as base estimators.
Parameters¶
- n_estimatorsint, default=100
The number of trees in the forest.
- max_depthint, default=None
The maximum depth of the trees.
- max_featuresint, float, str or None, default=”sqrt”
The number of features to consider when looking for the best split.
- bootstrapbool, default=True
Whether bootstrap samples are used when building trees.
- n_jobsint, default=None
The number of jobs to run in parallel for both fit and predict.
- random_stateint, RandomState instance or None, default=None
Controls the randomness of the estimator.
- significance_thresholdfloat, default=0.05
The p-value threshold for variable selection.
- interaction_depthint, default=1
The maximum order of interactions to search for.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GuideRandomForestClassifier¶
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters¶
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter inscore.
Returns¶
- selfobject
The updated object.
- class pyguide.GuideRandomForestRegressor(n_estimators=100, max_depth=None, max_features=1.0, bootstrap=True, n_jobs=None, random_state=None, significance_threshold=0.05, interaction_depth=1)[source]¶
Bases:
RegressorMixin,BaseEstimatorRandom Forest Regressor using GUIDE trees as base estimators.
Parameters¶
- n_estimatorsint, default=100
The number of trees in the forest.
- max_depthint, default=None
The maximum depth of the trees.
- max_featuresint, float, str or None, default=1.0
The number of features to consider when looking for the best split.
- bootstrapbool, default=True
Whether bootstrap samples are used when building trees.
- n_jobsint, default=None
The number of jobs to run in parallel for both fit and predict.
- random_stateint, RandomState instance or None, default=None
Controls the randomness of the estimator.
- significance_thresholdfloat, default=0.05
The p-value threshold for variable selection.
- interaction_depthint, default=1
The maximum order of interactions to search for.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GuideRandomForestRegressor¶
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters¶
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter inscore.
Returns¶
- selfobject
The updated object.
- class pyguide.GuideTreeClassifier(max_depth=None, min_samples_split=2, min_samples_leaf=1, significance_threshold=0.05, interaction_depth=1, categorical_features=None, ccp_alpha=0.0, interaction_features=None, max_interaction_candidates=None, max_features=None, random_state=None)[source]¶
Bases:
ClassifierMixin,BaseEstimatorGUIDE (Generalized, Unbiased, Interaction Detection and Estimation) Tree Classifier.
GUIDE is a decision tree algorithm that separates variable selection from split point optimization. This approach prevents the variable selection bias common in CART-like algorithms (which favor variables with many unique values) and provides built-in interaction detection.
Parameters¶
- max_depthint, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- min_samples_splitint, default=2
The minimum number of samples required to split an internal node.
- min_samples_leafint, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.
- significance_thresholdfloat, default=0.05
The p-value threshold for variable selection and interaction detection. If no variable is individually significant at this level, the algorithm searches for interactions. If no interaction is significant either, splitting stops.
- interaction_depthint, default=1
The maximum order of interactions to search for.
0: No interaction detection.
1: Pairwise interactions.
2: Triplets, etc.
- categorical_featureslist of int, default=None
Indices of features to be treated as categorical. If None, the algorithm attempts to identify categorical features automatically based on input types (e.g., pandas object/category columns).
- ccp_alphanon-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen.
- interaction_featureslist of int, default=None
Subset of feature indices to consider for interaction search. If None, all features are considered (subject to candidate filtering).
- max_interaction_candidatesint, default=None
If set, only the top K features (ranked by individual p-values) are considered as candidates for interaction tests. This significantly speeds up training on high-dimensional datasets.
- max_featuresint, float, str or None, default=None
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and max(1, int(max_features * n_features_in_)) features are considered at each split.
If “sqrt”, then max_features=sqrt(n_features_in_).
If “log2”, then max_features=log2(n_features_in_).
If None, then max_features=n_features_in_.
- random_stateint, RandomState instance or None, default=None
Controls the randomness of the estimator. The features are always randomly permuted at each split. When
max_features < n_features, the algorithm will selectmax_featuresat random at each split before finding the best split among them. But the best found split may vary across different runs, even ifmax_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting,random_statehas to be fixed to an integer.
Attributes¶
- classes_ndarray of shape (n_classes,)
The classes labels.
- n_classes_int
The number of classes.
- n_features_in_int
Number of features seen during fit.
- n_nodes_int
Total number of nodes in the fitted tree.
- n_leaves_int
Number of leaf nodes in the fitted tree.
- max_depth_int
The actual maximum depth of the fitted tree.
- feature_importances_ndarray of shape (n_features_in_,)
The feature importances based on weighted impurity reduction.
Notes¶
The algorithm follows a two-step process at each node: 1. Variable Selection: Use Chi-square tests (curvature) to identify the
best splitting variable independently of the split point.
Split Point Optimization: Given the selected variable, find the threshold that minimizes the impurity (Gini index).
References¶
Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 361-386.
- compute_guide_importance(X, y, max_depth=4, bias_correction=True, n_permutations=300, random_state=None)[source]¶
Calculate GUIDE variable importance scores using an auxiliary shallow tree.
Following Loh & Zhou (2021), this method grows a short unpruned tree to calculate unbiased associative importance scores. It includes permutation-based bias correction and interaction detection.
Parameters¶
- Xarray-like of shape (n_samples, n_features)
The training input samples.
- yarray-like of shape (n_samples,)
The target values.
- max_depthint, default=4
The depth of the auxiliary tree used for scoring. The paper recommends a depth of 4 for stable associative scores.
- bias_correctionbool, default=True
Whether to perform permutation-based bias correction.
- n_permutationsint, default=300
Number of permutations for bias correction. The paper uses 300 for high stability in simulations.
- random_stateint, RandomState instance or None, default=None
Controls the randomness of permutations and tree growth.
Returns¶
- importancesndarray of shape (n_features,)
The calculated importance scores. If bias_correction=True, these are the normalized VI scores, where a score of 1.0 represents the expected importance of a noise variable.
Notes¶
The importance score $v(X_k)$ for variable $X_k$ is defined as: $v(X_k) = sum_{t} sqrt{n_t} chi^2_1(k, t)$ where the sum is over intermediate nodes $t$, $n_t$ is the sample size at node $t$, and $chi^2_1(k, t)$ is the 1-degree-of-freedom chi-square statistic for the association between $X_k$ and the response at that node.
References¶
Loh, W.-Y. and Zhou, P. (2021). Variable Importance Scores. Journal of Data Science, 19(4), 569-592.
- cost_complexity_pruning_path(X, y, sample_weight=None)[source]¶
Compute the pruning path during Minimal Cost-Complexity Pruning.
- property feature_importances_¶
Return the feature importances. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature.
- get_depth()[source]¶
Return the depth of the decision tree. The depth of a tree is the maximum distance between the root and any leaf.
- property guide_importances_¶
Return the GUIDE importance scores (Loh & Zhou, 2021). Score is the sum over intermediate nodes of sqrt(n_t) * chi2_quantile(1-p).
- property interaction_importances_¶
Return the interaction-aware feature importances. If a split was chosen via interaction detection, the reduction is distributed equally among all interacting features.
- property max_depth_¶
- property n_leaves_¶
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GuideTreeClassifier¶
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters¶
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter inscore.
Returns¶
- selfobject
The updated object.
- property tree_¶
Returns a scikit-learn compatible MockTree.
- class pyguide.GuideTreeRegressor(max_depth=None, min_samples_split=2, min_samples_leaf=1, significance_threshold=0.05, interaction_depth=1, categorical_features=None, ccp_alpha=0.0, interaction_features=None, max_interaction_candidates=None, max_features=None, random_state=None)[source]¶
Bases:
RegressorMixin,BaseEstimatorGUIDE (Generalized, Unbiased, Interaction Detection and Estimation) Tree Regressor.
GUIDE is a decision tree algorithm that separates variable selection from split point optimization. This approach prevents the variable selection bias common in CART-like algorithms (which favor variables with many unique values) and provides built-in interaction detection.
Parameters¶
- max_depthint, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- min_samples_splitint, default=2
The minimum number of samples required to split an internal node.
- min_samples_leafint, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.
- significance_thresholdfloat, default=0.05
The p-value threshold for variable selection and interaction detection. If no variable is individually significant at this level, the algorithm searches for interactions. If no interaction is significant either, splitting stops.
- interaction_depthint, default=1
The maximum order of interactions to search for.
0: No interaction detection.
1: Pairwise interactions.
2: Triplets, etc.
- categorical_featureslist of int, default=None
Indices of features to be treated as categorical. If None, the algorithm attempts to identify categorical features automatically based on input types (e.g., pandas object/category columns).
- ccp_alphanon-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen.
- interaction_featureslist of int, default=None
Subset of feature indices to consider for interaction search. If None, all features are considered (subject to candidate filtering).
- max_interaction_candidatesint, default=None
If set, only the top K features (ranked by individual p-values) are considered as candidates for interaction tests. This significantly speeds up training on high-dimensional datasets.
- max_featuresint, float, str or None, default=None
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and max(1, int(max_features * n_features_in_)) features are considered at each split.
If “sqrt”, then max_features=sqrt(n_features_in_).
If “log2”, then max_features=log2(n_features_in_).
If None, then max_features=n_features_in_.
- random_stateint, RandomState instance or None, default=None
Controls the randomness of the estimator. The features are always randomly permuted at each split. When
max_features < n_features, the algorithm will selectmax_featuresat random at each split before finding the best split among them. To obtain a deterministic behaviour during fitting,random_statehas to be fixed to an integer.
Attributes¶
- n_features_in_int
Number of features seen during fit.
- n_nodes_int
Total number of nodes in the fitted tree.
- n_leaves_int
Number of leaf nodes in the fitted tree.
- max_depth_int
The actual maximum depth of the fitted tree.
- feature_importances_ndarray of shape (n_features_in_,)
The feature importances based on weighted impurity reduction (SSE).
Notes¶
The algorithm follows a two-step process at each node: 1. Variable Selection: Calculate residuals from the current node mean and
use Chi-square tests on residual signs to identify the best splitting variable.
Split Point Optimization: Given the selected variable, find the threshold that minimizes the Sum of Squared Errors (SSE).
References¶
Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 361-386.
- compute_guide_importance(X, y, max_depth=4, bias_correction=True, n_permutations=300, random_state=None)[source]¶
Calculate GUIDE variable importance scores using an auxiliary shallow tree.
Following Loh & Zhou (2021), this method grows a short unpruned tree to calculate unbiased associative importance scores. It includes permutation-based bias correction and interaction detection.
Parameters¶
- Xarray-like of shape (n_samples, n_features)
The training input samples.
- yarray-like of shape (n_samples,)
The target values.
- max_depthint, default=4
The depth of the auxiliary tree used for scoring. The paper recommends a depth of 4 for stable associative scores.
- bias_correctionbool, default=True
Whether to perform permutation-based bias correction.
- n_permutationsint, default=300
Number of permutations for bias correction. The paper uses 300 for high stability in simulations.
- random_stateint, RandomState instance or None, default=None
Controls the randomness of permutations and tree growth.
Returns¶
- importancesndarray of shape (n_features,)
The calculated importance scores. If bias_correction=True, these are the normalized VI scores, where a score of 1.0 represents the expected importance of a noise variable.
Notes¶
The importance score $v(X_k)$ for variable $X_k$ is defined as: $v(X_k) = sum_{t} sqrt{n_t} chi^2_1(k, t)$ where the sum is over intermediate nodes $t$, $n_t$ is the sample size at node $t$, and $chi^2_1(k, t)$ is the 1-degree-of-freedom chi-square statistic for the association between $X_k$ and the response at that node.
References¶
Loh, W.-Y. and Zhou, P. (2021). Variable Importance Scores. Journal of Data Science, 19(4), 569-592.
- cost_complexity_pruning_path(X, y, sample_weight=None)[source]¶
Compute the pruning path during Minimal Cost-Complexity Pruning.
- property feature_importances_¶
Return the feature importances. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature.
- get_depth()[source]¶
Return the depth of the decision tree. The depth of a tree is the maximum distance between the root and any leaf.
- property guide_importances_¶
Return the GUIDE importance scores (Loh & Zhou, 2021). Score is the sum over intermediate nodes of sqrt(n_t) * chi2_quantile(1-p).
- property interaction_importances_¶
Return the interaction-aware feature importances. If a split was chosen via interaction detection, the reduction is distributed equally among all interacting features.
- property max_depth_¶
- property n_leaves_¶
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GuideTreeRegressor¶
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters¶
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter inscore.
Returns¶
- selfobject
The updated object.
- property tree_¶
Returns a scikit-learn compatible MockTree.