API#
This part of the documentation covers all the interfaces of Trustee. For parts where Trustee depends on external libraries, we document the most important right here and provide links to the canonical documentation.
Trustee#
The core module of the Trustee project
- class trustee.main.ClassificationTrustee(expert, logger=None)#
Bases:
Trustee
Implements the Trust-oriented Decision Tree Extraction (Trustee) algorithm to train a student DecisionTreeClassifier based on observations from an Expert classification model.
- __init__(expert, logger=None)#
Classification Trustee constructor
- Parameters:
expert (object) – The ML blackbox model to analyze. The expert model must have a predict method call implemented for Trustee to work properly, unless explicitly stated otherwise using the predict_method_name argument in the fit() method.
logger (Logger object , default=None) – A logger object to log messages to. If none is given, the print() method will be used to log messages.
- explain(top_k=10)#
Returns explainable model that best imitates Expert model, based on highest mean agreement and highest fidelity.
- Returns:
top_student – (dt, pruned_dt, agreement, reward)
- dt: {DecisionTreeClassifier, DecisionTreeRegressor}
Unconstrained fitted student model.
- pruned_dt: {DecisionTreeClassifier, DecisionTreeRegressor}
Top-k pruned fitted student model.
- agreement: float
Mean agreement of pruned student model with respect to others.
- reward: float
Fidelity of student model to the expert model.
- Return type:
tuple
- fit(X, y, top_k=10, max_leaf_nodes=None, max_depth=None, ccp_alpha=0.0, train_size=0.7, num_iter=50, num_stability_iter=5, num_samples=2000, samples_size=None, use_features=None, predict_method_name='predict', optimization='fidelity', aggregate=True, verbose=False)#
Trains Decision Tree Regressor to imitate Expert model.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Internally, it will be converted to a pandas DataFrame.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – The target values for X (class labels in classification, real numbers in regression). Internally, it will be converted to a pandas Series.
top_k (int, default=10) – Number of top-k branches, sorted by number of samples per branch, to keep after finding decision tree with highest fidelity.
max_leaf_nodes (int, default=None) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
max_depth (int, default=None) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure.
ccp_alpha (float, default=0.0) – Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See Minimal Cost-Complexity Pruning here for details: https://scikit-learn.org/stable/modules/tree.html#minimal-cost-complexity-pruning
train_size (float or int, default=0.7) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples.
num_iter (int, default=50) – Number of iterations to repeat Trustee inner-loop for.
num_stability_iter (int, default=5) – Number of stability to repeat Trustee stabilization outer-loop for.
num_samples (int, default=2000) – The absolute number of samples to fetch from the training dataset split to train the student decision tree model. If the samples_size argument is provided, this arg is ignored.
samples_size (float, default=None) – The fraction of the training dataset to use to train the student decision tree model. If None, the value is automatically set to the num_samples provided value.
use_features (array-like, default=None) – Array-like of integers representing the indexes of features from the X training samples. If not None, only the features indicated by the provided indexes will be used to train the student decision tree model.
predict_method_name (str, default="predict") – The method interface to use to get predictions from the expert model. If no value is passed, the default predict interface is used.
optimization ({"fidelity", "accuracy"}, default="fidelity") – The comparison criteria to optimize the decision tree students in Trustee inner-loop. Used for ablation study only.
aggregate (bool, default=True) – Boolean indicating whether dataset aggregation should be used in Trustee inner-loop. Used for ablation study only.
verbose (bool, default=False) – Boolean indicating whether to log messages.
- get_all_students()#
Get list of all (student, reward) obtained during the inner-loop process.
- Returns:
students_by_iter – Matrix with all student models trained during fit().
- dt: {DecisionTreeClassifier, DecisionTreeRegressor}
Unconstrained fitted student model.
- reward: float
Fidelity of student model to the expert model.
- Return type:
array-like of shape (num_stability_iter, num_iter) of tuple (dt, reward)
- get_leaves_by_level()#
Returns number of leaves by level of the best student.
- Returns:
leaves_by_level – Dict of leaves by level. {“<level>(int)”: <leaves>(int)}
- Return type:
dict of int
- get_n_classes()#
Returns number of classes used in the top student model.
- Returns:
n_classes – Number of classes outputted in top student model.
- Return type:
int
- get_n_features()#
Returns number of features used in the top student model.
- Returns:
n_features – Number of features used in top student model.
- Return type:
int
- get_samples_by_level()#
Get number of samples by level of the best student.
- Returns:
samples_by_level – Dict of samples by level. {“<level>(int)”: <samples>(int)}
- Return type:
dict of int
- get_samples_sum()#
Get the sum of all samples in all non-leaf _nodes in best student model.
- Returns:
samples_sum – Sum of all samples covered by non-leaf nodes in top student model.
- Return type:
int
- get_stable(top_k=10, threshold=0.9, sort=True)#
Filters out explanations from Trustee stability analysis with less than threshold agreement.
- Parameters:
top_k (int, default=10) – Number of top-k branches, sorted by number of samples per branch, to keep after finding decision tree with highest fidelity.
threshold (float, default=0.9) – Remove any student decision tree explanation if their mean agreement goes below given threshold. To keep all students regardless of mean agreement, pass 0.
sort (bool, default=True) – Boolean indicating whether to sort returned stable student explanation based on mean agreement.
- Returns:
stable_explanations – [(dt, pruned_dt, agreement, reward), …]
- dt: {DecisionTreeClassifier, DecisionTreeRegressor}
Unconstrained fitted student model.
- pruned_dt: {DecisionTreeClassifier, DecisionTreeRegressor}
Top-k pruned fitted student model.
- agreement: float
Mean agreement of pruned student model with respect to others.
- reward: float
Fidelity of student model to the expert model.
- Return type:
array-like of tuple
- get_top_branches(top_k=10)#
Returns list of top-k _branches of the best student, sorted by the number of samples the branch classifies.
- Parameters:
top_k (int, default=10) – Number of top-k branches, sorted by number of samples per branch, to return.
- Returns:
top_branches – Dict of top-k branches from top student model.
dict: { “level”: int, “path”: array-like of dict, “class”: int, “prob”: float, “samples”: int}
- Return type:
array-like of dict
- get_top_features(top_k=10)#
Get list of top _features of the best student, sorted by the number of samples the feature is used to classify.
- Parameters:
top_k (int, default=10) – Number of top-k features, sorted by number of samples per branch, to return.
- Returns:
top_features – List of top-k features from top student model.
dict {“<feature>(int)” : {“count”: int”samples”: int}}
- Return type:
array-like of dict
- get_top_nodes(top_k=10)#
Returns list of top _nodes of the best student, sorted by the proportion of samples split by each node.
- The proportion of samples is calculated based on the impurity decrease equation is the following::
n_samples * abs(left_impurity - right_impurity)
- Parameters:
top_k (int, default=10) – Number of top-k nodes, sorted by number of samples per branch, to return.
- Returns:
top_nodes – List of top-k nodes from top student model.
- dict: {“idx”: int, “level”: int, “feature”: int, “threshold”: float, “samples”: int,
”values”: tuple of int, “gini_split”: tuple of float, “data_split”: tuple of float, “data_split_by_class”: array-like of tuple of float}
- Return type:
array-like of dict
- get_top_students()#
Get list of top (students, reward) obtained during the outer-loop process.
- Returns:
top_students – List with top student models trained during fit().
- dt: {DecisionTreeClassifier, DecisionTreeRegressor}
Unconstrained fitted student model.
- reward: float
Fidelity of student model to the expert model.
- Return type:
array-like of shape (num_stability_iter,) of tuple (dt, reward)
- prune(top_k=10, max_impurity=0.1)#
Prunes and returns the best student model explanation from the list of _students_by_iter.
- Parameters:
top_k (int, default=10) – Number of top-k branches, sorted by number of samples per branch, to return.
max_impurity (float, default=0.10) – Maximum impurity allowed in a branch. Will prune anything below that impurity level.
- Returns:
top_k_pruned_student – Top-k pruned best fitted student model.
- Return type:
{DecisionTreeClassifier, DecisionTreeRegressor}
- class trustee.main.RegressionTrustee(expert, logger=None)#
Bases:
Trustee
Implements the Trust-oriented Decision Tree Extraction (Trustee) algorithm to train a student DecisionTreeRegressor based on observations from an Expert regression model.
- __init__(expert, logger=None)#
Regression Trustee constructor
- Parameters:
expert (object) – The ML blackbox model to analyze. The expert model must have a predict method call implemented for Trustee to work properly, unless explicitly stated otherwise using the predict_method_name argument in the fit() method.
logger (Logger object , default=None) – A logger object to log messages to. If none is given, the print() method will be used to log messages.
- explain(top_k=10)#
Returns explainable model that best imitates Expert model, based on highest mean agreement and highest fidelity.
- Returns:
top_student – (dt, pruned_dt, agreement, reward)
- dt: {DecisionTreeClassifier, DecisionTreeRegressor}
Unconstrained fitted student model.
- pruned_dt: {DecisionTreeClassifier, DecisionTreeRegressor}
Top-k pruned fitted student model.
- agreement: float
Mean agreement of pruned student model with respect to others.
- reward: float
Fidelity of student model to the expert model.
- Return type:
tuple
- fit(X, y, top_k=10, max_leaf_nodes=None, max_depth=None, ccp_alpha=0.0, train_size=0.7, num_iter=50, num_stability_iter=5, num_samples=2000, samples_size=None, use_features=None, predict_method_name='predict', optimization='fidelity', aggregate=True, verbose=False)#
Trains Decision Tree Regressor to imitate Expert model.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Internally, it will be converted to a pandas DataFrame.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – The target values for X (class labels in classification, real numbers in regression). Internally, it will be converted to a pandas Series.
top_k (int, default=10) – Number of top-k branches, sorted by number of samples per branch, to keep after finding decision tree with highest fidelity.
max_leaf_nodes (int, default=None) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
max_depth (int, default=None) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure.
ccp_alpha (float, default=0.0) – Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See Minimal Cost-Complexity Pruning here for details: https://scikit-learn.org/stable/modules/tree.html#minimal-cost-complexity-pruning
train_size (float or int, default=0.7) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples.
num_iter (int, default=50) – Number of iterations to repeat Trustee inner-loop for.
num_stability_iter (int, default=5) – Number of stability to repeat Trustee stabilization outer-loop for.
num_samples (int, default=2000) – The absolute number of samples to fetch from the training dataset split to train the student decision tree model. If the samples_size argument is provided, this arg is ignored.
samples_size (float, default=None) – The fraction of the training dataset to use to train the student decision tree model. If None, the value is automatically set to the num_samples provided value.
use_features (array-like, default=None) – Array-like of integers representing the indexes of features from the X training samples. If not None, only the features indicated by the provided indexes will be used to train the student decision tree model.
predict_method_name (str, default="predict") – The method interface to use to get predictions from the expert model. If no value is passed, the default predict interface is used.
optimization ({"fidelity", "accuracy"}, default="fidelity") – The comparison criteria to optimize the decision tree students in Trustee inner-loop. Used for ablation study only.
aggregate (bool, default=True) – Boolean indicating whether dataset aggregation should be used in Trustee inner-loop. Used for ablation study only.
verbose (bool, default=False) – Boolean indicating whether to log messages.
- get_all_students()#
Get list of all (student, reward) obtained during the inner-loop process.
- Returns:
students_by_iter – Matrix with all student models trained during fit().
- dt: {DecisionTreeClassifier, DecisionTreeRegressor}
Unconstrained fitted student model.
- reward: float
Fidelity of student model to the expert model.
- Return type:
array-like of shape (num_stability_iter, num_iter) of tuple (dt, reward)
- get_leaves_by_level()#
Returns number of leaves by level of the best student.
- Returns:
leaves_by_level – Dict of leaves by level. {“<level>(int)”: <leaves>(int)}
- Return type:
dict of int
- get_n_classes()#
Returns number of classes used in the top student model.
- Returns:
n_classes – Number of classes outputted in top student model.
- Return type:
int
- get_n_features()#
Returns number of features used in the top student model.
- Returns:
n_features – Number of features used in top student model.
- Return type:
int
- get_samples_by_level()#
Get number of samples by level of the best student.
- Returns:
samples_by_level – Dict of samples by level. {“<level>(int)”: <samples>(int)}
- Return type:
dict of int
- get_samples_sum()#
Get the sum of all samples in all non-leaf _nodes in best student model.
- Returns:
samples_sum – Sum of all samples covered by non-leaf nodes in top student model.
- Return type:
int
- get_stable(top_k=10, threshold=0.9, sort=True)#
Filters out explanations from Trustee stability analysis with less than threshold agreement.
- Parameters:
top_k (int, default=10) – Number of top-k branches, sorted by number of samples per branch, to keep after finding decision tree with highest fidelity.
threshold (float, default=0.9) – Remove any student decision tree explanation if their mean agreement goes below given threshold. To keep all students regardless of mean agreement, pass 0.
sort (bool, default=True) – Boolean indicating whether to sort returned stable student explanation based on mean agreement.
- Returns:
stable_explanations – [(dt, pruned_dt, agreement, reward), …]
- dt: {DecisionTreeClassifier, DecisionTreeRegressor}
Unconstrained fitted student model.
- pruned_dt: {DecisionTreeClassifier, DecisionTreeRegressor}
Top-k pruned fitted student model.
- agreement: float
Mean agreement of pruned student model with respect to others.
- reward: float
Fidelity of student model to the expert model.
- Return type:
array-like of tuple
- get_top_branches(top_k=10)#
Returns list of top-k _branches of the best student, sorted by the number of samples the branch classifies.
- Parameters:
top_k (int, default=10) – Number of top-k branches, sorted by number of samples per branch, to return.
- Returns:
top_branches – Dict of top-k branches from top student model.
dict: { “level”: int, “path”: array-like of dict, “class”: int, “prob”: float, “samples”: int}
- Return type:
array-like of dict
- get_top_features(top_k=10)#
Get list of top _features of the best student, sorted by the number of samples the feature is used to classify.
- Parameters:
top_k (int, default=10) – Number of top-k features, sorted by number of samples per branch, to return.
- Returns:
top_features – List of top-k features from top student model.
dict {“<feature>(int)” : {“count”: int”samples”: int}}
- Return type:
array-like of dict
- get_top_nodes(top_k=10)#
Returns list of top _nodes of the best student, sorted by the proportion of samples split by each node.
- The proportion of samples is calculated based on the impurity decrease equation is the following::
n_samples * abs(left_impurity - right_impurity)
- Parameters:
top_k (int, default=10) – Number of top-k nodes, sorted by number of samples per branch, to return.
- Returns:
top_nodes – List of top-k nodes from top student model.
- dict: {“idx”: int, “level”: int, “feature”: int, “threshold”: float, “samples”: int,
”values”: tuple of int, “gini_split”: tuple of float, “data_split”: tuple of float, “data_split_by_class”: array-like of tuple of float}
- Return type:
array-like of dict
- get_top_students()#
Get list of top (students, reward) obtained during the outer-loop process.
- Returns:
top_students – List with top student models trained during fit().
- dt: {DecisionTreeClassifier, DecisionTreeRegressor}
Unconstrained fitted student model.
- reward: float
Fidelity of student model to the expert model.
- Return type:
array-like of shape (num_stability_iter,) of tuple (dt, reward)
- prune(top_k=10, max_impurity=0.1)#
Prunes and returns the best student model explanation from the list of _students_by_iter.
- Parameters:
top_k (int, default=10) – Number of top-k branches, sorted by number of samples per branch, to return.
max_impurity (float, default=0.10) – Maximum impurity allowed in a branch. Will prune anything below that impurity level.
- Returns:
top_k_pruned_student – Top-k pruned best fitted student model.
- Return type:
{DecisionTreeClassifier, DecisionTreeRegressor}
- class trustee.main.Trustee(expert, student_class, logger=None)#
Bases:
ABC
Base implementation the Trust-oriented Decision Tree Extraction (Trustee) algorithm to train student model based on observations from an Expert model.
- __init__(expert, student_class, logger=None)#
Trustee constructor.
- Parameters:
expert (object) – The ML blackbox model to analyze. The expert model must have a predict method call implemented for Trustee to work properly, unless explicitly stated otherwise using the predict_method_name argument in the fit() method.
student_class (Class) – Class of student to train based on blackbox model predictions. The given Class must implement a fit() and a `predict() method interface for Trustee to work properly. The current implementation has been tested using the DecisionTreeClassifier and DecisionTreeRegressor from scikit-learn.
logger (Logger object , default=None) – A logger object to log messages to. If none is given, the print() method will be used to log messages.
- explain(top_k=10)#
Returns explainable model that best imitates Expert model, based on highest mean agreement and highest fidelity.
- Returns:
top_student – (dt, pruned_dt, agreement, reward)
- dt: {DecisionTreeClassifier, DecisionTreeRegressor}
Unconstrained fitted student model.
- pruned_dt: {DecisionTreeClassifier, DecisionTreeRegressor}
Top-k pruned fitted student model.
- agreement: float
Mean agreement of pruned student model with respect to others.
- reward: float
Fidelity of student model to the expert model.
- Return type:
tuple
- fit(X, y, top_k=10, max_leaf_nodes=None, max_depth=None, ccp_alpha=0.0, train_size=0.7, num_iter=50, num_stability_iter=5, num_samples=2000, samples_size=None, use_features=None, predict_method_name='predict', optimization='fidelity', aggregate=True, verbose=False)#
Trains Decision Tree Regressor to imitate Expert model.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Internally, it will be converted to a pandas DataFrame.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – The target values for X (class labels in classification, real numbers in regression). Internally, it will be converted to a pandas Series.
top_k (int, default=10) – Number of top-k branches, sorted by number of samples per branch, to keep after finding decision tree with highest fidelity.
max_leaf_nodes (int, default=None) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
max_depth (int, default=None) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure.
ccp_alpha (float, default=0.0) – Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See Minimal Cost-Complexity Pruning here for details: https://scikit-learn.org/stable/modules/tree.html#minimal-cost-complexity-pruning
train_size (float or int, default=0.7) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples.
num_iter (int, default=50) – Number of iterations to repeat Trustee inner-loop for.
num_stability_iter (int, default=5) – Number of stability to repeat Trustee stabilization outer-loop for.
num_samples (int, default=2000) – The absolute number of samples to fetch from the training dataset split to train the student decision tree model. If the samples_size argument is provided, this arg is ignored.
samples_size (float, default=None) – The fraction of the training dataset to use to train the student decision tree model. If None, the value is automatically set to the num_samples provided value.
use_features (array-like, default=None) – Array-like of integers representing the indexes of features from the X training samples. If not None, only the features indicated by the provided indexes will be used to train the student decision tree model.
predict_method_name (str, default="predict") – The method interface to use to get predictions from the expert model. If no value is passed, the default predict interface is used.
optimization ({"fidelity", "accuracy"}, default="fidelity") – The comparison criteria to optimize the decision tree students in Trustee inner-loop. Used for ablation study only.
aggregate (bool, default=True) – Boolean indicating whether dataset aggregation should be used in Trustee inner-loop. Used for ablation study only.
verbose (bool, default=False) – Boolean indicating whether to log messages.
- get_all_students()#
Get list of all (student, reward) obtained during the inner-loop process.
- Returns:
students_by_iter – Matrix with all student models trained during fit().
- dt: {DecisionTreeClassifier, DecisionTreeRegressor}
Unconstrained fitted student model.
- reward: float
Fidelity of student model to the expert model.
- Return type:
array-like of shape (num_stability_iter, num_iter) of tuple (dt, reward)
- get_leaves_by_level()#
Returns number of leaves by level of the best student.
- Returns:
leaves_by_level – Dict of leaves by level. {“<level>(int)”: <leaves>(int)}
- Return type:
dict of int
- get_n_classes()#
Returns number of classes used in the top student model.
- Returns:
n_classes – Number of classes outputted in top student model.
- Return type:
int
- get_n_features()#
Returns number of features used in the top student model.
- Returns:
n_features – Number of features used in top student model.
- Return type:
int
- get_samples_by_level()#
Get number of samples by level of the best student.
- Returns:
samples_by_level – Dict of samples by level. {“<level>(int)”: <samples>(int)}
- Return type:
dict of int
- get_samples_sum()#
Get the sum of all samples in all non-leaf _nodes in best student model.
- Returns:
samples_sum – Sum of all samples covered by non-leaf nodes in top student model.
- Return type:
int
- get_stable(top_k=10, threshold=0.9, sort=True)#
Filters out explanations from Trustee stability analysis with less than threshold agreement.
- Parameters:
top_k (int, default=10) – Number of top-k branches, sorted by number of samples per branch, to keep after finding decision tree with highest fidelity.
threshold (float, default=0.9) – Remove any student decision tree explanation if their mean agreement goes below given threshold. To keep all students regardless of mean agreement, pass 0.
sort (bool, default=True) – Boolean indicating whether to sort returned stable student explanation based on mean agreement.
- Returns:
stable_explanations – [(dt, pruned_dt, agreement, reward), …]
- dt: {DecisionTreeClassifier, DecisionTreeRegressor}
Unconstrained fitted student model.
- pruned_dt: {DecisionTreeClassifier, DecisionTreeRegressor}
Top-k pruned fitted student model.
- agreement: float
Mean agreement of pruned student model with respect to others.
- reward: float
Fidelity of student model to the expert model.
- Return type:
array-like of tuple
- get_top_branches(top_k=10)#
Returns list of top-k _branches of the best student, sorted by the number of samples the branch classifies.
- Parameters:
top_k (int, default=10) – Number of top-k branches, sorted by number of samples per branch, to return.
- Returns:
top_branches – Dict of top-k branches from top student model.
dict: { “level”: int, “path”: array-like of dict, “class”: int, “prob”: float, “samples”: int}
- Return type:
array-like of dict
- get_top_features(top_k=10)#
Get list of top _features of the best student, sorted by the number of samples the feature is used to classify.
- Parameters:
top_k (int, default=10) – Number of top-k features, sorted by number of samples per branch, to return.
- Returns:
top_features – List of top-k features from top student model.
dict {“<feature>(int)” : {“count”: int”samples”: int}}
- Return type:
array-like of dict
- get_top_nodes(top_k=10)#
Returns list of top _nodes of the best student, sorted by the proportion of samples split by each node.
- The proportion of samples is calculated based on the impurity decrease equation is the following::
n_samples * abs(left_impurity - right_impurity)
- Parameters:
top_k (int, default=10) – Number of top-k nodes, sorted by number of samples per branch, to return.
- Returns:
top_nodes – List of top-k nodes from top student model.
- dict: {“idx”: int, “level”: int, “feature”: int, “threshold”: float, “samples”: int,
”values”: tuple of int, “gini_split”: tuple of float, “data_split”: tuple of float, “data_split_by_class”: array-like of tuple of float}
- Return type:
array-like of dict
- get_top_students()#
Get list of top (students, reward) obtained during the outer-loop process.
- Returns:
top_students – List with top student models trained during fit().
- dt: {DecisionTreeClassifier, DecisionTreeRegressor}
Unconstrained fitted student model.
- reward: float
Fidelity of student model to the expert model.
- Return type:
array-like of shape (num_stability_iter,) of tuple (dt, reward)
- prune(top_k=10, max_impurity=0.1)#
Prunes and returns the best student model explanation from the list of _students_by_iter.
- Parameters:
top_k (int, default=10) – Number of top-k branches, sorted by number of samples per branch, to return.
max_impurity (float, default=0.10) – Maximum impurity allowed in a branch. Will prune anything below that impurity level.
- Returns:
top_k_pruned_student – Top-k pruned best fitted student model.
- Return type:
{DecisionTreeClassifier, DecisionTreeRegressor}
Trust Report#
The module that implements Trust Reports
- class trustee.report.trust.TrustReport(blackbox, X=None, y=None, X_train=None, X_test=None, y_train=None, y_test=None, max_iter=10, num_pruning_iter=10, train_size=0.7, predict_method_name='predict', trustee_num_iter=50, trustee_num_stability_iter=10, trustee_sample_size=0.5, trustee_max_leaf_nodes=None, trustee_max_depth=None, trustee_ccp_alpha=0.0, analyze_branches=False, analyze_stability=False, skip_retrain=False, top_k=10, logger=None, verbose=False, class_names=None, feature_names=None, is_classify=True)#
Bases:
object
Class to generate Trust Report.
- __init__(blackbox, X=None, y=None, X_train=None, X_test=None, y_train=None, y_test=None, max_iter=10, num_pruning_iter=10, train_size=0.7, predict_method_name='predict', trustee_num_iter=50, trustee_num_stability_iter=10, trustee_sample_size=0.5, trustee_max_leaf_nodes=None, trustee_max_depth=None, trustee_ccp_alpha=0.0, analyze_branches=False, analyze_stability=False, skip_retrain=False, top_k=10, logger=None, verbose=False, class_names=None, feature_names=None, is_classify=True)#
Builds Trust Report for given blackbox model using the Trustee method to extract whitebox explanations as Decision Trees.
- Parameters:
blackbox (object) – The ML blackbox model to analyze. The expert model must have a predict method call implemented for Trustee to work properly, unless explicitly stated otherwise using the predict_method_name.
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Internally, it will be converted to a pandas DataFrame. Either (X, y) or (X_train, X_test, y_train, y_test) must be provided.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – The target values for X (class labels in classification, real numbers in regression). Internally, it will be converted to a pandas Series. Either (X, y) or (X_train, X_test, y_train, y_test) must be provided.
X_train ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Internally, it will be converted to a pandas DataFrame. Use this argument if a fixed train-test split is to be used. Either (X, y) or (X_train, X_test, y_train, y_test) must be provided.
X_test ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Internally, it will be converted to a pandas DataFrame. Use this argument if a fixed train-test split is to be used. Either (X, y) or (X_train, X_test, y_train, y_test) must be provided.
y_train (array-like of shape (n_samples,) or (n_samples, n_outputs)) – The target values for X (class labels in classification, real numbers in regression). Internally, it will be converted to a pandas Series. Use this argument if a fixed train-test split is to be used. Either (X, y) or (X_train, X_test, y_train, y_test) must be provided.
y – The target values for X (class labels in classification, real numbers in regression). Internally, it will be converted to a pandas Series. Use this argument if a fixed train-test split is to be used. Either (X, y) or (X_train, X_test, y_train, y_test) must be provided.
max_iter (int, default=10) – Number of iterations to repeat several analyses in the Trust Report, including feature removal and branch analysis.
num_pruning_iter (int, default=10) – Number of iterations to repeat the pruning analysis.
train_size (float or int, default=0.7) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples.
predict_method_name (str, default="predict") – The method interface to use to get predictions from the expert model. If no value is passed, the default predict interface is used.
trustee_num_iter (int, default=50) – Number of iterations to repeat Trustee inner-loop for.
trustee_num_stability_iter (int, default=5) – Number of stability to repeat Trustee stabilization outer-loop for.
trustee_samples_size (float, default=None) – The fraction of the training dataset to use to train the student decision tree model. If None, the value is automatically set to the num_samples provided value.
trustee_max_leaf_nodes (int, default=None) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
trustee_max_depth (int, default=None) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure.
trustee_ccp_alpha (float, default=0.0) – Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See Minimal Cost-Complexity Pruning here for details: https://scikit-learn.org/stable/modules/tree.html#minimal-cost-complexity-pruning
analyze_branches (bool, default=False) – Boolean indicating whether to perform the Trust Report branch analysis of Trustee explanations.
analyze_stability (bool, default=False) – Boolean indicating whether to perform the Trust Report stability analysis of Trustee explanations.
skip_retrain (bool, default=False) – Boolean indicating whether the Trust Report should attempt to retrain the given blackbox model. Used to evaluate the impact of each feature in training by iteratively removing top features. Works well for scikit-explain model, but can be troublesome for other libraries (especially AutoGluon).
top_k (int, default=10) – Number of top-k branches, sorted by number of samples per branch, to keep after finding decision tree with highest fidelity.
verbose (bool, default=False) – Boolean indicating whether to log messages.
logger (Logger object , default=None) – A logger object to log messages to. If none is given, the print() method will be used to log messages.
class_names (array-like of str, default=None) – List of class names to use when plotting decision trees and graphs.
feature_names (array-like of str, default=None,) – List of feature names to use when plotting decision trees and graphs.
is_classify (bool, default=True,) – Whether given blackbox is a classifier or regressor. The outputted plots change depending on chosen value.
- classmethod load(path)#
Load the Trust Report from a file.
- Parameters:
path (str) – The path to the file.
- Returns:
report – Loaded Trust Report from file.
- Return type:
- plot(output_dir, aggregate=False)#
Plot the analysis results.
- Parameters:
output_dir (str) – The output directory to save the plots.
aggregate (bool, default=False) – Whether to attempt to aggregate most important features based on the values seen in the data and branches. Does not always work properly, but can be useful to analyze the given dataset.
- save(output_dir, aggregate=False, save_all_dts=False)#
Saves report and plots to output dir
- Parameters:
output_dir (str) – The output directory to save the plots.
aggregate (bool, default=False) – Whether to attempt to aggregate most important features based on the values seen in the data and branches. Does not always work properly, but can be useful to analyze the given dataset.
save_all_dts (bool, default=False) – Whether to save all generated decision trees or just the main explanation.
- step#
Used for progress bar.
- total_steps =
_prepare_data (1) + _collect_blackbox (1) + _collect_trustee (1) + _collect_top_k_prunning (1) + _collect_ccp_prunning (num_pruning_iter) + _collect_max_depth_prunning (num_pruning_iter) + _collect_max_leaves_prunning (num_pruning_iter) + _collect_features_iter_removal (max_iter)
- total_steps#
if analyze_branches: total_steps += _collect_branch_analysis (num_leaves = trustee_max_leaf_nodes or guess 100)