Gradient Boosted Regression Trees (sklearn implementation)#
Gradient Boosting Regression model (using sklearn
).
Use mlr_model_type: gbr_sklearn
to use this MLR model in the recipe.
Classes:
|
Gradient Boosting Regression model ( |
- class esmvaltool.diag_scripts.mlr.models.gbr_sklearn.SklearnGBRModel(input_datasets, **kwargs)[source]#
Bases:
GBRModel
Gradient Boosting Regression model (
sklearn
implementation).Attributes:
Categorical features.
Input data of the MLR model.
Features of the input data.
Features of the input data after preprocessing.
Types of the features.
Units of the features.
Keyword arguments for
fit()
.Group attributes of the input data.
Label of the input data.
Units of the label.
MLR model type.
Numerical features.
Parameters of the complete MLR model pipeline.
Random state instance.
Methods:
create
(mlr_model_type, *args, **kwargs)Create desired MLR model subclass (factory method).
efecv
(**kwargs)Perform exhaustive feature elimination using cross-validation.
export_prediction_data
([filename])Export all prediction data contained in self._data.
export_training_data
([filename])Export all training data contained in self._data.
fit
()Fit MLR model.
get_ancestors
([label, features, ...])Return ancestor files.
get_data_frame
(data_type[, impute_nans])Return data frame of specified type.
get_x_array
(data_type[, impute_nans])Return x data of specific type.
get_y_array
(data_type[, impute_nans])Return y data of specific type.
grid_search_cv
(param_grid, **kwargs)Perform exhaustive parameter search using cross-validation.
plot_1d_model
([filename, n_points])Plot lineplot that represents the MLR model.
plot_feature_importance
([filename, color_coded])Plot feature importance.
plot_partial_dependences
([filename])Plot partial dependences for every feature.
plot_prediction_errors
([filename])Plot predicted vs.
plot_residuals
([filename])Plot residuals of training and test (if available) data.
plot_residuals_distribution
([filename])Plot distribution of residuals of training and test data (KDE).
plot_residuals_histogram
([filename])Plot histogram of residuals of training and test data.
plot_scatterplots
([filename])Plot scatterplots label vs.
plot_training_progress
([filename])Plot training progress for training and (if possible) test data.
predict
([save_mlr_model_error, ...])Perform prediction using the MLR model(s) and write
*.nc
files.Print correlation matrices for all datasets.
print_regression_metrics
([logo])Print all available regression metrics for training data.
register_mlr_model
(mlr_model_type)Add MLR model (subclass of this class) (decorator).
Reset regressor pipeline.
rfecv
(**kwargs)Perform recursive feature elimination using cross-validation.
Perform Shapiro-Wilk test to normality of residuals.
update_parameters
(**params)Update parameters of the whole pipeline.
- property categorical_features#
Categorical features.
- Type:
- classmethod create(mlr_model_type, *args, **kwargs)#
Create desired MLR model subclass (factory method).
- efecv(**kwargs)#
Perform exhaustive feature elimination using cross-validation.
- Parameters:
**kwargs (keyword arguments, optional) – Additional options for
esmvaltool.diag_scripts.mlr. custom_sklearn.cross_val_score_weighted()
.
- export_prediction_data(filename=None)#
Export all prediction data contained in self._data.
- Parameters:
filename (str, optional (default: '{data_type}_{pred_name}.csv')) – Name of the exported files.
- export_training_data(filename=None)#
Export all training data contained in self._data.
- Parameters:
filename (str, optional (default: '{data_type}.csv')) – Name of the exported files.
- property features#
Features of the input data.
- Type:
- property features_after_preprocessing#
Features of the input data after preprocessing.
- Type:
- property features_types#
Types of the features.
- Type:
- property features_units#
Units of the features.
- Type:
- fit()#
Fit MLR model.
Note
Specifying keyword arguments for this function is not allowed here since
features_after_preprocessing
might be altered by that. Use the keyword argumentfit_kwargs
during class initialization instead.
- get_ancestors(label=True, features=None, prediction_names=None, prediction_reference=False)#
Return ancestor files.
- Parameters:
label (bool, optional (default: True)) – Return
label
files.features (list of str, optional (default: None)) – Features for which files should be returned. If
None
, return files for all features.prediction_names (list of str, optional (default: None)) – Prediction names for which files should be returned. If
None
, return files for all prediction names.prediction_reference (bool, optional (default: False)) – Return
prediction_reference
files if available for givenprediction_names
.
- Returns:
Ancestor files.
- Return type:
- Raises:
ValueError – Invalid
feature
orprediction_name
given.
- get_data_frame(data_type, impute_nans=False)#
Return data frame of specified type.
- Parameters:
- Returns:
Desired data.
- Return type:
- Raises:
TypeError –
data_type
is invalid or data does not exist (e.g. test data is not set).
- get_x_array(data_type, impute_nans=False)#
Return x data of specific type.
- Parameters:
- Returns:
Desired data.
- Return type:
- Raises:
TypeError –
data_type
is invalid or data does not exist (e.g. test data is not set).
- get_y_array(data_type, impute_nans=False)#
Return y data of specific type.
- Parameters:
- Returns:
Desired data.
- Return type:
- Raises:
TypeError –
data_type
is invalid or data does not exist (e.g. test data is not set).
- grid_search_cv(param_grid, **kwargs)#
Perform exhaustive parameter search using cross-validation.
- Parameters:
param_grid (dict or list of dict) – Parameter names (keys) and ranges (values) for the search. Have to be given for each step of the pipeline separated by two underscores, i.e.
s__p
is the parameterp
for steps
.**kwargs (keyword arguments, optional) – Additional options for
sklearn.model_selection.GridSearchCV
.
- Raises:
ValueError – Final regressor does not supply the attributes
best_estimator_
orbest_params_
.
- property group_attributes#
Group attributes of the input data.
- Type:
- property numerical_features#
Numerical features.
- Type:
- plot_1d_model(filename=None, n_points=1000)#
Plot lineplot that represents the MLR model.
Note
This only works for a model with a single feature.
- Parameters:
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
ValueError – MLR model is built from more than 1 feature.
- plot_feature_importance(filename=None, color_coded=True)#
Plot feature importance.
This function uses properties of the GBR model based on the number of appearances of that feature in the regression trees and the improvements made by the individual splits (see Friedman, 2001).
Note
The features plotted here are not necessarily the real input features, but the ones after preprocessing.
- plot_partial_dependences(filename=None)#
Plot partial dependences for every feature.
- Parameters:
filename (str, optional (default: 'partial_dependece_{feature}')) – Name of the plot file.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- plot_prediction_errors(filename=None)#
Plot predicted vs. true values.
- Parameters:
filename (str, optional (default: 'prediction_errors')) – Name of the plot file.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- plot_residuals(filename=None)#
Plot residuals of training and test (if available) data.
- Parameters:
filename (str, optional (default: 'residuals')) – Name of the plot file.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- plot_residuals_distribution(filename=None)#
Plot distribution of residuals of training and test data (KDE).
- Parameters:
filename (str, optional (default: 'residuals_distribution')) – Name of the plot file.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- plot_residuals_histogram(filename=None)#
Plot histogram of residuals of training and test data.
- Parameters:
filename (str, optional (default: 'residuals_histogram')) – Name of the plot file.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- plot_scatterplots(filename=None)#
Plot scatterplots label vs. feature for every feature.
- Parameters:
filename (str, optional (default: 'scatterplot_{feature}')) – Name of the plot file.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- plot_training_progress(filename=None)[source]#
Plot training progress for training and (if possible) test data.
- Parameters:
filename (str, optional (default: 'training_progress')) – Name of the plot file.
- predict(save_mlr_model_error=None, save_lime_importance=False, save_propagated_errors=False, **kwargs)#
Perform prediction using the MLR model(s) and write
*.nc
files.- Parameters:
save_mlr_model_error (str or int, optional) – Additionally saves estimated squared MLR model error. This error represents the uncertainty of the prediction caused by the MLR model itself and not by errors in the prediction input data (errors in that will be considered by including datasets with
var_type
set toprediction_input_error
and settingsave_propagated_errors
toTrue
). If the option is set to'test'
, the (constant) error is estimated as RMSEP using a (hold-out) test data set. Only possible if test data is available, i.e. the optiontest_size
is not set toFalse
during class initialization. If the option is set to'logo'
, the (constant) error is estimated as RMSEP using leave-one-group-out cross-validation using the group_attributes. Only possible ifgroup_datasets_by_attributes
is given. If the option is set to an integern
(!= 0), the (constant) error is estimated as RMSEP using n-fold cross-validation.save_lime_importance (bool, optional (default: False)) – Additionally saves local feature importance given by LIME (Local Interpretable Model-agnostic Explanations).
save_propagated_errors (bool, optional (default: False)) – Additionally saves propagated errors from
prediction_input_error
datasets. Only possible when these are available.**kwargs (keyword arguments, optional) – Additional options for the final regressors
predict()
function.
- Raises:
RuntimeError –
return_var
andreturn_cov
are both set toTrue
.sklearn.exceptions.NotFittedError – MLR model is not fitted.
ValueError – An invalid value for
save_mlr_model_error
is given.ValueError –
save_propagated_errors
isTrue
and noprediction_input_error
data is available.
- print_correlation_matrices()#
Print correlation matrices for all datasets.
- print_regression_metrics(logo=False)#
Print all available regression metrics for training data.
- Parameters:
logo (bool, optional (default: False)) – Print regression metrics using
sklearn.model_selection.LeaveOneGroupOut
cross-validation. Only possible when group_datasets_by_attributes was given during class initialization.
- property random_state#
Random state instance.
- Type:
- classmethod register_mlr_model(mlr_model_type)#
Add MLR model (subclass of this class) (decorator).
- reset_pipeline()#
Reset regressor pipeline.
- rfecv(**kwargs)#
Perform recursive feature elimination using cross-validation.
Note
This only works for final estimators that provide information about feature importance either through a
coef_
attribute or through afeature_importances_
attribute.- Parameters:
**kwargs (keyword arguments, optional) – Additional options for
sklearn.feature_selection.RFECV
.- Raises:
RuntimeError – Final estimator does not provide
coef_
orfeature_importances_
attribute.
- test_normality_of_residuals()#
Perform Shapiro-Wilk test to normality of residuals.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- update_parameters(**params)#
Update parameters of the whole pipeline.
Note
Parameter names have to be given for each step of the pipeline separated by two underscores, i.e.
s__p
is the parameterp
for steps
.- Parameters:
**params (keyword arguments, optional) – Parameters for the pipeline which should be updated.
- Raises:
ValueError – Invalid parameter for pipeline given.