Base class for Gradient Boosted Regression models

Base class for Gradient Boosting Regression model.

Classes:

GBRModel(input_datasets, **kwargs)

Base class for Gradient Boosting Regression models.

class esmvaltool.diag_scripts.mlr.models.gbr_base.GBRModel(input_datasets, **kwargs)[source]

Bases: MLRModel

Base class for Gradient Boosting Regression models.

Attributes:

categorical_features

Categorical features.

data

Input data of the MLR model.

features

Features of the input data.

features_after_preprocessing

Features of the input data after preprocessing.

features_types

Types of the features.

features_units

Units of the features.

fit_kwargs

Keyword arguments for fit().

group_attributes

Group attributes of the input data.

label

Label of the input data.

label_units

Units of the label.

mlr_model_type

MLR model type.

numerical_features

Numerical features.

parameters

Parameters of the complete MLR model pipeline.

random_state

Random state instance.

Methods:

create(mlr_model_type, *args, **kwargs)

Create desired MLR model subclass (factory method).

efecv(**kwargs)

Perform exhaustive feature elimination using cross-validation.

export_prediction_data([filename])

Export all prediction data contained in self._data.

export_training_data([filename])

Export all training data contained in self._data.

fit()

Fit MLR model.

get_ancestors([label, features, ...])

Return ancestor files.

get_data_frame(data_type[, impute_nans])

Return data frame of specified type.

get_x_array(data_type[, impute_nans])

Return x data of specific type.

get_y_array(data_type[, impute_nans])

Return y data of specific type.

grid_search_cv(param_grid, **kwargs)

Perform exhaustive parameter search using cross-validation.

plot_1d_model([filename, n_points])

Plot lineplot that represents the MLR model.

plot_feature_importance([filename, color_coded])

Plot feature importance.

plot_partial_dependences([filename])

Plot partial dependences for every feature.

plot_prediction_errors([filename])

Plot predicted vs.

plot_residuals([filename])

Plot residuals of training and test (if available) data.

plot_residuals_distribution([filename])

Plot distribution of residuals of training and test data (KDE).

plot_residuals_histogram([filename])

Plot histogram of residuals of training and test data.

plot_scatterplots([filename])

Plot scatterplots label vs.

predict([save_mlr_model_error, ...])

Perform prediction using the MLR model(s) and write *.nc files.

print_correlation_matrices()

Print correlation matrices for all datasets.

print_regression_metrics([logo])

Print all available regression metrics for training data.

register_mlr_model(mlr_model_type)

Add MLR model (subclass of this class) (decorator).

reset_pipeline()

Reset regressor pipeline.

rfecv(**kwargs)

Perform recursive feature elimination using cross-validation.

test_normality_of_residuals()

Perform Shapiro-Wilk test to normality of residuals.

update_parameters(**params)

Update parameters of the whole pipeline.

property categorical_features

Categorical features.

Type

numpy.ndarray

classmethod create(mlr_model_type, *args, **kwargs)

Create desired MLR model subclass (factory method).

property data

Input data of the MLR model.

Type

dict

efecv(**kwargs)

Perform exhaustive feature elimination using cross-validation.

Parameters

**kwargs (keyword arguments, optional) – Additional options for esmvaltool.diag_scripts.mlr. custom_sklearn.cross_val_score_weighted().

export_prediction_data(filename=None)

Export all prediction data contained in self._data.

Parameters

filename (str, optional (default: '{data_type}_{pred_name}.csv')) – Name of the exported files.

export_training_data(filename=None)

Export all training data contained in self._data.

Parameters

filename (str, optional (default: '{data_type}.csv')) – Name of the exported files.

property features

Features of the input data.

Type

numpy.ndarray

property features_after_preprocessing

Features of the input data after preprocessing.

Type

numpy.ndarray

property features_types

Types of the features.

Type

pandas.Series

property features_units

Units of the features.

Type

pandas.Series

fit()

Fit MLR model.

Note

Specifying keyword arguments for this function is not allowed here since features_after_preprocessing might be altered by that. Use the keyword argument fit_kwargs during class initialization instead.

property fit_kwargs

Keyword arguments for fit().

Type

dict

get_ancestors(label=True, features=None, prediction_names=None, prediction_reference=False)

Return ancestor files.

Parameters
  • label (bool, optional (default: True)) – Return label files.

  • features (list of str, optional (default: None)) – Features for which files should be returned. If None, return files for all features.

  • prediction_names (list of str, optional (default: None)) – Prediction names for which files should be returned. If None, return files for all prediction names.

  • prediction_reference (bool, optional (default: False)) – Return prediction_reference files if available for given prediction_names.

Returns

Ancestor files.

Return type

list of str

Raises

ValueError – Invalid feature or prediction_name given.

get_data_frame(data_type, impute_nans=False)

Return data frame of specified type.

Parameters
  • data_type (str) – Data type to be returned. Must be one of 'all', 'train' or 'test'.

  • impute_nans (bool, optional (default: False)) – Impute nans if desired.

Returns

Desired data.

Return type

pandas.DataFrame

Raises

TypeErrordata_type is invalid or data does not exist (e.g. test data is not set).

get_x_array(data_type, impute_nans=False)

Return x data of specific type.

Parameters
  • data_type (str) – Data type to be returned. Must be one of 'all', 'train' or 'test'.

  • impute_nans (bool, optional (default: False)) – Impute nans if desired.

Returns

Desired data.

Return type

numpy.ndarray

Raises

TypeErrordata_type is invalid or data does not exist (e.g. test data is not set).

get_y_array(data_type, impute_nans=False)

Return y data of specific type.

Parameters
  • data_type (str) – Data type to be returned. Must be one of 'all', 'train' or 'test'.

  • impute_nans (bool, optional (default: False)) – Impute nans if desired.

Returns

Desired data.

Return type

numpy.ndarray

Raises

TypeErrordata_type is invalid or data does not exist (e.g. test data is not set).

grid_search_cv(param_grid, **kwargs)

Perform exhaustive parameter search using cross-validation.

Parameters
  • param_grid (dict or list of dict) – Parameter names (keys) and ranges (values) for the search. Have to be given for each step of the pipeline separated by two underscores, i.e. s__p is the parameter p for step s.

  • **kwargs (keyword arguments, optional) – Additional options for sklearn.model_selection.GridSearchCV.

Raises

ValueError – Final regressor does not supply the attributes best_estimator_ or best_params_.

property group_attributes

Group attributes of the input data.

Type

numpy.ndarray

property label

Label of the input data.

Type

str

property label_units

Units of the label.

Type

str

property mlr_model_type

MLR model type.

Type

str

property numerical_features

Numerical features.

Type

numpy.ndarray

property parameters

Parameters of the complete MLR model pipeline.

Type

dict

plot_1d_model(filename=None, n_points=1000)

Plot lineplot that represents the MLR model.

Note

This only works for a model with a single feature.

Parameters
  • filename (str, optional (default: '1d_mlr_model')) – Name of the plot file.

  • n_points (int, optional (default: 1000)) – Number of sampled points for the single feature (using linear spacing between minimum and maximum value).

Raises
plot_feature_importance(filename=None, color_coded=True)[source]

Plot feature importance.

This function uses properties of the GBR model based on the number of appearances of that feature in the regression trees and the improvements made by the individual splits (see Friedman, 2001).

Note

The features plotted here are not necessarily the real input features, but the ones after preprocessing.

Parameters
  • filename (str, optional (default: 'feature_importance')) – Name of the plot file.

  • color_coded (bool, optional (default: True)) – If True, mark positive (linear) correlations with red bars and negative (linear) correlations with blue bars. If False, all bars are blue.

plot_partial_dependences(filename=None)

Plot partial dependences for every feature.

Parameters

filename (str, optional (default: 'partial_dependece_{feature}')) – Name of the plot file.

Raises

sklearn.exceptions.NotFittedError – MLR model is not fitted.

plot_prediction_errors(filename=None)

Plot predicted vs. true values.

Parameters

filename (str, optional (default: 'prediction_errors')) – Name of the plot file.

Raises

sklearn.exceptions.NotFittedError – MLR model is not fitted.

plot_residuals(filename=None)

Plot residuals of training and test (if available) data.

Parameters

filename (str, optional (default: 'residuals')) – Name of the plot file.

Raises

sklearn.exceptions.NotFittedError – MLR model is not fitted.

plot_residuals_distribution(filename=None)

Plot distribution of residuals of training and test data (KDE).

Parameters

filename (str, optional (default: 'residuals_distribution')) – Name of the plot file.

Raises

sklearn.exceptions.NotFittedError – MLR model is not fitted.

plot_residuals_histogram(filename=None)

Plot histogram of residuals of training and test data.

Parameters

filename (str, optional (default: 'residuals_histogram')) – Name of the plot file.

Raises

sklearn.exceptions.NotFittedError – MLR model is not fitted.

plot_scatterplots(filename=None)

Plot scatterplots label vs. feature for every feature.

Parameters

filename (str, optional (default: 'scatterplot_{feature}')) – Name of the plot file.

Raises

sklearn.exceptions.NotFittedError – MLR model is not fitted.

predict(save_mlr_model_error=None, save_lime_importance=False, save_propagated_errors=False, **kwargs)

Perform prediction using the MLR model(s) and write *.nc files.

Parameters
  • save_mlr_model_error (str or int, optional) – Additionally saves estimated squared MLR model error. This error represents the uncertainty of the prediction caused by the MLR model itself and not by errors in the prediction input data (errors in that will be considered by including datasets with var_type set to prediction_input_error and setting save_propagated_errors to True). If the option is set to 'test', the (constant) error is estimated as RMSEP using a (hold-out) test data set. Only possible if test data is available, i.e. the option test_size is not set to False during class initialization. If the option is set to 'logo', the (constant) error is estimated as RMSEP using leave-one-group-out cross-validation using the group_attributes. Only possible if group_datasets_by_attributes is given. If the option is set to an integer n (!= 0), the (constant) error is estimated as RMSEP using n-fold cross-validation.

  • save_lime_importance (bool, optional (default: False)) – Additionally saves local feature importance given by LIME (Local Interpretable Model-agnostic Explanations).

  • save_propagated_errors (bool, optional (default: False)) – Additionally saves propagated errors from prediction_input_error datasets. Only possible when these are available.

  • **kwargs (keyword arguments, optional) – Additional options for the final regressors predict() function.

Raises
print_correlation_matrices()

Print correlation matrices for all datasets.

print_regression_metrics(logo=False)

Print all available regression metrics for training data.

Parameters

logo (bool, optional (default: False)) – Print regression metrics using sklearn.model_selection.LeaveOneGroupOut cross-validation. Only possible when group_datasets_by_attributes was given during class initialization.

property random_state

Random state instance.

Type

numpy.random.RandomState

classmethod register_mlr_model(mlr_model_type)

Add MLR model (subclass of this class) (decorator).

reset_pipeline()

Reset regressor pipeline.

rfecv(**kwargs)

Perform recursive feature elimination using cross-validation.

Note

This only works for final estimators that provide information about feature importance either through a coef_ attribute or through a feature_importances_ attribute.

Parameters

**kwargs (keyword arguments, optional) – Additional options for sklearn.feature_selection.RFECV.

Raises

RuntimeError – Final estimator does not provide coef_ or feature_importances_ attribute.

test_normality_of_residuals()

Perform Shapiro-Wilk test to normality of residuals.

Raises

sklearn.exceptions.NotFittedError – MLR model is not fitted.

update_parameters(**params)

Update parameters of the whole pipeline.

Note

Parameter names have to be given for each step of the pipeline separated by two underscores, i.e. s__p is the parameter p for step s.

Parameters

**params (keyword arguments, optional) – Parameters for the pipeline which should be updated.

Raises

ValueError – Invalid parameter for pipeline given.