MLRModel base class#

Base class for MLR models.

Example recipe#

The MLR main diagnostic script provides an interface for using MLR models in recipes. The following recipe shows a typical example on how to setup MLR recipes/diagnostics with the following properties:

Setup an MLR model with target variable y (using the tag Y) and three predictors x1, x2 and latitude (with tags X1, X2 and latitude, respectively). The target variable needs the attribute var_type: label; the predictors x1 and x2 the attribute var_type: feature. The coordinate feature latitude is added via the option coords_as_features: [latitude].
Suppose y and x1 are 3D fields (pressure, latitude, longitude); x2 is a 2D field (latitude, longitude). Thus, it is necessary to add the attribute broadcast_from: [1, 2] to it (see dim_map parameter in iris.util.broadcast_to_shape() for details). In order to consider multiple climate models (A, B and C) at once, the option group_datasets_by_attributes: [dataset] is necessary. Otherwise the diagnostic will complain about duplicate data.

For the prediction, data from dataset D is used (with var_type: prediction_input). For the feature X1 additional input error (with var_type: prediction_input_error) is used.

diag_feature_x1:
  variables:
    feature:
      ... # specify project, mip, start_year, end_year, etc.
      short_name: x1
      var_type: feature
      tag: X1
      additional_datasets:
        - {dataset: A, ...}
        - {dataset: B, ...}
        - {dataset: C, ...}
    prediction_input:
      ... # specify project, mip, start_year, end_year, etc.
      short_name: x1
      var_type: prediction_input
      tag: X1
      additional_datasets:
        - {dataset: D, ...}
    prediction_input_error:
      ... # specify project, mip, start_year, end_year, etc.
      short_name: x1Stderr
      var_type: prediction_input_error
      tag: X1
      additional_datasets:
        - {dataset: D, ...}
  scripts:
    null

diag_feature_x2:
  variables:
    feature:
      ... # specify project, mip, start_year, end_year, etc.
      short_name: x2
      var_type: feature
      broadcast_from: [1, 2]
      tag: X2
      additional_datasets:
        - {dataset: A, ...}
        - {dataset: B, ...}
        - {dataset: C, ...}
    prediction_input:
      ... # specify project, mip, start_year, end_year, etc.
      short_name: x2
      var_type: prediction_input
      broadcast_from: [1, 2]
      tag: X2
      additional_datasets:
        - {dataset: D, ...}
  scripts:
    null

diag_label:
  variables:
    label:
      ... # specify project, mip, start_year, end_year, etc.
      short_name: y
      var_type: label
      tag: Y
      additional_datasets:
        - {dataset: A, ...}
        - {dataset: B, ...}
        - {dataset: C, ...}
  scripts:
    null

In this example, a GBRT model (with mlr_model_type: gbr_sklearn) is used. Parameters for this are specified via parameters_final_regressor. Apart from the best-estimate prediction, the estimated MLR model error (save_mlr_model_error: test) and the propagated prediction input error (save_propagated_errors: true) are returned.

With postprocess.py, the global mean of the best estimate prediction and the corresponding errors (MLR model + propagated input error) are calculted.

diag_mlr_gbrt:
  scripts:
    mlr:
      script: mlr/main.py
      ancestors: [
         'diag_label/y',
         'diag_feature_*/*',
      ]
      coords_as_features: [latitude]
      group_datasets_by_attributes: [dataset]
      mlr_model_name: GBRT
      mlr_model_type: gbr_sklearn
      parameters_final_regressor:
        learning_rate: 0.1
        n_estimators: 100
      save_mlr_model_error: test
      save_propagated_errors: true
    postprocess:
      script: mlr/postprocess.py
      ancestors: ['diag_mlr_gbrt/mlr']
      ignore:
        - {var_type: null}
      mean: [pressure, latitude, longitude]

Plots of the global distribution (latitude, longitude) are created with plot.py after calculating the mean over the pressure coordinate using preprocess.py.

diag_plot:
  scripts:
    preprocess:
      script: mlr/preprocess.py
      ancestors: ['diag_mlr_gbrt/mlr']
      collapse: [pressure]
      ignore:
        - {var_type: null}
    plot:
      script: mlr/plot.py
      ancestors: ['diag_plot/preprocess']
      plot_map:
         plot_kwargs:
           cbar_label: 'Y'
           cbar_ticks: [0, 1, 2, 3]
           vmin: 0
           vmax: 3

All datasets must have the attribute var_type which specifies the type of the dataset. Possible values are feature (independent variables used for training/testing), label (dependent variables, y-axis), prediction_input (independent variables used for prediction of dependent variables, usually observational data), prediction_input_error (standard error of the prediction_input data, optional) or prediction_reference (true values for the prediction_input data, optional). In addition, all datasets must habe the attribute tag, which specifies the name of variable/diagnostic. All datasets can be converted to new units in the loading step by specifying the key convert_units_to in the respective dataset(s).

Training data#

All groups (specified in group_datasets_by_attributes, if desired) given for label datasets must also be given for the feature datasets. Within these groups, all feature and label datasets must have the same shape, except the attribute broadcast_from is set to a list of suitable coordinate indices to map this dataset to regular datasets (see parameter dim_map in iris.util.broadcast_to_shape()).

Prediction data#

All tag s specified for prediction_input datasets must also be given for the feature datasets (except allow_missing_features is set to True). Multiple predictions can be specified by prediction_name. Within these predictions, all prediction_input datasets must have the same shape, except the attribute broadcast_from is given. Errors in the prediction input data can be specified by prediction_input_error. If given, these errors are used to calculate errors in the final prediction using linear error propagation given by LIME. Additionally, true values for prediction_input can be specified with prediction_reference datasets (together with the respective prediction_name). This allows an evaluation of the performance of the MLR model by calculating residuals (true minus predicted values).

Available MLR models#

MLR models are subclasses of this base class. A list of all available MLR models can be found here. To add a new MLR model, create a new file in esmvaltool/diag_scripts/mlr/models/ with a child class of esmvaltool.diag_scripts.mlr.models.MLRModel decorated with esmvaltool.diag_scripts.mlr.models.MLRModel.register_mlr_model().

Optional parameters for class initialization#

accept_only_scalar_data: bool (default: False): If set to True, only accept scalar input data. Should be used together with the option group_datasets_by_attributes.
allow_missing_features: bool (default: False): Allow missing features in the training data.
cache_intermediate_results: bool (default: True): Cache the intermediate results of the pipeline’s transformers.
categorical_features: list of str: Names of features which are interpreted as categorical features (in contrast to numerical features).
coords_as_features: list of str: If given, specify a list of coordinates which should be used as features.
dtype: str (default: ‘float64’): Internal data type which is used for all calculations, see https://docs.scipy.org/doc/numpy/user/basics.types.html for a list of allowed values.
fit_kwargs: dict: Optional keyword arguments for the pipeline’s fit() function. These arguments have to be given for each step of the pipeline separated by two underscores, i.e. s__p is the parameter p for step s.
group_datasets_by_attributes: list of str: List of dataset attributes which are used to group input data for feature s and label s. For example, this is necessary if the MLR model should consider multiple climate models in the training phase. If this option is not given, specifying multiple datasets with identical var_type and tag entries results in an error. If given, all the input data is first grouped by the given attributes and then checked for uniqueness within this group. After that, all groups are stacked to form a single set of training data.
imputation_strategy: str (default: ‘remove’): Strategy for the imputation of missing values in the features. Must be one of 'remove', 'mean', 'median', 'most_frequent' or 'constant'.
log_level: str (default: ‘info’): Verbosity for the logger. Must be one of 'debug', 'info', 'warning' or 'error'.
mlr_model_name: str: Human-readable name of the MLR model instance (e.g used for labels).
n_jobs: int (default: 1): Maximum number of jobs spawned by this class. Use -1 to use all processors. More details are given here.
output_file_type: str (default: ‘png’): File type for the plots.
parameters: dict: Parameters used for the whole pipeline. Have to be given for each step of the pipeline separated by two underscores, i.e. s__p is the parameter p for step s. random_state parameters are explicitly allowed here (in contrast to parameters_final_regressor).
parameters_final_regressor: dict: Parameters used for the final regressor. If these parameters are updated using the function update_parameters(), the new names have to be given for each step of the pipeline separated by two underscores, i.e. s__p is the parameter p for step s. Note: to pass an argument for random_state, use the option random_state of this class.
pca: bool (default: False): Preprocess numerical input features using PCA. Parameters for this pipeline step can be given via the parameters argument.
plot_dir: str (default: ~/plots): Root directory to save plots.
plot_units: dict: Replace specific units (keys) with other text (values) in plots.
random_state: int or None (default: None): Random seed for numpy.random.RandomState that is used by all functionalities of this class that require randomness (e.g., probabilistic ML algorithms like Gradient Boosting Regression models, random train test splits, etc.). If None, use a random seed. Use an int to get reproducible results. See https://scikit-learn.org/stable/common_pitfalls.html#controlling-randomness for more details.
savefig_kwargs: dict: Keyword arguments for matplotlib.pyplot.savefig().
seaborn_settings: dict: Options for seaborn.set_theme() (affects all plots).
standardize_data: bool (default: True): Linearly standardize numerical input data by removing mean and scaling to unit variance.
sub_dir: str: Create additional subdirectory for output in work_dir and plot_dir.
test_size: float (default: 0.25): If given, randomly exclude the desired fraction of input data from training and use it as test data.
weighted_samples: dict: If specified, use weighted samples in the loss function used for the training of the MLR model. The given keyword arguments are directly passed to esmvaltool.diag_scripts.mlr.get_all_weights() to calculate the sample weights. By default, no weights are used. Raises errors if the desired weights cannot be calculated for the data, e.g., when time_weighted=True is used but the data does not contain a dimension time.
work_dir: str (default: ~/work): Root directory to save all other files (mainly *.nc files).

Classes:

MLRModel(input_datasets, **kwargs)

Base class for MLR models.

class esmvaltool.diag_scripts.mlr.models.MLRModel(input_datasets, **kwargs)[source]#

Bases: object

Base class for MLR models.

Attributes:

`categorical_features`	Categorical features.
`data`	Input data of the MLR model.
`features`	Features of the input data.
`features_after_preprocessing`	Features of the input data after preprocessing.
`features_types`	Types of the features.
`features_units`	Units of the features.
`fit_kwargs`	Keyword arguments for `fit()`.
`group_attributes`	Group attributes of the input data.
`label`	Label of the input data.
`label_units`	Units of the label.
`mlr_model_type`	MLR model type.
`numerical_features`	Numerical features.
`parameters`	Parameters of the complete MLR model pipeline.
`random_state`	Random state instance.

Methods:

`create`(mlr_model_type, args, *kwargs)	Create desired MLR model subclass (factory method).
`efecv`(**kwargs)	Perform exhaustive feature elimination using cross-validation.
`export_prediction_data`([filename])	Export all prediction data contained in self._data.
`export_training_data`([filename])	Export all training data contained in self._data.
`fit`()	Fit MLR model.
`get_ancestors`([label, features, ...])	Return ancestor files.
`get_data_frame`(data_type[, impute_nans])	Return data frame of specified type.
`get_x_array`(data_type[, impute_nans])	Return x data of specific type.
`get_y_array`(data_type[, impute_nans])	Return y data of specific type.
`grid_search_cv`(param_grid, **kwargs)	Perform exhaustive parameter search using cross-validation.
`plot_1d_model`([filename, n_points])	Plot lineplot that represents the MLR model.
`plot_partial_dependences`([filename])	Plot partial dependences for every feature.
`plot_prediction_errors`([filename])	Plot predicted vs.
`plot_residuals`([filename])	Plot residuals of training and test (if available) data.
`plot_residuals_distribution`([filename])	Plot distribution of residuals of training and test data (KDE).
`plot_residuals_histogram`([filename])	Plot histogram of residuals of training and test data.
`plot_scatterplots`([filename])	Plot scatterplots label vs.
`predict`([save_mlr_model_error, ...])	Perform prediction using the MLR model(s) and write `*.nc` files.
`print_correlation_matrices`()	Print correlation matrices for all datasets.
`print_regression_metrics`([logo])	Print all available regression metrics for training data.
`register_mlr_model`(mlr_model_type)	Add MLR model (subclass of this class) (decorator).
`reset_pipeline`()	Reset regressor pipeline.
`rfecv`(**kwargs)	Perform recursive feature elimination using cross-validation.
`test_normality_of_residuals`()	Perform Shapiro-Wilk test to normality of residuals.
`update_parameters`(**params)	Update parameters of the whole pipeline.

property categorical_features#

Categorical features.

Type:: numpy.ndarray

classmethod create(mlr_model_type, *args, **kwargs)[source]#: Create desired MLR model subclass (factory method).

property data#

Input data of the MLR model.

Type:: dict

efecv(**kwargs)[source]#

Perform exhaustive feature elimination using cross-validation.

Parameters:: **kwargs (keyword arguments, optional) – Additional options for esmvaltool.diag_scripts.mlr. custom_sklearn.cross_val_score_weighted().

export_prediction_data(filename=None)[source]#

Export all prediction data contained in self._data.

Parameters:: filename (str, optional (default: '{data_type}_{pred_name}.csv')) – Name of the exported files.

export_training_data(filename=None)[source]#

Export all training data contained in self._data.

Parameters:: filename (str, optional (default: '{data_type}.csv')) – Name of the exported files.

property features#

Features of the input data.

Type:: numpy.ndarray

property features_after_preprocessing#

Features of the input data after preprocessing.

Type:: numpy.ndarray

property features_types#

Types of the features.

Type:: pandas.Series

property features_units#

Units of the features.

Type:: pandas.Series

fit()[source]#: Fit MLR model.

Note

Specifying keyword arguments for this function is not allowed here since features_after_preprocessing might be altered by that. Use the keyword argument fit_kwargs during class initialization instead.

property fit_kwargs#

Keyword arguments for fit().

Type:: dict

get_ancestors(label=True, features=None, prediction_names=None, prediction_reference=False)[source]#

Return ancestor files.

Parameters:

label (bool, optional (default: True)) – Return label files.
features (list of str, optional (default: None)) – Features for which files should be returned. If None, return files for all features.
prediction_names (list of str, optional (default: None)) – Prediction names for which files should be returned. If None, return files for all prediction names.
prediction_reference (bool, optional (default: False)) – Return prediction_reference files if available for given prediction_names.

Returns:

Ancestor files.

Return type:

list of str

Raises:

ValueError – Invalid feature or prediction_name given.

get_data_frame(data_type, impute_nans=False)[source]#

Return data frame of specified type.

Parameters:

data_type (str) – Data type to be returned. Must be one of 'all', 'train' or 'test'.
impute_nans (bool, optional (default: False)) – Impute nans if desired.

Returns:

Desired data.

Return type:

pandas.DataFrame

Raises:

TypeError – data_type is invalid or data does not exist (e.g. test data is not set).

get_x_array(data_type, impute_nans=False)[source]#

Return x data of specific type.

Parameters:

data_type (str) – Data type to be returned. Must be one of 'all', 'train' or 'test'.
impute_nans (bool, optional (default: False)) – Impute nans if desired.

Returns:

Desired data.

Return type:

numpy.ndarray

Raises:

TypeError – data_type is invalid or data does not exist (e.g. test data is not set).

get_y_array(data_type, impute_nans=False)[source]#

Return y data of specific type.

Parameters:

data_type (str) – Data type to be returned. Must be one of 'all', 'train' or 'test'.
impute_nans (bool, optional (default: False)) – Impute nans if desired.

Returns:

Desired data.

Return type:

numpy.ndarray

Raises:

TypeError – data_type is invalid or data does not exist (e.g. test data is not set).

grid_search_cv(param_grid, **kwargs)[source]#

Perform exhaustive parameter search using cross-validation.

Parameters:

param_grid (dict or list of dict) – Parameter names (keys) and ranges (values) for the search. Have to be given for each step of the pipeline separated by two underscores, i.e. s__p is the parameter p for step s.
**kwargs (keyword arguments, optional) – Additional options for sklearn.model_selection.GridSearchCV.

Raises:

ValueError – Final regressor does not supply the attributes best_estimator_ or best_params_.

property group_attributes#

Group attributes of the input data.

Type:: numpy.ndarray

property label#

Label of the input data.

Type:: str

property label_units#

Units of the label.

Type:: str

property mlr_model_type#

MLR model type.

Type:: str

property numerical_features#

Numerical features.

Type:: numpy.ndarray

property parameters#

Parameters of the complete MLR model pipeline.

Type:: dict

plot_1d_model(filename=None, n_points=1000)[source]#

Plot lineplot that represents the MLR model.

Note

This only works for a model with a single feature.

Parameters:

filename (str, optional (default: '1d_mlr_model')) – Name of the plot file.
n_points (int, optional (default: 1000)) – Number of sampled points for the single feature (using linear spacing between minimum and maximum value).

Raises:

sklearn.exceptions.NotFittedError – MLR model is not fitted.
ValueError – MLR model is built from more than 1 feature.

plot_partial_dependences(filename=None)[source]#

Plot partial dependences for every feature.

Parameters:: filename (str, optional (default: 'partial_dependece_{feature}')) – Name of the plot file.
Raises:: sklearn.exceptions.NotFittedError – MLR model is not fitted.

plot_prediction_errors(filename=None)[source]#

Plot predicted vs. true values.

Parameters:: filename (str, optional (default: 'prediction_errors')) – Name of the plot file.
Raises:: sklearn.exceptions.NotFittedError – MLR model is not fitted.

plot_residuals(filename=None)[source]#

Plot residuals of training and test (if available) data.

Parameters:: filename (str, optional (default: 'residuals')) – Name of the plot file.
Raises:: sklearn.exceptions.NotFittedError – MLR model is not fitted.

plot_residuals_distribution(filename=None)[source]#

Plot distribution of residuals of training and test data (KDE).

Parameters:: filename (str, optional (default: 'residuals_distribution')) – Name of the plot file.
Raises:: sklearn.exceptions.NotFittedError – MLR model is not fitted.

plot_residuals_histogram(filename=None)[source]#

Plot histogram of residuals of training and test data.

Parameters:: filename (str, optional (default: 'residuals_histogram')) – Name of the plot file.
Raises:: sklearn.exceptions.NotFittedError – MLR model is not fitted.

plot_scatterplots(filename=None)[source]#

Plot scatterplots label vs. feature for every feature.

Parameters:: filename (str, optional (default: 'scatterplot_{feature}')) – Name of the plot file.
Raises:: sklearn.exceptions.NotFittedError – MLR model is not fitted.

predict(save_mlr_model_error=None, save_lime_importance=False, save_propagated_errors=False, **kwargs)[source]#

Perform prediction using the MLR model(s) and write *.nc files.

Parameters:

save_mlr_model_error (str or int, optional) – Additionally saves estimated squared MLR model error. This error represents the uncertainty of the prediction caused by the MLR model itself and not by errors in the prediction input data (errors in that will be considered by including datasets with var_type set to prediction_input_error and setting save_propagated_errors to True). If the option is set to 'test', the (constant) error is estimated as RMSEP using a (hold-out) test data set. Only possible if test data is available, i.e. the option test_size is not set to False during class initialization. If the option is set to 'logo', the (constant) error is estimated as RMSEP using leave-one-group-out cross-validation using the group_attributes. Only possible if group_datasets_by_attributes is given. If the option is set to an integer n (!= 0), the (constant) error is estimated as RMSEP using n-fold cross-validation.
save_lime_importance (bool, optional (default: False)) – Additionally saves local feature importance given by LIME (Local Interpretable Model-agnostic Explanations).
save_propagated_errors (bool, optional (default: False)) – Additionally saves propagated errors from prediction_input_error datasets. Only possible when these are available.
**kwargs (keyword arguments, optional) – Additional options for the final regressors predict() function.

Raises:

RuntimeError – return_var and return_cov are both set to True.
sklearn.exceptions.NotFittedError – MLR model is not fitted.
ValueError – An invalid value for save_mlr_model_error is given.
ValueError – save_propagated_errors is True and no prediction_input_error data is available.

print_correlation_matrices()[source]#: Print correlation matrices for all datasets.

print_regression_metrics(logo=False)[source]#

Print all available regression metrics for training data.

Parameters:: logo (bool, optional (default: False)) – Print regression metrics using sklearn.model_selection.LeaveOneGroupOut cross-validation. Only possible when group_datasets_by_attributes was given during class initialization.

property random_state#

Random state instance.

Type:: numpy.random.RandomState

classmethod register_mlr_model(mlr_model_type)[source]#: Add MLR model (subclass of this class) (decorator).

reset_pipeline()[source]#: Reset regressor pipeline.

rfecv(**kwargs)[source]#

Perform recursive feature elimination using cross-validation.

Note

This only works for final estimators that provide information about feature importance either through a coef_ attribute or through a feature_importances_ attribute.

Parameters:: **kwargs (keyword arguments, optional) – Additional options for sklearn.feature_selection.RFECV.
Raises:: RuntimeError – Final estimator does not provide coef_ or feature_importances_ attribute.

test_normality_of_residuals()[source]#

Perform Shapiro-Wilk test to normality of residuals.

Raises:: sklearn.exceptions.NotFittedError – MLR model is not fitted.

update_parameters(**params)[source]#

Update parameters of the whole pipeline.

Note

Parameter names have to be given for each step of the pipeline separated by two underscores, i.e. s__p is the parameter p for step s.

Parameters:: **params (keyword arguments, optional) – Parameters for the pipeline which should be updated.
Raises:: ValueError – Invalid parameter for pipeline given.