MLR main diagnostic

Main Diagnostic script to create MLR models.

Description

This diagnostic script creates Machine Learning Regression (MLR) models which use inter-model relations between process-based predictors (usually from the past/present climate) and a target variable (usually a projection of the future climate) to get a constrained prediction of the target variable. It provides an interface for using MLR models (subclasses of esmvaltool.diag_scripts.mlr.models.MLRModel).

Author

Manuel Schlund (DLR, Germany)

Project

CRESCENDO

Configuration options in recipe

efecv_kwargs: dict, optional

If specified, use these additional keyword arguments to perform a exhaustive feature elimination using cross-validation. May not be used together with grid_search_cv_param_grid or rfecv_kwargs.

grid_search_cv_kwargs: dict, optional

Keyword arguments for the grid search cross-validation, see https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html.

grid_search_cv_param_grid: dict or list of dict, optional

If specified, perform exhaustive parameter search using cross-validation instead of simply calling esmvaltool.diag_scripts.mlr.models.MLRModel.fit(). Contains parameters (keys) and ranges (values) for the exhaustive parameter search. Have to be given for each step of the pipeline separated by two underscores, i.e. s__p is the parameter p for step s. May not be used together with efecv_kwargs or rfecv_kwargs.

group_metadata: str, optional

Group input data by an attribute. For every group element (set of datasets), an individual MLR model is calculated. Only affects feature and label datasets. May be used together with the option pseudo_reality.

ignore: list of dict, optional

Ignore specific datasets by specifying multiple dict s of metadata.

mlr_model_type: str

MLR model type. The given model has to be defined in esmvaltool.diag_scripts.mlr.models.

only_predict: bool, optional (default: False)

If True, only use esmvaltool.diag_scripts.mlr.models.MLRModel.predict() and do not create any other output (CSV files, plots, etc.).

pattern: str, optional

Pattern matched against ancestor file names.

plot_partial_dependences: bool, optional (default: False)

Plot partial dependence of every feature in MLR model (computationally expensive).

predict_kwargs: dict, optional

Optional keyword arguments for the final regressor’s predict() function.

pseudo_reality: list of str, optional

List of dataset attributes which are used to group input data for a pseudo- reality test (also known as model-as-truth or perfect-model setup). For every element of the group a single MLR model is fitted on all data except for that of the specified group element. This group element is then used as additional prediction_input and prediction_reference. This allows a direct assessment of the predictive power of the MLR model by comparing the MLR prediction output and the true labels (similar to splitting the input data in a training and test set, but not dividing the data randomly but using specific datasets, e.g. the different climate models). May be used together with the option group_metadata.

rfecv_kwargs: dict, optional

If specified, use these additional keyword arguments to perform a recursive feature elimination using cross-validation, see https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html. May not be used together with efecv_kwargs or grid_search_cv_param_grid.

save_mlr_model_error: str or int, optional

Additionally saves estimated squared MLR model error. This error represents the uncertainty of the prediction caused by the MLR model itself and not by errors in the prediction input data (errors in that will be considered by including datasets with var_type set to prediction_input_error and setting save_propagated_errors to True). If the option is set to 'test', the (constant) error is estimated as RMSEP using a (hold-out) test data set. Only possible if test data is available, i.e. the option test_size is not set to False during class initialization. If the option is set to 'logo', the (constant) error is estimated as RMSEP using leave-one-group-out cross-validation using the group_attributes. Only possible if group_datasets_by_attributes is given. If the option is set to an integer n (!= 0), the (constant) error is estimated as RMSEP using n-fold cross-validation.

save_lime_importance: bool, optional (default: False)

Additionally save local feature importance given by LIME (Local Interpretable Model-agnostic Explanations).

save_propagated_errors: bool, optional (default: False)

Additionally save propagated errors from prediction_input_error datasets.

select_metadata: dict, optional

Pre-select input data by specifying (key, value) pairs. Affects all datasets regardless of var_type.

Additional optional parameters are optional parameters for esmvaltool.diag_scripts.mlr.models.MLRModel given here or optional parameters of esmvaltool.diag_scripts.mlr.mmm if mlr_model_type='mmm'.