MLRModel base class#
Base class for MLR models.
Example recipe#
The MLR main diagnostic script provides an interface for using MLR models in recipes. The following recipe shows a typical example on how to setup MLR recipes/diagnostics with the following properties:
Setup an MLR model with target variable
y
(using the tagY
) and three predictorsx1
,x2
andlatitude
(with tagsX1
,X2
andlatitude
, respectively). The target variable needs the attributevar_type: label
; the predictorsx1
andx2
the attributevar_type: feature
. The coordinate featurelatitude
is added via the optioncoords_as_features: [latitude]
.Suppose
y
andx1
are 3D fields (pressure, latitude, longitude);x2
is a 2D field (latitude, longitude). Thus, it is necessary to add the attributebroadcast_from: [1, 2]
to it (seedim_map
parameter iniris.util.broadcast_to_shape()
for details). In order to consider multiple climate models (A
,B
andC
) at once, the optiongroup_datasets_by_attributes: [dataset]
is necessary. Otherwise the diagnostic will complain about duplicate data.For the prediction, data from dataset
D
is used (withvar_type: prediction_input
). For the featureX1
additional input error (withvar_type: prediction_input_error
) is used.diag_feature_x1: variables: feature: ... # specify project, mip, start_year, end_year, etc. short_name: x1 var_type: feature tag: X1 additional_datasets: - {dataset: A, ...} - {dataset: B, ...} - {dataset: C, ...} prediction_input: ... # specify project, mip, start_year, end_year, etc. short_name: x1 var_type: prediction_input tag: X1 additional_datasets: - {dataset: D, ...} prediction_input_error: ... # specify project, mip, start_year, end_year, etc. short_name: x1Stderr var_type: prediction_input_error tag: X1 additional_datasets: - {dataset: D, ...} scripts: null diag_feature_x2: variables: feature: ... # specify project, mip, start_year, end_year, etc. short_name: x2 var_type: feature broadcast_from: [1, 2] tag: X2 additional_datasets: - {dataset: A, ...} - {dataset: B, ...} - {dataset: C, ...} prediction_input: ... # specify project, mip, start_year, end_year, etc. short_name: x2 var_type: prediction_input broadcast_from: [1, 2] tag: X2 additional_datasets: - {dataset: D, ...} scripts: null diag_label: variables: label: ... # specify project, mip, start_year, end_year, etc. short_name: y var_type: label tag: Y additional_datasets: - {dataset: A, ...} - {dataset: B, ...} - {dataset: C, ...} scripts: null
In this example, a GBRT model (with
mlr_model_type: gbr_sklearn
) is used. Parameters for this are specified viaparameters_final_regressor
. Apart from the best-estimate prediction, the estimated MLR model error (save_mlr_model_error: test
) and the propagated prediction input error (save_propagated_errors: true
) are returned.With
postprocess.py
, the global mean of the best estimate prediction and the corresponding errors (MLR model + propagated input error) are calculted.diag_mlr_gbrt: scripts: mlr: script: mlr/main.py ancestors: [ 'diag_label/y', 'diag_feature_*/*', ] coords_as_features: [latitude] group_datasets_by_attributes: [dataset] mlr_model_name: GBRT mlr_model_type: gbr_sklearn parameters_final_regressor: learning_rate: 0.1 n_estimators: 100 save_mlr_model_error: test save_propagated_errors: true postprocess: script: mlr/postprocess.py ancestors: ['diag_mlr_gbrt/mlr'] ignore: - {var_type: null} mean: [pressure, latitude, longitude]
Plots of the global distribution (latitude, longitude) are created with
plot.py
after calculating the mean over the pressure coordinate usingpreprocess.py
.diag_plot: scripts: preprocess: script: mlr/preprocess.py ancestors: ['diag_mlr_gbrt/mlr'] collapse: [pressure] ignore: - {var_type: null} plot: script: mlr/plot.py ancestors: ['diag_plot/preprocess'] plot_map: plot_kwargs: cbar_label: 'Y' cbar_ticks: [0, 1, 2, 3] vmin: 0 vmax: 3
All datasets must have the attribute var_type
which specifies the type of
the dataset. Possible values are feature
(independent variables used for
training/testing), label
(dependent variables, y-axis),
prediction_input
(independent variables used for prediction of dependent
variables, usually observational data), prediction_input_error
(standard
error of the prediction_input
data, optional) or prediction_reference
(true values for the prediction_input
data, optional). In addition, all
datasets must habe the attribute tag
, which specifies the name of
variable/diagnostic. All datasets can be converted to new units in the loading
step by specifying the key convert_units_to
in the respective dataset(s).
Training data#
All groups (specified in group_datasets_by_attributes
, if desired) given
for label
datasets must also be given for the feature
datasets. Within
these groups, all feature
and label
datasets must have the same shape,
except the attribute broadcast_from
is set to a list of suitable coordinate
indices to map this dataset to regular datasets (see parameter dim_map
in
iris.util.broadcast_to_shape()
).
Prediction data#
All tag
s specified for prediction_input
datasets must also be given
for the feature
datasets (except allow_missing_features
is set to
True
). Multiple predictions can be specified by prediction_name
.
Within these predictions, all prediction_input
datasets must have the same
shape, except the attribute broadcast_from
is given. Errors in the
prediction input data can be specified by prediction_input_error
. If given,
these errors are used to calculate errors in the final prediction using linear
error propagation given by LIME.
Additionally, true values for prediction_input
can be specified with
prediction_reference
datasets (together with the respective
prediction_name
). This allows an evaluation of the performance of the MLR
model by calculating residuals (true minus predicted values).
Available MLR models#
MLR models are subclasses of this base class. A list of all available MLR
models can be found here. To add a new MLR model,
create a new file in esmvaltool/diag_scripts/mlr/models/
with a child class
of esmvaltool.diag_scripts.mlr.models.MLRModel
decorated with
esmvaltool.diag_scripts.mlr.models.MLRModel.register_mlr_model()
.
Optional parameters for class initialization#
- accept_only_scalar_data: bool (default: False)
If set to
True
, only accept scalar input data. Should be used together with the optiongroup_datasets_by_attributes
.- allow_missing_features: bool (default: False)
Allow missing features in the training data.
- cache_intermediate_results: bool (default: True)
Cache the intermediate results of the pipeline’s transformers.
- categorical_features: list of str
Names of features which are interpreted as categorical features (in contrast to numerical features).
- coords_as_features: list of str
If given, specify a list of coordinates which should be used as features.
- dtype: str (default: ‘float64’)
Internal data type which is used for all calculations, see https://docs.scipy.org/doc/numpy/user/basics.types.html for a list of allowed values.
- fit_kwargs: dict
Optional keyword arguments for the pipeline’s
fit()
function. These arguments have to be given for each step of the pipeline separated by two underscores, i.e.s__p
is the parameterp
for steps
.- group_datasets_by_attributes: list of str
List of dataset attributes which are used to group input data for
feature
s andlabel
s. For example, this is necessary if the MLR model should consider multiple climate models in the training phase. If this option is not given, specifying multiple datasets with identicalvar_type
andtag
entries results in an error. If given, all the input data is first grouped by the given attributes and then checked for uniqueness within this group. After that, all groups are stacked to form a single set of training data.- imputation_strategy: str (default: ‘remove’)
Strategy for the imputation of missing values in the features. Must be one of
'remove'
,'mean'
,'median'
,'most_frequent'
or'constant'
.- log_level: str (default: ‘info’)
Verbosity for the logger. Must be one of
'debug'
,'info'
,'warning'
or'error'
.- mlr_model_name: str
Human-readable name of the MLR model instance (e.g used for labels).
- n_jobs: int (default: 1)
Maximum number of jobs spawned by this class. Use
-1
to use all processors. More details are given here.- output_file_type: str (default: ‘png’)
File type for the plots.
- parameters: dict
Parameters used for the whole pipeline. Have to be given for each step of the pipeline separated by two underscores, i.e.
s__p
is the parameterp
for steps
.random_state
parameters are explicitly allowed here (in contrast toparameters_final_regressor
).- parameters_final_regressor: dict
Parameters used for the final regressor. If these parameters are updated using the function
update_parameters()
, the new names have to be given for each step of the pipeline separated by two underscores, i.e.s__p
is the parameterp
for steps
. Note: to pass an argument forrandom_state
, use the optionrandom_state
of this class.- pca: bool (default: False)
Preprocess numerical input features using PCA. Parameters for this pipeline step can be given via the
parameters
argument.- plot_dir: str (default: ~/plots)
Root directory to save plots.
- plot_units: dict
Replace specific units (keys) with other text (values) in plots.
- random_state: int or None (default: None)
Random seed for
numpy.random.RandomState
that is used by all functionalities of this class that require randomness (e.g., probabilistic ML algorithms like Gradient Boosting Regression models, random train test splits, etc.). IfNone
, use a random seed. Use anint
to get reproducible results. See https://scikit-learn.org/stable/common_pitfalls.html#controlling-randomness for more details.- savefig_kwargs: dict
Keyword arguments for
matplotlib.pyplot.savefig()
.- seaborn_settings: dict
Options for
seaborn.set_theme()
(affects all plots).- standardize_data: bool (default: True)
Linearly standardize numerical input data by removing mean and scaling to unit variance.
- sub_dir: str
Create additional subdirectory for output in
work_dir
andplot_dir
.- test_size: float (default: 0.25)
If given, randomly exclude the desired fraction of input data from training and use it as test data.
- weighted_samples: dict
If specified, use weighted samples in the loss function used for the training of the MLR model. The given keyword arguments are directly passed to
esmvaltool.diag_scripts.mlr.get_all_weights()
to calculate the sample weights. By default, no weights are used. Raises errors if the desired weights cannot be calculated for the data, e.g., whentime_weighted=True
is used but the data does not contain a dimensiontime
.- work_dir: str (default: ~/work)
Root directory to save all other files (mainly
*.nc
files).
Classes:
|
Base class for MLR models. |
- class esmvaltool.diag_scripts.mlr.models.MLRModel(input_datasets, **kwargs)[source]#
Bases:
object
Base class for MLR models.
Attributes:
Categorical features.
Input data of the MLR model.
Features of the input data.
Features of the input data after preprocessing.
Types of the features.
Units of the features.
Keyword arguments for
fit()
.Group attributes of the input data.
Label of the input data.
Units of the label.
MLR model type.
Numerical features.
Parameters of the complete MLR model pipeline.
Random state instance.
Methods:
create
(mlr_model_type, *args, **kwargs)Create desired MLR model subclass (factory method).
efecv
(**kwargs)Perform exhaustive feature elimination using cross-validation.
export_prediction_data
([filename])Export all prediction data contained in self._data.
export_training_data
([filename])Export all training data contained in self._data.
fit
()Fit MLR model.
get_ancestors
([label, features, ...])Return ancestor files.
get_data_frame
(data_type[, impute_nans])Return data frame of specified type.
get_x_array
(data_type[, impute_nans])Return x data of specific type.
get_y_array
(data_type[, impute_nans])Return y data of specific type.
grid_search_cv
(param_grid, **kwargs)Perform exhaustive parameter search using cross-validation.
plot_1d_model
([filename, n_points])Plot lineplot that represents the MLR model.
plot_partial_dependences
([filename])Plot partial dependences for every feature.
plot_prediction_errors
([filename])Plot predicted vs.
plot_residuals
([filename])Plot residuals of training and test (if available) data.
plot_residuals_distribution
([filename])Plot distribution of residuals of training and test data (KDE).
plot_residuals_histogram
([filename])Plot histogram of residuals of training and test data.
plot_scatterplots
([filename])Plot scatterplots label vs.
predict
([save_mlr_model_error, ...])Perform prediction using the MLR model(s) and write
*.nc
files.Print correlation matrices for all datasets.
print_regression_metrics
([logo])Print all available regression metrics for training data.
register_mlr_model
(mlr_model_type)Add MLR model (subclass of this class) (decorator).
Reset regressor pipeline.
rfecv
(**kwargs)Perform recursive feature elimination using cross-validation.
Perform Shapiro-Wilk test to normality of residuals.
update_parameters
(**params)Update parameters of the whole pipeline.
- property categorical_features#
Categorical features.
- Type:
- classmethod create(mlr_model_type, *args, **kwargs)[source]#
Create desired MLR model subclass (factory method).
- efecv(**kwargs)[source]#
Perform exhaustive feature elimination using cross-validation.
- Parameters:
**kwargs (keyword arguments, optional) – Additional options for
esmvaltool.diag_scripts.mlr. custom_sklearn.cross_val_score_weighted()
.
- export_prediction_data(filename=None)[source]#
Export all prediction data contained in self._data.
- Parameters:
filename (str, optional (default: '{data_type}_{pred_name}.csv')) – Name of the exported files.
- export_training_data(filename=None)[source]#
Export all training data contained in self._data.
- Parameters:
filename (str, optional (default: '{data_type}.csv')) – Name of the exported files.
- property features#
Features of the input data.
- Type:
- property features_after_preprocessing#
Features of the input data after preprocessing.
- Type:
- property features_types#
Types of the features.
- Type:
- property features_units#
Units of the features.
- Type:
- fit()[source]#
Fit MLR model.
Note
Specifying keyword arguments for this function is not allowed here since
features_after_preprocessing
might be altered by that. Use the keyword argumentfit_kwargs
during class initialization instead.
- get_ancestors(label=True, features=None, prediction_names=None, prediction_reference=False)[source]#
Return ancestor files.
- Parameters:
label (bool, optional (default: True)) – Return
label
files.features (list of str, optional (default: None)) – Features for which files should be returned. If
None
, return files for all features.prediction_names (list of str, optional (default: None)) – Prediction names for which files should be returned. If
None
, return files for all prediction names.prediction_reference (bool, optional (default: False)) – Return
prediction_reference
files if available for givenprediction_names
.
- Returns:
Ancestor files.
- Return type:
- Raises:
ValueError – Invalid
feature
orprediction_name
given.
- get_data_frame(data_type, impute_nans=False)[source]#
Return data frame of specified type.
- Parameters:
- Returns:
Desired data.
- Return type:
- Raises:
TypeError –
data_type
is invalid or data does not exist (e.g. test data is not set).
- get_x_array(data_type, impute_nans=False)[source]#
Return x data of specific type.
- Parameters:
- Returns:
Desired data.
- Return type:
- Raises:
TypeError –
data_type
is invalid or data does not exist (e.g. test data is not set).
- get_y_array(data_type, impute_nans=False)[source]#
Return y data of specific type.
- Parameters:
- Returns:
Desired data.
- Return type:
- Raises:
TypeError –
data_type
is invalid or data does not exist (e.g. test data is not set).
- grid_search_cv(param_grid, **kwargs)[source]#
Perform exhaustive parameter search using cross-validation.
- Parameters:
param_grid (dict or list of dict) – Parameter names (keys) and ranges (values) for the search. Have to be given for each step of the pipeline separated by two underscores, i.e.
s__p
is the parameterp
for steps
.**kwargs (keyword arguments, optional) – Additional options for
sklearn.model_selection.GridSearchCV
.
- Raises:
ValueError – Final regressor does not supply the attributes
best_estimator_
orbest_params_
.
- property group_attributes#
Group attributes of the input data.
- Type:
- property numerical_features#
Numerical features.
- Type:
- plot_1d_model(filename=None, n_points=1000)[source]#
Plot lineplot that represents the MLR model.
Note
This only works for a model with a single feature.
- Parameters:
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
ValueError – MLR model is built from more than 1 feature.
- plot_partial_dependences(filename=None)[source]#
Plot partial dependences for every feature.
- Parameters:
filename (str, optional (default: 'partial_dependece_{feature}')) – Name of the plot file.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- plot_prediction_errors(filename=None)[source]#
Plot predicted vs. true values.
- Parameters:
filename (str, optional (default: 'prediction_errors')) – Name of the plot file.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- plot_residuals(filename=None)[source]#
Plot residuals of training and test (if available) data.
- Parameters:
filename (str, optional (default: 'residuals')) – Name of the plot file.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- plot_residuals_distribution(filename=None)[source]#
Plot distribution of residuals of training and test data (KDE).
- Parameters:
filename (str, optional (default: 'residuals_distribution')) – Name of the plot file.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- plot_residuals_histogram(filename=None)[source]#
Plot histogram of residuals of training and test data.
- Parameters:
filename (str, optional (default: 'residuals_histogram')) – Name of the plot file.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- plot_scatterplots(filename=None)[source]#
Plot scatterplots label vs. feature for every feature.
- Parameters:
filename (str, optional (default: 'scatterplot_{feature}')) – Name of the plot file.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- predict(save_mlr_model_error=None, save_lime_importance=False, save_propagated_errors=False, **kwargs)[source]#
Perform prediction using the MLR model(s) and write
*.nc
files.- Parameters:
save_mlr_model_error (str or int, optional) – Additionally saves estimated squared MLR model error. This error represents the uncertainty of the prediction caused by the MLR model itself and not by errors in the prediction input data (errors in that will be considered by including datasets with
var_type
set toprediction_input_error
and settingsave_propagated_errors
toTrue
). If the option is set to'test'
, the (constant) error is estimated as RMSEP using a (hold-out) test data set. Only possible if test data is available, i.e. the optiontest_size
is not set toFalse
during class initialization. If the option is set to'logo'
, the (constant) error is estimated as RMSEP using leave-one-group-out cross-validation using the group_attributes. Only possible ifgroup_datasets_by_attributes
is given. If the option is set to an integern
(!= 0), the (constant) error is estimated as RMSEP using n-fold cross-validation.save_lime_importance (bool, optional (default: False)) – Additionally saves local feature importance given by LIME (Local Interpretable Model-agnostic Explanations).
save_propagated_errors (bool, optional (default: False)) – Additionally saves propagated errors from
prediction_input_error
datasets. Only possible when these are available.**kwargs (keyword arguments, optional) – Additional options for the final regressors
predict()
function.
- Raises:
RuntimeError –
return_var
andreturn_cov
are both set toTrue
.sklearn.exceptions.NotFittedError – MLR model is not fitted.
ValueError – An invalid value for
save_mlr_model_error
is given.ValueError –
save_propagated_errors
isTrue
and noprediction_input_error
data is available.
- print_regression_metrics(logo=False)[source]#
Print all available regression metrics for training data.
- Parameters:
logo (bool, optional (default: False)) – Print regression metrics using
sklearn.model_selection.LeaveOneGroupOut
cross-validation. Only possible when group_datasets_by_attributes was given during class initialization.
- property random_state#
Random state instance.
- Type:
- classmethod register_mlr_model(mlr_model_type)[source]#
Add MLR model (subclass of this class) (decorator).
- rfecv(**kwargs)[source]#
Perform recursive feature elimination using cross-validation.
Note
This only works for final estimators that provide information about feature importance either through a
coef_
attribute or through afeature_importances_
attribute.- Parameters:
**kwargs (keyword arguments, optional) – Additional options for
sklearn.feature_selection.RFECV
.- Raises:
RuntimeError – Final estimator does not provide
coef_
orfeature_importances_
attribute.
- test_normality_of_residuals()[source]#
Perform Shapiro-Wilk test to normality of residuals.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- update_parameters(**params)[source]#
Update parameters of the whole pipeline.
Note
Parameter names have to be given for each step of the pipeline separated by two underscores, i.e.
s__p
is the parameterp
for steps
.- Parameters:
**params (keyword arguments, optional) – Parameters for the pipeline which should be updated.
- Raises:
ValueError – Invalid parameter for pipeline given.