Gaussian Process Regression (sklearn implementation)#
Gaussian Process Regression model (using sklearn
).
Use mlr_model_type: gpr_sklearn
to use this MLR model in the recipe.
Classes:
|
|
|
Gaussian Process Regression model ( |
- class esmvaltool.diag_scripts.mlr.models.gpr_sklearn.AdvancedGaussianProcessRegressor(kernel=None, *, alpha=1e-10, optimizer='fmin_l_bfgs_b', n_restarts_optimizer=0, normalize_y=False, copy_X_train=True, n_targets=None, random_state=None)[source]#
Bases:
GaussianProcessRegressor
Expand
sklearn.gaussian_process.GaussianProcessRegressor
.Methods:
fit
(X, y)Fit Gaussian process regression model.
Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
log_marginal_likelihood
([theta, ...])Return log-marginal likelihood of theta for training data.
predict
(x_data[, return_var, return_cov])Expand
predict()
to acceptreturn_var
.sample_y
(X[, n_samples, random_state])Draw samples from Gaussian process and evaluate at X.
score
(X, y[, sample_weight])Return the coefficient of determination of the prediction.
set_params
(**params)Set the parameters of this estimator.
set_predict_request
(*[, return_cov, ...])Request metadata passed to the
predict
method.set_score_request
(*[, sample_weight])Request metadata passed to the
score
method.- fit(X, y)#
Fit Gaussian process regression model.
- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequest
encapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)#
Get parameters for this estimator.
- log_marginal_likelihood(theta=None, eval_gradient=False, clone_kernel=True)#
Return log-marginal likelihood of theta for training data.
- Parameters:
theta (array-like of shape (n_kernel_params,) default=None) – Kernel hyperparameters for which the log-marginal likelihood is evaluated. If None, the precomputed log_marginal_likelihood of
self.kernel_.theta
is returned.eval_gradient (bool, default=False) – If True, the gradient of the log-marginal likelihood with respect to the kernel hyperparameters at position theta is returned additionally. If True, theta must not be None.
clone_kernel (bool, default=True) – If True, the kernel attribute is copied. If False, the kernel attribute is modified, but may result in a performance improvement.
- Returns:
log_likelihood (float) – Log-marginal likelihood of theta for training data.
log_likelihood_gradient (ndarray of shape (n_kernel_params,), optional) – Gradient of the log-marginal likelihood with respect to the kernel hyperparameters at position theta. Only returned when eval_gradient is True.
- sample_y(X, n_samples=1, random_state=0)#
Draw samples from Gaussian process and evaluate at X.
- Parameters:
X (array-like of shape (n_samples_X, n_features) or list of object) – Query points where the GP is evaluated.
n_samples (int, default=1) – Number of samples drawn from the Gaussian process per query point.
random_state (int, RandomState instance or None, default=0) – Determines random number generation to randomly draw samples. Pass an int for reproducible results across multiple function calls. See Glossary.
- Returns:
y_samples – Values of n_samples samples drawn from Gaussian process and evaluated at query points.
- Return type:
ndarray of shape (n_samples_X, n_samples), or (n_samples_X, n_targets, n_samples)
- score(X, y, sample_weight=None)#
Return the coefficient of determination of the prediction.
The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares
((y_true - y_pred)** 2).sum()
and \(v\) is the total sum of squares((y_true - y_true.mean()) ** 2).sum()
. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape
(n_samples, n_samples_fitted)
, wheren_samples_fitted
is the number of samples used in the fitting for the estimator.y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – \(R^2\) of
self.predict(X)
w.r.t. y.- Return type:
Notes
The \(R^2\) score used when calling
score
on a regressor usesmultioutput='uniform_average'
from version 0.23 to keep consistent with default value ofr2_score()
. This influences thescore
method of all the multioutput regressors (except forMultiOutputRegressor
).
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_predict_request(*, return_cov: bool | None | str = '$UNCHANGED$', return_var: bool | None | str = '$UNCHANGED$', x_data: bool | None | str = '$UNCHANGED$') AdvancedGaussianProcessRegressor #
Request metadata passed to the
predict
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
return_cov (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
return_cov
parameter inpredict
.return_var (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
return_var
parameter inpredict
.x_data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
x_data
parameter inpredict
.
- Returns:
self – The updated object.
- Return type:
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') AdvancedGaussianProcessRegressor #
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- class esmvaltool.diag_scripts.mlr.models.gpr_sklearn.SklearnGPRModel(input_datasets, **kwargs)[source]#
Bases:
MLRModel
Gaussian Process Regression model (
sklearn
implementation).Attributes:
Categorical features.
Input data of the MLR model.
Features of the input data.
Features of the input data after preprocessing.
Types of the features.
Units of the features.
Keyword arguments for
fit()
.Group attributes of the input data.
Label of the input data.
Units of the label.
MLR model type.
Numerical features.
Parameters of the complete MLR model pipeline.
Random state instance.
Methods:
create
(mlr_model_type, *args, **kwargs)Create desired MLR model subclass (factory method).
efecv
(**kwargs)Perform exhaustive feature elimination using cross-validation.
export_prediction_data
([filename])Export all prediction data contained in self._data.
export_training_data
([filename])Export all training data contained in self._data.
fit
()Fit MLR model.
get_ancestors
([label, features, ...])Return ancestor files.
get_data_frame
(data_type[, impute_nans])Return data frame of specified type.
get_x_array
(data_type[, impute_nans])Return x data of specific type.
get_y_array
(data_type[, impute_nans])Return y data of specific type.
grid_search_cv
(param_grid, **kwargs)Perform exhaustive parameter search using cross-validation.
plot_1d_model
([filename, n_points])Plot lineplot that represents the MLR model.
plot_partial_dependences
([filename])Plot partial dependences for every feature.
plot_prediction_errors
([filename])Plot predicted vs.
plot_residuals
([filename])Plot residuals of training and test (if available) data.
plot_residuals_distribution
([filename])Plot distribution of residuals of training and test data (KDE).
plot_residuals_histogram
([filename])Plot histogram of residuals of training and test data.
plot_scatterplots
([filename])Plot scatterplots label vs.
predict
([save_mlr_model_error, ...])Perform prediction using the MLR model(s) and write
*.nc
files.Print correlation matrices for all datasets.
Print information of the fitted kernel of the GPR model.
print_regression_metrics
([logo])Print all available regression metrics for training data.
register_mlr_model
(mlr_model_type)Add MLR model (subclass of this class) (decorator).
Reset regressor pipeline.
rfecv
(**kwargs)Perform recursive feature elimination using cross-validation.
Perform Shapiro-Wilk test to normality of residuals.
update_parameters
(**params)Update parameters of the whole pipeline.
- property categorical_features#
Categorical features.
- Type:
- classmethod create(mlr_model_type, *args, **kwargs)#
Create desired MLR model subclass (factory method).
- efecv(**kwargs)#
Perform exhaustive feature elimination using cross-validation.
- Parameters:
**kwargs (keyword arguments, optional) – Additional options for
esmvaltool.diag_scripts.mlr. custom_sklearn.cross_val_score_weighted()
.
- export_prediction_data(filename=None)#
Export all prediction data contained in self._data.
- Parameters:
filename (str, optional (default: '{data_type}_{pred_name}.csv')) – Name of the exported files.
- export_training_data(filename=None)#
Export all training data contained in self._data.
- Parameters:
filename (str, optional (default: '{data_type}.csv')) – Name of the exported files.
- property features#
Features of the input data.
- Type:
- property features_after_preprocessing#
Features of the input data after preprocessing.
- Type:
- property features_types#
Types of the features.
- Type:
- property features_units#
Units of the features.
- Type:
- fit()#
Fit MLR model.
Note
Specifying keyword arguments for this function is not allowed here since
features_after_preprocessing
might be altered by that. Use the keyword argumentfit_kwargs
during class initialization instead.
- get_ancestors(label=True, features=None, prediction_names=None, prediction_reference=False)#
Return ancestor files.
- Parameters:
label (bool, optional (default: True)) – Return
label
files.features (list of str, optional (default: None)) – Features for which files should be returned. If
None
, return files for all features.prediction_names (list of str, optional (default: None)) – Prediction names for which files should be returned. If
None
, return files for all prediction names.prediction_reference (bool, optional (default: False)) – Return
prediction_reference
files if available for givenprediction_names
.
- Returns:
Ancestor files.
- Return type:
- Raises:
ValueError – Invalid
feature
orprediction_name
given.
- get_data_frame(data_type, impute_nans=False)#
Return data frame of specified type.
- Parameters:
- Returns:
Desired data.
- Return type:
- Raises:
TypeError –
data_type
is invalid or data does not exist (e.g. test data is not set).
- get_x_array(data_type, impute_nans=False)#
Return x data of specific type.
- Parameters:
- Returns:
Desired data.
- Return type:
- Raises:
TypeError –
data_type
is invalid or data does not exist (e.g. test data is not set).
- get_y_array(data_type, impute_nans=False)#
Return y data of specific type.
- Parameters:
- Returns:
Desired data.
- Return type:
- Raises:
TypeError –
data_type
is invalid or data does not exist (e.g. test data is not set).
- grid_search_cv(param_grid, **kwargs)#
Perform exhaustive parameter search using cross-validation.
- Parameters:
param_grid (dict or list of dict) – Parameter names (keys) and ranges (values) for the search. Have to be given for each step of the pipeline separated by two underscores, i.e.
s__p
is the parameterp
for steps
.**kwargs (keyword arguments, optional) – Additional options for
sklearn.model_selection.GridSearchCV
.
- Raises:
ValueError – Final regressor does not supply the attributes
best_estimator_
orbest_params_
.
- property group_attributes#
Group attributes of the input data.
- Type:
- property numerical_features#
Numerical features.
- Type:
- plot_1d_model(filename=None, n_points=1000)#
Plot lineplot that represents the MLR model.
Note
This only works for a model with a single feature.
- Parameters:
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
ValueError – MLR model is built from more than 1 feature.
- plot_partial_dependences(filename=None)#
Plot partial dependences for every feature.
- Parameters:
filename (str, optional (default: 'partial_dependece_{feature}')) – Name of the plot file.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- plot_prediction_errors(filename=None)#
Plot predicted vs. true values.
- Parameters:
filename (str, optional (default: 'prediction_errors')) – Name of the plot file.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- plot_residuals(filename=None)#
Plot residuals of training and test (if available) data.
- Parameters:
filename (str, optional (default: 'residuals')) – Name of the plot file.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- plot_residuals_distribution(filename=None)#
Plot distribution of residuals of training and test data (KDE).
- Parameters:
filename (str, optional (default: 'residuals_distribution')) – Name of the plot file.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- plot_residuals_histogram(filename=None)#
Plot histogram of residuals of training and test data.
- Parameters:
filename (str, optional (default: 'residuals_histogram')) – Name of the plot file.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- plot_scatterplots(filename=None)#
Plot scatterplots label vs. feature for every feature.
- Parameters:
filename (str, optional (default: 'scatterplot_{feature}')) – Name of the plot file.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- predict(save_mlr_model_error=None, save_lime_importance=False, save_propagated_errors=False, **kwargs)#
Perform prediction using the MLR model(s) and write
*.nc
files.- Parameters:
save_mlr_model_error (str or int, optional) – Additionally saves estimated squared MLR model error. This error represents the uncertainty of the prediction caused by the MLR model itself and not by errors in the prediction input data (errors in that will be considered by including datasets with
var_type
set toprediction_input_error
and settingsave_propagated_errors
toTrue
). If the option is set to'test'
, the (constant) error is estimated as RMSEP using a (hold-out) test data set. Only possible if test data is available, i.e. the optiontest_size
is not set toFalse
during class initialization. If the option is set to'logo'
, the (constant) error is estimated as RMSEP using leave-one-group-out cross-validation using the group_attributes. Only possible ifgroup_datasets_by_attributes
is given. If the option is set to an integern
(!= 0), the (constant) error is estimated as RMSEP using n-fold cross-validation.save_lime_importance (bool, optional (default: False)) – Additionally saves local feature importance given by LIME (Local Interpretable Model-agnostic Explanations).
save_propagated_errors (bool, optional (default: False)) – Additionally saves propagated errors from
prediction_input_error
datasets. Only possible when these are available.**kwargs (keyword arguments, optional) – Additional options for the final regressors
predict()
function.
- Raises:
RuntimeError –
return_var
andreturn_cov
are both set toTrue
.sklearn.exceptions.NotFittedError – MLR model is not fitted.
ValueError – An invalid value for
save_mlr_model_error
is given.ValueError –
save_propagated_errors
isTrue
and noprediction_input_error
data is available.
- print_correlation_matrices()#
Print correlation matrices for all datasets.
- print_regression_metrics(logo=False)#
Print all available regression metrics for training data.
- Parameters:
logo (bool, optional (default: False)) – Print regression metrics using
sklearn.model_selection.LeaveOneGroupOut
cross-validation. Only possible when group_datasets_by_attributes was given during class initialization.
- property random_state#
Random state instance.
- Type:
- classmethod register_mlr_model(mlr_model_type)#
Add MLR model (subclass of this class) (decorator).
- reset_pipeline()#
Reset regressor pipeline.
- rfecv(**kwargs)#
Perform recursive feature elimination using cross-validation.
Note
This only works for final estimators that provide information about feature importance either through a
coef_
attribute or through afeature_importances_
attribute.- Parameters:
**kwargs (keyword arguments, optional) – Additional options for
sklearn.feature_selection.RFECV
.- Raises:
RuntimeError – Final estimator does not provide
coef_
orfeature_importances_
attribute.
- test_normality_of_residuals()#
Perform Shapiro-Wilk test to normality of residuals.
- Raises:
sklearn.exceptions.NotFittedError – MLR model is not fitted.
- update_parameters(**params)#
Update parameters of the whole pipeline.
Note
Parameter names have to be given for each step of the pipeline separated by two underscores, i.e.
s__p
is the parameterp
for steps
.- Parameters:
**params (keyword arguments, optional) – Parameters for the pipeline which should be updated.
- Raises:
ValueError – Invalid parameter for pipeline given.