Preprocessor¶

In this section, each of the preprocessor modules is described, roughly following the default order in which preprocessor functions are applied:

See Preprocessor functions for implementation details and the exact default order.

Overview¶

The ESMValTool preprocessor can be used to perform a broad range of operations on the input data before diagnostics or metrics are applied. The preprocessor performs these operations in a centralized, documented and efficient way, thus reducing the data processing load on the diagnostics side. For an overview of the preprocessor structure see the Recipe section: preprocessors.

Each of the preprocessor operations is written in a dedicated python module and all of them receive and return an Iris cube , working sequentially on the data with no interactions between them. The order in which the preprocessor operations is applied is set by default to minimize the loss of information due to, for example, temporal and spatial subsetting or multi-model averaging. Nevertheless, the user is free to change such order to address specific scientific requirements, but keeping in mind that some operations must be necessarily performed in a specific order. This is the case, for instance, for multi-model statistics, which required the model to be on a common grid and therefore has to be called after the regridding module.

Variable derivation¶

The variable derivation module allows to derive variables which are not in the CMIP standard data request using standard variables as input. The typical use case of this operation is the evaluation of a variable which is only available in an observational dataset but not in the models. In this case a derivation function is provided by the ESMValTool in order to calculate the variable and perform the comparison. For example, several observational datasets deliver total column ozone as observed variable (toz), but CMIP models only provide the ozone 3D field. In this case, a derivation function is provided to vertically integrate the ozone and obtain total column ozone for direct comparison with the observations.

To contribute a new derived variable, it is also necessary to define a name for it and to provide the corresponding CMOR table. This is to guarantee the proper metadata definition is attached to the derived data. Such custom CMOR tables are collected as part of the ESMValCore package. By default, the variable derivation will be applied only if the variable is not already available in the input data, but the derivation can be forced by setting the appropriate flag.

variables:
toz:
derive: true
force_derivation: false


The required arguments for this module are two boolean switches:

• derive: activate variable derivation

• force_derivation: force variable derivation even if the variable is directly available in the input data.

See also esmvalcore.preprocessor.derive(). To get an overview on derivation scripts and how to implement new ones, please go to Variable derivation.

CMORization and dataset-specific fixes¶

Data checking¶

Data preprocessed by ESMValTool is automatically checked against its cmor definition. To reduce the impact of this check while maintaining it as reliable as possible, it is split in two parts: one will check the metadata and will be done just after loading and concatenating the data and the other one will check the data itself and will be applied after all extracting operations are applied to reduce the amount of data to process.

Checks include, but are not limited to:

• Requested coordinates are present and comply with their definition.

• Correctness of variable names, units and other metadata.

• Compliance with the valid minimum and maximum values allowed if defined.

The most relevant (i.e. a missing coordinate) will raise an error while others (i.e an incorrect long name) will be reported as a warning.

Some of those issues will be fixed automatically by the tool, including the following:

• Incorrect standard or long names.

• Incorrect units, if they can be converted to the correct ones.

• Direction of coordinates.

• Automatic clipping of longitude to 0 - 360 interval.

Dataset specific fixes¶

Sometimes, the checker will detect errors that it can not fix by itself. ESMValTool deals with those issues by applying specific fixes for those datasets that require them. Fixes are applied at three different preprocessor steps:

To get an overview on data fixes and how to implement new ones, please go to Dataset fixes.

Vertical interpolation¶

Vertical level selection is an important aspect of data preprocessing since it allows the scientist to perform a number of metrics specific to certain levels (whether it be air pressure or depth, e.g. the Quasi-Biennial-Oscillation (QBO) u30 is computed at 30 hPa). Dataset native vertical grids may not come with the desired set of levels, so an interpolation operation will be needed to regrid the data vertically. ESMValTool can perform this vertical interpolation via the extract_levels preprocessor. Level extraction may be done in a number of ways.

Level extraction can be done at specific values passed to extract_levels as levels: with its value a list of levels (note that the units are CMOR-standard, Pascals (Pa)):

preprocessors:
preproc_select_levels_from_list:
extract_levels:
levels: [100000., 50000., 3000., 1000.]
scheme: linear


It is also possible to extract the CMIP-specific, CMOR levels as they appear in the CMOR table, e.g. plev10 or plev17 or plev19 etc:

preprocessors:
preproc_select_levels_from_cmip_table:
extract_levels:
levels: {cmor_table: CMIP6, coordinate: plev10}
scheme: nearest


Of good use is also the level extraction with values specific to a certain dataset, without the user actually polling the dataset of interest to find out the specific levels: e.g. in the example below we offer two alternatives to extract the levels and vertically regrid onto the vertical levels of ERA-Interim:

preprocessors:
preproc_select_levels_from_dataset:
extract_levels:
levels: ERA-Interim
# This also works, but allows specifying the pressure coordinate name
# levels: {dataset: ERA-Interim, coordinate: air_pressure}
scheme: linear_horizontal_extrapolate_vertical


By default, vertical interpolation is performed in the dimension coordinate of the z axis. If you want to explicitly declare the z axis coordinate to use (for example, air_pressure’ in variables that are provided in model levels and not pressure levels) you can override that automatic choice by providing the name of the desired coordinate:

preprocessors:
preproc_select_levels_from_dataset:
extract_levels:
levels: ERA-Interim
scheme: linear_horizontal_extrapolate_vertical
coordinate: air_pressure

• See also esmvalcore.preprocessor.get_cmor_levels().

Note

For both vertical and horizontal regridding one can control the extrapolation mode when defining the interpolation scheme. Controlling the extrapolation mode allows us to avoid situations where extrapolating values makes little physical sense (e.g. extrapolating beyond the last data point). The extrapolation mode is controlled by the extrapolation_mode keyword. For the available interpolation schemes available in Iris, the extrapolation_mode keyword must be one of:

• extrapolate: the extrapolation points will be calculated by extending the gradient of the closest two points;

• error: a ValueError exception will be raised, notifying an attempt to extrapolate;

• nan: the extrapolation points will be be set to NaN;

• mask: the extrapolation points will always be masked, even if the source data is not a MaskedArray; or

• nanmask: if the source data is a MaskedArray the extrapolation points will be masked, otherwise they will be set to NaN.

Weighting¶

Land/sea fraction weighting¶

This preprocessor allows weighting of data by land or sea fractions. In other words, this function multiplies the given input field by a fraction in the range 0-1 to account for the fact that not all grid points are completely land- or sea-covered.

The application of this preprocessor is very important for most carbon cycle variables (and other land surface outputs), which are e.g. reported in units of $$kgC~m^{-2}$$. Here, the surface unit actually refers to ‘square meter of land/sea’ and NOT ‘square meter of gridbox’. In order to integrate these globally or regionally one has to weight by both the surface quantity and the land/sea fraction.

For example, to weight an input field with the land fraction, the following preprocessor can be used:

preprocessors:
preproc_weighting:
weighting_landsea_fraction:
area_type: land
exclude: ['CanESM2', 'reference_dataset']


Allowed arguments for the keyword area_type are land (fraction is 1 for grid cells with only land surface, 0 for grid cells with only sea surface and values in between 0 and 1 for coastal regions) and sea (1 for sea, 0 for land, in between for coastal regions). The optional argument exclude allows to exclude specific datasets from this preprocessor, which is for example useful for climate models which do not offer land/sea fraction files. This arguments also accepts the special dataset specifiers reference_dataset and alternative_dataset.

Optionally you can specify your own custom fx variable to be used in cases when e.g. a certain experiment is preferred for fx data retrieval:

preprocessors:
preproc_weighting:
weighting_landsea_fraction:
area_type: land
exclude: ['CanESM2', 'reference_dataset']
fx_variables: [{'short_name': 'sftlf', 'exp': 'piControl'}, {'short_name': 'sftof', 'exp': 'piControl'}]


Certain metrics and diagnostics need to be computed and performed on specific domains on the globe. The ESMValTool preprocessor supports filtering the input data on continents, oceans/seas and ice. This is achieved by masking the model data and keeping only the values associated with grid points that correspond to, e.g., land, ocean or ice surfaces, as specified by the user. Where possible, the masking is realized using the standard mask files provided together with the model data as part of the CMIP data request (the so-called fx variable). In the absence of these files, the Natural Earth masks are used: although these are not model-specific, they represent a good approximation since they have a much higher resolution than most of the models and they are regularly updated with changing geographical features.

In ESMValTool, land-sea-ice masking can be done in two places: in the preprocessor, to apply a mask on the data before any subsequent preprocessing step and before running the diagnostic, or in the diagnostic scripts themselves. We present both these implementations below.

To mask out a certain domain (e.g., sea) in the preprocessor, mask_landsea can be used:

preprocessors:


and requires only one argument: mask_out: either land or sea.

The preprocessor automatically retrieves the corresponding mask (fx: stfof in this case) and applies it so that sea-covered grid cells are set to missing. Conversely, it retrieves the fx: sftlf mask when land needs to be masked out, respectively.

Optionally you can specify your own custom fx variable to be used in cases when e.g. a certain experiment is preferred for fx data retrieval:

preprocessors:
fx_variables: [{'short_name': 'sftlf', 'exp': 'piControl'}, {'short_name': 'sftof', 'exp': 'piControl'}]


If the corresponding fx file is not found (which is the case for some models and almost all observational datasets), the preprocessor attempts to mask the data using Natural Earth mask files (that are vectorized rasters). As mentioned above, the spatial resolution of the the Natural Earth masks are much higher than any typical global model (10m for land and glaciated areas and 50m for ocean masks).

Note that for masking out ice sheets, the preprocessor uses a different function, to ensure that both land and sea or ice can be masked out without losing generality. To mask ice out, mask_landseaice can be used:

preprocessors:


and requires only one argument: mask_out: either landsea or ice.

As in the case of mask_landsea, the preprocessor automatically retrieves the fx_variables: [sftgif] mask.

Optionally you can specify your own custom fx variable to be used in cases when e.g. a certain experiment is preferred for fx data retrieval:

preprocessors:
fx_variables: [{'short_name': 'sftgif', 'exp': 'piControl'}]


For masking out glaciated areas a Natural Earth shapefile is used. To mask glaciated areas out, mask_glaciated can be used:

preprocessors:


and it requires only one argument: mask_out: only glaciated.

Missing (masked) values can be a nuisance especially when dealing with multimodel ensembles and having to compute multimodel statistics; different numbers of missing data from dataset to dataset may introduce biases and artificially assign more weight to the datasets that have less missing data. This is handled in ESMValTool via the missing values masks: two types of such masks are available, one for the multimodel case and another for the single model case.

The multimodel missing values mask (mask_fillvalues) is a preprocessor step that usually comes after all the single-model steps (regridding, area selection etc) have been performed; in a nutshell, it combines missing values masks from individual models into a multimodel missing values mask; the individual model masks are built according to common criteria: the user chooses a time window in which missing data points are counted, and if the number of missing data points relative to the number of total data points in a window is less than a chosen fractional threshold, the window is discarded i.e. all the points in the window are masked (set to missing).

preprocessors:
missing_values_preprocessor:
threshold_fraction: 0.95
min_value: 19.0
time_window: 10.0


In the example above, the fractional threshold for missing data vs. total data is set to 95% and the time window is set to 10.0 (units of the time coordinate units). Optionally, a minimum value threshold can be applied, in this case it is set to 19.0 (in units of the variable units).

It is possible to use mask_fillvalues to create a combined multimodel mask (all the masks from all the analyzed models combined into a single mask); for that purpose setting the threshold_fraction to 0 will not discard any time windows, essentially keeping the original model masks and combining them into a single mask; here is an example:

preprocessors:
missing_values_preprocessor:
threshold_fraction: 0.0     # keep all missing values
min_value: -1e20            # small enough not to alter the data
#  time_window: 10.0        # this will not matter anymore


Thresholding on minimum and maximum accepted data values can also be performed: masks are constructed based on the results of thresholding; inside and outside interval thresholding and masking can also be performed. These functions are mask_above_threshold, mask_below_threshold, mask_inside_range, and mask_outside_range.

These functions always take a cube as first argument and either threshold for threshold masking or the pair minimum, maximum for interval masking.

See also esmvalcore.preprocessor.mask_above_threshold() and related functions.

Horizontal regridding¶

Regridding is necessary when various datasets are available on a variety of lat-lon grids and they need to be brought together on a common grid (for various statistical operations e.g. multimodel statistics or for e.g. direct inter-comparison or comparison with observational datasets). Regridding is conceptually a very similar process to interpolation (in fact, the regridder engine uses interpolation and extrapolation, with various schemes). The primary difference is that interpolation is based on sample data points, while regridding is based on the horizontal grid of another cube (the reference grid).

The underlying regridding mechanism in ESMValTool uses the cube.regrid() from Iris.

The use of the horizontal regridding functionality is flexible depending on what type of reference grid and what interpolation scheme is preferred. Below we show a few examples.

Regridding on a reference dataset grid¶

The example below shows how to regrid on the reference dataset ERA-Interim (observational data, but just as well CMIP, obs4mips, or ana4mips datasets can be used); in this case the scheme is linear.

preprocessors:
regrid_preprocessor:
regrid:
target_grid: ERA-Interim
scheme: linear


Regridding on an MxN grid specification¶

The example below shows how to regrid on a reference grid with a cell specification of 2.5x2.5 degrees. This is similar to regridding on reference datasets, but in the previous case the reference dataset grid cell specifications are not necessarily known a priori. Regridding on an MxN cell specification is oftentimes used when operating on localized data.

preprocessors:
regrid_preprocessor:
regrid:
target_grid: 2.5x2.5
scheme: nearest


In this case the NearestNeighbour interpolation scheme is used (see below for scheme definitions).

When using a MxN type of grid it is possible to offset the grid cell centrepoints using the lat_offset and lon_offset arguments:

• lat_offset: offsets the grid centers of the latitude coordinate w.r.t. the pole by half a grid step;

• lon_offset: offsets the grid centers of the longitude coordinate w.r.t. Greenwich meridian by half a grid step.

preprocessors:
regrid_preprocessor:
regrid:
target_grid: 2.5x2.5
lon_offset: True
lat_offset: True
scheme: nearest


Regridding (interpolation, extrapolation) schemes¶

The schemes used for the interpolation and extrapolation operations needed by the horizontal regridding functionality directly map to their corresponding implementations in Iris:

Note

For both vertical and horizontal regridding one can control the extrapolation mode when defining the interpolation scheme. Controlling the extrapolation mode allows us to avoid situations where extrapolating values makes little physical sense (e.g. extrapolating beyond the last data point). The extrapolation mode is controlled by the extrapolation_mode keyword. For the available interpolation schemes available in Iris, the extrapolation_mode keyword must be one of:

• extrapolate – the extrapolation points will be calculated by extending the gradient of the closest two points;

• error – a ValueError exception will be raised, notifying an attempt to extrapolate;

• nan – the extrapolation points will be be set to NaN;

• mask – the extrapolation points will always be masked, even if the source data is not a MaskedArray; or

• nanmask – if the source data is a MaskedArray the extrapolation points will be masked, otherwise they will be set to NaN.

Note

The regridding mechanism is (at the moment) done with fully realized data in memory, so depending on how fine the target grid is, it may use a rather large amount of memory. Empirically target grids of up to 0.5x0.5 degrees should not produce any memory-related issues, but be advised that for resolutions of < 0.5 degrees the regridding becomes very slow and will use a lot of memory.

Multi-model statistics¶

Computing multi-model statistics is an integral part of model analysis and evaluation: individual models display a variety of biases depending on model set-up, initial conditions, forcings and implementation; comparing model data to observational data, these biases have a significantly lower statistical impact when using a multi-model ensemble. ESMValTool has the capability of computing a number of multi-model statistical measures: using the preprocessor module multi_model_statistics will enable the user to ask for either a multi-model mean, median, max, min, std, and / or pXX.YY with a set of argument parameters passed to multi_model_statistics. Percentiles can be specified like p1.5 or p95. The decimal point will be replaced by a dash in the output file.

Note that current multimodel statistics in ESMValTool are local (not global), and are computed along the time axis. As such, can be computed across a common overlap in time (by specifying span: overlap argument) or across the full length in time of each model (by specifying span: full argument).

Restrictive computation is also available by excluding any set of models that the user will not want to include in the statistics (by setting exclude: [excluded models list] argument). The implementation has a few restrictions that apply to the input data: model datasets must have consistent shapes, and from a statistical point of view, this is needed since weights are not yet implemented; also higher dimensional data is not supported (i.e. anything with dimensionality higher than four: time, vertical axis, two horizontal axes).

preprocessors:
multimodel_preprocessor:
multi_model_statistics:
span: overlap
statistics: [mean, median]
exclude: [NCEP]


When calling the module inside diagnostic scripts, the input must be given as a list of cubes. The output will be saved in a dictionary where each entry contains the resulting cube with the requested statistic operations.

from esmvalcore.preprocessor import multi_model_statistics
statistics = multi_model_statistics([cube1,...,cubeN], 'overlap', ['mean', 'median'])
mean_cube = statistics['mean']
median_cube = statistics['median']


Note

Note that the multimodel array operations, albeit performed in per-time/per-horizontal level loops to save memory, could, however, be rather memory-intensive (since they are not performed lazily as yet). The Section on Information on maximum memory required details the memory intake for different run scenarios, but as a thumb rule, for the multimodel preprocessor, the expected maximum memory intake could be approximated as the number of datasets multiplied by the average size in memory for one dataset.

Time manipulation¶

The _time.py module contains the following preprocessor functions:

Statistics functions are applied by default in the order they appear in the list. For example, the following example applied to hourly data will retrieve the minimum values for the full period (by season) of the monthly mean of the daily maximum of any given variable.

daily_statistics:
operator: max

monthly_statistics:
operator: mean

climate_statistics:
operator: min
period: season


extract_time¶

This function subsets a dataset between two points in times. It removes all times in the dataset before the first time and after the last time point. The required arguments are relatively self explanatory:

• start_year

• start_month

• start_day

• end_year

• end_month

• end_day

These start and end points are set using the datasets native calendar. All six arguments should be given as integers - the named month string will not be accepted.

extract_season¶

Extract only the times that occur within a specific season.

This function only has one argument: season. This is the named season to extract. ie: DJF, MAM, JJA, SON.

Note that this function does not change the time resolution. If your original data is in monthly time resolution, then this function will return three monthly datapoints per year.

If you want the seasonal average, then this function needs to be combined with the seasonal_mean function, below.

extract_month¶

The function extracts the times that occur within a specific month. This function only has one argument: month. This value should be an integer between 1 and 12 as the named month string will not be accepted.

daily_statistics¶

This function produces statistics for each day in the dataset.

Parameters:
• operator: operation to apply. Accepted values are ‘mean’, ‘median’, ‘std_dev’, ‘min’, ‘max’ and ‘sum’. Default is ‘mean’

monthly_statistics¶

This function produces statistics for each month in the dataset.

Parameters:
• operator: operation to apply. Accepted values are ‘mean’, ‘median’, ‘std_dev’, ‘min’, ‘max’ and ‘sum’. Default is ‘mean’

seasonal_statistics¶

This function produces statistics for each season (DJF, MAM, JJA, SON) in the dataset. Note that this function will not check for missing time points. For instance, if you are looking at the DJF field, but your datasets starts on January 1st, the first DJF field will only contain data from January and February.

We recommend using the extract_time to start the dataset from the following December and remove such biased initial datapoints.

Parameters:
• operator: operation to apply. Accepted values are ‘mean’, ‘median’, ‘std_dev’, ‘min’, ‘max’ and ‘sum’. Default is ‘mean’

See also esmvalcore.preprocessor.seasonal_mean().

annual_statistics¶

This function produces statistics for each year.

Parameters:
• operator: operation to apply. Accepted values are ‘mean’, ‘median’, ‘std_dev’, ‘min’, ‘max’ and ‘sum’. Default is ‘mean’

decadal_statistics¶

This function produces statistics for each decade.

Parameters:
• operator: operation to apply. Accepted values are ‘mean’, ‘median’, ‘std_dev’, ‘min’, ‘max’ and ‘sum’. Default is ‘mean’

climate_statistics¶

This function produces statistics for the whole dataset. It can produce scalars (if the full period is chosen) or daily, monthly or seasonal statics.

Parameters:
• operator: operation to apply. Accepted values are ‘mean’, ‘median’, ‘std_dev’, ‘min’, ‘max’ and ‘sum’. Default is ‘mean’

• period: define the granularity of the statistics: get values for the full period, for each month or day of year. Available periods: ‘full’, ‘season’, ‘seasonal’, ‘monthly’, ‘month’, ‘mon’, ‘daily’, ‘day’. Default is ‘full’

Examples:
• Monthly climatology:

climate_statistics:
operator: mean
period: month

• Daily maximum for the full period:

climate_statistics:
operator: max
period: day

• Minimum value in the period:

climate_statistics:
operator: min
period: full


anomalies¶

This function computes the anomalies for the whole dataset. It can compute anomalies from the full, seasonal, monthly and daily climatologies. Optionally standardized anomalies can be calculated.

Parameters:
• period: define the granularity of the climatology to use: full period, seasonal, monthly or daily. Available periods: ‘full’, ‘season’, ‘seasonal’, ‘monthly’, ‘month’, ‘mon’, ‘daily’, ‘day’. Default is ‘full’

• reference: Time slice to use as the reference to compute the climatology on. Can be ‘null’ to use the full cube or a dictionary with the parameters from extract_time. Default is null

• standardize: if true calculate standardized anomalies (default: false)

Examples:
• Anomalies from the full period climatology:

anomalies:

• Anomalies from the full period monthly climatology:

anomalies:
period: month

• Standardized anomalies from the full period climatology:

anomalies:
standardized: true

• Standardized Anomalies from the 1979-2000 monthly climatology:

anomalies:
period: month
reference:
start_year: 1979
start_month: 1
start_day: 1
end_year: 2000
end_month: 12
end_day: 31
standardize: true


regrid_time¶

This function aligns the time points of each component dataset so that the Iris cubes from different datasets can be subtracted. The operation makes the datasets time points common; it also resets the time bounds and auxiliary coordinates to reflect the artificially shifted time points. Current implementation for monthly and daily data; the frequency is set automatically from the variable CMOR table unless a custom frequency is set manually by the user in recipe.

timeseries_filter¶

This function allows the user to apply a filter to the timeseries data. This filter may be of the user’s choice (currently only the low-pass Lanczos filter is implemented); the implementation is inspired by this iris example and uses aggregation via a rolling window.

Parameters:
• window: the length of the filter window (in units of cube time coordinate).

• span: period (number of months/days, depending on data frequency) on which weights should be computed e.g. for 2-yearly: span = 24 (2 x 12 months). Make sure span has the same units as the data cube time coordinate.

• filter_type: the type of filter to be applied; default ‘lowpass’. Available types: ‘lowpass’.

• filter_stats: the type of statistic to aggregate on the rolling window; default ‘sum’. Available operators: ‘mean’, ‘median’, ‘std_dev’, ‘sum’, ‘min’, ‘max’.

Examples:
• Lowpass filter with a monthly mean as operator:

timeseries_filter:
window: 3  # 3-monthly filter window
span: 12   # weights computed on the first year
filter_type: lowpass  # low-pass filter
filter_stats: mean    # 3-monthly mean lowpass filter


Area manipulation¶

The area manipulation module contains the following preprocessor functions:

extract_region¶

This function masks data outside a rectangular region requested. The boundaries of the region are provided as latitude and longitude coordinates in the arguments:

• start_longitude

• end_longitude

• start_latitude

• end_latitude

Note that this function can only be used to extract a rectangular region. Use extract_shape to extract any other shaped region from a shapefile.

extract_named_regions¶

This function extracts a specific named region from the data. This function takes the following argument: regions which is either a string or a list of strings of named regions. Note that the dataset must have a region coordinate which includes a list of strings as values. This function then matches the named regions against the requested string.

extract_shape¶

Extract a shape or a representative point for this shape from the data.

Parameters:
• shapefile: path to the shapefile containing the geometry of the region to be extracted. If the file contains multiple shapes behaviour depends on the decomposed parameter. This path can be relative to auxiliary_data_dir defined in the User configuration file.

• method: the method to select the region, selecting either all points

contained by the shape or a single representative point. Choose either ‘contains’ or ‘representative’. If not a single grid point is contained in the shape, a representative point will be selected.

• crop: by default extract_region will be used to crop the data to a

minimal rectangular region containing the shape. Set to false to only mask data outside the shape. Data on irregular grids will not be cropped.

• decomposed: by default false, in this case the union of all the regions in the shape file is masked out. If true, the regions in the shapefiles are masked out separately, generating an auxiliary dimension for the cube for this.

Examples:
• Extract the shape of the river Elbe from a shapefile:

extract_shape:
shapefile: Elbe.shp
method: contains


extract_point¶

Extract a single point from the data. This is done using either nearest or linear interpolation.

Returns a cube with the extracted point(s), and with adjusted latitude and longitude coordinates (see below).

Multiple points can also be extracted, by supplying an array of latitude and/or longitude coordinates. The resulting point cube will match the respective latitude and longitude coordinate to those of the input coordinates. If the input coordinate is a scalar, the dimension will be missing in the output cube (that is, it will be a scalar).

Parameters:
• cube: the input dataset cube.

• latitude, longitude: coordinates (as floating point values) of the point to be extracted. Either (or both) can also be an array of floating point values.

• scheme: interpolation scheme: either 'linear' or 'nearest'. There is no default.

zonal_statistics¶

The function calculates the zonal statistics by applying an operator along the longitude coordinate. This function takes one argument:

• operator: Which operation to apply: mean, std_dev, median, min, max or sum

See also esmvalcore.preprocessor.zonal_means().

meridional_statistics¶

The function calculates the meridional statistics by applying an operator along the latitude coordinate. This function takes one argument:

• operator: Which operation to apply: mean, std_dev, median, min, max or sum

See also esmvalcore.preprocessor.meridional_means().

area_statistics¶

This function calculates the average value over a region - weighted by the cell areas of the region. This function takes the argument, operator: the name of the operation to apply.

This function can be used to apply several different operations in the horizontal plane: mean, standard deviation, median variance, minimum and maximum.

Note that this function is applied over the entire dataset. If only a specific region, depth layer or time period is required, then those regions need to be removed using other preprocessor operations in advance.

The fx_variables argument specifies the fx variables that the user wishes to input to the function; the user may specify it as a list of variables e.g.

fx_variables: ['areacello', 'volcello']


or as list of dictionaries, with specific variable parameters (they key-value pair may be as specific as a CMOR variable can permit):

fx_variables: [{'short_name': 'areacello', 'mip': 'Omon'}, {'short_name': 'volcello, mip': 'fx'}]


The recipe parser wil automatically find the data files that are associated with these variables and pass them to the function for loading and processing.

Volume manipulation¶

The _volume.py module contains the following preprocessor functions:

• extract_volume: Extract a specific depth range from a cube.

• volume_statistics: Calculate the volume-weighted average.

• depth_integration: Integrate over the depth dimension.

• extract_transect: Extract data along a line of constant latitude or longitude.

• extract_trajectory: Extract data along a specified trajectory.

extract_volume¶

Extract a specific range in the z-direction from a cube. This function takes two arguments, a minimum and a maximum (z_min and z_max, respectively) in the z-direction.

Note that this requires the requested z-coordinate range to be the same sign as the Iris cube. That is, if the cube has z-coordinate as negative, then z_min and z_max need to be negative numbers.

volume_statistics¶

This function calculates the volume-weighted average across three dimensions, but maintains the time dimension.

This function takes the argument: operator, which defines the operation to apply over the volume.

No depth coordinate is required as this is determined by Iris. This function works best when the fx_variables provide the cell volume.

The fx_variables argument specifies the fx variables that the user wishes to input to the function; the user may specify it as a list of variables e.g.

fx_variables: ['areacello', 'volcello']


or as list of dictionaries, with specific variable parameters (they key-value pair may be as specific as a CMOR variable can permit):

fx_variables: [{'short_name': 'areacello', 'mip': 'Omon'}, {'short_name': 'volcello, mip': 'fx'}]


The recipe parser wil automatically find the data files that are associated with these variables and pass them to the function for loading and processing.

depth_integration¶

This function integrates over the depth dimension. This function does a weighted sum along the z-coordinate, and removes the z direction of the output cube. This preprocessor takes no arguments.

extract_transect¶

This function extracts data along a line of constant latitude or longitude. This function takes two arguments, although only one is strictly required. The two arguments are latitude and longitude. One of these arguments needs to be set to a float, and the other can then be either ignored or set to a minimum or maximum value.

For example, if we set latitude to 0 N and leave longitude blank, it would produce a cube along the Equator. On the other hand, if we set latitude to 0 and then set longitude to [40., 100.] this will produce a transect of the Equator in the Indian Ocean.

extract_trajectory¶

This function extract data along a specified trajectory. The three arguments are: latitudes, longitudes and number of point needed for extrapolation number_points.

If two points are provided, the number_points argument is used to set a the number of places to extract between the two end points.

If more than two points are provided, then extract_trajectory will produce a cube which has extrapolated the data of the cube to those points, and number_points is not needed.

Note that this function uses the expensive interpolate method from Iris.analysis.trajectory, but it may be necessary for irregular grids.

Cycles¶

The _cycles.py module contains the following preprocessor functions:

• amplitude: Extract the peak-to-peak amplitude of a cycle aggregated over specified coordinates.

amplitude¶

This function extracts the peak-to-peak amplitude (maximum value minus minimum value) of a field aggregated over specified coordinates. Its only argument is coords, which can either be a single coordinate (given as str) or multiple coordinates (given as list of str). Usually, these coordinates refer to temporal categorised coordinates like year, month, day of year, etc. For example, to extract the amplitude of the annual cycle for every single year in the data, use coords: year; to extract the amplitude of the diurnal cycle for every single day in the data, use coords: [year, day_of_year].

Detrend¶

ESMValTool also supports detrending along any dimension using the preprocessor function ‘detrend’. This function has two parameters:

• dimension: dimension to apply detrend on. Default: “time”

• method: It can be linear or constant. Default: linear

If method is linear, detrend will calculate the linear trend along the selected axis and subtract it to the data. For example, this can be used to remove the linear trend caused by climate change on some variables is selected dimension is time.

If method is constant, detrend will compute the mean along that dimension and subtract it from the data

Unit conversion¶

Converting units is also supported. This is particularly useful in cases where different datasets might have different units, for example when comparing CMIP5 and CMIP6 variables where the units have changed or in case of observational datasets that are delivered in different units.

In these cases, having a unit conversion at the end of the processing will guarantee homogeneous input for the diagnostics.

Note

Conversion is only supported between compatible units! In other words, converting temperature units from degC to Kelvin works fine, changing precipitation units from a rate based unit to an amount based unit is not supported at the moment.

Information on maximum memory required¶

In the most general case, we can set upper limits on the maximum memory the analysis will require:

Ms = (R + N) x F_eff - F_eff - when no multimodel analysis is performed;

Mm = (2R + N) x F_eff - 2F_eff - when multimodel analysis is performed;

where

• Ms: maximum memory for non-multimodel module

• Mm: maximum memory for multimodel module

• R: computational efficiency of module; R is typically 2-3

• N: number of datasets

• F_eff: average size of data per dataset where F_eff = e x f x F where e is the factor that describes how lazy the data is (e = 1 for fully realized data) and f describes how much the data was shrunk by the immediately previous module, e.g. time extraction, area selection or level extraction; note that for fix_data f relates only to the time extraction, if data is exact in time (no time selection) f = 1 for fix_data so for cases when we deal with a lot of datasets R + N \approx N, data is fully realized, assuming an average size of 1.5GB for 10 years of 3D netCDF data, N datasets will require:

Ms = 1.5 x (N - 1) GB

Mm = 1.5 x (N - 2) GB

As a rule of thumb, the maximum required memory at a certain time for multimodel analysis could be estimated by multiplying the number of datasets by the average file size of all the datasets; this memory intake is high but also assumes that all data is fully realized in memory; this aspect will gradually change and the amount of realized data will decrease with the increase of dask use.

Other¶

Miscellaneous functions that do not belong to any of the other categories.

Clip¶

This function clips data values to a certain minimum, maximum or range. The function takes two arguments:

• minimum: Lower bound of range. Default: None

• maximum: Upper bound of range. Default: None

The example below shows how to set all values below zero to zero.

preprocessors:
clip:
minimum: 0
maximum: null
`