.. _preprocessor_function: Preprocessor function ********************* Preprocessor functions are located in :py:mod:`esmvalcore.preprocessor`. To add a new preprocessor function, start by finding a likely looking file to add your function to in `esmvalcore/preprocessor `_. Create a new file in that directory if you cannot find a suitable place. The function should look like this: .. code-block:: python def example_preprocessor_function( cube, example_argument, example_optional_argument=5, ): """Compute an example quantity. A more extensive explanation of the computation can be added here. Add references to scientific literature if available. Parameters ---------- cube: iris.cube.Cube Input cube. example_argument: str Example argument, the value of this argument can be provided in the recipe. Describe what valid values are here. In this case, a valid argument is the name of a dimension of the input cube. example_optional_argument: int, optional Another example argument, the value of this argument can optionally be provided in the recipe. Describe what valid values are here. Returns ------- iris.cube.Cube The result of the example computation. """ # Replace this with your own computation cube = cube.collapsed(example_argument, iris.analysis.MEAN) return cube The above function needs to be imported in the file `esmvalcore/preprocessor/__init__.py `__: .. code-block:: python from ._example_module import example_preprocessor_function __all__ = [ ... 'example_preprocessor_function', ... ] The location in the ``__all__`` list above determines the default order in which preprocessor functions are applied, so carefully consider where you put it and ask for advice if needed. The preprocessor function above can then be used from the :ref:`preprocessors` like this: .. code-block:: yaml preprocessors: example_preprocessor: example_preprocessor_function: example_argument: median example_optional_argument: 6 The optional argument (in this example: ``example_optional_argument``) can be omitted in the recipe. Lazy and real data ================== Preprocessor functions should support both :ref:`real and lazy data `. This is vital for supporting the large datasets that are typically used with the ESMValCore. If the data of the incoming cube has been realized (i.e. ``cube.has_lazy_data()`` returns ``False`` so ``cube.core_data()`` is a `NumPy `__ array), the returned cube should also have realized data. Conversely, if the incoming cube has lazy data (i.e. ``cube.has_lazy_data()`` returns ``True`` so ``cube.core_data()`` is a `Dask array `__), the returned cube should also have lazy data. Note that NumPy functions will often call their Dask equivalent if it exists and if their input array is a Dask array, and vice versa. Note that preprocessor functions should preferably be small and just call the relevant :ref:`iris ` code. Code that is more involved, e.g. lots of work with Numpy and Dask arrays, and more broadly applicable, should be implemented in iris instead. Metadata ======== Preprocessor functions may change the metadata of datasets. An obvious example is :func:`~esmvalcore.preprocessor.convert_units`, which changes units. If cube metadata is changed in a preprocessor function, the :ref:`metadata.yml ` file is automatically updated with this information. The following attributes are taken into account: +------------------------------------+--------------------------------------------+ | Attribute in ``metadata.yml`` file | Updated from | +====================================+============================================+ | ``standard_name`` | :attr:`iris.cube.Cube.standard_name` | +------------------------------------+--------------------------------------------+ | ``long_name`` | :attr:`iris.cube.Cube.long_name` | +------------------------------------+--------------------------------------------+ | ``short_name`` | :attr:`iris.cube.Cube.var_name` | +------------------------------------+--------------------------------------------+ | ``units`` | :attr:`iris.cube.Cube.units` | +------------------------------------+--------------------------------------------+ | ``frequency`` | ``iris.cube.Cube.attributes['frequency']`` | +------------------------------------+--------------------------------------------+ If a given cube property is ``None``, the corresponding attribute is updated with an empty string (``''``). If a cube property is not given, the corresponding attribute is not updated. Documentation ============= The documentation in the function docstring will be shown in the :ref:`preprocessor_functions` chapter. In addition, you should add documentation on how to use the new preprocessor function from the recipe in `doc/recipe/preprocessor.rst `__ so it is shown in the :ref:`preprocessor` chapter. See the introduction to :ref:`documentation` for more information on how to best write documentation. Tests ===== Tests are should be implemented for new or modified preprocessor functions. For an introduction to the topic, see :ref:`tests`. Unit tests ---------- To add a unit test for the preprocessor function from the example above, create a file called ``tests/unit/preprocessor/_example_module/test_example_preprocessor_function.py`` and add the following content: .. code-block:: python """Test function `esmvalcore.preprocessor.example_preprocessor_function`.""" import cf_units import dask.array as da import iris import numpy as np import pytest from esmvalcore.preprocessor import example_preprocessor_function @pytest.mark.parametrize('lazy', [True, False]) def test_example_preprocessor_function(lazy): """Test that the computed result is as expected.""" # Construct the input cube data = np.array([1, 2], dtype=np.float32) if lazy: data = da.asarray(data, chunks=(1, )) cube = iris.cube.Cube( data, var_name='tas', units='K', ) cube.add_dim_coord( iris.coords.DimCoord( np.array([0.5, 1.5], dtype=np.float64), bounds=np.array([[0, 1], [1, 2]], dtype=np.float64), standard_name='time', units=cf_units.Unit('days since 1950-01-01 00:00:00', calendar='gregorian'), ), 0, ) # Compute the result result = example_preprocessor_function(cube, example_argument='time') # Check that lazy data is returned if and only if the input is lazy assert result.has_lazy_data() is lazy # Construct the expected result cube expected = iris.cube.Cube( np.array(1.5, dtype=np.float32), var_name='tas', units='K', ) expected.add_aux_coord( iris.coords.AuxCoord( np.array([1], dtype=np.float64), bounds=np.array([[0, 2]], dtype=np.float64), standard_name='time', units=cf_units.Unit('days since 1950-01-01 00:00:00', calendar='gregorian'), )) expected.add_cell_method( iris.coords.CellMethod(method='mean', coords=('time', ))) # Compare the result of the computation with the expected result print('result:', result) print('expected result:', expected) assert result == expected In this test we used the decorator `pytest.mark.parametrize `_ to test two scenarios, with both lazy and realized data, with a single test. Sample data tests ----------------- The idea of adding :ref:`sample data tests ` is to check that preprocessor functions work with realistic data. This also provides an easy way to add regression tests, though these should preferably be implemented as unit tests instead, because using the sample data for this purpose is slow. To add a test using the sample data, create a file ``tests/sample_data/preprocessor/example_preprocessor_function/test_example_preprocessor_function.py`` and add the following content: .. code-block:: python """Test function `esmvalcore.preprocessor.example_preprocessor_function`.""" from pathlib import Path import esmvaltool_sample_data import iris import pytest from esmvalcore.preprocessor import example_preprocessor_function @pytest.mark.use_sample_data def test_example_preprocessor_function(): """Regression test to check that the computed result is as expected.""" # Load an example input cube cube = esmvaltool_sample_data.load_timeseries_cubes(mip_table='Amon')[0] # Compute the result result = example_preprocessor_function(cube, example_argument='time') filename = Path(__file__).with_name('example_preprocessor_function.nc') if not filename.exists(): # Create the file the expected result if it doesn't exist iris.save(result, target=str(filename)) raise FileNotFoundError( f'Reference data was missing, wrote new copy to {filename}') # Load the expected result cube expected = iris.load_cube(str(filename)) # Compare the result of the computation with the expected result print('result:', result) print('expected result:', expected) assert result == expected This will use a file from the sample data repository as input. The first time you run the test, the computed result will be stored in the file ``tests/sample_data/preprocessor/example_preprocessor_function/example_preprocessor_function.nc`` Any subsequent runs will re-load the data from file and check that it did not change. Make sure the stored results are small, i.e. smaller than 100 kilobytes, to keep the size of the ESMValCore repository small. Using multiple datasets as input ================================ The name of the first argument of the preprocessor function should in almost all cases be ``cube``. Only when implementing a preprocessor function that uses all datasets as input, the name of the first argument should be ``products``. If you would like to implement this type of preprocessor function, start by having a look at the existing functions, e.g. :py:func:`esmvalcore.preprocessor.multi_model_statistics` or :py:func:`esmvalcore.preprocessor.mask_fillvalues`.