Configuration#

Overview#

Similar to Dask, ESMValCore provides one single configuration object that consists of a single nested dictionary for its configuration.

Note

In v2.12.0, a redesign process of ESMValTool/Core’s configuration started. Its main aim is to simplify the configuration by moving from many different configuration files for individual components to one configuration object that consists of a single nested dictionary (similar to Dask’s configuration). This change will not be implemented in one large pull request but rather in a step-by-step procedure. Thus, the configuration might appear inconsistent until this redesign is finished. A detailed plan for this new configuration is outlined in Issue #2371.

Specify configuration for `esmvaltool` command line tool#

When running recipes via the command line, configuration options can be specified via YAML files and command line arguments. The options from all YAML files and command line arguments are merged together using dask.config.collect() to create a single configuration object, which properly considers nested objects (see dask.config.update() for details). Configuration options given via the command line will always be preferred over options given via YAML files.

YAML files#

Configuration options can be specified via YAML files (i.e., *.yaml and *.yml).

A file could look like this (for example, located at ~/.config/esmvaltool/config.yml):

output_dir: ~/esmvaltool_output
max_parallel_tasks: 1

ESMValCore searches for all YAML files in each of the following locations and merges them together:

The directory specified via the --config_dir command line argument.
The user configuration directory: by default ~/.config/esmvaltool, but this location can be changed with the ESMVALTOOL_CONFIG_DIR environment variable.

Preference follows the order in the list above (i.e., the directory specified via command line argument is preferred over the user configuration directory). Within a directory, files are sorted lexicographically, and later files (e.g., z.yml) will take precedence over earlier files (e.g., a.yml).

Warning

ESMValCore will read all YAML files in these configuration directories. Thus, other YAML files in this directory which are not valid configuration files (like the old config-developer.yml files) will lead to errors. Make sure to move these files to a different directory.

The minimal required configuration for the tool is that you configure where it can find input data. In addition to that, you may copy the default configuration file with top level options

To get a copy of the default configuration file, you can run the command:

esmvaltool config copy defaults/config-user.yml

This will copy the file to your configuration directory and you can tailor it for your system, e.g. set the output_dir to a path where ESMValTool can store its output files.

Command line arguments#

All configuration options can also be given as command line arguments to the esmvaltool executable.

Example:

esmvaltool run --max_parallel_tasks=2 /path/to/recipe.yml

Options given via command line arguments will always take precedence over options specified via YAML files.

Specify/access configuration for Python API#

When running recipes with the experimental Python API, configuration options can be specified and accessed via the CFG object. For example:

>>> from esmvalcore.config import CFG
>>> CFG['output_dir'] = '~/esmvaltool_output'
>>> CFG['output_dir']
PosixPath('/home/user/esmvaltool_output')

Or, alternatively, via a context manager:

>>> with CFG.context(log_level="debug"):
...     print(CFG["log_level"])
debug
>>> print(CFG["log_level"])
info

This will also consider YAML configuration files in the user configuration directory (by default ~/.config/esmvaltool, but this can be changed with the ESMVALTOOL_CONFIG_DIR environment variable).

More information about this can be found here.

Top level configuration options#

Note: the following entries use Python syntax. For example, Python’s None is YAML’s null, Python’s True is YAML’s true, and Python’s False is YAML’s false.

Option	Description	Type	Default value
`auxiliary_data_dir`	Directory where auxiliary data is stored. [1]	`str`	`~/auxiliary_data`
`check_level`	Sensitivity of the CMOR check (`debug`, `strict`, `default`, `relaxed`, `ignore`), see Customizing checker strictness.	`str`	`default`
`compress_netcdf`	Use netCDF compression.	`bool`	`False`
`config_developer_file`	Path to custom Developer configuration file.	`str`	`None` (default file)
`dask`	Dask configuration.	`dict`	See Default options
`diagnostics`	Only run the selected diagnostics from the recipe, see Running.	`list` or `str`	`None` (all diagnostics)
`download_dir`	[deprecated] Directory where downloaded data will be stored. [2]	`str`	`~/climate_data`
`drs`	[deprecated] Directory structure for input data. [2]	`dict`	`{CMIP3: ESGF, CMIP5: ESGF, CMIP6: ESGF, CORDEX: ESGF, obs4MIPs: ESGF}`
`exit_on_warning`	Exit on warning (only used in NCL diagnostic scripts).	`bool`	`False`
`log_level`	Log level of the console (`debug`, `info`, `warning`, `error`).	`str`	`info`
`logging`	Logging configuration.	`dict`	See Logging configuration
`max_datasets`	Maximum number of datasets to use, see Running.	`int`	`None` (all datasets from recipe)
`max_parallel_tasks`	Maximum number of parallel processes, see Task priority. [4]	`int`	`None` (number of available CPUs)
`max_years`	Maximum number of years to use, see Running.	`int`	`None` (all years from recipe)
`output_dir`	Directory where all output will be written, see Output.	`str`	`~/esmvaltool_output`
`output_file_type`	Plot file type.	`str`	`png`
`profile_diagnostic`	Use a profiling tool for the diagnostic run. [3]	`bool`	`False`
`projects`	Project-specific configuration.	`dict`	See table in Project-specific configuration
`remove_preproc_dir`	Remove the `preproc` directory if the run was successful, see Preprocessed datasets.	`bool`	`True`
`resume_from`	Resume previous run(s) by using preprocessor output files from these output directories, see Running.	`list` of `str`	`[]`
`rootpath`	[deprecated] Rootpaths to the data from different projects. [2]	`dict`	`{default: ~/climate_data}`
`run_diagnostic`	Run diagnostic scripts, see Running.	`bool`	`True`
`save_intermediary_cubes`	Save intermediary cubes from the preprocessor, see also Preprocessed datasets.	`bool`	`False`
`search_data`	Perform a quick or complete search for input data. When set to `quick`, search will stop as soon as a result is found. Data sources with a lower value for `priority` will be searched first. (`quick`, `complete`)	`str`	`quick`
`search_esgf`	[deprecated] Automatic data download from ESGF (`never`, `when_missing`, `always`). [2]	`str`	`never`
`skip_nonexistent`	Skip non-existent datasets, see Running.	`bool`	`False`

[1]

The auxiliary_data_dir setting is the path to place any required additional auxiliary data files. This is necessary because certain Python toolkits, such as cartopy, will attempt to download data files at run time, typically geographic data files such as coastlines or land surface maps. This can fail if the machine does not have access to the wider internet. This location allows the user to specify where to find such files if they can not be downloaded at runtime. The example configuration file already contains two valid locations for auxiliary_data_dir directories on CEDA-JASMIN and DKRZ, and a number of such maps and shapefiles (used by current diagnostics) are already there. You will need esmeval group workspace membership to access the JASMIN one (see instructions how to gain access to the group workspace.

Warning

This setting is not for model or observational datasets, rather it is for extra data files such as shapefiles or other data sources needed by the diagnostics.

Dask configuration#

Configure Dask in the dask section.

The preprocessor functions and many of the Python diagnostics in ESMValTool make use of the Iris library to work with the data. In Iris, data can be either real or lazy. Lazy data is represented by dask arrays. Dask arrays consist of many small numpy arrays (called chunks) and if possible, computations are run on those small arrays in parallel. In order to figure out what needs to be computed when, Dask makes use of a ‘scheduler’. The default (thread-based) scheduler in Dask is rather basic, so it can only run on a single computer and it may not always find the optimal task scheduling solution, resulting in excessive memory use when using e.g. the esmvalcore.preprocessor.multi_model_statistics() preprocessor function. Therefore it is recommended that you take a moment to configure the Dask distributed scheduler. A Dask scheduler and the ‘workers’ running the actual computations, are collectively called a ‘Dask cluster’.

Dask profiles#

Because some recipes require more computational resources than others, ESMValCore provides the option to define “Dask profiles”. These profiles can be used to update the Dask user configuration per recipe run. The Dask profile can be selected in a YAML configuration file via

dask:
  use: <NAME_OF_PROFILE>

or alternatively in the command line via

esmvaltool run --dask='{"use": "<NAME_OF_PROFILE>"}' recipe_example.yml

Available predefined Dask profiles:

local_threaded (selected by default): use threaded scheduler without any further options.
local_distributed: use local distributed scheduler without any further options.
debug: use synchronous Dask scheduler for debugging purposes. Best used with max_parallel_tasks: 1.

To copy these predefined profiles to your configuration directory for further customization, run the command:

esmvaltool config copy defaults/dask.yml

Dask distributed scheduler configuration#

Here, some examples are provided on how to use a custom Dask distributed scheduler. Extensive documentation on setting up Dask Clusters is available here.

Note

If not all preprocessor functions support lazy data, computational performance may be best with the threaded scheduler. See Issue #674 for progress on making all preprocessor functions lazy.

Personal computer

Create a distributed.LocalCluster on the computer running ESMValCore using all available resources:

dask:
  use: local_cluster  # use "local_cluster" defined below
  profiles:
    local_cluster:
      cluster:
        type: distributed.LocalCluster

This should work well for most personal computers.

Note

If running this configuration on a shared node of an HPC cluster, Dask will try and use as many resources it can find available, and this may lead to overcrowding the node by a single user (you)!

Shared computer

Create a distributed.LocalCluster on the computer running ESMValCore, with 2 workers with 2 threads/4 GiB of memory each (8 GiB in total):

dask:
  use: local_cluster  # use "local_cluster" defined below
  profiles:
    local_cluster:
      cluster:
        type: distributed.LocalCluster
        n_workers: 2
        threads_per_worker: 2
        memory_limit: 4GiB

this should work well for shared computers.

Computer cluster

Create a Dask distributed cluster on the Levante supercomputer using the Dask-Jobqueue package:

dask:
  use: slurm_cluster  # use "slurm_cluster" defined below
  profiles:
    slurm_cluster:
      cluster:
        type: dask_jobqueue.SLURMCluster
        queue: shared
        account: <YOUR_SLURM_ACCOUNT>
        cores: 8
        memory: 7680MiB
        processes: 2
        interface: ib0
        local_directory: "/scratch/b/<YOUR_DKRZ_ACCOUNT>/dask-tmp"
        n_workers: 24

This will start 24 workers with cores / processes = 4 threads each, resulting in n_workers / processes = 12 Slurm jobs, where each Slurm job will request 8 CPU cores and 7680 MiB of memory and start processes = 2 workers. This example will use the fast infiniband network connection (called ib0 on Levante) for communication between workers running on different nodes. It is important to set the right location for temporary storage, in this case the /scratch space is used. It is also possible to use environmental variables to configure the temporary storage location, if you cluster provides these.

A configuration like this should work well for larger computations where it is advantageous to use multiple nodes in a compute cluster. See Deploying Dask Clusters on High Performance Computers for more information.

Externally managed Dask cluster

To use an externally managed cluster, specify an scheduler_address for the selected profile. Such a cluster can e.g. be started using the Dask Jupyterlab extension:

dask:
  use: external  # Use the `external` profile defined below
  profiles:
    external:
      scheduler_address: "tcp://127.0.0.1:43605"

See here for an example of how to configure this on a remote system.

For debugging purposes, it can be useful to start the cluster outside of ESMValCore because then Dask dashboard remains available after ESMValCore has finished running.

Advice on choosing performant configurations

The threads within a single worker can access the same memory locations, so they may freely pass around chunks, while communicating a chunk between workers is done by copying it, so this is (a bit) slower. Therefore it is beneficial for performance to have multiple threads per worker. However, due to limitations in the CPython implementation (known as the Global Interpreter Lock or GIL), only a single thread in a worker can execute Python code (this limitation does not apply to compiled code called by Python code, e.g. numpy), therefore the best performing configurations will typically not use much more than 10 threads per worker.

Due to limitations of the NetCDF library (it is not thread-safe), only one of the threads in a worker can read or write to a NetCDF file at a time. Therefore, it may be beneficial to use fewer threads per worker if the computation is very simple and the runtime is determined by the speed with which the data can be read from and/or written to disk.

Custom Dask threaded scheduler configuration#

The Dask threaded scheduler can be a good choice for recipes using a small amount of data or when running a recipe where not all preprocessor functions are lazy yet (see Issue #674 for the current status).

To avoid running out of memory, it is important to set the number of workers (threads) used by Dask to run its computations to a reasonable number. By default, the number of CPU cores in the machine will be used, but this may be too many on shared machines or laptops with a large number of CPU cores compared to the amount of memory they have available.

Typically, Dask requires about 2 GiB of RAM per worker, but this may be more depending on the computation.

To set the number of workers used by the Dask threaded scheduler, use the following configuration:

dask:
  use: local_threaded  # This can be omitted
  profiles:
    local_threaded:
      num_workers: 4

Default options#

By default, the following Dask configuration is used:

dask:
  use: local_threaded  # use the `local_threaded` profile defined below
  profiles:
    local_threaded:
      scheduler: threads
    local_distributed:
      cluster:
        type: distributed.LocalCluster
    debug:
      scheduler: synchronous

All available options#

Option	Description	Type	Default value
`profiles`	Different Dask profiles that can be selected via the `use` option. Each profile has a name (`dict` keys) and corresponding options (`dict` values). See Options for Dask profiles for details.	`dict`	See Default options
`use`	Dask profile that is used; must be defined in the option `profiles`.	`str`	`local_threaded`

Options for Dask profiles#

Option	Description	Type	Default value
`cluster`	Keyword arguments to initialize a Dask distributed cluster. Needs the option `type`, which specifies the class of the cluster. The remaining options are passed as keyword arguments to initialize that class. Cannot be used in combination with `scheduler_address`.	`dict`	If omitted, use externally managed cluster if `scheduler_address` is given or a Dask threaded scheduler otherwise.
`scheduler_address`	Scheduler address of an externally managed cluster. Will be passed to `distributed.Client`. Cannot be used in combination with `cluster`.	`str`	If omitted, use a Dask distributed cluster if `cluster` is given or a Dask threaded scheduler otherwise.
All other options	Passed as keyword arguments to `dask.config.set()`.	Any	No defaults.

Logging configuration#

Configure what information is logged and how it is presented in the logging section.

Note

Not all logging configuration is available here yet, see Issue #2596.

Configuration file example:

logging:
  log_progress_interval: 10s

will log progress of Dask computations every 10 seconds instead of showing a progress bar.

Command line example:

esmvaltool run --logging='{"log_progress_interval": "1m"}' recipe_example.yml

will log progress of Dask computations every minute instead of showing a progress bar.

Available options:

Option	Description	Type	Default value
`log_progress_interval`	When running computations with Dask, log progress every `log_progress_interval` instead of showing a progress bar. The value can be specified in the format accepted by `dask.utils.parse_timedelta()`. A negative value disables any progress reporting. A progress bar is only shown if `max_parallel_tasks: 1`.	`str` or `float`	0

Project-specific configuration#

Configure project-specific settings in the projects section.

Top-level keys in this section are projects, e.g., CMIP6, CORDEX, or obs4MIPs.

Example:

projects:
  CMIP6:
    ...  # project-specific options

The following project-specific options are available:

Option	Description	Type	Default value
`cmor_table`	CMOR tables are used to define the variables that ESMValCore can work with. Refer to CMOR table configuration for available options.	`dict`	`{"type": "esmvalcore.cmor.table.NoInfo"}`
`data`	Data sources are used to find input data and have to be configured before running the tool. Refer to Data sources for details.	`dict`	`{}`
`extra_facets`	Extra key-value pairs (”facets”) added to datasets in addition to the facets defined in the recipe. Refer to Extra Facets for details.	`dict`	Refer to Default extra facets.
`preprocessor_filename_template`	A template defining the filenames to use for preprocessed data when running a recipe. Refer to Preprocessor output filenames for details.	`str`	Refer to Preprocessor output filenames.

CMOR table configuration#

CMOR tables are used to define the variables that ESMValCore can work with.

Default values are provided in defaults/cmor_tables.yml, for example:

Listing 1 First few lines of defaults/cmor_tables.yml#

# CMOR table configuration.
projects:
  # Projects hosted on ESGF.
  CMIP7:
    cmor_table:
      type: esmvalcore.cmor.table.CMIP6Info
      paths:
        - cmip7/tables
        - cmip6-custom
  CMIP6:
    cmor_table:
      type: esmvalcore.cmor.table.CMIP6Info
      paths:
        - cmip6/Tables
        - cmip6-custom
  CMIP5:
    cmor_table:
      type: esmvalcore.cmor.table.CMIP5Info
      paths:
        - cmip5/Tables
        - cmip5-custom

The type parameter defines which class is used to read the CMOR tables for a project, it should be a subclass of esmvalcore.cmor.table.InfoBase. The other parameters are passed as keyword arguments to the class when it is created. See esmvalcore.cmor.table for a description of the built-in classes for reading CMOR tables and their parameters.

Most users will not need to change the CMOR table configuration. However, if you have variables you would like to use which are not in the standard tables, it is recommended that you start from the default configuration and extend the list of paths with the directory where your custom CMOR tables are stored.

If you have data that is not described in a CMOR table at all and do not need interoperability with typical climate model data, you can use the esmvalcore.cmor.table.NoInfo class to indicate that no CMOR table is available. In that case you will need to provide all necessary facets for loading and saving the data in the recipe or Dataset, and CMOR checks will be skipped.

Warning

While it is possible to work with datasets that are not described in a CMOR table, the preprocessor functions and diagnostics have been designed to work with CMORized data and may not work as expected with non-CMORized data.

Our ambition is that preprocessor functions support data that follows the CF Conventions.

Data sources#

The data section defines sources of input data. The easiest way to get started with these is to copy one of the example configuration files and tailor it to your needs.

To list the available example configuration files, run the command:

esmvaltool config list

To use one of the example configuration files, copy it to your configuration directory by running the command:

esmvaltool config copy data-intake-esgf.yml

where data-intake-esgf.yml needs to be replaced by the name of the example configuration you would like to use. The format of the configuration file is described in esmvalcore.io.

There are three modules available as part of ESMValCore that provide data sources:

esmvalcore.io.intake_esgf: Use the intake-esgf library to load data that is available from ESGF.
esmvalcore.io.local: Use glob patterns to find files on a filesystem.
esmvalcore.io.esgf: Use the legacy esgf-pyclient library to find and download data from ESGF.

Adding a custom data source is relatively easy and is explained in esmvalcore.io.protocol.

There are various use cases and we provide example configurations for each of them below.

Personal computer#

On a personal computer, the recommended setup can be obtained by running the commands:

esmvaltool config copy data-intake-esgf.yml
esmvaltool config copy data-local-esmvaltool.yml

This will use the esmvalcore.io.intake_esgf module to access data that is available through ESGF and use esmvalcore.io.local to find observational and reanalysis datasets that have been CMORized with ESMValTool (OBS6 and OBS projects for CMIP6- and CMIP5-style CMORization respectively) or are supported in their native format through the native6 project.

Warning

It is important to configure intake-esgf for your system before using it. Make sure to set local_cache to a path where it can store downloaded files, and if (some) ESGF data is already available on your system, point esg_dataroot to it. If you are missing certain search results, you may want to choose a different index node for searching the ESGF.

HPC system#

On HPC systems, data is often stored in large shared filesystems. We have several example configurations for popular HPC systems. To list the available example files, run the command:

esmvaltool config list data-hpc

If you are using one of the supported HPC systems, for example Jasmin, you can copy the example configuration file by running the command:

esmvaltool config copy data-hpc-badc.yml

and you should be good to go. If your HPC system is not supported yet, you can copy one of the other example configuration files, e.g. data-hpc-dkrz.yml and tailor it for your system.

Warning

Note

Deduplicating data found via esmvalcore.io.intake_esgf data sources and the esmvalcore.io.local data sources has not yet been implemented. Therefore it is recommended not to use the configuration option search_data: complete when using both data sources for the same project. The search_data: quick option can be safely used.

Climate model data in its native format#

For each of the climate models that are supported in their native format as described in Supported native models, an example configuration file is available. To list the available example files, run the command:

esmvaltool config list data-native

Filter Iris load warnings#

It is possible to ignore specific warnings when loading data with Iris. This is particularly useful for native datasets which do not follow the CMOR standard by default and consequently produce a lot of warnings when handled by Iris. This can be configured using the ignore_warnings argument to esmvalcore.io.local.LocalDataSource.

Here is an example on how to ignore specific warnings when loading data from the EMAC model in its native format:

# Read data from the EMAC model in its native format.
projects:
  EMAC:
    data:
      emac:
        type: esmvalcore.io.local.LocalDataSource
        rootpath: ~/climate_data
        dirname_template: "{exp}/{channel}"
        filename_template: "{exp}*{channel}{postproc_flag}.nc"
        ignore_warnings:
          - message: "Ignored formula of unrecognised type: .*"
            module: iris
          - message: "Ignoring formula terms variable .* referenced by data variable .* via variable .*"
            module: iris
          - message: "Missing CF-netCDF formula term variable .*, referenced by netCDF variable .*"
            module: iris
          - message: "NetCDF variable .* contains unknown cell method .*"
            module: iris

The keyword arguments specified in the list items are directly passed to warnings.filterwarnings() in addition to action=ignore.

Extra Facets#

It can be useful to automatically add extra facets to variables or datasets without explicitly specifying them in the recipe. These facets can be used for finding data or for providing extra information to the functions that fix data before passing it on to the preprocessor.

To support this, we provide the extra facets facilities. Facets are the key-value pairs described in Recipe section: datasets. Extra facets allows for the addition of more details per project, dataset, MIP table, and variable name.

Format of the extra facets#

Extra facets are configured in the extra_facets section of the project-specific configuration. They are specified in nested dictionaries with the following levels:

Dataset name
MIP table
Variable short name

Example:

projects:
  CMIP6:
    extra_facets:
      CanESM5:  # dataset name
        Amon:  # MIP table
          tas:  # variable short name
            a_new_key: a_new_value  # extra facets

The three top levels under extra_facets (dataset name, MIP table, and variable short name) can contain Unix shell-style wildcards. The special characters used in shell-style wildcards are:

Pattern	Meaning
`*`	matches everything
`?`	matches any single character
`[seq]`	matches any character in `seq`
`[!seq]`	matches any character not in `seq`

where seq can either be a sequence of characters or just a bunch of characters, for example [A-C] matches the characters A, B, and C, while [AC] matches the characters A and C.

Examples:

projects:
  CMIP6:
    extra_facets:
      CanESM5:  # dataset name
        "*":  # MIP table
          "*":  # variable short name
            a_new_key: a_new_value  # extra facets

Here, the extra facet a_new_key: a_new_value will be added to any CMIP6 data from model CanESM5.

If keys are duplicated, later keys will take precedence over earlier keys:

projects:
  CMIP6:
    extra_facets:
      CanESM5:
        "*":
          "*":
            shared_key: with_wildcard
            unique_key_1: test
        Amon:
          tas:
            shared_key: without_wildcard
            unique_key_2: test

Here, the following extra facets will be added to a dataset with project CMIP6, name CanESM5, MIP table Amon, and variable short name tas:

unique_key_1: test
shared_key: without_wildcard  # takes value from later entry
unique_key_2: test

Preprocessor output filenames#

The filename to use for saving preprocessed data when running a recipe is configured using preprocessor_filename_template, similar to the filename template in esmvalcore.io.local.LocalDataSource.

Default values are provided in defaults/preprocessor_filename_template.yml, for example:

Listing 2 First few lines of defaults/preprocessor_filename_template.yml#

# Templates for the filenames used to write preprocessor output.
projects:
  # ESGF projects.
  CMIP3:
    preprocessor_filename_template: "{project}_{institute}_{dataset}_{mip}_{exp}_{ensemble}_{short_name}"
  CMIP5:
    preprocessor_filename_template: "{project}_{dataset}_{mip}_{exp}_{ensemble}_{short_name}"
  CMIP6:
    preprocessor_filename_template: "{project}_{dataset}_{mip}_{exp}_{ensemble}_{short_name}_{grid}"
  CMIP7:
    preprocessor_filename_template: "{short_name}_{branding_suffix}_{frequency}_{region}_{grid}_{dataset}_{exp}_{ensemble}"
  CORDEX:
    preprocessor_filename_template: "{project}_{institute}_{dataset}_{rcm_version}_{driver}_{domain}_{mip}_{exp}_{ensemble}_{short_name}"
  obs4MIPs:
    preprocessor_filename_template: "{project}_{dataset}_{short_name}"
  ana4MIPs:
    preprocessor_filename_template: "{project}_{mip}_{type}_{dataset}_{short_name}"

The facet names from the template are replaced with the facet values from the recipe to create a filename. The extension .nc (and if applicable, a start and end time) will automatically be appended to the filename.

If no preprocessor_filename_template is configured for a project, the facets describing the dataset in the recipe, as stored in esmvalcore.dataset.Dataset.minimal_facets, are used.

ESGF configuration#

The esmvaltool run command can automatically download the files required to run a recipe from ESGF for the projects CMIP3, CMIP5, CMIP6, CORDEX, and obs4MIPs.

Refer to Data sources for instructions on how to set this up. This section describes additional configuration options for the esmvalcore.io.esgf module, which is based on the legacy esgf-pyclient library. Most users will not need this.

Note

When running a recipe that uses many or large datasets on a machine that does not have any data available locally, the amount of data that will be downloaded can be in the range of a few hundred gigabyte to a few terrabyte. See Obtaining input data for advice on getting access to machines with large datasets already available.

A log message will be displayed with the total amount of data that will be downloaded before starting the download. If you see that this is more than you would like to download, stop the tool by pressing the Ctrl and C keys on your keyboard simultaneously several times, edit the recipe so it contains fewer datasets and try again.

Configuration file#

An optional configuration file can be created for configuring how the esmvalcore.io.esgf.ESGFDataSource uses esgf-pyclient to find and download data. The name of this file is ~/.esmvaltool/esgf-pyclient.yml.

Search#

Any arguments to pyesgf.search.connection.SearchConnection can be provided in the section search_connection, for example:

search_connection:
  expire_after: 2592000  # the number of seconds in a month

to keep cached search results for a month.

The default settings are:

search_connection:
  urls:
    - 'https://esgf-node.ornl.gov/esgf-1-5-bridge'
    - 'https://esgf.ceda.ac.uk/esg-search'
    - 'https://esgf-data.dkrz.de/esg-search'
    - 'https://esgf-node.ipsl.upmc.fr/esg-search'
    - 'https://esg-dn1.nsc.liu.se/esg-search'
    - 'https://esgf.nci.org.au/esg-search'
    - 'https://esgf.nccs.nasa.gov/esg-search'
    - 'https://esgdata.gfdl.noaa.gov/esg-search'
  distrib: true
  timeout: 120  # seconds
  cache: '~/.esmvaltool/cache/pyesgf-search-results'
  expire_after: 86400  # cache expires after 1 day

Note that by default the tool will try searching the ESGF index nodes in the order provided in the configuration file and use the first one that is online. Some ESGF index nodes may return search results faster than others, so you may be able to speed up the search for files by experimenting with placing different index nodes at the top of the list.

Warning

ESGF is currently transitioning to new server technology and all of the above indices are expected to go offline except the first one.

Issues with https://esgf-node.ornl.gov/esgf-1-5-bridge can be reported here.

If you experience errors while searching, it sometimes helps to delete the cached results.

Download statistics#

The tool will maintain statistics of how fast data can be downloaded from what host in the file ~/.esmvaltool/cache/esgf-hosts.yml and automatically select hosts that are faster. There is no need to manually edit this file, though it can be useful to delete it if you move your computer to a location that is very different from the place where you previously downloaded data. An entry in the file might look like this:

esgf2.dkrz.de:
  duration (s): 8
  error: false
  size (bytes): 69067460
  speed (MB/s): 7.9

The tool only uses the duration and size to determine the download speed, the speed shown in the file is not used. If error is set to true, the most recent download request to that host failed and the tool will automatically try this host only as a last resort.

Developer configuration file#

Deprecated since version 2.14.0: The developer configuration file is deprecated and will no longer be supported in v2.16.0. Please use the project-specific configuration instead.

See the v2.13.0 documentation for previous usage of the developer configuration file.

Warning

Make sure that no config-developer.yml file is saved in the ESMValCore configuration directories (see Overview for details), as it does not contain configuration options that are valid in the new configuration system.

Upgrade instructions for finding files#

The input_dir, input_file, and ignore_warnings settings have been replaced by esmvalcore.io.local.LocalDataSource, which can be configured via data sources.

Example 1: A config-developer.yml file specifying a directory structure for CMIP6 data:

CMIP6:
  input_dir:
    ESGF: "{project}/{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}"
  input_file: "{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc"

and associated rootpath and drs settings:

rootpath:
  CMIP6: ~/climate_data
drs:
  CMIP6: ESGF

would translate to the following new configuration:

projects:
  CMIP6:
    data:
      local:
        type: esmvalcore.io.local.LocalDataSource
        rootpath: ~/climate_data
        dirname_template: "{project}/{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}"
        filename_template: "{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc"

Upgrade instructions for naming preprocessor output files#

The output_file setting has been replaced by the preprocessor_filename_template settings described in Preprocessor output filenames.

Example 1: A config-developer.yml file specifying preprocessor output filenames for CMIP6 data:

CMIP6:
  output_file: "{project}_{dataset}_{mip}_{exp}_{ensemble}_{short_name}_{grid}"

would translate to the following new configuration:

projects:
  CMIP6:
    preprocessor_filename_template: "{project}_{dataset}_{mip}_{exp}_{ensemble}_{short_name}_{grid}"

Upgrade instructions for using custom CMOR tables#

The CMOR tables can now be configured via CMOR table configuration. The following mapping applies:

cmor_type has been replaced by type
cmor_strict has been replaced by strict
cmor_path has been replaced by paths
cmor_default_table_prefix is no longer needed.

Because it is now possible to configure multiple paths to directories containing CMOR tables per project, the custom project for specifying additional custom CMOR tables is no longer needed nor supported.

Example 1: A config-developer.yml file specifying different CMIP6 CMOR tables than the default ones, augmented by the default custom CMOR tables:

CMIP6:
  cmor_path: /path/to/cmip6-cmor-tables
  cmor_strict: true
  cmor_type: CMIP6

would translate to the following new configuration:

projects:
  CMIP6:
    cmor_table:
      type: esmvalcore.cmor.table.CMIP6Info
      strict: true
      paths:
        - /path/to/cmip6-cmor-tables
        - cmip6-custom

where the cmip6-custom relative path refers to esmvalcore/cmor/tables/cmip6-custom and type refers to esmvalcore.cmor.table.CMIP6Info.

Example 2: A config-developer.yml file specifying additional custom CMOR tables:

CMIP6:
  cmor_path: cmip6
  cmor_strict: true
  cmor_type: CMIP6
custom:
  cmor_path: /path/to/custom-cmip5-style-tables

would translate to the following new configuration:

projects:
  CMIP6:
    cmor_table:
      type: esmvalcore.cmor.table.CMIP6Info
      strict: true
      paths:
        - cmip6
        - cmip6-custom
        - /path/to/custom-cmip6-style-tables

where the relative paths cmip6 and cmip6-custom refer to esmvalcore/cmor/tables/cmip6 and esmvalcore/cmor/tables/cmip6-custom respectively and the directory /path/to/custom-cmip6-style-tables contains the additional custom CMOR tables in CMIP6 format. A script to translate custom CMIP5 format tables to CMIP6 format tables is available here. Note that it is no longer possible to mix CMIP6-style tables with custom CMIP5-style tables for the same project.

References configuration file#

The esmvaltool/config-references.yml file contains the list of ESMValTool diagnostic and recipe authors, references and projects. Each author, project and reference referred to in the documentation section of a recipe needs to be in this file in the relevant section.

For instance, the recipe recipe_ocean_example.yml file contains the following documentation section:

documentation:
  authors:
    - demo_le

  maintainer:
    - demo_le

  references:
    - demora2018gmd

  projects:
    - ukesm

These four items here are named people, references and projects listed in the config-references.yml file.

Configuration#

Overview#

Specify configuration for esmvaltool command line tool#

YAML files#

Command line arguments#

Specify/access configuration for Python API#

Top level configuration options#

Dask configuration#

Dask profiles#

Dask distributed scheduler configuration#

Custom Dask threaded scheduler configuration#

Default options#

All available options#

Options for Dask profiles#

Logging configuration#

Project-specific configuration#

CMOR table configuration#

Data sources#

Personal computer#

HPC system#

Climate model data in its native format#

Filter Iris load warnings#

Extra Facets#

Format of the extra facets#

Default extra facets#

Preprocessor output filenames#

ESGF configuration#

Configuration file#

Search#

Download statistics#

Developer configuration file#

Upgrade instructions for finding files#

Upgrade instructions for naming preprocessor output files#

Upgrade instructions for using custom CMOR tables#

References configuration file#

Specify configuration for `esmvaltool` command line tool#