Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Beginner’s Guide to ESGF

This guide is targetted at users who are new to obtaining CMIP data from ESGF. While many people work hard to provide the community access in an intuitive fashion, ESGF remains a data source for researchers who have some prior understanding about the data they wish to find and how they are organized. This tutorial is meant to gently expose the uninitiated to key concepts and step you through your first searches using intake-esgf.

Which Variable Do We Need?

At the highest level, ESGF stores data in projects such as CMIP5 and CMIP6. While there are some similarities between projects, the control vocabulary, that is the metadata used to identify unique datasets, varies. In this tutorial we will explain some of the CMIP6 vocabulary, which is the default project for intake-esgf.

Perhaps the most important search criteria to determine is the name of the variable you wish to use. intake-esgf has some functionality to assist. First, import and instantiate the catalog.

from intake_esgf import ESGFCatalog
cat = ESGFCatalog()

Then you can use the catalog to perform a free text search for any word that may be related to the variable for which you are searching. In this case, we will search for air temperature surface.

cat.variable_info("air temperature surface")
Loading...

This function returns a pandas dataframe which lists the name of several variables along with their units and standard names. From a perusal of this list, it appears that tas is the variable we want for this search. The dataframe index also shows us that the name of the control vocabulary is variable_id.

Control Vocabulary

While we could now perform a search for variable_id=tas, this search will take quite some time. intake-esgf currently works better if we give it a better idea of what we wish to find. Simply put, we recommend constraining the search.

One of the more useful search facets is the experiment_id, a unique identifier corresponding to the experiment. As part of the planning phase of the CMIP process, groups of researchers write papers detailing the specific method that a model is to be run to be included in an experiment. This allows modeling centers to follow the protocol if they wish to be part of the experiment. You can browse the experiments to see the indentifiers and some basic information.

One commonly used experiment is historical, where models are run using reconstructions of the historical earth state from 1850 until 2015. We will use this in our example search.

cat.search(variable_id="tas",experiment_id="historical")
Summary information for 1908 results: mip_era [CMIP6] activity_drs [CMIP] institution_id [AS-RCEC, AWI, BCC, CAMS, CAS, CCCR-IITM, CCCm... source_id [TaiESM1, AWI-CM-1-1-MR, AWI-ESM-1-1-LR, BCC-C... experiment_id [historical] member_id [r1i1p1f1, r2i1p1f1, r3i1p1f1, r4i1p1f1, r5i1p... table_id [3hr, Amon, day, 6hrPlev, 6hrPlevPt, AERhr, CF... variable_id [tas] grid_label [gn, gr, gr1, gra, grg, gr2] dtype: object

This will populate an underlying pandas dataframe with the search results. The columns of that dataframe and unique values are presented . This exposes more of the control vocabulary for CMIP6. We have already explored variable_id and experiment_id. Now we explain more of the control vocabulary emphasizing what we find to be the more useful facets.

Downloading Data

We will refine our search to select a single model CanESM5, variant r1i1p1f1, and table Amon.

cat.search(
    variable_id="tas",
    experiment_id="historical",
    source_id="CanESM5",
    member_id="r1i1p1f1",
    table_id="Amon"
)
Summary information for 1 results: mip_era [CMIP6] activity_drs [CMIP] institution_id [CCCma] source_id [CanESM5] experiment_id [historical] member_id [r1i1p1f1] table_id [Amon] variable_id [tas] grid_label [gn] dtype: object

Once your search has been sufficiently narrowed, you may download into a dictionary of xarray datasets.

dsd = cat.to_dataset_dict()
{'tas': <xarray.Dataset> Size: 65MB Dimensions: (time: 1980, bnds: 2, lat: 64, lon: 128) Coordinates: * time (time) object 16kB 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 * lat (lat) float64 512B -87.86 -85.1 -82.31 ... 82.31 85.1 87.86 * lon (lon) float64 1kB 0.0 2.812 5.625 8.438 ... 351.6 354.4 357.2 height float64 8B ... Dimensions without coordinates: bnds Data variables: time_bnds (time, bnds) object 32kB dask.array<chunksize=(1980, 2), meta=np.ndarray> lat_bnds (lat, bnds) float64 1kB dask.array<chunksize=(64, 2), meta=np.ndarray> lon_bnds (lon, bnds) float64 2kB dask.array<chunksize=(128, 2), meta=np.ndarray> tas (time, lat, lon) float32 65MB dask.array<chunksize=(1980, 64, 128), meta=np.ndarray> areacella (lat, lon) float32 33kB dask.array<chunksize=(64, 128), meta=np.ndarray> Attributes: (12/55) CCCma_model_hash: 3dedf95315d603326fde4f5340dc0519d80d10c0 CCCma_parent_runid: rc3-pictrl CCCma_pycmor_hash: 33c30511acc319a98240633965a04ca99c26427e CCCma_runid: rc3.1-his01 Conventions: CF-1.7 CMIP-6.2 YMDH_branch_time_in_child: 1850:01:01:00 ... ... variant_label: r1i1p1f1 version: v20190429 license: CMIP6 model data produced by The Government ... cmor_version: 3.4.0 activity_drs: CMIP member_id: r1i1p1f1}

Note that you do not need to explicitly search for cell measures such as areacella. These will be included automatically. The files are downloaded locally to a cache directory which mirrors the directory structure of the remote storate. So while the above code is how you download data, it is also how you load it into memory for your analysis scripts. There is no need to handle files in your working directory or write complicated code to load them into memory.

Plotting

In this example, we will just take a temporal mean and plot the result using matplotlib.

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6, 4), tight_layout=True)
ds = dsd["tas"]["tas"].mean(dim="time") - 273.15  # to [C]
ds.plot(ax=ax, cmap="bwr", vmin=-40, vmax=40, cbar_kwargs={"label": "tas [C]"});
<Figure size 600x400 with 2 Axes>