Automatic Cell Measures

If you have worked with CMIP data before, you know that cell measure information like areacella is needed to take proper area-weighted means/summations. Yet many times, model centers have not uploaded this information uniformly in all submissions. This can be frustrating for the user.

In intake-esgf, when you call to_dataset_dict(), we perform a search for each dataset being placed in the dataset dictionary, progressively dropping facets to find, if possible, the cell measures that are closest to the dataset being downloaded. Sometimes they are simply in another variant_label, but other times they could be in a different activity_id. No matter where they are, we find them for you and add them to your dataset by default (disable with add_measures=False).

Consider the following search for data with UKESM1-0-LL. We are looking for a land variable gpp, the gross primary productivity.

from intake_esgf import ESGFCatalog
cat = ESGFCatalog().search(
    variable_id="gpp",
    source_id="UKESM1-0-LL",
    variant_label="r2i1p1f2",
    frequency="mon",
    experiment_id="historical",
)
dsd = cat.to_dataset_dict()

The progress bar will let you know that we are searching for cell measure information. We determine which measures need downloaded by looking in the dataset attributes. Since gpp is a land variable, we see that its cell_measures ='area: areacella' which indicates that this data should be also downloaded. However you will also find where land in the cell_methods meaning that we also need sftlf, the land fractions. If you look at the resulting dataset, you will find that both have been associated.

dsd["gpp"]
<xarray.Dataset> Size: 230MB
Dimensions:    (time: 1980, bnds: 2, lat: 144, lon: 192)
Coordinates:
  * time       (time) object 16kB 1850-01-16 00:00:00 ... 2014-12-16 00:00:00
  * lat        (lat) float64 1kB -89.38 -88.12 -86.88 ... 86.88 88.12 89.38
  * lon        (lon) float64 2kB 0.9375 2.812 4.688 6.562 ... 355.3 357.2 359.1
    type       |S4 4B ...
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) object 32kB dask.array<chunksize=(1, 2), meta=np.ndarray>
    lat_bnds   (time, lat, bnds) float64 5MB dask.array<chunksize=(1200, 144, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 6MB dask.array<chunksize=(1200, 192, 2), meta=np.ndarray>
    gpp        (time, lat, lon) float32 219MB dask.array<chunksize=(1, 144, 192), meta=np.ndarray>
    sftlf      (lat, lon) float32 111kB ...
    areacella  (lat, lon) float32 111kB ...
Attributes: (12/46)
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            CMIP
    branch_method:          standard
    branch_time_in_child:   0.0
    branch_time_in_parent:  113400.0
    creation_date:          2019-07-04T10:57:56Z
    ...                     ...
    title:                  UKESM1-0-LL output prepared for CMIP6
    variable_id:            gpp
    variant_label:          r2i1p1f2
    license:                CMIP6 model data produced by the Met Office Hadle...
    cmor_version:           3.4.0
    tracking_id:            hdl:21.14100/8a19464e-4fff-4ccd-b45c-6c0c79f7e70a

What makes this particular example difficult is that the cell measures for this model are only found in the piControl experiment, for the r1i1p1f2 variant. Our methods finds the right measures, which you can see by printing out the session log and looking for which areacella files are downloaded / accessed.

print(cat.session_log())
2024-05-02 17:59:43 search begin variable_id=['gpp'], source_id=['UKESM1-0-LL'], variant_label=['r2i1p1f2'], frequency=['mon'], experiment_id=['historical'], type=['Dataset'], project=['CMIP6'], latest=[True], retracted=[False]
2024-05-02 17:59:44 combine_time=0.01
2024-05-02 17:59:44 search end total_time=0.94
2024-05-02 17:59:44 file info begin
2024-05-02 17:59:45 file info end total_time=1.26
2024-05-02 17:59:45 begin move_data
2024-05-02 17:59:49 transfer_time=3.53 [s] at 6.71 [Mb s-1] http://esgf-node.ornl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Lmon/gpp/gn/v20190708/gpp_Lmon_UKESM1-0-LL_historical_r2i1p1f2_gn_195001-201412.nc
2024-05-02 17:59:53 transfer_time=7.80 [s] at 5.22 [Mb s-1] http://esgf-node.ornl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Lmon/gpp/gn/v20190708/gpp_Lmon_UKESM1-0-LL_historical_r2i1p1f2_gn_185001-194912.nc
2024-05-02 17:59:53 end move_data
2024-05-02 17:59:53 search begin variant_label=['r2i1p1f2'], source_id=['UKESM1-0-LL'], mip_era=['CMIP6'], activity_id=['CMIP'], experiment_id=['historical'], grid_label=['gn'], table_id=['fx', 'Ofx'], variable_id=['sftlf'], type=['Dataset'], project=['CMIP6'], latest=[True], retracted=[False]
2024-05-02 17:59:54 search end no results
2024-05-02 17:59:54 search begin source_id=['UKESM1-0-LL'], mip_era=['CMIP6'], activity_id=['CMIP'], experiment_id=['historical'], grid_label=['gn'], table_id=['fx', 'Ofx'], variable_id=['sftlf'], type=['Dataset'], project=['CMIP6'], latest=[True], retracted=[False]
2024-05-02 17:59:56 search end no results
2024-05-02 17:59:56 search begin source_id=['UKESM1-0-LL'], mip_era=['CMIP6'], activity_id=['CMIP'], grid_label=['gn'], table_id=['fx', 'Ofx'], variable_id=['sftlf'], type=['Dataset'], project=['CMIP6'], latest=[True], retracted=[False]
2024-05-02 17:59:57 combine_time=0.00
2024-05-02 17:59:57 search end total_time=1.23
2024-05-02 17:59:57 file info begin
2024-05-02 17:59:58 file info end total_time=1.03
2024-05-02 17:59:58 begin move_data
2024-05-02 17:59:58 transfer_time=0.07 [s] at 1.23 [Mb s-1] http://esgf-node.ornl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/MOHC/UKESM1-0-LL/piControl/r1i1p1f2/fx/sftlf/gn/v20190705/sftlf_fx_UKESM1-0-LL_piControl_r1i1p1f2_gn.nc
2024-05-02 17:59:58 end move_data
2024-05-02 17:59:58 search begin variant_label=['r2i1p1f2'], source_id=['UKESM1-0-LL'], mip_era=['CMIP6'], activity_id=['CMIP'], experiment_id=['historical'], grid_label=['gn'], table_id=['fx', 'Ofx'], variable_id=['areacella'], type=['Dataset'], project=['CMIP6'], latest=[True], retracted=[False]
2024-05-02 18:00:00 search end no results
2024-05-02 18:00:00 search begin source_id=['UKESM1-0-LL'], mip_era=['CMIP6'], activity_id=['CMIP'], experiment_id=['historical'], grid_label=['gn'], table_id=['fx', 'Ofx'], variable_id=['areacella'], type=['Dataset'], project=['CMIP6'], latest=[True], retracted=[False]
2024-05-02 18:00:02 search end no results
2024-05-02 18:00:02 search begin source_id=['UKESM1-0-LL'], mip_era=['CMIP6'], activity_id=['CMIP'], grid_label=['gn'], table_id=['fx', 'Ofx'], variable_id=['areacella'], type=['Dataset'], project=['CMIP6'], latest=[True], retracted=[False]
2024-05-02 18:00:05 combine_time=0.00
2024-05-02 18:00:05 search end total_time=2.73
2024-05-02 18:00:05 file info begin
2024-05-02 18:00:06 file info end total_time=1.12
2024-05-02 18:00:06 begin move_data
2024-05-02 18:00:06 transfer_time=0.06 [s] at 1.01 [Mb s-1] http://esgf-node.ornl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/MOHC/UKESM1-0-LL/piControl/r1i1p1f2/fx/areacella/gn/v20190705/areacella_fx_UKESM1-0-LL_piControl_r1i1p1f2_gn.nc
2024-05-02 18:00:06 end move_data