Reproducibility

If you are using ESGF data in an analysis publication, the journal to which you are submitting may require that you provide data citations or availability. While we are working on improving this in ESGF, we also wanted to highlight the current functionality. Consider the following query assumed to be used in an unspecified analysis. For comparison, we will print the underlying dataframe to show the results of the search.

cat = ESGFCatalog().search(
    experiment_id="historical",
    source_id="CanESM5",
    variable_id=["gpp", "tas", "nbp"],
    variant_label=["r1i1p1f1"],
    frequency="mon",
)
cat.df
table_id experiment_id institution_id variable_id datetime_start activity_drs version member_id source_id datetime_stop grid_label mip_era project id
0 Amon historical CCCma tas 1850-01-16T12:00:00Z CMIP 20190429 r1i1p1f1 CanESM5 2014-12-16T12:00:00Z gn CMIP6 CMIP6 [CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1....
1 Lmon historical CCCma nbp 1850-01-16T12:00:00Z CMIP 20190429 r1i1p1f1 CanESM5 2014-12-16T12:00:00Z gn CMIP6 CMIP6 [CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1....
2 Lmon historical CCCma gpp 1850-01-16T12:00:00Z CMIP 20190429 r1i1p1f1 CanESM5 2014-12-16T12:00:00Z gn CMIP6 CMIP6 [CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1....

In the course of the analysis, you would download the datasets into a dictionary.

dsd = cat.to_dataset_dict(add_measures=False)

Then you may loop through the datasets and pull out the tracking_id from the global attributes of each dataset.

tracking_ids = [ds.tracking_id for _,ds in dsd.items()]
for tracking_id in tracking_ids:
    print(tracking_id)
hdl:21.14100/387658c8-f085-4ab8-995c-def848e7d856
hdl:21.14100/872062df-acae-499b-aa0f-9eaca7681abc
hdl:21.14100/52656bcc-3758-463b-964f-ef8863a6424a

The tracking_id is similar to a digital object identifier (DOI) and can be provided in some form in your paper or supplemental material to be precise about what ESGF data you used. If you have a list of tracking_ids, then you can pass them into from_tracking_ids() to reproduce the catalog.

new_cat = ESGFCatalog().from_tracking_ids(tracking_ids)
new_cat.df
table_id experiment_id institution_id variable_id activity_drs version member_id source_id grid_label mip_era project id
0 Lmon historical CCCma gpp CMIP 1 r1i1p1f1 CanESM5 gn CMIP6 CMIP6 [CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1....
1 Amon historical CCCma tas CMIP 1 r1i1p1f1 CanESM5 gn CMIP6 CMIP6 [CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1....
2 Lmon historical CCCma nbp CMIP 1 r1i1p1f1 CanESM5 gn CMIP6 CMIP6 [CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1....

If you visually compare cat with new_cat you will see that they are the same. From here you may interact with the new catalog and recover the data you used if needed. This can also be used to quickly communicate the colleagues which data should be used.