Reproducibility¶

If you are using ESGF data in an analysis publication, the journal to which you are submitting may require that you provide data citations or availability. While we are working on improving this in ESGF, we also wanted to highlight the current functionality. Consider the following query assumed to be used in an unspecified analysis. For comparison, we will print the underlying dataframe to show the results of the search.

cat = ESGFCatalog().search(
    experiment_id="historical",
    source_id="CanESM5",
    variable_id=["gpp", "tas", "nbp"],
    variant_label=["r1i1p1f1"],
    frequency="mon",
)
cat.df

	table_id	experiment_id	institution_id	variable_id	datetime_start	activity_drs	version	member_id	source_id	datetime_stop	grid_label	mip_era	project	id
0	Amon	historical	CCCma	tas	1850-01-16T12:00:00Z	CMIP	20190429	r1i1p1f1	CanESM5	2014-12-16T12:00:00Z	gn	CMIP6	CMIP6	[CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1....
1	Lmon	historical	CCCma	nbp	1850-01-16T12:00:00Z	CMIP	20190429	r1i1p1f1	CanESM5	2014-12-16T12:00:00Z	gn	CMIP6	CMIP6	[CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1....
2	Lmon	historical	CCCma	gpp	1850-01-16T12:00:00Z	CMIP	20190429	r1i1p1f1	CanESM5	2014-12-16T12:00:00Z	gn	CMIP6	CMIP6	[CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1....

In the course of the analysis, you would download the datasets into a dictionary.

dsd = cat.to_dataset_dict(add_measures=False)

Then you may loop through the datasets and pull out the tracking_id from the global attributes of each dataset.

tracking_ids = [ds.tracking_id for _,ds in dsd.items()]
for tracking_id in tracking_ids:
    print(tracking_id)

hdl:21.14100/387658c8-f085-4ab8-995c-def848e7d856
hdl:21.14100/872062df-acae-499b-aa0f-9eaca7681abc
hdl:21.14100/52656bcc-3758-463b-964f-ef8863a6424a

The tracking_id is similar to a digital object identifier (DOI) and can be provided in some form in your paper or supplemental material to be precise about what ESGF data you used. If you have a list of tracking_ids, then you can pass them into from_tracking_ids() to reproduce the catalog.

new_cat = ESGFCatalog().from_tracking_ids(tracking_ids)
new_cat.df

	table_id	experiment_id	institution_id	variable_id	activity_drs	version	member_id	source_id	grid_label	mip_era	project	id
0	Lmon	historical	CCCma	gpp	CMIP	1	r1i1p1f1	CanESM5	gn	CMIP6	CMIP6	[CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1....
1	Amon	historical	CCCma	tas	CMIP	1	r1i1p1f1	CanESM5	gn	CMIP6	CMIP6	[CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1....
2	Lmon	historical	CCCma	nbp	CMIP	1	r1i1p1f1	CanESM5	gn	CMIP6	CMIP6	[CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1....

If you visually compare cat with new_cat you will see that they are the same. From here you may interact with the new catalog and recover the data you used if needed. This can also be used to quickly communicate the colleagues which data should be used.