Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Simplifying Search with Model Groups

At a simple level, you can think of intake-esgf as analagous to the ESGF web interface but where results are presented to you as a pandas dataframe in place of pages of web results. However, we believe that the user does not want to wade through either of these. Many times you want to see model results organized by unique combinations of source_id, member_id, and grid_label. That is to say, when you are going to perform an analysis, you would like your model outputs to be self-consistent and from the same run and grid, even across experiments. To assist you in honing in on what sets of results may be useful to your analysis, we introduce the notion of model groups.

Consider the following search, motivated by a desire to study controls (temperature, precipitation) on the carbon cycle (gross primary productivty) across a number of historical and future scenarios.

from intake_esgf import ESGFCatalog
cat = ESGFCatalog().search(
    experiment_id=["historical", "ssp585", "ssp370", "ssp245"],
    variable_id=["gpp", "tas", "pr"],
    table_id=["Amon", "Lmon"],
)
Summary information for 6196 results: mip_era [CMIP6] activity_drs [AerChemMIP, CMIP, ScenarioMIP] institution_id [BCC, AS-RCEC, AWI, CAMS, CAS, CCCR-IITM, CCCm... source_id [BCC-ESM1, TaiESM1, AWI-CM-1-1-MR, AWI-ESM-1-1... experiment_id [ssp370, historical, ssp245, ssp585] member_id [r1i1p1f1, r2i1p1f1, r3i1p1f1, r4i1p1f1, r5i1p... table_id [Amon, Lmon] variable_id [pr, tas, gpp] grid_label [gn, gr, gr1] dtype: object

Even if this exact application does not resonate with you, the situation is a familiar one. We have several thousand results with many different models and variants to sort through. To help guide you to which groups of models might be useful to you, we provide the following function.

cat.model_groups()
source_id member_id grid_label ACCESS-CM2 r1i1p1f1 gn 8 r2i1p1f1 gn 8 r3i1p1f1 gn 8 r4i1p1f1 gn 8 r5i1p1f1 gn 8 .. UKESM1-0-LL r16i1p1f2 gn 9 r17i1p1f2 gn 8 r18i1p1f2 gn 9 r19i1p1f2 gn 9 UKESM1-1-LL r1i1p1f2 gn 6 Name: project, Length: 978, dtype: int64

This returns a pandas series where the results have been grouped and sorted by source_id, member_id, and grid_label and the counts of datasets returned. Pandas will probably truncate this series. If you want to see the whole series, you can call print(cat.model_groups().to_string()) instead. However, as there are still several hundred possibile model groups, we will not show that here.

Removing Incomplete Groups

If you glance through the model groups, you will see that, relative to our search, many will be incomplete. By this we mean, that there are many model groups that will not have all the variables in all the experiments that we wish to include in our analysis. Since we are looking for 4 experiments and 3 variables, we need the model groups with 12 dataset results. We can check which groups satisfy this condition by operating on the model group pandas series.

mgs = cat.model_groups()
print(mgs[mgs==12])
source_id      member_id  grid_label
ACCESS-ESM1-5  r1i1p1f1   gn            12
               r2i1p1f1   gn            12
               r3i1p1f1   gn            12
               r4i1p1f1   gn            12
               r5i1p1f1   gn            12
                                        ..
UKESM1-0-LL    r1i1p1f2   gn            12
               r2i1p1f2   gn            12
               r3i1p1f2   gn            12
               r4i1p1f2   gn            12
               r8i1p1f2   gn            12
Name: project, Length: 227, dtype: int64

The rest are incomplete and we would like a fast way to remove them from the search results. But the reality is that many times our completeness criteria is more complicated than just a number. In the above example, we may want all the variables for all the experiments, but if a model does not have a submission for, say, ssp245, that is acceptable.

intake-esgf provides an interface which uses a user-provided function to remove incomplete entries. Internally, we will loop over all model groups in the results and pass your function the portion of the dataframe that corresponds to the current model group. Your function then needs to return a boolean based on the contents of that sub-dataframe.

def should_i_keep_it(sub_df):
    # this model group has all experiments/variables
    if len(sub_df) == 12:
        return True
    # if any of these experiments is missing a variable, remove this
    for exp in ["historical", "ssp585", "ssp370"]:
        if len(sub_df[sub_df["experiment_id"] == exp]) != 3:
            return False
    # if the check makes it here, keep it
    return True

Then we pass this function to the catalog by the remove_incomplete() function and observe how it has impacted the search results.

cat.remove_incomplete(should_i_keep_it)
print(cat.model_groups())
source_id      member_id  grid_label
ACCESS-ESM1-5  r1i1p1f1   gn            12
               r2i1p1f1   gn            12
               r3i1p1f1   gn            12
               r4i1p1f1   gn            12
               r5i1p1f1   gn            12
                                        ..
UKESM1-0-LL    r1i1p1f2   gn            12
               r2i1p1f2   gn            12
               r3i1p1f2   gn            12
               r4i1p1f2   gn            12
               r8i1p1f2   gn            12
Name: project, Length: 227, dtype: int64

Removing Ensembles

Depending on the goals and scope of your analysis, you may want to use only a single variant per model. This can be challenging to locate as not all variants have all the experiments and models. However, now that we have removed the incomplete results, we can now call the remove_ensembles() function which will only keep the smallest member_id for each model group. By smallest, we mean that first entry after a hierarchical sort using the integer index values of each label in the member_id.

cat.remove_ensembles()
print(cat.model_groups())
source_id         member_id  grid_label
ACCESS-ESM1-5     r1i1p1f1   gn            12
BCC-CSM2-MR       r1i1p1f1   gn            12
CanESM5           r1i1p1f1   gn            12
CanESM5-1         r1i1p1f1   gn            12
CanESM5-CanOE     r1i1p2f1   gn            12
CAS-ESM2-0        r1i1p1f1   gn            12
CESM2             r4i1p1f1   gn            12
CESM2-WACCM       r1i1p1f1   gn            12
CMCC-CM2-SR5      r1i1p1f1   gn            12
CMCC-ESM2         r1i1p1f1   gn            12
CNRM-CM6-1        r1i1p1f2   gr            12
CNRM-CM6-1-HR     r1i1p1f2   gr            12
CNRM-ESM2-1       r1i1p1f2   gr            12
EC-Earth3-Veg     r1i1p1f1   gr            12
EC-Earth3-Veg-LR  r1i1p1f1   gr            12
GISS-E2-1-G       r1i1p1f2   gn            12
GISS-E2-1-H       r1i1p1f2   gn            12
GISS-E2-2-G       r1i1p3f1   gn            12
INM-CM4-8         r1i1p1f1   gr1           12
INM-CM5-0         r1i1p1f1   gr1           12
IPSL-CM6A-LR      r1i1p1f1   gr            12
MIROC-ES2H        r1i1p4f2   gn            12
MIROC-ES2L        r1i1p1f2   gn            12
MPI-ESM1-2-HR     r1i1p1f1   gn            12
MPI-ESM1-2-LR     r1i1p1f1   gn            12
NorESM2-LM        r1i1p1f1   gn            12
NorESM2-MM        r1i1p1f1   gn            12
TaiESM1           r1i1p1f1   gn            12
UKESM1-0-LL       r1i1p1f2   gn            12
Name: project, dtype: int64

Now the results are much more manageable and ready to be downloaded for use in your analysis.

Feedback

What do you think of this interface? We have found that it saves our students days of work, but are interested in critical feedback. Can you think of simpler interface? Are there other analysis tasks that are painful and time consuming that we could automate?