Globus Transfers¶

Warning

This page describes a feature that we are currently testing. The interface is not yet well defined and may change. Some cells will not render as they depend on user authentication and personal information. If you have input on how to improve this interface, please reach out.

Setting Up¶

Chances are if you are reading this, you are already familiar with Globus. In case that you are not, Globus is research cyberinfrastructure, developed so you can easily, reliably and securely move, share, & discover data no matter where it lives – from a supercomputer, lab cluster, tape archive, public cloud or laptop.

A portion of the ESGF data archive is stored in public Guest Collections and access information included in some ESGF index nodes. This means that a portion of the ESGF archive can be accessed using Globus Transfer. These transfers can be triggered seamlessly in intake-esgf if you satisfy a few requirements. You will need:

A Globus login. In order to manage permissions, Globus requires an identity. Simply try to login at https://www.globus.org/ and first look through the list of supported institutions to login with your credentials. If your institution is not listed, you may login with another option list below the institution pulldown.
A place to send data. Globus transfer uses custom software to both send and receive data. In their parlance, you need write access to another collection which represents where you will send the data. It is possible to download to your personal computer. You will need to download Globus Connect Personal and have it running and connected when you initiate the transfer.
The UUID of the destination collection. The UUID can be found by navigating to the collection in Globus, clicking the ⋮ to show the collection properties, and copying the UUID value listed.

Initiating the Transfer¶

In this case we will use the configuration options to only query the anl-dev globus-based index. This is just to keep the example script simple. There is Globus transfer information in several of the indices. For demonstration, we will search for a few files for a model whose file sizes are smaller.

with intake_esgf.conf.set(indices={"ornl-dev": False}):
    cat = ESGFCatalog()
    cat.search(
        experiment_id="historical",
        source_id="CanESM5",
        frequency="mon",
        variable_id=[
            "pr",
            "tas",
            "gpp",
        ],
        member_id="r1i1p1f1",
    )

This portion of the process what you would do normally. To use globus transfers where possible, you need to include additional arguments to to_dataset_dict(). The first is globus_endpoint, the UUID of the destination collection to which you will transfer the data. The second is globus_path, any additional path you wish to add to the root path of the destination collection.

dsd = cat.to_dataset_dict(
    globus_endpoint=COLLECTION_UUID, # <-- your data here
    globus_path="data/ESGF-Data", # <-- additional path
)

Internally this will do several things:

We use the datasets present in your catalog to query the indices again for file information. This information is partitioned into that which has an associated Globus collection and that for which we will need to use https to download. The information that has a Globus collection is further partitioned preferring the collection with the fastest transfer times for you.
We remove the file information which we detect is already present in the local cache.
We submit the Globus transfer(s) and log the task_id to the intake-esgf logfile. You will not see anything on the screen to indicate that the transfer is ongoing, but can monitor the progress by going to your activity on globus.org.
Once the the Globus transfer(s) are underway, we will download the remaining files using https.
After the https downloads have completed, we block further progress until the Globus transfers report that they have succeeded.

Then the code progresses as usual, looking for cell measures and loading files into xarray Datasets.

What Can Go Wrong?¶

While this interface maintains the intake-esgf paradigm, and also makes using Globus transfers very simple, it also poses a few difficulties.

Your local_cache may not include the location to where you transfered data. Technically you can use this approach to transfer data to any collection, not necesarily to the resource on which you are working.
Furthermore, even if you execute this script on the resource which corresponds to the globus_endpoint you provided to to_dataset_dict(), the local cache directory may not point to where Globus transferred data. So if the collection_root = /home/username/ and you set your local cache to /home/username/data/ESGF-Data, then you should have given to_dataset_dict() the option globus_path="data/ESGF-Data". However, the collection root is not something we can always query and thus have no way to verify.

Please tell us what you think of this interface.