🌍 Analyzing Sea Level Rise Using Earth Data in the Cloud¶

This notebook is entirely based on Jinbo Wang's tutorial ¶

No description has been provided for this image

We are going to reproduce the plot from this infographic Source: NASA-led Study Reveals the Causes of Sea Level Rise Since 1900

In [1]:

Copied!

import earthaccess

print(f"using earthaccess v{earthaccess.__version__}")

auth = earthaccess.login()
# are we authenticated?
if not auth.authenticated:
    # ask for credentials and persist them in a .netrc file
    auth.login(strategy="interactive", persist=True)
import earthaccess

print(f"using earthaccess v{earthaccess.__version__}")

auth = earthaccess.login()
# are we authenticated?
if not auth.authenticated:
    # ask for credentials and persist them in a .netrc file
    auth.login(strategy="interactive", persist=True)

using earthaccess v0.9.0

Why earthaccess?¶

earthaccess is a Python library that simplifies data discovery and access to NASA Earth science data by providing an abstraction layer for NASA’s Common Metadata Repository (CMR) Search API so that searching for data can be done using a simpler notation instead of low level HTTP queries.

Authentication in the cloud¶

If the collection is a cloud-hosted collection we can print the summary() and get the S3 credential endpoint. These S3 credentials are temporary and we can use them with third party libraries such as s3fs, boto3 or awscli.

In [2]:

Copied!





from pprint import pprint
# We'll get 4 collections that match with our keywords
collections = earthaccess.search_datasets(
    keyword = "SEA SURFACE HEIGHT",
    cloud_hosted = True,
    count = 4
)

# Let's print 2 collections
for collection in collections[0:2]:
    # pprint(collection.summary())
    print(pprint(collection.summary()), collection.abstract(), "\n", collection["umm"]["DOI"], "\n\n")
from pprint import pprint
# We'll get 4 collections that match with our keywords
collections = earthaccess.search_datasets(
    keyword = "SEA SURFACE HEIGHT",
    cloud_hosted = True,
    count = 4
)

# Let's print 2 collections
for collection in collections[0:2]:
    # pprint(collection.summary())
    print(pprint(collection.summary()), collection.abstract(), "\n", collection["umm"]["DOI"], "\n\n")

{'cloud-info': {'Region': 'us-west-2',
                'S3BucketAndObjectPrefixNames': ['podaac-ops-cumulus-public/SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205/',
                                                 'podaac-ops-cumulus-protected/SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205/'],
                'S3CredentialsAPIDocumentationURL': 'https://archive.podaac.earthdata.nasa.gov/s3credentialsREADME',
                'S3CredentialsAPIEndpoint': 'https://archive.podaac.earthdata.nasa.gov/s3credentials'},
 'concept-id': 'C2270392799-POCLOUD',
 'file-type': "[{'Format': 'netCDF-4', 'FormatType': 'Native', "
              "'AverageFileSize': 9.7, 'AverageFileSizeUnit': 'MB'}]",
 'get-data': ['https://cmr.earthdata.nasa.gov/virtual-directory/collections/C2270392799-POCLOUD',
              'https://search.earthdata.nasa.gov/search/granules?p=C2270392799-POCLOUD'],
 'short-name': 'SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205',
 'version': '2205'}
None This dataset provides gridded Sea Surface Height Anomalies (SSHA) above a mean sea surface, on a 1/6th degree grid every 5 days. It contains the fully corrected heights, with a delay of up to 3 months. The gridded data are derived from the along-track SSHA data of TOPEX/Poseidon, Jason-1, Jason-2, Jason-3 and Jason-CS (Sentinel-6) as reference data from the level 2 along-track data found at https://podaac.jpl.nasa.gov/dataset/MERGED_TP_J1_OSTM_OST_CYCLES_V51, plus ERS-1, ERS-2, Envisat, SARAL-AltiKa, CryoSat-2, Sentinel-3A, Sentinel-3B, depending on the date, from the RADS database. The date given in the grid files is the center of the 5-day window. The grids were produced from altimeter data using Kriging interpolation, which gives best linear prediction based upon prior knowledge of covariance. 
 {'DOI': '10.5067/SLREF-CDRV3', 'Authority': 'https://doi.org'} 


{'cloud-info': {'Region': 'us-west-2',
                'S3BucketAndObjectPrefixNames': ['podaac-swot-ops-cumulus-protected/SWOT_L2_LR_SSH_2.0/',
                                                 'podaac-swot-ops-cumulus-public/SWOT_L2_LR_SSH_2.0/'],
                'S3CredentialsAPIDocumentationURL': 'https://archive.swot.podaac.earthdata.nasa.gov/s3credentialsREADME',
                'S3CredentialsAPIEndpoint': 'https://archive.swot.podaac.earthdata.nasa.gov/s3credentials'},
 'concept-id': 'C2799438306-POCLOUD',
 'file-type': "[{'FormatType': 'Native', 'Format': 'netCDF-4'}]",
 'get-data': ['https://cmr.earthdata.nasa.gov/virtual-directory/collections/C2799438306-POCLOUD',
              'https://search.earthdata.nasa.gov/search/granules?p=C2799438306-POCLOUD',
              'https://search.earthdata.nasa.gov/search/granules?p=C2799438306-POCLOUD&pg[0][id]=*PIC0*',
              'https://search.earthdata.nasa.gov/search/granules?p=C2799438306-POCLOUD&pg[0][id]=*PGC0*'],
 'short-name': 'SWOT_L2_LR_SSH_2.0',
 'version': '2.0'}
None The SWOT Level 2 KaRIn Low Rate Sea Surface Height Data Product from the Surface Water Ocean Topography (SWOT) mission provides global sea surface height and significant wave height observations derived from low rate (LR) measurements from the Ka-band Radar Interferometer (KaRIn). SWOT launched on December 16, 2022 from Vandenberg Air Force Base in California into a 1-day repeat orbit for the "calibration" or "fast-sampling" phase of the mission, which completed in early July 2023. After the calibration phase, SWOT entered a 21-day repeat orbit in August 2023 to start the "science" phase of the mission, which is expected to continue through 2025. <br>
The L2 sea surface height data product is distributed in one netCDF-4 file per pass (half-orbit) covering the full KaRIn swath width, which spans 10-60km on each side of the nadir track. Sea surface height, sea surface height anomaly, wind speed, significant waveheight, and related parameters are provided on a geographically fixed, swath-aligned 2x2 km2 grid (Basic, Expert, Windwave). The sea surface height data are also provided on a finer 250x250 m2 "native" grid with minimal smoothing applied (Unsmoothed). <br>
This dataset is the parent collection to the following sub-collections: <br>
https://podaac.jpl.nasa.gov/dataset/SWOT_L2_LR_SSH_Basic_2.0 <br>
https://podaac.jpl.nasa.gov/dataset/SWOT_L2_LR_SSH_WindWave_2.0 <br>
https://podaac.jpl.nasa.gov/dataset/SWOT_L2_LR_SSH_Expert_2.0 <br>
https://podaac.jpl.nasa.gov/dataset/SWOT_L2_LR_SSH_Unsmoothed_2.0 <br> 
 {'DOI': '10.5067/SWOT-SSH-2.0'}

A year of data¶

Things to keep in mind:

this particular dataset has data until 2019
this is a global dataset, each granule represents the whole world
temporal coverage is 1 granule each 5 days

In [3]:

Copied!





granules = earthaccess.search_data(
    short_name = "SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205",
    count = 1
)
granules = earthaccess.search_data(
    short_name = "SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205",
    count = 1
)

Working with the URLs directly¶

If we chose, we can use earthdata to grab the file's URLs and then download them with another library (if we have a .netrc file configured with NASA's EDL credentials) Getting the links to our data is quiet simple with the data_links() method on each of the results:

In [4]:

Copied!

# the collection is cloud hosted, but we can access it out of AWS with the regular HTTPS URL
granules[0].data_links(access="out_of_region")
# the collection is cloud hosted, but we can access it out of AWS with the regular HTTPS URL
granules[0].data_links(access="out_of_region")

Out[4]:

['https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205/ssh_grids_v2205_1992101012.nc']

In [5]:

Copied!

granules[0].data_links(access="direct")
granules[0].data_links(access="direct")

Out[5]:

['s3://podaac-ops-cumulus-protected/SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205/ssh_grids_v2205_1992101012.nc']

POC: streaming into xarray¶

We can use earthaccess to stream files directly into xarray, this will work well for cases where:

Data is in NetCDF/HDF5/Zaar format
- xarray can read bytes directly for remote datasets only with h5netcdf and scipy back-ends, if we deal with a format that won't be recognized by these 2 backends xarray will raise an exception.
Data is not big data (multi TB)
- not fully tested with Dask distributed
Data is gridded
- xarray works better with homogeneous coordinates, working with swath data will be cumbersome.
Data is chunked using reasonable large sizes(1MB or more)
- If our files are chunked in small pieces the access time will be orders of magnitude bigger than just downloading the data and accessing it locally.

Opening a year of SSH (SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL1812) data (1.1 GB approx) can take up to 5 minutes streaming the data out of region(not in AWS) The reason for this is not that the data transfer is order of magintude slower but due the client libraries not fetching data concurrently and the metadata of the files in HDF is usually not consolidated like in Zaar, hence h5netcdf has to issue a lot of requests to get the info it needs.

Note: we are looping through each year and getting the metadata for the first granule in May

In [6]:

Copied!





#  storing the resulting granule metadata
granules = []

# we just grab 1 granule from May for each year of the dataset
for year in range(1999, 2019):
    print(f"Retrieving data for year: {year}")
    results = earthaccess.search_data(
        doi = "10.5067/SLREF-CDRV3",
        temporal=(f"{year}-05", f"{year}-05"),
        count = 1,
    )
    granules.append(results[0])
print(f"Total granules: {len(granules)}")
#  storing the resulting granule metadata
granules = []

# we just grab 1 granule from May for each year of the dataset
for year in range(1999, 2019):
    print(f"Retrieving data for year: {year}")
    results = earthaccess.search_data(
        doi = "10.5067/SLREF-CDRV3",
        temporal=(f"{year}-05", f"{year}-05"),
        count = 1,
    )
    granules.append(results[0])
print(f"Total granules: {len(granules)}")

Retrieving data for year: 1999

Retrieving data for year: 2000

Retrieving data for year: 2001

Retrieving data for year: 2002

Retrieving data for year: 2003

Retrieving data for year: 2004

Retrieving data for year: 2005

Retrieving data for year: 2006

Retrieving data for year: 2007

Retrieving data for year: 2008

Retrieving data for year: 2009

Retrieving data for year: 2010

Retrieving data for year: 2011

Retrieving data for year: 2012

Retrieving data for year: 2013

Retrieving data for year: 2014

Retrieving data for year: 2015

Retrieving data for year: 2016

Retrieving data for year: 2017

Retrieving data for year: 2018

Total granules: 20

What `earthaccess.open()` do?¶

earthaccess.open() takes a list of results from earthaccess.search_data() or a list of URLs and creates a list of Python File-like objects that can be used in our code as if the remote files were local. When executed in AWS the file system used is S3FS when we open files outside of AWS we get a regular HTTPS file session.

In [7]:

Copied!

%%time

import xarray as xr

fileset = earthaccess.open(granules)

print(f" Using {type(fileset[0])} filesystem")

ds = xr.open_mfdataset(fileset, chunks={})
ds
%%time

import xarray as xr

fileset = earthaccess.open(granules)

print(f" Using {type(fileset[0])} filesystem")

ds = xr.open_mfdataset(fileset, chunks={})
ds

 Using <class 'earthaccess.store.EarthAccessFile'> filesystem

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File <timed exec>:8

File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/752/lib/python3.10/site-packages/xarray/backends/api.py:1054, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
   1051     open_ = open_dataset
   1052     getattr_ = getattr
-> 1054 datasets = [open_(p, **open_kwargs) for p in paths]
   1055 closers = [getattr_(ds, "_close") for ds in datasets]
   1056 if preprocess is not None:

File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/752/lib/python3.10/site-packages/xarray/backends/api.py:1054, in <listcomp>(.0)
   1051     open_ = open_dataset
   1052     getattr_ = getattr
-> 1054 datasets = [open_(p, **open_kwargs) for p in paths]
   1055 closers = [getattr_(ds, "_close") for ds in datasets]
   1056 if preprocess is not None:

File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/752/lib/python3.10/site-packages/xarray/backends/api.py:577, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    570 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
    571 backend_ds = backend.open_dataset(
    572     filename_or_obj,
    573     drop_variables=drop_variables,
    574     **decoders,
    575     **kwargs,
    576 )
--> 577 ds = _dataset_from_backend_dataset(
    578     backend_ds,
    579     filename_or_obj,
    580     engine,
    581     chunks,
    582     cache,
    583     overwrite_encoded_chunks,
    584     inline_array,
    585     chunked_array_type,
    586     from_array_kwargs,
    587     drop_variables=drop_variables,
    588     **decoders,
    589     **kwargs,
    590 )
    591 return ds

File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/752/lib/python3.10/site-packages/xarray/backends/api.py:370, in _dataset_from_backend_dataset(backend_ds, filename_or_obj, engine, chunks, cache, overwrite_encoded_chunks, inline_array, chunked_array_type, from_array_kwargs, **extra_tokens)
    368     ds = backend_ds
    369 else:
--> 370     ds = _chunk_ds(
    371         backend_ds,
    372         filename_or_obj,
    373         engine,
    374         chunks,
    375         overwrite_encoded_chunks,
    376         inline_array,
    377         chunked_array_type,
    378         from_array_kwargs,
    379         **extra_tokens,
    380     )
    382 ds.set_close(backend_ds._close)
    384 # Ensure source filename always stored in dataset object

File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/752/lib/python3.10/site-packages/xarray/backends/api.py:318, in _chunk_ds(backend_ds, filename_or_obj, engine, chunks, overwrite_encoded_chunks, inline_array, chunked_array_type, from_array_kwargs, **extra_tokens)
    307 def _chunk_ds(
    308     backend_ds,
    309     filename_or_obj,
   (...)
    316     **extra_tokens,
    317 ):
--> 318     chunkmanager = guess_chunkmanager(chunked_array_type)
    320     # TODO refactor to move this dask-specific logic inside the DaskManager class
    321     if isinstance(chunkmanager, DaskManager):

File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/752/lib/python3.10/site-packages/xarray/namedarray/parallelcompat.py:117, in guess_chunkmanager(manager)
    115 if isinstance(manager, str):
    116     if manager not in chunkmanagers:
--> 117         raise ValueError(
    118             f"unrecognized chunk manager {manager} - must be one of: {list(chunkmanagers)}"
    119         )
    121     return chunkmanagers[manager]
    122 elif isinstance(manager, ChunkManagerEntrypoint):
    123     # already a valid ChunkManager so just pass through

ValueError: unrecognized chunk manager dask - must be one of: []

Plotting¶

In [8]:

Copied!

ds.SLA.where((ds.SLA>=0) & (ds.SLA < 10)).std('Time').plot(figsize=(14,6), x='Longitude', y='Latitude')
ds.SLA.where((ds.SLA>=0) & (ds.SLA < 10)).std('Time').plot(figsize=(14,6), x='Longitude', y='Latitude')

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 ds.SLA.where((ds.SLA>=0) & (ds.SLA < 10)).std('Time').plot(figsize=(14,6), x='Longitude', y='Latitude')

NameError: name 'ds' is not defined

In [9]:

Copied!





from pyproj import Geod
import numpy as np

def ssl_area(lats):
    """
    Calculate the area associated with a 1/6 by 1/6 degree box at latitude specified in 'lats'.

    Parameter
    ==========
    lats: a list or numpy array of size N the latitudes of interest.

    Return
    =======
    out: Array (N) area values (unit: m^2)
    """
    # Define WGS84 as CRS:
    geod = Geod(ellps='WGS84')
    dx=1/12.0
    # TODO: explain this
    c_area=lambda lat: geod.polygon_area_perimeter(np.r_[-dx,dx,dx,-dx], lat+np.r_[-dx,-dx,dx,dx])[0]
    out=[]
    for lat in lats:
        out.append(c_area(lat))
    return np.array(out)
# note: they rotated the data in the last release, this operation used to be (1,-1)
ssh_area = ssl_area(ds.Latitude.data).reshape(-1,1)
print(ssh_area.shape)
from pyproj import Geod
import numpy as np

def ssl_area(lats):
    """
    Calculate the area associated with a 1/6 by 1/6 degree box at latitude specified in 'lats'.

    Parameter
    ==========
    lats: a list or numpy array of size N the latitudes of interest.

    Return
    =======
    out: Array (N) area values (unit: m^2)
    """
    # Define WGS84 as CRS:
    geod = Geod(ellps='WGS84')
    dx=1/12.0
    # TODO: explain this
    c_area=lambda lat: geod.polygon_area_perimeter(np.r_[-dx,dx,dx,-dx], lat+np.r_[-dx,-dx,dx,dx])[0]
    out=[]
    for lat in lats:
        out.append(c_area(lat))
    return np.array(out)
# note: they rotated the data in the last release, this operation used to be (1,-1)
ssh_area = ssl_area(ds.Latitude.data).reshape(-1,1)
print(ssh_area.shape)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 26
     24     return np.array(out)
     25 # note: they rotated the data in the last release, this operation used to be (1,-1)
---> 26 ssh_area = ssl_area(ds.Latitude.data).reshape(-1,1)
     27 print(ssh_area.shape)

NameError: name 'ds' is not defined

In [10]:

Copied!

# This dataset was moved from opendap
granule = earthaccess.search_data(concept_id="C2491724765-POCLOUD")[0].data_links()[0]
gmsl = earthaccess.open([granule])[0]
# This dataset was moved from opendap
granule = earthaccess.search_data(concept_id="C2491724765-POCLOUD")[0].data_links()[0]
gmsl = earthaccess.open([granule])[0]

In [11]:

Copied!





%%time

import numpy as np
from matplotlib import pyplot as plt
plt.rcParams["figure.figsize"] = (16,4)

img = plt.imread("underwater.jpeg")

fig, axs = plt.subplots()
plt.grid(True)

def global_mean(SLA, **kwargs):
    dout=((SLA*ssh_area).sum()/(SLA/SLA*ssh_area).sum())*1000
    return dout

result = ds.SLA.groupby('Time').apply(global_mean)

plt.xlabel('Time (year)',fontsize=16)
plt.ylabel('Global Mean SLA (meter)',fontsize=12)
# axs.imshow(img, aspect='auto')
plt.grid(True)

historic_ts=xr.open_dataset(gmsl)

result.plot(ax=axs, color="orange", marker="o", label='satellite record')
historic_ts['global_average_sea_level_change'].plot(ax=axs, label='Historical in-situ record', color="lightblue")

x0,x1 = axs.get_xlim()
y0,y1 = axs.get_ylim()
axs.imshow(img, extent=[x0, x1, y0, y1], aspect='auto')

plt.legend()
plt.show()
%%time

import numpy as np
from matplotlib import pyplot as plt
plt.rcParams["figure.figsize"] = (16,4)

img = plt.imread("underwater.jpeg")

fig, axs = plt.subplots()
plt.grid(True)

def global_mean(SLA, **kwargs):
    dout=((SLA*ssh_area).sum()/(SLA/SLA*ssh_area).sum())*1000
    return dout

result = ds.SLA.groupby('Time').apply(global_mean)

plt.xlabel('Time (year)',fontsize=16)
plt.ylabel('Global Mean SLA (meter)',fontsize=12)
# axs.imshow(img, aspect='auto')
plt.grid(True)

historic_ts=xr.open_dataset(gmsl)

result.plot(ax=axs, color="orange", marker="o", label='satellite record')
historic_ts['global_average_sea_level_change'].plot(ax=axs, label='Historical in-situ record', color="lightblue")

x0,x1 = axs.get_xlim()
y0,y1 = axs.get_ylim()
axs.imshow(img, extent=[x0, x1, y0, y1], aspect='auto')

plt.legend()
plt.show()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File <timed exec>:14

NameError: name 'ds' is not defined

No description has been provided for this image