Week 26 · Space GIS Architect~7 min · 684 words

Cloud-native: COG, Zarr, STAC catalogs

The old way: download a 10 GB GeoTIFF. The new way: range-request just the bytes you need from S3. COG, Zarr, and STAC make this possible.

If you have to download an entire Sentinel-2 scene to use it, fetching takes minutes. What if you could just grab the bytes you need?

Cloud-optimized formats (COG, Zarr) and catalogs (STAC) make that possible. This week, you'll fetch just the pixels you want — for any place on Earth — in seconds.

Learning objectives

Hawaiian Islands. Every Sentinel-2 scene since 2015 is in the STAC catalog, range-requestable from any browser.

Primer

The traditional way to use satellite imagery: download a 5 GB scene, unzip it, load it into desktop GIS. The cloud-native way: range-request just the bytes you need from a file living on S3, never download the whole thing. This week is the three formats and one spec that make that possible.

Cloud-Optimized GeoTIFF (COG)

COG isn't a new file format. It's a particular way of writing a regular GeoTIFF so that HTTP-range-request access is efficient. Three requirements:

  1. Internal tiling — the image is divided into ~256×256 or 512×512 pixel tiles, stored in row-major order. The file header has an index of where each tile begins.
  2. Internal overviews — the file also contains downsampled versions of the image (typically at 2×, 4×, 8×, ... resolution) for fast low-zoom rendering.
  3. Header at the beginning — TIFF allows the IFD (image file directory) to live anywhere; COG mandates the beginning so a small range request can read the structure first.

With those properties, a client can: (1) range-request the first ~64 KB to read the header and tile index, (2) compute which tiles cover the area of interest at the right zoom level, (3) range-request only those tiles. Total bytes transferred: kilobytes, not gigabytes.

from rio_tiler.io import COGReader

# Read just a small window from a COG on S3 — no full download
with COGReader('https://noaa-goes18.s3.amazonaws.com/.../foo.tif') as cog:
    img = cog.part(bbox=(-100, 30, -80, 40), max_size=512)

Zarr

Zarr is a format for chunked, compressed, multi-dimensional arrays. Where COG is for 2D rasters, Zarr is for the (time × lat × lon × band × ...) hypercubes that modern Earth-observation analysis often needs. The data is stored as a directory tree on S3, with each chunk a separate object — so parallel reads of different chunks can fan out across many concurrent workers.

import xarray as xr
ds = xr.open_zarr('s3://my-bucket/era5-temperature.zarr',
                  storage_options={'anon': True})
# Now ds is a lazy xarray Dataset; reading a slice triggers parallel chunk fetches
slice = ds.air_temperature.sel(time='2024-01-15', lat=slice(30,40), lon=slice(-100,-80))
slice.load()  # actually fetches the chunks

Zarr is the standard for cloud-native climate, reanalysis, and time-series gridded data. The Pangeo community runs a free public collection of huge Zarr datasets at catalog.pangeo.io.

STAC: SpatioTemporal Asset Catalog

You have a COG or a Zarr. How do you tell users about it? How do they discover that you have a frame over Florida on January 15? Enter STAC, the SpatioTemporal Asset Catalog spec.

STAC defines a small set of JSON schemas:

  • Item — one asset (e.g. one Landsat scene). Has geometry, time range, properties, and asset URLs (the actual COG / Zarr / etc.).
  • Collection — a homogeneous group of items (e.g. "Landsat 9 Level-2 surface reflectance").
  • Catalog — a hierarchy of collections.
  • STAC API — a standardized REST interface for searching across catalogs. Endpoints: /search, /collections, /items.

Major STAC catalogs (all free to query):

  • Microsoft Planetary Computer — Landsat, Sentinel-1, Sentinel-2, NAIP, ESA WorldCover, and dozens more.
  • AWS Earth Search — Sentinel-2, Landsat, NAIP via Element 84.
  • Radiant Earth MLHub — labeled training datasets for ML.
from pystac_client import Client

cat = Client.open('https://planetarycomputer.microsoft.com/api/stac/v1/')
search = cat.search(collections=['sentinel-2-l2a'],
                    bbox=[-81, 28, -80, 29],
                    datetime='2024-01-01/2024-02-01')
items = list(search.items())
print(f"{len(items)} matching scenes")

The lab

You'll identify a COG-formatted GOES product on AWS Open Data, use rio-tiler to fetch just a single map tile from it via HTTP range request, and time it against downloading the whole file. The speedup is typically 50–500×. Then you'll query the Microsoft Planetary Computer STAC API for Sentinel-2 scenes over Cape Canaveral in 2024 — a one-line search that returns dozens of cloud-free items, each with COG asset URLs you can immediately range-request.

This is the architecture every modern production geospatial pipeline uses, including LaunchDetect's. You no longer download data; you query catalogs and range-request the bytes you need.

Connecting to Hawaiʻi: STAC catalogs and Hawaiian data

Microsoft Planetary Computer's STAC catalog includes Sentinel-2 over Hawaiʻi (every scene back to 2015), Landsat (back to the 1970s), VIIRS night-lights, and many others — all queryable with one line of code, all free, all served as COGs you can range-request. The Hawaiʻi Statewide GIS Program has begun publishing some of its own datasets in STAC-compatible formats. This is the future: open standards, open access, partial downloads.

Try Planetary Computer's STAC search box at planetarycomputer.microsoft.com — type 'Hawaiʻi' and see what's available.

Hands-on lab: Pull a single tile from a COG without downloading the file

Identify a COG-formatted GOES product on AWS Open Data. Use rio-tiler to fetch just a single tile via HTTP range request. Time it vs downloading the whole file.

Quiz — click an answer to check it

No grade, no shame. Tap any option; you'll see if it's right plus the answer if not. The point is to notice what you already know and what's still settling.

Q1. COG is:
  1. A GeoTIFF with internal tiling + overviews + correct byte ordering for HTTP range reads
  2. A new format separate from GeoTIFF
  3. A vector format
  4. A compression scheme only
Q2. Zarr is best for:
  1. Multi-dimensional gridded data (e.g. time × lat × lon × band), chunkable, parallelizable
  2. Vector data
  3. 1D time series only
  4. Single static rasters
Q3. STAC is:
  1. SpatioTemporal Asset Catalog — a spec for cataloging geospatial assets
  2. A file format
  3. A query language
  4. A database
Q4. HTTP range request lets you:
  1. Fetch a byte range of a file rather than the whole file
  2. Run faster
  3. Authenticate
  4. Compress
Q5. STAC API standard endpoints include:
  1. /search, /collections, /items
  2. /users, /posts only
  3. /login, /logout
  4. GraphQL only

Reflection

Take five minutes with this. Write your answer somewhere. Carry it into next week.

If anyone can fetch any byte of Earth observation data they want, what shifts? Who benefits from democratized data? Who loses gatekeeping power? Whose responsibility is it to use the access wisely?
Mark this week complete Visiting alone doesn't count it as 'done'. Click when you've actually worked through the primer + lab + quiz.
Share + discuss on Twitter/X Discuss on GitHub