Cloud-native: COG, Zarr, STAC catalogs
The old way: download a 10 GB GeoTIFF. The new way: range-request just the bytes you need from S3. COG, Zarr, and STAC make this possible.
If you have to download an entire Sentinel-2 scene to use it, fetching takes minutes. What if you could just grab the bytes you need?
Cloud-optimized formats (COG, Zarr) and catalogs (STAC) make that possible. This week, you'll fetch just the pixels you want — for any place on Earth — in seconds.
Learning objectives
- Understand COG (Cloud-Optimized GeoTIFF) structure
- Use Zarr for multi-dimensional gridded data
- Build and query a STAC catalog
- Range-request a tile out of a COG without downloading the file
Primer
The traditional way to use satellite imagery: download a 5 GB scene, unzip it, load it into desktop GIS. The cloud-native way: range-request just the bytes you need from a file living on S3, never download the whole thing. This week is the three formats and one spec that make that possible.
Cloud-Optimized GeoTIFF (COG)
COG isn't a new file format. It's a particular way of writing a regular GeoTIFF so that HTTP-range-request access is efficient. Three requirements:
- Internal tiling — the image is divided into ~256×256 or 512×512 pixel tiles, stored in row-major order. The file header has an index of where each tile begins.
- Internal overviews — the file also contains downsampled versions of the image (typically at 2×, 4×, 8×, ... resolution) for fast low-zoom rendering.
- Header at the beginning — TIFF allows the IFD (image file directory) to live anywhere; COG mandates the beginning so a small range request can read the structure first.
With those properties, a client can: (1) range-request the first ~64 KB to read the header and tile index, (2) compute which tiles cover the area of interest at the right zoom level, (3) range-request only those tiles. Total bytes transferred: kilobytes, not gigabytes.
from rio_tiler.io import COGReader
# Read just a small window from a COG on S3 — no full download
with COGReader('https://noaa-goes18.s3.amazonaws.com/.../foo.tif') as cog:
img = cog.part(bbox=(-100, 30, -80, 40), max_size=512)
Zarr
Zarr is a format for chunked, compressed, multi-dimensional arrays. Where COG is for 2D rasters, Zarr is for the (time × lat × lon × band × ...) hypercubes that modern Earth-observation analysis often needs. The data is stored as a directory tree on S3, with each chunk a separate object — so parallel reads of different chunks can fan out across many concurrent workers.
import xarray as xr
ds = xr.open_zarr('s3://my-bucket/era5-temperature.zarr',
storage_options={'anon': True})
# Now ds is a lazy xarray Dataset; reading a slice triggers parallel chunk fetches
slice = ds.air_temperature.sel(time='2024-01-15', lat=slice(30,40), lon=slice(-100,-80))
slice.load() # actually fetches the chunks
Zarr is the standard for cloud-native climate, reanalysis, and time-series gridded data. The Pangeo community runs a free public collection of huge Zarr datasets at catalog.pangeo.io.
STAC: SpatioTemporal Asset Catalog
You have a COG or a Zarr. How do you tell users about it? How do they discover that you have a frame over Florida on January 15? Enter STAC, the SpatioTemporal Asset Catalog spec.
STAC defines a small set of JSON schemas:
- Item — one asset (e.g. one Landsat scene). Has geometry, time range, properties, and asset URLs (the actual COG / Zarr / etc.).
- Collection — a homogeneous group of items (e.g. "Landsat 9 Level-2 surface reflectance").
- Catalog — a hierarchy of collections.
- STAC API — a standardized REST interface for searching across catalogs. Endpoints:
/search,/collections,/items.
Major STAC catalogs (all free to query):
- Microsoft Planetary Computer — Landsat, Sentinel-1, Sentinel-2, NAIP, ESA WorldCover, and dozens more.
- AWS Earth Search — Sentinel-2, Landsat, NAIP via Element 84.
- Radiant Earth MLHub — labeled training datasets for ML.
from pystac_client import Client
cat = Client.open('https://planetarycomputer.microsoft.com/api/stac/v1/')
search = cat.search(collections=['sentinel-2-l2a'],
bbox=[-81, 28, -80, 29],
datetime='2024-01-01/2024-02-01')
items = list(search.items())
print(f"{len(items)} matching scenes")
The lab
You'll identify a COG-formatted GOES product on AWS Open Data, use rio-tiler to fetch just a single map tile from it via HTTP range request, and time it against downloading the whole file. The speedup is typically 50–500×. Then you'll query the Microsoft Planetary Computer STAC API for Sentinel-2 scenes over Cape Canaveral in 2024 — a one-line search that returns dozens of cloud-free items, each with COG asset URLs you can immediately range-request.
This is the architecture every modern production geospatial pipeline uses, including LaunchDetect's. You no longer download data; you query catalogs and range-request the bytes you need.
Connecting to Hawaiʻi: STAC catalogs and Hawaiian data
Microsoft Planetary Computer's STAC catalog includes Sentinel-2 over Hawaiʻi (every scene back to 2015), Landsat (back to the 1970s), VIIRS night-lights, and many others — all queryable with one line of code, all free, all served as COGs you can range-request. The Hawaiʻi Statewide GIS Program has begun publishing some of its own datasets in STAC-compatible formats. This is the future: open standards, open access, partial downloads.
Hands-on lab: Pull a single tile from a COG without downloading the file
Identify a COG-formatted GOES product on AWS Open Data. Use rio-tiler to fetch just a single tile via HTTP range request. Time it vs downloading the whole file.
Quiz — click an answer to check it
No grade, no shame. Tap any option; you'll see if it's right plus the answer if not. The point is to notice what you already know and what's still settling.
- A GeoTIFF with internal tiling + overviews + correct byte ordering for HTTP range reads
- A new format separate from GeoTIFF
- A vector format
- A compression scheme only
- Multi-dimensional gridded data (e.g. time × lat × lon × band), chunkable, parallelizable
- Vector data
- 1D time series only
- Single static rasters
- SpatioTemporal Asset Catalog — a spec for cataloging geospatial assets
- A file format
- A query language
- A database
- Fetch a byte range of a file rather than the whole file
- Run faster
- Authenticate
- Compress
- /search, /collections, /items
- /users, /posts only
- /login, /logout
- GraphQL only
Reflection
Take five minutes with this. Write your answer somewhere. Carry it into next week.