Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Reader reconciliation

One metadata surface, two read interfaces, two readers

UNEP
IMEO
MARS

Status: revised 2026-05-09 — slimmed substantially. The earlier draft introduced a parallel _ReaderMeta / SyncReader / AsyncReader Protocol taxonomy, a custom ByteStore Protocol, and a from-scratch _cog_helpers.py async COG reader. On review, today’s GeoData / GeoDataBase already covers the sync metadata + read surface; obspec already plays the role of ByteStore; and developmentseed/async-geotiff already ships the async COG reader. So the design collapsed to: add AsyncGeoData only, defer ByteStore to obspec, write AsyncGeoTIFFReader as a thin adapter over async-geotiff. Scope: the long-term shape of the reader layer in georeader. Adds an AsyncGeoData Protocol alongside today’s GeoData / GeoDataBase; adds one new reader (AsyncGeoTIFFReader) as a thin adapter over async-geotiff; documents an additive widening of RasterioReader’s bytes-path knobs. Audience: anyone touching georeader/abstract_reader.py, georeader/rasterio_reader.py, or building downstream pipelines that need to swap readers without rewriting call sites.


Summary

Today, georeader ships one reader (RasterioReader) with a sync, GDAL-backed interface that’s worked well for years. As the package’s audience grows into cloud-native and async-first workloads, it needs to grow alongside — without breaking the call sites that already use it.

This design adds a single new Protocol (AsyncGeoData) alongside today’s GeoData / GeoDataBase so async-shaped readers slot into the existing surface. One concrete async reader is added (AsyncGeoTIFFReader), implemented as a thin adapter over developmentseed/async-geotiff. Cloud byte access is delegated to obspec — the upstream Protocol that async-geotiff already consumes — rather than wrapped in a parallel ByteStore Protocol of our own. Downstream code branches only on sync-vs-async, never on which concrete reader class is in use.

The work splits into two small issues that can be reviewed independently.


Motivation

Three pressures make this worth doing now:

  1. Cloud is the default substrate, not an exotic one. New RS workflows assume reads from S3 / GCS / Azure; today’s RasterioReader routes through GDAL VSI, which is excellent for the common case but offers no way to opt into competing transports — obstore (Rust core, HTTP/2, native parallel ranges) for hot-path throughput, or fsspec for niche backends and custom auth. The existing reader lacks the seam to plug them in.

  2. Async I/O is now first-class. Tile servers, web maps, ML inference services, and any code that fans out reads concurrently are increasingly written async-first. RasterioReader is sync-only. Users wanting an async reader either roll their own or pull in an external library with a different API shape — there is no shared interface to compose against.

  3. COG-only readers can be substantially faster than full GDAL. A pure-Rust COG reader (via async-tiff) can skip per-call GDAL state and PROJ initialisation, batch parallel range requests directly via obstore, and coalesce close-by ranges. For tile-server fan-out across thousands of small windows the overhead difference is meaningful. A reader specialised to COG (the dominant cloud-native format) deserves a place alongside the general-purpose RasterioReader, not as a separate ecosystem with an incompatible API. We don’t have to build such a reader — developmentseed/async-geotiff exists, is actively maintained, and is the right thing to depend on. Our job is to expose it behind the same Protocol-shaped surface as RasterioReader.

The status quo can absorb each of these one at a time, but the shapes start to drift apart and downstream code accumulates branches. A reconciliation pass — AsyncGeoData Protocol + thin async-geotiff adapter — pays for itself the first time a user wants to swap GDAL VSI for obstore in a hot loop.


Primer for newcomers

A handful of advanced concepts run through this design. Quick primers below; deeper specs in the per-issue sub-designs.

ELI5. Reading a satellite image from the cloud is like ordering one slice of pizza from a giant pie that lives in another city. You don’t want the whole pie shipped — just your slice. This design is about how to ask for slices, who actually fetches them, and how to wait efficiently when you want a thousand at once.

What “reader” means in this package

What it is. A reader is a Python class that turns a file path or URL (local disk, S3, GCS, Azure, HTTP) into a GeoTensor — a numpy array with georeferencing attached. Today’s package has one (RasterioReader); this design adds two more.

How it works. A reader has two phases. Open (cheap) reads only the file’s header — enough to know the CRS, transform, shape, dtype. Read (expensive) actually fetches pixel bytes for a window and decodes them. The split lets you pass readers around as cheap handles and only pay I/O when you ask for data.

What this means for us. Code that takes a “reader” as input doesn’t need the bytes — just the metadata. That’s why georeader’s existing Protocols split into two layers (GeoDataBase for metadata-only, GeoData for read-capable). Many georeader functions (window math, bounds queries, catalog construction) only need metadata and run instantly even on cloud-hosted files.

Sync vs async I/O

What it is. Sync code blocks the calling thread until I/O completes (the standard Python flow). Async code uses async def / await so the thread can do other work while waiting. Two different control-flow models for the same fundamental operation.

How it works. Sync I/O is what you’ve used your whole life: open(path).read(). Async I/O uses asyncio (or trio); the runtime juggles many in-flight reads concurrently on one thread, which is dramatically more efficient for workloads where you’d otherwise spawn a thread-per-request (tile servers, 1000-window batch reads).

What this means for us. RasterioReader is sync — fine for batch jobs, scripts, notebooks. AsyncGeoTIFFReader is async — needed when you want to fan out 1000 reads concurrently from one process. The Protocol surface (GeoData / AsyncGeoData) isolates the difference so user code only branches on await vs not, never on which concrete reader class is in use.

The “bytes path”

What it is. When a reader fetches data from cloud storage (S3, GCS, Azure), something has to translate “give me bytes 0–4096 of s3://bucket/scene.tif” into actual HTTP traffic. The library that does this is the bytes path.

How it works. Three options ship today: GDAL VSI (libcurl in C, default for RasterioReader), obstore (Rust core, fast for parallel ranges), and fsspec (Python, broadest backend coverage). They differ in throughput, async support, and which clouds they speak.

What this means for us. A single reader class can run on different bytes paths. RasterioReader defaults to VSI but the optional widening in Issue 1 lets you swap to fsspec via fs= or to a custom callback via opener=. The new reader (AsyncGeoTIFFReader) skips GDAL entirely and accepts any obspec.AsyncStore (obstore.S3Store / GCSStore / AzureStore / etc.). Your call which trade-off matches the workload — see geostack.md §“obstore vs fsspec compared” for the comparison.

Python Protocols

What it is. A typing.Protocol is a “structural type” — a class declaration that says what methods/attributes a type must have without requiring inheritance. Like duck typing with type-checker support.

How it works. Define a Protocol with the surface you want; any class that has the right attributes satisfies it automatically (no class MyReader(GeoData) declaration required). With @runtime_checkable, isinstance(x, Protocol) works at runtime too.

What this means for us. The reader Protocols (GeoDataBase, GeoData, AsyncGeoData) let RasterioReader and AsyncGeoTIFFReader (and any future sensor-specific or raw-byte reader) be passed to the same function with no shared base class — they just satisfy the Protocol structurally. Same shape; independent implementations; no inheritance hierarchy.


Goals


Non-goals


Constraints


High-level shape

Two readers, one shared metadata surface, two read interfaces:

ReaderLives inSync / asyncTransportDriver coverage
RasterioReadergeoreadersyncGDAL / VSIevery GDAL driver
AsyncGeoTIFFReadergeoreaderasyncobstore / fsspecTIFF / COG only

The metadata properties and the read_window / read_bounds / read_geoslice / load method names are identical across both. The only divergence is whether reads are sync or async.

# Sync path — RasterioReader satisfies GeoData
def apply_to_chip(reader: GeoData, slice_: GeoSlice, op: Operator) -> GeoTensor:
    with reader as r:
        gt = r.read_geoslice(slice_)
        return op(gt)

# Async path — AsyncGeoTIFFReader satisfies AsyncGeoData
async def apply_to_chip_async(reader: AsyncGeoData, slice_: GeoSlice, op: Operator) -> GeoTensor:
    async with reader as r:
        gt = await r.read_geoslice(slice_)
        return op(gt)                                   # op itself stays sync


# In geotoolz, the pipeline picks which world it lives in:
geotoolz.catalog_ops.CatalogPipeline(
    catalog,
    op,
    reader_class=georeader.RasterioReader,         # sync default
    # reader_class=georeader.AsyncGeoTIFFReader,    # async, fan-out
)

Same metadata surface, same read_* method names, two different bytes paths underneath. The only tax on swapping is await — which is unavoidable as long as the cloud HTTP world is fundamentally async. For the side-by-side strategy comparison (open cost, read cost, concurrency, driver coverage), see the stack-level overview in geostack.md.


Sub-designs

The work splits into two independently reviewable issues:

#Sub-designOwns
1reader_protocol.mdAsyncGeoData Protocol (single new Protocol); GeoTensor Protocol-conformance check; tutorial chapter updates (02). Optional bundle: RasterioReader constructor widening with opener=/fs=/rio_open_kwargs= knobs + three-bytes-paths writeup in tutorial Ch. 3.
2reader_async_geotiff.mdAsyncGeoTIFFReader class — thin (~80 LOC) adapter over developmentseed/async-geotiff; async open(...) classmethod; Window/RasterArray translators; passthrough of obspec.AsyncStore to GeoTIFF.open.

Cloud byte transport is delegated to obspec (see types/bytestore.md); we ship a small geotoolz.io.open_store(url) factory and nothing else. There is no ByteStore Protocol of our own.

Each sub-design is sized to be a single PR with a focused review.


Sequencing

Issue 1 (AsyncGeoData Protocol; optional RasterioReader widening)
   │
   ▼
types/bytestore.md (one-page obspec passthrough note + open_store helper)
   │
   ▼
Issue 2 (AsyncGeoTIFFReader thin adapter over async-geotiff)

Open questions

These are unresolved and should be discussed before Issue 1 starts.

1. RasterioReader file-handle caching

The current RasterioReader opens the file fresh on every read() call — see Ch. 3 §1 of the tutorial. That behaviour is deliberate: it makes the reader pickleable for multiprocessing / joblib / Dask workers, because a cached rasterio.DatasetReader cannot cross a process boundary.

The proposal in this design implies caching the open handle for the lifetime of the reader (with explicit __enter__ / __exit__ and close()). That’s a behaviour change and the trade-off is real:

Decision needed before Issue 1.

2. Where COG IFD parsing + tile math + decompression lives

In developmentseed/async-geotiff (and its Rust dep async-tiff). We don’t host these primitives ourselves — AsyncGeoTIFFReader is a thin adapter over GeoTIFF.open and overview.read(window=...). The earlier draft of this plan proposed a private _cog_helpers.py module; that scope was removed when the review showed async-geotiff already covers IFD walk, tile-fetch math, decompression dispatch, request coalescing, and decoding off the event loop. See Issue 2 §“Why the rewrite”.

If a future reader needs the same primitives (sync facade, sensor-specific COG variant), the right path is to call async-geotiff from sync code via asyncio.run(...) — not to fork the helpers.

3. A sync GDAL-free GeoTensor reader (deferred)

Earlier drafts of this design proposed a LazyCOGReader — a sync, GDAL-free, COG-only GeoTensor reader. It was originally pitched as a wrapper around the developmentseed/lazycogs library, which turned out to return xarray.DataArray (not GeoTensor) and to be properly part of the xrtoolz / dense-cube stack — see the geostack_notes.md discussion for that re-routing.

The sync GDAL-free GeoTensor workload itself is plausible (notebooks, FastAPI sync handlers, batch scripts), but doesn’t yet have a clear customer that RasterioReader (sync, GDAL) and AsyncGeoTIFFReader (async, GDAL-free) don’t already cover between them. If a real workload emerges, the cheapest path is a sync facade that wraps AsyncGeoTIFFReader with asyncio.run(...) for one-call use cases; the more expensive path is a from-scratch sync IFD reader. Decide if and when.

4. Async warp / resample / overview-pick (deferred)

async-geotiff explicitly disclaims warping, resampling, and automatic overview selection. Their guidance is “load with async-geotiff, then warp via rasterio.MemoryFile if needed”. Our v1 plan adopts the same boundary: AsyncGeoTIFFReader.read_bounds(target_crs=...) raises NotImplementedError and points users at:

This is fine for the workloads we know about (tile servers serving native-CRS overviews, fan-out batch reads in a single CRS). It will not fit a future tile server that needs Web-Mercator output from a UTM source without GDAL anywhere in the loop. When that customer materialises, the options are:

Same logic applies to overview auto-selection (request_resolution-style helper) and to in-CRS resampling. Deferred for a later discussion — flagging here so we don’t accidentally bake a no-warp assumption deep into downstream code that would later be hard to lift.


Alternatives considered


Tutorial alignment

Once these designs are implemented, the existing tutorial chapters need updates:

The tutorial today describes the current package state; updates land alongside each issue’s implementation, not before.


Open questions, gotchas, and warnings

The reconciliation is mostly low-risk — pieces exist, the work is plumbing. A few things to manage actively: