Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

RasterioReader

The lazy file-backed reader

UNEP
IMEO
MARS

Module: georeader/rasterio_reader.py (1630 LOC) Role: the canonical GeoData implementation. Wraps rasterio to give you a GeoTensor-shaped interface over a file (local, S3, GCS, Azure, HTTP) without reading the bytes until you ask.


1. Why a lazy reader exists

Three concrete reasons rasterio alone isn’t enough:

  1. Process-safety. RasterioReader opens the file fresh on every read() call. That’s the unlock for multiprocessing / joblib / Dask workers — you can pickle the reader, send it to workers, and each worker opens its own dataset handle. A cached rasterio.DatasetReader cannot be pickled safely.

  2. A GeoTensor-shaped surface without the bytes. reader.shape, reader.transform, reader.bounds, reader.dtype, reader.isel(...) all work without reading data. Only read() / load() / read_from_window().load() materialise.

  3. Multi-file stacks as one object. Pass a list of paths, get a (T, C, H, W) reader. isel({"time": 0}) returns a single-time-step reader. The time dimension is a structural feature, not a wrapper.

This is the class your operators should accept whenever they don’t strictly need the data in memory.


2. RasterioReader vs GeoTensor

┌─────────────────────────────────────────────────────────────────────────┐
│                 RASTERIOREADER vs GEOTENSOR                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  RasterioReader (Lazy)              GeoTensor (In-Memory)               │
│  ─────────────────────              ────────────────────                │
│                                                                         │
│  • Data on disk/cloud               • Data in RAM                       │
│  • Read on demand                   • Instant access                    │
│  • Memory efficient                 • Full numpy API                    │
│  • Parallel-safe                    • Arithmetic operations             │
│  • Overview/pyramid support         • Broadcasting                      │
│                                                                         │
│  Use for:                           Use for:                            │
│  • Large files                      • Processing pipelines              │
│  • Cloud data                       • CNN inference                     │
│  • Tiled processing                 • Index calculations                │
│  • Quick previews                   • Visualizations                    │
│                                                                         │
│  Convert: reader.load() ────────────────────────────────► GeoTensor     │
└─────────────────────────────────────────────────────────────────────────┘

The mental model: RasterioReader is the address book, GeoTensor is the delivered package. reader.load() is the postman.

The arithmetic ops (+, *, etc.) only exist on GeoTensor. If you write reader * 2 you get an error — and that’s deliberate. Arithmetic on a lazy reader implies you’ve decided to materialise; the explicit .load() makes that decision visible at the call site.


3. Multi-file reading: time series as structure

┌─────────────────────────────────────────────────────────────────────────┐
│                    MULTI-FILE READING                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Input: List of paths                Output array shape                 │
│  ────────────────────                ──────────────────                 │
│                                                                         │
│  paths = [                                                              │
│    "2023-01.tif",   ─────┐                                              │
│    "2023-02.tif",   ─────┼──────► stack=True:  (T, C, H, W)             │
│    "2023-03.tif"    ─────┘                      (3, 4, 1000, 1000)      │
│  ]                                                                      │
│                                                                         │
│  Each file: (4, 1000, 1000)        stack=False: (T×C, H, W)             │
│  4 bands, 1000×1000 pixels                       (12, 1000, 1000)       │
│                                                                         │
│  Requirements for multi-file:                                           │
│  • Same CRS                                                             │
│  • Same transform (resolution, origin)                                  │
│  • Same shape (unless allow_different_shape=True)                       │
└─────────────────────────────────────────────────────────────────────────┘

Two modes worth distinguishing:

Validation is strict: __init__ checks CRS / transform / shape match across files unless you opt out via allow_different_shape=True. The check only relaxes shape — CRS and transform mismatches always raise. (See rasterio_reader.py:301.)


4. Window focus — “view” semantics

┌─────────────────────────────────────────────────────────────────────────┐
│                    WINDOW FOCUS CONCEPT                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Full raster (10000 × 10000)                                            │
│  ┌────────────────────────────────────────────────────────────────┐     │
│  │                                                                │     │
│  │                                                                │     │
│  │        ┌─────────────────────┐                                 │     │
│  │        │    window_focus     │  ← reader.set_window(...)       │     │
│  │        │    (2000 × 2000)    │                                 │     │
│  │        │                     │  After set_window:              │     │
│  │        │  ┌───────────┐      │  • reader.shape → (C, 2000, 2000)│    │
│  │        │  │ read()    │      │  • reader.bounds → window bounds│     │
│  │        │  │ window    │      │  • read(window=...) is relative │     │
│  │        │  └───────────┘      │    to window_focus              │     │
│  │        └─────────────────────┘                                 │     │
│  │                                                                │     │
│  └────────────────────────────────────────────────────────────────┘     │
│                                                                         │
│  Benefits: • Work with large files efficiently                          │
│            • Coordinates/bounds reflect the focused region              │
│            • Tiled processing with consistent interface                 │
└─────────────────────────────────────────────────────────────────────────┘

set_window(window_focus) is the single most useful trick in this class for tiled processing:

The non-mutating sister is read_from_window(window, boundless=True) (rasterio_reader.py:654) which returns a new reader with the focus applied — preferred for parallel pipelines where mutating shared state is awkward.


5. Constructor parameters

RasterioReader(
    paths,                          # str | list[str]
    allow_different_shape=False,
    window_focus=None,              # rasterio.windows.Window
    fill_value_default=None,        # falls back to file's nodata, then 0
    stack=True,                     # only matters for list paths
    indexes=None,                   # 1-based band indices (rasterio convention)
    overview_level=None,            # 0 = first overview, None = full res
    check=True,                     # validate CRS/transform/shape across paths
    rio_env_options=None,           # GDAL options dict (vsi creds etc.)
)

A few non-obvious ones:


6. The four read paths

Each method emphasises a different ergonomic. They all ultimately call into rasterio.DatasetReader.read.

MethodReturnsUse for
load(boundless=True)GeoTensor of full focus“give me everything in memory now”
read(**kwargs)np.ndarray (no metadata)rasterio-compatible; passes kwargs through
read_from_window(window, boundless=True)RasterioReader (focused)tiled inference; chain with .load() to materialise
read_from_tile(x, y, z, ...)np.ndarrayXYZ web-tile schema (used by tileserver reader)

The boundless=True default on read_from_window/load matches GeoTensor.read_from_window — you get the requested shape no matter where the window lands, padded with fill_value_default. This is critical for batched CNN inference: every chip has the same shape, batches stack cleanly.


7. xarray-style isel over time + band

reader.isel({"time": [0, 2], "band": [1, 2, 3]})

Same dim-name vocabulary as GeoTensor.isel (Chapter 1 §6) but with "time" admitted for stacked multi-file readers. Returns a new reader — still lazy. Spatial dims ("x", "y") accept slices and rewrite the focus window.

The internal mechanism: isel returns a copied reader with indexes adjusted (band selection), paths filtered (time selection), and window_focus updated (spatial slice). Nothing reads. Source: rasterio_reader.py:739.


8. Overviews — the “free preview” path

reader = RasterioReader("s3://bucket/big.tif", overview_level=2)
preview = reader.load()  # ~64× cheaper than full res

Two methods to know:

This is also the right tool for “should I load this whole scene to decide if it’s cloudy?” — read overview level 3, run your cloud check on the small image, then go full-res only if it’s worth it.


9. The cloud story — VSI paths and credentials

rasterio (via GDAL) handles cloud paths transparently if you pass them in the right form:

URI formBackend
s3://bucket/key.tifGDAL VSI /vsis3/
gs://bucket/key.tif/vsigs/
https://...cog.tif/vsicurl/ (range requests)
az://account/key.tif/vsiaz/

Internal helper _get_rio_options_path(path) (and the module-level _vsi_path in geotensor.py) translate user-friendly URIs to VSI form. Credentials come from rio_env_options or from environment (AWS_*, GOOGLE_APPLICATION_CREDENTIALS, etc.) — same as plain rasterio.

GDAL options: RIO_ENV_OPTIONS_DEFAULT

The package ships a sensible-default GDAL configuration applied to every read:

# georeader/geotensor.py:140-150
RIO_ENV_OPTIONS_DEFAULT = dict(
    GDAL_DISABLE_READDIR_ON_OPEN="EMPTY_DIR",
    GDAL_HTTP_MERGE_CONSECUTIVE_RANGES="YES",
    GDAL_CACHEMAX=2_000_000_000,
    GDAL_HTTP_MULTIPLEX="YES",
)

What each does:

Override via the rio_env_options= kwarg on the constructor when defaults aren’t right (rare) or when you need to add specific options like AWS_REQUEST_PAYER="requester".

How RasterioReader applies them

Every open goes through a rasterio.Env(...) context wrapping the configured options:

# georeader/rasterio_reader.py:301-326
with rasterio.Env(**self._get_rio_options_path(paths[0])):
    with rasterio.open(paths[0], "r", overview_level=overview_level) as src:
        ...

GDAL is configured once per rasterio.open call via the env context manager, and credentials are picked up from os.environ at the moment the context is entered. This is the seam that makes the next subsection’s pattern work.

Credentials: env-var-first

The mental model:

GDAL reads credentials from process environment variables. The pattern is to set the env vars once at app startup, then construct RasterioReader instances anywhere with no per-call credential threading. The reader’s rasterio.Env(...) wrap inherits whatever’s in os.environ at the moment of open.

Per-cloud env vars GDAL recognises:

CloudRequiredOptional
AWSAWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEYAWS_SESSION_TOKEN, AWS_REGION, AWS_REQUEST_PAYER=requester
GCSGOOGLE_APPLICATION_CREDENTIALS (path to JSON)
AzureAZURE_STORAGE_ACCOUNT + one of: AZURE_STORAGE_SAS_TOKEN, AZURE_STORAGE_CONNECTION_STRING, AZURE_STORAGE_ACCESS_TOKEN

The rest of this section walks through the three Azure modes the package’s downstream users (marsml, mars_data_ops) actually use.

Azure auth modes

Mode 1 — SAS token / connection string / account name

The simplest case: the credentials are static strings, and we set them as env vars before any reader is constructed.

# mars_data_ops/utils/filesystem.py:800-818
if set_env_variables:
    if connection_string is not None:
        os.environ['AZURE_STORAGE_CONNECTION_STRING'] = connection_string
    if sas_token is not None:
        os.environ['AZURE_STORAGE_SAS_TOKEN'] = sas_token
    if account_name is not None:
        os.environ['AZURE_STORAGE_ACCOUNT'] = account_name

Three orthogonal env vars; setting any combination of them is fine — GDAL’s preference order is connection string first (most specific), then SAS, then implicit auth via AZURE_STORAGE_ACCOUNT alone (anonymous read).

Mode 2 — Managed identity

When running inside Azure compute (VMs, AKS pods, Functions, etc.), there’s no static credential — the platform mints a short-lived bearer token via the IMDS endpoint. We fetch the token via azure.identity.DefaultAzureCredential and hand it to GDAL as an env var:

# mars_data_ops/utils/filesystem.py:765-789
credential = (
    DefaultAzureCredential(managed_identity_client_id=client_id)
    if client_id else DefaultAzureCredential()
)
token = credential.get_token('https://storage.azure.com/.default').token
os.environ['AZURE_STORAGE_ACCOUNT'] = account_name
os.environ['AZURE_STORAGE_ACCESS_TOKEN'] = token

Sharp edge: the token typically expires in ~1 hour. This snippet calls get_token(...) once at startup. If a long-running process tries to read after expiry, GDAL gets a 401 with no refresh path. For pipelines that run longer than the token TTL, refresh logic is the user’s responsibility today — see the reader_rasterio.md proposal for what an opinionated solution would look like.

Mode 3 — HTTPS with embedded SAS fallback

GDAL’s AZURE_STORAGE_SAS_TOKEN env var doesn’t always kick in for paths that don’t go through the canonical az:// form. The fallback is to rewrite the path as an HTTPS URL with the SAS token embedded as a query string:

# mars_data_ops/utils/filesystem.py:336-358
def pathasroothttps(self, path: str) -> str:
    path_https = path.replace(self.root, self.root_https())
    if self.sas_token is not None:
        sep = '&' if '?' in path_https else '?'
        path_https += f"{sep}{self.sas_token.lstrip('?')}"
    return path_https

Now RasterioReader(pathasroothttps(p)) reads https://account.blob.core.windows.net/container/blob?sv=...&sig=... directly — GDAL treats it as a vanilla /vsicurl/ URL and the embedded SAS is the auth.

Use this when env-var auth misbehaves (the most common case is non-canonical paths that GDAL doesn’t recognise as Azure).

Config-file entry point

Wrapping the three modes is a config-file entry point that app code calls once at startup:

# mars_data_ops/utils/filesystem.py:539-614
def fs_access_from_config(config, use_managed_identity=False, configdet='filesystem'):
    account_name = config.get('azure.storage', 'AZURE_STORAGE_ACCOUNT')
    sas_token = config.get('azure.storage', 'AZURE_STORAGE_SAS_TOKEN', fallback=None)
    connection_string = config.get(
        'azure.storage', 'AZURE_STORAGE_CONNECTION_STRING', fallback=None,
    )
    return config_storage_access(
        account_name, root=root,
        use_managed_identity=use_managed_identity,
        sas_token=sas_token,
        connection_string=connection_string,
    )

The implementation (filesystem.py:617-703) walks an explicit priority order: managed identity → connection string → SAS — first one set wins.

The canonical flow (TL;DR)

End-to-end, the production pattern looks like this:

  1. App calls fs_access_from_config(config) → reads the [azure.storage] section.

  2. config_storage_access(...) sets AZURE_STORAGE_* env vars (or fetches a bearer token via DefaultAzureCredential for managed identity and sets AZURE_STORAGE_ACCESS_TOKEN).

  3. Code reads rasters via RasterioReader(...) from anywhere in the codebase. The reader wraps rasterio.open in rasterio.Env(**RIO_ENV_OPTIONS_DEFAULT) per call.

  4. GDAL picks credentials up from the process env — no per-call credential threading needed.

  5. Fallback when env-var auth misbehaves: pathasroothttps(path) builds an HTTPS URL with the SAS token embedded as a query string and RasterioReader reads that directly.

The Reader reconciliation design is about widening this seam: AsyncGeoTIFFReader plugs in here as an alternative implementation of the same interface, swapping GDAL VSI for direct HTTP-range / obstore reads. The credential pattern stays env-var-first for the GDAL-VSI default; the new path has its own credential locus — see reader_protocol.md §“Credential handling”. For a proposal that would reduce the env-var-soup ergonomics with a typed Credential Protocol, see plans/types/credentials.md.


10. Method reference (the public surface)

Lifecycle

Inspection (no I/O after __init__)

Configuration (mutating)

Read

Overviews


11. Sharp edges


12. Connection to the rest of the package

Where RasterioReader shows up downstream:

For curvilinear sensors (PRISMA, EnMAP, MODIS) RasterioReader is the wrong tool — those use a custom reader pattern routed through griddata (Chapter 7).

Next chapter: Window utils — the window/coordinate-system math that all of the above relies on.