Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

The Modern GeoStack

Motivation and ecosystem map for `geotoolz`, `xrtoolz`, and `GeoCatalog`

UNEP
IMEO
MARS

Status: living motivation document — companion to the per-design specs in plans/. Scope: the modern cloud-native geospatial ecosystem and the gaps that motivate geotoolz, xrtoolz, and GeoCatalog. Audience: anyone trying to understand why these libraries exist and where they sit in the broader stack.


Primer for newcomers

ELI5. Imagine a planetary shipping network. Warehouses all over the world hold the cargo. Trucks and ships move it between ports. Dock workers crack open containers and stack the contents on standard pallets, which feed factories that build finished products. Customers find what they want through a shipping registry, and they see the result in storefronts. Modern geospatial software is exactly this network — but the cargo is data (satellite pixels, climate cubes, building footprints) instead of widgets, and the warehouses span continents and petabytes. The libraries I’m building are the missing pieces in this network: better factories for raster and cube data, and a better registry to tie everything together.

The five-layer view

A flattened picture of the whole stack, before the detailed walk-through:

Storefront — the USER

notebook · web map · tile server

Factories — the LOGIC

pixel math · joins · regridding

compose, transform, reduce, render

Shipping registry — DISCOVERY

“what exists, where it lives, what’s inside”

a queryable metadata index

Forklifts — the TRANSPORT

HTTP range requests · concurrent IO

only fetch the bytes you actually need

Warehouses — the STORAGE

cloud object stores · S3 / GCS / Azure

raster files · cube stores · tabular files

Read it from the bottom: bytes sit in warehouses, forklifts haul them out only when asked, the registry tells the forklift which warehouse to go to, factories turn raw cargo into finished products, and the storefront shows it to the human. Every layer above is useless without the one below — but the registry is the layer most people skip and then regret.

Walking down the stack

The stack is eight conceptual layers — same anatomy whether the cargo is a satellite scene, a climate cube, or a building footprint. For each: what it does, how we used to do it, what changed, and what’s still awkward.

1. The warehouses (Storage). Where the data physically lives. Today these are object stores in the cloud — buckets full of files, stacked in racks the size of city blocks, billed by the byte. Old world: tape archives, FTP servers, lab NAS appliances. Data lived where the people who collected it happened to work, and you got a copy by emailing someone or scp-ing it overnight. What changed: one address — s3://…, gs://… — for everyone, with effectively infinite capacity, and fault-tolerance baked in. Still awkward: the bytes alone are dumb; nothing useful happens until something fetches them, and “where exactly is my pallet inside this warehouse?” is a real problem (see layer 7).

2. The forklifts (Transport). Getting bytes out of the warehouse and onto your laptop. Modern transport speaks HTTP with range requests — “give me bytes 4096–8191 of this 10 GB file” — and dispatches many in parallel. Old world: full-file FTP downloads, mounted network shares, vendor-specific transfer protocols. If you wanted one tile out of a satellite scene, you downloaded the whole scene. What changed: concurrency hides latency; you only pay for what you read; a single laptop can saturate a 10 Gbps link. Still awkward: range reads only help if the file format lets you compute which bytes you want without reading the whole thing first — most pre-2015 formats don’t.

3. The retrofit team (Virtualisation). A clever hack for legacy cargo. The world has petabytes of data sitting in monolithic file formats (HDF4, HDF5, NetCDF-3) that aren’t range-readable. Rather than rewrite all of it, the retrofit team scans each old crate once, writes a tiny “where everything is inside” index, and exposes the index so modern forklifts can pretend the old crate is a new one. Old world: migrate everything — years of effort, double storage costs while in flight, and political fights over who owns the canonical copy. What changed: zero-migration access to legacy archives — the old data didn’t move, but it now behaves cloud-native. Still awkward: one more layer of indirection to maintain, and the indices go stale when the underlying files change.

4. The dock workers (Readers). The code that actually opens a file and turns its bytes into something you can compute on. Modern readers are asynchronous and concurrent — many small range reads dispatched in parallel — and lazy, deferring work until you ask for the result. Old world: synchronous, one-file-at-a-time, blocking; or worse, click-through GUI dialogs in a vendor application (ArcToolbox, ENVI menus). What changed: a single Python process can drive thousands of concurrent reads. Still awkward: async code is harder to reason about, and each data geometry (raster vs cube vs vector) needs its own specialised reader — there’s no one universal “open this geo-thing” function.

5. The standard pallet (Substrate). Once cargo is unloaded, what shape does it take in memory? The modern answer is a typed carrier — an array (or table) that remembers its coordinates: which axis is latitude, which row is “Paris”, which column is “geometry”, what the units are, what projection the world is in. Old world: bare arrays plus sidecar files (.prj, .tfw, .hdr, .aux.xml) you had to read by hand and keep in sync. Lose the sidecar, lose the meaning. What changed: coordinate-aware operations don’t need manual bookkeeping; downstream code stays geographically honest. Still awkward: the three data geometries (raster / cube / vector) each have a different idea of “the right pallet” — there isn’t yet one universal carrier.

6. The factories (Compute / engines). Where the science actually happens — pixel math, regridding, climatologies, spatial joins, ensemble statistics. Modern factories are composable: small typed operators chained into pipelines, like Lego blocks for analysis. Old world: monolithic GIS suites (ArcGIS, ENVI, GRASS, IDL) where every workflow was a vendor-specific click-trail, plus a long tail of one-off scripts copy-pasted between projects. What changed: reusable, testable units that work the same in a notebook, a script, and a server. Still awkward: the factory ecosystem is fragmented — each data geometry has its own factory, and operators don’t always compose cleanly across them. (This is one of the gaps the libraries below try to close.)

7. The shipping registry (Discovery / Catalog). The single most underrated layer. A registry knows what exists, where each pallet is stored, and what’s inside each container — without having to physically open them. You query the registry first, get back a short list of warehouse addresses, then send the forklifts. Old world: directory-of-files plus a README, hand-maintained CSV indices, FGDC metadata XML no one updated, or full-blown enterprise spatial databases for the lucky few. At petabyte scale there are heavyweight standardised registries (you’ll meet them later); at lab scale most teams still wing it. What changed: the same columnar file format used for tabular data can hold the metadata index, queryable with the same SQL engine you already use for analytics — the registry becomes “just another file”. Still awkward: no universal agreement on what metadata to record, and registries decay if no one keeps the crawls running.

8. The storefront (Viz / UX). Where humans actually see the result. Modern storefronts are GPU-accelerated and in-browser, reading straight from cloud storage — interactive maps that fetch tiles on demand and render millions of features at 60 fps. Old world: render to PNG, save to disk, open in a desktop GIS (QGIS, ArcMap), or maintain a heavyweight tile-server farm pre-rendering images for the web. What changed: the same notebook that ran the analysis can also render the map; the same cloud bytes feed both. Still awkward: GPU memory is finite, browsers have limits, and “ship a million polygons to a tablet” is still a real engineering problem.

Two organising ideas

Borrowing the networking metaphor: the modern stack splits cleanly into a control plane and a data plane.

You need both. A registry with no readers is a phone book with no phones; readers with no registry means every workflow re-implements path-globbing and metadata parsing.

The three stacks

Geospatial data has three fundamentally different geometries, and each gets its own optimised stack — same eight layers, different specialisations in each row:

StackGeometryCargo (Storage)Pallet (Substrate)
Imagery2D / 3D pixel gridstiled-and-overviewed raster filestyped georeferenced array
DenseN-D coordinate cubeschunked array storeslabelled-coordinate dataset
VectorDiscrete featurescolumnar tabular filestyped spatial table

The registry sits above all three and uses the vector stack to govern the other two — i.e., it stores raster and cube metadata as rows in a columnar tabular file and queries them with SQL.

The rest of this document names the actual tools, walks each of the three stacks layer by layer, shows the byte-by-byte lifecycle of a query in each, and maps the gaps onto the libraries I’m building.


Why this exists

The cloud-native geospatial ecosystem is mature at the edges (storage formats, byte transport, viz) and fragmented in the middle (slicing, indexing, composition). Three concrete gaps motivate this work:

1. No unified compute layer for imagery. rio-tiler does web tiles, rasterio does GDAL-backed reads, eo-learn does workflow DAGs — but nothing composes pixel-level operators (band math, masking, regridding, sensor calibration) over a typed GeoTensor carrier with the same ergonomics as PyTorch’s nn.Sequential. → geotoolz.

2. No first-class operator algebra for dense cubes. xarray is a brilliant substrate (labelled arrays, coordinates, chunking) but operator composition still happens as ad-hoc function chains. Climatologies, regridding, EOF decompositions, and ensemble statistics all want a shared shape. → xrtoolz.

3. No lightweight catalog for personal/lab-scale work. STAC is the gold standard at petabyte scale, but spinning up a STAC API for a few hundred TB of project data is overkill. A GeoParquet-backed catalog queryable with DuckDB closes the gap between “directory of files” and “full STAC deployment”. → GeoCatalog (see plans/geodatabase/).


The unified picture

The Grand Unified GeoStack

The full stack, with GeoCatalog as the discovery layer that ties the three data-plane stacks together:

Viz / UX
lonboard
titiler
Felt
Logic
Engines
geotoolzimagery
xrtoolzdense
DuckDBvector
Discovery
The Glue
★ GeoCatalog ★GeoParquet / GeoJSON / STAC
Substrate
Readers
georeader
xarray
GeoPandas
async-geotiff
Transport
IO
obstore
kerchunk
GDAL / VSI
Storage
Cloud
COG
Zarr
GeoParquet

The catalog is the glue

The key move across all three stacks: the vector stack (GeoParquet metadata, DuckDB query) is what makes the imagery and dense stacks navigable. Without a catalog, every workflow re-implements glob() + STAC parsing and the boilerplate compounds across projects. Each stack below ends with a lifecycle of a query sequence diagram showing exactly how the catalog hand-off works for that data geometry.


The three stacks

Each stack is a specialised silo optimised for its data geometry, but they share a common transport foundation (obstore) and a common discovery layer (GeoCatalog).

Shared anatomy

The same six layers repeat across all three stacks; only the row contents change.

LayerImageryDenseVector
StorageCOGZarr · HDF5 · NetCDFGeoParquet · PostgreSQL · FlatGeobuf
Transportobstore · GDAL/VSIobstore · kerchunkobstore · GDAL/VSI
ReadersRasterioReader · async-geotiff · rio-tilerzarr-python · icechunk · kerchunk · lazycogspyogrio · PostGIS · SedonaDB
SubstrateGeoTensor · GeoSlice (georeader)DataArray · Dataset · chunks (xarray)GeoArrow · GeoDataFrame
ComputegeotoolzxrtoolzDuckDB · PostGIS · SedonaDB
Vizlonboard · titiler · terracottalonboard · Holoviz · Datashaderlonboard · Felt

Stack A — Imagery (Rasterio territory)

Focus. 2D/3D satellite and aerial imagery. Philosophy. Window-based processing — crop spatial slices from massive COGs via HTTP range requests.

Viz / UX
lonboardJupyter / GPU
titilerflexible API
Compute
geotoolzOperator · Sequential · Graph
GeoTensor in / out
Substrate
georeaderGeoTensor · GeoSlice · GeoCatalog
Readers
RasterioReader
async-geotiff
rio-tiler
Transport
GDAL / VSIlegacy IO
obstoreRust byte mover
Storage
COGCloud Optimized GeoTIFF

Layers

Storage — COG (Cloud Optimized GeoTIFF). A standard GeoTIFF with two structural choices: pixels are organised into internal tiles (typically 256×256 or 512×512), and decimated overviews (½, ¼, ⅛ resolution) are appended to the same file. The IFD (Image File Directory) at the start of the file lists the byte offset and length of every tile and every overview. Because clients can read the IFD with a single small HEAD/GET, then fetch only the tiles overlapping their AOI via HTTP range requests, a 10 GB Sentinel-2 scene becomes a few-MB read for a small region. Advantage: arbitrary spatial subsetting over HTTP without downloading the whole file.

Transport — obstore and GDAL/VSI. Two parallel paths into cloud storage. obstore is a Python wrapper around the Rust object_store crate — it speaks S3, GCS, Azure Blob, and HTTP natively, dispatches concurrent range requests, and benchmarks 2–5× faster than fsspec-based alternatives for byte-range workloads. GDAL/VSI is the C++ “Virtual File System” baked into GDAL (vsis3://, vsigs://, vsicurl:// etc.), which is what rasterio uses by default — mature, ubiquitous, but synchronous, and configured through environment variables (AWS_*, GDAL_HTTP_*). Advantage: obstore for new async pipelines, GDAL/VSI for the long tail of formats GDAL already understands.

Readers — RasterioReader, async-tiff, rio-tiler. This is where bytes become arrays.

Advantage: pick the reader that matches your concurrency model — sync for scripts, async for high-concurrency servers and batch fan-out.

Substrate — georeader. A small object model that lets every reader hand back the same shape of object, so downstream code doesn’t care which reader produced it.

Advantage: a typed carrier means coordinate metadata propagates automatically through pipelines, no crs= arguments threaded through every call.

Compute — geotoolz. Operator algebra over GeoTensor. A single Operator is a typed function GeoTensor → GeoTensor; Sequential composes a linear chain (e.g. mask → calibrate → NDVI); Graph handles multi-input DAGs (e.g. fusing radar + optical inputs). Mirrors the ergonomics of PyTorch’s nn.Sequential/nn.Module but for georeferenced rasters. Advantage: composable, testable units with carrier-aware semantics. See plans/geotoolz/.

Viz / UX — titiler, terracotta, lonboard.

Advantage: dynamic vs pre-baked tile serving for production; in-notebook GPU rendering for exploration.

Lifecycle of a query

A user asks for a low-cloud Sentinel-2 mosaic of Paris in 2024:

UserU
GeoCatalogC
ReaderR · georeader
obstoreO
S3S
SQL: bbox ∩ Paris ∧ date ∈ 2024 ∧ cloud < 10
list of COG URLs + metadata
open(urls, slice=GeoSlice(bbox))
HEAD + small range read
HTTP Range (header)
COG IFD bytes
tile offsets + nodata + CRS
compute overlapping tile indices
par — parallel tile fetch
range read (tile k)
HTTP Range (tile k)
compressed tile bytes
tile bytes
decompress + assemble window
GeoTensor (CRS + affine + array)

The trick. The reader makes two round trips — first a cheap header read to learn tile layout, then a fan-out of concurrent range reads for just the tiles that overlap the AOI. That’s how you turn a 10 GB scene into a 3 MB transfer.

What’s specific. GDAL/VSI is the legacy transport for synchronous readers; obstore is the modern transport for async readers. geotoolz carries GeoTensor (CRS + affine + array) end-to-end so operators stay coordinate-aware.

Stack B — Dense (Xarray territory)

Focus. Multi-dimensional scientific grids — climate, weather, oceanography. Philosophy. Coordinate-based DataCubes. Time, altitude, and variable are physical dimensions of a single labelled array.

Viz / UX
lonboardGPU render
Holovizaggregation
Compute
xrtoolzN-D algebra · regridding · means
Substrate
xarraycoordinates · dimensions · chunks
Readers
Zarrstandard
Icechunkversioned
stackstac · odc-stac · lazycogs
Virtual
kerchunkHDF5 / NetCDF virtualisation
Transport
obstore
Storage
Zarr · HDF5 · NetCDF

Layers

Storage — Zarr, HDF5, NetCDF. All three store N-D arrays plus metadata, but with very different physical layouts.

Advantage of Zarr: parallel reads/writes scale linearly with object-store concurrency. Advantage of HDF5/NetCDF: ubiquitous in climate science and decades of tooling.

Transport — obstore. Same Rust-backed mover as Stack A. Zarr’s “many small files” workload is exactly what object_store is tuned for — concurrent GETs without the per-request overhead of s3fs/gcsfs. Advantage: makes a Zarr that lives in S3 feel like a Zarr that lives on a fast local disk.

Virtualisation — kerchunk / VirtualiZarr. A clever trick for the legacy formats: scan an HDF5/NetCDF file once, record the byte offsets of every chunk into a small JSON or Parquet “reference file”, then expose that reference as a virtual Zarr store. Readers see a Zarr; under the hood obstore is doing range reads into the original HDF5. Advantage: no rewriting of petabytes of legacy data — you get cloud-native access patterns for free.

Readers — zarr-python, Icechunk, stackstac / odc-stac, lazycogs.

Advantage: zarr-python for vanilla cubes, Icechunk when you need ACID-like guarantees on a writable cube, stackstac / odc-stac for STAC-driven cube assembly via GDAL, lazycogs for the same pattern with a Rust-native I/O stack and STAC-geoparquet predicate pushdown.

Substrate — xarray. The de-facto Python model for labelled N-D arrays.

Advantage: coordinate-aware indexing (ds.sel(time="2024-06", lat=slice(45, 50))) means you write the science in the units of the science, not array indices.

Compute — xrtoolz. Operator algebra over xarray.Dataset/DataArray, mirroring geotoolz but for cubes. Targets the recurring patterns: climatologies (rolling means, seasonal anomalies), regridding (between map projections), EOF / spectral decompositions, and ensemble statistics. Advantage: composable units that respect xarray’s labelled-coordinate semantics, instead of ad-hoc groupby().mean() chains scattered across notebooks.

Viz / UX — lonboard, HoloViz (Datashader, hvPlot).

Advantage: GPU rendering for interactive exploration; server-side aggregation for browser-friendly delivery of huge cubes.

Lifecycle of a query

A climate scientist asks for monthly mean SST over the North Atlantic for 1990–2024:

UserU
GeoCatalogC
xarrayX
obstoreO
S3S
query: collection=ERA5, var=sst
Zarr URL
xr.open_zarr(url)
GET .zmetadata (consolidated)
HTTP GET
schema + chunk grid + coords
lazy Dataset (no chunks read yet)
ds.sst.sel(lat=slice(20,60), lon=slice(-80,0), time=slice("1990","2024"))
map coord ranges → chunk indices
par — parallel chunk fetch
GET chunk (ti, latj, lonk)
HTTP GET
compressed chunk bytes
chunk array
decompress + assemble + (optional) Dask graph
.groupby("time.month").mean()
DataArray (12, n_lat, n_lon)

The trick. Coordinate selection (.sel) is translated into chunk indices before any bytes are read; only the chunks intersecting the requested coordinate hyperrectangle are fetched. Lazy evaluation through Dask means downstream reductions (.mean()) collapse the chunk-fetch + compute graph into a single optimised pass.

What’s specific. A virtualisation layer (kerchunk) sits between readers and transport — it makes legacy HDF5/NetCDF behave like Zarr without rewriting the underlying files. xrtoolz provides operator composition over Dataset/DataArray (climatologies, regridding, EOFs, ensemble statistics).

Stack C — Vector (GeoPandas territory)

Focus. Discrete features — points, lines, polygons, and their attributes. Philosophy. Tabular and relational. Geography is just another column in a high-performance database.

Viz / UX
lonboardGeoArrow / GPU
Feltcollaborative
Engines
DuckDB SQL · GeoPandasspatial joins · buffers · filters
Substrate
GeoArrowzero-copy tabular spatial streams
Readers
pyogriofast OGR
PostGISdatabase
SedonaDBbig data
Transport
obstore
Storage
GeoParquet
PostgreSQL
FlatGeobuf

Layers

Storage — GeoParquet, PostgreSQL/PostGIS, FlatGeobuf.

Advantage: GeoParquet for analytical workloads on cloud storage; PostGIS for transactional / multi-user; FlatGeobuf for lightweight HTTP-only access patterns.

Transport — obstore, GDAL/VSI. obstore pulls Parquet row groups from cloud storage; GDAL/VSI is what pyogrio uses to interface with the long tail of vector formats (GPKG, SHP, GeoJSON, KML, …). Advantage: same dual-path story as Stack A — modern async path for Parquet, mature GDAL path for legacy formats.

Readers — pyogrio, PostGIS, Sedona.

Advantage: pyogrio for single-node analytics, PostGIS for transactional, Sedona for distributed.

Substrate — GeoArrow, GeoPandas.

Advantage: GeoArrow’s zero-copy hand-off eliminates the “serialise to GeoJSON, parse, re-serialise” tax that used to dominate vector pipelines.

Compute / Engines — DuckDB (with spatial), PostGIS, Sedona.

Advantage: DuckDB collapses a whole category of “load shapefile → filter in pandas” workflows into a single SQL statement against cloud storage. This is the engine that powers GeoCatalog.

Viz / UX — lonboard, Felt.

Advantage: GPU-rendered interactive exploration in notebooks; collaborative web sharing for handoff.

Lifecycle of a query

An analyst asks for all building footprints in a flood zone:

UserU
DuckDBD · spatial
obstoreO
S3S
SELECT * FROM 's3://buildings.parquet' WHERE ST_Intersects(geom, :flood_bbox)
GET Parquet footer (small range)
HTTP Range
row-group stats (bbox per group)
footer parsed
predicate pushdown — drop non-intersecting groups
par — parallel row-group fetch
GET row-group k (geom + attr columns only)
HTTP Range
column chunks
Arrow record batch
ST_Intersects on candidates (refine)
GeoArrow result table
hand to lonboard / GeoPandas (zero-copy)

The trick. Two stages of filtering: (1) row-group pruning using the bbox statistics in the Parquet footer skips entire row groups that can’t possibly intersect, and only the surviving groups are fetched. (2) column pruning means non-needed attribute columns are never read off the wire. The result lands as Arrow buffers, so the hand-off to lonboard or GeoPandas is zero-copy.

What’s specific. GeoArrow enables zero-copy hand-off — DuckDB can stream query results straight into lonboard without re-serialisation. This is also the stack that powers GeoCatalog: the catalog is a GeoParquet file queried with DuckDB.


Where my libraries fit

Most of the modern stack is already there. Two of the three libraries below (georeader modernisation, GeoCatalog) are reconciliation work — taking pieces that already work and giving them a coherent surface. The third (geotoolz + xrtoolz) is the genuinely missing piece: the factory layer that doesn’t yet have a good general-purpose shape.

User

notebook · API · map

Logic

geotoolz · xrtoolz

future: geoarrax (JAX bridge)

Discovery

GeoCatalog

GeoParquet + DuckDB

Substrate

georeader

unified Reader Protocol

Transport

obstore · GDAL / VSI

Storage

COG · Zarr · GeoParquet

Stars mark where my libraries plug in. Read it from the top: a user wants something → engines compose carriers → the registry tells them which files to open → readers translate cloud bytes into typed carriers → transport hauls only the bytes needed → cloud storage is the source of truth.


1. Modernising georeader (reconciliation, not invention)

georeader already gives us a clean object model: GeoTensor (georeferenced array), GeoSlice (declarative crop), and a small reader surface anchored on RasterioReader. What it doesn’t yet have is a protocol — a shared shape every reader can satisfy — which means async / GDAL-free readers reinvent the substrate every time.

What plugs in.

What falls out for free. Today’s GeoData / GeoDataBase Protocols stay as the sync surface; we add a single new AsyncGeoData Protocol mirror (~30 LOC) for the async reader; cross-cutting types (GeoSlice, Credential) live in a shared home in plans/types/; existing RasterioReader keeps doing what it does well; the new async reader is a thin adapter rather than a from-scratch COG reimplementation.

What about lazycogs? developmentseed/lazycogs is excellent — but it returns xarray.DataArray, not GeoTensor, so it belongs in Stack B (Dense / xarray) above as a STAC-driven peer of stackstac / odc-stac. A sync GDAL-free GeoTensor reader for geotoolz was floated and deferred to v0.5+ (no clear customer that RasterioReader + AsyncGeoTIFFReader don’t already cover); see plans/georeader/README.md open question §3.

Sensor-specific readers built on the same Protocol. Once the Reader Protocol stabilises, per-sensor readers slot in without re-inventing the substrate — each handles its own quirks (+proj=geos affines, bowtie distortion, irregular file formats) but emits the same GeoTensor carrier:

plans/readers/ for the per-sensor designs.

Honest framing. This isn’t a new library. It’s a refactor that turns several already-working tools into a coherent stack with a single Reader Protocol surface. It’s the prerequisite for everything else belowgeotoolz operators don’t compose if every reader hands back a different shape of object.

→ Designs: plans/georeader/, plans/types/, plans/readers/.


2. GeoCatalogthe registry as a file

The pitch in one line: store metadata as GeoParquet, query it with the tool that fits the scale — GeoPandas for most cases, DuckDB when the catalog gets big.

Why now. STAC is the gold-standard registry at petabyte scale where you have a team running an API. It’s overkill for the lab-scale problem most teams actually have: a few TB of project data on object storage that you want to slice by bbox + date + attribute without spinning up infrastructure. Meanwhile GeoParquet, GeoPandas, and DuckDB Spatial have matured to the point where the metadata index can be just another file.

Two backends, one API.

The point: GeoParquet is the artifact, GeoPandas is the default tool, DuckDB is the scale-up. You can do 90% of analysis with just the first two — gpd.read_parquet(...) plus the geopandas API you already know — and the third joins in transparently when row counts demand it.

What’s in the API.

Honest framing. Nothing here is new technology. The contribution is the convention — the canonical schema, the IntervalIndex+R-tree pairing, the GeoParquet 1.1 bbox column convention — plus a thin Python API that hides the schema and connection plumbing. STAC-compatible read/write keeps the door open when you scale past lab size.

→ Designs: plans/geodatabase/geocatalog.md (Phase 1, GeoPandas), plans/geodatabase/geoduckdb.md (Phase 2, DuckDB).


3. geotoolz + xrtoolzthe factory layer that’s missing

This is the part of the stack where I keep hitting the same gap, regardless of which institution or project I’m working at. The storage / transport / discovery layers have gotten brilliant. The research-notebook → operational-pipeline leap is still mostly hand-rolled glue.

What shape is actually missing?

There are operator-flavoured libraries in the geospatial ecosystem. Each is excellent at what it does. None of them hits all three of typed carrier + composable algebra + backend-agnostic execution:

What’s missing is a general-purpose operator algebra over typed georeferenced carriers — the nn.Module/Sequential/Graph shape, but for GeoTensor (raster) and Dataset/DataArray/coordax.Array (cube). The libraries above each fill one or two of those properties; none combines all three. coordax gets us closest, and is treated below as a likely foundation for the cube side rather than competition.

The two-layer ladder (Keras-style progression of complexity)

One of the best lessons from Keras is that complexity should be opt-in. Beginners reach for model = Sequential([Dense(10), Dense(1)]); experts reach for the functional API or Module subclasses. The same library scales with the user’s expertise. geotoolz and xrtoolz apply the same lesson:

The same operation lives at both layers — NDVI()(tensor) and ndvi(np.asarray(tensor)) produce the same result. The choice is ergonomic, not semantic. This is the on-ramp that lets a researcher write a one-liner today, lift it into a pipeline tomorrow, and ship it as a serving endpoint next month — without rewriting.

(geotoolz deliberately ships only these two tiers, where its sibling xrtoolz ships three. The asymmetry is architectural: GeoTensor is an np.ndarray subclass with __array_ufunc__, so metadata round-trips through ufunc-pure primitives for free, and non-ufunc primitives get wrapped at the Operator boundary in a single line. xarray.DataArray is composition over ndarray, not a subclass — its middle “tensor function” tier earns its keep. Substrate dictates tier count.)

The pattern

Why this matters — the use-case spectrum

The same operator graph supports:

The throughline: the gap I keep running into, across affiliations and projects, is the research-to-production translation step. The modern stack has gotten brilliant at storage and transport; the factory layer that lets a research idea graduate into operational infrastructure without a rewrite is still mostly hand-rolled. That’s what these libraries aim at.

The scale spectrum

The same composition pattern carries across data sizes — the carrier determines the surface, the backend determines the scale:

RegimeData sizeBackendTypical workflow
Smallfits in RAMnumpyinteractive notebook, single-scene analysis — operators run as-is
Mediumfits on one box, doesn’t fit in RAMDask, async batch, Rayper-tile parallel processing, daily-scale pipelines — same operators dispatched per chunk/tile
Bigneeds a clusterdistributed JAX (via coordax)continental-scale ensemble analyses — same composition pattern, JAX-backed carriers

The architectural commitment: operators don’t bake in a backend. They run on whatever the carrier wraps.

Honest scope. “Same operator everywhere” holds for single-machine numpy, Dask-orchestrated batch, and distributed JAX — same composition pattern, sometimes with backend-specific implementations under one signature. Sedona / Spark and other distributed-SQL engines are not covered by the same code path; if you need them, the operator graph would have to emit Sedona SQL, which is a separate library and not currently planned. We’d rather deliver the 80% case cleanly than over-promise the 100%.

Honest research-to-prod scope

“Production” is a slippery word. The marquee pitch — the same operator graph runs in a notebook and a serving endpoint — holds for these targets:

What is not directly covered:

The 80% case (notebook → API / pipeline / orchestrator / batch) is the deliverable target. The 20% (streaming, distributed SQL, edge) is acknowledged out-of-scope rather than promised and missed.

Specialisations


4. Supporting infrastructure

The cross-cutting types that flow between layers each get their own design — small, but load-bearing.


Future work — JAX bridge (via coordax)

GeoTensor is a numpy.ndarray subclass — fine for 99% of cases, but it walls off differentiability. The future-work direction is a thin bridge that translates GeoTensor ↔ jax.Array (most likely via coordax’s coordinate-aware array) while preserving CRS / transform metadata, opening the door to:

The bridge does not need to be invented from scratch — coordax is JAX-native, coordinate-aware, and probably the right backbone. The real work is the GeoTensor ↔ coordax.Array adapter and the CRS-preserving operator implementations, not a new array library. (An earlier working name for this glue was geoarrax; if coordax keeps maturing, even the glue may shrink to a handful of converter functions.)

Out of scope for the initial library push; flagged here so the architecture leaves room for it — operators avoid numpy-only assumptions, carriers stay duck-typed where possible.


Reference: library × layer × design

Library / moduleLayerStatusDesign doc
georeader (modernised)Substrate / ReadersReconciliation refactorplans/georeader/
Sensor-specific readers (ABI, SEVIRI, MODIS, VIIRS, …)Substrate / ReadersNew, on shared Reader Protocolplans/readers/
geotoolz.catalog — Phase 1 InMemoryGeoCatalog (GeoPandas + R-tree + IntervalIndex)DiscoveryNew — geotoolz submodule (incubates → standalone geocatalog at maturity)plans/geodatabase/geocatalog.md
geotoolz.catalog — Phase 2 DuckDBGeoCatalog (GeoParquet + DuckDB spatial)DiscoverySame submodule, second backendplans/geodatabase/geoduckdb.md
geotoolz.ops (Operator, Sequential, Graph, ModelOp, sensor presets)Logic (imagery)New — geotoolz library, stable coreplans/geotoolz/
geotoolz.patch (Patcher, PatchGeometry, Sampler, Window, Aggregation; streaming + hierarchical)Logic (locality)New — geotoolz submodule (incubates → standalone geopatcher at maturity)plans/geopatcher/
xrtoolz (operator algebra over xarray / coordax)Logic (cubes)External librarygithub.com/jejjohnson/xr_toolz
geotoolz.types.GeoSlice (cross-cutting wire format; samplers live in geotoolz.patch)Cross-cutting typesgeotoolz submoduleplans/types/geoslice.md
Credential + per-cloud subclassesCross-cutting typesNewplans/types/credentials.md
geotoolz.io.open_store(url) (over upstream obspec.AsyncStore)Cross-cutting typesNew (~30 LOC factory only)plans/types/bytestore.md
JAX bridge (via coordax)Logic ↔ JAXFuture workTBD

See also

The success-story anchor

The motivation in this doc is anchored on a concrete, projected end-to-end pipeline: MARS-style methane attribution rebuilt on the unified stack across three instruments (TROPOMI + GHGSat + EMIT) and three orchestration phases (research notebook → batch pipeline → FastAPI alert service). The full plan lives in:

It’s the single best demonstration of the “same operator graph runs in research and production” claim — and the place where the geotoolz / xrtoolz / GeoCatalog substrate, the per-sensor readers, and the plumax science modules all compose into one operational artifact.