Report 6 — geocatalog: standalone spatiotemporal-index package¶
| Status | Scoping proposal — committed |
| Reading time | ~15 min |
| Decisions locked in | GeoSlice lives in geocatalog (it’s the catalog’s wire format). The whole ecosystem develops in one monorepo (uv/hatch workspaces), publishes as separate packages. |
| Audience | Anyone reviewing the catalog split before code moves |
| Companion reports | Reports 1–5 (pipekit stack), Report 7 (geopatcher) |
| Inputs | jejjohnson/geotoolz @ main (rescraped 2026-05-16), specifically geotoolz.catalog (~3,600 LOC) and geotoolz.types.GeoSlice (~200 LOC) |
STAC background. For the standards
geocatalogbuilds on, see the tutorial group: STAC primer, REST API primer, Storing STAC catalogs (GeoJSON / GeoParquet / PostGIS), pystac and rustac clients, and a curated catalog survey.
Part 1 — Where geocatalog sits in the stack¶
The full ecosystem after all proposed splits:
┌──────────────────────────────────────────────────────────┐
│ georeader (GeoTensor, GeoData Protocol, RasterioReader, │
│ AsyncGeoTIFFReader — the I/O substrate) │
└──────────────────────────────────────────────────────────┘
▲
┌─────────────────┼─────────────────┐
│ │ │
┌──────────┴───────┐ ┌──────┴──────┐ ┌──────┴──────────┐
│ geocatalog │ │ pipekit │ │ geopatcher │
│ (this report) │ │ (framework)│ │ (sibling report)│
│ — GeoSlice │ │ │ │ — Field / │
│ — GeoCatalog │ │ │ │ Domain │
│ — backends │ │ │ │ — Patchers │
│ — builders │ │ │ │ — Aggregations │
│ — loaders │ │ │ │ │
└──────────────────┘ └─────────────┘ └─────────────────┘
▲ ▲ ▲
└─────────────────┼─────────────────┘
│
┌─────────────────┴─────────────────┐
│ geotoolz │
│ (GeoTensor domain operators: │
│ radiometry, indices, cloud, │
│ spectral, viz, plume, readers, │
│ presets, …) │
│ │
│ Also: xr_toolz on the │
│ xarray-flavoured side │
└───────────────────────────────────┘Three observations from this picture:
geocatalogsits at the same level aspipekitandgeopatcher. None depend on each other. All three sit abovegeoreader(catalog usesGeoTensoras one of several loader return types; patcher usesGeoDataProtocol viaRasterField). All three are consumed bygeotoolzandxr_toolz.geocatalogis a peer library, not a framework subdivision. It owns the spatiotemporal-index domain end-to-end: Protocol, backends, builders, loaders, set algebra, GeoParquet round-trip.georeaderis the substrate. Catalog’s loaders returnGeoTensor(raster path); patcher’sRasterFieldwraps anyGeoData.geocatalogandgeopatcherare both georeader consumers, not georeader replacements.
Part 2 — What’s in geocatalog¶
From the rescrape: 11 files, ~3,600 LOC, fully implemented Phase 1 + part of Phase 2.
2.1 Source layout¶
| File | LOC | Purpose |
|---|---|---|
_src/base.py | 236 | CatalogRow, GeoCatalog Protocol |
_src/slice.py | ~200 | GeoSlice (moves here from geotoolz.types) |
_src/memory.py | 378 | InMemoryGeoCatalog — GeoDataFrame + R-tree + IntervalIndex |
_src/duckdb_backend.py | 803 | DuckDBGeoCatalog — lazy SQL over GeoParquet 1.1 |
_src/raster.py | 444 | build_raster_catalog, load_raster, load_raster_timeseries |
_src/xarray_backend.py | 377 | build_xarray_catalog, load_xarray |
_src/vector.py | 414 | build_vector_catalog, load_vector |
_src/streaming.py | 626 | Streaming-write infrastructure for large catalogs |
_src/parquet.py | 112 | to_geoparquet / from_geoparquet round-trip |
_src/ops.py | 103 | query, intersect, union set algebra |
_src/domain.py | 109 | CatalogDomain adapter (bridges to geopatcher) |
2.2 Public surface¶
# Core
from geocatalog import (
GeoCatalog, # Protocol
CatalogRow, # backend-neutral row view
GeoSlice, # bounded request for data
)
# Backends
from geocatalog import (
InMemoryGeoCatalog, # GeoDataFrame backend, Phase 1
DuckDBGeoCatalog, # SQL backend, Phase 2 (extras-gated)
)
# Factory
from geocatalog import open_catalog # picks a backend for a GeoParquet artifact
# Builders (extras-gated by carrier)
from geocatalog import (
build_raster_catalog,
build_xarray_catalog,
build_vector_catalog,
)
# Loaders (extras-gated by carrier)
from geocatalog import (
load_raster,
load_raster_timeseries,
load_xarray,
load_vector,
)
# Set algebra
from geocatalog import query, intersect, union
# Artifact round-trip
from geocatalog import to_geoparquet, from_geoparquet
# Patcher bridge
from geocatalog import CatalogDomain # adapter for geopatcher.SpatialPatcher2.3 Why GeoSlice lives here¶
GeoSlice is the catalog’s wire format — the bounded request for data that catalogs produce and loaders consume. The dataclass header in the current code is explicit: “the unit of work between catalog, sampler, loader, operator.”
Three reasons it belongs in geocatalog:
Catalog is the primary producer.
catalog.iter_slices()returnsIterator[GeoSlice]. The shape and semantics ofGeoSliceare determined by what the catalog can answer about.Patcher doesn’t use
GeoSlicedirectly. The patcher’s wire format isPatch(data + anchor + neighborhood), notGeoSlice(bounded request). Different abstraction, different package.Splitting it out would over-engineer. A 200-LOC frozen dataclass with five accessor properties doesn’t justify its own package.
Consumers of GeoSlice:
| Consumer | Imports as |
|---|---|
geocatalog (self) | from geocatalog._src.slice import GeoSlice |
geotoolz | from geocatalog import GeoSlice (used in catalog-aware operators, samplers, the CatalogPipeline ETL driver) |
xr_toolz | from geocatalog import GeoSlice (used in catalog-aware xarray loaders) |
geopatcher | does NOT import GeoSlice — uses Patch |
| User code | from geocatalog import GeoSlice |
2.4 Why the loaders live here, not in georeader or geotoolz¶
The loaders (load_raster, load_xarray, load_vector) consume a GeoSlice and return a backend-specific carrier (GeoTensor, xr.Dataset, gpd.GeoDataFrame). They’re catalog-shaped functions that happen to call out to the I/O substrate.
Two reasons they live in geocatalog:
They’re catalog API. The natural pattern is
catalog.iter_slices(...)→load_raster(slice, reader_cls=...). The catalog produces the slices; the catalog ships the loaders.They keep the I/O substrate (
georeader) independent.georeaderis a reader / writer /GeoTensorlibrary. It doesn’t know about catalogs. Putting catalog-aware loaders ingeocatalogkeeps the substrate clean.
The loaders are extras-gated so users who only want the index don’t pull in heavy I/O dependencies. See §3.
Part 3 — Dependencies and optional extras¶
3.1 Base install¶
[project]
name = "geocatalog"
version = "0.1.0"
dependencies = [
"pandas>=2.0",
"geopandas>=1.0",
"pyproj>=3.6",
"shapely>=2.0",
"rasterio>=1.4", # GeoSlice uses rasterio.Affine + rasterio.windows
]Base install gives:
GeoCatalogProtocol,CatalogRow,GeoSliceInMemoryGeoCatalog(GeoDataFrame-backed)Set algebra (
query,intersect,union)to_geoparquet/from_geoparquetCatalogDomain(the patcher bridge — Protocol-shaped, no patcher import)
No loaders, no DuckDB backend. The index works; you bring your own reader.
3.2 Optional extras¶
[project.optional-dependencies]
duckdb = ["duckdb>=1.0"] # DuckDBGeoCatalog
raster = ["georeader>=0.4"] # build_raster_catalog, load_raster, load_raster_timeseries
xarray = ["xarray>=2024.1", "rioxarray>=0.15"] # build_xarray_catalog, load_xarray
vector = ["pyogrio>=0.8"] # build_vector_catalog, load_vector
streaming = ["pyarrow>=17.0"] # streaming-write infrastructure
all = ["geocatalog[duckdb,raster,xarray,vector,streaming]"]The patterns this enables:
Someone who only needs the index over their own files:
pip install geocatalog. ~5 MB install. No GDAL, no rasterio overhead beyond what shapely already needs.Operational MARS ETL:
pip install geocatalog[duckdb,raster,streaming]. Big catalogs over remote GeoParquet, raster loaders, streaming writes.Xarray climate workflows:
pip install geocatalog[xarray]. Catalogs over reanalysis NetCDFs.Vector / GIS workflows:
pip install geocatalog[vector]. Catalogs over GeoJSON, FlatGeobuf, layered GeoPackages.
3.3 No pipekit dependency¶
geocatalog does not import from pipekit. The catalog isn’t an Operator; it’s a queryable index. The framework-vs-data-source distinction is clean:
pipekitsays how to compose operators that transform a carriergeocatalogsays how to find which files match a spatiotemporal query
These are orthogonal concerns. A user can from geocatalog import open_catalog without ever using pipekit. A pipekit pipeline can read from a non-catalog source. The two compose at the application layer (in geotoolz and xr_toolz), not at the library layer.
3.4 The CatalogDomain patcher bridge — important detail¶
CatalogDomain (109 LOC) is an adapter that lets a geopatcher.SpatialPatcher iterate over a catalog’s rows. It satisfies geopatcher.Domain Protocol (crs, bounds) — but does not import geopatcher. It implements the Protocol structurally; geopatcher only knows about it via Protocol satisfaction at runtime.
This is the right factoring. geocatalog produces something that quacks like a Domain; geopatcher accepts anything that quacks like a Domain. No import either direction.
Part 4 — Migration from geotoolz.catalog¶
4.1 What moves¶
| Currently | Becomes |
|---|---|
geotoolz.catalog.* | geocatalog.* |
geotoolz.catalog._src.* | geocatalog._src.* |
geotoolz.types.GeoSlice | geocatalog.GeoSlice (canonical) |
4.2 Backwards compatibility in geotoolz¶
geotoolz re-exports for transitional compatibility:
# geotoolz/__init__.py
from geocatalog import (
GeoCatalog,
CatalogRow,
InMemoryGeoCatalog,
DuckDBGeoCatalog,
GeoSlice,
open_catalog,
build_raster_catalog,
build_raster_timeseries,
load_raster,
load_raster_timeseries,
query,
intersect,
union,
to_geoparquet,
from_geoparquet,
CatalogDomain,
)
import geocatalog as catalog # alias so `gz.catalog.*` keeps working
# geotoolz/types/__init__.py
# Keep GeoSlice re-export here too so old `from geotoolz.types import GeoSlice` works
from geocatalog import GeoSliceExisting user code works unchanged:
import geotoolz as gz
cat = gz.open_catalog("s3://bucket/scenes.parquet")
for sl in cat.iter_slices(...):
gt = gz.load_raster(sl)Plan: ship the re-exports for two minor versions (v1.1, v1.2), then deprecate with a DeprecationWarning, remove in v2.0. Users who follow the deprecation cleanly migrate from geotoolz.catalog import ... → from geocatalog import ....
4.3 What stays in geotoolz¶
After the move, geotoolz.catalog keeps only the catalog-aware operators that compose into pipekit pipelines — CatalogPipeline for ETL, CatalogReader for inline reads, etc. These depend on both geocatalog and pipekit, so they belong in the consumer (geotoolz), not the substrate (geocatalog).
Sketch:
# geotoolz/catalog_ops.py (NEW, ~200 LOC)
from pipekit import Operator
from geocatalog import GeoCatalog, GeoSlice
import georeader
class CatalogPipeline(Operator):
"""ETL driver: iterate a catalog, apply an op to each scene, optionally parallel."""
def __init__(self, catalog: GeoCatalog, op: Operator, *, n_workers: int = 1, ...):
...
class CatalogReader(Operator):
"""Read a single GeoSlice into a GeoTensor inline in a Sequential."""
def __init__(self, reader_cls=None):
...The Operator-shaped wrappers stay in geotoolz because they require both pipekit and geocatalog. Putting them in geocatalog would force a pipekit dep on every catalog user — wrong shape.
Part 5 — Cross-library composition this enables¶
5.1 Same catalog, multiple consumer libraries¶
# Build once
import geocatalog as gc
cat = gc.build_raster_catalog("s3://emit/2024/*.nc")
# Use from geotoolz
import geotoolz as gz
for sl in cat.iter_slices(bounds=aoi, time=(t0, t1)):
gt = gc.load_raster(sl)
result = gz.methane_pipeline(gt)
# ...
# Use from xr_toolz
import xr_toolz as xrt
for sl in cat.iter_slices(bounds=aoi, time=(t0, t1)):
ds = gc.load_xarray(sl)
result = xrt.skill_score_pipeline(ds)
# ...Currently impossible — catalog is locked inside geotoolz, so xr_toolz can’t reach it without weird imports. After the split, the same catalog object feeds either domain library.
5.2 Catalog → patcher chains¶
import geocatalog as gc
import geopatcher as gp
# Catalog drives the iteration; patcher drives the per-scene slicing
for sl in cat.iter_slices(...):
gt = gc.load_raster(sl)
field = gp.RasterField(gt)
for patch in gp.SpatialPatcher(
geometry=gp.SpatialRectangular((256, 256)),
sampler=gp.SpatialRegularStride((224, 224)),
).split(field):
# process each 256×256 patch
...Both libraries are infrastructure-level; geotoolz is the consumer that ties them together.
5.3 Catalog over xarray data, processed with pipekit¶
import geocatalog as gc
import pipekit as pk
import xr_toolz as xrt
cat = gc.build_xarray_catalog("s3://reanalysis/era5/")
pipeline = pk.Sequential([
xrt.validation.ValidateCoords(),
xrt.detrend.RemoveClimatology(clim),
xrt.metrics.RMSE(reference=ref),
])
for sl in cat.iter_slices(bounds=aoi, time=(t0, t1)):
ds = gc.load_xarray(sl)
score = pipeline(ds)The lego pieces compose because they all sit at peer-library levels, not inside one monolith.
Part 6 — Monorepo development model¶
Decision: develop in a single pipekit-ecosystem monorepo with uv / hatch workspaces; publish as separate packages.
6.1 Why this works for geocatalog specifically¶
Cross-package changes during early development are common. Adding a new
CatalogRowfield affects all backends, all loaders, all geotoolz consumers. One PR is much easier than three coordinated PRs.GeoSliceevolution is rare but cross-cutting. WhenGeoSlicegrows (e.g., addtarget_dtype), it ripples through every loader and every consumer. Monorepo PR > three-way release dance.Shared test infrastructure. Catalog tests want a small sample of raster fixtures, xarray fixtures, vector fixtures. Sharing fixtures across packages (catalog + patcher + geotoolz) avoids triplication.
CI runs the integration matrix in one place. A monorepo CI can verify “geocatalog 0.2 + geotoolz 0.5 + xr_toolz 0.3 still works together” automatically.
6.2 Layout¶
pipekit-ecosystem/ # monorepo root
├── pyproject.toml # workspace config
├── packages/
│ ├── pipekit/
│ │ ├── pyproject.toml
│ │ └── src/pipekit/...
│ ├── pipekit-array/
│ │ ├── pyproject.toml
│ │ └── src/pipekit_array/...
│ ├── geocatalog/ # ← this report
│ │ ├── pyproject.toml
│ │ └── src/geocatalog/...
│ ├── geopatcher/ # ← Report 7
│ │ ├── pyproject.toml
│ │ └── src/geopatcher/...
│ ├── geotoolz/
│ │ ├── pyproject.toml
│ │ └── src/geotoolz/...
│ └── xr_toolz/
│ ├── pyproject.toml
│ └── src/xr_toolz/...
├── shared/
│ ├── fixtures/ # cross-package test fixtures
│ └── ci/ # reusable GH Actions workflows
└── docs/ # one Sphinx site spanning everything6.3 Operational pattern¶
Develop locally with
uv syncresolving cross-package deps to the local workspace paths. Cross-package edits work withoutpip install -edance.CI matrix: each package gets its own job + an integration job that pins all packages to their HEAD and runs end-to-end tests.
Release per-package independently. Each
packages/*/pyproject.tomlcarries its own version. Tags likegeocatalog-v0.2.0trigger publish for that package only.Pinned cross-package versions in published metadata:
geotoolz==0.5.xdeclaresgeocatalog>=0.2,<0.3,geopatcher>=0.1,<0.2. Users who install onlygeotoolzget the right pinned versions automatically.
This is the same pattern as huggingface_hub / transformers / datasets, xarray / xarray-spatial / xvec, jax / flax / optax / equinox, and many others. Well-trodden.
Part 7 — The CatalogDomain ↔ geopatcher.Domain bridge in detail¶
Worth dwelling on because it’s the cleanest example of the Protocol-based decoupling that makes the whole split work.
7.1 What CatalogDomain is¶
A small adapter (109 LOC) that wraps a GeoCatalog and exposes the geopatcher.Domain Protocol (crs, bounds, plus a method for getting the relevant rows).
# In geocatalog._src.domain
from typing import Protocol
class _DomainProtocol(Protocol):
"""Structural type matching geopatcher.Domain, redeclared locally."""
@property
def crs(self) -> Any: ...
@property
def bounds(self) -> Any: ...
class CatalogDomain:
"""A catalog presented as a Domain for the patcher to walk."""
def __init__(self, catalog: GeoCatalog):
self.catalog = catalog
@property
def crs(self):
return self.catalog.crs
@property
def bounds(self):
return self.catalog.bounds
def rows_for(self, anchor) -> Iterator[CatalogRow]:
return self.catalog.query(...)7.2 Why this works without imports¶
geopatcher.Domain is a runtime_checkable Protocol. CatalogDomain doesn’t inherit from it — it just happens to have the right shape. geopatcher accepts any Domain-shaped thing at runtime via isinstance(x, Domain).
Result: geocatalog doesn’t import geopatcher; geopatcher doesn’t import geocatalog. The Protocol is the contract.
7.3 The user-facing pattern¶
import geocatalog as gc
import geopatcher as gp
cat = gc.open_catalog("...")
dom = gc.CatalogDomain(cat)
# The patcher walks the catalog via the Domain Protocol
patcher = gp.SpatialPatcher(
geometry=gp.SpatialPolygonIntersection(...),
sampler=gp.SpatialExplicit(...),
)
for patch in patcher.split(dom):
# patch.data is loaded on-demand via the catalog's loader
...Catalog-driven iteration meets patch-driven iteration. The two libraries co-design via Protocol, not via shared imports.
Part 8 — Effort and timing¶
8.1 Effort¶
Day 1: Set up
packages/geocatalog/in the monorepo. Move source files. Setpyproject.tomlwith current Phase 1 deps.Day 2: Move
GeoSlicefromgeotoolz.types→geocatalog._src.slice. Update intra-catalog imports. Set up extras-gated re-exports.Day 3: Set
geotoolzto depend ongeocatalog; ship backwards-compat re-exports fromgeotoolz.catalog.*andgeotoolz.types.GeoSlice. Run existing geotoolz test suite — it should pass unchanged.Day 4: Set
xr_toolzto depend ongeocatalog; exposexr_toolz.catalog_ops.*for xarray-flavoured catalog ops if not already there.Day 5: Set up CI per-package + integration. Write migration guide. Cross-link docs.
Total: ~1 week of focused work.
8.2 Timing¶
Do this alongside the pipekit extraction, not before or after.
Reasoning: both refactors touch geotoolz/__init__.py. Doing them in one combined PR (pipekit extraction + geocatalog extraction + geopatcher extraction) means one breaking change for users to absorb, not three. The deprecation banner says “v1.0 reshaped the ecosystem; here’s the migration guide” once.
8.3 What this unblocks¶
After geocatalog is a separate package:
xr_toolzcan adopt catalog natively without weird cross-imports from geotoolz.pipekit-jax(future) can build JAX-traceable catalog drivers by depending on geocatalog directly.The patcher’s
CatalogDomainbridge becomes meaningful — currently it’s angeotoolz.catalog._src.domainartifact; after the split it’s a proper cross-package Protocol contract.Catalog releases independently. Phase 2 DuckDB work, GeoParquet 1.1 spec updates, streaming improvements all ship without coordinating with geotoolz’s release cycle.
Part 9 — Honest tradeoffs (catalog-specific)¶
9.1 What gets better¶
Catalog discoverability. Right now someone searching PyPI for “spatiotemporal catalog python” doesn’t find
geotoolz.catalog. They’d findgeocatalog. Real outreach win.Minimal install for narrow use cases. A user who wants only the index doesn’t pull in georeader / rasterio / GDAL.
GeoSlicebecomes a meaningful cross-library type. Currently it’s “that thing ingeotoolz.types”; after the split it’sgeocatalog.GeoSlice, the catalog’s wire format, importable in any consumer.Phase 2 (DuckDB / GeoParquet 1.1) work can land in
geocatalogwithout bumping geotoolz.
9.2 What gets worse (and the mitigations)¶
One more package to release. Mitigation: monorepo + per-package release tags; the operational overhead is one
git tag geocatalog-v0.2.0-style command.Version pinning between geocatalog and geotoolz. Mitigation: geotoolz declares
geocatalog>=0.2,<0.3ranges; users get correct combos automatically.Documentation has to live somewhere. Mitigation: one Sphinx site at the monorepo level with sections for each package, OR mkdocs per-package with cross-links. Either works.
The
CatalogDomainProtocol bridge is implicit. Mitigation: a one-paragraph note in geocatalog’s docs pointing at geopatcher’s Domain Protocol with examples.
Recommendation¶
Ship geocatalog as a standalone package. It satisfies every signal for warranting one:
Self-contained scope (own Phase 1 / Phase 2 design at
notes/geotoolz/plans/geodatabase/)Real size (~3,600 LOC)
Already carrier-neutral (works with raster, xarray, vector loaders)
No pipekit dependency (it’s not an Operator-shaped library; it’s a data-source-shaped library)
Cross-library value (xr_toolz, future pipekit-jax both benefit)
Independent backend evolution (InMemory + DuckDB + future Postgres-PostGIS?)
GeoSlice lives in geocatalog — it’s the catalog’s wire format, full stop.
Monorepo development is fine — it’s the standard pattern, well-supported by uv / hatch.
The whole thing is a one-week extraction job, done alongside the pipekit + geopatcher extractions as one coordinated v1.0 release.