Status: design proposal — full report in
geotoolz.md. Scope: a composable Operator library for remote sensing, sitting on top ofgeoreader.GeoTensor. Sibling toxr_toolz, targeted at a different community and a different substrate. Audience: anyone building RS pipelines that compose preprocessing, inference, and evaluation steps; anyone consuming ageoreaderGeoCatalogand wanting to apply per-tile operators at scale.
Summary¶
geotoolz is the operator-composition layer above georeader.
Where georeader reads bytes and produces GeoTensors, geotoolz lets users compose those reads with classification heads, spectral-index calculations, masking, mosaicking, sampling, inference, and writing — declared as a Sequential of operators or a Graph of named ops, executed eagerly or driven by a Hydra config.
It’s the RS-shaped sibling to xr_toolz (the climate-side composition library): same architectural patterns, separate codebase, different substrate.
GeoTensor is the shared currency at the bottom; the operator surface is built fresh for each library’s audience.
Two libraries, multiple submodules¶
georeader and geotoolz are the two shipping libraries. Inside geotoolz two incubation submodules host plan trees that may graduate to their own packages once their APIs settle:
geotoolz.ops— the stable core:Operator,Sequential,Graph,ModelOp, sensor presets. The public surface most users touch.geotoolz.patch— the four-axis Patcher framework specified inplans/geopatcher/. Incubates toward a future standalonegeopatcher.geotoolz.catalog— theGeoCatalogProtocol + backends specified inplans/geodatabase/. Incubates toward a future standalonegeocatalog.geotoolz.types— cross-cutting value types (GeoSlice); graduates with whichever sibling extracts first.
The point of incubation: API can churn freely during v0.1–v0.3 against real operator-graph users, then graduate via from geotoolz.patch import * / from geotoolz.catalog import * re-export shims with DeprecationWarning. Same pattern Keras used for layer/optimizer subpackages.
What’s in the design¶
The full design report is geotoolz.md.
It runs through the architecture in 10 numbered sections; the high-level outline:
| § | Topic |
|---|---|
| 0 | Inputs the design draws on (xr_toolz patterns, georeader substrate, jej_vc_snippets empirical vocabulary) |
| 1 | Inventory — two-tier model (jaxtyped Array primitives → Operator), 12 module surface (radiometry, indices, cloud, compositing, sampling, inference, ...). The collapse from xr_toolz’s three tiers is possible because GeoTensor is an np.ndarray subclass with __array_ufunc__. |
| 2 | User story — five personas, the goal arc from “load S2 → mask clouds → NDVI → save COG” to “Hydra-YAML pipeline” |
| 3 | Motivation — why a separate library (not extending xr_toolz), why re-implement the composition core, comparison with TorchGeo / xarray-spatial / stackstac |
| 4 | Mathematics — Operator delegation chain, dual-mode __call__ (eager vs Graph), split-object stateful pattern, sampler stride, compositing reductions, matched filter, pansharpening, Lee speckle |
| 5 | Coupling with the ecosystem — georeader, scipy/scikit-image, xrpatcher, georeader.catalog, ML frameworks, Hydra-zen, xr_toolz |
| 6 | Proposed API surface — module layout, Operator base class, Sequential / Graph, ModelOp, signature conventions, Hydra YAML compatibility, dependencies, versioning |
| 7 | Sharp edges — twelve flagged-pitfalls-with-mitigations |
| 8 | End-to-end examples — RGB viz, NDVI, cloud-free composites, atmospheric correction, tiled inference, hyperspectral retrievals, catalog-driven processing, Hydra YAML, Graph branching, sensor presets |
| 9 | Verdict — concrete v0.1 deliverables, two-month roadmap, what to defer |
| 10 | xr_toolz coexistence — substrate-split rule, what cross-pollinates and what doesn’t |
Examples & idioms¶
Two companion docs sit alongside the main report, both MyST-rendered and listed in the site TOC:
| File | What’s in it |
|---|---|
examples/usecases.md | A 13-case gallery of where the Operator algebra fits in a deployment — notebook exploration, ETL, ML training/inference, LitServe API, FastAPI multi-pipeline service, user-uploaded YAML viz, tile server, QC as operators, regulatory artifacts, active learning, cross-sensor fusion Graph, workflow orchestrator units, A/B regression. Each case names the driver (notebook user vs. HTTP request vs. scheduler), the calling code, an implementation sketch, and an honest tradeoffs note. |
examples/tips_n_tricks.md | A reference of small composable Operators that punch above their weight — Tap/Snapshot/Profile/Histogram/ShapeTrace/Diff (observers), Branch/Switch/Try/Coalesce/Retry (control flow), Fanout/ApplyToBands/Cache (composition), context-managed Mode and Spy (stateful), AssertX/Quarantine (QC), and small building blocks like Identity/Const/Lambda/Sink. Includes the scoped-by-default API design for Mode (a with pipe.mode("train"): context manager) and Spy (a with Spy.scoped() as s: ... block) — the explicit-scoping alternative to PyTorch’s sticky train()/eval() footgun. |
Both files are deliberate companions, not duplicates: usecases.md is “who drives the pipeline” (a tour of deployment shapes); tips_n_tricks.md is “what primitives let those shapes work” (the v0.1 stdlib of idioms).
New readers can start with either — use cases first if you want to know whether the abstraction fits your problem, tricks first if you already know it does and want the building blocks.
Ecosystem bridges¶
Two sub-designs spec how geotoolz reaches into the long-tail of the SciPy stack without re-implementing it.
Each sub-design ships as a (design, examples) pair under its own subdirectory and is listed under the site TOC’s Plans → geotoolz entry.
| Bridge | Design | Examples | What it does |
|---|---|---|---|
| scikit-image | skimage/design_skimage.md | skimage/examples_skimage.md | A small family of wrapping Operators (PerBand, MultiBand, Func) that adapt skimage’s calling conventions to GeoTensor — axis-aware HWC/CHW adapters, dtype-range safety (float [0,1] ↔ uint8 [0,255]), and NaN masking/restoration. Brings filtering, morphology, segmentation, restoration, edge detection, feature extraction, and registration into the operator graph with no per-function glue. |
| scikit-learn | sklearn/design_sklearn.md | sklearn/examples_sklearn.md | One bridging type (PixelTable) for sklearn’s (n_samples, n_features) shape, plus shape adapters (spatial-pixelwise, spatiotemporal-pixelwise, chipwise) and four estimator wrappers (Classifier / Regressor / Transformer / Clusterer). First-class NaN strategy taxonomy. Fitting stays out of the operator graph; pre-fitted estimators are the artifact that flows into pipelines. Works for any BaseEstimator-protocol library (sklearn proper, xgboost, lightgbm, catboost, imbalanced-learn, custom estimators). |
Both bridges share the same shape: the design doc specs the wrapping Operators, type adapters, and gotchas; the examples doc walks through 10+ end-to-end recipes (SAR speckle reduction, cloud-mask post-processing, CLAHE, edge detection, superpixel segmentation, inpainting, GLCM textures, blob detection, image registration on the skimage side; crop classification, chipwise inference, biomass regression, pixel-wise PCA, full sklearn Pipeline composition, per-tile classifier, time-series classification, spatial CV, unsupervised clustering, incremental fitting on the sklearn side).
Connections to other designs¶
geotoolz sits on top of the rest of the plan tree.
The cross-design touchpoints:
| Design | What geotoolz consumes / produces |
|---|---|
| Reader reconciliation | geotoolz.catalog_ops.CatalogPipeline accepts a reader_class=... kwarg pulling from any GeoData (sync) or AsyncGeoData (async) reader. The strategy injection is the central swappability seam. |
| Geodatabase | CatalogPipeline(catalog, op).run() consumes a GeoCatalog. The pipeline iterates the catalog, applies the operator per row, writes outputs. |
Core types — GeoSlice | geotoolz.sampling.GridSampler wraps grid_sampler; geotoolz.inference.ApplyToChips consumes the iterator and uses stitch for the inverse step. The Stitch operator in geotoolz is a direct re-export of the primitive specified there. |
geopatcher | The canonical home for the patching/sampling algebra. geotoolz.sampling.GridSampler, geotoolz.inference.ApplyToChips, and geotoolz.catalog_ops.CatalogPipeline are thin operator wrappers around geopatcher.Patcher — GridSampler = Patcher(Rectangular × RegularStride × Boxcar × OverlapAdd) driven by catalog.iter_slices(); ApplyToChips = for p in patcher.split(field): yield op(p) + patcher.merge(...). geotoolz does not implement samplers or stitching of its own; it consumes them from geopatcher. Streaming reconstruction (lazy Field.select, disk-backed Aggregation, hierarchical Patcher-of-Patchers) plugs into CatalogPipeline for petabyte-scale inference. |
| Sensor readers | Sensor-preset operators (presets.s2.S2_L2A_RGB, presets.emit.EMIT_METHANE_MF, etc.) wrap the per-sensor readers from those designs. |
Open questions¶
The full design report flags twelve sharp edges in §7, architectural questions in §3 and §10, and a comprehensive risks/gotchas section in §11. The ones most worth surfacing here:
Operatorbase class — shared withxr_toolzor re-implemented? The current decision (§0) is to re-implement (~300 LOC, freedom to specialise). Worth revisiting if a third sibling library appears.Hydra-zen as a hard dep or extra? §6 currently scopes it to an optional
[hydra]extra. Some users want it as a baseline.presets/legacy/for SPOT VGT and Proba-V? §1.2 doesn’t list them. Tutorial Ch. 17 suggests they’d live there if added.Async operator support. The current
Operatorbase class is sync. AnAsyncOperatorfor use withAsyncGeoTIFFReaderis plausible but not designed. Pick a design before v0.1 — see §11.2.coordaxstability for the JAX path. The future-work JAX bridge depends oncoordax(NeuralGCM, research-grade). Spike before committing the v0.5 work; have a fallback (equinox + jaxtyping directly).georeader 2.0(feature/geotensor_npapi) merge timing. The two-tier model assumes the ndarray-subclass GeoTensor lands. If it stalls, geotoolz blocks._srcprivacy decision. Primitives are currently “semi-public” — likely a footgun for non-ufunc primitives passedGeoTensorinputs. Decide truly-private vs deliberately-public namespace (§11.3 recommends truly-private for v0.1).Sensor preset scope reduction for v0.1. 80 operators in 4 months is tight. Recommendation: cut Himawari-AHI HSD, MTG-FCI, SEVIRI-HRIT to v0.5+; ship MODIS + ABI as v0.1 sensor proofs.
Array-API compliance from day one or as a retrofit? If JAX path matters, factoring primitives through the array-API standard at v0.1 is cheap; v0.5 retrofit is expensive. See §11.3.
The full risks/gotchas list lives in geotoolz.md §11 — covers strategic risks, implementation gotchas to test in CI, and scope honesty.
Why this README is a thin wrapper¶
The detailed design lives in geotoolz.md — 800+ lines covering everything from architecture decisions to per-operator math to a 15-example use-case gallery.
That file is the source of truth; this README is a navigation entry point that other designs link to.
If you’re cross-referencing geotoolz from another plan, prefer linking to a specific section of geotoolz.md where possible (#1-2-module-surface, #5-2-numpy-scipy-scikit-image-scikit-learn-compute, etc.) rather than to this README — readers want to land on the section that’s relevant to their question, not navigate two levels down.