Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Report 3 — Sister libraries on top of pipekit

UNEP
IMEO
MARS
StatusSurface proposal
Reading time~20 min
AudienceAnyone scoping the layered library ecosystem on top of pipekit
Companion reportsReport 1 (background), Report 2 (pipekit core), Report 4 (use-case revisit)

This report describes the three sister libraries that sit on pipekit and together cover the practical carrier surface: arrays (duck-typed via Array API), GeoTensor (geotoolz), and xarray DataArrays/Datasets (xr_toolz). Each library is a thin layer over pipekit with carrier-specific operators.

The diagram:

┌────────────────────────────────────────────────────┐
│                  Domain libraries                  │
│                                                    │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────┐  │
│  │  geotoolz    │  │   xr_toolz   │  │  others  │  │
│  │ (GeoTensor)  │  │   (xarray)   │  │   ...    │  │
│  └──────────────┘  └──────────────┘  └──────────┘  │
│         │                  │              │        │
│         └────────┬─────────┴──────────────┘        │
│                  ▼                                 │
│         ┌─────────────────┐                        │
│         │  pipekit-array  │  ← duck-typed arrays   │
│         │   (Array API)   │     (numpy, JAX, etc.) │
│         └─────────────────┘                        │
│                  │                                 │
│                  ▼                                 │
│         ┌─────────────────┐                        │
│         │     pipekit     │  ← carrier-agnostic    │
│         │     (core)      │     framework          │
│         └─────────────────┘                        │
└────────────────────────────────────────────────────┘

The three layers above pipekit

Layer 1 — pipekit-array: duck-typed array operators

A sister package on top of pipekit’s framework, implementing array-shaped operators against the Python Array API standard. The Array API is the modern, formal answer to “duck arrays” — it gives you a single array_namespace(x) dispatch that returns a numpy-shaped namespace regardless of whether x is numpy, JAX, CuPy, PyTorch, or dask.

Layer 2 — geotoolz: GeoTensor operators

Thin domain layer on top of pipekit, with operators that consume and return GeoTensor (the numpy-subclass with geographic metadata). Most array math delegates to pipekit-array; what geotoolz adds is the geo-specific work (sensor presets, CRS-aware operators, etc).

Layer 3 — xr_toolz: xarray operators

Thin domain layer on top of pipekit, with operators that consume and return xr.DataArray / xr.Dataset / xr.DataTree. Array math that’s not xarray-specific can also delegate to pipekit-array.

Three modules, three primitives, each in its own sweet spot. Let me cover each in detail.

Part 1 — pipekit-array: duck-typed arrays via Array API

1.1 What the Array API gives us

The Python Array API standard is a specification that’s been adopted by:

LibraryConformance
numpy ≥ 2.0Full (via numpy.array_api namespace)
jax.numpyFull (since JAX 0.4.20)
cupyFull
pytorch ≥ 2.0Full (via torch.array_api)
dask.arrayPartial
sparsePartial

The mechanism: every conforming array implements __array_namespace__(), which returns a module-like object with mean, sum, where, reshape, etc. as functions. The user’s pattern:

def my_operator(x):
    xp = array_namespace(x)  # returns numpy / jax / cupy / etc.
    return xp.mean(x, axis=-1, keepdims=True)

One function works on five backends. This is exactly the “duck-array” idea formalised.

1.2 What ships in pipekit-array

The carrier-specific operators that v1 / v2 scoping flagged as “numpy-flavoured” — rewritten against the Array API namespace so they work cross-backend.

OperatorPurposeModule
ApplyToBands(inner, axis=0)Split-apply-stack over an axispipekit_array.combinators
Subsample(stride=10)Stride-decimate the last two axespipekit_array.geom
Histogram(bins=10) (controller + .at(key))Capture distributions per tap sitepipekit_array.observe
Diff(reference, atol=1e-6)Compare against a stored reference; raise on driftpipekit_array.qc
`AssertValueRange(min, max, on_fail=“raise”“warn”)`Pass-through; raise / warn on out-of-range
AssertNoNaN()Pass-through; raise on any NaNpipekit_array.qc
AssertValidFraction(min_valid=0.5)Pass-through; raise if < 50% non-NaNpipekit_array.qc
ModelOp(model, method="__call__", batch_size=None)Framework-agnostic inference; numpy / JAX / torch modelpipekit_array.inference
BatchedMap(op, batch_size=8)Split along axis 0, apply, concatenatepipekit_array.parallel
MeanScalar(field=None)Reduce to scalar via xp.meanpipekit_array.reduce
StackAlong(axis)Stack list-of-arrays along an axispipekit_array.combinators
ConcatenateAlong(axis)Concatenate along an axispipekit_array.combinators

12 operators, ~400 LOC. All implemented against array_namespace(x).

1.3 The trade-offs and honest constraints

What works.

What doesn’t.

[project]
name = "pipekit-array"
dependencies = ["pipekit>=0.1"]

[project.optional-dependencies]
numpy = ["numpy>=2.0"]
jax   = ["jax>=0.4.20"]
torch = ["torch>=2.0"]
cupy  = ["cupy>=13"]
dask  = ["dask[array]>=2024"]

The minimum install is dep-free (pipekit only) but operators raise ImportError if no Array-API-conforming backend is available. The recommended user install is pipekit-array[numpy].

1.5 Migration of existing geotoolz / xr_toolz numpy operators

The eight or so carrier-specific operators currently sitting in geotoolz.pipeline_idioms BYO and in xr_toolz array-flavoured code migrate to pipekit-array. Both geotoolz and xr_toolz then re-export them from pipekit-array with carrier-specific defaults baked in:

# geotoolz/qc.py
from pipekit_array.qc import AssertValueRange as _AssertValueRange

class AssertValueRange(_AssertValueRange):
    """GeoTensor-aware value-range assertion. Same logic, but reads
    GeoTensor's fill_value as default min/max bounds if not provided."""
    def __init__(self, min_val=None, max_val=None, on_fail="raise"):
        super().__init__(min_val=min_val, max_val=max_val, on_fail=on_fail)

One implementation; three places it’s used.

Part 2 — geotoolz: GeoTensor operators

2.1 Scope after pipekit extraction

geotoolz becomes a thinner library focused on its actual domain value: remote sensing on top of georeader.GeoTensor. The framework code that used to live in geotoolz.core becomes a compatibility shim that re-exports from pipekit.

geotoolz/
  __init__.py             # re-exports pipekit + array + geo ops
  core/                   # compatibility shim → re-exports from pipekit
  io/                     # GeoTensor-specific I/O: ReadBounds, WriteCOG
  geom/                   # CRS-aware geometric ops: BowtieCorrection, GeostationaryParallaxCorrect, ...
  radiometry/             # TOAToBOA, DarkObjectSubtraction, RadianceToReflectance, BTFromRadiance
  indices/                # NDVI, NDWI, EVI, ... (sensor-aware band-index lookups)
  spectral/               # MatchedFilter, ACE, LinearUnmixing
  cloud/                  # MaskFromQABits, MaskFromSCL, ApplyMask
  qa/                     # AssertCRSEquals, AssertResolutionWithin, AssertSchema (carrier-specific)
  mask/                   # PolygonMask, AOI clipping
  patch/                  # ExtractPatches, StitchPatches (with per-patch metadata)
  catalog/                # GeoCatalog, CatalogPipeline (the domain-specific iteration)
  readers/                # gz.readers.<sensor> per the 8 sensor design docs
  viz/                    # Colormap, TrueColor, FalseColor, StretchToUint8
  compositing/            # MedianComposite, MaxNDVIComposite, CloudFreeComposite
  augment/                # RandomFlip, RandomRotate90 (training-only)
  presets/                # Sensor presets bundling reader + ops

2.2 What geotoolz uniquely owns

ConcernWhy it stays in geotoolz
GeoTensor reading / writingTightly coupled to georeader and rasterio
CRS-aware geometric opscrs, transform, bounds are GeoTensor properties
Sensor presets (readers + ops bundles)All 8 sensor design docs from earlier
GeoCatalog / CatalogPipelineMulti-scene iteration over a parquet catalog
ExtractPatches with per-patch metadataEach patch carries its own transform
Geo-specific QC (AssertCRSEquals, AssertResolutionWithin)Need GeoTensor attributes
Sensor-specific calibration tablesPer-sensor data/ directories

2.3 What geotoolz delegates upward

Now in geotoolzMigrates to
geotoolz.core.* (Operator, Sequential, etc.)pipekit
geotoolz.qc.AssertValueRange / AssertNoNaN (numpy-shaped)pipekit-array (geotoolz re-exports)
geotoolz.spectral.MatchedFilter array mathpipekit-array (geotoolz wraps with GeoTensor-aware default targets)
geotoolz.augment.* array transformspipekit-array (geotoolz wraps with GeoTensor-aware coordinate updates)

The result: geotoolz is now ~30% smaller (the framework code is gone) and clearly focused on geo-specific semantics.

2.4 What new in geotoolz from the sensor design docs

From earlier work, geotoolz.readers.<sensor> ships the 8 sensor reader modules:

ModuleSensor
geotoolz.readers.modisMODIS L1B + L2 (stretch — pyhdf gating)
geotoolz.readers.viirsVIIRS SDR + EDR
geotoolz.readers.goesGOES-R ABI
geotoolz.readers.seviriMSG SEVIRI (NAT + xRIT)
geotoolz.readers.mtgMTG-FCI
geotoolz.readers.himawariHimawari AHI (HSD)
geotoolz.readers.tropomiTROPOMI L2
geotoolz.readers.s3Sentinel-3 OLCI / SLSTR

Each module bundles reader + sensor-specific operators + zero-arg presets. Reader plans + per-sensor design docs already exist in the sensor-integration outputs.

2.5 Cross-cutting modules geotoolz adds on top of pipekit + pipekit-array

ModuleWhat’s new vs pipekit / pipekit-array
geotoolz.compositingBAPComposite, MaxNDVIComposite, MedianComposite with QA-aware behaviour
geotoolz.normalizeScene-statistics-based normalisation (stateful)
geotoolz.restoreDespeckle, gap-fill, super-resolution wrappers
geotoolz.plumePlumeMask, PlumeFootprint, point-source attribution
geotoolz.matched_filterCH4 / NH3 / N2O matched filtering for hyperspectral plume retrieval

Part 3 — xr_toolz: xarray operators

3.1 Scope after pipekit extraction

xr_toolz becomes a focused xarray-domain library. The framework code in xr_toolz.core becomes a compatibility shim. What stays:

xr_toolz/
  __init__.py             # re-exports pipekit + array + xr ops
  core/                   # compatibility shim → re-exports from pipekit; PLUS Augment, ApplyToEach
  validation/             # ValidateCoords, RenameCoords, SortCoords (xarray coord/attr manip)
  crs/                    # AssignCRS, Reproject, GetCRS (rioxarray-backed)
  subset/                 # SubsetBBox, SubsetTime, SubsetWhere
  masks/                  # AddLandMask, AddOceanMask, AddCountryMask (via regionmask)
  detrend/                # CalculateClimatology, RemoveClimatology, ComputeAnomaly
  interpolate/            # Regrid, GapFill, Smooth, Resample (D12)
  transforms/             # Encoders (one-hot, cyclical, ...) and decompositions (PSD, wavelet)
  metrics/                # RMSE, PSDScore, NashSutcliffe (D7)
  kinematics/             # OkuboWeiss, RelativeVorticity, KineticEnergy (D9)
  ocn/                    # Domain-specific: oceanography quantities
  atm/                    # Domain-specific: atmospheric quantities
  rs/                     # Remote-sensing-flavoured xarray ops (DataArray-on-disk)
  viz/                    # Matplotlib-based plot operators (D10)
  inference/              # SklearnModelOp, JaxModelOp (sample_dim aware)
  data/                   # CMEMS, CDS, AEMET data-source presets

3.2 What xr_toolz uniquely owns

ConcernWhy it stays in xr_toolz
Coordinate validation / harmonizationxr.Dataset coord + attr manipulation
CRS embedding via rioxarrayxarray-specific accessor
Climatology / anomaly / detrendTime-axis aware, xarray-native via groupby
Regridding, gap-fill, smoothingxarray’s interp + scipy backends
Augment / ApplyToEach combinatorsUse xr.merge — fundamentally xarray-specific
Skill-score metrics (RMSE, PSD, NSE)Compute over named dims, return xarray
Domain-specific quantities (ocean kinematics, atmospheric chemistry)Lives on xr.DataArray natively
Visualisation operators returning matplotlib.Figurexarray-flavoured plotting
Data-source presets (CMEMS, CDS, AEMET, …)Opening + standardising specific provider datasets

3.3 What xr_toolz delegates upward

Now in xr_toolzMigrates to
xr_toolz.core.* (Operator, Sequential, Graph, …)pipekit
xr_toolz.core.combinators.Augment / ApplyToEachStays — uses xr.merge; not a pipekit fit
xr_toolz.core.combinators.Tappipekit (unified with geotoolz’s Tap)
Array-shaped pieces of xr_toolz operatorspipekit-array (e.g., the reduce inside RMSE)

3.4 The xarray-specific combinators worth preserving

Three combinators from xr_toolz.core.combinators that are too valuable to remove:

The first two stay in xr_toolz because they use xr.merge. Worth a clear note in their docstrings that the merge semantics differ from pipekit’s framework-level combinators.

Part 4 — Where each library’s Tap, Sequential, etc. lives after migration

A quick reference table because this is the most common confusion point:

SymbolLives inRe-exported from
Operatorpipekit._base.operatorpipekit, geotoolz.core, xr_toolz.core
Sequentialpipekit._base.sequentialpipekit, geotoolz.core, xr_toolz.core
Graph, Input, Nodepipekit._base.graphpipekit, geotoolz.core, xr_toolz.core
Fanoutpipekit.combinepipekit, geotoolz.core, xr_toolz.core
Identity, Const, Lambda, Sinkpipekit.blockspipekit, both libraries
Tappipekit.observeboth libraries
Branch, Switch, Try, Coalesce, Retrypipekit.controlboth libraries
Snapshot, ShapeTrace, Profilepipekit.observeboth libraries
Cachepipekit.cacheboth libraries
Quarantine, AssertShape, AssertDTypepipekit.qcboth libraries
Signature, compute_output_signaturepipekit.signatureboth libraries
ThreadMap, ProcessMap, AsyncMap, BatchedMappipekit.parallelboth libraries
pipe, compose, juxt, complementpipekit.composeboth libraries
ApplyToBands, Subsample, Diffpipekit_array.*geotoolz + xr_toolz
Histogram controllerpipekit_array.observeboth libraries
AssertValueRange, AssertNoNaN, AssertValidFractionpipekit_array.qcboth libraries (with sensor-aware defaults)
ModelOppipekit_array.inferenceboth libraries
Augment, ApplyToEachxr_toolz.core.combinatorsxr_toolz only
GeoCatalog, CatalogPipelinegeotoolz.cataloggeotoolz only
ExtractPatches, StitchPatches (with metadata)geotoolz.patchgeotoolz only
Sensor readers / presetsgeotoolz.readers.<sensor>geotoolz only
Reproject, CRS opsxr_toolz.crsxr_toolz only
Climatology / detrendxr_toolz.detrendxr_toolz only

Part 5 — Other libraries that could fit on top

You asked whether there should be libraries for numpy / JAX / numba / duck-array. Here’s my honest answer for each:

5.1 pipekit-array — yes, the answer for numpy / JAX / CuPy / PyTorch

Covered above. Array API is the right abstraction. One library covers all four backends.

5.2 pipekit-numba — probably not a separate library

Numba is a JIT compiler, not a separate carrier type. Numba-jitted operators are still numpy-flavoured at the carrier level; what’s different is that their inner kernels are @njit. The right pattern: pipekit-array operators can have numba-jitted inner kernels, decided per-operator. Adding a separate pipekit-numba library duplicates surface for no real abstraction benefit.

If you want a fast-path: pipekit-array[fast] extra pulls in numba and a couple of operators have @njit-decorated inner kernels. The Operator class itself is unchanged.

5.3 pipekit-jax-traceable — separate library, deferred

JAX-specific compatibility (jax.jit, jax.vmap, jax.grad working through a pipekit pipeline) is genuinely a different problem from “JAX as one of the Array API backends.” It requires:

This is a separate library, probably pipekit-jax or jax_geotoolz. Defer until a concrete project (differentiable retrievals, learnable corrections) demands it.

5.4 pipekit-dask — out of scope

Distributed parallelism via dask requires every operator to be pickleable + scheduler-aware. That’s an orchestrator concern, not a framework concern. Pipekit operators are pickleable (Group J discipline); how they get distributed is downstream tooling. dask users can compose pipekit Operators inside dask.bag.map or dask.delayed themselves.

5.5 pipekit-cuda — covered by pipekit-array with CuPy

CuPy is an Array API conformant backend. pipekit-array[cupy] is the answer; CUDA-specific operators don’t need their own library.

5.6 What about specific carriers — pandas DataFrames, polars, dicts?

pipekit is Carrier = Any. You can write Operator subclasses that consume DataFrames — there’s nothing in the framework that prevents it. Whether a pipekit-pandas library is worth its own existence depends on whether your community has DataFrame-shaped pipelines. My honest read: probably not for the methane / MARS / atmospheric-chemistry use cases that drive your work. Defer.

Part 6 — How to think about adding a new sister library

Three questions to ask:

  1. Is there a clearly-typed carrier? If yes → potential sister library. If no → it belongs in an existing library or in pipekit core.

  2. Are there ≥ 8 operators that would consume this carrier natively? If yes → worth its own library. If no → contribute the operators to the closest existing library.

  3. Does the carrier already have a wide-enough standard? Array API has this for arrays. xarray.Dataset is its own de facto standard. Pandas / polars don’t fit cleanly. If the carrier has a coherent standard, sister-library is easy; if not, you’re inventing the standard along with the operators.

By those criteria:

Summary

LayerLibraryWhat it ownsStatus
CorepipekitFramework: Operator, Sequential, Graph, observe, control, qc, parallelNew
Arraypipekit-arrayArray API operators: ApplyToBands, Subsample, ModelOp, Diff, AssertValueRange, …New
GeogeotoolzGeoTensor operators: io, geom, radiometry, indices, cloud, presets, readers, …Existing (refactored)
xarrayxr_toolzxarray operators: validation, crs, subset, detrend, interpolate, metrics, ocn, atm, vizExisting (refactored)
Futurepipekit-jaxJAX-traceable operators with static metadataDeferred
Out of scopepipekit-dask, pipekit-numba, pipekit-pandasDifferent problems; not framework concernsNot planned

The honest takeaway: two new packages (pipekit, pipekit-array) plus refactoring two existing ones (geotoolz, xr_toolz). Everything else either doesn’t justify a separate library, doesn’t have a clear carrier standard, or is solving a different problem entirely.