Report 14 — pipekit-evaluate: multidimensional evaluation as first-class infrastructure¶
| Status | Scoping proposal — committed (two-part: framework + package) |
| Reading time | ~30 min |
| Decisions locked in | A dedicated evaluation package, not a submodule. Three orthogonal axes (Unit × Lens × Stage), not the source’s flat five-axis list. Lagrangian and event-based evaluation need supporting infrastructure (xr_toolz.lagrangian, xr_toolz.events, EventCatalog) which are scoped as siblings. Monorepo development. |
| Audience | Anyone reviewing how the GeoStack assesses ML / hybrid models against physical-fidelity requirements |
| Companion reports | Reports 1–13 (existing stack), ../supporting_info/geodata_lifecycle.md (data lifecycle companion), ../supporting_info/geostack_vision.md |
| Source material | External scientific writing on multi-dimensional ocean ML evaluation: scales, data representation, physical representation, process consistency, phenomena-based verification |
What this report does¶
Two-part structure, because the framework framing is generalisable while the package proposal is one specific implementation:
Part 1 — The evaluation framework. A taxonomy with three orthogonal axes (Unit × Lens × Stage), a matrix view, and a mapping onto existing and missing pieces of the stack. Generalisable beyond ocean.
Part 2 — The
pipekit-evaluatepackage proposal. Concrete scoping in the style of Reports 6, 7, 10, 11, 12, 13.Part 3 — Supporting infrastructure gaps. The package alone isn’t enough; three other pieces (
xr_toolz.lagrangian,xr_toolz.events,EventCatalog) need to ship in parallel.Part 4 — Recommendations and tradeoffs.
The framework framing (Part 1) is what should be adopted into v2; the package proposal (Part 2) is one structurally honest way to implement it.
Part 1 — The Evaluation Framework¶
1.1 — Why evaluation is multidimensional¶
A single scalar metric on a held-out set is structurally insufficient for geophysical ML. Three concrete failure modes motivate the framework:
The double-penalty trap. An L2 loss penalises a correctly-shaped feature that’s spatially displaced twice — once for being missing where it should be, once for being present where it shouldn’t. Gradient descent learns to predict a flat mean state rather than risk spatial phase errors. The model can achieve low loss while producing physically meaningless output.
Scale-blind aggregation. Standard losses preferentially weight large-amplitude, low-wavenumber structures because that’s where the variance lives. Submesoscale and high-frequency features — often the scientifically interesting bits — get smoothed out, the spectral slope steepens artificially, and the model’s “skill” is dominated by features that were always going to be easy.
Eulerian-only blindness. A model can have excellent agreement on gridded fields (the Eulerian frame) while completely misrepresenting how fluid parcels move through those fields (the Lagrangian frame). Small phase errors in velocity integrate into large trajectory divergence. Transport-relevant skill is invisible in standard scores.
Three failure modes that good aggregate metrics actively hide. Honest evaluation has to decompose along the dimensions where these failures live.
1.2 — Three orthogonal axes, not five flat categories¶
The source presents five categories — Scales, Data Representation, Physical Representation, Process Evaluation, Phenomena-Based — as if they were coequal. They aren’t. Some are about where you evaluate; some are about what mathematical form the metric takes; some are about frame of reference; some are about what’s being scored. Mixing these creates apparent overlap (the Lagrangian content appears in two of the five) and obscures the real structure.
The clean decomposition is three orthogonal axes:
UNIT LENS STAGE
(what's being scored) (what kind of critique) (when in lifecycle)
┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
│ Field │ │ Point-wise │ │ Training │
│ Statistic │ │ Probabilistic │ │ (differentiable, │
│ Trajectory │ × │ Spectral │ × │ fast) │
│ Event │ │ Structural │ │ Validation │
│ Budget │ │ Detection │ │ (cheap, frequent) │
│ │ │ Physical-constraint │ │ Final eval │
│ │ │ │ │ (comprehensive) │
│ │ │ │ │ Monitoring │
│ │ │ │ │ (continuous, light) │
└──────────────────────┘ └──────────────────────┘ └──────────────────────┘Why this matters. Three axes cross to give a 3D space of evaluation operations, not five competing lists. Concretely:
“Eulerian field × point-wise × at submesoscale” → PSD comparison (the source’s Scales decomposition applied to Data Representation)
“Lagrangian trajectory × probabilistic × at climate scale” → ensemble dispersion CRPS
“Event × detection × by region” → marine-heatwave POD/FAR by basin
“Budget × physical-constraint × continuously during long rollout” → mass conservation drift over time
This reframes the source’s five categories as combinations of axis choices rather than competing taxonomies. Cleaner; also matches how evaluation actually composes in practice.
1.3 — The Unit axis: what gets scored¶
Five things you can score. Each one demands its own carrier type.
| Unit | What it is | Existing stack support |
|---|---|---|
| Field | Continuous gridded prediction vs. reference. The default ML case. | xr.Dataset / GeoTensor — fully supported |
| Statistic | Derived summary: PSD, histogram, structure function, moments | Partial — xr_toolz.transforms has FFT/wavelets but not comparison ops |
| Trajectory | Particle path or drifter track in a velocity field | Missing — needs xr_toolz.lagrangian (Part 3) |
| Event | Discrete phenomenon (eddy, marine heatwave, atmospheric river) | Missing — needs xr_toolz.events + EventCatalog (Part 3) |
| Budget | Conservation closure: mass, energy, tracer, PV over a control volume | Partial — xr_toolz.budgets exists; not evaluation-shaped |
The point: the operator shape for scoring a Field is (prediction, reference) → scalar. The shape for scoring an Event is (predicted_events, reference_events) → ContingencyTable + AttributeErrors. The shape for scoring a Trajectory is (predicted_velocity_field, reference_drifters) → dispersion_statistics. These are different operator signatures and they need different supporting infrastructure.
1.4 — The Lens axis: what kind of critique¶
Six lenses, each targeting a specific failure mode. The Lens axis is what makes a metric a critique, not just a distance.
| Lens | Targets | Examples | Differentiable? |
|---|---|---|---|
| Point-wise | Magnitude bias | RMSE, MAE, NSE, IoA | Yes |
| Probabilistic | Calibration of uncertainty | CRPS, Energy Score, Brier, Rank Histogram | Yes (CRPS, Brier) |
| Spectral | Scale-dependent variance | PSD comparison, spectral slope fitting, KE spectrum | Yes (in spectral space) |
| Structural | Spatial coherence | SSIM, perceptual loss, FSS | Yes |
| Detection | Event presence / absence | POD, FAR, CSI, IoU, contingency table | No (counts) |
| Physical-constraint | Conservation / balance | Geostrophic balance, stratification, PV invariance, mass closure | Sometimes |
Five tiers of critique strength. Honest practice picks a lens that can detect the failure modes that matter for the use case, not just whichever lens is cheapest. The source’s “double-penalty” example is point-wise lens hiding what structural or spectral would catch; “spectral blurring” is point-wise hiding what spectral catches; “Eulerian blindness” is field-unit hiding what trajectory-unit catches.
1.5 — The Stage axis: when in the lifecycle¶
The third axis the source underdevelops. A metric’s operational role determines what’s acceptable for cost and differentiability:
| Stage | Constraints | Examples |
|---|---|---|
| Training | Must be differentiable; fast (per batch) | Point-wise, CRPS, spectral-band MSE, simple FSS |
| Validation | Fast; computed every epoch or every N steps | Same as training + cheap structural / detection |
| Final eval | Can be expensive; computed once on test set | All lenses; full LCS computation; full event matching |
| Monitoring | Continuous in production; lightweight | Drift detection on a small selection of metrics |
Why this matters operationally. Perceptual losses with deep-feature comparison cost ~100ms per batch — fine for validation, too expensive for training. LCS computation requires advecting millions of particles — final-eval only. Conservation budget closure needs full forecast trajectories — final-eval only. Without the Stage axis, the catalogue of metrics is a flat list with no operational guidance.
1.6 — The Lens × Unit matrix¶
The two main axes cross. Not every cell is populated; the populated cells are the operationally useful metrics:
UNITS (what's being scored)
Field Statistic Trajectory Event Budget
╤════════╤═══════════╤════════════╤════════╤═════════╕
Point-wise │ RMSE │ │ TrajRMSE │CentDisp│ │
│ MAE │ — │ Endpoint │ Area │ — │
│ NSE │ │ separation │ error │ │
├────────┼───────────┼────────────┼────────┼─────────┤
Probabil- │ CRPS │ EnergyScr │ Dispersion │ Brier │ │
istic │ EnsRMSE│ RankHist │ CRPS │ Reliab │ — │
├────────┼───────────┼────────────┼────────┼─────────┤
LE Spectral │ PSDcmp │ Slope fit │ │ │ │
NS │ KE spec│ Struct fn │ — │ — │ — │
├────────┼───────────┼────────────┼────────┼─────────┤
Structural │ SSIM │ │ LCS overlap│ IoU │ │
│ FSS │ — │ Filament │ Hauss- │ — │
│ Percept│ │ stats │ dorff │ │
├────────┼───────────┼────────────┼────────┼─────────┤
Detection │ │ │ │ POD │ │
│ — │ — │ — │ FAR │ — │
│ │ │ │ CSI │ │
├────────┼───────────┼────────────┼────────┼─────────┤
Physical- │ Geostr │ │ PV │ │ Mass │
constraint │ balance│ — │ conserv │ — │ Energy │
│ Strat │ │ FTLE │ │ Tracer │
╘════════╧═══════════╧════════════╧════════╧═════════╛Reading the matrix:
Empty cells are honest gaps where the (unit, lens) combination doesn’t yield a useful metric
The Field × Point-wise corner is where 90% of current ML evaluation happens; the rest of the matrix is what good practice opens up
Each populated cell is a concrete operator class to ship in
pipekit-evaluate
1.7 — The Reference Frame is a sub-decomposition of Unit¶
The source treats Eulerian / Lagrangian as a top-level axis. I’d argue it’s actually a sub-decomposition of the Unit axis: Eulerian Fields and Lagrangian Trajectories are different units. Conflating frame and unit is what produces the source’s Section 6.3 / 6.4 duplication, where Lagrangian content appears in both Physical Representation and Process Evaluation.
Cleaner restatement: the Field unit is implicitly Eulerian (state at fixed coordinates); the Trajectory unit is implicitly Lagrangian (state following parcels). Both are needed; both are first-class.
1.8 — Decomposition is a separate operation: the Scale axis revisited¶
The source’s “Scales of Evaluation” mixes two different operations:
Regional partitioning — evaluate on coastal vs. open ocean vs. polar separately. The data is decomposed; the metric is unchanged.
Scale partitioning — evaluate at submesoscale vs. mesoscale separately. The metric is computed in scale-space; the input is unchanged.
These are different operations on different axes. In the package design they become different operator families:
ByRegion(regions: dict[str, mask])— applies a metric separately per region; aggregates resultsByScale(decomposition: SpectralDecomp | WaveletDecomp)— computes metric in scale-decomposed spaceByLeadTime(...)— applies a metric separately per forecast leadByEvent(detector)— restricts evaluation to event-occupied regions
Both are evaluation lenses applied on top of metrics. They don’t replace metrics; they compose with them. This is what the source’s “scale partitioning” should have been: not a coequal category, but a wrapping operation.
1.9 — How this maps to the existing stack¶
| Framework piece | Current stack |
|---|---|
| Field-unit, point-wise metrics | xr_toolz.metrics (good coverage); pipekit-array.metrics |
| Statistic-unit derivation | xr_toolz.transforms (FFT, wavelets) — derivation only, no comparison ops |
| Trajectory-unit | Missing entirely — no particle advection in the stack |
| Event-unit | Missing entirely — no event detection or matching |
| Budget-unit | xr_toolz.budgets — exists but isn’t evaluation-shaped |
| Probabilistic lens | Partial — xskillscore integration in xr_toolz.metrics |
| Spectral lens (compare PSDs) | Missing — building blocks in transforms but no comparison ops |
| Structural lens (SSIM, FSS) | Missing |
| Detection lens (POD, FAR, CSI) | Missing — likely belongs in pipekit-array.metrics.classification |
| Physical-constraint lens | Partial — xr_toolz.calc has gradient ops but no balance-check evaluation |
| Lens composition (ByRegion, ByScale) | Missing — no decomposition wrappers |
| Stage discipline | Missing — no annotation of which metrics suit which stage |
This is the gap analysis. The current stack has the bottom-left corner of the matrix (Field × Point-wise) covered well, and very little of the rest. The framework’s value is making the gaps visible.
Part 2 — The pipekit-evaluate Package¶
2.1 — Where it sits in the stack¶
Domain libraries geotoolz │ xr_toolz
▲
│
Infrastructure ┌─ pipekit-cycle ─┐
│ pipekit-train │
│ pipekit-evaluate ◄── (this report)
│ pipekit-experiment │
│ statecatalog │
└─────────────────┘
▲
│
Framework pipekit ◄── pipekit-arraypipekit-evaluate is a framework-layer sibling of pipekit-cycle, pipekit-train, pipekit-experiment. Three observations:
Carrier-agnostic core, carrier-specific operators. The Protocols don’t know what’s flowing through; concrete metrics for
Fieldare domain-shaped (xarray vs. GeoTensor) and live with their carrier inxr_toolz/geotoolz.Cross-package dependency profile. It reads from observation catalogs (
geocatalog) and state catalogs (statecatalog); produces artifacts that flow topipekit-experiment.ModelRegistryfor tracking; integrates withpipekit-trainso the same metric can be a loss in training and an evaluator in validation.Composable with
pipekit-cycle. Evaluating a forecast pipeline isn’t “run the forecast then evaluate”; it can be “instrument the forecast cycle with evaluation hooks at each step.” The same lens machinery that decomposes a final-eval report can be used to log per-step diagnostics during a long rollout.
2.2 — Why a dedicated package, not spread across existing ones¶
The honest alternative is “put metrics in each domain library.” That’s where they live today (xr_toolz.metrics, etc.). Three reasons for consolidation:
The Protocols are framework-level.
Metric,EvaluationLens,EvaluationReportare abstractions that all domain libraries need to satisfy. Putting them in pipekit core bloats core; putting them in one domain library inverts dependencies.The matrix view is structurally coherent. A user wanting “probabilistic spectral evaluation of a forecast” is touching the Probabilistic lens, the Spectral lens composition, and the Field unit. These spread across three packages today. A dedicated home makes the coherent surface discoverable.
EvaluationReportis a registry artifact. Likepipekit-experiment.TrainingArtifact, it’s a content-addressed serialisable thing. It belongs alongside other artifact types in a peer infrastructure package.
The alternative (spread across existing packages) is acceptable for v0.1 — much of pipekit-evaluate’s content is moving symbols around — but the Protocols + Report + Lens composition are new framework code that needs a home. Pragmatic recommendation: start the package, even if v0.1 is mostly re-exports from xr_toolz.metrics + a few new operators.
2.3 — Source layout¶
pipekit-evaluate/
__init__.py # public re-exports
_src/
protocols.py # Metric, EvaluationLens, EvaluationUnit Protocols
units.py # Field, Statistic, Trajectory, Event, Budget carrier types
report.py # EvaluationReport, ReportEntry, content-addressed serialisation
metrics/
pointwise.py # RMSE, MAE, NSE, IoA, NormalisedBias
probabilistic.py # CRPS, EnergyScore, BrierScore, RankHistogram
spectral.py # PSDCompare, SpectralSlopeFit, KineticEnergySpectrum, StructureFunction
structural.py # SSIM, FSS, PerceptualLoss
detection.py # POD, FAR, CSI, IoU, Hausdorff, ContingencyTable
physical.py # GeostrophicBalance, StaticStability, MassClosure, PVConservation
lenses/
regional.py # ByRegion
scale.py # ByScale (spectral, wavelet)
temporal.py # ByLeadTime, ByMonth, BySeason
event.py # ByEvent (restrict to event-occupied)
ensemble.py # ByMember (per-member stats), Aggregated (ensemble-mean)
aggregations.py # Mean, StratifiedMean, ReportTable
adapters/
xr_toolz.py # Bridges to xr_toolz.metrics (extras-gated)
geotoolz.py # Bridges to geotoolz.metrics
xskillscore.py # xskillscore integration
properscoring.py # properscoring integration
train.py # pipekit-train integration: any Metric as a Loss
experiment.py # pipekit-experiment integration: log ReportsTotal estimate: ~1500 LOC of framework + ~800 LOC of metric implementations (much re-using existing code).
2.4 — The Protocols¶
@runtime_checkable
class EvaluationUnit(Protocol):
"""Type marker for things that can be scored.
Concrete implementations: Field (Dataset / GeoTensor), Statistic
(1D distribution / spectrum), Trajectory, Event, Budget.
"""
@property
def unit_kind(self) -> Literal["field", "statistic", "trajectory", "event", "budget"]: ...
@runtime_checkable
class Metric(Protocol):
"""The evaluation operator. Takes prediction + reference, returns score(s)."""
@property
def lens(self) -> Literal["pointwise", "probabilistic", "spectral",
"structural", "detection", "physical_constraint"]: ...
@property
def differentiable(self) -> bool: ...
@property
def stage_compatibility(self) -> set[Literal["training", "validation",
"final", "monitoring"]]: ...
def __call__(self, prediction: EvaluationUnit, reference: EvaluationUnit) -> Any: ...
@runtime_checkable
class EvaluationLens(Protocol):
"""A decomposition wrapper. Applies a metric in a structured way.
ByRegion, ByScale, ByLeadTime, ByEvent, ByMember are concretes.
"""
def __call__(self, metric: Metric, prediction: EvaluationUnit,
reference: EvaluationUnit) -> "ReportEntry": ...Three Protocols. Runtime-checkable. Other packages (xr_toolz, geotoolz) satisfy them structurally.
2.5 — The EvaluationReport artifact¶
The output of an evaluation campaign. Sibling of TrainingArtifact from Report 12.
@dataclass
class ReportEntry:
"""One row in an evaluation report."""
metric_name: str
lens_path: tuple[str, ...] # e.g., ("ByRegion", "atlantic", "ByLeadTime", "24h")
value: float | np.ndarray | dict
unit: str # the EvaluationUnit kind
stage: str # at which lifecycle stage this was computed
@dataclass
class EvaluationReport:
"""Composable evaluation artifact.
Aggregates results across lenses, metrics, units, stages.
Content-addressed: hash(entries, model_hash, dataset_hash, config).
Serializable, diffable, registrable in pipekit-experiment.
"""
entries: list[ReportEntry]
model_ref: str # hash from ModelRegistry
dataset_ref: str # content_hash of evaluation dataset
pipeline_config: dict # the evaluation pipeline YAML
timestamp: datetime
def to_pandas(self) -> pd.DataFrame: ...
def to_dict(self) -> dict: ...
def content_hash(self) -> str: ...
def diff(self, other: "EvaluationReport") -> "ReportDiff": ...
def filter(self, **selectors) -> "EvaluationReport": ...Two operations that matter:
.diff(other)— compare two reports (e.g., v3 vs. v4 of a model) across all dimensions.filter(lens="ByRegion", region="atlantic")— slice the report along the matrix
EvaluationReport is content-addressable. Registered in pipekit-experiment.ModelRegistry alongside the model it evaluated. Provenance closes: model → training artifact → trained-model hash → evaluation artifact → score, all traceable.
2.6 — Worked example: comprehensive forecast evaluation¶
import pipekit_evaluate as pe
import pipekit_experiment as px
# Pull a forecast and its verifying analyses from statecatalog
state_cat = sc.DuckDBStateCatalog.open("s3://reanalysis/methane_v3/states.parquet")
forecasts = state_cat.queries.forecast_chain(run_time=t0, model_config_hash=H)
truth_chain = [state_cat.queries.latest_analysis(before=f.valid_time) for f in forecasts]
# Define the evaluation pipeline as composable operators
evaluation = pe.Pipeline([
# Point-wise on the Field unit, decomposed by lead time and region
pe.ByLeadTime(
pe.ByRegion(
regions={"arctic": arctic_mask, "tropics": tropics_mask, "global": None},
metric=pe.metrics.pointwise.RMSE(),
),
),
# Spectral evaluation: does the model preserve the kinetic-energy spectrum?
pe.metrics.spectral.PSDCompare(
spatial_dims=("lat", "lon"),
expected_slope=-3.0,
tolerance=0.5,
),
# Probabilistic: if the forecast is ensemble, score ensemble calibration
pe.metrics.probabilistic.CRPS(ensemble_dim="member"),
# Physical constraint: did mass conservation close over the forecast window?
pe.metrics.physical.MassClosure(tolerance=0.01),
# Phenomena: did predicted marine heatwaves match observed?
pe.ByEvent(
detector=pe.events.MarineHeatwaveDetector(
percentile=90, min_duration=5,
),
metric=pe.metrics.detection.CSI(),
),
])
# Run the evaluation
report = evaluation.run(predictions=forecasts, reference=truth_chain)
# The report aggregates everything
print(report.to_pandas())
# metric lens_path value unit
# ─────────────────────────────────────────────────────────────────
# RMSE ByLeadTime=24h, ByRegion=arctic 0.42 field
# RMSE ByLeadTime=24h, ByRegion=tropics 0.18 field
# RMSE ByLeadTime=48h, ByRegion=arctic 0.61 field
# ...
# PSDCompare (none) {slope: -2.7, error: 0.30} field
# CRPS (none) 0.082 field
# MassClosure (none) {drift: 1.2e-5} budget
# CSI ByEvent=MarineHeatwave 0.71 event
# Register the report — content-addressed alongside the model
registry = px.S3ModelRegistry(...)
registry.attach_evaluation(model_hash=H, report=report)The same pipeline serves training validation (evaluation.run(...) during validation with a smaller subset) and final-eval (with the full test catalog). The Stage axis is what distinguishes them: differentiable metrics get exposed as training losses; expensive metrics run only at final-eval.
2.7 — Cross-package integration¶
2.7.1 — With pipekit-train¶
Any Metric can be a training loss:
# Use the same RMSE as both training loss AND validation metric
metric = pe.metrics.pointwise.RMSE()
loop = pt.TrainingLoop(
model_op=...,
dataset=...,
loss=pe.adapters.train.MetricAsLoss(metric), # the same metric
val_dataset=...,
callbacks=[
pt.LogToExperiment(...),
pe.adapters.train.EvaluationCallback( # runs evaluation pipeline
pipeline=evaluation_pipeline,
every_n_epochs=5,
),
],
)MetricAsLoss checks metric.differentiable and refuses if not. Stage discipline enforced.
2.7.2 — With pipekit-cycle¶
Per-step evaluation during long rollouts:
forecast_with_eval = pc.Cycle(
step_op=pk.Sequential([
forward_model,
pe.MetricObserver(metric=pe.metrics.physical.MassClosure(), report_to=report),
]),
n_steps=72,
)MetricObserver is a pipekit.Operator that’s a no-op on the carrier but appends a ReportEntry to a side channel. Lets you monitor conservation drift, spectral degradation, etc., per-step during long rollouts.
2.7.3 — With catalogs¶
Evaluation is fundamentally cross-catalog: predictions live in statecatalog, references live in either statecatalog (model-vs-model) or geocatalog (model-vs-obs). The matchup pattern from ../supporting_info/geodata_lifecycle.md is exactly the right substrate:
matched = gc.queries.matchup(
primary=state_cat.queries.forecast_chain(...),
secondary=obs_cat,
time_tolerance=timedelta(hours=3),
space_tolerance_km=25.0,
)
report = evaluation.run_on_matched(matched)This is the cleanest version of “verify forecasts against in-situ observations”: the matchup produces co-located pairs; the evaluation pipeline scores each pair.
2.8 — Dependencies and optional extras¶
[project]
name = "pipekit-evaluate"
version = "0.1.0"
dependencies = [
"pipekit>=0.1",
"numpy>=2.0",
"scipy>=1.10",
]
[project.optional-dependencies]
# Carrier-specific integrations
xarray = ["xr-toolz>=0.1"]
geo = ["geotoolz>=0.1"]
# External scoring libraries
xskill = ["xskillscore>=0.0.26"]
proper = ["properscoring>=0.1"]
# Lagrangian evaluation (Part 3)
lagrangian = ["xr-toolz[lagrangian]>=0.1"]
# Event detection (Part 3)
events = ["xr-toolz[events]>=0.1"]
# Cross-package integration
catalog = ["geocatalog>=0.1", "statecatalog>=0.1"]
train = ["pipekit-train>=0.1"]
experiment = ["pipekit-experiment>=0.1"]
# Deep-feature perceptual losses
perceptual = ["torch>=2.0"] # for pretrained feature extractors
all = ["pipekit-evaluate[xarray,geo,xskill,proper,lagrangian,events,catalog,train,experiment]"]Part 3 — Supporting Infrastructure Gaps¶
pipekit-evaluate alone isn’t enough. Three pieces of supporting infrastructure need to land in parallel — they’re prerequisites for several of the matrix cells.
3.1 — xr_toolz.lagrangian (particle tracking)¶
What’s missing: Particle advection through a velocity field. Required for the Trajectory unit and most Lagrangian-frame evaluation.
What it ships:
xr_toolz.lagrangian/
particles.py # Particle, ParticleSet — carrier types
advection.py # AdvectParticles (operator wrapping diffrax integration)
dispersion.py # SingleParticleDispersion, PairDispersion, RelativeDispersion
ftle.py # FiniteTimeLyapunovExponent operator
lcs.py # LagrangianCoherentStructures (variational, hyperbolic)
filaments.py # TracerFilamentStatisticsDependencies: xarray, diffrax (for ODE integration), scipy (for sparse linalg in FTLE).
Why xr_toolz, not somax: Particle tracking is evaluation infrastructure, not a forward model. It consumes a velocity field (xarray) and produces trajectory statistics; somax is for ocean dynamics simulation. Worth being clear about this — there’s a temptation to put particle tracking in somax because it’s “ocean-y,” but it’s used at evaluation time on any velocity field including ones the somax forward model didn’t produce.
Effort: ~2 weeks. The hardest part is FTLE/LCS implementation; the rest is wrapping diffrax.
3.2 — xr_toolz.events (event detection)¶
What’s missing: Detect discrete phenomena in a Dataset (marine heatwaves, mesoscale eddies, atmospheric rivers, plume detections) and produce a structured Event artifact.
What it ships:
xr_toolz.events/
protocols.py # Event, EventDetector, EventCatalog
detectors/
threshold.py # ThresholdDetector (climatological percentile)
closed_contour.py # ClosedContourDetector (e.g., for eddies via SSH)
object.py # ObjectDetector (connected components in masks)
matching.py # match_events(predicted, reference, tolerances)
attributes.py # Centroid, Area, Duration, Intensity, TrajectoryEffort: ~2 weeks. Detectors are bespoke per phenomenon; the framework is small.
3.3 — EventCatalog — a new catalog type¶
What’s missing: Events need their own catalog. Once you detect marine heatwaves over 30 years, you have a catalog of events — each with bounds, duration, attributes, lineage to the source field. This is structurally a third catalog type alongside geocatalog (observations) and statecatalog (model states).
# Lives in xr_toolz.events.catalog or as a sibling package eventcatalog
class EventCatalog:
"""Indexed catalog of detected events.
Wire format: EventSlice with bbox + time_window + event_id + attributes.
Backends: InMemoryEventCatalog (pandas), DuckDBEventCatalog (extras-gated).
"""
...This is analogous to statecatalog (Report 13): different wire format from observations, different query semantics, parallel Protocol design. Three-catalog model (geocatalog for observations, statecatalog for model states, EventCatalog for events) becomes the full data substrate.
Lean: ship initially as a submodule of xr_toolz.events, promote to sibling package only if it grows. Avoids creating a fourth catalog package speculatively.
3.4 — pipekit-array.metrics.classification¶
What’s missing: Detection lens metrics (POD, FAR, CSI, IoU, Brier) are classification metrics, not regression. They’re array-shaped, not domain-specific. They belong in pipekit-array.metrics.classification alongside the existing regression metrics.
What it ships: A small module, ~150 LOC, with the canonical detection metrics implemented multi-backend (numpy / JAX / etc.) via the Array API.
Effort: ~3 days.
3.5 — Updates to xr_toolz.transforms for spectral comparison¶
What’s missing: xr_toolz.transforms has FFT and wavelet derivation operators. It doesn’t have comparison operators (compare two PSDs, fit a spectral slope and test against expected exponent, compute KE spectrum at multiple latitudes and stratify).
What gets added: ~5 new operators in xr_toolz.transforms.spectral_compare or — better — pipekit-evaluate.metrics.spectral (since these are evaluation operators, not transforms).
Part 4 — Recommendations & Tradeoffs¶
4.1 — v2 vision document edits¶
Three additions to make v2 fully integrate the evaluation framework:
Add the three-axis framework (Unit × Lens × Stage) to the v2 Geo-Task Taxonomy section. The current taxonomy has Time / Space / Variables / Representation / Tier — these are about data. The evaluation framework adds three about assessing models.
Add a section on Multi-dimensional Evaluation between “The L0–L4 Pipeline” and the design principles, citing the three failure modes (double-penalty, scale-blind aggregation, Eulerian blindness). This is the operational counterpart to the data-tier framing.
Strengthen the “ML at every level” principle by noting that evaluation is also at every level — different metrics suit L0–L2 retrievals (point-wise field) vs. L3–L4 forecasts (probabilistic + physical-constraint + phenomena).
4.2 — Build order¶
Realistic sequencing:
v0.1 of pipekit-evaluate (3-4 weeks)
• Protocols (Metric, EvaluationLens, EvaluationUnit)
• EvaluationReport artifact + content-addressing
• Point-wise + probabilistic metrics (re-export from xr_toolz + xskillscore)
• ByRegion, ByLeadTime lenses
• Basic adapters to pipekit-train and pipekit-experiment
v0.2 of pipekit-evaluate + xr_toolz.lagrangian (4-6 weeks)
• Lagrangian operators in xr_toolz
• Trajectory unit + dispersion metrics in pipekit-evaluate
• Spectral comparison operators
• Structural metrics (SSIM, FSS)
v0.3 of pipekit-evaluate + xr_toolz.events (4-6 weeks)
• Event detection in xr_toolz
• Event unit + detection metrics in pipekit-evaluate
• EventCatalog (initially as submodule)
• Physical-constraint metrics (conservation budgets, balance checks)
v0.4 of pipekit-evaluate (2-3 weeks)
• Per-step evaluation in pipekit-cycle (MetricObserver)
• Perceptual losses (deep-feature-based)
• Report diff and visualization helpersTotal: ~3-4 months of focused work to land the full framework. v0.1 is enough to be useful on its own (covers ~70% of practical use cases); the later versions close the matrix completely.
4.3 — Honest tradeoffs¶
What gets better:
Evaluation becomes a first-class concern, not a postscript. The framework forces honest assessment along multiple dimensions.
Cross-package coherence. Same metric in training (as loss), validation (as score), final eval (as report entry). No duplicated implementations.
Provenance closes. Model → training artifact → trained model → evaluation report. All content-addressed, all traceable.
Phenomena and Lagrangian skill become routine instead of bespoke research code per evaluation.
The matrix view exposes gaps. Hard to ignore “we never evaluate the spectral slope of our predictions” when the matrix has an empty cell labeled “Spectral × Field” staring at you.
What gets harder:
More moving pieces. Five new packages worth of effort in this report (the main one plus supporting infrastructure). Mitigation: phased build order; v0.1 ships only the framework + cheap metrics.
The matrix is daunting. A user faced with 30+ metrics across 6 lenses across 5 units may pick paralysed. Mitigation: ship sensible defaults (e.g.,
pipekit_evaluate.presets.weatherbench(),pipekit_evaluate.presets.oceanbench()) that pre-compose appropriate metrics for common evaluation cases.Lagrangian evaluation is computationally expensive. Advecting 100K particles for 72 hours is real compute. Mitigation: clear documentation that this is final-eval-only; provide downsampled variants.
Event detection is domain-specific. Marine heatwaves, mesoscale eddies, atmospheric rivers each have their own detection algorithm. The framework provides the structure (
Event,EventDetector, matching, attributes); concrete detectors live in domain modules. Mitigation: ship 2-3 reference detectors as exemplars; treat the rest as user-contributed.Spectral metrics have edge cases. PSD comparison is sensitive to windowing, detrending, sampling. Get this wrong and the metric is misleading. Mitigation: thorough documentation; sensible defaults; cite the relevant references in operator docstrings.
4.4 — What this doesn’t try to be¶
Three things explicitly not in scope:
A general-purpose model evaluation framework (like
evaluateorlm-eval). Pipekit-evaluate is geophysics-shaped: it understands gridded fields, particle trajectories, conservation budgets. NLP / image-classification evaluation has its own ecosystem.An interpretability framework. Why does the model fail? is a different question from how does the model fail? Interpretability (SHAP, attention rollout, etc.) is downstream.
A benchmarking framework. WeatherBench, OceanBench, etc. define which metrics on which datasets constitute the canonical benchmark. Pipekit-evaluate provides the operators; the benchmark choice is per-project.
4.5 — Recommendation¶
Ship pipekit-evaluate as a separate sister package with two supporting infrastructure additions (xr_toolz.lagrangian, xr_toolz.events). Signals:
Distinct conceptual surface (the three-axis framework) deserves its own home
Cross-package integration concerns argue against putting it in any one domain library
The
EvaluationReportartifact type is structurally analogous toTrainingArtifact— peer infrastructureLagrangian and event-detection prerequisites are substantial enough to be their own work
The framework framing (Part 1) should be adopted into v2 regardless of whether the package gets built. The package proposal (Part 2) is one structurally honest implementation; alternative implementations are possible. Don’t conflate adopting the framework with committing to the package.
The headline win for the GeoStack: evaluation goes from “RMSE on validation” to a structured multidimensional assessment with first-class artifacts, traceable provenance, and the same operator-graph machinery as training and inference. This is what the L4 / forecast / DA story needs. Without it, “ML at every level” remains a research claim; with it, it becomes operational discipline.
Summary¶
The source’s five-axis evaluation framework reorganises cleanly into three orthogonal axes — Unit (what’s scored: Field, Statistic, Trajectory, Event, Budget) × Lens (what kind of critique: point-wise, probabilistic, spectral, structural, detection, physical-constraint) × Stage (when in the lifecycle: training, validation, final, monitoring). The Lens × Unit matrix exposes both the metrics worth implementing and the gaps in the current stack.
pipekit-evaluate is a sibling package that owns this surface: Protocols (Metric, EvaluationLens, EvaluationUnit), concrete operators across the matrix cells, the EvaluationReport artifact, and integration with the rest of the GeoStack (pipekit-cycle, pipekit-train, pipekit-experiment, catalogs). Three supporting infrastructure additions — xr_toolz.lagrangian for particle tracking, xr_toolz.events for event detection, and pipekit-array.metrics.classification for detection metrics — are scoped as parallel work.
Realistic timeline: 3-4 weeks for v0.1 (framework + cheap metrics), 3-4 months for the full v0.4 surface across all matrix cells. The framework framing should be adopted into v2 regardless of package timing — multidimensional evaluation isn’t optional for honest geophysical ML, and the source’s argument for that is the strongest single point in the document.