Report 13 — statecatalog: catalog for model states¶
| Status | Scoping proposal — committed |
| Reading time | ~20 min |
| Decisions locked in | Sibling of geocatalog, not subclass of it. StateSlice (the wire format) lives in statecatalog. Ensemble + lineage support is in v0.1, not bolted on later. Monorepo development. |
| Audience | Anyone reviewing how model outputs (forecasts, analyses, reanalysis) get indexed |
| Companion reports | Report 6 (geocatalog), Report 10 (pipekit-cycle); the other Reports 1–9 for ecosystem context |
Part 1 — The problem statecatalog solves¶
geocatalog indexes observations: “this S3 URI is a scene covering this bbox at this time.” That’s correct for L0–L2 where data has one timestamp and one place.
Model outputs are different. A forecast has:
valid_time— the time the forecast predicts about (“what does the atmosphere look like at noon UTC?”)run_time(a.k.a. initialisation time) — when the forecast was launchedlead_time = valid_time - run_time— how far ahead the forecast looksensemble_member— which realization, orNonefor deterministic runsmodel_config_hash— which model + config produced itparent_state_uri— the analysis (or earlier forecast) it was launched from
The “what time is this data” question has two answers (valid_time, run_time), plus an ensemble dimension, plus lineage. None of this fits geocatalog.GeoSlice cleanly. Forcing it into geocatalog would corrupt the abstraction; a GeoSlice extended with model-state semantics stops being “a bounded request for observation data.”
The honest answer: a parallel catalog for model states, with its own wire format. statecatalog parallels geocatalog in shape, but with the model-state dimensions baked in from day one.
Part 2 — Where statecatalog sits in the stack¶
Domain libraries geotoolz │ xr_toolz
▲
│
Infrastructure ┌─ pipekit-cycle ─┐ Reads/writes
│ (Report 10) │ ──▶ statecatalog ◄── (this report)
└────────────────┘
▲ ▲
│ │
Framework pipekit ┌─ georeader ─┐
│ (reads │
│ state data) │
└──────────────┘Three observations:
Sibling of
geocatalog, not a subclass. They share the Protocol-shaped design pattern but their wire formats and query semantics differ enough that subclassing would be confusing.Primary writer is
pipekit-cycle. When aDACycleorCyclewrites its history of analysed/forecast states, it writes throughstatecatalog. Primary reader is alsopipekit-cycle(when loading initial conditions for cycles) and ML training (pipekit-train.SimulationDatasetmay consume state trajectories as training data).No pipekit dependency in core.
statecatalogis a queryable index. It does not import pipekit; pipekit-cycle imports statecatalog when needed.
Part 3 — What’s in statecatalog¶
Five conceptual pieces, ~700 LOC plus an extras-gated DuckDB backend at ~400 LOC.
3.1 Source layout¶
statecatalog/
__init__.py # public re-exports
_src/
slice.py # StateSlice — the wire format (the headline)
base.py # StateRow, StateCatalog Protocol
memory.py # InMemoryStateCatalog (in-memory pandas-backed)
duckdb_backend.py # DuckDBStateCatalog (extras-gated)
queries.py # Common query patterns: at_time, lead_window, lineage, ensemble
streaming.py # Streaming-write infrastructure (parallel to geocatalog's)
parquet.py # GeoParquet round-trip (with state-extension fields)
integrations/
pipekit.py # CatalogReader / CatalogWriter Operators (extras-gated)
cycle.py # Direct integration with pipekit-cycle (extras-gated)3.2 StateSlice — the wire format (slice.py)¶
The headline class. Frozen dataclass. Carries everything needed to identify and load a model state.
@dataclass(frozen=True)
class StateSlice:
"""A bounded model-state request.
Parallel to geocatalog.GeoSlice but for model outputs rather than
observations. Carries the additional dimensions that model state
requires.
"""
# Spatial
bbox: tuple[float, float, float, float]
crs: str
# Temporal — the key asymmetry vs GeoSlice
valid_time: datetime # what time the state is "about"
run_time: datetime # when the model run was launched
# Model identity
model_config_hash: str # identifies the producing model + config
model_name: str | None = None # human-readable convenience tag
# Ensemble
ensemble_member: int | None = None # None for deterministic runs
ensemble_total: int | None = None # total members in this ensemble
# Lineage — the second key asymmetry
parent_state_uri: str | None = None # the state this run was launched from
# Storage
state_uri: str # where the actual state lives
# Free-form metadata
metadata: dict = field(default_factory=dict)
@property
def lead_time(self) -> timedelta:
return self.valid_time - self.run_time
@property
def is_analysis(self) -> bool:
"""Analysis = valid_time == run_time (zero lead)."""
return self.lead_time == timedelta(0)
@property
def is_forecast(self) -> bool:
return self.lead_time > timedelta(0)
@property
def is_ensemble(self) -> bool:
return self.ensemble_member is not None
def to_dict(self) -> dict: ...
@classmethod
def from_dict(cls, d) -> "StateSlice": ...The properties (lead_time, is_analysis, is_forecast, is_ensemble) make common queries readable. The structure makes ensemble + lineage support first-class, not bolted on later (per the “scoping cautions” from earlier).
3.3 The catalog Protocol (base.py)¶
@dataclass(frozen=True)
class StateRow:
"""Backend-neutral row view. Carries enough to construct a StateSlice."""
bbox: tuple
crs: str
valid_time: datetime
run_time: datetime
model_config_hash: str
state_uri: str
ensemble_member: int | None
parent_state_uri: str | None
metadata: dict
@runtime_checkable
class StateCatalog(Protocol):
"""Queryable index over model states.
Implementations: InMemoryStateCatalog, DuckDBStateCatalog.
Parallel to geocatalog.GeoCatalog Protocol.
"""
@property
def crs(self) -> str: ...
@property
def n_states(self) -> int: ...
def iter_slices(self, **filters) -> Iterator[StateSlice]: ...
def query(self, **filters) -> "StateCatalog": ... # returns a subset catalog
def to_pandas(self) -> pd.DataFrame: ...3.4 The common query patterns (queries.py)¶
This is where state-catalog-specific querying lives. Each is a function that takes a StateCatalog and filter args, returning either a filtered StateCatalog or an iterator of StateSlice.
def at_time(cat: StateCatalog, valid_time: datetime, *,
model_config_hash: str | None = None,
ensemble_member: int | None = None) -> StateCatalog:
"""All states valid at this time. Picks across forecast lead times."""
def latest_analysis(cat: StateCatalog, *,
before: datetime | None = None) -> StateSlice | None:
"""The most recent analysis (lead_time=0) before a given time."""
def forecast_chain(cat: StateCatalog, run_time: datetime, *,
model_config_hash: str,
ensemble_member: int | None = None) -> list[StateSlice]:
"""All forecast states from a given run, ordered by lead time."""
def lead_window(cat: StateCatalog, run_time: datetime,
lead_range: tuple[timedelta, timedelta], **filters) -> StateCatalog:
"""Forecasts with lead in [lead_min, lead_max]."""
def ensemble_at(cat: StateCatalog, valid_time: datetime, run_time: datetime,
model_config_hash: str) -> list[StateSlice]:
"""All ensemble members for a given (valid_time, run_time) pair."""
def lineage(cat: StateCatalog, state_uri: str, *,
depth: int = -1) -> list[StateSlice]:
"""Trace the ancestral chain back through parent_state_uri.
depth=-1 means trace to the root.
Used to answer 'what observations / earlier states contributed to this?'
"""
def assimilation_window(cat: StateCatalog, run_time: datetime,
obs_catalog: "GeoCatalog",
window: timedelta) -> "GeoCatalog":
"""Given a model run, return the observation catalog filtered to the
obs that were assimilated. Bridges statecatalog ↔ geocatalog.
"""The last one (assimilation_window) is the catalog-to-catalog bridge that lets DA cycles record their provenance: “this analysis at this time pulled from this obs catalog over this window.”
3.5 Backends¶
Two backends, parallel to geocatalog:
class InMemoryStateCatalog:
"""pandas-backed. Suitable for small-to-medium catalogs (< 1M rows).
Used for testing, dev, and small research workflows.
"""
df: pd.DataFrame
def __init__(self, df: pd.DataFrame): ...
@classmethod
def empty(cls) -> "InMemoryStateCatalog": ...
def append(self, slice: StateSlice): ...
def append_many(self, slices: list[StateSlice]): ...
class DuckDBStateCatalog:
"""SQL-backed via DuckDB + GeoParquet. Suitable for very large catalogs
(10M+ rows). Extras-gated: pip install statecatalog[duckdb].
The schema includes a state_extension namespace in the GeoParquet
metadata, parallel to geocatalog's catalog-extension.
"""
parquet_uri: str
def __init__(self, parquet_uri: str): ...
@classmethod
def open(cls, parquet_uri: str) -> "DuckDBStateCatalog": ...3.6 Integration with pipekit-cycle (integrations/cycle.py)¶
The bridge module. Extras-gated behind statecatalog[pipekit] since it imports pipekit-cycle.
class CycleStateWriter(StatefulOperator):
"""Write each step of a Cycle's history to a StateCatalog.
Used as a callback inside DACycle / Cycle to persist state outputs.
"""
state_catalog: StateCatalog
model_config_hash: str
class CycleStateReader(Operator):
"""Read initial state for a cycle from a StateCatalog.
Common pattern: 'start my cycle from the latest analysis before 06Z.'
"""
state_catalog: StateCatalog
query: Callable[[StateCatalog], StateSlice]Part 4 — Worked examples¶
4.1 Writing forecast outputs from a cycle¶
import pipekit as pk
import statecatalog as sc
import pipekit_cycle as pc
cat = sc.DuckDBStateCatalog.open("s3://reanalysis/methane_v3/states.parquet")
writer = sc.integrations.cycle.CycleStateWriter(
state_catalog=cat,
model_config_hash=pk.config_hash(forecast_op),
)
# A 72-hour ensemble forecast — each step writes to the catalog
forecast = pc.EnsembleCycle(
step_op=pc.Sequential([
chemistry_forward,
writer, # ← writes each step's state
]),
n_steps=72,
n_members=40,
)
trajectory = forecast(initial_state, initial_carry_state)
# state catalog now has 72 × 40 = 2880 new StateSlice rows4.2 Querying for the latest analysis¶
import statecatalog.queries as q
from datetime import datetime
# What's the most recent analysis state available before 06Z today?
slice = q.latest_analysis(
cat,
before=datetime(2026, 5, 17, 6),
)
# Load the actual state through whatever reader is appropriate
state = georeader.read_from_url(slice.state_uri)4.3 Spinning up a new cycle from the latest analysis¶
reader = sc.integrations.cycle.CycleStateReader(
state_catalog=cat,
query=lambda c: q.latest_analysis(c, before=datetime.now()),
)
forecast = pc.Sequential([
reader, # load initial state from catalog
chemistry_forward,
pc.Cycle(step_op=chemistry_forward, n_steps=72),
writer, # write forecast states back
])
forecast(carrier=None)4.4 Computing forecast verification metrics¶
# For each forecast step in a run, find the verifying analysis and compute RMSE
forecasts = q.forecast_chain(cat, run_time=datetime(2026, 5, 17, 0),
model_config_hash=H)
for fcst in forecasts:
# Find the analysis valid at the forecast's valid_time
truth = q.latest_analysis(cat, before=fcst.valid_time)
if truth is None: continue
fcst_state = georeader.read_from_url(fcst.state_uri)
truth_state = georeader.read_from_url(truth.state_uri)
rmse = compute_rmse(fcst_state, truth_state)
print(f"Lead {fcst.lead_time}: RMSE = {rmse}")4.5 Tracing the lineage of a state¶
# What contributed to this analysis?
ancestors = q.lineage(cat, state_uri="s3://.../analysis_2026-05-17_06Z.nc",
depth=-1)
# 'ancestors' is a chain back through previous forecasts and analyses
for s in ancestors:
print(f"{s.run_time}+{s.lead_time}: {s.state_uri}")
# Also: what obs went into the analysis step?
obs_used = q.assimilation_window(
cat=cat,
run_time=datetime(2026, 5, 17, 6),
obs_catalog=obs_cat,
window=timedelta(hours=3),
)
# 'obs_used' is a filtered geocatalog.GeoCatalog4.6 Training an emulator on simulated trajectories¶
import pipekit_train as pt
# Pull historical forecasts (a year's worth of methane forecasts) for emulator training
training_slices = list(q.lead_window(
cat,
run_time=datetime(2025, 1, 1),
lead_range=(timedelta(0), timedelta(hours=72)),
model_config_hash=H,
).iter_slices())
# Build a TrainingDataset over them
dataset = pt.SimulationDataset(
forward_model=NullForward(), # we're reading pre-computed trajectories
prior=None,
n_samples=len(training_slices),
cycle=None,
# custom mode: iterate state catalog rather than re-simulate
)
# (Or: add a `pt.StateCatalogDataset(state_catalog=cat, ...)` to pipekit-train)Part 5 — Cross-catalog bridges¶
The two-catalog model (geocatalog + statecatalog) needs explicit bridges. Three matter:
5.1 state ← geocatalog¶
DA cycles need observations. The assimilation_window query (in queries.py) lets a state catalog refer back to the observation catalog that produced its analyses.
obs_assimilated = q.assimilation_window(state_cat, run_time, obs_cat, window)This is implemented by joining state_cat.metadata["obs_catalog_uri"] with obs_cat’s slice domain. Each state slice records which obs catalog and which window of observations were assimilated.
5.2 geocatalog → state¶
Going the other way: “find me the analysis that assimilated this specific observation.” Less common, but useful for forensics (“did this obs cause an analysis to drift?”). Implemented via metadata indexing in the state catalog:
def states_that_used(state_cat: StateCatalog, obs_slice: "GeoSlice") -> list[StateSlice]:
"""Find analyses whose assimilation window contained this obs."""5.3 state → state (lineage)¶
Already covered: lineage(cat, state_uri) traces the ancestral chain.
Part 6 — Dependencies and optional extras¶
[project]
name = "statecatalog"
version = "0.1.0"
dependencies = [
"pandas>=2.0",
"pyarrow>=17.0",
"pyproj>=3.6",
]
[project.optional-dependencies]
duckdb = ["duckdb>=1.0"] # DuckDBStateCatalog
pipekit = ["pipekit>=0.1", "pipekit-cycle>=0.1"] # cycle integration
geocatalog = ["geocatalog>=0.1"] # cross-catalog bridges
streaming = ["pyarrow>=17.0"]
all = ["statecatalog[duckdb,pipekit,geocatalog]"]Minimum install: in-memory catalog, parquet round-trip, common queries. No pipekit dependency. No DuckDB dependency.
Part 7 — Honest tradeoffs¶
7.1 What gets better¶
L4 outputs (forecasts, analyses, reanalysis) are catalogable. Until now your only catalog was for observations. Adding a state catalog makes model outputs first-class indexable artifacts.
Ensemble support is structural.
ensemble_memberis a top-level dimension; queries likeensemble_atare first-class.Lineage is first-class.
parent_state_urichains analyses → forecasts → reanalyses, traceable.DA cycles record their own provenance. Each state slice knows what obs catalog and what window it used.
Forecast verification becomes a query, not a hand-rolled join.
forecast_chain+latest_analysis+ a metric op is the full verification pipeline.Emulator training data is a query. Pull a year’s worth of states from the catalog; train.
7.2 What gets harder¶
Two catalogs instead of one is more moving parts. Mitigation: the two are parallel in design; learning one prepares you for the other.
State storage is more diverse than scene storage. A forecast state might be a Zarr cube, a NetCDF file, a Parquet of point samples, or a raw float32 binary dump. Mitigation:
state_uriis opaque to the catalog; readers are pluggable.Ensemble storage is expensive. A 40-member ensemble at 0.25° hourly is hundreds of GB per run. Mitigation: ensembles get aggressive storage policies (zstd compression, chunk-aligned writes); not a catalog concern but worth knowing.
The lineage chain can be deep. Tracing 6-hourly analyses over a year is ~1500 parent links. Mitigation: depth-limited traversal; cache derived chains.
7.3 What doesn’t fit and isn’t tried¶
Continuous (streaming) DA outputs. statecatalog assumes discrete writes; continuous-DA needs a different model.
Probabilistic state representations beyond ensembles. GP-shaped state with full posterior covariance doesn’t fit cleanly. Mitigation: stored as opaque blobs with rich metadata.
State catalog as a queryable backend for visualisation UIs. statecatalog is for catalog queries, not for serving rendered state to a UI. The state-rendering use case is downstream.
Cross-model alignment. When two different models (
config_hash_A,config_hash_B) produce states for the same time, the catalog stores both; aligning their state spaces is a separate concern (and almost always domain-specific).
Part 8 — Effort and timing¶
8.1 Effort¶
Day 1-2:
slice.py,base.py—StateSlice,StateRow,StateCatalogProtocol. The data class needs careful design since changing it later is painful.Day 3-4:
memory.py—InMemoryStateCatalog. Includes pandas-shaped append + query operations.Day 5-6:
queries.py— the common query patterns. Test each on a synthetic catalog.Day 7-8:
parquet.py— GeoParquet round-trip with the state-extension namespace.Day 9-10:
duckdb_backend.py—DuckDBStateCatalog. Extras-gated. Hardest piece; modeled on geocatalog’s DuckDB backend.Day 11:
integrations/pipekit.py,integrations/cycle.py— Operator wrappers.Day 12-13: Documentation, smoke tests, integration with pipekit-cycle.
Total: ~2.5-3 weeks of focused work. The DuckDB backend is the largest single piece.
8.2 Timing¶
Ship alongside or shortly after pipekit-cycle (Report 10). The two are mutually useful but neither blocks the other: pipekit-cycle can write state to filesystem without statecatalog; statecatalog can index pre-existing state outputs without pipekit-cycle.
Realistic timeline: v0.3-v0.4 of the ecosystem.
8.3 What this unblocks¶
L4 becomes a queryable, first-class data tier. Without statecatalog, model outputs are scattered files. With it, they’re indexed artifacts.
DA cycles get a proper output destination.
CycleStateWritermakes the “DA cycle that records its history” pattern trivial.Forecast verification becomes routine. Query forecasts, query analyses, compute metrics; no bespoke pipeline per evaluation.
Emulator training data is queryable. SimulationDataset for the (params, simulated_state) case is one query.
Reanalysis becomes a buildable artifact. A reanalysis is just a state catalog with a coherent set of analyses; statecatalog gives you the substrate.
Part 9 — Recommendation¶
Ship statecatalog as a separate sister package. Signals:
Distinct data model from
geocatalog(model outputs vs observations)Sufficient size (~700-1100 LOC depending on extras) to warrant its own package
Different optional extras (DuckDB-state, ensemble streaming, pipekit-cycle integration) than geocatalog has
Cross-domain reusability: oceanography reanalysis, atmospheric chemistry forecasts, ML emulator outputs all benefit
The package lives in the same monorepo: packages/statecatalog/. Sibling of geocatalog, geopatcher, pipekit-cycle, pipekit-train, pipekit-experiment.
This is the L4 data tier that observations alone can’t reach. Without it, model outputs are second-class citizens in your stack. With it, the full L0-L4 chain has consistent indexing semantics: geocatalog for L0-L2 observations, statecatalog for L3-L4 model states, both queryable, both catalogable, both composable into the same operator-graph pipelines.
The headline win: forecasts, analyses, and reanalyses become routinely queryable artifacts, not bespoke per-project file conventions. That’s the structural shift the L4 story needs.