Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Operational attribution

A forward plan for MARS-style methane attribution rebuilt on `georeader` + `GeoCatalog` + `geotoolz` + `plumax`

UNEP
IMEO
MARS

Status. Forward-looking plan. Projected pipeline assuming the unified stack lands. Treats everything except the current georeader package as future work. Scope. MARS-style methane source attribution from satellite observations. Three instruments (TROPOMI + GHGSat + EMIT). Covers exploration, preprocessing, inference, and serving — explicitly not model training. Audience. Three layers in one doc: an executive summary for funders/talks, an architectural plan for engineers, a validation + migration story for paper/blog framing.


TL;DR

Methane attribution from satellite imagery — “this scene shows a plume; what is the source rate, where is the source, with what uncertainty?” — is what UNEP-IMEO’s Methane Alert and Response System (MARS) does operationally today. It works. It’s also rebuilt as bespoke glue code in nearly every research project that touches the same problem. Across the institutions I’ve worked at — FIT, RIT, Universidad de Valencia, CNRS, CSIC, UNEP-IMEO, MARS — the same research-to-production gap appears every time, with the storage and transport layers getting steadily better while the factory layer (the science composition) stays hand-rolled.

This doc projects what an operational attribution pipeline looks like once the unified stack lands: georeader 2.0 (modernised reader Protocol), GeoCatalog (multi-satellite metadata index), geotoolz/xrtoolz (operator algebra), plumax (forward models). The same operator graph runs in:

  1. A research notebook — exploration, single-event analysis, MAP via NumPyro.

  2. A batch pipeline — last-week’s-overpasses fanned out across detections, posteriors written to GeoParquet.

  3. An alert service — a FastAPI handler that takes a detection event and returns a posterior in seconds.

One graph, three orchestration modes, zero rewrites between research and production. That’s the substrate gap this stack is meant to close.


1. Why this doc exists

MARS-style attribution is a real, deployed operational system at UNEP-IMEO. This is not a proposal to rebuild MARS. The question this doc answers is different: what would the right substrate look like for this entire class of operational attribution work — not just MARS, but every research-to-production translation of a satellite-driven Bayesian inversion?

Two motivations sit underneath:

  1. The same pipeline shape recurs everywhere. Methane plume attribution, NO₂ inversion, CO₂ flux estimation, oil-spill backtracking, wildfire-emission attribution — all are: (satellite L1/L2 data) → (forward model with met forcing) → (Bayesian inversion) → (posterior on source parameters). The science differs; the plumbing is identical. Today every project rebuilds the plumbing.

  2. MARS is a proof point that the operational form exists. The constraint isn’t “can this be done”; it’s “can this be done with a unified, composable, reproducible substrate that other groups can pick up and extend without re-implementing the IO layer”. Plumax + geotoolz is that substrate.

This doc projects how the v1 plumax target — Tier I + AK + multi-satellite L2 fusion across {TROPOMI, GHGSat, EMIT} — would be implemented end-to-end on the unified stack, with three orchestration phases demonstrating the research-to-prod arc.


2. What MARS-style attribution does today (and where the seams are)

Operational attribution today is the composition of half a dozen well-understood pieces, each typically implemented as project-specific glue:

                ┌──────────────────────────────────────────────────────┐
                │  Detection trigger                                   │
                │  (operator alert · matched-filter scan · TROPOMI Q4) │
                └──────────────────────┬───────────────────────────────┘
                                       │ event (lat, lon, time, instrument)
                                       ▼
        ┌─────────────────────────────────────────────────────────────┐
        │  Multi-instrument data assembly                             │
        │  • Pull EMIT scene for the event time + AOI                 │
        │  • Pull TROPOMI overpass(es) within the event window        │
        │  • Pull GHGSat target acquisition if available              │
        │  • Pull met forcing (WRF / ERA5) for the AOI + window       │
        └─────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
        ┌─────────────────────────────────────────────────────────────┐
        │  Per-instrument preprocessing                               │
        │  • EMIT: matched-filter detection on radiance cube          │
        │  • TROPOMI: L2 XCH₄ + averaging-kernel + retrieval-prior    │
        │  • GHGSat: L2 XCH₄ + footprint                              │
        │  • Cloud + QA masks, native-resolution preserved            │
        └─────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
        ┌─────────────────────────────────────────────────────────────┐
        │  Forward model + likelihood + inference                     │
        │  • Tier I Gaussian plume forward (Q, x₀, ū, θ_wind, …)      │
        │  • Per-instrument AK operator + bias correction             │
        │  • Joint multi-instrument likelihood at native resolution   │
        │  • MAP / MCMC / variational posterior                       │
        └─────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
        ┌─────────────────────────────────────────────────────────────┐
        │  Reporting                                                  │
        │  • Posterior mean + uncertainty                             │
        │  • Provenance (which instruments contributed, met source)   │
        │  • Alert payload to operational responders                  │
        └─────────────────────────────────────────────────────────────┘

Where the seams show up today:

SeamToday’s realityCost
Satellite ingestPer-instrument scripts, often re-implemented per project; auth via env-var soup; no shared cache / metadata indexMulti-week ramp-up per new project + non-reproducible IO
Multi-instrument catalogPer-project CSV / pandas notebooks; bespoke schema; no spatiotemporal indexCross-project discovery is impossible; same data is re-curated 10×
Forward modelPer-paper code; matched filter often a research script; no operator surfaceNo reuse across projects; small bug fixes don’t propagate
Inference loopNumPyro / TensorFlow Probability / Stan, hand-wired to the forward; switching requires rewritingSlow methods iteration
Operational deliveryInference moved into a FastAPI / Cloud Function via cut-and-paste from the research notebook; metadata round-trips lostProduction code drifts from research code; can’t validate prod against research

The unified stack closes each of these seams with a typed, composable surface. The next sections show how.


3. The three-instrument scope

The v1 target is TROPOMI + GHGSat + EMIT — the same set the Tier IV plumax design targets for v1 multi-instrument fusion. Each instrument contributes a different role; the demo’s value is precisely in the fusion, not in any single instrument.

InstrumentRoleResolutionCadenceNative productWhat it constrains
TROPOMI (S5P)Wide-swath screening~5.5 × 7 km nadirDaily globalL2 XCH₄ NetCDFRegional context; rough source localisation; persistent emitters
GHGSat (commercial constellation)Targeted high-resolution~25 mTaskedL2 XCH₄ HDF5Single-source rate; tight localisation; intermittent leaks
EMIT (NASA / ISS)Hyperspectral imaging spectrometer~60 mISS overpass cadenceL1B radiance NetCDFScene-context plume detection via matched filter; fingerprint validation

Why these three, why this fusion:

What’s explicitly out of scope for v1: Sentinel-2 / Landsat (cloud-shadow vegetation-index methane proxies — useful but different physics); PRISMA / Tanager (similar to EMIT, additive once the EMIT pipeline works); CarbonMapper (commercial, schema TBD).


4. Mapping the unified stack to the attribution pipeline

Each layer of the geostack maps onto a concrete piece of the attribution pipeline. The point of the table below is that none of this is novel science — the science modules already have well-understood mathematical structure; the substrate work is making them compose cleanly.

        ┌─────────────────────────────────────────────────────────────┐
   USER │  notebook · alert API · operational dashboard               │
        └─────────────────────────▲───────────────────────────────────┘
                                  │
        ┌─────────────────────────┴───────────────────────────────────┐
  LOGIC │  ★ geotoolz operators (preprocessing, AK, masking, fusion)  │
        │  ★ plumax operators (transport, RTM, likelihood, inversion) │
        │  ★ xrtoolz operators (met-field reductions, climatology)    │
        └─────────────────────────▲───────────────────────────────────┘
                                  │
        ┌─────────────────────────┴───────────────────────────────────┐
   DISC │  ★ GeoCatalog: multi-satellite overpass + footprint index   │
        │  Phase 1: GeoPandas + IntervalIndex (≤10⁵ overpasses/basin) │
        │  Phase 2: DuckDB + GeoParquet (global, multi-year)          │
        └─────────────────────────▲───────────────────────────────────┘
                                  │
        ┌─────────────────────────┴───────────────────────────────────┐
   SUBS │  ★ georeader 2.0: GeoTensor carrier + per-sensor readers    │
        │  EMIT_L1B · TROPOMI_L2 · GHGSat_L2 · WRF · ERA5             │
        └─────────────────────────▲───────────────────────────────────┘
                                  │
        ┌─────────────────────────┴───────────────────────────────────┐
  TRANS │  obstore · GDAL/VSI                                         │
        └─────────────────────────▲───────────────────────────────────┘
                                  │
        ┌─────────────────────────┴───────────────────────────────────┐
   STOR │  cloud buckets · COG · NetCDF · GeoParquet · Zarr (met)     │
        └─────────────────────────────────────────────────────────────┘

Layer by layer, what each contributes to the attribution pipeline:

4.1 Storage

Already exists. EMIT lives on AWS / DAAC; TROPOMI on Copernicus / GES DISC; GHGSat on the vendor’s portal; ERA5 on Copernicus CDS; WRF outputs on local compute. No new storage — the unified stack is about access patterns, not new repositories.

4.2 Transport — obstore

Single byte mover for all cloud buckets. Three concrete payoffs over the current per-project mix of s3fs / boto3 / gcsfs:

4.3 Substrate — modernised georeader + per-sensor readers

The most concrete user-visible payoff. Today’s per-project ingest replaced with five typed readers conforming to one Protocol:

ReaderStatus todayWhat lands in the unified stack
EMIT_L1B (NetCDF radiance)Per-project hand-rolled; matched-filter often a fork of the JPL reference codeSensor-specific reader emitting GeoTensor[*batch, bands, H, W] with a wavelength attribute; matched-filter operator consumes it
TROPOMI_L2 (NetCDF L2 XCH₄)Per-project, often via harp or hand-coded xarraySensor reader emitting GeoTensor[*time, H, W] with averaging-kernel sidecar
GHGSat_L2 (HDF5 L2 XCH₄)Per-project, schema differs by product versionSensor reader emitting GeoTensor with footprint + retrieval-prior metadata
WRF / ERA5 (NetCDF / GRIB met)xarray.open_dataset(...) per projectxrtoolz reader emitting MetField as a coordax.Dataset (per plumax/00_prerequisites.md)
inventory (EDGAR / GFEI / EPA)Per-project, often a single CSV importVector reader emitting a GeoDataFrame-typed prior on q_a

Cross-cutting: Credential (typed Protocol per plans/types/credentials.md) handles auth; cloud byte transport for the async path is delegated to upstream obspec — not a Protocol of our own — see plans/types/bytestore.md.

4.4 Discovery — GeoCatalog

The single biggest leap from “hand-rolled per project” to “shared substrate”. The catalog is a multi-satellite spatiotemporal index keyed on (instrument, time, footprint, quality, native_url).

Phase 1 (in-memory, per-basin / per-event): a geopandas.GeoDataFrame plus pd.IntervalIndex, built once per session by crawling instrument-specific URL conventions. Sufficient for ≤10⁵ overpasses (a basin × multi-year window). Persisted as GeoParquet via to_geoparquet() so the same catalog feeds research and batch.

Phase 2 (DuckDB on GeoParquet): when the catalog goes operational and global, the same API but DuckDB-backed with predicate pushdown via the GeoParquet 1.1 bbox column. Most of MARS’s catalog needs probably stay in Phase 1 — operational attribution is event-driven, with each event filtering to a small AOI × time window where Phase 1 is more than fast enough.

The query that anchors everything:

# "All overpasses within 2 hours of this detection, 50 km bbox around the source,
#  any of the three instruments, quality flag >= 'usable'."
catalog = GeoCatalog.from_geoparquet("s3://mars/catalog/overpasses_2026.parquet")
candidates = catalog.query(
    geometry=event.bbox(50_000),
    interval=event.window(hours=2),
    instruments=["EMIT", "TROPOMI", "GHGSat"],
    quality_min="usable",
)

That single query replaces the per-project glob+filter+join logic that today every attribution paper re-implements.

4.5 Compute — geotoolz × plumax × xrtoolz

The factory layer is where the science composition lives. Three libraries with a clean separation of concerns:

LibraryWhat it owns in the attribution pipeline
geotoolzGeneric raster operators: MatchedFilter (EMIT methane fingerprint), ApplyAK (averaging-kernel application), Mask (cloud + QA), Crop (AOI extraction), Reproject, Stitch, Sequential/Graph composition.
plumaxDomain-specific science operators: GaussianPlume (Tier I forward), MetField loader, BiasCorrect per instrument, JointLikelihood over instruments, MAPInverter / NUTSInverter wrappers. Operators in the geotoolz sense — pickleable, Hydra-friendly, carrier-aware.
xrtoolzMet-field operators: TimeInterpolate (snapshot / piecewise / advected), PBLDiagnostic, StabilityClassifier, RegridToAnalysisGrid. Operates on coordax.Dataset carrier (per plumax adoption).

The pipeline as one operator graph:

# Build once, call from notebook / batch / API
attribution = Graph(
    inputs={
        "emit": Input(GeoTensor),       # EMIT L1B radiance cube
        "tropomi": Input(GeoTensor),    # TROPOMI L2 XCH₄
        "ghgsat": Input(GeoTensor),     # GHGSat L2 XCH₄
        "met": Input(MetField),         # WRF or ERA5 met
        "prior": Input(GeoDataFrame),   # EDGAR / GFEI prior
    },

    # Per-instrument preprocessing (geotoolz operators)
    detect_emit       = MatchedFilter(target=ch4_signature) >> "emit",
    masked_tropomi    = Mask(qa="qa_value", min=0.5) >> "tropomi",
    masked_ghgsat     = Mask(qa="quality_flag", min=2) >> "ghgsat",

    # Met-field preparation (xrtoolz operators)
    pbl               = PBLDiagnostic() >> "met",
    stability         = StabilityClassifier() >> "met",
    met_interp        = TimeInterpolate(policy="piecewise") >> "met",

    # Forward model (plumax operators)
    forward_emit      = GaussianPlume(instrument="EMIT") >> ("met_interp", "stability"),
    forward_tropomi   = GaussianPlume(instrument="TROPOMI") >> ("met_interp", "stability"),
    forward_ghgsat    = GaussianPlume(instrument="GHGSat") >> ("met_interp", "stability"),

    # Per-instrument observation operator (geotoolz × plumax)
    sim_emit_radiance = ApplyRTM() >> ("forward_emit", "stability"),
    sim_tropomi       = ApplyAK(instrument="TROPOMI") >> "forward_tropomi",
    sim_ghgsat        = ApplyAK(instrument="GHGSat") >> "forward_ghgsat",

    # Bias correction (plumax)
    sim_emit_bc       = BiasCorrect(instrument="EMIT") >> "sim_emit_radiance",
    sim_tropomi_bc    = BiasCorrect(instrument="TROPOMI") >> "sim_tropomi",
    sim_ghgsat_bc     = BiasCorrect(instrument="GHGSat") >> "sim_ghgsat",

    # Joint multi-instrument likelihood (plumax)
    posterior         = JointLikelihood(instruments=["EMIT","TROPOMI","GHGSat"]) >> (
        ("detect_emit",     "sim_emit_bc"),
        ("masked_tropomi",  "sim_tropomi_bc"),
        ("masked_ghgsat",   "sim_ghgsat_bc"),
        "prior",
    ),
)

That’s the graph. Notebook, batch worker, and FastAPI handler all instantiate this same Graph and call it. The orchestration changes around it; the graph itself is identical.

4.6 Serving

Three orchestration modes share the graph above. See §5 for each.


5. The three-phase demo plan

Each phase uses the identical operator graph from §4.5. What changes is the orchestration around it: how inputs are resolved, where outputs go, what the latency budget is.

5.1 Phase 1 — Research notebook

Goal. Validate that the unified-stack operator graph runs and produces a posterior consistent with hand-rolled MARS-style outputs on a known event.

Anchor event. A single Permian basin methane release with all three instruments observing within a 24-hour window. (Pick a published event from the literature so we have a published posterior to bench against.)

Notebook flow:

import geotoolz as gz
import plumax as px
import xrtoolz as xr_t
from georeader.catalog import GeoCatalog  # per `plans/geodatabase/` — module lives under georeader

# 1. Build / load a small catalog around the event
catalog = GeoCatalog.from_basin_crawl(basin="permian", year=2024)
overpasses = catalog.query(
    geometry=event.bbox(50_000),
    interval=event.window(hours=12),
    instruments=["EMIT", "TROPOMI", "GHGSat"],
)

# 2. Resolve inputs via georeader 2.0
emit_gt    = read_emit_l1b(overpasses["EMIT"][0].url, slice=event.geoslice(50_000))
tropomi_gt = read_tropomi_l2(overpasses["TROPOMI"][0].url, slice=event.geoslice(50_000))
ghgsat_gt  = read_ghgsat_l2(overpasses["GHGSat"][0].url, slice=event.geoslice(50_000))
met        = read_wrf_metfield(event.window(hours=12), event.bbox(100_000))
prior      = read_edgar(year=2024, basin="permian")

# 3. Run the same graph as §4.5
attribution = build_attribution_graph()  # the §4.5 Graph
posterior_callable = attribution.compile()  # MAP via NumPyro under the hood

# 4. Inference
posterior = posterior_callable(
    emit=emit_gt, tropomi=tropomi_gt, ghgsat=ghgsat_gt,
    met=met, prior=prior,
)

# 5. Visualisation (matplotlib + lonboard for the geo bits)
plot_posterior(posterior, event)
plot_observations_vs_simulated(posterior, emit_gt, tropomi_gt, ghgsat_gt)

What this validates:

Estimated effort. ~2–3 weeks of focused work after georeader 2.0 + the v0.1 geotoolz core land. Most of the work is in plumax operator wrappers (GaussianPlume, JointLikelihood, BiasCorrect) — the plumax tier-I math already exists in the roadmap; the work is wrapping it as Operators with get_config for Hydra serialisation.

5.2 Phase 2 — Batch pipeline

Goal. Demonstrate that the same operator graph runs at orchestrated scale: last week’s detections fanned out across compute, posteriors written to a queryable artifact.

Pipeline shape:

GeoCatalog (phase 1) ── query(week_window) ──▶ list of detection events
        │
        ▼
  for each event:                          (parallelism via Dask / Ray / Modal)
    ┌──────────────────────────┐
    │  resolve inputs (Φ.1)    │
    │  compile attribution     │
    │  run posterior           │
    │  write to GeoParquet     │
    └──────────────────────────┘
        │
        ▼
posteriors.parquet  (per-event posterior + provenance + alert score)

Concretely:

import dask.distributed as dd
from plumax.batch import attribute_event

client = dd.Client()  # or Ray / Modal

events = catalog.query(
    interval=last_week,
    quality_min="usable",
    detection_channel=["matched_filter", "tropomi_q4"],
).to_events()

# attribute_event is a thin wrapper around the §5.1 logic — same operator graph
futures = [client.submit(attribute_event, event) for event in events]
posteriors = client.gather(futures)

# Persist
write_posteriors_geoparquet(posteriors, "s3://mars/posteriors/2026-W19.parquet")

What this validates:

Estimated effort. ~1 week after Phase 1. Most of the work is the attribute_event adapter and the GeoParquet posterior schema; the operator graph is unchanged.

5.3 Phase 3 — Alert service (FastAPI)

Goal. Demonstrate research-to-prod with zero rewrite: the same graph, served as a request handler.

Service shape:

from fastapi import FastAPI
from plumax.attribution import build_attribution_graph, AttributionRequest, AttributionResponse

app = FastAPI()
attribution_graph = build_attribution_graph().compile()  # same Graph as Phase 1 + 2

@app.post("/attribute", response_model=AttributionResponse)
async def attribute(req: AttributionRequest) -> AttributionResponse:
    # 1. Catalog query for overpasses near the event
    overpasses = catalog.query(
        geometry=req.bbox(50_000),
        interval=req.window(hours=12),
        instruments=req.instruments,
    )

    # 2. Resolve inputs (async via the AsyncGeoData Protocol)
    inputs = await resolve_inputs_async(overpasses, req)

    # 3. Same graph, called as the request handler
    posterior = attribution_graph(**inputs)

    # 4. Build alert payload
    return AttributionResponse(
        event_id=req.event_id,
        posterior_summary=summarize_posterior(posterior),
        provenance=overpasses.provenance(),
        alert_score=compute_alert_score(posterior),
        latency_ms=...,
    )

Operational expectations:

What this validates:

Estimated effort. ~1 week after Phase 2. Most of the work is the FastAPI scaffolding and async-reader plumbing; the graph and inference are unchanged.


6. Operator catalog — what the Sequential actually contains

For engineers: the concrete operators each library would contribute. This is the first-pass surface — exact signatures land in each library’s plans.

6.1 geotoolz operators (existing per plans/geotoolz/, with one addition)

6.2 plumax operators (projected — see plumax roadmap)

6.3 xrtoolz operators (projected — see github.com/jejjohnson/xr_toolz)

6.4 What lives in plumax, not geotoolz

The substrate-split rule from geotoolz.md §10.1: operators that are domain-specific (Gaussian plume math, AK application, methane likelihoods) live in plumax. Operators that are generic (matched filter, masking, cropping, reprojection) live in geotoolz. The split is enforced by import direction — plumax imports geotoolz, never the reverse.


7. Out of scope (explicit)

For all three demo phases:

These exclusions are deliberate, per the scope-honesty discussion in geostack_notes.md — keeping the in-scope list deliverable rather than the out-of-scope list aspirational.


8. Validation plan

Validation is layered to match the plumax architectural principle 2: each step validates the next.

8.1 Per-operator (geotoolz / plumax / xrtoolz)

8.2 Per-graph (the §4.5 attribution Graph)

8.3 End-to-end (against published events)

8.4 Operational continuity


9. Migration path from current MARS-style work

For audiences asking “what changes for the people doing this work today?” — a concrete mapping.

Today’s pieceMaps toWhat changes
Per-project EMIT / TROPOMI / GHGSat ingest scriptsgeoreader 2.0 sensor readersOne library replaces N scripts; auth + transport unified
Project-specific overpass CSVsGeoCatalog Phase 1One queryable object replaces ad-hoc CSVs; round-trips to GeoParquet
matched_filter.py per projectgeotoolz.hyperspectral.MatchedFilterLibrary-quality, tested, version-pinnable
Per-project gauss_plume.pyplumax.tier1.GaussianPlume (Operator-wrapped)Shared math; bug fixes propagate; Hydra-config-friendly
Per-project NumPyro modelplumax.JointLikelihood + MAPInverterLikelihood structure becomes a typed, reusable operator
Hand-rolled FastAPI handlerapp.post("/attribute", ...) calling the same graphZero rewrite from research to prod; full provenance for free
Per-project repository / archivalGeoParquet posterior artifact + operator-graph hashEach attribution is reproducible from artifact alone

What does not change for current MARS workflows:


10. Risks and open questions

The full risks tracker for the underlying libraries lives in their own design docs (plans/geotoolz/geotoolz.md §11, plans/geodatabase/, plans/georeader/). The risks specific to this attribution pipeline:

10.1 Critical-path

10.2 Per-instrument

10.3 Inference

10.4 Operational

10.5 Open questions to settle before v1


Plumax internals

Unified-stack design docs

External