Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

GeoSlice

Unit of work between catalog, sampler, loader, operator

UNEP
IMEO
MARS

Parent: Core types — Core types. Status: design proposal. Promoted from the Geodatabase Phase 1 writeup, where it was treated as a footnote despite being the unit of work flowing between three layers. Scope: the GeoSlice dataclass and the three primitives that produce or consume it (random_sampler, grid_sampler, stitch).


Summary

A GeoSlice is a bounded request for data — a bbox, a time interval, a target resolution, and a CRS. It is the unit of work that catalogs produce, that loaders consume, and that operators (in geotoolz) compose against. The three sampler/stitch primitives are the canonical producers and consumer:

Today, all four live in jej_vc_snippets/sampler.py (~1400 LOC). Ownership update: the three primitives above belong in geotoolz.patch today (with geopatcher as the plan/future extraction) — they are concrete instances of Patcher(Rectangular × RegularStride × Boxcar × OverlapAdd) (grid sampler), Patcher(Rectangular × Random × Boxcar × …) (random sampler), and OverlapAdd.merge (stitch). georeader.samplers is not the right home: georeader owns the substrate (Protocols, carriers, byte paths); the patching/sampling algebra belongs in geotoolz.patch. The dataclass itself stays here in types/ — it’s the cross-cutting wire format between the catalog (producer), geotoolz.patch (consumer), and georeader’s loaders (consumer). This document gives the dataclass its own attention so the invariants, sampler math, and stitch reductions are specified in one place rather than scattered across other designs.


Motivation

Three reasons this deserves its own design doc rather than a section inside the catalog plan:

  1. It’s the inter-layer contract. Catalogs produce GeoSlices. Loaders consume them. geotoolz.sampling.GridSampler wraps grid_sampler; geotoolz.inference.ApplyToChips consumes the iterator and stitches predictions. Six designs reference GeoSlice; none was the right home to fully specify it.

  2. The math is non-trivial and easy to get wrong. Random-sampler weighting has a documented bias when tile sizes are heterogeneous. Grid-sampler stride math has subtle edge-case behaviour at the trailing row/column. Stitch’s four reduction modes (average / max / first / last) each have different ordering semantics. Putting it all in one place makes it auditable.

  3. GeoSlice overlaps with two other conceptsrasterio.windows.Window (pixel-space rectangle) and slices.create_windows (chunking generator). Reconciling the three is a design decision in its own right, not a footnote in another doc.


Primer for newcomers

ELI5. A GeoSlice is a delivery slip: it says where (bbox), when (time), at what zoom (resolution), and in what coordinate system (CRS). The catalog writes the slip; the loader reads it; whoever’s in between never has to ask “wait, which file did this come from?” — the slip has everything they need.

Frozen dataclasses (immutability)

What it is. A @dataclass(frozen=True) is a Python class with auto-generated __init__ / __repr__ / __eq__ and a guarantee that its attributes can’t be mutated after construction. Trying to assign a new value to a field raises FrozenInstanceError.

How it works. The @dataclass decorator inspects type annotations and synthesises the constructor. frozen=True adds __setattr__ and __delattr__ overrides that raise. The instance is also hashable by default (because nothing can change), which lets you use it as a dict key or set member.

What this means for us. GeoSlice is frozen so that once constructed, it can be passed across function boundaries, stored in caches, used as a dict key for “slices I’ve already processed,” etc. Code that wants to change a slice creates a new one (dataclasses.replace(slice_, bounds=new_bounds)) — explicit by design. Mutable units of work flowing between layers are a recipe for bugs.

pd.Interval and IntervalIndex

What it is. A pd.Interval(start, end, closed='both') is pandas’s typed representation of a time range. Combined into an IntervalIndex, you get fast “find all rows whose interval overlaps this query interval” queries — the temporal counterpart to a spatial R-tree.

How it works. Each interval is an immutable object with a closed policy ('both', 'left', 'right', 'neither'). Standard set ops (overlaps, contains) are vectorised at the IntervalIndex level. Scales cleanly: 10k intervals → microseconds per query. Same primitive used inside catalogs to filter by time.

What this means for us. GeoSlice.interval is a pd.Interval, not a (tmin, tmax) tuple, because the catalog stores intervals the same way and “does this slice’s interval overlap any catalog row?” is a single function call (no manual min/max gymnastics). Convention is closed='both' everywhere — both endpoints inclusive.

pyproj.CRS (vs string EPSG codes)

What it is. pyproj.CRS is a typed CRS object that knows its EPSG code, WKT representation, axis order, datum shifts, and reprojection rules. Compared to a bare "EPSG:4326" string, it carries semantic information.

How it works. Constructed from "EPSG:32630", a WKT string, or a pyproj.CRS.from_epsg(32630) call. Carries methods for axis-order checking, transformation parameter lookup, and “is this CRS equivalent to that one” comparisons. Reprojection to/from another CRS goes through pyproj.Transformer.

What this means for us. GeoSlice.crs is a pyproj.CRS, not a string, so two slices in slightly different CRS representations (e.g., EPSG:32630 vs full WKT for the same UTM zone) can be compared sensibly. The package’s compare_crs(a, b) helper does the right thing across the various forms.

“Unit of work” between layers

What it is. A small, immutable, self-contained value that flows from a producer (catalog, sampler) to a consumer (loader, operator) across a function boundary, carrying everything the consumer needs to do its job — no further reference to the producer required.

How it works. GeoSlice has four fields: bounds (where in space), interval (when in time), resolution (at what pixel size), crs (in which coordinate system). That’s enough for a loader to fetch a chip without consulting the catalog or the sampler again. The producer hands you a slice; you do whatever; you hand the result downstream. No hidden state.

What this means for us. The pattern decouples layers cleanly. A random_sampler doesn’t know which loader will consume the slice; a read_geoslice(slice) method doesn’t know whether the slice came from a sampler, a manual user query, or a JSON config. Same GeoSlice shape; many producers and consumers; no coupling beyond the type.


Goals


Non-goals


Constraints


The GeoSlice dataclass

import pandas as pd
import pyproj
from dataclasses import dataclass

@dataclass(frozen=True)
class GeoSlice:
    """A bounded request for data — produced by samplers, consumed by loaders.

    Carries everything a loader needs to fetch a chip without consulting
    the catalog: bbox in CRS units, time interval, target resolution, CRS.
    """
    bounds: tuple[float, float, float, float]   # (xmin, ymin, xmax, ymax)
    interval: pd.Interval                       # closed='both'
    resolution: tuple[float, float]             # (x_res, y_res) in CRS units
    crs: pyproj.CRS

    @property
    def shape(self) -> tuple[int, int]:
        """Output grid shape (height, width) implied by bounds + resolution."""
        x_res, y_res = self.resolution
        xmin, ymin, xmax, ymax = self.bounds
        return (
            round((ymax - ymin) / y_res),
            round((xmax - xmin) / x_res),
        )

    @property
    def transform(self) -> "rasterio.Affine":
        """North-up affine from bounds + resolution."""
        ...

    @property
    def width(self) -> int: ...
    @property
    def height(self) -> int: ...

Invariants

Why GeoSlice and not Window or slice?

rasterio.windows.Windowslice (Python)GeoSlice
Coordinate systempixelarray indexCRS units
Carries CRSnonoyes
Carries timenonoyes
Carries resolutionnonoyes
Use forwithin-file extractionarray slicinginter-layer contract

GeoSlice exists because the inter-layer contract needs all four. A Window is what a loader produces from a GeoSlice once it knows the source file’s transform. A Python slice is what slices.create_windows (Tutorial Ch. 6) emits inside a single file, with no geographic meaning.

Conversion to/from Window

def slice_to_window(slice_: GeoSlice, transform: Affine) -> Window:
    """Convert a CRS-unit GeoSlice to a pixel-space Window in the given file."""
    return rasterio.windows.from_bounds(*slice_.bounds, transform=transform)

def window_to_slice(window: Window, transform: Affine, crs: CRS,
                    interval: pd.Interval, resolution: tuple[float, float]) -> GeoSlice:
    """Inverse — usually only useful in tests / debugging."""
    bounds = rasterio.windows.bounds(window, transform)
    return GeoSlice(bounds=bounds, interval=interval, resolution=resolution, crs=crs)

The asymmetry is deliberate: a GeoSlice carries strictly more information than a Window, so going slice → window is lossy in the time/CRS axes (you have to drop them) and going window → slice requires extra arguments to fill them back in.


random_sampler

Signature

def random_sampler(
    catalog: GeoCatalog,
    chip_size: tuple[int, int],
    *,
    length: int | None = None,
    roi: shapely.Polygon | None = None,
    toi: pd.Interval | None = None,
    units: Literal["pixels", "crs"] = "pixels",
    seed: int | None = None,
    weight: Literal["area", "uniform"] = "area",
) -> Iterator[GeoSlice]: ...

Area-weighted sampling math

Default weight="area". Tiles big enough to fit a chip are kept; each surviving tile gets weight proportional to its area:

wi=(xmax,ixmin,i)(ymax,iymin,i)jAjw_i = \frac{(x_{\max,i} - x_{\min,i}) \cdot (y_{\max,i} - y_{\min,i})}{\sum_j A_j}

A tile is drawn with numpy.random.Generator.choice(p=w), then a chip is placed uniformly inside the tile:

x_offset = uniform(0, tile_width  - chip_width)
y_offset = uniform(0, tile_height - chip_height)

The timestamp is sampled uniformly (in seconds) inside the tile’s interval and assigned to both tmin and tmax.

The weighting bias (open question)

Area-weighted sampling pushes samples toward large tiles. For uniform-density imagery this is what you want — every pixel has equal probability of being sampled. For tiled mosaics with one giant tile and many small ones, it over-samples the giant tile.

Three options:

The proposal is to expose weight={"area", "uniform"} and document the bias. See parent README §Open questions for the unresolved decision on whether callable is worth adding in v1.


grid_sampler

Signature

def grid_sampler(
    catalog: GeoCatalog,
    chip_size: tuple[int, int],
    *,
    stride: tuple[int, int] | None = None,
    roi: shapely.Polygon | None = None,
    toi: pd.Interval | None = None,
    units: Literal["pixels", "crs"] = "pixels",
) -> Iterator[GeoSlice]: ...

Stride math

For each tile of size (W, H) and chip (w, h) with stride (s_x, s_y):

ncols=Wwsx+1,nrows=Hhsy+1n_{\text{cols}} = \left\lceil \frac{W - w}{s_x} \right\rceil + 1, \qquad n_{\text{rows}} = \left\lceil \frac{H - h}{s_y} \right\rceil + 1

Final-row/column chips are shifted (not truncated) so chip size stays exact across the grid. Total chip count = sum over tiles. Matches the standard sliding-window convention used by xrpatcher / xbatcher.

Reconciliation with slices.create_windows

slices.create_windows does the same stride math but at the pixel level inside a single file. The relationship:

LayerFunctionInputOutput
Inter-file (catalog)grid_samplercatalog + chip size in CRS unitsiterator of GeoSlice
Intra-file (one raster)slices.create_windows(H, W) shape + chip size in pixelsiterator of Window

grid_sampler calls into the catalog to find tiles; for each tile it can either reimplement stride math itself or convert the tile’s bounds into a (H, W) shape and delegate to create_windows, then convert each Window back to a GeoSlice. The implementation detail is open; the two functions stay separately useful at their respective scopes.


stitch

Signature

def stitch(
    predictions: Sequence[np.ndarray],
    slices: Sequence[GeoSlice],
    *,
    method: Literal["average", "max", "first", "last"] = "average",
    roi: shapely.Polygon | None = None,
) -> GeoTensor: ...

Reduction methods

MethodUpdate rule
averageout += pred; counts += 1; out /= max(counts, 1)
maxout = np.maximum(out, pred)
firstwrite only where counts == 0, then increment counts
lastunconditional overwrite

Pixel placement uses the output transform implicit in (xmin, ymax, x_res, y_res):

col = int((slice_.bounds[0] - xmin) / x_res)
row = int((ymax - slice_.bounds[3]) / y_res)   # y-flip: rows count from north downward

Order-dependence

first and last are order-dependent by definition — they reduce based on the iteration order of predictions. average and max are order-independent.

For deterministic pipelines, sort slices by (slice_.bounds, slice_.interval.left) before passing. The samplers don’t guarantee a stable order — random_sampler is intentionally random; grid_sampler iterates tiles in catalog order, which depends on the backend.


Connections to other designs

DesignHow it touches GeoSlice
Geodatabase / geocatalog.mdGeoCatalog.query(slice_: GeoSlice) consumes one. Catalog rows + iter_rows() is what the samplers iterate.
Geodatabase / geoduckdb.mdDuckDB cursor iteration into the samplers; catalog produces lazy row stream, samplers reservoir-sample.
Reader reconciliationGeoData.read_geoslice(slice) (sync) and AsyncGeoData.read_geoslice(slice) (async) are the canonical loader entry points.
geotoolz.mdgeotoolz.sampling.GridSampler wraps grid_sampler. geotoolz.inference.ApplyToChips consumes the iterator and uses stitch for the inverse step. The Stitch operator in geotoolz is a direct re-export.

The flow of a GeoSlice through a typical inference pipeline:

GeoCatalog  ──grid_sampler──►  GeoSlice  ──reader.read_geoslice──►  GeoTensor
                                  │                                       │
                                  └──────────────────────┐              model
                                                          ▼                │
                                                   stitch(slices,         │
                                                          predictions) ◄──┘
                                                            │
                                                            ▼
                                                       GeoTensor (stitched)

Every arrow is sync in this proposal. The async equivalent (using AsyncGeoData.read_geoslice and asyncio.gather) lands as a follow-up if needed.


Open questions

1. weight=callable in random_sampler

Section above proposes weight={"area", "uniform"} for v1. A callable variant (weight=lambda row: ...) is requested by some users for stratified sampling but adds API surface. Defer to v2 or include now? Tentative pick: defer.

2. GeoSlice antimeridian policy

The invariants require xmin < xmax. Antimeridian-crossing AOIs are forbidden at the slice level. Where does the split happen?

window_utils.bounds_to_windows currently handles this in the loader path by returning a list of windows. Whether GeoSlice should mirror that or refuse outright is a judgement call.

3. GeoSlice.dims for higher-rank slices

Today’s shape returns (h, w). For 4D (t, b, h, w) chips, do we want a dims property? The number of channels can be inferred from the catalog row + loader, not from the slice itself. Tentative pick: keep slices 2D-shaped; let downstream code assemble multi-dim chunks.

4. stitch default for method

average is the proposal default. For probability-output models this is wrong — averaging probabilities pre-softmax gives the wrong posterior. For most practical cases users want last (faster, no allocation of the counts array) or average (smoother). Pick: keep average as default, document the gotcha. Tentative pick: keep average.

5. Time semantics in stitch

stitch produces a single GeoTensor from many slices. If those slices have different interval fields (likely, when stitching predictions across time), what does the output’s interval become?

Unresolved. The implementation in jej_vc_snippets is silent on this — the stitched output is just an ndarray with no time metadata.


Alternatives considered