Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

geotoolz.sklearn examples

A gallery of sklearn Operator patterns over GeoTensor

UNEP
IMEO
MARS

Status: companion to design_sklearn.md — worked examples across the three fit/predict mental models. Audience: anyone wiring scikit-learn (or an xgboost / lightgbm / catboost / imbalanced-learn estimator) into a geotoolz pipeline for classification, regression, transformation, clustering, temporal modelling, cross-validation, or incremental fitting on raster data.

Companion to the design doc. Examples cover classification, regression, transformations, sklearn Pipeline composition, unsupervised clustering, temporal models, cross-validation, and incremental fitting — across the three fit/predict mental models below.

Contents


Mental models

Three common fit/predict relationships in geospatial ML, each with different implications for the operator graph:

Fit globally + predict globally. Sample training pixels across the full catalog (multiple scenes, multiple regions, multiple seasons). Fit one estimator. Apply that estimator to anything in the catalog at inference. The standard supervised-learning case — most appropriate when the underlying relationship between features and labels is stationary across the catalog.

Fit globally + predict locally (patches). Same global fit, but inference runs patch-by-patch via ApplyToChips. The patches are an implementation detail of inference (memory, tile-server latency, parallelism), not a modelling choice. The estimator doesn’t know it’s being run on patches — the predictions are stitched back into the full output.

Fit locally + predict locally. Train a separate estimator per region (per MGRS tile, per season, per ecoregion, per scene). Each estimator only predicts on its corresponding region. Use when spatial heterogeneity is real — different atmospheric conditions, different crop calendars, different sensor calibrations — and a global model would underperform. Implementation: a registry of fitted estimators, dispatched via Switch on metadata.

A fourth pattern — fit locally + predict globally (transfer learning, domain adaptation) — exists but isn’t covered here; it’s an active research area, not a stable pattern.

Patch-generation strategies

When the data comes in patches — for fitting or inference — there are several ways to produce them. The right choice depends on data scale, where the workflow lives, and whether fit and predict need the same patches.

1. GeoCatalog + sampler (geotoolz-native). Lazy, catalog-driven. Each patch is read independently from cloud storage via a GeoSlice. Scales to planetary catalogs because no scene is ever fully materialised. The default choice for training over a catalog and for inference over a catalog.

sampler = gz.sampling.GridSampler(chip_size=(256, 256))
for sl in sampler(catalog):
    with georeader.RasterioReader.open(sl.source_uri) as r:
        gt = r.read_geoslice(sl)
    process(gt)

2. Read whole + xrpatcher (eager). Read a full scene into xarray, then chop it into patches. One read, many patches — better I/O when patches are densely sampled from one scene. The right choice for single-scene workflows (experimentation, viz, quick prototypes).

import xrpatcher
ds = xr.open_dataset(scene_uri)
patches = xrpatcher.XRDAPatcher(ds, patches=dict(x=256, y=256), strides=dict(x=224, y=224))
for patch_ds in patches:
    gt = GeoTensor.from_xarray(patch_ds)
    process(gt)

3. ApplyToChips (geotoolz, inference-only). Wraps an in-memory GeoTensor as chips for inference. The right choice when you already have a full scene loaded and want to run a chipwise model over it. Not for fitting (no label coupling).

infer = gz.inference.ApplyToChips(
    sampler=gz.sampling.GridSampler(chip_size=(256, 256), stride=(224, 224)),
    chip_op=pipeline,
    stitcher=gz.sampling.Stitch(method="average"),
)
pred = infer(scene_gt)

Heuristics:

TaskStrategy
Training over many scenesGeoCatalog + RandomSampler
Validation on a held-out scene catalogGeoCatalog + GridSampler
Inference on a single scene (full output)ApplyToChips
Inference across a catalogCatalogPipeline
Single-scene experimentationRead whole + xrpatcher

The examples below cite which strategy they use, with the rationale.


1. Crop classification (global fit, global predict)

Setting. Multi-class crop classification from Sentinel-2 monthly composites across an agricultural region (multiple MGRS tiles, one growing season). One Random Forest fit on pixels sampled from across the training catalog; applied to held-out scenes at inference.

Patch strategy. GeoCatalog + RandomSampler for training (lazy, catalog-scale). Per-scene prediction at inference (each scene read whole in this example; chipwise inference shown in the next example).

import geotoolz as gz
import georeader
import joblib
from sklearn.ensemble import RandomForestClassifier

# === Training catalog: S2 chips with co-located crop labels ===
train_catalog = georeader.catalog.open_catalog("s3://crop/train_2024.parquet")

feature_pipeline = gz.Sequential([
    gz.cloud.MaskClouds(qa_band=-1, bits=[10, 11]),
    gz.correction.TOAToBOA(sun_zenith_band=-2),
    gz.indices.AppendIndex(gz.indices.NDVI(red_idx=2, nir_idx=3), name="ndvi"),
    gz.indices.AppendIndex(gz.indices.NDWI(green_idx=1, nir_idx=3), name="ndwi"),
])

label_pipeline = gz.Sequential([
    gz.catalog_ops.ReadLabel(label_uri_field="label_uri", band="crop_class"),
])

# === Fit globally ===
clf = gz.sklearn.fit_pixelwise(
    estimator=RandomForestClassifier(n_estimators=300, max_depth=20, n_jobs=-1),
    X_pipeline=feature_pipeline,
    y_pipeline=label_pipeline,
    catalog=train_catalog,
    sampler=gz.sampling.RandomSampler(chip_size=(256, 256), length=500),
    n_pixels=1_000_000,
    nan_policy=gz.sklearn.NanPolicy(on_input="drop", on_label="drop"),
    random_state=42,
)
joblib.dump(clf, "/models/crop_rf_2024.joblib")

# === Predict globally ===
predict_pipeline = gz.Sequential([
    feature_pipeline,
    gz.sklearn.PixelwiseClassifier(
        estimator=joblib.load("/models/crop_rf_2024.joblib"),
        nan="mask",                       # NaN pixels → fill_value
        output_dtype="uint8",
    ),
])

test_catalog = georeader.catalog.open_catalog("s3://crop/test_2024.parquet")
gz.catalog_ops.CatalogPipeline(
    test_catalog,
    gz.Sequential([
        predict_pipeline,
        gz.catalog_ops.WriteCOG(path_template="s3://out/crop_2024/{tile}.tif"),
    ]),
    n_workers=8,
).run()

Notes on this example.


2. Chipwise inference for large scenes (global fit, local predict)

Setting. Same crop classifier as Example 1, but inference now runs on full Sentinel-2 tiles (~10,980 × 10,980 pixels, ~1 GB in memory). Whole-scene inference would OOM on a workstation; chipwise inference holds memory constant.

Patch strategy. ApplyToChips over a full-scene GeoTensor. The same model from Example 1 — no retraining.

clf = joblib.load("/models/crop_rf_2024.joblib")

# Chip operator: features + classifier. Same as Example 1's predict_pipeline.
chip_op = gz.Sequential([
    feature_pipeline,
    gz.sklearn.PixelwiseClassifier(estimator=clf, nan="mask", output_dtype="uint8"),
])

# Wrap as a chipwise inference Operator.
infer = gz.inference.ApplyToChips(
    sampler=gz.sampling.GridSampler(chip_size=(512, 512), stride=(480, 480)),
    chip_op=chip_op,
    stitcher=gz.sampling.Stitch(method="majority"),  # vote per pixel across overlap
)

# Apply to full-scene GeoTensor.
with georeader.RasterioReader.open("s3://s2/tile_29SND_2024-08-12.tif") as r:
    scene_gt = r.read_geoslice(scene_slice)

prediction = infer(scene_gt)
georeader.save_cog(prediction, "/out/29SND_2024-08-12_crop.tif")

Notes.


3. Forest biomass regression

Setting. Aboveground biomass (continuous, Mg/ha) from Sentinel-1 + Sentinel-2 fused features. Regression rather than classification — different default NaN policy, different output dtype, different metrics.

Patch strategy. GeoCatalog + RandomSampler for training. Per-scene prediction at inference (same as Example 1).

import geotoolz as gz
from sklearn.ensemble import GradientBoostingRegressor

# === Fused feature pipeline: optical + radar ===
feature_pipeline = gz.Graph(
    inputs={"optical": gz.core.Input("optical"), "sar": gz.core.Input("sar")},
    outputs={"features": gz.fusion.AlignAndStack(target_grid="optical")(
        gz.cloud.MaskClouds(qa_band=-1, bits=[10, 11])(gz.core.Input("optical")),
        gz.sar.LinearToDB()(gz.sar.LeeSpeckle(window=7)(gz.core.Input("sar"))),
    )},
)

# === Fit ===
reg = gz.sklearn.fit_pixelwise(
    estimator=GradientBoostingRegressor(n_estimators=500, max_depth=5, learning_rate=0.05),
    X_pipeline=feature_pipeline,
    y_pipeline=gz.catalog_ops.ReadLabel(label_uri_field="agb_uri", band="agb_mg_ha"),
    catalog=biomass_train_catalog,
    sampler=gz.sampling.RandomSampler(chip_size=(256, 256), length=300),
    n_pixels=500_000,
    nan_policy=gz.sklearn.NanPolicy(
        on_input="drop",
        on_label="drop",
    ),
)
joblib.dump(reg, "/models/agb_gbr.joblib")

# === Predict ===
predict = gz.Sequential([
    feature_pipeline,
    gz.sklearn.PixelwiseRegressor(
        estimator=joblib.load("/models/agb_gbr.joblib"),
        nan="mask",
        fill_value=np.nan,                # ← regression default; classification used -1
        output_dtype="float32",
    ),
])

Notes.


4. Pixel-wise PCA dimensionality reduction

Setting. Hyperspectral data (~200 bands). Reduce to 10 principal components for downstream classification or visualisation. The transformer case — output is multi-band, not single-band.

Patch strategy. GeoCatalog + RandomSampler over a representative subset of scenes for fitting PCA. Apply per-scene at inference.

from sklearn.decomposition import PCA

# === Fit PCA on a representative sample ===
pca = gz.sklearn.fit_pixelwise(
    estimator=PCA(n_components=10, whiten=True),
    X_pipeline=hyperspectral_pipeline,
    y_pipeline=None,                       # unsupervised
    catalog=hyperspectral_catalog,
    sampler=gz.sampling.RandomSampler(chip_size=(64, 64), length=100),
    n_pixels=200_000,
    nan_policy=gz.sklearn.NanPolicy(on_input="drop"),
)
joblib.dump(pca, "/models/pca_10.joblib")

# === Apply as preprocessing in any pipeline ===
preprocess = gz.Sequential([
    hyperspectral_pipeline,
    gz.sklearn.PixelwiseTransformer(
        estimator=joblib.load("/models/pca_10.joblib"),
        nan="mask",
    ),
    # Output: (10, H, W) GeoTensor — 10 PCA components
])

# Compose with anything downstream
classify = gz.Sequential([
    preprocess,
    gz.sklearn.PixelwiseClassifier(estimator=clf_on_pca, nan="mask"),
])

Notes.


5. Full sklearn Pipeline composition

Setting. Soil moisture regression from Sentinel-1 backscatter + DEM features. The estimator is a multi-stage sklearn Pipeline: imputation → scaling → polynomial feature expansion → Ridge regression. The killer interop feature — the whole thing is one Operator on the geotoolz side.

Patch strategy. GeoCatalog + RandomSampler for training. The Pipeline is one estimator from geotoolz’s perspective.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge

sklearn_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler",  StandardScaler()),
    ("poly",    PolynomialFeatures(degree=2, interaction_only=True)),
    ("reg",     Ridge(alpha=1.0)),
])

# Fit the whole Pipeline as one estimator
fitted = gz.sklearn.fit_pixelwise(
    estimator=sklearn_pipe,
    X_pipeline=sm_feature_pipeline,
    y_pipeline=gz.catalog_ops.ReadLabel(label_uri_field="sm_uri", band="vwc"),
    catalog=sm_catalog,
    sampler=gz.sampling.RandomSampler(chip_size=(256, 256), length=400),
    n_pixels=800_000,
    nan_policy=gz.sklearn.NanPolicy(
        on_input="propagate",          # let SimpleImputer handle NaN
        on_label="drop",
    ),
)
joblib.dump(fitted, "/models/sm_pipeline.joblib")

# Inference — the whole sklearn Pipeline is one Operator
predict = gz.Sequential([
    sm_feature_pipeline,
    gz.sklearn.PixelwiseRegressor(
        estimator=joblib.load("/models/sm_pipeline.joblib"),
        nan="propagate",                # SimpleImputer at the start handles it
    ),
])

Notes.


6. Per-tile classifier (local fit, local predict)

Setting. Crop classification across an MGRS grid spanning multiple ecoregions and climate zones. A global model underperforms because growing seasons, atmospheric profiles, and dominant crops vary by region. Train one classifier per MGRS tile; dispatch on tile ID at inference.

Patch strategy. GeoCatalog + RandomSampler for fitting each tile’s model separately. Switch for dispatch at inference.

import geotoolz as gz
from sklearn.ensemble import RandomForestClassifier

# === Fit one model per tile ===
tile_models = {}
for tile_id in train_catalog.tiles():
    tile_catalog = train_catalog.filter(tile=tile_id)
    clf = gz.sklearn.fit_pixelwise(
        estimator=RandomForestClassifier(n_estimators=200, n_jobs=-1),
        X_pipeline=feature_pipeline,
        y_pipeline=label_pipeline,
        catalog=tile_catalog,
        sampler=gz.sampling.RandomSampler(chip_size=(256, 256), length=100),
        n_pixels=300_000,
        nan_policy=gz.sklearn.NanPolicy(on_input="drop", on_label="drop"),
    )
    joblib.dump(clf, f"/models/crop_rf_{tile_id}.joblib")
    tile_models[tile_id] = clf

# === Dispatch via Switch at inference ===
classifier_dispatch = gz.core.Switch(
    key=lambda gt: gt.metadata["mgrs_tile"],
    cases={
        tile_id: gz.sklearn.PixelwiseClassifier(
            estimator=clf, nan="mask", output_dtype="uint8",
        )
        for tile_id, clf in tile_models.items()
    },
    default=gz.core.Raise("no model for tile: {key}"),
)

predict = gz.Sequential([feature_pipeline, classifier_dispatch])

# Catalog-driven inference — each scene gets routed to its tile's classifier
gz.catalog_ops.CatalogPipeline(test_catalog, predict, n_workers=8).run()

Notes.


7. Time-series crop classification

Setting. Crop classification from Sentinel-2 monthly composites ((C=10, T=12, H, W)) — twelve months of features per pixel. The estimator sees each pixel’s full year as a 120-dim feature vector and predicts one crop class per pixel.

Patch strategy. GeoCatalog + RandomSampler, but the catalog rows reference temporal stacks (not individual scenes). Readers materialise (C, T, H, W) GeoTensors.

import geotoolz as gz
from sklearn.ensemble import RandomForestClassifier

# === Temporal feature pipeline ===
temporal_features = gz.Sequential([
    gz.cloud.MaskClouds(qa_band=-1, bits=[10, 11]),    # acts per-timestep
    gz.correction.TOAToBOA(sun_zenith_band=-2),
    gz.temporal.LinearGapFill(),                       # interpolate cloud-masked gaps
])

# === Fit with temporal adapter ===
clf = gz.sklearn.fit_pixelwise(
    estimator=RandomForestClassifier(n_estimators=400, max_depth=15, n_jobs=-1),
    X_pipeline=temporal_features,
    y_pipeline=label_pipeline,
    catalog=temporal_catalog,
    sampler=gz.sampling.RandomSampler(chip_size=(256, 256), length=300),
    adapter=gz.sklearn.ToTemporalPixelMajor(time_handling="features"),  # (C,T,H,W) → (H*W, T*C)
    n_pixels=500_000,
    nan_policy=gz.sklearn.NanPolicy(on_input="drop", on_label="drop"),
)
joblib.dump(clf, "/models/crop_ts_rf.joblib")

# === Predict ===
predict = gz.Sequential([
    temporal_features,
    gz.sklearn.PixelwiseClassifier(
        estimator=joblib.load("/models/crop_ts_rf.joblib"),
        adapter=gz.sklearn.ToTemporalPixelMajor(time_handling="features"),
        nan="mask",
        output_dtype="uint8",
    ),
])

Notes.


8. Spatial cross-validation with GridSearchCV

Setting. Hyperparameter tuning for the crop classifier. Standard k-fold CV leaks information geographically — nearby pixels are spatially correlated, so a pixel in fold 1 and a pixel in fold 2 from the same field are nearly identical, inflating CV scores. Need spatial block CV: group pixels by a coarse spatial block, then GroupKFold ensures all pixels in one block stay in one fold.

Patch strategy. GeoCatalog + RandomSampler for training. CV machinery lives inside the sklearn estimator passed to fit_pixelwise.

import geotoolz as gz
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, GroupKFold

# === Fit_pixelwise extracts a spatial-block group ID per pixel ===
# `block_band` is a band in the y_pipeline output carrying the spatial block ID.
block_label_pipeline = gz.Sequential([
    label_pipeline,
    gz.spatial.BlockID(block_size_m=10_000),     # 10 km blocks → block ID band
])

# === Wrap classifier in GridSearchCV with spatial folds ===
grid = GridSearchCV(
    estimator=RandomForestClassifier(n_jobs=-1),
    param_grid={
        "n_estimators": [200, 400, 600],
        "max_depth":    [10, 20, None],
        "min_samples_leaf": [1, 5, 20],
    },
    cv=GroupKFold(n_splits=5),
    scoring="f1_macro",
    n_jobs=1,                    # n_jobs handled by RF, avoid nested parallelism
    verbose=2,
)

# === Fit — fit_pixelwise passes the block ID as `groups=` ===
fitted_grid = gz.sklearn.fit_pixelwise(
    estimator=grid,
    X_pipeline=feature_pipeline,
    y_pipeline=block_label_pipeline,
    catalog=train_catalog,
    sampler=gz.sampling.RandomSampler(chip_size=(256, 256), length=300),
    n_pixels=500_000,
    nan_policy=gz.sklearn.NanPolicy(on_input="drop", on_label="drop"),
    group_band=1,                # band 1 of y_pipeline is the block ID
)

print(f"Best params: {fitted_grid.best_params_}")
print(f"Best CV score: {fitted_grid.best_score_:.3f}")

# Best estimator is itself fitted on the full training set
joblib.dump(fitted_grid.best_estimator_, "/models/crop_rf_tuned.joblib")

Notes.


9. Unsupervised land-cover clustering

Setting. No labels available. Want a coarse land-cover segmentation from Sentinel-2 features as an exploratory product or pre-labelling step for active learning. Use MiniBatchKMeans for scalability.

Patch strategy. GeoCatalog + RandomSampler for fitting on a sample. CatalogPipeline for inference over the whole catalog.

from sklearn.cluster import MiniBatchKMeans

# === Unsupervised fit ===
kmeans = gz.sklearn.fit_pixelwise(
    estimator=MiniBatchKMeans(n_clusters=12, batch_size=10_000, random_state=42),
    X_pipeline=feature_pipeline,
    y_pipeline=None,                       # unsupervised — no labels needed
    catalog=catalog,
    sampler=gz.sampling.RandomSampler(chip_size=(256, 256), length=200),
    n_pixels=500_000,
    nan_policy=gz.sklearn.NanPolicy(on_input="drop"),
)
joblib.dump(kmeans, "/models/landcover_kmeans_12.joblib")

# === Predict cluster IDs per pixel ===
predict = gz.Sequential([
    feature_pipeline,
    gz.sklearn.PixelwiseClusterer(
        estimator=joblib.load("/models/landcover_kmeans_12.joblib"),
        nan="mask",
        output_dtype="uint8",
    ),
])

# === Bonus: cluster centroids as inspection ===
centroids = kmeans.cluster_centers_           # (12, n_features)
plot_centroid_spectra(centroids, feature_names=["B02", "B03", "B04", "B08", "NDVI", "NDWI"])

Notes.


10. Incremental fitting for large training sets

Setting. Training data is genuinely larger than memory — billions of labeled pixels across a continental catalog. Batch-fit any single estimator on all of it is infeasible. Use partial_fit-capable estimators streamed through the catalog.

Patch strategy. GeoCatalog + GridSampler (deterministic coverage, multiple epochs) for fitting. Standard per-scene prediction at inference.

import geotoolz as gz
from sklearn.linear_model import SGDClassifier

clf = gz.sklearn.fit_pixelwise_incremental(
    estimator=SGDClassifier(
        loss="log_loss",
        alpha=1e-4,
        learning_rate="optimal",
        random_state=42,
    ),
    X_pipeline=feature_pipeline,
    y_pipeline=label_pipeline,
    catalog=continental_catalog,
    sampler=gz.sampling.GridSampler(chip_size=(256, 256)),
    chunk_size=10_000,                     # pixels per partial_fit call
    classes=np.arange(N_CLASSES),          # required for SGD classifier
    epochs=3,
    nan_policy=gz.sklearn.NanPolicy(on_input="drop", on_label="drop"),
    shuffle=True,                          # shuffle chips within each epoch
)
joblib.dump(clf, "/models/crop_sgd.joblib")

Implementation sketch (helper internals):

def fit_pixelwise_incremental(*, estimator, X_pipeline, y_pipeline,
                               catalog, sampler, chunk_size, classes,
                               epochs, nan_policy, shuffle=True, **_):
    for epoch in range(epochs):
        slices = list(sampler(catalog))
        if shuffle: random.shuffle(slices)
        buf_X, buf_y = [], []
        for sl in slices:
            X_chip = X_pipeline(sl.load())
            y_chip = y_pipeline(sl.load())
            X, y = _to_pixel_major(X_chip), _to_pixel_major(y_chip)
            X, y = nan_policy.apply(X, y)
            buf_X.append(X); buf_y.append(y)
            if sum(len(b) for b in buf_X) >= chunk_size:
                Xc = np.concatenate(buf_X)[:chunk_size]
                yc = np.concatenate(buf_y)[:chunk_size]
                estimator.partial_fit(Xc, yc, classes=classes)
                buf_X, buf_y = [], []
        # flush final partial chunk
        if buf_X:
            estimator.partial_fit(np.concatenate(buf_X), np.concatenate(buf_y), classes=classes)
    return estimator

Notes.


Cross-cutting observations

A few patterns recur across these examples worth naming explicitly:

Adapter parity. The fit-time adapter and the inference-time adapter must match. The most common silent bug — using ToTemporalPixelMajor(...="features") to fit and ToPixelMajor() to predict, or different time_handling modes between fit and serve — produces shape-mismatched feature vectors that sklearn either rejects (best case) or silently predicts garbage on. Always pass the adapter as a named variable shared between fit and predict pipelines.

NaN policy asymmetry between fit and predict. Fit-time policies usually drop NaN rows because you can’t learn from incomplete labels. Predict-time policies usually mask NaN rows because the output must be a complete raster with NaN written at masked positions. Defaults reflect this; verify when changing them.

Pre-fitted estimator as the artifact. Across all ten examples, the estimator pickle is the durable artifact, not the operator graph. The graph is the instructions for how to apply the estimator; the estimator is the content. Both need to land in the regulatory-artifact bundle for full reproducibility, but their lifecycles differ — the graph YAML may change across deployments while the estimator stays pinned.

Feature pipeline shared across fit and predict. The same feature_pipeline Operator object appears in both paths in every example. This is the train/serve skew prevention from the use-cases doc — violated, models silently degrade in production.

The right patch strategy is decided once per workflow. Mixing strategies (GeoCatalog for training, xrpatcher for inference, ApplyToChips for tile serving) is fine as long as the chip operator is identical. The patch strategy determines I/O performance and scalability; the operator graph determines correctness. Keep them orthogonal.