Scope: adding a DuckDB-backed catalog backend, GeoParquet-as-artifact, and SQL-native cross-catalog operations to the
jejjohnson/georeadercatalogmodule proposed in Phase 1. Status: design proposal, no code changed yet. Assumes Phase 1 (the in-memoryGeoCatalog, builders, loaders, samplers, andto_geoparquet/from_geoparquetround-trip) has shipped.
0. What’s in place from Phase 1¶
The Phase 1 report proposed and (assume) shipped:
| Component | Provides |
|---|---|
GeoCatalog protocol | one shape across raster / xarray / vector |
InMemoryGeoCatalog | Phase 1 implementation: gpd.GeoDataFrame with IntervalIndex + geometry |
build_{raster,xarray,vector}_catalog | per-backend builders |
load_{raster,xarray,vector} | loaders that return GeoTensor / xr.Dataset |
random_sampler, grid_sampler, stitch | ML glue |
GeoSlice | unit of work passed between samplers and loaders |
query, intersect, union | set algebra over catalogs |
to_geoparquet / from_geoparquet | round-trip the catalog as a portable artifact |
What Phase 1 cannot do well:
Catalogs much beyond ~10⁵ rows:
gpd.overlaybecomes painful, the gdf eats RAM.Sharing without re-running the build step: a pickled gdf is fragile across versions.
Querying a catalog without loading the whole thing into Python.
Catalog analytics (“monthly file count by tile, weighted by useful area inside this AOI”).
Streaming a catalog during construction so 10M files don’t sit in RAM.
Phase 2 fills these gaps by adding a second backend behind the same GeoCatalog protocol.
1. Inventory: what DuckDB brings¶
DuckDB is an in-process columnar OLAP engine with a spatial extension that gives us everything we need at the index layer.
1.1 Core engine¶
Single-binary, in-process (no server). ~30 MB.
Native Parquet read/write with predicate and projection pushdown driven by row-group min/max statistics.
Reads transparently from local filesystem, S3, GCS, Azure, HTTPS, HuggingFace via the
httpfsextension.Arrow-native: zero-copy to
pyarrow, then topandas/polars/geopandas.Multi-threaded execution; vectorized columnar engine.
A persistent file format (
.duckdb) for storing tables + indexes when needed.
1.2 Spatial extension¶
GEOS-backed, GeoParquet 1.x compatible. The functions Phase 2 leans on:
| Category | Functions used |
|---|---|
| Constructors | ST_MakeEnvelope, ST_GeomFromText, ST_GeomFromWKB, ST_AsWKB |
| Predicates | ST_Intersects, ST_Contains, ST_Within, ST_Overlaps |
| Set ops | ST_Intersection, ST_Union, ST_Difference |
| Measures | ST_Area, ST_Centroid, ST_Distance |
| CRS | ST_Transform (requires the spatial extension’s bundled PROJ data) |
| Indexing | CREATE INDEX ... USING RTREE (on .duckdb table files only) |
Spatial joins build an in-memory R-tree on the smaller side per query. Persistent on-disk R-trees are limited to materialized DuckDB tables, not Parquet — see §7.
1.3 GeoParquet support¶
Reads any GeoParquet 1.0/1.1 file produced by
geopandas.to_parquet(or by DuckDB itself).Writes via
COPY ... TO 'file.parquet' (FORMAT 'parquet', COMPRESSION 'zstd')— schema includes WKB geometry + GeoParquet-style column metadata when the spatial extension is loaded.GeoParquet 1.1 introduces a per-row
bboxcovering struct that lets the engine push spatial filters down to the row-group level without parsing WKB. For 10⁷-row catalogs this is roughly an order-of-magnitude speedup on selective queries.
⚠️ The
spatialextension’s GeoParquet write path is more recent than the read path. Pin a known-good DuckDB version (≥ 1.1) and add a smoke test that round-trips a small catalog.
2. User story¶
Persona shift. Phase 1’s user has hundreds to thousands of files in one project. Phase 2’s user has too many to load eagerly, or wants to publish a catalog so a collaborator (or a CI job, or a notebook on a different machine) can query it without rebuilding.
Goal arc:
“I built a 1M-row catalog yesterday. Reload it instantly.” →
open_catalog("cat.parquet")(nobuild_*_catalogneeded).“My collaborator wants to query my catalog from her laptop.” → I push to
s3://bucket/cat.parquet; she runsopen_catalog("s3://bucket/cat.parquet"). Only the row groups her query touches get downloaded.“Intersect 200k imagery rows with 500k label rows.” → SQL spatial join, parallel, minutes instead of an OOM.
“How does my coverage break down by month, by tile, by cloud %?” → one SQL query against the catalog, no loop.
“Stream candidate slices into the random sampler without materializing the join.” → DuckDB cursor →
iter_rows()→ sampler.“Append today’s new acquisitions to yesterday’s catalog without rebuilding.” → drop a new shard into a directory;
open_catalog("cat_dir/")reads them as one virtual table.“Build the catalog directly into DuckDB during ingest so I never hold 10M rows in RAM.” →
build_raster_catalog(..., backend="duckdb", out_path="cat.parquet")streams rows to disk.
The framing: DuckDB doesn’t replace Phase 1, it extends the working envelope by 2–3 orders of magnitude, and turns the catalog into a first-class shareable artifact.
3. Motivation¶
3.1 The walls Phase 1 hits¶
| Dimension | Phase 1 ceiling | Why |
|---|---|---|
| Catalog size in RAM | ~10⁵–10⁶ rows | each gdf row is ~1 KB; a 10M-row gdf is ~10 GB |
intersect (overlay) | ~10⁴ × 10⁴ comfortable, 10⁵ × 10⁵ painful | gpd.overlay is O(n × m) worst case, single-threaded |
| Build cost | re-run every Python session | nothing is cached; rasterio metadata extraction is 1–10 ms/file |
| Sharing | pickled gdfs are fragile across versions | no canonical interchange format |
| Analytics | Python loops over the gdf | aggregations at scale need a query engine |
| Remote | “rsync the whole archive” | no lazy/partial access |
3.2 What DuckDB unlocks¶
| Need | DuckDB mechanism |
|---|---|
| Scale past memory | columnar parquet + row-group pushdown; nothing materializes until you ask |
| Persistence | GeoParquet is the artifact; the catalog is now a file you can version, hash, sign |
| Sharing | httpfs extension reads S3/GCS/HTTPS GeoParquet; predicate pushdown means low egress |
| Fast joins | spatial-join operator with parallel R-tree probe; band-join on temporal intervals |
| Analytics | full SQL surface |
| Streaming build | INSERT ... SELECT from a Python iterator into a DuckDB-managed Parquet writer |
3.3 Compared with alternatives¶
STAC + pgSTAC / stac-fastapi. Heavier: needs Postgres, a schema, an HTTP service. Wins on standardization and federated discovery. Loses on dev velocity for single-user / single-team workflows.
PostGIS + GeoAlchemy. Real database. More features (transactions, advanced indexes, triggers). Far more setup. Best when you already run Postgres for other reasons.
xstac + intake-stac. Library-only, no server. But STAC items are JSON-per-file, so a 1M-item catalog is 1M JSON files or one giant blob — both worse than one Parquet file for the queries we care about.
Plain
gpd.read_parquet+ Python. What Phase 1 already gives viafrom_geoparquet. Fine until you need joins or analytics at scale.
The unique slot DuckDB occupies: zero-server, single-file, parallel, SQL, with native Parquet + spatial. Nothing else hits all five.
3.4 Primer for newcomers¶
ELI5. GeoParquet is like a phone book for spatial data — alphabetised columns of data with bbox info on every page. DuckDB is the librarian who can read the bbox info to skip 99% of pages when you ask “who’s in Madagascar?” — without ever opening those pages. The combination scales to millions of files because it never reads what it doesn’t need.
DuckDB: an embedded SQL database¶
What it is. DuckDB is a SQL database that runs in-process — no server, no daemon.
Think SQLite, but column-oriented and tuned for analytics (OLAP) rather than transactional workloads.
Open-source, fast, with a spatial extension that adds GIS functions.
How it works. import duckdb; con = duckdb.connect() gives you a SQL engine in your Python process.
Queries run against tables (in-memory), Parquet files (read directly without import), or a persistent on-disk database.
The spatial extension adds ST_Intersects, ST_Contains, GEOMETRY types — basically PostGIS in an embedded engine.
What this means for us. Phase 2 stores catalogs as GeoParquet on disk (or in cloud storage); DuckDB reads them lazily.
Queries are vanilla SQL: SELECT path FROM catalog WHERE ST_Intersects(geometry, AOI) AND date BETWEEN '2023-01-01' AND '2023-12-31'.
No daemon, no schema migrations, no driver dance — it’s a Python import.
GeoParquet 1.1 + bbox-column predicate pushdown¶
What it is. GeoParquet is a spec for storing geometry columns inside Apache Parquet files.
Predicate pushdown is the database-engine optimisation that lets a query skip reading rows that obviously can’t match. GeoParquet 1.1 adds an optional bbox column that makes spatial predicate pushdown work for cloud-hosted Parquet.
How it works. Parquet stores data in row-groups (typically ~10⁵ rows each), each with min/max statistics per column.
A query like WHERE date > '2023-06-01' skips row-groups whose date max is < 2023-06-01 without reading them.
Adding a per-row bbox column (four floats: xmin, ymin, xmax, ymax) extends this to spatial predicates: ST_Intersects(geometry, AOI) translates to bbox.xmin < AOI.xmax AND bbox.xmax > AOI.xmin AND ..., which uses Parquet’s column statistics.
Now a 1M-row catalog answers a small-AOI query by reading ~10⁵ rows of bbox data, not all 1M geometries.
What this means for us. A Phase 2 catalog hosted on S3 can be queried with sub-second latency for the typical AOI-bounded workload, even at multi-million row scales — without downloading the full catalog. The catalog itself becomes a queryable cloud artifact, not a thing you download to use.
Lazy queries / cursors¶
What it is. A cursor is a query handle that yields rows on demand, rather than materialising the whole result set up front.
Standard database concept; the difference between SELECT * FROM big_table returning 10M rows in memory vs streaming them.
How it works. DuckDB returns query results lazily by default — con.execute("SELECT ...") returns a relation object that hasn’t fetched anything yet.
.fetchall() materialises; .fetchmany(N) streams N at a time; .arrow() returns an Arrow stream that downstream consumers (sampler, loader) can iterate without ever holding the full result.
What this means for us. A random_sampler over a 10M-row catalog doesn’t need to load 10M rows into Python — it iterates the cursor and reservoir-samples without ever materialising.
Same for grid_sampler walking tiles in a continent-scale catalog.
The streaming cursor is what makes Phase 2 the right tool above 10⁵ rows.
Streaming construction¶
What it is. Building a catalog one shard at a time without ever holding the full catalog in memory — needed when the catalog is too big to fit (e.g., the full Sentinel-2 archive is ~10⁷ scenes).
How it works. Two patterns. Append-to-Parquet: write rows in chunks to a growing Parquet file; the writer holds only the current chunk. DuckDB INSERT INTO: create a table, insert rows in chunks, export to Parquet at the end; DuckDB handles its own paging. The Phase 2 builders use the first pattern (append-to-Parquet) because it doesn’t require DuckDB during construction — just a Parquet writer.
What this means for us. Catalog builders that scan millions of files don’t OOM partway through. Each shard is a small append; the final Parquet file is what users query against. Without streaming, Phase 2’s “10⁶+ row catalog” claim wouldn’t survive contact with reality at the build step.
4. Mathematics¶
4.1 Predicate pushdown via Parquet zone maps¶
A Parquet file is a sequence of row groups; each row group has, per column, min/max statistics. GeoParquet adds a per-row-group geometry bbox (and, in 1.1, an optional per-row bbox column).
For a query
WHERE tmax >= $T_lo AND tmin <= $T_hi
AND ST_Intersects(geometry, ST_MakeEnvelope($x0,$y0,$x1,$y1))a row group is pruned without reading any of its data when any of these holds:
If the catalog is sorted by tmin (or, even better, written via a Hilbert/R-curve sort on geometry centroids), the fraction of row groups read for a narrow query approaches the fraction of rows that match. For a 10⁷-row catalog with 10k rows per group and a 0.1%-selective query, expect to read ~10 row groups, not 1000.
Practical recipe: sort the catalog by (tmin, ST_Hilbert(centroid)) before writing.
4.2 Spatial-join algorithm¶
DuckDB’s spatial join builds an R-tree on the smaller relation and probes with the larger one:
where k is the candidate match count after bbox filtering, refined by the exact ST_Intersects predicate.
Compare with gpd.overlay’s effective O(n + m + n × m') (R-tree filter then per-pair intersection check), which dominates for dense overlap.
Empirically, on the same machine: a 200 k × 500 k spatial join takes minutes in DuckDB, hours in geopandas, and OOMs at 1M × 1M in geopandas while completing in DuckDB.
4.3 Temporal band-join¶
The interval-overlap predicate
a.tmax >= b.tmin AND a.tmin <= b.tmaxis recognized by DuckDB’s optimizer as a range join (a.k.a. band join).
The algorithm sorts both sides by tmin, then sweeps:
sort A by tmin, B by tmin
maintain a sliding window W of B-rows whose tmax >= current A.tmin
for each a in A:
advance W: drop b where b.tmax < a.tmin; add b where b.tmin <= a.tmax
emit (a, b) for b in WCost: . No nested-loop blow-up.
4.4 Combined spatiotemporal join cost¶
For a query that joins on both space and time, DuckDB’s planner picks the more selective predicate first based on column statistics. Typical plan:
Spatial join via R-tree (filters out 99%+ of pairs in most workloads).
Apply temporal predicate as a post-filter.
Compute
ST_Intersectiononly for surviving pairs.
You can verify with EXPLAIN.
The win versus Phase 1 is 2–3 orders of magnitude on real-world catalog sizes.
4.5 GeoParquet bbox column (1.1)¶
GeoParquet 1.1 introduces an optional per-row covering struct:
bbox: STRUCT<xmin DOUBLE, ymin DOUBLE, xmax DOUBLE, ymax DOUBLE>For a spatial filter, DuckDB can evaluate
WHERE bbox.xmax >= $x0 AND bbox.xmin <= $x1
AND bbox.ymax >= $y0 AND bbox.ymin <= $y1
AND ST_Intersects(geometry, ST_MakeEnvelope($x0,$y0,$x1,$y1))The first four conjuncts are pure double-comparison — no WKB parsing, fully vectorized.
The exact ST_Intersects runs only on rows that pass the bbox filter.
For very selective queries this is roughly 10× faster than calling ST_Intersects directly on every row.
The catalog writer should always emit the bbox column.
4.6 Streaming build cost model¶
Phase 1’s build_raster_catalog materializes the whole gdf.
For 10M files at ~1 KB/row, that’s 10 GB plus Python overhead.
The streaming variant batches B rows into Arrow record batches and appends to a Parquet writer:
For B = 10 000 and ~1 KB/row, peak is ~10 MB plus rasterio’s per-file overhead (tens of MB).
Build time is dominated by rasterio.open calls, which can be parallelized with a ThreadPoolExecutor (GIL-friendly because rasterio releases the GIL during I/O).
5. Coupling with Phase 1¶
The whole point of the Phase 1 protocol design was to make Phase 2 transparent. Here’s the seam:
| Operation | InMemoryGeoCatalog (Phase 1) | DuckDBGeoCatalog (Phase 2) |
|---|---|---|
query(slice) | gdf.cx[...] + IntervalIndex.overlaps | SQL WHERE with predicate pushdown |
intersect(other) | gpd.overlay | SQL JOIN ... ON ST_Intersects(...) AND <interval-overlap> |
union(other) | pd.concat | SQL UNION ALL |
iter_rows() | gdf.itertuples() | Arrow batch iterator over the relation |
to_geoparquet(path) | gdf.to_parquet(...) | COPY ... TO ... (FORMAT 'parquet') |
materialize() | identity | execute the relation, return InMemoryGeoCatalog |
__len__ | len(gdf) | SELECT COUNT(*) (cached) |
| Backend-specific escape hatch | .gdf (full geopandas surface) | .relation (full SQL surface) |
Loaders unchanged. load_raster(catalog, slice, ...) only needs catalog.iter_rows() (or a single query + iter_rows) — it never touches the underlying gdf or relation.
Samplers need the smallest tweak: replace catalog.gdf.iloc[i] patterns with catalog.iter_rows() or catalog.query(slice).iter_rows().
Once that’s done, samplers work identically against either backend.
stitch unchanged. It only consumes GeoSlice + np.ndarray, no catalog.
The catalog is the only seam. Get the protocol right and Phase 2 is purely additive.
6. Proposed API surface¶
6.1 Module layout¶
georeader/
├── catalog/
│ ├── __init__.py
│ ├── base.py # GeoCatalog protocol, CatalogRow, GeoSlice (re-export)
│ ├── memory.py # InMemoryGeoCatalog (Phase 1)
│ ├── duckdb_backend.py # DuckDBGeoCatalog (Phase 2) [extra: duckdb]
│ ├── parquet.py # GeoParquet read/write helpers
│ ├── streaming.py # StreamingParquetWriter for direct-to-disk builds
│ └── ops.py # query / intersect / union dispatch
└── cli/
└── catalog.py # `georeader catalog ...` subcommands6.2 Protocol additions¶
The Phase 1 GeoCatalog protocol stays mostly intact.
Phase 2 adds:
class GeoCatalog(Protocol):
# ... Phase 1 surface ...
# NEW: explicit construction from a parquet artifact (works for both backends)
@classmethod
def open(
cls,
source: str | Path | list[str | Path],
*,
backend: Literal["memory", "duckdb"] = "duckdb",
table: str = "files",
) -> "GeoCatalog": ...
# NEW: lazy / eager bridge (no-op on InMemory; executes on DuckDB)
def materialize(self) -> "InMemoryGeoCatalog": ...
# NEW: structured row iteration that both backends implement
def iter_rows(self, *, batch_size: int = 1024) -> Iterator["CatalogRow"]: ...Note open is a factory: it returns whichever backend the caller asked for, defaulting to DuckDB once Phase 2 ships because the lazy backend is almost always the right choice for an artifact-on-disk.
6.3 CatalogRow¶
@dataclass(frozen=True)
class CatalogRow:
filepath: str
geometry: shapely.geometry.base.BaseGeometry # decoded from WKB
interval: pd.Interval # built from (tmin, tmax)
crs: pyproj.CRS
extras: dict[str, Any] # backend-specific extras (data_vars, layer, ...)Loaders consume CatalogRow.
Whether the row came from a gdf or a DuckDB cursor is invisible.
6.4 DuckDBGeoCatalog¶
class DuckDBGeoCatalog(GeoCatalog):
"""Lazy, SQL-backed catalog. Operations return new relations; nothing
runs until iter_rows() / materialize() / aggregate() is called."""
relation: duckdb.DuckDBPyRelation
con: duckdb.DuckDBPyConnection
backend: Literal["raster", "xarray", "vector"]
crs: pyproj.CRS
@classmethod
def open(cls, source: str | Path | list[str | Path],
*, table: str = "files") -> "DuckDBGeoCatalog":
"""Open a parquet file, a list of parquet files, a directory of
parquet shards, or a .duckdb file. Local or remote URIs."""
@classmethod
def from_memory(cls, mem: InMemoryGeoCatalog) -> "DuckDBGeoCatalog":
"""Register an in-memory catalog as a DuckDB view."""
# Set algebra — all return DuckDBGeoCatalog (still lazy)
def query(self, slice: GeoSlice, *, t_step: int | None = None) -> "DuckDBGeoCatalog": ...
def intersect(self, other: "GeoCatalog", *, spatial_only: bool = False) -> "DuckDBGeoCatalog": ...
def union(self, other: "GeoCatalog") -> "DuckDBGeoCatalog": ...
# Persistence
def to_geoparquet(self, path: str | Path, *,
partition_by: list[str] | None = None,
sort_by: list[str] | None = None,
compression: str = "zstd") -> None: ...
def to_duckdb_file(self, path: str | Path, *,
with_rtree: bool = True) -> None: ...
# Analytics escape hatch — return a DataFrame (eager)
def aggregate(self, sql: str, **params) -> pd.DataFrame: ...
def explain(self) -> str: ... # debug
# Underlying surface
def sql(self, sql: str, **params) -> "DuckDBGeoCatalog": ... # arbitrary SQL filter6.5 Builders gain a backend= parameter¶
def build_raster_catalog(
filepaths: Sequence[str | Path],
*,
filename_regex: str,
date_format: str = "%Y%m%d",
target_crs: CRS | str | None = None,
target_resolution: tuple[float, float] | None = None,
# NEW
backend: Literal["memory", "duckdb"] = "memory",
out_path: str | Path | None = None, # required if backend="duckdb"
write_bbox: bool = True, # GeoParquet 1.1 bbox column
sort_by: tuple[str, ...] = ("tmin", "geometry_hilbert"),
n_workers: int = 1,
) -> GeoCatalog: ...When backend="duckdb", the builder streams metadata into an Arrow record-batch writer and lands directly in out_path as GeoParquet.
The returned object is a DuckDBGeoCatalog opened on that file.
Memory stays bounded regardless of file count.
Same surface for build_xarray_catalog and build_vector_catalog.
6.6 GeoParquet schema (canonical)¶
filepath VARCHAR NOT NULL
tmin TIMESTAMP NULL -- NULL means "no date known"
tmax TIMESTAMP NULL
geometry BLOB (WKB) NOT NULL -- GeoParquet column metadata = CRS + encoding
bbox STRUCT NOT NULL -- xmin, ymin, xmax, ymax (GeoParquet 1.1)
crs VARCHAR NOT NULL -- usually one value per file; per-row allowed
backend VARCHAR NOT NULL -- 'raster' | 'xarray' | 'vector'
-- raster-specific (NULL for other backends)
n_bands INTEGER NULL
dtype VARCHAR NULL
nodata DOUBLE NULL
-- xarray-specific
data_vars VARCHAR[] NULL
time_var VARCHAR NULL
n_timesteps INTEGER NULL
-- vector-specific
layer VARCHAR NULL
-- optional analytics columns (recommended)
area_m2 DOUBLE NULL
cloud_pct DOUBLE NULL
sensor VARCHAR NULL
processing_ver VARCHAR NULLPer-file extras land in additional columns (typed when the schema knows them, JSON-encoded otherwise).
6.7 CLI¶
$ georeader catalog inspect cat.parquet
$ georeader catalog stats cat.parquet --by month --by sensor
$ georeader catalog query s3://bucket/cat.parquet \
--aoi AOI.geojson --tmin 2023-06-01 --tmax 2023-06-30 \
--out subset.parquet
$ georeader catalog intersect imagery.parquet labels.parquet --out joint.parquet
$ georeader catalog union l7.parquet l8.parquet --out landsat.parquet
$ georeader catalog optimize cat.parquet --sort tmin,hilbert --bbox --rewriteThe CLI is a thin wrapper around DuckDBGeoCatalog; everything it does is callable from Python too.
6.8 What I’d not add in Phase 2¶
A query DSL that abstracts over pandas vs SQL. Just expose
.gdfand.relationas escape hatches.A persistent server (HTTP API). That’s STAC’s job.
Auto-conversion between CRSs at query time. Catalogs should be written in one CRS;
ST_Transformis available but slow and a footgun.Materialized R-tree indexes on Parquet. They don’t exist; if you need persistent indexes, write to
.duckdb(see §7).
7. Sharp edges¶
GeoParquet 1.0 vs 1.1. 1.1 adds the per-row bbox column — order-of-magnitude speedup for spatial filters at scale. Default to writing 1.1; provide a
--geoparquet-version=1.0opt-out for compatibility with older readers.DuckDB spatial extension version pinning. APIs are stable but evolving. Pin a minimum DuckDB version (e.g. ≥ 1.1) and document the install (
INSTALL spatial; LOAD spatial; INSTALL httpfs; LOAD httpfs;).No persistent R-tree on Parquet. Persistent R-tree indexes only exist in
.duckdbtable files. If a catalog will be queried thousands of times with the same predicate shapes, materialize to.duckdb:CREATE TABLE files AS SELECT * FROM 'cat.parquet'; CREATE INDEX idx_geom ON files USING RTREE (geometry);Otherwise, rely on bbox + zone-map pushdown — usually fast enough.
Multi-CRS catalogs. Same problem as Phase 1, more visible because SQL doesn’t auto-reproject. Strong recommendation: write all GeoParquet in EPSG:4326, store the native CRS in a column, reproject on load.
ST_Transformworks but requires PROJ data and is slow per call.NULL temporal sentinels. Don’t write
pd.Timestamp.min/.max— they overflowTIMESTAMP. WriteNULLand require explicitOR tmin IS NULLfor queries that should include date-less files.Cursor lifecycle on
iter_rows(). The DuckDB connection must outlive the iterator. Use a context manager:with catalog.cursor() as rows: for row in rows: ...Document that storing the iterator beyond the
withblock is unsupported.WKB → shapely deserialization cost. Per-row
shapely.from_wkbis fine at 10⁴ rows/s, painful at 10⁶. For tight inner loops, batch withshapely.from_wkb(arr)(vectorized, usespygeos/ shapely 2.x).Sampler weighting on lazy catalogs. Phase 1’s area-weighted random sampler (§4.5 of Phase 1 report) needs all candidate areas. With DuckDB, either (a) maintain a precomputed
area_m2column at build time (recommended), (b) useUSING SAMPLE 10%to estimate, or (c) stream the area column once and cache.Parquet schema evolution. Parquet is immutable. Add a column → rewrite the file. Mitigate by storing the catalog as a directory of dated shards with consistent schema; DuckDB reads them as one virtual table via
read_parquet('shards/*.parquet'). Document the canonical schema and bump a version column.Authentication for remote.
httpfspicks up env vars,~/.aws/credentials, GCS service accounts automatically. Document this and provide one-line recipes per provider. Surface clear errors when auth fails (DuckDB’s default messages are cryptic).Streaming builder + multiprocessing. When
n_workers > 1, the rasterio metadata extraction parallelizes well, but Arrow record-batch writes must serialize. Use a single writer thread fed by a queue. Don’t try to parallelize the Parquet writer itself.Connection-per-thread. DuckDB connections are not thread-safe; use
con.cursor()in worker threads or open a fresh connection per process. TheDuckDBGeoCatalogshould expose a.cursor()method, not passconaround.
8. End-to-end examples¶
A varied gallery, organized by what’s new in Phase 2 — persistence, analytics, scale, federation, and lazy ML pipelines. All examples assume the Phase 1 surface exists.
| § | Example | Pattern |
|---|---|---|
| 8.1 | A. Save and reload a catalog locally | round-trip via GeoParquet |
| 8.1 | B. Publish a catalog to S3, query remotely | httpfs + predicate pushdown |
| 8.2 | C. Coverage histogram by month | aggregate() SQL |
| 8.2 | D. Per-tile statistics | grouped SQL |
| 8.2 | E. Custom SQL filter (low cloud, recent) | .sql() escape hatch |
| 8.3 | F. Million-row spatial join | DuckDB intersect at scale |
| 8.3 | G. Streaming build directly into GeoParquet | backend="duckdb" builder |
| 8.4 | H. Federated query across monthly shards | one virtual table from many files |
| 8.4 | I. Mix raster + vector catalogs in one SQL view | union via SQL |
| 8.4 | J. Cross-CRS catalog join with ST_Transform | when harmonization isn’t possible |
| 8.5 | K. Lazy iterator from filtered catalog → sampler | DuckDB cursor → random_sampler |
| 8.5 | L. Stratified sampling via SQL window functions | balanced training set |
| 8.5 | M. Active learning with persistent uncertainty column | round-trip uncertainty back into the catalog |
8.1 Persistence and sharing¶
A. Save and reload a catalog locally¶
from georeader.catalog import build_raster_catalog, open_catalog
catalog = build_raster_catalog(
filepaths=glob("/data/s2/T29SND_*.tif"),
filename_regex=r"T29SND_(?P<date>\d{8})_.*\.tif",
target_crs="EPSG:4326",
target_resolution=(0.0001, 0.0001),
)
catalog.to_geoparquet("/cat/s2_T29SND.parquet", sort_by=["tmin", "geometry_hilbert"])
# Tomorrow, in a fresh process:
catalog = open_catalog("/cat/s2_T29SND.parquet") # DuckDBGeoCatalog (lazy)
print(len(catalog), catalog.total_bounds, catalog.temporal_extent)B. Publish a catalog to S3, query remotely¶
from georeader.catalog import open_catalog, GeoSlice
# Producer
catalog.to_geoparquet("s3://my-bucket/catalogs/s2_eu_2023.parquet")
# Consumer, on a different machine:
remote = open_catalog("s3://my-bucket/catalogs/s2_eu_2023.parquet")
aoi = GeoSlice(
bounds=(-3.8, 40.3, -3.6, 40.5),
interval=pd.Interval(pd.Timestamp("2023-06-01"), pd.Timestamp("2023-06-30"), closed="both"),
resolution=(0.0001, 0.0001),
crs="EPSG:4326",
)
sub = remote.query(aoi).materialize() # only relevant row groups downloaded
print(f"Matched {len(sub)} files; downloaded ~{sub.bytes_read_estimate} bytes from S3.")8.2 Catalog analytics¶
C. Coverage histogram by month¶
from georeader.catalog import open_catalog
catalog = open_catalog("/cat/s2_T29SND.parquet")
df = catalog.aggregate("""
SELECT
date_trunc('month', tmin) AS month,
COUNT(*) AS n_files,
AVG(cloud_pct) AS mean_cloud,
SUM(area_m2) / 1e6 AS total_area_km2
FROM files
GROUP BY 1
ORDER BY 1
""")
df.plot.bar(x="month", y="n_files")D. Per-tile statistics, restricted to an AOI¶
df = catalog.aggregate("""
SELECT
substr(filepath, 12, 6) AS tile_id,
COUNT(*) AS n_acquisitions,
MIN(tmin) AS first_seen,
MAX(tmax) AS last_seen,
AVG(cloud_pct) AS mean_cloud
FROM files
WHERE ST_Intersects(geometry, ST_GeomFromText($aoi_wkt))
GROUP BY 1
ORDER BY n_acquisitions DESC
""", aoi_wkt=aoi_polygon.wkt)E. Custom SQL filter (recent, low-cloud, large enough)¶
fresh = catalog.sql("""
SELECT * FROM files
WHERE tmin >= '2023-01-01'
AND cloud_pct < 10
AND area_m2 > 1e8
""") # returns DuckDBGeoCatalog (still lazy)
print(len(fresh))
fresh.to_geoparquet("/cat/s2_T29SND_clean_2023.parquet")8.3 Scale operations¶
F. Million-row spatial join (intersect at scale)¶
imagery = open_catalog("s3://my-bucket/imagery_eu.parquet") # 1.2M rows
labels = open_catalog("s3://my-bucket/labels_eu.parquet") # 0.5M rows
joint = imagery.intersect(labels) # lazy, no execution yet
joint.to_geoparquet("/cat/eu_imagery_x_labels.parquet",
partition_by=["sensor"], sort_by=["tmin"])
# DuckDB plan (printed for transparency):
print(joint.explain())
# │ ► HASH_JOIN ON ST_Intersects(...) AND a.tmax >= b.tmin AND a.tmin <= b.tmax
# │ ├── PARQUET_SCAN imagery_eu.parquet [bbox + tmin/tmax pushdown]
# │ └── PARQUET_SCAN labels_eu.parquet [bbox + tmin/tmax pushdown]G. Streaming build directly into GeoParquet¶
from georeader.catalog import build_raster_catalog
# 8.5M files; would never fit in a gdf.
catalog = build_raster_catalog(
filepaths=iter_files_from("/very_large_archive/"), # any iterable
filename_regex=FILENAME_REGEX,
target_crs="EPSG:4326",
target_resolution=(0.0001, 0.0001),
backend="duckdb",
out_path="/cat/global_archive.parquet",
write_bbox=True,
sort_by=("tmin", "geometry_hilbert"),
n_workers=8, # parallel rasterio metadata extraction
)
print(len(catalog)) # 8_532_104, peak RAM stayed under 1 GB8.4 Federation and heterogeneity¶
H. Federated query across monthly shards¶
# /cat/s2_2023/ contains 12 monthly parquet shards with the same schema
catalog = open_catalog("/cat/s2_2023/") # DuckDB reads all shards as one table
aoi = GeoSlice(bounds=AOI, interval=pd.Interval("2023-04-01", "2023-09-30", closed="both"),
resolution=(0.0001, 0.0001), crs="EPSG:4326")
sub = catalog.query(aoi) # touches only 6 shards (Apr–Sep)Same pattern works for s3://bucket/s2_2023/*.parquet or HuggingFace datasets.
I. Mix raster and vector catalogs in one SQL view¶
imagery_cat = open_catalog("/cat/s2.parquet")
labels_cat = open_catalog("/cat/labels.parquet")
con = imagery_cat.con
con.register("img", imagery_cat.relation)
con.register("lab", labels_cat.relation)
paired = con.sql("""
SELECT img.filepath AS img_fp,
lab.filepath AS lab_fp,
ST_Intersection(img.geometry, lab.geometry) AS geometry,
GREATEST(img.tmin, lab.tmin) AS tmin,
LEAST (img.tmax, lab.tmax) AS tmax
FROM img
JOIN lab
ON ST_Intersects(img.geometry, lab.geometry)
AND img.tmax >= lab.tmin AND img.tmin <= lab.tmax
WHERE img.cloud_pct < 5
""")
paired_cat = DuckDBGeoCatalog.from_relation(paired, backend="raster", crs="EPSG:4326")J. Cross-CRS catalog join (last resort)¶
If you really can’t harmonize CRSs at write time, push ST_Transform into the join:
joint = catalog.sql("""
SELECT a.*, b.filepath AS dem_path
FROM files a
JOIN 'dem_utm.parquet' b
ON ST_Intersects(
a.geometry,
ST_Transform(b.geometry, 'EPSG:32629', 'EPSG:4326')
)
""")Slower and PROJ-dependent. Document the cost.
8.5 Sampler / inference patterns at scale¶
K. Lazy iterator from a filtered catalog into the random sampler¶
The sampler doesn’t care that the catalog is DuckDB-backed; it just calls iter_rows().
from georeader.catalog import open_catalog
from georeader.samplers import random_sampler
catalog = open_catalog("s3://bucket/imagery_eu.parquet")
clean = catalog.sql("WHERE cloud_pct < 10 AND area_m2 > 1e8") # still lazy
# random_sampler internally does clean.iter_rows() + reservoir sampling
for sl in random_sampler(clean, chip_size=(256, 256), length=100_000, seed=42):
x = load_raster(catalog, sl, band_indexes=[2, 3, 4, 8]).values
train_step(x)DuckDB streams candidate rows; the sampler reservoir-samples without ever materializing the full filtered set.
L. Stratified sampling via SQL window functions¶
Balance a training set across, say, agro-ecological zones:
catalog = open_catalog("/cat/s2_with_zones.parquet") # has a `zone` column
stratified = catalog.sql("""
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY zone ORDER BY random()) AS rn
FROM files
WHERE cloud_pct < 5
)
WHERE rn <= 1000 -- 1000 chips per zone
""")
for sl in random_sampler(stratified, chip_size=(256, 256)):
...M. Active learning with a persistent uncertainty column¶
Round-trip the model’s uncertainty back into the catalog to drive the next labeling round.
catalog = open_catalog("/cat/al_round_3.parquet")
# Score current candidates
scores = []
for row in catalog.iter_rows():
sl = GeoSlice(bounds=row.geometry.bounds, interval=row.interval,
resolution=(10.0, 10.0), crs=row.crs)
x = load_raster(catalog, sl).values
_, sigma = model_with_uncertainty(x)
scores.append((row.filepath, float(sigma.mean())))
# Persist back as a derived catalog
con = catalog.con
con.execute("""
CREATE OR REPLACE TABLE al_round_4 AS
SELECT files.*, scores.uncertainty
FROM files
JOIN (SELECT * FROM (VALUES ($values)) AS t(filepath, uncertainty)) scores
USING (filepath)
""", values=scores)
con.execute("COPY al_round_4 TO '/cat/al_round_4.parquet' (FORMAT 'parquet', COMPRESSION 'zstd')")
# The next round: pick the top-N most uncertain
next_round = open_catalog("/cat/al_round_4.parquet").sql(
"SELECT * FROM files ORDER BY uncertainty DESC LIMIT 500"
)8.6 Phase 2 API extensions implied by these examples¶
The new examples lean on a small set of additions that aren’t strictly in §6:
DuckDBGeoCatalog.from_relation(...)— wrap an arbitrary DuckDB relation as a catalog (used in I).GeoCatalog.bytes_read_estimate— telemetry surface from DuckDB’s stats (used in B).geometry_hilbertvirtual sort key — computed at write time from the geometry centroid (used in A, G).catalog.cursor()context manager arounditer_rows()(mentioned in §7).
None require new dependencies; all are thin wrappers over DuckDB or geopandas.
9. Verdict¶
Phase 2 is a clean, additive upgrade that costs almost nothing for users on the in-memory path and unlocks 2–3 orders of magnitude more catalog scale, plus a real sharing story, for users who need it. The Phase 1 protocol design pays off here: loaders, samplers, and stitchers are unchanged.
What needs to happen to ship Phase 2:
Implement
DuckDBGeoCatalogbehind a[duckdb]extra, satisfying theGeoCatalogprotocol from Phase 1.Canonicalize the GeoParquet schema (§6.6) — types, NULL conventions, bbox column, sort order. Bump a
schema_versioncolumn so future migrations are tractable.Add the streaming-write builders (
backend="duckdb") so multi-million-file ingest doesn’t materialize a gdf.Round-trip tests on every backend at three scales: 10², 10⁴, 10⁶ rows.
Cross-CRS smoke test (mixed-UTM-zone catalog harmonized to EPSG:4326 at write time).
A
georeader catalogCLI surface (§6.7) with the half-dozen subcommands above.One worked example per category in the docs, mirroring §8.
What I’d defer to Phase 3:
A federated discovery layer (STAC bridge: produce a STAC catalog from a
GeoCatalog, or open a STAC endpoint as aGeoCatalog).Persistent on-disk R-trees with automatic materialization based on query patterns.
Distributed query (DuckDB has nothing here; this is the Iceberg / Delta Lake territory).
A web UI for catalog inspection.
If we get §6 + §7 right, Phase 2 turns the catalog from “a Python object you build at the start of every script” into “a file you build once, version, share, and query lazily from anywhere.” That’s the upgrade.
Open questions, gotchas, and warnings¶
The DuckDB-on-GeoParquet path is sound; concerns to manage actively.
DuckDB spatial extension version pinning¶
The spatial extension is pre-1.0 in DuckDB; behaviours change between versions.
GeoParquet 1.1 bbox-column predicate pushdown specifically lands at DuckDB ≥ 1.1.
Mitigation: pin duckdb >= 1.1 (and a known-good spatial-extension pin) in package metadata; add a CI matrix that tests on the latest DuckDB once it crosses 1.0 final. Don’t lower the pin even if a user asks — silent query-plan regressions are the worst kind of bug.
httpfs extension and credentials¶
Reading S3 / GCS / Azure GeoParquet from cloud needs the httpfs extension configured with the right credentials.
DuckDB’s secret API differs from obstore / fsspec patterns.
Mitigation: define the Credential ↔ DuckDB-secret bridge in the Credential design (cross-link), so users don’t configure auth twice (once for georeader readers, once for DuckDB).
Concurrent reads/writes from multiple processes¶
DuckDB on a single GeoParquet file is read-safe across processes. Write-safe only under specific patterns:
Append-only (new row groups in new files, then a manifest concat) — safe.
In-place mutation (rewrite the file under existing reads) — not safe across concurrent writers; you need an external lock or a Delta-Lake-style manifest.
Phase 2 of GeoCatalog should default to the append-only pattern; document the trade-off if/when a concurrent-writer use case shows up.
GeoParquet 1.1 writer support¶
DuckDB’s GeoParquet write path is newer than the read path.
Round-trip test (write with DuckDB, read with geopandas.read_parquet, check bbox column survives + CRS metadata is preserved) before promising the writer in v0.1. Cross-link geocatalog.md §10.2.
Bit-rot risk¶
Most users will stay on Phase 1 (GeoPandas in-memory) for the 90% case; Phase 2 won’t see organic exercise until someone hits 10⁶+ rows. Risk: the Phase 2 code paths regress without being noticed. Mitigation: maintain a multi-million-row CI fixture from v0.1; gate Phase 2 release behind one real Phase 2 user.
ST_Transform is slow per row¶
DuckDB’s ST_Transform requires PROJ data and is expensive when called per row.
Hence the §4 design choice to write all GeoParquet in EPSG:4326 and store native CRS as a column for round-trip — reproject at write time, not at query time.
Footgun for users who skip this convention: a multi-CRS catalog with ST_Transform in the WHERE clause will be 100× slower than a 4326-canonical catalog.
Document loudly.