Catalog — query / intersect / union set algebra
Catalog set algebra: query / intersect / union¶
Catalogs support set-algebra operations that return new catalogs — composable filters and joins over file collections. This notebook shows the three patterns on real data:
queryfilters by space + time.intersectpairs two catalogs by spatial + temporal overlap.unionconcatenates two catalogs (with auto-reproject).
Two real catalogs participate:
- Imagery — eight cloud-free Sentinel-2 L2A scenes over Lake Tahoe (MGRS 10SGJ, June–July 2024) pulled from MPC.
- Labels — Natural Earth’s admin-1 (states/provinces) polygons, restricted to California, Nevada, and Oregon. Plays the role of any vector overlay you might want to pair with imagery (CORINE land-cover, GHCN station footprints, MODIS land classification, …).
import geocatalog as gc
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
import shapely.geometry
from geostack import (
LAKE_TAHOE_BBOX,
LAKE_TAHOE_TILE,
load_natural_earth_admin1,
load_stac_items,
)1. Build the imagery catalog (eight S2 scenes)¶
items = load_stac_items(
"sentinel-2-l2a",
LAKE_TAHOE_BBOX,
"2024-06-01/2024-07-31",
tile=LAKE_TAHOE_TILE,
max_cloud_cover=15,
)
imagery = gc.build_raster_catalog(
[it.assets["B04"].href for it in items],
filename_regex=r".*_(?P<date>\d{8})T\d{6}_.*\.tif",
target_crs="EPSG:32610",
)
imagery.gdf["mission"] = [
"S2A" if "S2A" in it.id else "S2B" for it in items
]
print(f"imagery: {len(imagery)} rows bounds={tuple(round(x, 1) for x in imagery.total_bounds)}")
print(f"missions: {imagery.gdf['mission'].value_counts().to_dict()}")imagery: 22 rows bounds=(699960.0, 4290240.0, 809760.0, 4400040.0)
missions: {'S2A': 12, 'S2B': 10}
2. Build the labels catalog (admin-1 polygons)¶
Natural Earth admin-1 polygons restricted to the three states whose
admin boundary intersects the Lake Tahoe MGRS tile. We hand the
geopandas GeoDataFrame to InMemoryGeoCatalog directly with the
vector backend so the catalog set-algebra story works against a
vector overlay just like it would against another raster archive.
ne_admin1 = load_natural_earth_admin1()
target_states = ["California", "Nevada", "Oregon"]
admin_subset = ne_admin1[ne_admin1["name"].isin(target_states)].copy()
admin_subset["filepath"] = [f"admin_{s.lower()}.gpkg" for s in admin_subset["name"]]
admin_subset["start_time"] = pd.Timestamp("2024-01-01")
admin_subset["end_time"] = pd.Timestamp("2024-12-31")
admin_subset = admin_subset[
["filepath", "geometry", "start_time", "end_time", "name"]
]
labels = gc.InMemoryGeoCatalog(
gpd.GeoDataFrame(admin_subset, geometry="geometry", crs="EPSG:4326"),
backend="vector",
)
print(f"labels: {len(labels)} admin-1 rows ({', '.join(target_states)})")labels: 3 admin-1 rows (California, Nevada, Oregon)
3. Plot the two catalogs together¶
Imagery footprints are 110 × 110 km tiles in UTM 10N; the labels are state-sized polygons in EPSG:4326. We reproject the imagery to 4326 for the overlay so both fit in one map.
imagery_4326 = imagery.gdf.to_crs("EPSG:4326")
fig, ax = plt.subplots(figsize=(8, 7))
labels.gdf.plot(ax=ax, edgecolor="C3", facecolor="C3", alpha=0.15, linewidth=1.5)
for _, row in labels.gdf.iterrows():
cx, cy = row.geometry.centroid.coords[0]
ax.text(cx, cy, row["name"], fontsize=9, ha="center", color="C3")
imagery_4326.plot(ax=ax, edgecolor="C0", facecolor="none", linewidth=1.2)
ax.set_xlim(-125, -114)
ax.set_ylim(33, 47)
ax.set_xlabel("longitude")
ax.set_ylabel("latitude")
ax.set_title("Imagery footprints (blue) vs admin-1 labels (red)")
plt.show()
4. query — spatial AND temporal filter on imagery¶
Ask for scenes overlapping the western shore of Lake Tahoe in the first week of June.
hits = imagery.query(
bounds=(750000, 4318000, 758000, 4332000),
crs="EPSG:32610",
time=("2024-06-01", "2024-06-10"),
)
print(f"first-week western-shore hits: {len(hits)}")
print(hits.gdf[["mission", "start_time", "filepath"]].head().to_string())first-week western-shore hits: 4
mission start_time filepath
datetime
[2024-06-04 00:00:00, 2024-06-04 23:59:59.999999] S2B 2024-06-04 https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/10/S/GJ/2024/06/04/S2B_MSIL2A_20240604T183919_N0510_R070_T10SGJ_20240604T233744.SAFE/GRANULE/L2A_T10SGJ_A037848_20240604T185042/IMG_DATA/R10m/T10SGJ_20240604T183919_B04_10m.tif?st=2026-05-25T15%3A21%3A25Z&se=2026-05-26T16%3A06%3A25Z&sp=rl&sv=2025-07-05&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-05-26T10%3A35%3A49Z&ske=2026-06-02T10%3A35%3A49Z&sks=b&skv=2025-07-05&sig=ojINaC51tIQhJ9hV2K/PjYY%2BCEHGnM8OeyVptEjXbuk%3D
[2024-06-07 00:00:00, 2024-06-07 23:59:59.999999] S2B 2024-06-07 https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/10/S/GJ/2024/06/07/S2B_MSIL2A_20240607T184919_N0510_R113_T10SGJ_20240608T003436.SAFE/GRANULE/L2A_T10SGJ_A037891_20240607T185839/IMG_DATA/R10m/T10SGJ_20240607T184919_B04_10m.tif?st=2026-05-25T15%3A21%3A25Z&se=2026-05-26T16%3A06%3A25Z&sp=rl&sv=2025-07-05&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-05-26T10%3A35%3A49Z&ske=2026-06-02T10%3A35%3A49Z&sks=b&skv=2025-07-05&sig=ojINaC51tIQhJ9hV2K/PjYY%2BCEHGnM8OeyVptEjXbuk%3D
[2024-06-09 00:00:00, 2024-06-09 23:59:59.999999] S2A 2024-06-09 https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/10/S/GJ/2024/06/09/S2A_MSIL2A_20240609T183921_N0510_R070_T10SGJ_20240610T024930.SAFE/GRANULE/L2A_T10SGJ_A046828_20240609T185104/IMG_DATA/R10m/T10SGJ_20240609T183921_B04_10m.tif?st=2026-05-25T15%3A21%3A25Z&se=2026-05-26T16%3A06%3A25Z&sp=rl&sv=2025-07-05&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-05-26T10%3A35%3A49Z&ske=2026-06-02T10%3A35%3A49Z&sks=b&skv=2025-07-05&sig=ojINaC51tIQhJ9hV2K/PjYY%2BCEHGnM8OeyVptEjXbuk%3D
[2024-06-02 00:00:00, 2024-06-02 23:59:59.999999] S2A 2024-06-02 https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/10/S/GJ/2024/06/02/S2A_MSIL2A_20240602T184921_N0510_R113_T10SGJ_20240603T032329.SAFE/GRANULE/L2A_T10SGJ_A046728_20240602T185554/IMG_DATA/R10m/T10SGJ_20240602T184921_B04_10m.tif?st=2026-05-25T15%3A21%3A25Z&se=2026-05-26T16%3A06%3A25Z&sp=rl&sv=2025-07-05&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-05-26T10%3A35%3A49Z&ske=2026-06-02T10%3A35%3A49Z&sks=b&skv=2025-07-05&sig=ojINaC51tIQhJ9hV2K/PjYY%2BCEHGnM8OeyVptEjXbuk%3D
5. intersect — cross-catalog AND¶
intersect(imagery, labels) returns the rows whose footprints
AND time intervals overlap. Each surviving row’s geometry is
clipped to the intersection, and the time interval is the per-row
temporal intersection. Since the admin polygons span all of 2024,
every imagery row in the AOI will pair with whichever state
polygon it sits inside.
paired = gc.intersect(imagery, labels, spatial_only=True)
print(f"len(paired): {len(paired)} (every imagery row that hits CA/NV/OR)")
paired_4326 = paired.gdf.to_crs("EPSG:4326")
print(paired_4326[["filepath", "geometry"]].head().to_string())len(paired): 44 (every imagery row that hits CA/NV/OR)
filepath geometry
datetime
[2024-07-29 00:00:00, 2024-07-29 23:59:59.999999] https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/10/S/GJ/2024/07/29/S2A_MSIL2A_20240729T183921_N0511_R070_T10SGJ_20240730T023050.SAFE/GRANULE/L2A_T10SGJ_A047543_20240729T184934/IMG_DATA/R10m/T10SGJ_20240729T183921_B04_10m.tif?st=2026-05-25T15%3A21%3A25Z&se=2026-05-26T16%3A06%3A25Z&sp=rl&sv=2025-07-05&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-05-26T10%3A35%3A49Z&ske=2026-06-02T10%3A35%3A49Z&sks=b&skv=2025-07-05&sig=ojINaC51tIQhJ9hV2K/PjYY%2BCEHGnM8OeyVptEjXbuk%3D POLYGON ((-120.69936 38.73822, -120.66683 39.72682, -120.00053 39.7115, -120.00003 38.9954, -119.59433 38.71123, -120.69936 38.73822))
[2024-07-29 00:00:00, 2024-07-29 23:59:59.999999] https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/10/S/GJ/2024/07/29/S2A_MSIL2A_20240729T183921_N0511_R070_T10SGJ_20240730T023050.SAFE/GRANULE/L2A_T10SGJ_A047543_20240729T184934/IMG_DATA/R10m/T10SGJ_20240729T183921_B04_10m.tif?st=2026-05-25T15%3A21%3A25Z&se=2026-05-26T16%3A06%3A25Z&sp=rl&sv=2025-07-05&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-05-26T10%3A35%3A49Z&ske=2026-06-02T10%3A35%3A49Z&sks=b&skv=2025-07-05&sig=ojINaC51tIQhJ9hV2K/PjYY%2BCEHGnM8OeyVptEjXbuk%3D POLYGON ((-119.38764 39.69403, -119.43793 38.70656, -119.59433 38.71123, -120.00003 38.9954, -120.00053 39.7115, -119.38764 39.69403))
[2024-06-14 00:00:00, 2024-06-14 23:59:59.999999] https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/10/S/GJ/2024/06/14/S2B_MSIL2A_20240614T183919_N0510_R070_T10SGJ_20240615T003207.SAFE/GRANULE/L2A_T10SGJ_A037991_20240614T185209/IMG_DATA/R10m/T10SGJ_20240614T183919_B04_10m.tif?st=2026-05-25T15%3A21%3A25Z&se=2026-05-26T16%3A06%3A25Z&sp=rl&sv=2025-07-05&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-05-26T10%3A35%3A49Z&ske=2026-06-02T10%3A35%3A49Z&sks=b&skv=2025-07-05&sig=ojINaC51tIQhJ9hV2K/PjYY%2BCEHGnM8OeyVptEjXbuk%3D POLYGON ((-120.69936 38.73822, -120.66683 39.72682, -120.00053 39.7115, -120.00003 38.9954, -119.59433 38.71123, -120.69936 38.73822))
[2024-06-14 00:00:00, 2024-06-14 23:59:59.999999] https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/10/S/GJ/2024/06/14/S2B_MSIL2A_20240614T183919_N0510_R070_T10SGJ_20240615T003207.SAFE/GRANULE/L2A_T10SGJ_A037991_20240614T185209/IMG_DATA/R10m/T10SGJ_20240614T183919_B04_10m.tif?st=2026-05-25T15%3A21%3A25Z&se=2026-05-26T16%3A06%3A25Z&sp=rl&sv=2025-07-05&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-05-26T10%3A35%3A49Z&ske=2026-06-02T10%3A35%3A49Z&sks=b&skv=2025-07-05&sig=ojINaC51tIQhJ9hV2K/PjYY%2BCEHGnM8OeyVptEjXbuk%3D POLYGON ((-119.38764 39.69403, -119.43793 38.70656, -119.59433 38.71123, -120.00003 38.9954, -120.00053 39.7115, -119.38764 39.69403))
[2024-07-09 00:00:00, 2024-07-09 23:59:59.999999] https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/10/S/GJ/2024/07/09/S2A_MSIL2A_20240709T183921_N0510_R070_T10SGJ_20240710T024627.SAFE/GRANULE/L2A_T10SGJ_A047257_20240709T185326/IMG_DATA/R10m/T10SGJ_20240709T183921_B04_10m.tif?st=2026-05-25T15%3A21%3A25Z&se=2026-05-26T16%3A06%3A25Z&sp=rl&sv=2025-07-05&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-05-26T10%3A35%3A49Z&ske=2026-06-02T10%3A35%3A49Z&sks=b&skv=2025-07-05&sig=ojINaC51tIQhJ9hV2K/PjYY%2BCEHGnM8OeyVptEjXbuk%3D POLYGON ((-120.69936 38.73822, -120.66683 39.72682, -120.00053 39.7115, -120.00003 38.9954, -119.59433 38.71123, -120.69936 38.73822))
Plot the clipped intersection footprints — they’re now state-sliced imagery tiles.
fig, ax = plt.subplots(figsize=(8, 7))
labels.gdf.plot(ax=ax, edgecolor="lightgray", facecolor="none", linewidth=1)
imagery_4326.plot(ax=ax, edgecolor="C0", facecolor="none", linewidth=1)
paired_4326.plot(ax=ax, edgecolor="C2", facecolor="C2", alpha=0.4)
ax.set_xlim(-125, -114)
ax.set_ylim(33, 47)
ax.set_xlabel("longitude")
ax.set_ylabel("latitude")
ax.set_title("Green = clipped intersection footprints (imagery ∩ states)")
plt.show()
6. union — cross-catalog OR¶
Useful when you have multiple sensors (Landsat 7 + Landsat 8, or Sentinel-2 + Sentinel-3) you want to treat as a single virtual dataset. We fabricate a tiny “Sentinel-3 OLCI” catalog row at low resolution that overlaps the same AOI to show the merge.
s3_olci = gc.InMemoryGeoCatalog(
gpd.GeoDataFrame(
[
{
"filepath": "S3_OLCI_20240615.tif",
"geometry": shapely.geometry.box(749000, 4316000, 770000, 4340000),
"start_time": pd.Timestamp("2024-06-15"),
"end_time": pd.Timestamp("2024-06-15 23:59:59"),
"mission": "S3A",
}
],
geometry="geometry",
crs="EPSG:32610",
),
backend="raster",
)
combined = gc.union(imagery, s3_olci)
print(f"len(combined): {len(combined)} ({len(imagery)} S2 + {len(s3_olci)} S3)")
print(combined.gdf["mission"].value_counts().to_dict())len(combined): 23 (22 S2 + 1 S3)
{'S2A': 12, 'S2B': 10, 'S3A': 1}
Cross-CRS union¶
union silently reprojects the second catalog if its CRS does not
match — self.crs always wins.
s3_4326 = gc.InMemoryGeoCatalog(
gpd.GeoDataFrame(
[
{
"filepath": "S3_OLCI_4326.tif",
"geometry": shapely.geometry.box(*LAKE_TAHOE_BBOX),
"start_time": pd.Timestamp("2024-06-15"),
"end_time": pd.Timestamp("2024-06-15 23:59:59"),
"mission": "S3A",
}
],
geometry="geometry",
crs="EPSG:4326",
),
backend="raster",
)
reprojected = gc.union(imagery, s3_4326)
print(f"len(reprojected): {len(reprojected)}")
print(f"reprojected.gdf.crs: {reprojected.gdf.crs}")
print("⇒ second catalog's footprints reprojected into UTM 10N silently")len(reprojected): 23
reprojected.gdf.crs: EPSG:32610
⇒ second catalog's footprints reprojected into UTM 10N silently
7. where — pandas-.query() passthrough¶
Non-geometric filter (mission, cloud %, processing level) — the pandas escape hatch.
imagery_a = imagery.where("mission == 'S2A'")
print(f"S2A-only: {len(imagery_a)} rows out of {len(imagery)}")S2A-only: 12 rows out of 22
Recap¶
The four operations compose into a small algebra over file collections:
| Op | Returns | Use it for |
|---|---|---|
catalog.query(bounds, crs, time) | sub-catalog | Spatial+temporal filter (the most common case). |
catalog.where("sql_like") | sub-catalog | Non-geometric filters (mission, cloud %, sensor). |
gc.intersect(a, b, spatial_only=...) | row-aligned pair | Join imagery with labels / co-located sensors. |
gc.union(a, b) | concatenated catalog | Multi-sensor virtual dataset; auto-reproject. |
Combine them freely — gc.intersect(imagery.query(bbox, time), labels.where("class == 'forest'")) is one production-ready pattern.