Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Catalog — query / intersect / union set algebra

Catalog set algebra: query / intersect / union

Catalogs support set-algebra operations that return new catalogs — composable filters and joins over file collections. This notebook shows the three patterns on real data:

Two real catalogs participate:

  1. Imagery — eight cloud-free Sentinel-2 L2A scenes over Lake Tahoe (MGRS 10SGJ, June–July 2024) pulled from MPC.
  2. Labels — Natural Earth’s admin-1 (states/provinces) polygons, restricted to California, Nevada, and Oregon. Plays the role of any vector overlay you might want to pair with imagery (CORINE land-cover, GHCN station footprints, MODIS land classification, …).
import geocatalog as gc
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
import shapely.geometry

from geostack import (
    LAKE_TAHOE_BBOX,
    LAKE_TAHOE_TILE,
    load_natural_earth_admin1,
    load_stac_items,
)

1. Build the imagery catalog (eight S2 scenes)

items = load_stac_items(
    "sentinel-2-l2a",
    LAKE_TAHOE_BBOX,
    "2024-06-01/2024-07-31",
    tile=LAKE_TAHOE_TILE,
    max_cloud_cover=15,
)
imagery = gc.build_raster_catalog(
    [it.assets["B04"].href for it in items],
    filename_regex=r".*_(?P<date>\d{8})T\d{6}_.*\.tif",
    target_crs="EPSG:32610",
)
imagery.gdf["mission"] = [
    "S2A" if "S2A" in it.id else "S2B" for it in items
]
print(f"imagery: {len(imagery)} rows  bounds={tuple(round(x, 1) for x in imagery.total_bounds)}")
print(f"missions: {imagery.gdf['mission'].value_counts().to_dict()}")
imagery: 22 rows  bounds=(699960.0, 4290240.0, 809760.0, 4400040.0)
missions: {'S2A': 12, 'S2B': 10}

2. Build the labels catalog (admin-1 polygons)

Natural Earth admin-1 polygons restricted to the three states whose admin boundary intersects the Lake Tahoe MGRS tile. We hand the geopandas GeoDataFrame to InMemoryGeoCatalog directly with the vector backend so the catalog set-algebra story works against a vector overlay just like it would against another raster archive.

ne_admin1 = load_natural_earth_admin1()
target_states = ["California", "Nevada", "Oregon"]
admin_subset = ne_admin1[ne_admin1["name"].isin(target_states)].copy()
admin_subset["filepath"] = [f"admin_{s.lower()}.gpkg" for s in admin_subset["name"]]
admin_subset["start_time"] = pd.Timestamp("2024-01-01")
admin_subset["end_time"] = pd.Timestamp("2024-12-31")
admin_subset = admin_subset[
    ["filepath", "geometry", "start_time", "end_time", "name"]
]
labels = gc.InMemoryGeoCatalog(
    gpd.GeoDataFrame(admin_subset, geometry="geometry", crs="EPSG:4326"),
    backend="vector",
)
print(f"labels: {len(labels)} admin-1 rows ({', '.join(target_states)})")
labels: 3 admin-1 rows (California, Nevada, Oregon)

3. Plot the two catalogs together

Imagery footprints are 110 × 110 km tiles in UTM 10N; the labels are state-sized polygons in EPSG:4326. We reproject the imagery to 4326 for the overlay so both fit in one map.

imagery_4326 = imagery.gdf.to_crs("EPSG:4326")
fig, ax = plt.subplots(figsize=(8, 7))
labels.gdf.plot(ax=ax, edgecolor="C3", facecolor="C3", alpha=0.15, linewidth=1.5)
for _, row in labels.gdf.iterrows():
    cx, cy = row.geometry.centroid.coords[0]
    ax.text(cx, cy, row["name"], fontsize=9, ha="center", color="C3")
imagery_4326.plot(ax=ax, edgecolor="C0", facecolor="none", linewidth=1.2)
ax.set_xlim(-125, -114)
ax.set_ylim(33, 47)
ax.set_xlabel("longitude")
ax.set_ylabel("latitude")
ax.set_title("Imagery footprints (blue) vs admin-1 labels (red)")
plt.show()
<Figure size 800x700 with 1 Axes>

4. query — spatial AND temporal filter on imagery

Ask for scenes overlapping the western shore of Lake Tahoe in the first week of June.

hits = imagery.query(
    bounds=(750000, 4318000, 758000, 4332000),
    crs="EPSG:32610",
    time=("2024-06-01", "2024-06-10"),
)
print(f"first-week western-shore hits: {len(hits)}")
print(hits.gdf[["mission", "start_time", "filepath"]].head().to_string())
first-week western-shore hits: 4
                                                  mission start_time                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               filepath
datetime                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
[2024-06-04 00:00:00, 2024-06-04 23:59:59.999999]     S2B 2024-06-04  https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/10/S/GJ/2024/06/04/S2B_MSIL2A_20240604T183919_N0510_R070_T10SGJ_20240604T233744.SAFE/GRANULE/L2A_T10SGJ_A037848_20240604T185042/IMG_DATA/R10m/T10SGJ_20240604T183919_B04_10m.tif?st=2026-05-25T15%3A21%3A25Z&se=2026-05-26T16%3A06%3A25Z&sp=rl&sv=2025-07-05&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-05-26T10%3A35%3A49Z&ske=2026-06-02T10%3A35%3A49Z&sks=b&skv=2025-07-05&sig=ojINaC51tIQhJ9hV2K/PjYY%2BCEHGnM8OeyVptEjXbuk%3D
[2024-06-07 00:00:00, 2024-06-07 23:59:59.999999]     S2B 2024-06-07  https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/10/S/GJ/2024/06/07/S2B_MSIL2A_20240607T184919_N0510_R113_T10SGJ_20240608T003436.SAFE/GRANULE/L2A_T10SGJ_A037891_20240607T185839/IMG_DATA/R10m/T10SGJ_20240607T184919_B04_10m.tif?st=2026-05-25T15%3A21%3A25Z&se=2026-05-26T16%3A06%3A25Z&sp=rl&sv=2025-07-05&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-05-26T10%3A35%3A49Z&ske=2026-06-02T10%3A35%3A49Z&sks=b&skv=2025-07-05&sig=ojINaC51tIQhJ9hV2K/PjYY%2BCEHGnM8OeyVptEjXbuk%3D
[2024-06-09 00:00:00, 2024-06-09 23:59:59.999999]     S2A 2024-06-09  https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/10/S/GJ/2024/06/09/S2A_MSIL2A_20240609T183921_N0510_R070_T10SGJ_20240610T024930.SAFE/GRANULE/L2A_T10SGJ_A046828_20240609T185104/IMG_DATA/R10m/T10SGJ_20240609T183921_B04_10m.tif?st=2026-05-25T15%3A21%3A25Z&se=2026-05-26T16%3A06%3A25Z&sp=rl&sv=2025-07-05&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-05-26T10%3A35%3A49Z&ske=2026-06-02T10%3A35%3A49Z&sks=b&skv=2025-07-05&sig=ojINaC51tIQhJ9hV2K/PjYY%2BCEHGnM8OeyVptEjXbuk%3D
[2024-06-02 00:00:00, 2024-06-02 23:59:59.999999]     S2A 2024-06-02  https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/10/S/GJ/2024/06/02/S2A_MSIL2A_20240602T184921_N0510_R113_T10SGJ_20240603T032329.SAFE/GRANULE/L2A_T10SGJ_A046728_20240602T185554/IMG_DATA/R10m/T10SGJ_20240602T184921_B04_10m.tif?st=2026-05-25T15%3A21%3A25Z&se=2026-05-26T16%3A06%3A25Z&sp=rl&sv=2025-07-05&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-05-26T10%3A35%3A49Z&ske=2026-06-02T10%3A35%3A49Z&sks=b&skv=2025-07-05&sig=ojINaC51tIQhJ9hV2K/PjYY%2BCEHGnM8OeyVptEjXbuk%3D

5. intersect — cross-catalog AND

intersect(imagery, labels) returns the rows whose footprints AND time intervals overlap. Each surviving row’s geometry is clipped to the intersection, and the time interval is the per-row temporal intersection. Since the admin polygons span all of 2024, every imagery row in the AOI will pair with whichever state polygon it sits inside.

paired = gc.intersect(imagery, labels, spatial_only=True)
print(f"len(paired): {len(paired)}  (every imagery row that hits CA/NV/OR)")
paired_4326 = paired.gdf.to_crs("EPSG:4326")
print(paired_4326[["filepath", "geometry"]].head().to_string())
len(paired): 44  (every imagery row that hits CA/NV/OR)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                filepath                                                                                                                                geometry
datetime                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
[2024-07-29 00:00:00, 2024-07-29 23:59:59.999999]  https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/10/S/GJ/2024/07/29/S2A_MSIL2A_20240729T183921_N0511_R070_T10SGJ_20240730T023050.SAFE/GRANULE/L2A_T10SGJ_A047543_20240729T184934/IMG_DATA/R10m/T10SGJ_20240729T183921_B04_10m.tif?st=2026-05-25T15%3A21%3A25Z&se=2026-05-26T16%3A06%3A25Z&sp=rl&sv=2025-07-05&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-05-26T10%3A35%3A49Z&ske=2026-06-02T10%3A35%3A49Z&sks=b&skv=2025-07-05&sig=ojINaC51tIQhJ9hV2K/PjYY%2BCEHGnM8OeyVptEjXbuk%3D  POLYGON ((-120.69936 38.73822, -120.66683 39.72682, -120.00053 39.7115, -120.00003 38.9954, -119.59433 38.71123, -120.69936 38.73822))
[2024-07-29 00:00:00, 2024-07-29 23:59:59.999999]  https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/10/S/GJ/2024/07/29/S2A_MSIL2A_20240729T183921_N0511_R070_T10SGJ_20240730T023050.SAFE/GRANULE/L2A_T10SGJ_A047543_20240729T184934/IMG_DATA/R10m/T10SGJ_20240729T183921_B04_10m.tif?st=2026-05-25T15%3A21%3A25Z&se=2026-05-26T16%3A06%3A25Z&sp=rl&sv=2025-07-05&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-05-26T10%3A35%3A49Z&ske=2026-06-02T10%3A35%3A49Z&sks=b&skv=2025-07-05&sig=ojINaC51tIQhJ9hV2K/PjYY%2BCEHGnM8OeyVptEjXbuk%3D  POLYGON ((-119.38764 39.69403, -119.43793 38.70656, -119.59433 38.71123, -120.00003 38.9954, -120.00053 39.7115, -119.38764 39.69403))
[2024-06-14 00:00:00, 2024-06-14 23:59:59.999999]  https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/10/S/GJ/2024/06/14/S2B_MSIL2A_20240614T183919_N0510_R070_T10SGJ_20240615T003207.SAFE/GRANULE/L2A_T10SGJ_A037991_20240614T185209/IMG_DATA/R10m/T10SGJ_20240614T183919_B04_10m.tif?st=2026-05-25T15%3A21%3A25Z&se=2026-05-26T16%3A06%3A25Z&sp=rl&sv=2025-07-05&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-05-26T10%3A35%3A49Z&ske=2026-06-02T10%3A35%3A49Z&sks=b&skv=2025-07-05&sig=ojINaC51tIQhJ9hV2K/PjYY%2BCEHGnM8OeyVptEjXbuk%3D  POLYGON ((-120.69936 38.73822, -120.66683 39.72682, -120.00053 39.7115, -120.00003 38.9954, -119.59433 38.71123, -120.69936 38.73822))
[2024-06-14 00:00:00, 2024-06-14 23:59:59.999999]  https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/10/S/GJ/2024/06/14/S2B_MSIL2A_20240614T183919_N0510_R070_T10SGJ_20240615T003207.SAFE/GRANULE/L2A_T10SGJ_A037991_20240614T185209/IMG_DATA/R10m/T10SGJ_20240614T183919_B04_10m.tif?st=2026-05-25T15%3A21%3A25Z&se=2026-05-26T16%3A06%3A25Z&sp=rl&sv=2025-07-05&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-05-26T10%3A35%3A49Z&ske=2026-06-02T10%3A35%3A49Z&sks=b&skv=2025-07-05&sig=ojINaC51tIQhJ9hV2K/PjYY%2BCEHGnM8OeyVptEjXbuk%3D  POLYGON ((-119.38764 39.69403, -119.43793 38.70656, -119.59433 38.71123, -120.00003 38.9954, -120.00053 39.7115, -119.38764 39.69403))
[2024-07-09 00:00:00, 2024-07-09 23:59:59.999999]  https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/10/S/GJ/2024/07/09/S2A_MSIL2A_20240709T183921_N0510_R070_T10SGJ_20240710T024627.SAFE/GRANULE/L2A_T10SGJ_A047257_20240709T185326/IMG_DATA/R10m/T10SGJ_20240709T183921_B04_10m.tif?st=2026-05-25T15%3A21%3A25Z&se=2026-05-26T16%3A06%3A25Z&sp=rl&sv=2025-07-05&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2026-05-26T10%3A35%3A49Z&ske=2026-06-02T10%3A35%3A49Z&sks=b&skv=2025-07-05&sig=ojINaC51tIQhJ9hV2K/PjYY%2BCEHGnM8OeyVptEjXbuk%3D  POLYGON ((-120.69936 38.73822, -120.66683 39.72682, -120.00053 39.7115, -120.00003 38.9954, -119.59433 38.71123, -120.69936 38.73822))

Plot the clipped intersection footprints — they’re now state-sliced imagery tiles.

fig, ax = plt.subplots(figsize=(8, 7))
labels.gdf.plot(ax=ax, edgecolor="lightgray", facecolor="none", linewidth=1)
imagery_4326.plot(ax=ax, edgecolor="C0", facecolor="none", linewidth=1)
paired_4326.plot(ax=ax, edgecolor="C2", facecolor="C2", alpha=0.4)
ax.set_xlim(-125, -114)
ax.set_ylim(33, 47)
ax.set_xlabel("longitude")
ax.set_ylabel("latitude")
ax.set_title("Green = clipped intersection footprints (imagery ∩ states)")
plt.show()
<Figure size 800x700 with 1 Axes>

6. union — cross-catalog OR

Useful when you have multiple sensors (Landsat 7 + Landsat 8, or Sentinel-2 + Sentinel-3) you want to treat as a single virtual dataset. We fabricate a tiny “Sentinel-3 OLCI” catalog row at low resolution that overlaps the same AOI to show the merge.

s3_olci = gc.InMemoryGeoCatalog(
    gpd.GeoDataFrame(
        [
            {
                "filepath": "S3_OLCI_20240615.tif",
                "geometry": shapely.geometry.box(749000, 4316000, 770000, 4340000),
                "start_time": pd.Timestamp("2024-06-15"),
                "end_time": pd.Timestamp("2024-06-15 23:59:59"),
                "mission": "S3A",
            }
        ],
        geometry="geometry",
        crs="EPSG:32610",
    ),
    backend="raster",
)
combined = gc.union(imagery, s3_olci)
print(f"len(combined): {len(combined)}  ({len(imagery)} S2 + {len(s3_olci)} S3)")
print(combined.gdf["mission"].value_counts().to_dict())
len(combined): 23  (22 S2 + 1 S3)
{'S2A': 12, 'S2B': 10, 'S3A': 1}

Cross-CRS union

union silently reprojects the second catalog if its CRS does not match — self.crs always wins.

s3_4326 = gc.InMemoryGeoCatalog(
    gpd.GeoDataFrame(
        [
            {
                "filepath": "S3_OLCI_4326.tif",
                "geometry": shapely.geometry.box(*LAKE_TAHOE_BBOX),
                "start_time": pd.Timestamp("2024-06-15"),
                "end_time": pd.Timestamp("2024-06-15 23:59:59"),
                "mission": "S3A",
            }
        ],
        geometry="geometry",
        crs="EPSG:4326",
    ),
    backend="raster",
)
reprojected = gc.union(imagery, s3_4326)
print(f"len(reprojected): {len(reprojected)}")
print(f"reprojected.gdf.crs: {reprojected.gdf.crs}")
print("⇒ second catalog's footprints reprojected into UTM 10N silently")
len(reprojected): 23
reprojected.gdf.crs: EPSG:32610
⇒ second catalog's footprints reprojected into UTM 10N silently

7. where — pandas-.query() passthrough

Non-geometric filter (mission, cloud %, processing level) — the pandas escape hatch.

imagery_a = imagery.where("mission == 'S2A'")
print(f"S2A-only: {len(imagery_a)} rows out of {len(imagery)}")
S2A-only: 12 rows out of 22

Recap

The four operations compose into a small algebra over file collections:

OpReturnsUse it for
catalog.query(bounds, crs, time)sub-catalogSpatial+temporal filter (the most common case).
catalog.where("sql_like")sub-catalogNon-geometric filters (mission, cloud %, sensor).
gc.intersect(a, b, spatial_only=...)row-aligned pairJoin imagery with labels / co-located sensors.
gc.union(a, b)concatenated catalogMulti-sensor virtual dataset; auto-reproject.

Combine them freely — gc.intersect(imagery.query(bbox, time), labels.where("class == 'forest'")) is one production-ready pattern.