Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Vectorize

Rasters → vectors

UNEP
IMEO
MARS

Module: georeader/vectorize.py (370 LOC) Role: the inverse of Chapter 9. Extract polygon geometries from binary raster masks. Standard tool for converting segmentation outputs and classification rasters back to GIS-friendly vector formats.


1. The job

You have a binary (or thresholded) raster — typically the output of a CNN segmentation, a cloud detector, or a classified scene. You want a list of Shapely Polygons in CRS coordinates, suitable for:

Underneath this is rasterio.features.shapes (which delegates to GDAL’s polygonization). This module wraps it with three on-by-default ergonomics: small-polygon filtering, vertex simplification, and polygon buffering.


2. The vectorization process

┌─────────────────────────────────────────────────────────────────────────┐
│                    VECTORIZATION PROCESS                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Raster (Binary Mask)                Vector (Polygons)                  │
│  ────────────────────                ─────────────────                  │
│                                                                          │
│  ┌─┬─┬─┬─┬─┬─┬─┬─┐                       ╔═══════════╗                  │
│  │0│0│0│1│1│1│0│0│                      ╔╝           ╚╗                 │
│  ├─┼─┼─┼─┼─┼─┼─┼─┤                     ╔╝             ╚╗                │
│  │0│0│1│1│1│1│1│0│   ═══════════►     ╔╝               ╚╗               │
│  ├─┼─┼─┼─┼─┼─┼─┼─┤   Vectorize        ║    Polygon 1    ║               │
│  │0│1│1│1│1│1│1│0│                    ╚╗               ╔╝               │
│  ├─┼─┼─┼─┼─┼─┼─┼─┤                     ╚╗             ╔╝                │
│  │0│0│1│1│1│1│0│0│                      ╚╗           ╔╝                 │
│  └─┴─┴─┴─┴─┴─┴─┴─┘                       ╚═══════════╝                  │
│                                                                          │
│  1 = foreground (vectorized)                                            │
│  0 = background (ignored)                                               │
└─────────────────────────────────────────────────────────────────────────┘

Every connected component of 1-valued pixels becomes one polygon. The polygon traces the outside edge of the foreground pixels — so a 3×3 patch of 1s becomes a polygon enclosing 9 pixel squares (perimeter ≈ 12 pixels), not a single point.

For multi-class rasters, threshold or mask first: vectorize is fundamentally a binary operation. To extract per-class polygons from a label raster, loop:

polygons_per_class = {
    c: get_polygons(labels == c, ...) for c in np.unique(labels)
}

3. Polygon simplification

┌─────────────────────────────────────────────────────────────────────────┐
│              POLYGON SIMPLIFICATION (tolerance parameter)                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Raw (pixelated)              Simplified (tolerance=1)                  │
│  ────────────────              ──────────────────────                   │
│                                                                          │
│  ┌─┐                                ╭───────╮                            │
│  │ └─┐                             ╱         ╲                           │
│  │   └─┐                          ╱           ╲                          │
│  │     └─┐   ────────────►       │             │    Fewer vertices,     │
│  │       │   simplify            │             │    smoother edges      │
│  │     ┌─┘                        ╲           ╱                          │
│  │   ┌─┘                           ╲         ╱                           │
│  └───┘                              ╰───────╯                            │
│                                                                          │
│  tolerance=0: Keep all vertices (staircase pattern)                     │
│  tolerance=1: Simplify ~1 pixel tolerance (DEFAULT)                     │
│  tolerance>1: More aggressive simplification                            │
└─────────────────────────────────────────────────────────────────────────┘

Without simplification, a polygon traces every pixel boundary — a 100×100 region produces ~400 vertices forming a staircase pattern. That’s both visually ugly and storage-expensive.

tolerance is the Douglas-Peucker simplification distance, in pixels. Default 1.0 collapses single-pixel jaggies into smooth edges while preserving features ≥ 1 pixel. Larger values smooth more aggressively at the cost of accuracy:

toleranceWhat you get
0Exact pixel boundaries (staircase). Use when downstream consumers expect rasterise round-trip equality.
1.0 (default)One-pixel smoothing. Visually clean while preserving ≥ 1 pixel features.
2.0–5.0Aggressive smoothing for visualisation. Loses fine-grained shape detail.
> 10Bounding-box-like simplification. Suitable only for index/preview purposes.

The simplification happens after vectorization, in pixel coordinates, before applying the affine transform. So tolerance=1 always means “1 pixel” regardless of CRS or resolution.


4. Polygon filtering

┌─────────────────────────────────────────────────────────────────────────┐
│                    POLYGON FILTERING                                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Parameters:                                                             │
│  ───────────                                                             │
│                                                                          │
│  min_area=25.5 (default)    Remove polygons smaller than ~5x5 pixels    │
│                             Helps filter noise and artifacts             │
│                                                                          │
│  polygon_buffer=0           Buffer/erode polygons by N pixels           │
│                             Positive: expand                             │
│                             Negative: shrink (erode)                     │
│                                                                          │
│  Before (min_area=0):                After (min_area=25):               │
│  ┌────────────────────┐              ┌────────────────────┐             │
│  │  ■   ┌───────┐     │              │      ┌───────┐     │             │
│  │ ■ ■  │       │  ■  │   ═══════►   │      │       │     │             │
│  │      │       │     │   Filter     │      │       │     │             │
│  │ ■    └───────┘     │              │      └───────┘     │             │
│  └────────────────────┘              └────────────────────┘             │
│     ↑ small polygons removed                                            │
└─────────────────────────────────────────────────────────────────────────┘

Two filters layered onto the basic extraction:

Order of operations: vectorize → buffer → simplify → filter by min_area. So the area filter applies to the final simplified geometry, not the raw extraction.


5. The two functions

FunctionReturnsWhat it does
get_polygons(binary_mask, min_area=25.5, polygon_buffer=0, tolerance=1.0, transform=None)list[Polygon]the workhorse — see below
transform_polygon(polygon, transform=None, src_crs=None, dst_crs=None)Polygon / MultiPolygonapply an affine transform and/or reproject between CRSs

get_polygons(binary_mask, ...)

Accepts either a numpy array or a GeoData (GeoTensor / RasterioReader).

This duck-typing on input is the killer convenience: pipe a CNN’s (H, W) numpy output and a transform together, get georeferenced polygons. No manual polygon_to_crs round-trip.

transform_polygon(polygon, ...)

The post-processing helper for when you already have a polygon and want to:

Source: vectorize.py:271.


6. The standard CNN-output-to-vectors recipe

import numpy as np
from georeader import vectorize

# 1. CNN prediction on a GeoTensor input — you get back logits or probs
probs = model(s2_gt.values)               # (H, W) float32, ndarray
mask = probs > 0.5                        # (H, W) bool

# 2. Wrap in a GeoTensor so transform tags along
mask_gt = s2_gt.array_as_geotensor(mask)  # (H, W) bool GeoTensor

# 3. Vectorize with sensible defaults
polygons = vectorize.get_polygons(
    mask_gt,
    min_area=100,            # 10×10 pixel min — drops noise
    tolerance=1.0,           # one-pixel smoothing
    polygon_buffer=0,
)

# 4. Persist
import geopandas as gpd
gdf = gpd.GeoDataFrame(geometry=polygons, crs=s2_gt.crs)
gdf.to_file("flood_polygons.geojson", driver="GeoJSON")

Five lines for the whole “CNN prediction → georeferenced vector layer” pipeline. The array_as_geotensor step (Chapter 1 §10) is what makes the final polygons georeferenced — without it you’d be in pixel coords.


7. The rasterize ↔ vectorize round trip

rasterize and vectorize are conceptually inverses, but they are not exactly invertible. Three sources of asymmetry:

  1. Rasterization quantises — every pixel either is or isn’t in the polygon. After rasterize → vectorize, you get a polygon traced along pixel boundaries (a staircase) rather than the original smooth shape.

  2. tolerance > 0 smooths. Round-tripping through vectorize(tolerance=1) then rasterize will not exactly match the original raster.

  3. min_area drops content. Small features below the threshold are gone for good after vectorize → rasterize.

For applications that need exact round-trips (e.g., test fixtures), pass tolerance=0, min_area=0, polygon_buffer=0. The output will have full vertex counts (large) and exact pixel boundaries.


8. Sharp edges


9. Connection to geotoolz

Two operators in geotoolz.md wrap this module:

These complete the “raster ML pipeline → GIS-ready output” loop without users touching rasterio or geopandas directly.

Next chapter: Reflectance — the radiometry / spectral-response-function module (971 LOC, 97 box-drawing chars; the third densest in the package).