SciAgent-Skills geopandas-geospatial

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/scientific-computing/geopandas-geospatial" ~/.claude/skills/jaechang-hits-sciagent-skills-geopandas-geospatial && rm -rf "$T"
manifest: skills/scientific-computing/geopandas-geospatial/SKILL.md
source content

GeoPandas Geospatial Analysis

Overview

GeoPandas extends pandas with spatial operations on geometric types, combining pandas DataFrames with Shapely geometries and Fiona for file I/O. It enables reading, writing, manipulating, and visualizing geospatial vector data (points, lines, polygons) with a familiar pandas-like API.

When to Use

  • Reading and writing spatial file formats (Shapefile, GeoJSON, GeoPackage, Parquet)
  • Performing spatial joins between geographic datasets (points in polygons, nearest neighbors)
  • Running overlay operations (intersection, union, difference, clipping)
  • Computing geometric properties (area, distance, buffer, centroid)
  • Creating choropleth maps and interactive web maps
  • Reprojecting data between coordinate reference systems
  • Aggregating spatial features by attribute (dissolve)
  • For raster data analysis, use rasterio/xarray instead
  • For large-scale distributed geospatial, consider Dask-GeoPandas or Apache Sedona

Prerequisites

pip install geopandas matplotlib
# Optional:
# pip install folium       — interactive maps
# pip install mapclassify  — classification schemes for choropleth
# pip install contextily   — basemaps
# pip install pyarrow      — faster I/O (2-4x speedup)
# pip install psycopg2 geoalchemy2  — PostGIS support

Quick Start

import geopandas as gpd

# Read spatial data
gdf = gpd.read_file("data.geojson")
print(f"Shape: {gdf.shape}, CRS: {gdf.crs}")
print(f"Geometry types: {gdf.geometry.geom_type.unique()}")

# Reproject, compute area, save
gdf_proj = gdf.to_crs("EPSG:3857")
gdf_proj['area_m2'] = gdf_proj.geometry.area
gdf_proj.to_file("output.gpkg")

# Quick map
gdf.plot(column='population', legend=True, figsize=(10, 8))

Core API

1. Data I/O

import geopandas as gpd

# Read various formats
gdf = gpd.read_file("data.shp")           # Shapefile
gdf = gpd.read_file("data.geojson")       # GeoJSON
gdf = gpd.read_file("data.gpkg")          # GeoPackage
gdf = gpd.read_file("data.gpkg", layer="roads")  # Specific layer

# Filtered reading (load only needed data)
gdf = gpd.read_file("data.gpkg", bbox=(xmin, ymin, xmax, ymax))
gdf = gpd.read_file("data.gpkg", columns=["name", "geometry"])
gdf = gpd.read_file("data.gpkg", where="population > 10000")

# Arrow acceleration (2-4x faster)
gdf = gpd.read_file("data.gpkg", use_arrow=True)

# Parquet/Feather (columnar, fast, preserves CRS)
gdf = gpd.read_parquet("data.parquet")
gdf.to_parquet("output.parquet")

# PostGIS database
from sqlalchemy import create_engine
engine = create_engine("postgresql://user:pass@host/db")
gdf = gpd.read_postgis("SELECT * FROM parcels", con=engine, geom_col='geom')
gdf.to_postgis("output_table", con=engine)

# Write
gdf.to_file("output.gpkg")             # GeoPackage (recommended)
gdf.to_file("output.shp")              # Shapefile
gdf.to_file("output.geojson", driver="GeoJSON")

2. CRS Management

# Check current CRS
print(gdf.crs)                  # e.g., EPSG:4326
print(gdf.crs.is_geographic)    # True for lat/lon
print(gdf.crs.is_projected)     # True for meters

# Reproject (transforms coordinates)
gdf_proj = gdf.to_crs("EPSG:3857")       # Web Mercator
gdf_proj = gdf.to_crs(epsg=32633)        # UTM zone 33N

# Set CRS (only when metadata missing, does NOT transform coordinates)
gdf = gdf.set_crs("EPSG:4326")

# Estimate appropriate UTM zone
utm_crs = gdf.estimate_utm_crs()
gdf_utm = gdf.to_crs(utm_crs)

Common EPSG codes:

CodeNameUse
4326WGS 84GPS coordinates, web data
3857Web MercatorWeb mapping (Google/OSM tiles)
326xxUTM zones (N)Area/distance calculations
5070Albers Equal Area (US)Area-preserving US maps

3. Geometric Operations

# Buffer (expand/erode geometry by distance)
buffered = gdf.geometry.buffer(100)      # 100 units (meters if projected)
eroded = gdf.geometry.buffer(-50)        # Negative = erosion

# Simplify (reduce complexity)
simplified = gdf.geometry.simplify(tolerance=10, preserve_topology=True)

# Centroid, convex hull, envelope
centroids = gdf.geometry.centroid
hulls = gdf.geometry.convex_hull
bounds = gdf.geometry.envelope

# Union all geometries
unified = gdf.geometry.union_all()

# Affine transformations
rotated = gdf.geometry.rotate(angle=45, origin='center')
scaled = gdf.geometry.scale(xfact=2.0, yfact=2.0)
translated = gdf.geometry.translate(xoff=100, yoff=50)

# Geometric properties
areas = gdf.geometry.area           # Use projected CRS for accuracy
lengths = gdf.geometry.length       # Perimeter for polygons
is_valid = gdf.geometry.is_valid    # Validate geometry
total = gdf.geometry.total_bounds   # [minx, miny, maxx, maxy]

4. Spatial Analysis

# Spatial join (combine datasets by spatial relationship)
joined = gpd.sjoin(points_gdf, polygons_gdf, predicate='intersects')
joined = gpd.sjoin(gdf1, gdf2, predicate='within')
joined = gpd.sjoin(gdf1, gdf2, predicate='contains', how='left')

# Nearest neighbor join
nearest = gpd.sjoin_nearest(gdf1, gdf2, max_distance=1000, distance_col='dist')

# Overlay operations (set-theoretic)
intersection = gpd.overlay(gdf1, gdf2, how='intersection')
union = gpd.overlay(gdf1, gdf2, how='union')
difference = gpd.overlay(gdf1, gdf2, how='difference')
sym_diff = gpd.overlay(gdf1, gdf2, how='symmetric_difference')

# Dissolve (aggregate by attribute)
dissolved = gdf.dissolve(by='region', aggfunc='sum')
dissolved = gdf.dissolve(by='region', aggfunc={'population': 'sum', 'area': 'mean'})

# Clip to boundary
clipped = gpd.clip(gdf, boundary_gdf)

# Distance calculations (use projected CRS)
distances = gdf.geometry.distance(single_point)

# Spatial predicates
within_mask = gdf1.geometry.within(gdf2.geometry)
intersects_mask = gdf1.geometry.intersects(gdf2.geometry)

5. Visualization

import matplotlib.pyplot as plt

# Basic plot
gdf.plot(figsize=(10, 8))

# Choropleth map
gdf.plot(column='population', cmap='YlOrRd', legend=True, figsize=(12, 8))

# Classification schemes (requires mapclassify)
gdf.plot(column='population', scheme='quantiles', k=5, legend=True)
gdf.plot(column='population', scheme='fisher_jenks', k=5, legend=True)

# Multi-layer map
fig, ax = plt.subplots(figsize=(12, 10))
polygons_gdf.plot(ax=ax, color='lightblue', edgecolor='black')
points_gdf.plot(ax=ax, color='red', markersize=10)
roads_gdf.plot(ax=ax, color='gray', linewidth=0.5)
ax.set_title('Multi-layer Map')
ax.set_axis_off()

# Interactive map (requires folium)
m = gdf.explore(column='population', cmap='YlOrRd', legend=True,
                tooltip=['name', 'population'])
m.save('map.html')

# Multi-layer interactive
m = gdf1.explore(color='blue', name='Layer 1')
gdf2.explore(m=m, color='red', name='Layer 2')
import folium
folium.LayerControl().add_to(m)

# Basemap (requires contextily)
import contextily as ctx
gdf_wm = gdf.to_crs(epsg=3857)
ax = gdf_wm.plot(alpha=0.5, figsize=(10, 10))
ctx.add_basemap(ax)

Key Concepts

Data Structures

  • GeoSeries: Pandas Series of Shapely geometries with spatial methods (area, distance, buffer, etc.)
  • GeoDataFrame: Pandas DataFrame with one or more geometry columns. One column is the "active geometry" used by spatial methods
from shapely.geometry import Point

# Create from coordinates
gdf = gpd.GeoDataFrame(
    {'name': ['A', 'B'], 'value': [10, 20]},
    geometry=[Point(0, 0), Point(1, 1)],
    crs="EPSG:4326"
)

# Multiple geometry columns
gdf['centroid'] = gdf.geometry.centroid
gdf = gdf.set_geometry('centroid')  # Switch active geometry

CRS Rules for Spatial Operations

  • Always check CRS before any spatial operation:
    print(gdf.crs)
  • Match CRS before spatial joins, overlays, or distance calculations
  • Use projected CRS (meters) for area/distance — geographic CRS (degrees) gives wrong results
  • set_crs() only adds metadata; to_crs() transforms coordinates

Spatial Indexing

GeoPandas automatically creates spatial indexes (R-tree) for sjoin, overlay, and other spatial operations. For manual queries:

sindex = gdf.sindex
possible_idx = list(sindex.intersection((xmin, ymin, xmax, ymax)))

Common Workflows

Workflow 1: Load → Transform → Analyze → Export

import geopandas as gpd

# Load
gdf = gpd.read_file("parcels.shp")
print(f"CRS: {gdf.crs}, Rows: {len(gdf)}")

# Transform to projected CRS for measurements
gdf = gdf.to_crs(gdf.estimate_utm_crs())

# Analyze
gdf['area_ha'] = gdf.geometry.area / 10000  # hectares
gdf['perimeter_m'] = gdf.geometry.length
print(f"Total area: {gdf['area_ha'].sum():.1f} ha")

# Export
gdf.to_file("parcels_analyzed.gpkg")

Workflow 2: Spatial Join and Aggregate

# Count points per polygon
points_in_poly = gpd.sjoin(points_gdf, polygons_gdf, predicate='within')
counts = points_in_poly.groupby('index_right').agg(
    point_count=('geometry', 'size'),
    total_value=('value', 'sum')
)
result = polygons_gdf.merge(counts, left_index=True, right_index=True, how='left')
result['point_count'] = result['point_count'].fillna(0)
print(f"Polygons with points: {(result['point_count'] > 0).sum()}/{len(result)}")

Workflow 3: Multi-Source Integration

# Read from different sources, ensure matching CRS
roads = gpd.read_file("roads.shp")
buildings = gpd.read_file("buildings.geojson")
parcels = gpd.read_postgis("SELECT * FROM parcels", con=engine, geom_col='geom')

target_crs = roads.crs
buildings = buildings.to_crs(target_crs)
parcels = parcels.to_crs(target_crs)

# Find buildings within 50m of roads
buildings_near_roads = gpd.sjoin_nearest(
    buildings, roads, max_distance=50, distance_col='road_dist'
)
print(f"Buildings near roads: {len(buildings_near_roads)}/{len(buildings)}")

Key Parameters

ParameterFunctionDefaultEffect
predicate
sjoin
'intersects'
Spatial relationship: intersects, within, contains, touches, crosses
how
sjoin, overlay
'inner'
Join type: inner, left, right
max_distance
sjoin_nearestNoneSearch radius limit (improves performance)
k
sjoin_nearest1Number of nearest neighbors to find
tolerance
simplifyRequiredDouglas-Peucker tolerance (in CRS units)
preserve_topology
simplifyTruePrevents self-intersections
resolution
buffer16Number of segments for buffer curves
aggfunc
dissolve'first'Aggregation function for non-geometry columns
use_arrow
read_fileFalseEnable Arrow acceleration (2-4x faster)
scheme
plotNoneClassification: quantiles, equal_interval, fisher_jenks
k
plot (with scheme)5Number of classification bins

Best Practices

  1. Always check CRS before spatial operations — mismatched CRS gives wrong or empty results
  2. Use projected CRS for area and distance calculations — geographic CRS (degrees) is meaningless for measurements
  3. Match CRS before joins
    gdf2 = gdf2.to_crs(gdf1.crs)
    before
    gpd.sjoin(gdf1, gdf2)
  4. Validate geometries with
    .is_valid
    before complex operations — invalid geometries cause silent errors
  5. Use GeoPackage over Shapefile — no 10-char column name limit, supports multiple layers, better performance
  6. Filter during read — use
    bbox
    ,
    columns
    ,
    where
    to load only needed data for large files
  7. Set max_distance in sjoin_nearest — unbounded nearest-neighbor search is slow on large datasets
  8. Use
    .copy()
    when modifying geometry
    — avoid unintended side effects on original GeoDataFrame

Common Recipes

Recipe: Count Points in Polygons

When to use: Aggregate point observations (sample sites, observations) by region (county, watershed).

import geopandas as gpd

regions = gpd.read_file("regions.geojson")
points = gpd.read_file("observations.geojson").to_crs(regions.crs)

joined = gpd.sjoin(points, regions, how="inner", predicate="within")
counts = joined.groupby("index_right").size().rename("point_count")
regions_with_counts = regions.join(counts).fillna(0)
regions_with_counts["point_count"] = regions_with_counts["point_count"].astype(int)
print(regions_with_counts[["name", "point_count"]].sort_values("point_count", ascending=False).head())

Recipe: Buffer Around Points and Dissolve Overlaps

When to use: Create service areas or catchment zones from point locations.

import geopandas as gpd

facilities = gpd.read_file("facilities.geojson")
utm_crs = facilities.estimate_utm_crs()   # Project to meters
facilities_m = facilities.to_crs(utm_crs)

# 1 km buffer
buffers = facilities_m.copy()
buffers["geometry"] = facilities_m.geometry.buffer(1000)

# Dissolve overlapping buffers into single polygon
service_area = buffers.dissolve()
service_area_wgs84 = service_area.to_crs("EPSG:4326")
service_area_wgs84.to_file("service_area.geojson", driver="GeoJSON")
print(f"Service area: {service_area_wgs84.geometry.area.sum():.0f} sq degrees")

Troubleshooting

ProblemCauseSolution
Empty spatial join resultCRS mismatch between GeoDataFramesEnsure matching CRS:
gdf2 = gdf2.to_crs(gdf1.crs)
Wrong area/distance valuesUsing geographic CRS (degrees)Reproject to projected CRS:
gdf.to_crs(gdf.estimate_utm_crs())
DriverError
on read
Missing driver or corrupt fileCheck file exists; try
driver="GeoJSON"
explicitly
Geometry column lost after mergeCalled
df.merge(gdf)
instead of
gdf.merge(df)
Always call merge ON the GeoDataFrame
TopologicalError
on overlay
Invalid geometriesFix with
gdf.geometry = gdf.geometry.buffer(0)
Slow sjoin_nearestNo distance limit on large datasetSet
max_distance
parameter
Folium map blankGeometries not in EPSG:4326Reproject to WGS84:
gdf.to_crs(epsg=4326).explore()
Shapefile column names truncatedShapefile 10-char limitUse GeoPackage instead:
gdf.to_file("out.gpkg")

References

Related Skills

  • matplotlib-scientific-plotting — advanced map styling and figure export
  • polars-dataframes — high-performance tabular analysis before/after spatial operations
  • folium — advanced interactive web mapping beyond geopandas.explore()