SciAgent-Skills geopandas-geospatial
install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/scientific-computing/geopandas-geospatial" ~/.claude/skills/jaechang-hits-sciagent-skills-geopandas-geospatial && rm -rf "$T"
manifest:
skills/scientific-computing/geopandas-geospatial/SKILL.mdsource content
GeoPandas Geospatial Analysis
Overview
GeoPandas extends pandas with spatial operations on geometric types, combining pandas DataFrames with Shapely geometries and Fiona for file I/O. It enables reading, writing, manipulating, and visualizing geospatial vector data (points, lines, polygons) with a familiar pandas-like API.
When to Use
- Reading and writing spatial file formats (Shapefile, GeoJSON, GeoPackage, Parquet)
- Performing spatial joins between geographic datasets (points in polygons, nearest neighbors)
- Running overlay operations (intersection, union, difference, clipping)
- Computing geometric properties (area, distance, buffer, centroid)
- Creating choropleth maps and interactive web maps
- Reprojecting data between coordinate reference systems
- Aggregating spatial features by attribute (dissolve)
- For raster data analysis, use rasterio/xarray instead
- For large-scale distributed geospatial, consider Dask-GeoPandas or Apache Sedona
Prerequisites
pip install geopandas matplotlib # Optional: # pip install folium — interactive maps # pip install mapclassify — classification schemes for choropleth # pip install contextily — basemaps # pip install pyarrow — faster I/O (2-4x speedup) # pip install psycopg2 geoalchemy2 — PostGIS support
Quick Start
import geopandas as gpd # Read spatial data gdf = gpd.read_file("data.geojson") print(f"Shape: {gdf.shape}, CRS: {gdf.crs}") print(f"Geometry types: {gdf.geometry.geom_type.unique()}") # Reproject, compute area, save gdf_proj = gdf.to_crs("EPSG:3857") gdf_proj['area_m2'] = gdf_proj.geometry.area gdf_proj.to_file("output.gpkg") # Quick map gdf.plot(column='population', legend=True, figsize=(10, 8))
Core API
1. Data I/O
import geopandas as gpd # Read various formats gdf = gpd.read_file("data.shp") # Shapefile gdf = gpd.read_file("data.geojson") # GeoJSON gdf = gpd.read_file("data.gpkg") # GeoPackage gdf = gpd.read_file("data.gpkg", layer="roads") # Specific layer # Filtered reading (load only needed data) gdf = gpd.read_file("data.gpkg", bbox=(xmin, ymin, xmax, ymax)) gdf = gpd.read_file("data.gpkg", columns=["name", "geometry"]) gdf = gpd.read_file("data.gpkg", where="population > 10000") # Arrow acceleration (2-4x faster) gdf = gpd.read_file("data.gpkg", use_arrow=True) # Parquet/Feather (columnar, fast, preserves CRS) gdf = gpd.read_parquet("data.parquet") gdf.to_parquet("output.parquet") # PostGIS database from sqlalchemy import create_engine engine = create_engine("postgresql://user:pass@host/db") gdf = gpd.read_postgis("SELECT * FROM parcels", con=engine, geom_col='geom') gdf.to_postgis("output_table", con=engine) # Write gdf.to_file("output.gpkg") # GeoPackage (recommended) gdf.to_file("output.shp") # Shapefile gdf.to_file("output.geojson", driver="GeoJSON")
2. CRS Management
# Check current CRS print(gdf.crs) # e.g., EPSG:4326 print(gdf.crs.is_geographic) # True for lat/lon print(gdf.crs.is_projected) # True for meters # Reproject (transforms coordinates) gdf_proj = gdf.to_crs("EPSG:3857") # Web Mercator gdf_proj = gdf.to_crs(epsg=32633) # UTM zone 33N # Set CRS (only when metadata missing, does NOT transform coordinates) gdf = gdf.set_crs("EPSG:4326") # Estimate appropriate UTM zone utm_crs = gdf.estimate_utm_crs() gdf_utm = gdf.to_crs(utm_crs)
Common EPSG codes:
| Code | Name | Use |
|---|---|---|
| 4326 | WGS 84 | GPS coordinates, web data |
| 3857 | Web Mercator | Web mapping (Google/OSM tiles) |
| 326xx | UTM zones (N) | Area/distance calculations |
| 5070 | Albers Equal Area (US) | Area-preserving US maps |
3. Geometric Operations
# Buffer (expand/erode geometry by distance) buffered = gdf.geometry.buffer(100) # 100 units (meters if projected) eroded = gdf.geometry.buffer(-50) # Negative = erosion # Simplify (reduce complexity) simplified = gdf.geometry.simplify(tolerance=10, preserve_topology=True) # Centroid, convex hull, envelope centroids = gdf.geometry.centroid hulls = gdf.geometry.convex_hull bounds = gdf.geometry.envelope # Union all geometries unified = gdf.geometry.union_all() # Affine transformations rotated = gdf.geometry.rotate(angle=45, origin='center') scaled = gdf.geometry.scale(xfact=2.0, yfact=2.0) translated = gdf.geometry.translate(xoff=100, yoff=50) # Geometric properties areas = gdf.geometry.area # Use projected CRS for accuracy lengths = gdf.geometry.length # Perimeter for polygons is_valid = gdf.geometry.is_valid # Validate geometry total = gdf.geometry.total_bounds # [minx, miny, maxx, maxy]
4. Spatial Analysis
# Spatial join (combine datasets by spatial relationship) joined = gpd.sjoin(points_gdf, polygons_gdf, predicate='intersects') joined = gpd.sjoin(gdf1, gdf2, predicate='within') joined = gpd.sjoin(gdf1, gdf2, predicate='contains', how='left') # Nearest neighbor join nearest = gpd.sjoin_nearest(gdf1, gdf2, max_distance=1000, distance_col='dist') # Overlay operations (set-theoretic) intersection = gpd.overlay(gdf1, gdf2, how='intersection') union = gpd.overlay(gdf1, gdf2, how='union') difference = gpd.overlay(gdf1, gdf2, how='difference') sym_diff = gpd.overlay(gdf1, gdf2, how='symmetric_difference') # Dissolve (aggregate by attribute) dissolved = gdf.dissolve(by='region', aggfunc='sum') dissolved = gdf.dissolve(by='region', aggfunc={'population': 'sum', 'area': 'mean'}) # Clip to boundary clipped = gpd.clip(gdf, boundary_gdf) # Distance calculations (use projected CRS) distances = gdf.geometry.distance(single_point) # Spatial predicates within_mask = gdf1.geometry.within(gdf2.geometry) intersects_mask = gdf1.geometry.intersects(gdf2.geometry)
5. Visualization
import matplotlib.pyplot as plt # Basic plot gdf.plot(figsize=(10, 8)) # Choropleth map gdf.plot(column='population', cmap='YlOrRd', legend=True, figsize=(12, 8)) # Classification schemes (requires mapclassify) gdf.plot(column='population', scheme='quantiles', k=5, legend=True) gdf.plot(column='population', scheme='fisher_jenks', k=5, legend=True) # Multi-layer map fig, ax = plt.subplots(figsize=(12, 10)) polygons_gdf.plot(ax=ax, color='lightblue', edgecolor='black') points_gdf.plot(ax=ax, color='red', markersize=10) roads_gdf.plot(ax=ax, color='gray', linewidth=0.5) ax.set_title('Multi-layer Map') ax.set_axis_off() # Interactive map (requires folium) m = gdf.explore(column='population', cmap='YlOrRd', legend=True, tooltip=['name', 'population']) m.save('map.html') # Multi-layer interactive m = gdf1.explore(color='blue', name='Layer 1') gdf2.explore(m=m, color='red', name='Layer 2') import folium folium.LayerControl().add_to(m) # Basemap (requires contextily) import contextily as ctx gdf_wm = gdf.to_crs(epsg=3857) ax = gdf_wm.plot(alpha=0.5, figsize=(10, 10)) ctx.add_basemap(ax)
Key Concepts
Data Structures
- GeoSeries: Pandas Series of Shapely geometries with spatial methods (area, distance, buffer, etc.)
- GeoDataFrame: Pandas DataFrame with one or more geometry columns. One column is the "active geometry" used by spatial methods
from shapely.geometry import Point # Create from coordinates gdf = gpd.GeoDataFrame( {'name': ['A', 'B'], 'value': [10, 20]}, geometry=[Point(0, 0), Point(1, 1)], crs="EPSG:4326" ) # Multiple geometry columns gdf['centroid'] = gdf.geometry.centroid gdf = gdf.set_geometry('centroid') # Switch active geometry
CRS Rules for Spatial Operations
- Always check CRS before any spatial operation:
print(gdf.crs) - Match CRS before spatial joins, overlays, or distance calculations
- Use projected CRS (meters) for area/distance — geographic CRS (degrees) gives wrong results
- set_crs() only adds metadata; to_crs() transforms coordinates
Spatial Indexing
GeoPandas automatically creates spatial indexes (R-tree) for sjoin, overlay, and other spatial operations. For manual queries:
sindex = gdf.sindex possible_idx = list(sindex.intersection((xmin, ymin, xmax, ymax)))
Common Workflows
Workflow 1: Load → Transform → Analyze → Export
import geopandas as gpd # Load gdf = gpd.read_file("parcels.shp") print(f"CRS: {gdf.crs}, Rows: {len(gdf)}") # Transform to projected CRS for measurements gdf = gdf.to_crs(gdf.estimate_utm_crs()) # Analyze gdf['area_ha'] = gdf.geometry.area / 10000 # hectares gdf['perimeter_m'] = gdf.geometry.length print(f"Total area: {gdf['area_ha'].sum():.1f} ha") # Export gdf.to_file("parcels_analyzed.gpkg")
Workflow 2: Spatial Join and Aggregate
# Count points per polygon points_in_poly = gpd.sjoin(points_gdf, polygons_gdf, predicate='within') counts = points_in_poly.groupby('index_right').agg( point_count=('geometry', 'size'), total_value=('value', 'sum') ) result = polygons_gdf.merge(counts, left_index=True, right_index=True, how='left') result['point_count'] = result['point_count'].fillna(0) print(f"Polygons with points: {(result['point_count'] > 0).sum()}/{len(result)}")
Workflow 3: Multi-Source Integration
# Read from different sources, ensure matching CRS roads = gpd.read_file("roads.shp") buildings = gpd.read_file("buildings.geojson") parcels = gpd.read_postgis("SELECT * FROM parcels", con=engine, geom_col='geom') target_crs = roads.crs buildings = buildings.to_crs(target_crs) parcels = parcels.to_crs(target_crs) # Find buildings within 50m of roads buildings_near_roads = gpd.sjoin_nearest( buildings, roads, max_distance=50, distance_col='road_dist' ) print(f"Buildings near roads: {len(buildings_near_roads)}/{len(buildings)}")
Key Parameters
| Parameter | Function | Default | Effect |
|---|---|---|---|
| sjoin | | Spatial relationship: intersects, within, contains, touches, crosses |
| sjoin, overlay | | Join type: inner, left, right |
| sjoin_nearest | None | Search radius limit (improves performance) |
| sjoin_nearest | 1 | Number of nearest neighbors to find |
| simplify | Required | Douglas-Peucker tolerance (in CRS units) |
| simplify | True | Prevents self-intersections |
| buffer | 16 | Number of segments for buffer curves |
| dissolve | 'first' | Aggregation function for non-geometry columns |
| read_file | False | Enable Arrow acceleration (2-4x faster) |
| plot | None | Classification: quantiles, equal_interval, fisher_jenks |
| plot (with scheme) | 5 | Number of classification bins |
Best Practices
- Always check CRS before spatial operations — mismatched CRS gives wrong or empty results
- Use projected CRS for area and distance calculations — geographic CRS (degrees) is meaningless for measurements
- Match CRS before joins —
beforegdf2 = gdf2.to_crs(gdf1.crs)gpd.sjoin(gdf1, gdf2) - Validate geometries with
before complex operations — invalid geometries cause silent errors.is_valid - Use GeoPackage over Shapefile — no 10-char column name limit, supports multiple layers, better performance
- Filter during read — use
,bbox
,columns
to load only needed data for large fileswhere - Set max_distance in sjoin_nearest — unbounded nearest-neighbor search is slow on large datasets
- Use
when modifying geometry — avoid unintended side effects on original GeoDataFrame.copy()
Common Recipes
Recipe: Count Points in Polygons
When to use: Aggregate point observations (sample sites, observations) by region (county, watershed).
import geopandas as gpd regions = gpd.read_file("regions.geojson") points = gpd.read_file("observations.geojson").to_crs(regions.crs) joined = gpd.sjoin(points, regions, how="inner", predicate="within") counts = joined.groupby("index_right").size().rename("point_count") regions_with_counts = regions.join(counts).fillna(0) regions_with_counts["point_count"] = regions_with_counts["point_count"].astype(int) print(regions_with_counts[["name", "point_count"]].sort_values("point_count", ascending=False).head())
Recipe: Buffer Around Points and Dissolve Overlaps
When to use: Create service areas or catchment zones from point locations.
import geopandas as gpd facilities = gpd.read_file("facilities.geojson") utm_crs = facilities.estimate_utm_crs() # Project to meters facilities_m = facilities.to_crs(utm_crs) # 1 km buffer buffers = facilities_m.copy() buffers["geometry"] = facilities_m.geometry.buffer(1000) # Dissolve overlapping buffers into single polygon service_area = buffers.dissolve() service_area_wgs84 = service_area.to_crs("EPSG:4326") service_area_wgs84.to_file("service_area.geojson", driver="GeoJSON") print(f"Service area: {service_area_wgs84.geometry.area.sum():.0f} sq degrees")
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Empty spatial join result | CRS mismatch between GeoDataFrames | Ensure matching CRS: |
| Wrong area/distance values | Using geographic CRS (degrees) | Reproject to projected CRS: |
on read | Missing driver or corrupt file | Check file exists; try explicitly |
| Geometry column lost after merge | Called instead of | Always call merge ON the GeoDataFrame |
on overlay | Invalid geometries | Fix with |
| Slow sjoin_nearest | No distance limit on large dataset | Set parameter |
| Folium map blank | Geometries not in EPSG:4326 | Reproject to WGS84: |
| Shapefile column names truncated | Shapefile 10-char limit | Use GeoPackage instead: |
References
- GeoPandas documentation: https://geopandas.org/
- GeoPandas GitHub: https://github.com/geopandas/geopandas
- Shapely documentation: https://shapely.readthedocs.io/
- EPSG registry: https://epsg.io/
Related Skills
- matplotlib-scientific-plotting — advanced map styling and figure export
- polars-dataframes — high-performance tabular analysis before/after spatial operations
- folium — advanced interactive web mapping beyond geopandas.explore()