Awesome-Agent-Skills-for-Empirical-Research education-data-source-nhgis

install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/17-DAAF-Contribution-Community-daaf/dot-claude/skills/education-data-source-nhgis" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-education-data-so-b8b06d && rm -rf "$T"
manifest: skills/17-DAAF-Contribution-Community-daaf/dot-claude/skills/education-data-source-nhgis/SKILL.md
source content

NHGIS Data Source Reference

IPUMS NHGIS — census geography crosswalks and demographic data for education research. Via the Education Data Portal: geographic crosswalk tables linking K-12 schools (ncessch) and colleges (unitid) to census tracts, block groups, CBSAs, and regions (census 1990-2020). Census demographic variables (income, poverty, race, educational attainment) are NOT in the Portal — access directly from NHGIS via free IPUMS registration. Use when linking school or institutional data to census geography for contextual analysis.

Census geography and demographic data source for education research. NHGIS provides the foundation for linking schools to community characteristics via census tracts, block groups, and school district boundaries.

CRITICAL: Value Encoding

When accessing NHGIS data through the Education Data Portal (not NHGIS directly), categorical variables use integer encodings, not string labels. Always verify the exact codes in the mirror codebook.

VariableInteger CodeMeaning
census_region
1
Northeast
census_region
2
Midwest
census_region
3
South
census_region
4
West
cbsa_type
1
Metropolitan
cbsa_type
2
Micropolitan
geocode_accuracy
4
Did not geocode

See

./references/variable-catalog.md
for complete encoding tables.

CRITICAL: Portal Data Scope

The Education Data Portal provides ONLY geographic crosswalk tables that link schools and colleges to census geography (tracts, block groups, regions, CBSAs). These contain geographic identifiers and assignment columns — approximately 35-47 columns per file.

The Portal does NOT provide census demographic data (population, income, poverty, race, education attainment, housing, language, etc.). For demographic variables, you must access NHGIS directly via IPUMS (free registration required). See

./references/data-access.md
for direct access methods.

This skill documents both contexts: Portal crosswalk data (with integer encodings above) and direct NHGIS census variables (in

./references/variable-catalog.md
, clearly marked as requiring direct NHGIS access).

What is NHGIS?

NHGIS (from IPUMS, University of Minnesota) provides free access to census geography and demographic data.

  • Collector: IPUMS, University of Minnesota
  • Coverage: US census data from 1790-present (decennial census + ACS)
  • Content: Summary tables, GIS boundary files, time series tables, geographic crosswalks
  • Frequency: Decennial census (every 10 years) + ACS (annual, 5-year rolling)
  • Available years: 1790-2020 (decennial), 2005-2023 (ACS 5-year)
  • Primary identifiers: GISJOIN (NHGIS internal), GEOID (Census Bureau standard)
  • Education relevance: Links school locations to community demographics via census tracts, block groups, and school district boundaries
  • Available through Education Data Portal: Geographic crosswalk tables only (school-to-census and college-to-census links for census 1990, 2000, 2010, 2020). Census demographic data requires direct NHGIS access.

Reference File Structure

FilePurposeWhen to Read
geographic-units.md
Census geography hierarchy (tracts, blocks, districts)Understanding census geography
school-geography-links.md
Linking schools to census areasConnecting school data to demographics
time-series.md
Historical data, harmonization methodsLongitudinal analysis
variable-catalog.md
Key demographic variables, codes, special valuesSelecting census variables or interpreting encodings
boundary-changes.md
How boundaries change between censusesHandling geographic inconsistencies
data-access.md
Direct NHGIS access methods (registration, Data Finder, ipumspy)Custom census analysis beyond Portal

Decision Trees

What geographic level should I use?

Research question about...
├─ Individual schools
│   ├─ School's immediate neighborhood → Census tract or block group
│   ├─ School attendance zone → SABINS (limited years) or block-to-school crosswalk
│   └─ School district overall → School district boundaries
├─ School districts
│   ├─ District-level demographics → School district geographic level
│   ├─ Within-district variation → Census tracts within district
│   └─ District poverty estimates → SAIPE (via Education Data Portal)
├─ Regional patterns
│   ├─ County-level → County boundaries
│   ├─ Metro area → CBSA (Core Based Statistical Area)
│   └─ State-level → State boundaries
└─ Historical analysis
    ├─ Consistent boundaries needed → Geographically standardized tables
    └─ Original boundaries OK → Nominally integrated tables

How do I link schools to census data?

Linking schools to census demographics?
├─ Have school coordinates (lat/lon)
│   ├─ Point-in-polygon → Spatial join to tract/block group boundaries
│   └─ Need tract ID only → Geocoding service or FCC API
├─ Have school NCES ID only
│   ├─ Use NCES EDGE files → School District Geographic Relationship Files
│   └─ Use Education Data Portal → NHGIS source provides tract links
├─ Need school attendance zones
│   ├─ 2009-2012 data → SABINS school areas
│   └─ Current data → Contact school district (no national source)
└─ See ./references/school-geography-links.md for details

What time period data do I need?

Time period?
├─ Single recent year
│   ├─ Tract/block group level → ACS 5-year (most recent)
│   ├─ Larger areas (65K+ pop) → ACS 1-year
│   └─ Full census count → 2020 Decennial Census
├─ Historical comparison
│   ├─ Same boundaries across time → Geographically standardized tables (to 2010)
│   ├─ Original boundaries → Nominally integrated time series
│   └─ Custom standardization → Use geographic crosswalks
├─ Long time series (1970+)
│   └─ See ./references/time-series.md
└─ Pre-1970
    └─ Limited tract coverage; county/state more complete

Quick Reference: Geographic Levels and Variables

Geographic Levels

LevelTypical SizeEducation UseNHGIS Coverage
Block~40 peoplePoint locations1990-2020
Block Group~1,500 peopleSchool neighborhoods1990-2020
Census Tract~4,000 peopleCommunity context1910-2020
County SubdivisionVariesRural areas1980-2020
PlaceCity/townUrban context1980-2020
School DistrictVariesDistrict analysis2000-2020
County~100,000 peopleRegional patterns1790-2020
StateVariesPolicy analysis1790-2020

Key Identifiers

IDFormatLevelExampleNotes
ncessch
Int64School
10000201704
NCES school ID (schools Portal data)
unitid
Int64College
100654
IPEDS institution ID (colleges Portal data)
GISJOIN
String with prefixAny
G0600010
NHGIS internal ID; use for direct NHGIS joins (not in Portal data)
GEOID
Numeric stringAny
06001402100
Census Bureau standard; use for non-NHGIS joins (not in Portal data)
tract
Int64Tract
402100
Census tract number (in Portal data)
block_group
Int64Block Group
1
Block group within tract (1-9; 0=unassigned)
geoid_block
Int64Block
60014021001001
Full block FIPS code (in Portal data — stored as Int64, not String)
cbsa
Int64Metro area
41860
Core Based Statistical Area code (2000+ census files only)

Key Education Variables

TopicExample VariablesSource
Child populationUnder 18, 5-17 school-ageDecennial, ACS
Race/ethnicityHispanic, White, Black, Asian, etc.Decennial, ACS
PovertyPersons below poverty, SNAP receiptACS (sample)
Education attainmentHS diploma, BA+ (adults)ACS (sample)
LanguageEnglish proficiency, language at homeACS (sample)
HousingOwner/renter, median value, crowdingDecennial, ACS
Family structureSingle-parent, grandparent householdsACS (sample)
ImmigrationForeign-born, recent immigrantsACS (sample)

Data Sources by Type

SourceYearsGeographic DetailContent
Decennial Census1790-2020Block (1990+)100% count: age, sex, race, housing
ACS 5-Year2005-2023Block groupSample: income, education, language
ACS 1-Year2010-2023Areas 65K+ popSample: same as 5-year
Time Series1790-2020VariesHarmonized across years
Geographic Crosswalks1990-2020Block+Interpolation weights

Portal Variables (Schools NHGIS)

Key geographic and identifying columns in the schools NHGIS datasets. Census 2020 files have 47 columns; earlier census years have fewer (e.g., 1990 has 35 columns — no CBSA or legislative district fields).

VariableDescriptionType
ncessch
NCES school IDInt64
leaid
NCES district IDInt64
tract
Census tract numberInt64
block_group
Block group number (1-9; 0 = unassigned)Int64
geoid_block
Full block FIPS identifierInt64
census_region
Census Bureau region (1-4, 9)Int64
census_division
Census Bureau division (1-9)Int64
cbsa
CBSA code (2000+ census files only)Int64
cbsa_type
Metropolitan (1) or Micropolitan (2)Int64
cbsa_city
Principal city indicator (0=No, 1=Yes; 2000+ only). See note below.Int64
geocode_accuracy
Geocode confidence (1=High, 2=Medium, 3=Low, 4=Did not geocode, -2=N/A)Float64
geocode_accuracy_detailed
Geocode match type (1-12)Int64
class_code
FIPS place class codeInt64
lower_chamber_type
State legislative district lower chamber type (1-8; census 2010 only). See
variable-catalog.md
for code mapping.
Int64
geo_latitude
/
geo_longitude
Geocoded coordinatesFloat64
latitude
/
longitude
CCD-reported coordinates (many nulls in early years)Float64
fips
State FIPS codeInt64
puma
Public Use Microdata Area (2000+ census files only)Int64

Portal Variables (Colleges NHGIS)

Colleges NHGIS datasets have 38 columns (2020 census). Different identifier set from schools.

VariableDescriptionType
unitid
IPEDS institution IDInt64
opeid
Office of Postsecondary Education IDString
tract
Census tract numberInt64
block_group
Block group number (1-9)Int64
geoid_block
Full block FIPS identifierInt64
census_region
Census Bureau region (1-4, 9)Int64
census_division
Census Bureau division (1-9)Int64
cbsa
CBSA codeInt64
cbsa_type
Metropolitan (1) or Micropolitan (2)Int64
cbsa_city
Principal city indicator (0=No, 1=Yes; 2000+ only)Int64
geocode_accuracy
Geocode match score (Int64 in colleges, Float64 in schools)Int64
county_fips
County FIPS codeInt64
county_name
County nameString
state_abbr
State abbreviationString

Missing Data Codes

CodeMeaningWhen Used
-2
Not geocoded
geocode_accuracy
field in Portal data
-1
Missing/not reportedGeneral missing data indicator (e.g.,
latitude
,
county_code
)
0
Unassigned
block_group
(rare, ~4 rows in schools)
null
Not availableVariable not applicable to this record; many columns heavily null in early years

Schema Difference: Schools NHGIS 2020 files (47 columns) have a different schema than colleges NHGIS 2020 files (38 columns). Schools data includes school-specific identifiers (

ncessch
,
leaid
,
school_name
, mailing/location address fields) while colleges data includes institution-specific identifiers (
unitid
,
opeid
,
inst_name
,
county_name
). Both entity types have block-level geographic precision. Earlier census years have fewer columns (e.g., Schools 1990 has 35 columns — no CBSA or legislative district fields). Do not assume identical column structures when working across entities or census years.

Data Access

Datasets for NHGIS are available via the mirror system. See

datasets-reference.md
for canonical paths,
mirrors.yaml
for mirror configuration, and
fetch-patterns.md
for fetch code patterns.

DatasetTypeYearsPathCodebook
Schools Census 1990Single1986-2023
nhgis/schools_nhgis_geog_1990
nhgis/codebook_schools_nhgis_census1990
Schools Census 2000Single1986-2023
nhgis/schools_nhgis_geog_2000
nhgis/codebook_schools_nhgis_census2000
Schools Census 2010Single1986-2023
nhgis/schools_nhgis_geog_2010
nhgis/codebook_schools_nhgis_census2010
Schools Census 2020Single1986-2023
nhgis/schools_nhgis_geog_2020
nhgis/codebook_schools_nhgis_census2020
Colleges Census 1990Single1980-2023
nhgis/colleges_nhgis_geog_1990
nhgis/codebook_colleges_nhgis_census1990
Colleges Census 2000Single1980-2023
nhgis/colleges_nhgis_geog_2000
nhgis/codebook_colleges_nhgis_census2000
Colleges Census 2010Single1980-2023
nhgis/colleges_nhgis_geog_2010
nhgis/codebook_colleges_nhgis_census2010
Colleges Census 2020Single1980-2023
nhgis/colleges_nhgis_geog_2020
nhgis/codebook_colleges_nhgis_census2020

Codebooks are

.xls
files co-located with data in all mirrors. Use
get_codebook_url()
from
fetch-patterns.md
to construct download URLs.

Truth Hierarchy: When interpreting variable values, apply this priority:

  1. Actual data file (what you observe in the parquet/CSV) — this IS the truth
  2. Live codebook (.xls in mirror) — authoritative documentation, may lag
  3. This skill documentation — convenient summary, may drift from codebook

If this documentation contradicts the codebook, trust the codebook. If the codebook contradicts observed data, trust the data and investigate.

Filtering

import polars as pl

# Filter to a specific school
school_census = df.filter(pl.col("ncessch") == 10000201704)

# Filter to metropolitan areas only (cbsa_type only in 2000+ census files)
metro = df.filter(pl.col("cbsa_type") == 1)

# Filter to a specific census region (South)
south = df.filter(pl.col("census_region") == 3)

# Filter to a specific year
recent = df.filter(pl.col("year") == 2023)

Note: The Portal provides pre-processed school/college-to-census-geography links. For custom census analysis (tract-level demographics, time series, boundary files), use NHGIS directly via methods in

./references/data-access.md
(requires free IPUMS registration).

Common Pitfalls

PitfallIssueSolution
Boundary changesTracts split/merged between censuses break longitudinal analysisUse crosswalks or geographically standardized tables
ACS margins of errorSmall-area estimates have high uncertaintyCheck MOE; aggregate areas if needed
Block data limitationsOnly 100% count variables available (no income/poverty)Use block groups for sample data (ACS)
GISJOIN vs GEOIDDifferent ID formats cause join failuresUse GISJOIN for NHGIS joins, GEOID for Census Bureau joins
2020 Census noiseDifferential privacy added noise to small-area countsCheck for negative values; prefer ACS for detailed characteristics
Schools vs colleges schemaDifferent column counts (47 vs 38 for 2020) and identifier setsCheck schema before joining; do not assume identical structures
Census year schema driftEarlier census files have fewer columns (e.g., 1990 lacks CBSA/legislative fields)Check available columns per census year before relying on them
geocode_accuracy typeFloat64 in schools, Int64 in collegesCast to consistent type before cross-entity comparison
Using string codesPortal data uses integer encodings, not string labelsAlways verify codes against codebook (see encoding warning above)

Related Data Sources

SourceRelationshipWhen to Use
education-data-source-ccd
School identifiers for linkingJoin school data to census geography via
ncessch
education-data-source-saipe
District-level povertyUse SAIPE for district poverty; NHGIS for tract/block group poverty
education-data-source-meps
School-level povertyMEPS provides school-level poverty estimates; NHGIS provides community context
education-data-source-ipeds
College identifiers for linkingJoin college data to census geography via
unitid
education-data-explorer
Parent discovery skillFinding available endpoints
education-data-query
Data fetchingDownloading parquet/CSV files

Topic Index

TopicReference File
Census tract definition
./references/geographic-units.md
Block group definition
./references/geographic-units.md
School district boundaries
./references/geographic-units.md
School-to-tract linking
./references/school-geography-links.md
SABINS attendance areas
./references/school-geography-links.md
NCES EDGE files
./references/school-geography-links.md
Time series tables
./references/time-series.md
Geographic standardization
./references/time-series.md
Geographic crosswalks
./references/time-series.md
Population variables
./references/variable-catalog.md
Income/poverty variables
./references/variable-catalog.md
Education variables
./references/variable-catalog.md
Tract boundary changes
./references/boundary-changes.md
2022 Connecticut changes
./references/boundary-changes.md
TIGER/Line versions
./references/boundary-changes.md
Direct NHGIS access
./references/data-access.md
ipumspy Python package
./references/data-access.md
Data Finder workflow
./references/data-access.md