Awesome-Agent-Skills-for-Empirical-Research education-data-source-ccd

install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/17-DAAF-Contribution-Community-daaf/dot-claude/skills/education-data-source-ccd" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-education-data-so-13f824 && rm -rf "$T"
manifest: skills/17-DAAF-Contribution-Community-daaf/dot-claude/skills/education-data-source-ccd/SKILL.md
source content

CCD Data Source Reference

Common Core of Data (CCD) — the federal complete-universe database of all U.S. public K-12 schools and districts (~100,000 schools, ~18,000 districts), collecting enrollment, staffing, finance, and directory data annually (1986-present). Use when analyzing public school enrollment by grade/race/sex, district finances, school staffing, or directory attributes. Public schools and districts only; excludes private schools and postsecondary. Note significant variable encoding and race/ethnicity definition changes over time.

The CCD is the Department of Education's comprehensive, annual, national database of all public elementary and secondary schools and school districts in the United States. It is the only federal dataset that provides a complete universe census (not a sample) of U.S. public K-12 education.

CRITICAL: Value Encoding

The Education Data Portal uses integer codes for categorical variables that differ from NCES's original string codes. Always verify codes against codebooks.

Context
school_type
charter
urban_centric_locale
Portal (integers)
1
(Regular)
0
(No) /
1
(Yes)
11
(City-Large)
NCES original
1-Regular school
Yes
/
No
11-City: Large

Note:

charter
and
magnet
use
0/1
encoding, NOT
1=Yes / 2=No
as some NCES documentation shows.

See

./references/variable-definitions.md
for complete encoding tables.

What is CCD?

  • Primary K-12 database: DOE's authoritative source for public elementary/secondary education statistics
  • Universe survey: Covers ALL public schools and districts, not a sample
  • Annual collection: Data submitted by State Education Agencies (SEAs) each year
  • Six major components: Directory, Membership, Staffing, Finance (state and district), Dropout/Completers
  • Coverage: ~100,000 public schools and ~18,000 school districts nationwide
  • Historical depth: Data available from 1986 to present (varies by component)
  • Collector: National Center for Education Statistics (NCES) via EDFacts
  • Available through: Education Data Portal mirrors (5 of 6 survey components; see Data Access section for details)

Reference File Structure

FilePurposeWhen to Read
survey-components.md
Detailed coverage of each CCD survey componentUnderstanding what data is collected
data-collection.md
How data flows from schools to NCES, timelines, respondent universeUnderstanding data provenance and timing
variable-definitions.md
Key variables, coding schemes, special valuesInterpreting specific data elements
data-quality.md
Missing data patterns, suppression, state variationsAssessing data reliability
historical-changes.md
Definition changes, code revisions over timeLongitudinal analysis

Decision Trees

What CCD component do I need?

What information do you need?
├─ School/district names, addresses, contacts → Directory
│   └─ See ./references/survey-components.md#directory
├─ Student enrollment counts → Membership
│   ├─ By grade → Membership (grade disaggregation)
│   ├─ By race/ethnicity → Membership (race disaggregation)
│   ├─ By sex → Membership (sex disaggregation)
│   └─ See ./references/survey-components.md#membership
├─ Staff/teacher counts → Staffing
│   └─ See ./references/survey-components.md#staffing
├─ Revenue and expenditure → Finance
│   ├─ State-level totals → National Public Education Financial Survey
│   ├─ District-level detail → School District Finance Survey (F-33)
│   └─ See ./references/survey-components.md#finance
├─ Graduation/dropout rates → Dropout and Completers
│   └─ See ./references/survey-components.md#dropout-completers
└─ School type, charter status, locale → Directory
    └─ See ./references/survey-components.md#directory

Is this a data quality issue?

Unexpected data values?
├─ Negative numbers (-1, -2, -3, -9) → Missing data codes
│   └─ See ./references/variable-definitions.md#missing-data-codes
├─ Very different from prior year → Check for definition changes
│   └─ See ./references/historical-changes.md
├─ State appears as outlier → Check state-specific reporting
│   └─ See ./references/data-quality.md#state-variations
├─ Large number of zeros → Check suppression rules
│   └─ See ./references/data-quality.md#suppression
└─ Locale codes don't match → Pre/post 2006 locale system change
    └─ See ./references/historical-changes.md#locale-codes

Can I compare across time?

Building a time series?
├─ Race/ethnicity categories → Major change in 2010
│   └─ See ./references/historical-changes.md#race-ethnicity
├─ Locale codes → Completely revised in 2006
│   └─ See ./references/historical-changes.md#locale-codes
├─ School/district IDs → Check for ID changes
│   └─ See ./references/variable-definitions.md#identifiers
├─ Free/reduced lunch → CEP and direct certification changes
│   └─ See ./references/data-quality.md#frpl
└─ Finance data → Definition changes and inflation
    └─ See ./references/historical-changes.md#finance

Quick Reference: CCD Components

ComponentLevelKey VariablesYearsUpdate Cycle
DirectorySchool, LEA, StateName, address, type, status, locale, charter1986+Annual
MembershipSchool, LEA, StateEnrollment by grade, race, sex1986+Annual
StaffingSchool, LEA, StateFTE teachers, staff by category1987+Annual
Finance (State)StateRevenue, expenditure by source/function1989+Annual (1-2 yr lag)
Finance (District)LEARevenue, expenditure, per-pupil1989+Annual (2 yr lag)
Dropout/CompletersLEA, StateDropout counts, diploma recipients1991+Annual

Note: Not all components listed above are available through the Portal mirrors. See the Data Access section for which datasets are mirrored.

Key Identifiers

Portal ColumnFormatLevelExampleNotes
ncessch
12 charactersSchool
010000100100
State FIPS (2) + LEA suffix (5) + School (5)
leaid
7 charactersDistrict
0100001
State FIPS (2) + State-assigned (5)
fips
2 digitsState
01
Federal Information Processing Standard

ID Type Warning:

ncessch
and
leaid
may be String or Int64 depending on the dataset. In the Schools Directory,
ncessch
is String (preserving leading zeros); in enrollment data,
ncessch
is Int64. In the Districts Directory,
leaid
is Int64; in Finance data,
leaid
is String. Always check the actual dtype and cast as needed when joining across datasets.

Missing Data Codes

The Portal uses both

null
and negative integer codes to represent missing/special values. The specific pattern varies by dataset:

CodeMeaningWhen Used
null
Not availableCommon in Directory fields that don't apply to all years
-1
Missing/not reportedData not reported by state
-2
Not applicableItem doesn't apply to this entity
-3
SuppressedData suppressed for privacy
-9
Not reportedState did not report this item

Check actual data. Some datasets use

null
where others use
-1
for effectively the same condition. Always check the observed values in the data before applying a blanket missing-value filter.

School Types (
school_type
)

CodeTypeDescription
1RegularStandard public school
2Special EducationFocuses on students with disabilities
3VocationalCareer/technical education focus
4AlternativeNon-traditional programs
5Reportable ProgramProgram within another school (2007-08+)

LEA Types (
agency_type
)

CodeTypeDescription
1RegularLocally governed school district
2ComponentDistrict sharing superintendent with others
3Supervisory UnionAdmin services for multiple districts
4Regional AgencyEducation service agency
5State-operatedState-run schools (deaf, blind, correctional)
6Federal-operatedFederal schools (BIE, DoDEA)
7Charter AgencyAll schools are charters (2007-08+)
8OtherDoesn't fit other categories (2007-08+)
9Specialized AgencySpecialized public agency (observed in data)

Grade -1 Encoding

In CCD enrollment data:

  • grade = -1
    means Pre-Kindergarten, NOT missing data
  • grade = 99
    means Total across all grades

Do NOT filter

grade >= 0
— this removes all Pre-K students!

# WRONG - removes Pre-K students!
df = df.filter(pl.col("grade") >= 0)

# CORRECT
pre_k = df.filter(pl.col("grade") == -1)  # Pre-K only
k12 = df.filter(pl.col("grade").is_between(0, 12))  # K-12
total = df.filter(pl.col("grade") == 99)  # All grades

Portal Column Name Mapping

Variable Name Mapping: The Portal column

urban_centric_locale
contains locale codes. Some documentation may refer to this as simply
locale
. Use
urban_centric_locale
when filtering or selecting columns in Portal data.

Dataset-to-Component Mapping

Mirror DatasetCCD ComponentPath
Schools CCD DirectorySchool Directory
ccd/schools_ccd_directory
Schools CCD EnrollmentSchool Membership
ccd/schools_ccd_enrollment_{year}
Districts LEA DirectoryLEA Directory
ccd/school-districts_lea_directory
Districts CCD EnrollmentLEA Membership
ccd/schools_ccd_lea_enrollment_{year}
Districts CCD FinanceF-33 District Finance
ccd/districts_ccd_finance

Data Collection Flow

Schools → Local Education Agencies (LEAs)
                ↓
    State Education Agencies (SEAs)
                ↓
        EDFacts Submission System
                ↓
    NCES Quality Review & Editing
                ↓
        CCD Public Data Files

Timeline: Data for school year 20XX-YY typically submitted spring 20YY, released fall 20YY (preliminary) to spring 20YY+1 (provisional/final).

Data Access

Datasets for CCD are available via the mirror system. See

datasets-reference.md
for canonical paths,
mirrors.yaml
for mirror configuration, and
fetch-patterns.md
for fetch code patterns.

Key datasets (5 datasets; see

datasets-reference.md
for the authoritative list):

DatasetTypePathCodebook
School DirectorySingle
ccd/schools_ccd_directory
ccd/codebook_schools_ccd_directory
School EnrollmentYearly (1986-2023)
ccd/schools_ccd_enrollment_{year}
ccd/codebook_schools_ccd_enrollment
District DirectorySingle
ccd/school-districts_lea_directory
ccd/codebook_districts_ccd_directory
District EnrollmentYearly (1986-2023)
ccd/schools_ccd_lea_enrollment_{year}
ccd/codebook_districts_ccd_enrollment
District FinanceSingle
ccd/districts_ccd_finance
ccd/codebook_districts_ccd_finance

Not in Portal mirrors: The following CCD components are documented in this skill for reference but are not available through the Education Data Portal mirrors:

  • Dropout/Completers — completion and dropout data by demographics
  • State Finance (NPEFS) — state-level education revenue and expenditure

For these components, access NCES directly at https://nces.ed.gov/ccd/.

Codebooks are

.xls
files co-located with data in all mirrors. Use
get_codebook_url()
from
fetch-patterns.md
to construct download URLs:

url = get_codebook_url("ccd/codebook_schools_ccd_directory")

Truth Hierarchy: When interpreting variable values, apply this priority:

  1. Actual data file (what you observe in the parquet/CSV) -- this IS the truth
  2. Live codebook (.xls in mirror) -- authoritative documentation, may lag
  3. This skill documentation -- convenient summary, may drift from codebook

If this documentation contradicts the codebook, trust the codebook. If the codebook contradicts observed data, trust the data and investigate.

Filtering

All filtering is done locally with Polars after download:

import polars as pl

# Filter by state (California)
df = df.filter(pl.col("fips") == 6)

# Filter by year
df = df.filter(pl.col("year").is_in([2020, 2021, 2022]))

# Get totals only (enrollment)
df = df.filter(pl.col("grade") == 99)

# Get specific grades (K-12)
df = df.filter(pl.col("grade").is_between(0, 12))

Finance Data Notes

  • Finance data lag: The latest available year in the mirror is 2020 (empirically verified). Finance data typically lags 2+ years behind current school year.
  • Finance dataset has 163 columns -- by far the most complex CCD dataset
  • Some finance columns use
    _total
    suffix (e.g.,
    exp_current_instruction_total
    )
  • leaid
    is String type in Finance data (unlike the Districts Directory where it is Int64)

Common Pitfalls

PitfallIssueSolution
Summing gradesMisses ungraded studentsUse
grade=99
(total) instead
Assuming
-1
is missing
In grade data,
-1
= Pre-K
Check variable format in codebook
Cross-state comparisonDifferent state definitionsCheck state methodology first
Using FRPL as poverty measureCEP schools show 100%Supplement with MEPS or SAIPE data
Locale time series2006 code system changeAnalyze pre/post-2006 separately
Charter school countsEarly years incompleteVerify against state records pre-2010
Dropout rate comparisonState definitions varyWithin-state comparisons only
Using NCES string codesPortal uses integersSee variable-definitions.md for mappings
Assuming
charter=1/2
Portal uses
0=No, 1=Yes
Empirically verified; not NCES
1=Yes, 2=No
ID type across datasets
leaid
/
ncessch
may be String or Int64
Always check dtype before joining

Coverage Notes

What CCD Includes

  • All public schools (traditional, charter, magnet, alternative)
  • All public school districts and LEAs
  • Bureau of Indian Education (BIE) schools
  • Department of Defense Education Activity (DoDEA) schools
  • State-operated schools (deaf, blind, correctional)

What CCD Excludes

  • Private schools (use Private School Universe Survey - PSS)
  • Homeschool students
  • Postsecondary institutions (use IPEDS)
  • Detailed student-level data (CCD is aggregate only)

Related Data Sources

SourceRelationshipWhen to Use
education-data-source-edfacts
CCD nonfiscal data flows through EDFactsSame underlying data
education-data-source-crdc
Biennial; uses CCD school IDsNeed discipline, course access, equity data
education-data-source-saipe
Uses CCD district IDsNeed poverty estimates (better than FRPL)
education-data-source-meps
School-level poverty estimatesNeed school-level poverty (better than FRPL)
education-data-source-ipeds
Separate system for postsecondaryNeed college/university data
PSSPrivate school equivalentNeed private school data
education-data-source-nhgis
Census geography crosswalksNeed school-Census links
education-data-explorer
Parent discovery skillFinding available datasets
education-data-query
Data fetching (mirror system)Downloading parquet/CSV files via
fetch_from_mirrors()

Topic Index

TopicReference File
Directory survey
./references/survey-components.md
Membership survey
./references/survey-components.md
Staffing survey
./references/survey-components.md
Finance surveys
./references/survey-components.md
Dropout/completers
./references/survey-components.md
Data collection process
./references/data-collection.md
EDFacts submission
./references/data-collection.md
Respondent universe
./references/data-collection.md
NCES identifiers
./references/variable-definitions.md
Missing data codes
./references/variable-definitions.md
Grade codes
./references/variable-definitions.md
Race/ethnicity codes
./references/variable-definitions.md
Locale codes
./references/variable-definitions.md
State-level variations
./references/data-quality.md
Missing data patterns
./references/data-quality.md
FRPL limitations
./references/data-quality.md
Data suppression
./references/data-quality.md
Locale code changes (2006)
./references/historical-changes.md
Race/ethnicity changes (2010)
./references/historical-changes.md
LEA type changes (2007)
./references/historical-changes.md
ID changes over time
./references/historical-changes.md