Awesome-Agent-Skills-for-Empirical-Research education-data-source-edfacts
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/17-DAAF-Contribution-Community-daaf/dot-claude/skills/education-data-source-edfacts" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-education-data-so-d5e4f0 && rm -rf "$T"
skills/17-DAAF-Contribution-Community-daaf/dot-claude/skills/education-data-source-edfacts/SKILL.mdEDFacts Data Source Reference
EDFacts — federal K-12 outcome data from State Education Agencies, covering state assessment proficiency rates, ACGR graduation rates, and ESSA accountability indicators at school and district level (assessments 2009-2020, graduation rates 2010-2019). Use when analyzing within-state achievement trends, subgroup proficiency gaps, or adjusted cohort graduation rates. Complements CCD (school characteristics) with outcome data. State assessment scores CANNOT be compared across states; use NAEP for cross-state comparisons.
EDFacts is the U.S. Department of Education's centralized data collection system for pre-K through grade 12 education data from State Education Agencies (SEAs). It provides state assessment proficiency rates, graduation rates, and accountability indicators — the authoritative federal source for state-level K-12 outcome data.
CRITICAL: Value Encoding
The Urban Institute Education Data Portal converts NCES string codes (e.g.,
,ALL,CWD) to integer codes. Always verify actual data values before filtering — do not rely on documentation labels alone.LEP
Context Subgroup "All" English Learner Sex "Male" Portal integer 9911NCES string ALLLEPMSee
for complete encoding tables../references/variable-definitions.md
What is EDFacts?
- Collector: U.S. Department of Education, via State Education Agencies (SEAs)
- Coverage: All public schools and districts in 50 states + DC
- Content: State assessment proficiency rates, ACGR graduation rates, participation rates, accountability indicators
- Frequency: Annual collection
- Available years: Assessments 2009-10 to present; Graduation rates 2010-11 to present
- Primary identifiers:
(school ID, Int64),ncessch
(district ID, Int64),leaid
(state FIPS code, Int64)fips - Key limitation: State assessment scores CANNOT be compared across states (different tests, different cut scores)
- Available through: Education Data Portal mirrors
Reference File Structure
| File | Purpose | When to Read |
|---|---|---|
| ESSA, NCLB history, accountability systems | Understanding policy context |
| Proficiency levels, test scores, limitations | Working with assessment data |
| ACGR methodology, cohort definitions | Analyzing graduation data |
| Key variables, suppression codes, special values | Interpreting specific variables |
| Known issues, state variations, COVID impacts | Data cleaning, limitations |
| Special populations, disaggregation | Analyzing by student groups |
Decision Trees
What type of analysis?
What EDFacts data do you need? ├─ Assessment/proficiency data │ ├─ Within-state trends → Valid analysis │ ├─ Cross-state comparison → INVALID - use NAEP instead │ └─ Subgroup gaps → See ./references/subgroup-reporting.md ├─ Graduation rates (ACGR) │ ├─ Understand methodology → See ./references/graduation-rates.md │ ├─ Extended rates (5-year, 6-year) → See ./references/graduation-rates.md │ └─ Subgroup rates → See ./references/subgroup-reporting.md ├─ Understanding variables │ ├─ Missing/suppressed values → See ./references/variable-definitions.md │ ├─ Range vs. exact values → See ./references/variable-definitions.md │ └─ Subgroup codes → See ./references/subgroup-reporting.md └─ Data quality concerns ├─ COVID-19 impacts (2019-20) → See ./references/data-quality.md ├─ State reporting changes → See ./references/data-quality.md └─ Suppression rates → See ./references/data-quality.md
Is my comparison valid?
What are you comparing? ├─ Same state, different years │ ├─ Same assessment system? → Valid │ └─ Different tests? → Break in time series ├─ Schools within same state → Valid ├─ Districts within same state → Valid ├─ Subgroups within same school → Valid (check suppression) ├─ Different states │ ├─ Proficiency rates → INVALID │ ├─ Graduation rates (ACGR) → More comparable │ └─ Use NAEP instead → Valid └─ National ranking by proficiency → INVALID
Quick Reference: EDFacts Data Elements
Assessment Data
| Data Element | Description | Available Years |
|---|---|---|
| Proficiency rates | % meeting state standards in reading/math | 2009-10 to present |
| Participation rates | % of students assessed | 2012-13 to present |
| Achievement levels | Below Basic, Basic, Proficient, Advanced | Varies by state |
| Grade levels | Grades 3-8, high school (varies) | 2009-10 to present |
Graduation Data
| Data Element | Description | Available Years |
|---|---|---|
| 4-year ACGR | Adjusted Cohort Graduation Rate | 2010-11 to present |
| 5-year ACGR | Extended graduation rate | 2011-12 to present |
| 6-year ACGR | Further extended rate | 2012-13 to present |
| Diploma types | Regular diploma only in ACGR | All years |
Key Identifiers
Portal Data Types: All identifiers are Int64 in the Portal parquet files. The NCES source format (zero-padded strings) is shown for reference only. When joining with other Portal datasets, join on the integer columns directly.
| ID | Portal Type | NCES Source Format | Level | Example (Int64) |
|---|---|---|---|---|
| Int64 | 12-char zero-padded | School | |
| Int64 | Same as ncessch | School | |
| Int64 | 7-char zero-padded | District/LEA | |
| Int64 | Same as leaid | District/LEA | |
| Int64 | 2-digit | State | (Alabama) |
Data Levels
| Level | Identifier | Dataset Path Pattern |
|---|---|---|
| School | (Int64) | |
| District/LEA | (Int64) | |
| State | (Int64) | Aggregate from lower levels |
Subgroups Reported
Note: Not all subgroup columns are present in every dataset. Grad rates data does NOT have
,sex, ormigrantcolumns.military_connected
| Subgroup | NCES Code | Portal Integer | Column | Available In |
|---|---|---|---|---|
| All students | | | race, sex, lep, disability | Assessments, Grad Rates |
| Economically disadvantaged | | | econ_disadvantaged | Assessments, Grad Rates |
| Students with disabilities | | | disability | Assessments, Grad Rates |
| English learners | | | lep | Assessments, Grad Rates |
| Homeless | | | homeless | Assessments, Grad Rates |
| Foster care | | | foster_care | Assessments, Grad Rates |
| Migrant | | | migrant | Assessments only |
| Military connected | | | military_connected | Assessments only |
| Race/ethnicity | Multiple | | race | Assessments, Grad Rates |
| Sex | | | sex | Assessments only |
EDFacts Filter Column Pattern:
- Special population columns (lep, disability, homeless, etc.) use
= subgroup,1
= total99 - Race column uses integer codes (1=White, 2=Black, etc.)
- Sex column uses
= Male,1
= Female,2
= Total (assessments only)99
Grade Codes (grade_edfacts)
| Code | Grade Level |
|---|---|
- | Grades 3-8 (individual) |
| Grades 9-12 combined |
| Total (all grades) |
Race Codes
Empirically verified from 2018 school assessment data. Only these values appear in the
column:race
| Code | Category |
|---|---|
| White |
| Black |
| Hispanic |
| Asian |
| American Indian/Alaska Native |
| Two or More Races |
| Total |
Note: Code
(Native Hawaiian/Pacific Islander) is NOT observed in the data. Codes6(Nonresident alien),8(Unknown),9(Other),20,-1,-2are also not observed in the race column. These codes may exist in other Portal sources but are absent from EDFacts.-3
Sex Codes
| Code | Category |
|---|---|
| Male |
| Female |
| Unknown |
| Total |
Disability Codes
Empirically verified from 2018 school assessment and 2019 grad rate data. Only
and1are observed in the99column. The expanded codes (0-4) documented in other Portal sources are NOT present in EDFacts datasets.disability
| Code | Category |
|---|---|
| Students with disabilities (IDEA-eligible) |
| Total (all students) |
LEP Codes
| Code | Category |
|---|---|
| Students who are limited English proficient |
| All students (total) |
Special Population Columns
For
homeless, migrant, econ_disadvantaged, foster_care, military_connected:
| Code | Category |
|---|---|
| Yes (in subgroup) |
| Total (all students) |
Missing Data Codes
| Code | Meaning | When Used |
|---|---|---|
| Missing/not applicable | Data not reported |
| Not reported | Item doesn't apply to this entity |
| Suppressed for privacy | Data suppressed for small N-size |
| Rounds to zero | Value rounds to zero |
| Range values | Exact value suppressed | Range provided instead of exact value |
suffix | Calculated midpoint of suppressed range | Use for analysis when exact values are suppressed |
Always use
variables for analysis when exact values are suppressed._midpt
Data Access
All EDFacts data is fetched via the Education Data Portal mirror system. There is no API access.
Key references:
-- Mirror definitions, URL templates, read strategiesmirrors.yaml
-- Canonical dataset paths (one path works for all mirrors)datasets-reference.md
--fetch-patterns.md
andfetch_from_mirrors()
patternsfetch_yearly_from_mirrors()
Truth Hierarchy: When interpreting variable values, apply this priority:
- Actual data file (what you observe in the parquet/CSV) — this IS the truth
- Live codebook (.xls in mirror) — authoritative documentation, may lag
- This skill documentation — convenient summary, may drift from codebook
If this documentation contradicts the codebook, trust the codebook. If the codebook contradicts observed data, trust the data and investigate.
Key Datasets
| Dataset | Path | Type | Columns |
|---|---|---|---|
| School Assessments | | Yearly (2009-2018, 2020) | 26 cols |
| School Grad Rates | | Yearly (2010-2019) | 18 cols |
| District Assessments | | Yearly (2009-2018, 2020) | 23 cols |
| District Grad Rates | | Yearly (2010-2019) | 15 cols |
Note: 2019 assessment data is NOT available (at any level) due to COVID testing waivers.
Codebooks
Codebook
.xls files are available for both assessment and graduation rate datasets. Use get_codebook_url() from fetch-patterns.md:
# Assessment codebooks: url = get_codebook_url("edfacts/codebook_schools_edfacts_assessments") url = get_codebook_url("edfacts/codebook_districts_edfacts_assessments") # Graduation rate codebooks: url = get_codebook_url("edfacts/codebook_schools_edfacts_graduation") url = get_codebook_url("edfacts/codebook_districts_edfacts_graduation")
Codebook naming note: Graduation rate codebooks use
(not_graduation), while the data files use_grad_rates. This follows the same pattern as other Portal sources where codebook names differ from data file names. See_grad_ratesfor the authoritative path mapping.datasets-reference.md
Dataset Column Differences
Assessment and graduation rate datasets have different column sets:
| Column | Assessments | Grad Rates |
|---|---|---|
| Yes (1, 2, 99) | No |
| Yes (1, 99) | No |
| Yes (1, 99) | No |
| Yes (3-9, 99) | No |
/ | Yes | No |
| No | Yes |
| No | Yes |
/ | Yes | Yes |
Filtering
# Grade filtering: grade_edfacts uses integer codes df = df.filter(pl.col("grade_edfacts") == 4) # Grade 4 df = df.filter(pl.col("grade_edfacts") == 99) # All grades combined # Subgroup filtering: special population columns use 1/99 pattern df_total = df.filter(pl.col("sex") == 99) # All students (total) df_econ = df.filter(pl.col("econ_disadvantaged") == 1) # Economically disadvantaged only # Race filtering: integer codes df_black = df.filter(pl.col("race") == 2) # Black students
Common Pitfalls
| Pitfall | Issue | Solution |
|---|---|---|
| Ranking states by proficiency | Different tests, different cut scores make comparisons meaningless | Use NAEP for cross-state comparisons |
| Comparing 2019-20 to other years | COVID testing waivers created data gaps | Note data gap, exclude year |
| Ignoring suppression | Results biased toward larger schools/subgroups | Document suppression rates, use variables |
| Assuming proficiency = same thing | State definitions of "proficient" vary widely | Clarify each state's definition |
| Pre/post ESSA comparison | Different accountability systems (NCLB vs ESSA) | Note policy change at 2015 boundary |
| Using string codes for filtering | Portal uses integer encoding, not NCES strings | Always check actual data values; see encoding tables above |
Key Policy Context
| Law | Years | Key Features |
|---|---|---|
| NCLB | 2002-2015 | AYP, 100% proficiency goal, HQT |
| ESSA | 2015-present | State flexibility, multiple indicators |
- AYP (Adequate Yearly Progress): NCLB requirement eliminated by ESSA
- ESSA Accountability: States design own systems with federal guardrails
- N-size: Minimum students required for reporting (varies by state, typically 10-30)
CRITICAL WARNING: Cross-State Comparisons
State assessment proficiency rates CANNOT be compared across states.
| Factor | Why It Varies |
|---|---|
| Assessment content | Each state creates its own tests |
| Proficiency cut scores | Each state sets own thresholds |
| Standards alignment | State academic standards differ |
| Test difficulty | Not calibrated nationally |
A student "proficient" in one state may score "below basic" in another state taking a harder test with higher cut scores. Rankings of states by proficiency rates are meaningless.
Use NAEP (National Assessment of Educational Progress) for valid cross-state comparisons.
Valid vs. Invalid Analysis Examples
Valid Analysis:
# Within-state trend analysis state_df = df.filter(pl.col("fips") == 6) # California only trend = state_df.group_by("year").agg( pl.col("read_test_pct_prof_midpt").mean() ) # Valid: Same state, same test system
INVALID Analysis:
# DO NOT DO THIS - Cross-state comparison # This comparison is MEANINGLESS state_comparison = df.group_by("fips").agg( pl.col("read_test_pct_prof_midpt").mean() ).sort("read_test_pct_prof_midpt", descending=True) # INVALID: Different tests, different standards
Related Data Sources
| Source | Relationship | When to Use |
|---|---|---|
| CCD provides school/district demographics | Combining outcome data with school characteristics |
| CRDC has discipline, AP, school climate data | Analyzing school equity alongside achievement |
| SAIPE provides district poverty estimates | Linking poverty to achievement |
| MEPS provides school poverty estimates | School-level poverty and assessment analysis |
| Parent discovery skill | Finding available endpoints |
| Data fetching | Downloading via mirrors |
Topic Index
| Topic | Reference File |
|---|---|
| NCLB to ESSA transition | |
| State accountability systems | |
| Federal reporting requirements | |
| Proficiency levels | |
| Why states can't be compared | |
| NAEP comparison | |
| Assessment system changes | |
| ACGR calculation | |
| Cohort adjustments | |
| Extended graduation rates | |
| Diploma types | |
| Suppression codes | |
| Missing data values | |
| Range/midpoint variables | |
| Participation rates | |
| COVID-19 data gaps | |
| State reporting variations | |
| Known data issues | |
| Time series breaks | |
| Students with disabilities | |
| English learners | |
| Economically disadvantaged | |
| Race/ethnicity reporting | |
| Homeless/foster/migrant | |
| N-size requirements | |