Awesome-Agent-Skills-for-Empirical-Research education-data-source-scorecard
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/17-DAAF-Contribution-Community-daaf/dot-claude/skills/education-data-source-scorecard" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-education-data-so-61169b && rm -rf "$T"
skills/17-DAAF-Contribution-Community-daaf/dot-claude/skills/education-data-source-scorecard/SKILL.mdScorecard Data Source Reference
College Scorecard — the primary institutional-level source for post-enrollment labor market outcomes, linking NSLDS financial aid records to IRS/Treasury earnings data. Use when comparing institutions on post-graduation earnings, loan repayment, or student debt, or when actual tax-record-based earnings are required rather than survey estimates. Covers six sub-datasets accessed via Portal mirrors. Critical limitation: tracks only Title IV federal aid recipients, not all students.
Federal data on post-college outcomes including earnings, debt, and repayment for students who received Title IV financial aid. Links education records to IRS tax data for actual earnings, making it the primary source for post-college labor market outcomes.
CRITICAL: Value Encoding and Missing Data
The Education Data Portal uses integer encodings for all categorical variables and lowercase, restructured variable names that differ from the original Scorecard column names. Suppression encoding differs by dataset:
- Earnings/counts:
integer code is the primary suppression indicator-3- Yes/No flags (institutional characteristics):
for missing,null/0for valid1- Rates (repayment, default):
for missingnull- The original Scorecard string
does NOT appear in Portal data"PrivacySuppressed"
Context pred_degree_awarded_ipedsHBCU / tribal flags religious_affiliationPortal (integer) -04 /01Integer codes 22-200 Original Scorecard String labels String labels String labels See
for complete encoding tables../references/variable-definitions.md
What is College Scorecard?
- Publisher: U.S. Department of Education
- Primary value: Post-college labor market outcomes (earnings) and debt/repayment metrics
- Data sources: NSLDS (loans/aid), IRS/Treasury (earnings), IPEDS (institutional characteristics)
- Coverage: Title IV federal aid recipients only — not all students
- Unique feature: Links education to IRS tax records for actual earnings data
- Access: Education Data Portal mirrors (parquet/CSV); see
for paths,datasets-reference.md
for mirror config,mirrors.yaml
for fetch codefetch-patterns.md - Primary identifier:
(IPEDS institution ID)unitid
Reference File Structure
| File | Purpose | When to Read |
|---|---|---|
| Post-college earnings methodology, cohorts, time horizons | Analyzing earnings outcomes |
| Student debt, repayment rates, default rates | Analyzing debt or loan outcomes |
| Completion metrics vs IPEDS | Comparing graduation rates |
| Title IV limitation details, who is included/excluded | Understanding data representativeness |
| Key variables, naming conventions, special values | Building queries or interpreting results |
| Suppression rules, selection bias, known limitations | Assessing data reliability |
| Program-level earnings and debt data | Analyzing outcomes by major/CIP code |
Decision Trees
What outcome am I researching?
Outcome type? ├─ Post-college earnings │ ├─ Institution-level → ./references/earnings-data.md │ └─ By field of study → ./references/field-of-study.md ├─ Student debt levels │ ├─ Cumulative borrowing → ./references/debt-repayment.md │ └─ Debt by field → ./references/field-of-study.md ├─ Loan repayment/default │ └─ Repayment rates → ./references/debt-repayment.md ├─ Completion rates │ └─ Scorecard completion → ./references/completion-rates.md └─ Understanding limitations ├─ Who is included → ./references/population-coverage.md └─ Data quality issues → ./references/data-quality.md
How do I interpret this data?
Interpretation question? ├─ Why are earnings suppressed? │ └─ Privacy thresholds → ./references/data-quality.md ├─ What does "6-year earnings" mean? │ └─ Cohort timing → ./references/earnings-data.md ├─ Why don't Scorecard rates match IPEDS? │ └─ Different cohorts → ./references/completion-rates.md ├─ What loans are included in debt? │ └─ Federal only → ./references/debt-repayment.md └─ How representative is this data? └─ Title IV coverage → ./references/population-coverage.md
Building a query?
Query construction? ├─ Variable names and codes → ./references/variable-definitions.md ├─ Suppression flags to handle → ./references/data-quality.md ├─ Understanding cohort years → ./references/earnings-data.md └─ Field-level queries → ./references/field-of-study.md
Quick Reference: Scorecard Variables
Portal Data Structure (CRITICAL)
The Portal uses LONG format with time horizon as a column, NOT the WIDE format from original Scorecard bulk download files. Portal column names are all lowercase and differ significantly from original Scorecard names.
| Original Scorecard (WIDE) | Portal Column (LONG) | How to Get |
|---|---|---|
| | Filter: |
| | Filter: |
| | Filter: |
| | Filter: |
, | NOT IN EARNINGS | Join to IPEDS directory or dataset |
| | In dataset; filter: |
| | In dataset; filter: |
Earnings Dataset Columns (Actual Portal Names)
Source dataset:
(203,066 rows x 33 columns)scorecard/colleges_scorecard_earnings
| Portal Column | Type | Description | Original Scorecard |
|---|---|---|---|
| Int64 | IPEDS institution ID | |
| String | OPE ID (8-digit, zero-padded) | |
| Int64 | Data year (2003-2014, 2018) | File year |
| Int64 | Years since first enrollment (6-10) | Encoded in variable name |
| Int64 | Entry cohort year | Encoded in variable name |
| Int64 | Median earnings (W-2) | |
| Int64 | Mean earnings | |
| Int64 | Standard deviation of earnings | |
| Int64 | 10th percentile earnings | |
| Int64 | 25th percentile earnings | |
| Int64 | 75th percentile earnings | |
| Int64 | 90th percentile earnings | |
| Int64 | Count working and not enrolled | |
| Int64 | Count not working and not enrolled | |
| Float64 | Share earning > $25K | |
| Int64 | Mean earnings, low-income | |
| Int64 | Mean earnings, mid-income | |
| Int64 | Mean earnings, high-income | |
| Int64 | Mean earnings, dependent students | — |
| Int64 | Mean earnings, dependent low-income | — |
| Int64 | Mean earnings, independent students | — |
| Int64 | Mean earnings, female | — |
| Int64 | Mean earnings, male | — |
| Int64 | Count working by subgroup | — |
Key Identifiers
| ID | Format | Level | Example | Notes |
|---|---|---|---|---|
| 6-digit integer | Institution | | Same as IPEDS unitid; primary join key |
| 8-digit string | OPE (Title IV) | | Zero-padded; present in all datasets |
| Integer | 6-digit OPE | | Numeric, no zero-padding |
Data Timing
| Metric | Dimension Column | Values | Typical Lag |
|---|---|---|---|
| Earnings | | 6, 7, 8, 9, 10 | Data from 7+ years ago |
| Default | | 2, 3 | Varies |
| Repayment | | 1, 3, 5, 7 | Varies |
"After entry" means after first enrollment, not after graduation.
Categorical Value Encodings (Institutional Characteristics Dataset)
| Variable | Values |
|---|---|
| 0=Not classified, 1=Certificate, 2=Associate's, 3=Bachelor's, 4=Graduate |
| Yes/No flags (HBCU, tribal, etc.) | 0=No, 1=Yes, null=Missing |
| 76 integer codes 22-200 (see variable-definitions.md for complete mapping), null=None/Missing |
Missing Data Codes
| Code | Meaning | Which Datasets |
|---|---|---|
| Suppressed for privacy | Earnings dataset (earnings and count columns) — primary suppression indicator |
| Missing/not applicable | Institutional characteristics (yes/no flags), repayment/default (rates) |
| Positive numeric | Actual value | Earnings, debt, counts, rates |
import polars as pl # Filter for valid earnings (handle -3 suppression code) valid = df.filter( (pl.col("earnings_med").is_not_null()) & (pl.col("earnings_med") != -3) ) # Filter for 6-year earnings specifically six_yr_valid = valid.filter(pl.col("years_after_entry") == 6)
Data Access
Datasets for Scorecard are available via the mirror system. See
datasets-reference.md for canonical paths, mirrors.yaml for mirror configuration, and fetch-patterns.md for fetch code patterns.
Codebooks are
.xls files co-located with data in all mirrors. Use get_codebook_url() from fetch-patterns.md to construct download URLs.
Truth Hierarchy: When interpreting variable values, apply this priority:
- Actual data file (what you observe in the parquet/CSV) — this IS the truth
- Live codebook (.xls in mirror) — authoritative documentation, may lag
- This skill documentation — convenient summary, may drift from codebook
If this documentation contradicts the codebook, trust the codebook. If the codebook contradicts observed data, trust the data and investigate.
All Scorecard Datasets (6 total)
| Dataset | Path | Codebook | Type | Years |
|---|---|---|---|---|
| Earnings | | | Single | varies |
| Default | | | Single | 1996-2020 |
| Institutional Characteristics | | | Single | 1996-2020 |
| Repayment | | | Single | 2007-2016 |
| Student Characteristics (Aid) | | | Single | 1997-2016 |
| Student Characteristics (Home) | | | Single | 1997-2016 |
Scorecard naming note: Data file paths differ significantly from codebook paths. Notable mismatches: data
vs codebookrepayment_fsa; datadefaultvs codebookinst_characteristics; datainstitutional-characteristicsvs codebookrepayment_nslds; datarepaymentvs codebookstudent_body_nslds; datastudent-characteristics_aid-applicantsvs codebookstudent_body_treasury. Always use the exact paths shown above.student-characteristics_home-neighborhood
Fetching Data
import polars as pl from fetch_utils import fetch_from_mirrors # See fetch-patterns.md # Fetch earnings data earnings = fetch_from_mirrors("scorecard/colleges_scorecard_earnings") # Filter by time horizon (LONG format — filter, don't use wide column names) six_yr = earnings.filter(pl.col("years_after_entry") == 6) # Filter for valid earnings (exclude -3 suppression code) valid = six_yr.filter( (pl.col("earnings_med").is_not_null()) & (pl.col("earnings_med") != -3) ) # Institution names/control are NOT in the earnings dataset. # Join to inst_characteristics or IPEDS directory: inst = fetch_from_mirrors("scorecard/colleges_scorecard_inst_characteristics", years=[2020]) valid = valid.join( inst.select("unitid", "inst_name", "pred_degree_awarded_ipeds"), on="unitid", how="left" )
Common Pitfalls
| Pitfall | Issue | Solution |
|---|---|---|
| "All graduates" claims | Scorecard covers Title IV recipients only, not all students | Note Title IV limitation prominently in any analysis |
| Wage comparison | Comparing to BLS wages or Census income uses different populations | Use for relative comparisons, not absolute claims; document population differences |
| Ignoring suppression | Many programs have no data due to privacy thresholds | Check suppression rates before analyzing; document coverage |
| Time lag ignored | Earnings reflect old cohorts (6-year = data from 7+ years ago) | Document data vintage and cohort years explicitly |
| Total borrowing assumption | Scorecard debt includes only federal loans, not private | State "federal loans only" when reporting debt figures |
| String codes from docs | Original Scorecard uses string labels; Portal uses integers | Verify actual data types in Portal parquet files; use integer codes |
| Wide-format variable names | Using column name on Portal data | Portal uses LONG format — filter instead |
| Assuming null = suppressed | Earnings dataset uses for suppression, not null | Filter both: |
| Using uppercase names | Original Scorecard uses ; Portal uses | Always use lowercase Portal names from actual data |
Critical Limitation: Title IV Recipients Only
The single most important caveat for all Scorecard analysis:
Scorecard tracks ONLY students who received federal financial aid (Title IV):
- Pell Grants
- Federal student loans (Direct, Perkins, PLUS)
- Federal work-study
| Excluded Group | Impact |
|---|---|
| Full-pay students | Often higher-income; different outcomes |
| Students with only state/institutional aid | Missing from data |
| International students | Not eligible for federal aid |
| Some graduate students | If they received no federal aid |
Coverage varies dramatically by institution type:
| Institution Type | Typical Title IV Coverage |
|---|---|
| For-profit colleges | 80-90%+ |
| Community colleges | 60-80% |
| Public flagships | 50-70% |
| Selective private colleges | 30-50% |
Data systematically overrepresents lower-income students who are more likely to need federal aid.
What Scorecard Data Does NOT Include
| Excluded | Why It Matters |
|---|---|
| Non-Title IV students | Often higher-income; different outcomes |
| Self-employment income | 1099 income excluded from earnings |
| Students still in school | Not working = not in earnings data |
| Private student loans | Only federal loans tracked |
| Students who left the country | Lost to follow-up |
Comparison: Scorecard vs IPEDS
| Aspect | College Scorecard | IPEDS |
|---|---|---|
| Who's tracked | Title IV aid recipients | First-time, full-time students |
| Includes part-time | Yes | No (for grad rates) |
| Includes transfers-in | Yes | No (tracked at origin) |
| Outcome focus | Earnings, debt, repayment | Completion, retention |
| Data source | NSLDS + IRS | Institution-reported |
Related Data Sources
| Source | Relationship | When to Use |
|---|---|---|
| Institutional characteristics, enrollment, finance | Join on for institution names, control type, enrollment context |
| Alternative post-college earnings (Census LEHD) | When broader population coverage needed (not limited to Title IV) |
| Federal student aid details | Deeper analysis of aid types and disbursements |
| Parent discovery skill | Finding available endpoints |
| Data fetching | Downloading parquet/CSV files |
Topic Index
| Topic | Reference File |
|---|---|
| Earnings methodology | |
| Cohort definitions | |
| IRS data matching | |
| Earnings suppression | |
| Debt metrics | |
| Repayment rates | |
| Default rates | |
| NSLDS data | |
| Completion methodology | |
| IPEDS comparison | |
| Title IV coverage | |
| Who is excluded | |
| Selection bias | |
| Variable names | |
| Special values | |
| Privacy suppression | |
| Data limitations | |
| Program-level data | |
| CIP codes | |