Awesome-Agent-Skills-for-Empirical-Research financial-data-analysis

Methods for acquiring, cleaning, and analyzing financial datasets for research

install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/domains/finance/financial-data-analysis" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-financial-data-an && rm -rf "$T"
manifest: skills/43-wentorai-research-plugins/skills/domains/finance/financial-data-analysis/SKILL.md
source content

Financial Data Analysis

A practical skill for sourcing, processing, and analyzing financial data in academic research contexts. Covers data acquisition from public APIs, cleaning workflows, and standard analytical techniques used in empirical finance research.

Data Acquisition

Public Financial Data Sources

SourceData TypeAccessPython Package
Yahoo FinancePrices, fundamentalsFree
yfinance
FRED (St. Louis Fed)Macroeconomic indicatorsFree (API key)
fredapi
SEC EDGARCompany filings (10-K, 10-Q)Free
sec-edgar-downloader
WRDS (Wharton)CRSP, Compustat, IBESUniversity subscription
wrds
Alpha VantageReal-time and historical pricesFree tier
alpha_vantage

Fetching Price Data

import yfinance as yf
import pandas as pd

def fetch_stock_data(tickers: list[str], start: str, end: str) -> pd.DataFrame:
    """
    Fetch adjusted close prices for a list of tickers.

    Args:
        tickers: List of ticker symbols (e.g., ['AAPL', 'MSFT'])
        start: Start date (YYYY-MM-DD)
        end: End date (YYYY-MM-DD)
    Returns:
        DataFrame with adjusted close prices
    """
    data = yf.download(tickers, start=start, end=end, auto_adjust=True)
    prices = data['Close'] if len(tickers) > 1 else data[['Close']]
    prices.columns = tickers if len(tickers) > 1 else tickers
    return prices

# Fetch 5 years of data
prices = fetch_stock_data(['AAPL', 'MSFT', 'GOOGL'], '2020-01-01', '2025-01-01')
print(prices.head())

Macroeconomic Data from FRED

from fredapi import Fred

fred = Fred(api_key=os.environ["FRED_API_KEY"])

# Common series for finance research
series_ids = {
    'GDP': 'GDP',
    'CPI': 'CPIAUCSL',
    'Fed_Funds_Rate': 'FEDFUNDS',
    'Unemployment': 'UNRATE',
    '10Y_Treasury': 'DGS10',
    'VIX': 'VIXCLS'
}

macro_data = pd.DataFrame()
for name, sid in series_ids.items():
    macro_data[name] = fred.get_series(sid, observation_start='2000-01-01')

Data Cleaning Pipeline

Financial data requires careful cleaning before analysis:

def clean_financial_data(df: pd.DataFrame) -> pd.DataFrame:
    """Standard cleaning pipeline for financial time series."""
    cleaned = df.copy()

    # 1. Handle missing values
    missing_pct = cleaned.isnull().sum() / len(cleaned) * 100
    print(f"Missing data:\n{missing_pct}")

    # 2. Forward-fill for market holidays (max 5 days)
    cleaned = cleaned.ffill(limit=5)

    # 3. Remove remaining NaN rows
    cleaned = cleaned.dropna()

    # 4. Detect and flag outliers (>5 sigma daily returns)
    returns = cleaned.pct_change()
    z_scores = (returns - returns.mean()) / returns.std()
    outliers = (z_scores.abs() > 5).any(axis=1)
    print(f"Outlier days flagged: {outliers.sum()}")

    # 5. Verify data integrity
    assert cleaned.index.is_monotonic_increasing, "Index must be sorted"
    assert not cleaned.duplicated().any(), "No duplicate rows allowed"

    return cleaned

Standard Financial Metrics

Return Calculations

def compute_returns(prices: pd.DataFrame) -> dict:
    """Compute standard return metrics."""
    simple_returns = prices.pct_change().dropna()
    log_returns = np.log(prices / prices.shift(1)).dropna()

    annualized_return = simple_returns.mean() * 252
    annualized_vol = simple_returns.std() * np.sqrt(252)
    sharpe_ratio = annualized_return / annualized_vol

    # Maximum drawdown
    cumulative = (1 + simple_returns).cumprod()
    rolling_max = cumulative.cummax()
    drawdown = (cumulative - rolling_max) / rolling_max
    max_drawdown = drawdown.min()

    return {
        'annualized_return': annualized_return,
        'annualized_volatility': annualized_vol,
        'sharpe_ratio': sharpe_ratio,
        'max_drawdown': max_drawdown
    }

Event Studies

A common methodology in empirical finance research:

  1. Define the event window (e.g., [-5, +5] trading days around earnings announcement)
  2. Estimate normal returns using the market model over the estimation window (e.g., [-250, -30])
  3. Compute abnormal returns: AR = R_actual - R_expected
  4. Aggregate cumulative abnormal returns (CAR) across firms
  5. Test statistical significance using parametric (Patell test) and non-parametric (sign test) methods

Always report both raw and risk-adjusted results, and perform robustness checks with different estimation windows and benchmark models.

Reproducibility

Store all data processing steps in version-controlled scripts. Use

pandas.DataFrame.to_parquet()
for efficient storage of intermediate datasets, and document data provenance including download dates, API versions, and any filters applied.