Awesome-Agent-Skills-for-Empirical-Research r-python-translation
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/17-DAAF-Contribution-Community-daaf/dot-claude/skills/r-python-translation" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-r-python-translat && rm -rf "$T"
skills/17-DAAF-Contribution-Community-daaf/dot-claude/skills/r-python-translation/SKILL.mdR-to-Python Translation Skill
R-to-Python translation reference for quantitative social science data analysis. Maps R ecosystem packages (tidyverse/dplyr, ggplot2, fixest, survey, sf, plm, lme4, marginaleffects, rdrobust) to DAAF Python equivalents (polars, plotnine, pyfixest, statsmodels, linearmodels, svy, geopandas). Use when user mentions R/RStudio background, requests R-equivalent code comments, needs to understand Python analysis code from an R perspective, or wants to translate R data analysis concepts to Python. Covers paradigm differences, verb-by-verb operation translations, regression modeling, causal inference, visualization, and workflow adaptation.
Cross-language translation reference for researchers moving between the R and Python data analysis ecosystems. This skill maps R packages, idioms, and workflows to their DAAF Python equivalents so that R-background users can audit, understand, and learn from DAAF-produced code, and so that code-producing agents can annotate their output with R equivalents when directed.
This skill is a routing hub — it provides overview tables, decision trees, and directs readers to the detailed reference files listed below. The reference files contain the exhaustive verb-by-verb mappings, code examples, and edge-case documentation.
What This Skill Does
- Maps the R data analysis ecosystem to DAAF's Python stack across data wrangling, modeling, visualization, causal inference, surveys, spatial analysis, and workflow tooling
- Provides a structured annotation protocol for agents to add inline R-equivalent comments to Python code
- Identifies paradigm gaps where R and Python diverge fundamentally, so users know where to expect friction
Use cases:
- R user auditing DAAF Python code and needing to understand what operations are being performed
- Agent annotating code with R-equivalent comments for an R-background researcher
- R user learning Python for data analysis and needing a conceptual bridge
- Translating a specific R operation or idiom to its Python equivalent
- Understanding where R tools have no direct Python equivalent (and what the workaround is)
How to Use This Skill
Reference File Structure
Each topic in
./references/ contains focused documentation:
| File | Purpose | When to Read |
|---|---|---|
| Core language and paradigm differences | Encountering fundamental R-vs-Python confusion |
| Core dplyr/tidyr to polars verb mapping (select, filter, mutate, joins, reshaping, window functions, lazy eval) | Reading or writing data manipulation code |
| String, date/time, and factor operations (stringr, lubridate, forcats to polars) | Working with string/date/categorical columns |
| fixest/stats/plm to pyfixest/statsmodels/linearmodels | Reading or writing regression code |
| ggplot2/plotly R to plotnine/plotly Python | Reading or writing visualization code |
| R causal inference ecosystem to Python equivalents | Working with DiD, RDD, IV, event studies |
| survey/sf/tidymodels to svy/geopandas/scikit-learn | Working with surveys, spatial data, or ML |
| RStudio/Quarto workflow to DAAF/marimo workflow | Adapting to DAAF's execution model |
| Curated guides and tutorials with provenance | Seeking additional learning materials |
| Common R-user mistakes in Python | Debugging or reviewing code from R perspective |
Reading Order
- R user auditing DAAF code:
then the relevant domain file (e.g.,paradigm-differences.md
for data wrangling,polars-dplyr.md
for models) thenregression-modeling.mdgotchas.md - Agent annotating code with R equivalents: Agent Code Annotation Protocol section below, then the relevant domain file for the code being annotated
- Learning Python from R background:
thenparadigm-differences.md
thenpolars-dplyr.md
thenworkflow-environment.mdexternal-resources.md - Looking up a specific translation: Quick Decision Trees below, then the relevant reference file
Quick Decision Trees
"How do I do X from R in Python?"
What kind of R operation? ├─ Data wrangling (filter, mutate, join, pivot, summarise) │ └─ ./references/polars-dplyr.md ├─ Regression / statistical modeling │ └─ ./references/regression-modeling.md ├─ Plotting / visualization │ └─ ./references/visualization.md ├─ Causal inference (DiD, RDD, IV, event studies) │ └─ ./references/causal-inference.md ├─ Surveys / spatial / machine learning │ └─ ./references/survey-spatial-ml.md └─ Fundamental language differences (types, syntax, environment) └─ ./references/paradigm-differences.md
"Why does this Python code look different from R?"
What looks unfamiliar? ├─ Expression syntax (pl.col().method().alias()) │ └─ ./references/paradigm-differences.md ├─ Missing values (None vs NaN vs null vs NA) │ └─ ./references/paradigm-differences.md ├─ Formula interface (~) behaves differently │ └─ ./references/regression-modeling.md ├─ Import patterns and namespacing │ └─ ./references/gotchas.md └─ No interactive REPL / console workflow └─ ./references/workflow-environment.md
"I want to translate an R script to Python"
What does the R script do? ├─ Loads and wrangles data (read_csv, dplyr verbs) │ └─ ./references/polars-dplyr.md ├─ Runs regressions (lm, feols, plm) │ └─ ./references/regression-modeling.md ├─ Creates plots (ggplot, plotly) │ └─ ./references/visualization.md ├─ Uses survey weights (svydesign, svymean) │ └─ ./references/survey-spatial-ml.md ├─ Spatial operations (sf, st_join) │ └─ ./references/survey-spatial-ml.md ├─ Multiple of the above │ └─ Start with ./references/paradigm-differences.md, then each relevant file └─ Uses a package not listed above └─ ./references/external-resources.md for broader ecosystem guidance
"Something isn't working and I think it's an R habit"
What went wrong? ├─ 1-indexed access gave wrong element │ └─ ./references/gotchas.md ├─ Factor/categorical behaves differently │ └─ ./references/gotchas.md ├─ NA handling surprised me │ └─ ./references/paradigm-differences.md ├─ Pipe operator (|> or %>%) not available │ └─ ./references/paradigm-differences.md ├─ library() vs import confusion │ └─ ./references/gotchas.md └─ Model output structure is different └─ ./references/regression-modeling.md
"Which Python package replaces my R package?"
Which R package? ├─ dplyr / tidyr / readr / tibble → polars │ └─ ./references/polars-dplyr.md ├─ ggplot2 → plotnine │ └─ ./references/visualization.md ├─ plotly (R) → plotly (Python) │ └─ ./references/visualization.md ├─ fixest → pyfixest │ └─ ./references/regression-modeling.md ├─ stats (lm, glm) → statsmodels │ └─ ./references/regression-modeling.md ├─ plm / lme4 / estimatr → linearmodels │ └─ ./references/regression-modeling.md ├─ survey → svy │ └─ ./references/survey-spatial-ml.md ├─ sf / terra → geopandas │ └─ ./references/survey-spatial-ml.md ├─ tidymodels / caret → scikit-learn │ └─ ./references/survey-spatial-ml.md ├─ marginaleffects → marginaleffects (Python) │ └─ ./references/regression-modeling.md ├─ rdrobust / did / synthdid → rdrobust / pyfixest DiD │ └─ ./references/causal-inference.md └─ Quarto / RMarkdown → marimo └─ ./references/workflow-environment.md
Package Mapping Overview
| Python Package | R Equivalent | Fidelity | Key Difference |
|---|---|---|---|
| polars | dplyr + tidyr + data.table | Low | Expression system vs verb grammar; method chaining vs pipe |
| pyfixest | fixest | High | Near-identical formula syntax; minor SE default differences |
| plotnine | ggplot2 | High | Same grammar of graphics; Python string quoting for aes |
| plotly | plotly (R) | High | vs ; similar output |
| statsmodels | base R stats + lmtest + sandwich | Medium | Three formula dialects; manual vcov specification |
| linearmodels | plm + lme4 + estimatr | Medium | Requires pandas MultiIndex for panel structure |
| scikit-learn | tidymodels / caret | Medium | Imperative fit/predict vs declarative recipe pipeline |
| geopandas | sf + terra | Medium | shapely geometries vs sfc; different CRS handling |
| svy | survey (Lumley) | Medium | Limited GLM family coverage (gaussian/binomial/Poisson only) |
| marimo | Quarto / RMarkdown | Medium | Reactive cells vs knit-based linear execution |
Fidelity key: High = near-direct translation, same mental model. Medium = same capability, different API patterns. Low = fundamentally different paradigm requiring conceptual remapping.
Library Versions
Translations in this skill reference specific library versions. Python versions are pinned in DAAF's Docker environment (Python 3.12). R versions reference CRAN releases as of March 2026. When syntax or behavior has changed between versions, the reference files note the change.
| Python Package | DAAF Version | R Equivalent | R Version (CRAN) |
|---|---|---|---|
| polars | 1.38.1 | dplyr + tidyr + data.table | dplyr 1.2.0, tidyr 1.3.2, data.table 1.18.2 |
| pyfixest | 0.40.0 | fixest | 0.14.0 |
| plotnine | 0.15.3 | ggplot2 | 4.0.2 |
| plotly | 6.5.2 | plotly (R) | 4.12.0 |
| statsmodels | 0.14.6 | base R stats + lmtest + sandwich | lmtest 0.9-40, sandwich 3.1-1 |
| linearmodels | unpinned | plm + lme4 + estimatr | plm 2.6-7, lme4 2.0-1 |
| scikit-learn | 1.8.0 | tidymodels / caret | tidymodels 1.4.1, caret 7.0-1 |
| geopandas | 1.1.3 | sf + terra | sf 1.1-0, terra 1.9-11 |
| svy | 0.13.0 | survey | survey 4.5 |
| marginaleffects | unpinned | marginaleffects (R) | 0.32.0 |
| rdrobust | unpinned | rdrobust (R) | 3.0.0 |
| marimo | 0.19.11 | Quarto / RMarkdown | Quarto 1.6.x |
Unpinned packages: linearmodels, marginaleffects, and rdrobust install the latest version at Docker build time. Translations for these packages reference their documented API as of March 2026.
R version note: R package versions are from CRAN as of March 2026 (R 4.5.3). Check
packageVersion("pkg") in your R installation to verify your local version matches.
Top 10 Paradigm Differences
These are the friction points R users encounter most frequently when reading or writing DAAF Python code. Each is covered in depth in the referenced file.
| # | Friction Point | R Way | Python Way | Reference |
|---|---|---|---|---|
| 1 | Expression system | | | |
| 2 | Formula fragmentation | One universal syntax | Three dialects (pyfixest, statsmodels, linearmodels) | |
| 3 | Missing values | Single type | , , and (context-dependent) | |
| 4 | mutate equivalent | | | |
| 5 | No row index | Tibbles have row numbers | Polars has no row index; use | |
| 6 | Polars-to-pandas bridge | Data frames go directly into models | Must call before statsmodels/pyfixest | |
| 7 | Factor vs Categorical | with ordered levels | / (different semantics) | |
| 8 | Package fragmentation | One package per domain (fixest does it all) | Multiple packages per domain (statsmodels + linearmodels + pyfixest) | |
| 9 | 1-indexed vs 0-indexed | is first element | is first element | |
| 10 | Namespace model | exports all names | requires explicit namespacing | |
Agent Code Annotation Protocol
This section defines when and how code-producing agents add inline R-equivalent comments to DAAF Python scripts.
When to Annotate
Annotations are added only when the orchestrator explicitly passes an R-background directive to the agent. This is not a default behavior.
Trigger conditions (orchestrator activates this when any apply):
- User states they have an R / RStudio background
- User requests R-equivalent comments in code
- User asks to understand Python code from an R perspective
How the orchestrator passes the directive: The orchestrator adds the following to the agent prompt:
"User has R background. Load r-python-translation skill. Add inline R-equivalent comments for non-trivial data operations."
Comment Format
# R: df %>% filter(year == 2020) filtered = df.filter(pl.col("year") == 2020) # R: df %>% mutate(pct = count / sum(count)) result = df.with_columns( (pl.col("count") / pl.col("count").sum()).alias("pct") ) # R: feols(y ~ x1 + x2 | state + year, data = df, cluster = ~state) fit = pf.feols("y ~ x1 + x2 | state + year", data=pdf, vcov={"CRV1": "state"})
What to Annotate
- Annotate: Data wrangling (polars operations), modeling calls (pyfixest, statsmodels, linearmodels), visualization layer construction (plotnine, plotly), causal inference method calls
- Do NOT annotate: Import statements,
/print()
validation lines, file I/O boilerplate (assert
,pl.read_parquet
), config sections, section separator commentsdf.write_parquet
Rules
- One
comment per logical operation, placed on the line immediately above the Python code# R: - Keep annotations to a single line; abbreviate complex R pipelines if needed
- R annotations are in addition to standard IAT comments (
,# INTENT:
,# REASONING:
), not a replacement# ASSUMES: - Consumer agents: research-executor, code-reviewer, debugger, data-ingest
Related Skills
| Skill | Relationship |
|---|---|
| Python-side data wrangling — detailed API reference for the dplyr/tidyr equivalent |
| Python-side fixed effects regression — detailed API for the fixest equivalent |
| Python-side static visualization — detailed API for the ggplot2 equivalent |
| Python-side interactive visualization — detailed API for plotly R equivalent |
| Python-side general modeling — covers base R stats, lmtest, sandwich equivalents |
| Python-side panel/IV models — covers plm, lme4, estimatr equivalents |
| Python-side ML — covers tidymodels/caret equivalents |
| Python-side spatial data — covers sf/terra equivalents |
| Python-side survey analysis — covers survey (Lumley) equivalents |
| Python-side notebooks — covers Quarto/RMarkdown workflow equivalents |
| Parallel skill for Stata-background users — shares the same Python target stack |
Note: Individual tool skills contain library-specific usage guidance (syntax, gotchas, performance). This skill provides the R-to-Python conceptual bridge — use both together when an R-background user is working with a specific library.
Topic Index
| Topic | Reference File |
|---|---|
Pipe operator ( / ` | >`) equivalents |
| Expression system (pl.col, .alias) | |
| Missing value semantics (NA vs None/NaN/null) | |
| Type system differences | |
| Package/namespace model | |
| 0-indexing vs 1-indexing | |
| Polars-to-pandas conversion for modeling | |
| Row index differences | |
| dplyr verb mapping (filter, select, mutate, arrange) | |
| summarise / group_by equivalents | |
| tidyr verbs (pivot_longer, pivot_wider, separate, unite) | |
| Join operations (left_join, inner_join, anti_join) | |
| String operations (stringr vs polars .str) | |
| Date operations (lubridate vs polars .dt) | |
| across() / where() equivalents | |
| case_when equivalent | |
| readr I/O equivalents | |
| fixest formula syntax in pyfixest | |
| lm() / glm() in statsmodels | |
| Formula interface comparison (three Python dialects) | |
| Standard error specification differences | |
| plm panel models in linearmodels | |
| lme4 mixed effects equivalents | |
| marginaleffects (R to Python) | |
| Model summary / tidy output | |
| Sandwich / robust SE equivalents | |
| ggplot2 layer mapping to plotnine | |
| aes() string quoting in plotnine | |
| Theme customization | |
| Scale functions | |
| Faceting (facet_wrap, facet_grid) | |
| plotly R vs plotly Python | |
| ggsave equivalent | |
| Difference-in-differences (did, did2s) | |
| Regression discontinuity (rdrobust) | |
| Instrumental variables (ivreg vs pyfixest IV) | |
| Event study designs | |
| Synthetic control | |
| Matching / propensity scores | |
| survey package to svy | |
| svydesign / svymean / svyglm equivalents | |
| sf spatial operations to geopandas | |
| CRS / projection handling | |
| Spatial joins (st_join vs sjoin) | |
| tidymodels pipeline to scikit-learn | |
| RStudio vs DAAF workflow | |
| Quarto / RMarkdown vs marimo | |
| Interactive console vs file-first execution | |
| Package management (renv vs pip/uv) | |
| Project structure conventions | |
| Curated R-to-Python migration guides | |
| Package documentation links | |
| Tutorial recommendations with provenance | |
| 1-indexed list/vector access | |
| Factor vs Categorical pitfalls | |
| library() vs import habits | |
| T/F vs True/False | |
| Assignment operator (<- vs =) | |
| Vectorized operations expectations | |
| NULL vs None differences | |
| apply family vs map/list comprehension | |
| Copying semantics (R copy-on-modify vs Python references) | |
| Logical operators (& / | vs and / or) |
| String interpolation (glue vs f-strings) | |
| data.table vs polars | |
| Lazy evaluation (polars LazyFrame vs R lazy tibble) | |
| nest/unnest equivalents | |
| Window functions (over vs mutate + group_by) | |
| Coordinate systems (coord_flip, coord_polar) | |
| Stat layers (stat_smooth, stat_summary) | |
| Color palette mapping (viridis, brewer) | |
| Multi-panel layouts (patchwork vs subplot) | |
| Staggered DiD estimators | |
| Parallel trends testing | |
| BRR / jackknife replication weights | |
| Raster data handling (terra vs rasterio) | |
| Feature engineering (recipes vs sklearn Pipeline) | |
| Cross-validation (rsample vs sklearn) | |
| Environment/workspace differences (.RData vs nothing) | |
| Debugging workflow (browser() vs breakpoint()) | |
| R help system (?func) vs Python help(func) | |
| Cheat sheet and quick-reference links | |
| Community resources (Stack Overflow tags, forums) | |