★ Furious Five 01 02 03 04 05 06

Python, Stata & R

Code Cheatsheet

Quick-reference code for every causal inference method in the book. Python, Stata, and R side by side.

6 chapters · 5 methods · Python + Stata + R

⚠ Code without understanding is a dangerous shortcut

These cheatsheets are a reference tool, not a substitute for learning. Every method rests on assumptions that must be justified by the context in which the data was collected and the relationships that economic theory suggests.

A valid instrument in one setting may be invalid in another. Parallel trends may hold for one policy comparison but fail for the next. Running code without understanding when and why each method works can produce results that are not just wrong, but dangerously misleading.

Before using any of this code, you should:

  • 📖 Read the book — understand the theory and assumptions behind each method
  • Complete the exercises — test whether you truly understand the concepts
  • 🤖 Use the AI tutors — deepen your understanding through guided conversation
  • 🎧 Listen to the podcasts — revise key ideas through a different medium

The Furious Five

All 5 causal inference methods applied to one question — the ultimate code summary

The Furious Five: All Methods at a Glance

Five methods, one question: Does education cause higher earnings? Chapter 6 applies all five to the same problem — and they converge on a true return of about 7–10% per year.

Method Key Assumption Estimates Main Threat
1. RCT
Random assignment breaks the link between treatment and confounders. A simple comparison of means is causal.
group_means = df.groupby("D")["Y"].mean()
Random assignmentATENon-compliance
2. Regression
Controls for observable differences, but any unobserved confounder biases the estimate. Useful as a baseline, never as the final word.
pf.feols("Y ~ D + X", data=df, vcov="hetero")
All confounders observedConditional avgOmitted variable bias
3. IV / 2SLS
Uses an external source of exogenous variation (instrument) that shifts treatment but is unrelated to the outcome. Recovers the effect for compliers.
pf.feols("Y ~ 1 | D ~ Z", data=df, vcov="hetero")
Valid instrumentLATE (compliers)Weak/invalid instrument
4. RD
Compares observations just above vs. just below a cutoff. Gives a credible causal estimate at the cutoff, but may not generalize far from the threshold.
pf.feols("Y ~ D * score", data=df, vcov="hetero")
No manipulation of running varLocal effect at cutoffLimited external validity
5. DD
Compares changes over time between treated and control groups. Identifies the causal effect if both groups would have followed the same trend absent treatment.
pf.feols("Y ~ treat_post | group + time", data=df)
Parallel trendsATTPre-trend violation
import pandas as pd import numpy as np import pyfixest as pf DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/" # ================================================================ # METHOD 1: RANDOMIZED CONTROLLED TRIAL (WITH NON-COMPLIANCE) # ================================================================ # Question: Does education cause higher earnings? # Equation: ln(W) = α + ρ·S + ε (First Stage: S = γ_0 + γ_1·Z + u) # Logic: Z (scholarship offer) is randomly assigned. We use it to # isolate exogenous variation in S (actual schooling). # Estimate: Wald ratio = (Reduced Form effect on W) / (First Stage effect on S) # Bias: None, assuming the offer only affects earnings through schooling # ---------------------------------------------------------------- rct = pd.read_csv(DATA + "ch6/synthetic_rct.csv") means = rct.groupby("scholarship")[["schooling", "earnings"]].mean() wald_rct = (means.loc[1, "earnings"] - means.loc[0, "earnings"]) / \ (means.loc[1, "schooling"] - means.loc[0, "schooling"]) print(f"1. RCT (Wald): {wald_rct:.4f}") # ================================================================ # METHOD 2: REGRESSION (OLS) # ================================================================ # Question: What is the raw return to each year of schooling? # Equation: ln(W) = α + ρ·S + ε # Logic: Control for observables, hoping all confounders are captured. # Estimate: Conditional average association (not necessarily causal). # Bias: Omitted Variable Bias (e.g., Ability Bias - upward). Smarter # individuals may get more schooling AND earn more regardless. # ---------------------------------------------------------------- qob = pd.read_csv(DATA + "ch6/qob_clean.csv") ols = pf.feols("lnw ~ s", data=qob, vcov="hetero") print(f"2. OLS: {ols.coef()['s']:.4f}") # ================================================================ # METHOD 3: INSTRUMENTAL VARIABLES (IV / 2SLS) # ================================================================ # Question: What is the causal return to schooling, using exogenous variation? # Equation: Wald = Cov(Y, Z) / Cov(D, Z) # Logic: Quarter of birth (Z) shifts schooling laws but is unrelated to ability. # Estimate: LATE — Local Average Treatment Effect for "compliers" # (students kept in school solely due to compulsory schooling laws). # Bias: None, provided Z is relevant and the exclusion restriction holds. # ---------------------------------------------------------------- iv = pf.feols("lnw ~ 1 | s ~ q4", data=qob, vcov="hetero") print(f"3. IV (QOB): {iv.coef()['s']:.4f}") # ================================================================ # METHOD 4: REGRESSION DISCONTINUITY (RD) # ================================================================ # Question: Does the diploma credential itself boost earnings? # Equation: Y = α + ρ·D + β_1·(Score) + β_2·(D × Score) + ε # Logic: Compare students just above vs. just below the passing cutoff. # Estimate: LATE at the cutoff (the "sheepskin" or credential effect). # Bias: None, assuming continuity of potential outcomes at the cutoff # (students cannot precisely manipulate their scores). # ---------------------------------------------------------------- sheep = pd.read_csv(DATA + "ch6/sheepskin_clean.csv") # Center the running variable if not already centered, and apply a bandwidth (e.g., +/- 30) bandwidth_data = sheep[abs(sheep["minscore"]) <= 30] # Use robust local linear regression allowing varying slopes on either side of cutoff rd = pf.feols("avgearnings ~ pass_exam * minscore", data=bandwidth_data, weights="person_years", vcov="hetero") print(f"4. RD: ${rd.coef()['pass_exam']:.0f} (Credential effect)") # ================================================================ # METHOD 5: DIFFERENCES-IN-DIFFERENCES (WALD-DiD) # ================================================================ # Question: Do compulsory schooling reforms (T) raise causal earnings? # Equation: Wald-DiD = DiD_Earnings / DiD_Schooling # Logic: Compare changes over time in reform vs. non-reform states, using # the reform as an instrument for actual years of schooling. # Estimate: LATE of schooling on earnings driven by the policy change. # Bias: Biased if parallel trends assumption fails (states would have # had different trajectories absent the reform). # ---------------------------------------------------------------- did = pd.read_csv(DATA + "ch6/synthetic_did.csv") did["treat_post"] = did["treated"] * did["post"] # First stage: Effect of policy on schooling dd_s = pf.feols("avg_schooling ~ treat_post | state + year", data=did, vcov={"CRV1": "state"}) # Reduced form: Effect of policy on earnings dd_e = pf.feols("avg_earnings ~ treat_post | state + year", data=did, vcov={"CRV1": "state"}) wald_did = dd_e.coef()['treat_post'] / dd_s.coef()['treat_post'] print(f"5. Wald-DiD: {wald_did:.4f}")
🎲

01 Randomized Trials

Selection bias · Potential outcomes · RAND HIE

Read Chapter Watch Video Listen to Podcast
import pandas as pd import pyfixest as pf DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/" # --- Step 1: Compare health by insurance status (observational) --- nhis = pd.read_csv(DATA + "ch1/nhis_clean.csv") print(nhis.groupby("insurance")["health"].mean().round(2)) result = pf.feols("health ~ insurance", data=nhis, vcov="hetero") result.summary() # --- Step 2: Balance check (RAND HIE — did randomization work?) --- rand = pd.read_csv(DATA + "ch1/rand_balance.csv") d = rand[["age","plan_free","plan_deductible","plan_coinsurance","family_id"]].dropna() result = pf.feols("age ~ plan_free+plan_deductible+plan_coinsurance", data=d, vcov={"CRV1": "family_id"}) result.summary() # --- Step 3: Causal effect on spending (free insurance → more spending) --- hie = pd.read_csv(DATA + "ch1/rand_utilization.csv") d = hie[["total_expenses","plan_free","plan_deductible","plan_coinsurance","family_id"]].dropna() result = pf.feols("total_expenses ~ plan_free+plan_deductible+plan_coinsurance", data=d, vcov={"CRV1": "family_id"}) result.summary() # --- Step 4: Causal effect on health (the RAND paradox: no effect!) --- health = pd.read_csv(DATA + "ch1/rand_health_outcomes.csv") d = health[["health_index","plan_free","plan_deductible","plan_coinsurance","family_id"]].dropna() result = pf.feols("health_index ~ plan_free+plan_deductible+plan_coinsurance", data=d, vcov={"CRV1": "family_id"}) result.summary()
📈

02 Regression

OLS · Omitted variable bias · Bad controls

Read Chapter Watch Video Listen to Podcast
import numpy as np, pandas as pd, pyfixest as pf # --- Step 1: Simulate data where we KNOW the true causal effect --- np.random.seed(42) n = 1000 ability = np.random.normal(50, 10, n) prob_private = 1 / (1 + np.exp(-(ability - 50) / 5)) private = np.random.binomial(1, prob_private) true_effect = 5000 earnings = 30000 + true_effect*private + 800*ability + np.random.normal(0, 5000, n) students = pd.DataFrame({"earnings": earnings, "private": private, "ability": ability}) # --- Step 2: Short regression (omitting ability → biased) --- short = pf.feols("earnings ~ private", data=students) print(f"Short (biased): ${short.coef()['private']:,.0f} (true = $5,000)") # --- Step 3: Long regression (including ability → unbiased) --- long = pf.feols("earnings ~ private + ability", data=students) print(f"Long (unbiased): ${long.coef()['private']:,.0f}") # --- Step 4: Verify the OVB formula --- aux = pf.feols("ability ~ private", data=students) pi_1 = aux.coef()["private"] gamma = long.coef()["ability"] ovb = pi_1 * gamma print(f"OVB = pi1 x gamma = {pi_1:.2f} x {gamma:.0f} = ${ovb:,.0f}")
🔧

03 Instrumental Variables

LATE · Compliers · Minneapolis DV Experiment

Read Chapter Watch Video Watch Video 2 Listen to Podcast
import pandas as pd, pyfixest as pf DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/" # --- Step 1: Load Minneapolis Domestic Violence Experiment --- mdve = pd.read_csv(DATA + "ch3/mdve_clean.csv") # --- Step 2: Check compliance (assigned vs. delivered) --- print(pd.crosstab(mdve["assigned"], mdve["delivered"], margins=True)) # --- Step 3: Create binary treatment variables --- mdve["z_coddle"] = (mdve["assigned"] != "Arrest").astype(int) # instrument mdve["d_coddle"] = (mdve["delivered"] != "Arrest").astype(int) # treatment # --- Step 4: First stage --- fs = pf.feols("d_coddle ~ z_coddle", data=mdve, vcov="hetero") print(f"First stage: {fs.coef()['z_coddle']:.3f}") # --- Step 5: Reduced form (from published data) --- reduced_form = 0.211 - 0.097 print(f"Reduced form: {reduced_form:.3f}") # --- Step 6: LATE = reduced form / first stage --- late = reduced_form / fs.coef()["z_coddle"] print(f"LATE = {reduced_form:.3f} / {fs.coef()['z_coddle']:.3f} = {late:.3f}") print("Coddling increases recidivism by ~15 pp among compliers")
📉

04 Regression Discontinuity

Sharp RD · Bandwidth · MLDA & mortality

Read Chapter Watch Video Listen to Podcast
import pandas as pd, matplotlib.pyplot as plt, pyfixest as pf DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/" # --- Step 1: Load MLDA mortality data --- mlda = pd.read_csv(DATA + "ch4/mlda_clean.csv") # --- Step 2: Plot the discontinuity --- fig, ax = plt.subplots(figsize=(8, 5)) ax.scatter(mlda["agecell"], mlda["all"], color="gray", alpha=0.7, s=40) ax.axvline(x=21, color="red", linestyle="--", label="MLDA cutoff") ax.set_xlabel("Age"); ax.set_ylabel("Deaths per 100,000") ax.legend(); plt.show() # --- Step 3: Sharp RD (linear) --- result = pf.feols("all ~ over21 + age", data=mlda, vcov="hetero") print(f"Linear RD jump: {result.coef()['over21']:.2f} deaths per 100k") # --- Step 4: Quadratic RD (robustness) --- result = pf.feols("all ~ over21+age+age2+over_age+over_age2", data=mlda, vcov="hetero") print(f"Quadratic RD jump: {result.coef()['over21']:.2f}") # --- Step 5: Placebo (internal causes should NOT jump) --- result = pf.feols("internal ~ over21 + age", data=mlda, vcov="hetero") print(f"Placebo (internal): {result.coef()['over21']:.2f} (expect ~0)")
📊

05 Differences-in-Differences

Parallel trends · Two-way FE · MLDA panel data

Read Chapter Watch Video Listen to Podcast
import pandas as pd, pyfixest as pf DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/" # --- Step 1: Manual DD (Great Depression banking data) --- banks = pd.read_csv(DATA + "ch5/banks_clean.csv") pre_6 = banks.loc[banks["year"]==1930, "bib6"].values[0] post_6 = banks.loc[banks["year"]==1931, "bib6"].values[0] pre_8 = banks.loc[banks["year"]==1930, "bib8"].values[0] post_8 = banks.loc[banks["year"]==1931, "bib8"].values[0] dd = (post_6 - pre_6) - (post_8 - pre_8) print(f"Manual DD: {dd} banks saved by Atlanta Fed") # --- Step 2: Regression DD with fixed effects --- deaths = pd.read_csv(DATA + "ch5/deaths_clean.csv") allcause = deaths[deaths["dtype"] == "all"] result = pf.feols("mrate ~ legal | state + year", data=allcause, vcov={"CRV1": "state"}) print(f"DD (all-cause): {result.coef()['legal']:.2f} (SE: {result.se()['legal']:.2f})") # --- Step 3: Population-weighted DD --- result = pf.feols("mrate ~ legal | state + year", data=allcause, weights="pop", vcov={"CRV1": "state"}) print(f"Weighted DD: {result.coef()['legal']:.2f}") # --- Step 4: Placebo (suicide should NOT respond) --- suicide = deaths[deaths["dtype"] == "suicide"] result = pf.feols("mrate ~ legal | state + year", data=suicide, vcov={"CRV1": "state"}) print(f"Placebo (suicide): {result.coef()['legal']:.2f} (expect ~0)")
🎓

06 The Wages of Schooling

Twins · Quarter of birth · Child labor laws · Sheepskin effects

Read Chapter Watch Video Listen to Podcast
import pandas as pd, pyfixest as pf DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/" # --- Step 1: Simple bivariate OLS --- qob = pd.read_csv(DATA + "ch6/qob_clean.csv") ols = pf.feols("lnw ~ s", data=qob, vcov="hetero") print(f"Simple OLS: {ols.coef()['s']:.3f} (~7%/year, biased by ability)") # --- Step 2: OLS with controls (twins data) --- twins = pd.read_csv(DATA + "ch6/twins_clean.csv") ols_c = pf.feols("lwage ~ educ+age+age2+female+white", data=twins, vcov="hetero") print(f"OLS + controls: {ols_c.coef()['educ']:.3f} (~11%)") # --- Step 3: Twin fixed effects --- first = twins[twins["first"] == 1] fe = pf.feols("dlwage ~ deduc - 1", data=first, vcov="hetero") print(f"Twin FE: {fe.coef()['deduc']:.3f} (~6%, attenuated)") # --- Step 4: Twin IV (corrects measurement error) --- iv_twin = pf.feols("lwage ~ age+age2+female+white | educ ~ educt_t", data=twins, vcov="hetero") print(f"Twin IV: {iv_twin.coef()['educ']:.3f} (~11%)") # --- Step 5: Quarter-of-birth IV --- iv_qob = pf.feols("lnw ~ 1 | s ~ q4", data=qob, vcov="hetero") print(f"QOB IV: {iv_qob.coef()['s']:.3f} (~7%)") # --- Step 6: Child labor law IV --- cl = pd.read_csv(DATA + "ch6/childlabor_clean.csv") ols_cl = pf.feols("lnwkwage ~ indEduc | sob + yob + year", data=cl, weights="weight", vcov={"CRV1": "sob"}) print(f"OLS (child labor): {ols_cl.coef()['indEduc']:.4f}") # --- Step 7: First-stage F-statistic --- fs = pf.feols("s ~ q4", data=qob) f_stat = fs.tstat()['q4'] ** 2 # F = t-squared for a single restriction print(f"First-stage F (QOB): {f_stat:.1f} (should be > 10)")