Quick-reference code for every causal inference method in the book. Python, Stata, and R side by side.
6 chapters·5 methods·Python + Stata + R
⚠ Code without understanding is a dangerous shortcut
These cheatsheets are a reference tool, not a substitute for learning. Every method rests on assumptions that must be justified by the context in which the data was collected and the relationships that economic theory suggests.
A valid instrument in one setting may be invalid in another. Parallel trends may hold for one policy comparison but fail for the next. Running code without understanding when and why each method works can produce results that are not just wrong, but dangerously misleading.
Before using any of this code, you should:
📖 Read the book — understand the theory and assumptions behind each method
✍ Complete the exercises — test whether you truly understand the concepts
🤖 Use the AI tutors — deepen your understanding through guided conversation
🎧 Listen to the podcasts — revise key ideas through a different medium
Five methods, one question: Does education cause higher earnings? Chapter 6 applies all five to the same problem — and they converge on a true return of about 7–10% per year.
Method
Key Assumption
Estimates
Main Threat
1. RCT Random assignment breaks the link between treatment and confounders. A simple comparison of means is causal. group_means = df.groupby("D")["Y"].mean()
Random assignment
ATE
Non-compliance
2. Regression Controls for observable differences, but any unobserved confounder biases the estimate. Useful as a baseline, never as the final word. pf.feols("Y ~ D + X", data=df, vcov="hetero")
All confounders observed
Conditional avg
Omitted variable bias
3. IV / 2SLS Uses an external source of exogenous variation (instrument) that shifts treatment but is unrelated to the outcome. Recovers the effect for compliers. pf.feols("Y ~ 1 | D ~ Z", data=df, vcov="hetero")
Valid instrument
LATE (compliers)
Weak/invalid instrument
4. RD Compares observations just above vs. just below a cutoff. Gives a credible causal estimate at the cutoff, but may not generalize far from the threshold. pf.feols("Y ~ D * score", data=df, vcov="hetero")
No manipulation of running var
Local effect at cutoff
Limited external validity
5. DD Compares changes over time between treated and control groups. Identifies the causal effect if both groups would have followed the same trend absent treatment. pf.feols("Y ~ treat_post | group + time", data=df)
Parallel trends
ATT
Pre-trend violation
import pandas as pd
import numpy as np
import pyfixest as pf
DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"
# ================================================================
# METHOD 1: RANDOMIZED CONTROLLED TRIAL (WITH NON-COMPLIANCE)
# ================================================================
# Question: Does education cause higher earnings?
# Equation: ln(W) = α + ρ·S + ε (First Stage: S = γ_0 + γ_1·Z + u)
# Logic: Z (scholarship offer) is randomly assigned. We use it to
# isolate exogenous variation in S (actual schooling).
# Estimate: Wald ratio = (Reduced Form effect on W) / (First Stage effect on S)
# Bias: None, assuming the offer only affects earnings through schooling
# ----------------------------------------------------------------
rct = pd.read_csv(DATA + "ch6/synthetic_rct.csv")
means = rct.groupby("scholarship")[["schooling", "earnings"]].mean()
wald_rct = (means.loc[1, "earnings"] - means.loc[0, "earnings"]) / \
(means.loc[1, "schooling"] - means.loc[0, "schooling"])
print(f"1. RCT (Wald): {wald_rct:.4f}")
# ================================================================
# METHOD 2: REGRESSION (OLS)
# ================================================================
# Question: What is the raw return to each year of schooling?
# Equation: ln(W) = α + ρ·S + ε
# Logic: Control for observables, hoping all confounders are captured.
# Estimate: Conditional average association (not necessarily causal).
# Bias: Omitted Variable Bias (e.g., Ability Bias - upward). Smarter
# individuals may get more schooling AND earn more regardless.
# ----------------------------------------------------------------
qob = pd.read_csv(DATA + "ch6/qob_clean.csv")
ols = pf.feols("lnw ~ s", data=qob, vcov="hetero")
print(f"2. OLS: {ols.coef()['s']:.4f}")
# ================================================================
# METHOD 3: INSTRUMENTAL VARIABLES (IV / 2SLS)
# ================================================================
# Question: What is the causal return to schooling, using exogenous variation?
# Equation: Wald = Cov(Y, Z) / Cov(D, Z)
# Logic: Quarter of birth (Z) shifts schooling laws but is unrelated to ability.
# Estimate: LATE — Local Average Treatment Effect for "compliers"
# (students kept in school solely due to compulsory schooling laws).
# Bias: None, provided Z is relevant and the exclusion restriction holds.
# ----------------------------------------------------------------
iv = pf.feols("lnw ~ 1 | s ~ q4", data=qob, vcov="hetero")
print(f"3. IV (QOB): {iv.coef()['s']:.4f}")
# ================================================================
# METHOD 4: REGRESSION DISCONTINUITY (RD)
# ================================================================
# Question: Does the diploma credential itself boost earnings?
# Equation: Y = α + ρ·D + β_1·(Score) + β_2·(D × Score) + ε
# Logic: Compare students just above vs. just below the passing cutoff.
# Estimate: LATE at the cutoff (the "sheepskin" or credential effect).
# Bias: None, assuming continuity of potential outcomes at the cutoff
# (students cannot precisely manipulate their scores).
# ----------------------------------------------------------------
sheep = pd.read_csv(DATA + "ch6/sheepskin_clean.csv")
# Center the running variable if not already centered, and apply a bandwidth (e.g., +/- 30)
bandwidth_data = sheep[abs(sheep["minscore"]) <= 30]
# Use robust local linear regression allowing varying slopes on either side of cutoff
rd = pf.feols("avgearnings ~ pass_exam * minscore",
data=bandwidth_data, weights="person_years", vcov="hetero")
print(f"4. RD: ${rd.coef()['pass_exam']:.0f} (Credential effect)")
# ================================================================
# METHOD 5: DIFFERENCES-IN-DIFFERENCES (WALD-DiD)
# ================================================================
# Question: Do compulsory schooling reforms (T) raise causal earnings?
# Equation: Wald-DiD = DiD_Earnings / DiD_Schooling
# Logic: Compare changes over time in reform vs. non-reform states, using
# the reform as an instrument for actual years of schooling.
# Estimate: LATE of schooling on earnings driven by the policy change.
# Bias: Biased if parallel trends assumption fails (states would have
# had different trajectories absent the reform).
# ----------------------------------------------------------------
did = pd.read_csv(DATA + "ch6/synthetic_did.csv")
did["treat_post"] = did["treated"] * did["post"]
# First stage: Effect of policy on schooling
dd_s = pf.feols("avg_schooling ~ treat_post | state + year",
data=did, vcov={"CRV1": "state"})
# Reduced form: Effect of policy on earnings
dd_e = pf.feols("avg_earnings ~ treat_post | state + year",
data=did, vcov={"CRV1": "state"})
wald_did = dd_e.coef()['treat_post'] / dd_s.coef()['treat_post']
print(f"5. Wald-DiD: {wald_did:.4f}")
library(fixest)
DATA <- "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"
# ================================================================
# METHOD 1: RANDOMIZED CONTROLLED TRIAL (WITH NON-COMPLIANCE)
# ================================================================
# Question: Does education cause higher earnings?
# Equation: ln(W) = a + rho*S + e (First Stage: S = g0 + g1*Z + u)
# Logic: Z (scholarship offer) is randomly assigned. We use it to
# isolate exogenous variation in S (actual schooling).
# Estimate: Wald ratio = (Reduced Form effect on W) / (First Stage effect on S)
# Bias: None, assuming the offer only affects earnings through schooling
# ----------------------------------------------------------------
rct <- read.csv(paste0(DATA, "ch6/synthetic_rct.csv"))
means <- aggregate(cbind(schooling, earnings) ~ scholarship, data = rct, FUN = mean)
wald_rct <- (means$earnings[2] - means$earnings[1]) /
(means$schooling[2] - means$schooling[1])
cat(sprintf("1. RCT (Wald): %.4f\n", wald_rct))
# ================================================================
# METHOD 2: REGRESSION (OLS)
# ================================================================
# Question: What is the raw return to each year of schooling?
# Equation: ln(W) = a + rho*S + e
# Logic: Control for observables, hoping all confounders are captured.
# Estimate: Conditional average association (not necessarily causal).
# Bias: Omitted Variable Bias (e.g., Ability Bias - upward). Smarter
# individuals may get more schooling AND earn more regardless.
# ----------------------------------------------------------------
qob <- read.csv(paste0(DATA, "ch6/qob_clean.csv"))
ols <- feols(lnw ~ s, data = qob, vcov = "hetero")
cat(sprintf("2. OLS: %.4f\n", coef(ols)["s"]))
# ================================================================
# METHOD 3: INSTRUMENTAL VARIABLES (IV / 2SLS)
# ================================================================
# Question: What is the causal return to schooling, using exogenous variation?
# Equation: Wald = Cov(Y, Z) / Cov(D, Z)
# Logic: Quarter of birth (Z) shifts schooling laws but is unrelated to ability.
# Estimate: LATE -- Local Average Treatment Effect for "compliers"
# (students kept in school solely due to compulsory schooling laws).
# Bias: None, provided Z is relevant and the exclusion restriction holds.
# ----------------------------------------------------------------
iv <- feols(lnw ~ 1 | s ~ q4, data = qob, vcov = "hetero")
cat(sprintf("3. IV (QOB): %.4f\n", coef(iv)["fit_s"]))
# ================================================================
# METHOD 4: REGRESSION DISCONTINUITY (RD)
# ================================================================
# Question: Does the diploma credential itself boost earnings?
# Equation: Y = a + rho*D + b1*(Score) + b2*(D x Score) + e
# Logic: Compare students just above vs. just below the passing cutoff.
# Estimate: LATE at the cutoff (the "sheepskin" or credential effect).
# Bias: None, assuming continuity of potential outcomes at the cutoff
# (students cannot precisely manipulate their scores).
# ----------------------------------------------------------------
sheep <- read.csv(paste0(DATA, "ch6/sheepskin_clean.csv"))
bw_data <- sheep[abs(sheep$minscore) <= 30, ]
rd <- feols(avgearnings ~ pass_exam * minscore, data = bw_data,
weights = ~person_years, vcov = "hetero")
cat(sprintf("4. RD: $%.0f (Credential effect)\n", coef(rd)["pass_exam"]))
# ================================================================
# METHOD 5: DIFFERENCES-IN-DIFFERENCES (WALD-DiD)
# ================================================================
# Question: Do compulsory schooling reforms (T) raise causal earnings?
# Equation: Wald-DiD = DiD_Earnings / DiD_Schooling
# Logic: Compare changes over time in reform vs. non-reform states, using
# the reform as an instrument for actual years of schooling.
# Estimate: LATE of schooling on earnings driven by the policy change.
# Bias: Biased if parallel trends assumption fails (states would have
# had different trajectories absent the reform).
# ----------------------------------------------------------------
did <- read.csv(paste0(DATA, "ch6/synthetic_did.csv"))
did$treat_post <- did$treated * did$post
# First stage: Effect of policy on schooling
dd_s <- feols(avg_schooling ~ treat_post | state + year, data = did, vcov = ~state)
# Reduced form: Effect of policy on earnings
dd_e <- feols(avg_earnings ~ treat_post | state + year, data = did, vcov = ~state)
wald_did <- coef(dd_e)["treat_post"] / coef(dd_s)["treat_post"]
cat(sprintf("5. Wald-DiD: %.4f\n", wald_did))
* =================================================================
* METHOD 1: RANDOMIZED CONTROLLED TRIAL (WITH NON-COMPLIANCE)
* =================================================================
* Question: Does education cause higher earnings?
* Equation: ln(W) = a + rho*S + e (First Stage: S = g0 + g1*Z + u)
* Logic: Z (scholarship offer) is randomly assigned. We use it to
* isolate exogenous variation in S (actual schooling).
* Estimate: Wald ratio = (Reduced Form effect on W) / (First Stage effect on S)
* Bias: None, assuming the offer only affects earnings through schooling
* -----------------------------------------------------------------
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/synthetic_rct.csv", clear
* Manual Wald calculation for pedagogical clarity
quietly reg earnings scholarship, robust
scalar rf = _b[scholarship]
quietly reg schooling scholarship, robust
scalar fs = _b[scholarship]
display "1. RCT (Wald): " round(rf / fs, 0.0001)
* Note: In practice, we estimate this directly via 2SLS to get correct standard errors:
* ivregress 2sls earnings (schooling = scholarship), robust
* =================================================================
* METHOD 2: REGRESSION (OLS)
* =================================================================
* Question: What is the raw return to each year of schooling?
* Equation: ln(W) = a + rho*S + e
* Logic: Control for observables, hoping all confounders are captured.
* Estimate: Conditional average association (not necessarily causal).
* Bias: Omitted Variable Bias (e.g., Ability Bias - upward). Smarter
* individuals may get more schooling AND earn more regardless.
* -----------------------------------------------------------------
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/qob_clean.csv", clear
reg lnw s, robust
* =================================================================
* METHOD 3: INSTRUMENTAL VARIABLES (IV / 2SLS)
* =================================================================
* Question: What is the causal return to schooling, using exogenous variation?
* Equation: Wald = Cov(Y, Z) / Cov(D, Z)
* Logic: Quarter of birth (Z) shifts schooling laws but is unrelated to ability.
* Estimate: LATE -- Local Average Treatment Effect for "compliers"
* (students kept in school solely due to compulsory schooling laws).
* Bias: None, provided Z is relevant and the exclusion restriction holds.
* -----------------------------------------------------------------
ivregress 2sls lnw (s = q4), robust
* =================================================================
* METHOD 4: REGRESSION DISCONTINUITY (RD)
* =================================================================
* Question: Does the diploma credential itself boost earnings?
* Equation: Y = a + rho*D + b1*(Score) + b2*(D x Score) + e
* Logic: Compare students just above vs. just below the passing cutoff.
* Estimate: LATE at the cutoff (the "sheepskin" or credential effect).
* Bias: None, assuming continuity of potential outcomes at the cutoff
* (students cannot precisely manipulate their scores).
* -----------------------------------------------------------------
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/sheepskin_clean.csv", clear
* Interact treatment with the running variable to allow different slopes
gen pass_X_score = pass_exam * minscore
* Local linear regression using a rectangular kernel (bandwidth of +/- 30)
reg avgearnings pass_exam minscore pass_X_score ///
if abs(minscore) <= 30 [aw = person_years], robust
* =================================================================
* METHOD 5: DIFFERENCES-IN-DIFFERENCES (WALD-DiD)
* =================================================================
* Question: Do compulsory schooling reforms (T) raise causal earnings?
* Equation: Wald-DiD = DiD_Earnings / DiD_Schooling
* Logic: Compare changes over time in reform vs. non-reform states, using
* the reform as an instrument for actual years of schooling.
* Estimate: LATE of schooling on earnings driven by the policy change.
* Bias: Biased if parallel trends assumption fails (states would have
* had different trajectories absent the reform).
* -----------------------------------------------------------------
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/synthetic_did.csv", clear
gen treat_post = treated * post
* First stage: Effect of policy on schooling
quietly reg avg_schooling treat_post i.state i.year, cluster(state)
scalar dd_fs = _b[treat_post]
* Reduced form: Effect of policy on earnings
quietly reg avg_earnings treat_post i.state i.year, cluster(state)
display "5. Wald-DD: " round(_b[treat_post] / dd_fs, 0.0001)
🎲
01 Randomized Trials
Selection bias · Potential outcomes · RAND HIE
import pandas as pd
import pyfixest as pf
DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"
# --- Step 1: Compare health by insurance status (observational) ---
nhis = pd.read_csv(DATA + "ch1/nhis_clean.csv")
print(nhis.groupby("insurance")["health"].mean().round(2))
result = pf.feols("health ~ insurance", data=nhis, vcov="hetero")
result.summary()
# --- Step 2: Balance check (RAND HIE — did randomization work?) ---
rand = pd.read_csv(DATA + "ch1/rand_balance.csv")
d = rand[["age","plan_free","plan_deductible","plan_coinsurance","family_id"]].dropna()
result = pf.feols("age ~ plan_free+plan_deductible+plan_coinsurance",
data=d, vcov={"CRV1": "family_id"})
result.summary()
# --- Step 3: Causal effect on spending (free insurance → more spending) ---
hie = pd.read_csv(DATA + "ch1/rand_utilization.csv")
d = hie[["total_expenses","plan_free","plan_deductible","plan_coinsurance","family_id"]].dropna()
result = pf.feols("total_expenses ~ plan_free+plan_deductible+plan_coinsurance",
data=d, vcov={"CRV1": "family_id"})
result.summary()
# --- Step 4: Causal effect on health (the RAND paradox: no effect!) ---
health = pd.read_csv(DATA + "ch1/rand_health_outcomes.csv")
d = health[["health_index","plan_free","plan_deductible","plan_coinsurance","family_id"]].dropna()
result = pf.feols("health_index ~ plan_free+plan_deductible+plan_coinsurance",
data=d, vcov={"CRV1": "family_id"})
result.summary()
library(fixest)
DATA <- "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"
# --- Step 1: Compare health by insurance status (observational) ---
nhis <- read.csv(paste0(DATA, "ch1/nhis_clean.csv"))
print(aggregate(health ~ insurance, data = nhis, FUN = mean))
result <- feols(health ~ insurance, data = nhis, vcov = "hetero")
summary(result)
# --- Step 2: Balance check (RAND HIE -- did randomization work?) ---
rand <- read.csv(paste0(DATA, "ch1/rand_balance.csv"))
d <- na.omit(rand[, c("age","plan_free","plan_deductible","plan_coinsurance","family_id")])
result <- feols(age ~ plan_free + plan_deductible + plan_coinsurance,
data = d, vcov = ~family_id)
summary(result)
# --- Step 3: Causal effect on spending (free insurance -> more spending) ---
hie <- read.csv(paste0(DATA, "ch1/rand_utilization.csv"))
d <- na.omit(hie[, c("total_expenses","plan_free","plan_deductible","plan_coinsurance","family_id")])
result <- feols(total_expenses ~ plan_free + plan_deductible + plan_coinsurance,
data = d, vcov = ~family_id)
summary(result)
# --- Step 4: Causal effect on health (the RAND paradox: no effect!) ---
health <- read.csv(paste0(DATA, "ch1/rand_health_outcomes.csv"))
d <- na.omit(health[, c("health_index","plan_free","plan_deductible","plan_coinsurance","family_id")])
result <- feols(health_index ~ plan_free + plan_deductible + plan_coinsurance,
data = d, vcov = ~family_id)
summary(result)
clear all
set more off
* --- Step 1: Compare health by insurance status ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch1/nhis_clean.csv", clear
tabstat health, by(insurance)
reg health insurance, robust
* --- Step 2: Balance check (RAND HIE) ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch1/rand_balance.csv", clear
reg age plan_free plan_deductible plan_coinsurance, cluster(family_id)
* --- Step 3: Causal effect on spending ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch1/rand_utilization.csv", clear
reg total_expenses plan_free plan_deductible plan_coinsurance, cluster(family_id)
* --- Step 4: Causal effect on health (no effect!) ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch1/rand_health_outcomes.csv", clear
reg health_index plan_free plan_deductible plan_coinsurance, cluster(family_id)
📈
02 Regression
OLS · Omitted variable bias · Bad controls
import numpy as np, pandas as pd, pyfixest as pf
# --- Step 1: Simulate data where we KNOW the true causal effect ---
np.random.seed(42)
n = 1000
ability = np.random.normal(50, 10, n)
prob_private = 1 / (1 + np.exp(-(ability - 50) / 5))
private = np.random.binomial(1, prob_private)
true_effect = 5000
earnings = 30000 + true_effect*private + 800*ability + np.random.normal(0, 5000, n)
students = pd.DataFrame({"earnings": earnings, "private": private, "ability": ability})
# --- Step 2: Short regression (omitting ability → biased) ---
short = pf.feols("earnings ~ private", data=students)
print(f"Short (biased): ${short.coef()['private']:,.0f} (true = $5,000)")
# --- Step 3: Long regression (including ability → unbiased) ---
long = pf.feols("earnings ~ private + ability", data=students)
print(f"Long (unbiased): ${long.coef()['private']:,.0f}")
# --- Step 4: Verify the OVB formula ---
aux = pf.feols("ability ~ private", data=students)
pi_1 = aux.coef()["private"]
gamma = long.coef()["ability"]
ovb = pi_1 * gamma
print(f"OVB = pi1 x gamma = {pi_1:.2f} x {gamma:.0f} = ${ovb:,.0f}")
library(fixest)
# --- Step 1: Simulate data where we KNOW the true causal effect ---
set.seed(42)
n <- 1000
ability <- rnorm(n, 50, 10)
prob_private <- 1 / (1 + exp(-(ability - 50) / 5))
private <- rbinom(n, 1, prob_private)
true_effect <- 5000
earnings <- 30000 + true_effect * private + 800 * ability + rnorm(n, 0, 5000)
students <- data.frame(earnings = earnings, private = private, ability = ability)
# --- Step 2: Short regression (omitting ability -> biased) ---
short <- feols(earnings ~ private, data = students)
cat(sprintf("Short (biased): $%s (true = $5,000)\n",
formatC(coef(short)["private"], format = "f", digits = 0, big.mark = ",")))
# --- Step 3: Long regression (including ability -> unbiased) ---
long <- feols(earnings ~ private + ability, data = students)
cat(sprintf("Long (unbiased): $%s\n",
formatC(coef(long)["private"], format = "f", digits = 0, big.mark = ",")))
# --- Step 4: Verify the OVB formula ---
aux <- feols(ability ~ private, data = students)
pi_1 <- coef(aux)["private"]
gamma <- coef(long)["ability"]
ovb <- pi_1 * gamma
cat(sprintf("OVB = pi1 x gamma = %.2f x %.0f = $%s\n",
pi_1, gamma, formatC(ovb, format = "f", digits = 0, big.mark = ",")))
clear all
set more off
set seed 42
set obs 1000
* --- Step 1: Simulate data ---
gen ability = rnormal(50, 10)
gen prob_private = 1 / (1 + exp(-(ability - 50) / 5))
gen private = rbinomial(1, prob_private)
gen earnings = 30000 + 5000*private + 800*ability + rnormal(0, 5000)
* --- Step 2: Short regression (biased) ---
reg earnings private
* --- Step 3: Long regression (unbiased) ---
reg earnings private ability
* --- Step 4: OVB formula ---
scalar long_b = _b[private]
scalar gamma = _b[ability]
quietly reg earnings private
scalar ovb_direct = _b[private] - long_b
quietly reg ability private
display "OVB = " _b[private] " x " gamma " = " _b[private]*gamma
🔧
03 Instrumental Variables
LATE · Compliers · Minneapolis DV Experiment
import pandas as pd, pyfixest as pf
DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"
# --- Step 1: Load Minneapolis Domestic Violence Experiment ---
mdve = pd.read_csv(DATA + "ch3/mdve_clean.csv")
# --- Step 2: Check compliance (assigned vs. delivered) ---
print(pd.crosstab(mdve["assigned"], mdve["delivered"], margins=True))
# --- Step 3: Create binary treatment variables ---
mdve["z_coddle"] = (mdve["assigned"] != "Arrest").astype(int) # instrument
mdve["d_coddle"] = (mdve["delivered"] != "Arrest").astype(int) # treatment
# --- Step 4: First stage ---
fs = pf.feols("d_coddle ~ z_coddle", data=mdve, vcov="hetero")
print(f"First stage: {fs.coef()['z_coddle']:.3f}")
# --- Step 5: Reduced form (from published data) ---
reduced_form = 0.211 - 0.097
print(f"Reduced form: {reduced_form:.3f}")
# --- Step 6: LATE = reduced form / first stage ---
late = reduced_form / fs.coef()["z_coddle"]
print(f"LATE = {reduced_form:.3f} / {fs.coef()['z_coddle']:.3f} = {late:.3f}")
print("Coddling increases recidivism by ~15 pp among compliers")
clear all
set more off
* --- Step 1: Load MDVE data ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch3/mdve_clean.csv", clear
* --- Step 2: Check compliance ---
tabulate assigned delivered
* --- Step 3: Create binary variables ---
gen z_coddle = (assigned != "Arrest")
gen d_coddle = (delivered != "Arrest")
* --- Step 4: First stage ---
reg d_coddle z_coddle, robust
scalar first_stage = _b[z_coddle]
* --- Step 5: Reduced form (published data) ---
scalar reduced_form = 0.211 - 0.097
* --- Step 6: LATE ---
display "LATE = " reduced_form " / " first_stage " = " reduced_form/first_stage
📉
04 Regression Discontinuity
Sharp RD · Bandwidth · MLDA & mortality
import pandas as pd, matplotlib.pyplot as plt, pyfixest as pf
DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"
# --- Step 1: Load MLDA mortality data ---
mlda = pd.read_csv(DATA + "ch4/mlda_clean.csv")
# --- Step 2: Plot the discontinuity ---
fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(mlda["agecell"], mlda["all"], color="gray", alpha=0.7, s=40)
ax.axvline(x=21, color="red", linestyle="--", label="MLDA cutoff")
ax.set_xlabel("Age"); ax.set_ylabel("Deaths per 100,000")
ax.legend(); plt.show()
# --- Step 3: Sharp RD (linear) ---
result = pf.feols("all ~ over21 + age", data=mlda, vcov="hetero")
print(f"Linear RD jump: {result.coef()['over21']:.2f} deaths per 100k")
# --- Step 4: Quadratic RD (robustness) ---
result = pf.feols("all ~ over21+age+age2+over_age+over_age2", data=mlda, vcov="hetero")
print(f"Quadratic RD jump: {result.coef()['over21']:.2f}")
# --- Step 5: Placebo (internal causes should NOT jump) ---
result = pf.feols("internal ~ over21 + age", data=mlda, vcov="hetero")
print(f"Placebo (internal): {result.coef()['over21']:.2f} (expect ~0)")
library(fixest)
DATA <- "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"
# --- Step 1: Load MLDA mortality data ---
mlda <- read.csv(paste0(DATA, "ch4/mlda_clean.csv"))
# --- Step 2: Plot the discontinuity ---
plot(mlda$agecell, mlda$all, pch = 16, col = "gray",
xlab = "Age", ylab = "Deaths per 100,000", main = "MLDA RD")
abline(v = 21, col = "red", lty = 2)
legend("topleft", legend = "MLDA cutoff", col = "red", lty = 2)
# --- Step 3: Sharp RD (linear) ---
result <- feols(all ~ over21 + age, data = mlda, vcov = "hetero")
cat(sprintf("Linear RD jump: %.2f deaths per 100k\n", coef(result)["over21"]))
# --- Step 4: Quadratic RD (robustness) ---
result <- feols(all ~ over21 + age + age2 + over_age + over_age2,
data = mlda, vcov = "hetero")
cat(sprintf("Quadratic RD jump: %.2f\n", coef(result)["over21"]))
# --- Step 5: Placebo (internal causes should NOT jump) ---
result <- feols(internal ~ over21 + age, data = mlda, vcov = "hetero")
cat(sprintf("Placebo (internal): %.2f (expect ~0)\n", coef(result)["over21"]))
clear all
set more off
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch4/mlda_clean.csv", clear
* --- Sharp RD (linear) ---
reg all over21 age, robust
* --- Quadratic RD (robustness) ---
reg all over21 age age2 over_age over_age2, robust
* --- Placebo (internal causes should NOT jump) ---
reg internal over21 age, robust