Code Cheatsheet

★ The Furious Five: All Methods at a Glance

Five methods, one question: Does education cause higher earnings? Chapter 6 applies all five to the same problem — and they converge on a true return of about 7–10% per year.

Method	Key Assumption	Estimates	Main Threat
1. RCT Random assignment breaks the link between treatment and confounders. A simple comparison of means is causal. `group_means = df.groupby("D")["Y"].mean()`	Random assignment	ATE	Non-compliance
2. Regression Controls for observable differences, but any unobserved confounder biases the estimate. Useful as a baseline, never as the final word. `pf.feols("Y ~ D + X", data=df, vcov="hetero")`	All confounders observed	Conditional avg	Omitted variable bias
3. IV / 2SLS Uses an external source of exogenous variation (instrument) that shifts treatment but is unrelated to the outcome. Recovers the effect for compliers. `pf.feols("Y ~ 1 \| D ~ Z", data=df, vcov="hetero")`	Valid instrument	LATE (compliers)	Weak/invalid instrument
4. RD Compares observations just above vs. just below a cutoff. Gives a credible causal estimate at the cutoff, but may not generalize far from the threshold. `pf.feols("Y ~ D * score", data=df, vcov="hetero")`	No manipulation of running var	Local effect at cutoff	Limited external validity
5. DD Compares changes over time between treated and control groups. Identifies the causal effect if both groups would have followed the same trend absent treatment. `pf.feols("Y ~ treat_post \| group + time", data=df)`	Parallel trends	ATT	Pre-trend violation

import pandas as pd
import numpy as np
import pyfixest as pf

DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"

# ================================================================
# METHOD 1: RANDOMIZED CONTROLLED TRIAL (WITH NON-COMPLIANCE)
# ================================================================
# Question: Does education cause higher earnings?
# Equation: ln(W) = α + ρ·S + ε  (First Stage: S = γ_0 + γ_1·Z + u)
# Logic:    Z (scholarship offer) is randomly assigned. We use it to
#           isolate exogenous variation in S (actual schooling).
# Estimate: Wald ratio = (Reduced Form effect on W) / (First Stage effect on S)
# Bias:     None, assuming the offer only affects earnings through schooling
# ----------------------------------------------------------------
rct = pd.read_csv(DATA + "ch6/synthetic_rct.csv")
means = rct.groupby("scholarship")[["schooling", "earnings"]].mean()
wald_rct = (means.loc[1, "earnings"] - means.loc[0, "earnings"]) / \
           (means.loc[1, "schooling"] - means.loc[0, "schooling"])
print(f"1. RCT (Wald):  {wald_rct:.4f}")

# ================================================================
# METHOD 2: REGRESSION (OLS)
# ================================================================
# Question: What is the raw return to each year of schooling?
# Equation: ln(W) = α + ρ·S + ε
# Logic:    Control for observables, hoping all confounders are captured.
# Estimate: Conditional average association (not necessarily causal).
# Bias:     Omitted Variable Bias (e.g., Ability Bias - upward). Smarter
#           individuals may get more schooling AND earn more regardless.
# ----------------------------------------------------------------
qob = pd.read_csv(DATA + "ch6/qob_clean.csv")
ols = pf.feols("lnw ~ s", data=qob, vcov="hetero")
print(f"2. OLS:         {ols.coef()['s']:.4f}")

# ================================================================
# METHOD 3: INSTRUMENTAL VARIABLES (IV / 2SLS)
# ================================================================
# Question: What is the causal return to schooling, using exogenous variation?
# Equation: Wald = Cov(Y, Z) / Cov(D, Z)
# Logic:    Quarter of birth (Z) shifts schooling laws but is unrelated to ability.
# Estimate: LATE — Local Average Treatment Effect for "compliers"
#           (students kept in school solely due to compulsory schooling laws).
# Bias:     None, provided Z is relevant and the exclusion restriction holds.
# ----------------------------------------------------------------
iv = pf.feols("lnw ~ 1 | s ~ q4", data=qob, vcov="hetero")
print(f"3. IV (QOB):    {iv.coef()['s']:.4f}")

# ================================================================
# METHOD 4: REGRESSION DISCONTINUITY (RD)
# ================================================================
# Question: Does the diploma credential itself boost earnings?
# Equation: Y = α + ρ·D + β_1·(Score) + β_2·(D × Score) + ε
# Logic:    Compare students just above vs. just below the passing cutoff.
# Estimate: LATE at the cutoff (the "sheepskin" or credential effect).
# Bias:     None, assuming continuity of potential outcomes at the cutoff
#           (students cannot precisely manipulate their scores).
# ----------------------------------------------------------------
sheep = pd.read_csv(DATA + "ch6/sheepskin_clean.csv")
# Center the running variable if not already centered, and apply a bandwidth (e.g., +/- 30)
bandwidth_data = sheep[abs(sheep["minscore"]) <= 30]
# Use robust local linear regression allowing varying slopes on either side of cutoff
rd = pf.feols("avgearnings ~ pass_exam * minscore",
              data=bandwidth_data, weights="person_years", vcov="hetero")
print(f"4. RD:          ${rd.coef()['pass_exam']:.0f} (Credential effect)")

# ================================================================
# METHOD 5: DIFFERENCES-IN-DIFFERENCES (WALD-DiD)
# ================================================================
# Question: Do compulsory schooling reforms (T) raise causal earnings?
# Equation: Wald-DiD = DiD_Earnings / DiD_Schooling
# Logic:    Compare changes over time in reform vs. non-reform states, using
#           the reform as an instrument for actual years of schooling.
# Estimate: LATE of schooling on earnings driven by the policy change.
# Bias:     Biased if parallel trends assumption fails (states would have
#           had different trajectories absent the reform).
# ----------------------------------------------------------------
did = pd.read_csv(DATA + "ch6/synthetic_did.csv")
did["treat_post"] = did["treated"] * did["post"]

# First stage: Effect of policy on schooling
dd_s = pf.feols("avg_schooling ~ treat_post | state + year",
                data=did, vcov={"CRV1": "state"})

# Reduced form: Effect of policy on earnings
dd_e = pf.feols("avg_earnings ~ treat_post | state + year",
                data=did, vcov={"CRV1": "state"})

wald_did = dd_e.coef()['treat_post'] / dd_s.coef()['treat_post']
print(f"5. Wald-DiD:    {wald_did:.4f}")

library(fixest)

DATA <- "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"

# ================================================================
# METHOD 1: RANDOMIZED CONTROLLED TRIAL (WITH NON-COMPLIANCE)
# ================================================================
# Question: Does education cause higher earnings?
# Equation: ln(W) = a + rho*S + e  (First Stage: S = g0 + g1*Z + u)
# Logic:    Z (scholarship offer) is randomly assigned. We use it to
#           isolate exogenous variation in S (actual schooling).
# Estimate: Wald ratio = (Reduced Form effect on W) / (First Stage effect on S)
# Bias:     None, assuming the offer only affects earnings through schooling
# ----------------------------------------------------------------
rct <- read.csv(paste0(DATA, "ch6/synthetic_rct.csv"))
means <- aggregate(cbind(schooling, earnings) ~ scholarship, data = rct, FUN = mean)
wald_rct <- (means$earnings[2] - means$earnings[1]) /
            (means$schooling[2] - means$schooling[1])
cat(sprintf("1. RCT (Wald):  %.4f\n", wald_rct))

# ================================================================
# METHOD 2: REGRESSION (OLS)
# ================================================================
# Question: What is the raw return to each year of schooling?
# Equation: ln(W) = a + rho*S + e
# Logic:    Control for observables, hoping all confounders are captured.
# Estimate: Conditional average association (not necessarily causal).
# Bias:     Omitted Variable Bias (e.g., Ability Bias - upward). Smarter
#           individuals may get more schooling AND earn more regardless.
# ----------------------------------------------------------------
qob <- read.csv(paste0(DATA, "ch6/qob_clean.csv"))
ols <- feols(lnw ~ s, data = qob, vcov = "hetero")
cat(sprintf("2. OLS:         %.4f\n", coef(ols)["s"]))

# ================================================================
# METHOD 3: INSTRUMENTAL VARIABLES (IV / 2SLS)
# ================================================================
# Question: What is the causal return to schooling, using exogenous variation?
# Equation: Wald = Cov(Y, Z) / Cov(D, Z)
# Logic:    Quarter of birth (Z) shifts schooling laws but is unrelated to ability.
# Estimate: LATE -- Local Average Treatment Effect for "compliers"
#           (students kept in school solely due to compulsory schooling laws).
# Bias:     None, provided Z is relevant and the exclusion restriction holds.
# ----------------------------------------------------------------
iv <- feols(lnw ~ 1 | s ~ q4, data = qob, vcov = "hetero")
cat(sprintf("3. IV (QOB):    %.4f\n", coef(iv)["fit_s"]))

# ================================================================
# METHOD 4: REGRESSION DISCONTINUITY (RD)
# ================================================================
# Question: Does the diploma credential itself boost earnings?
# Equation: Y = a + rho*D + b1*(Score) + b2*(D x Score) + e
# Logic:    Compare students just above vs. just below the passing cutoff.
# Estimate: LATE at the cutoff (the "sheepskin" or credential effect).
# Bias:     None, assuming continuity of potential outcomes at the cutoff
#           (students cannot precisely manipulate their scores).
# ----------------------------------------------------------------
sheep <- read.csv(paste0(DATA, "ch6/sheepskin_clean.csv"))
bw_data <- sheep[abs(sheep$minscore) <= 30, ]
rd <- feols(avgearnings ~ pass_exam * minscore, data = bw_data,
            weights = ~person_years, vcov = "hetero")
cat(sprintf("4. RD:          $%.0f (Credential effect)\n", coef(rd)["pass_exam"]))

# ================================================================
# METHOD 5: DIFFERENCES-IN-DIFFERENCES (WALD-DiD)
# ================================================================
# Question: Do compulsory schooling reforms (T) raise causal earnings?
# Equation: Wald-DiD = DiD_Earnings / DiD_Schooling
# Logic:    Compare changes over time in reform vs. non-reform states, using
#           the reform as an instrument for actual years of schooling.
# Estimate: LATE of schooling on earnings driven by the policy change.
# Bias:     Biased if parallel trends assumption fails (states would have
#           had different trajectories absent the reform).
# ----------------------------------------------------------------
did <- read.csv(paste0(DATA, "ch6/synthetic_did.csv"))
did$treat_post <- did$treated * did$post

# First stage: Effect of policy on schooling
dd_s <- feols(avg_schooling ~ treat_post | state + year, data = did, vcov = ~state)

# Reduced form: Effect of policy on earnings
dd_e <- feols(avg_earnings ~ treat_post | state + year, data = did, vcov = ~state)

wald_did <- coef(dd_e)["treat_post"] / coef(dd_s)["treat_post"]
cat(sprintf("5. Wald-DiD:    %.4f\n", wald_did))

* =================================================================
* METHOD 1: RANDOMIZED CONTROLLED TRIAL (WITH NON-COMPLIANCE)
* =================================================================
* Question: Does education cause higher earnings?
* Equation: ln(W) = a + rho*S + e  (First Stage: S = g0 + g1*Z + u)
* Logic:    Z (scholarship offer) is randomly assigned. We use it to
* isolate exogenous variation in S (actual schooling).
* Estimate: Wald ratio = (Reduced Form effect on W) / (First Stage effect on S)
* Bias:     None, assuming the offer only affects earnings through schooling
* -----------------------------------------------------------------
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/synthetic_rct.csv", clear

* Manual Wald calculation for pedagogical clarity
quietly reg earnings scholarship, robust
scalar rf = _b[scholarship]
quietly reg schooling scholarship, robust
scalar fs = _b[scholarship]
display "1. RCT (Wald): " round(rf / fs, 0.0001)

* Note: In practice, we estimate this directly via 2SLS to get correct standard errors:
* ivregress 2sls earnings (schooling = scholarship), robust

* =================================================================
* METHOD 2: REGRESSION (OLS)
* =================================================================
* Question: What is the raw return to each year of schooling?
* Equation: ln(W) = a + rho*S + e
* Logic:    Control for observables, hoping all confounders are captured.
* Estimate: Conditional average association (not necessarily causal).
* Bias:     Omitted Variable Bias (e.g., Ability Bias - upward). Smarter
* individuals may get more schooling AND earn more regardless.
* -----------------------------------------------------------------
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/qob_clean.csv", clear

reg lnw s, robust

* =================================================================
* METHOD 3: INSTRUMENTAL VARIABLES (IV / 2SLS)
* =================================================================
* Question: What is the causal return to schooling, using exogenous variation?
* Equation: Wald = Cov(Y, Z) / Cov(D, Z)
* Logic:    Quarter of birth (Z) shifts schooling laws but is unrelated to ability.
* Estimate: LATE -- Local Average Treatment Effect for "compliers"
* (students kept in school solely due to compulsory schooling laws).
* Bias:     None, provided Z is relevant and the exclusion restriction holds.
* -----------------------------------------------------------------
ivregress 2sls lnw (s = q4), robust

* =================================================================
* METHOD 4: REGRESSION DISCONTINUITY (RD)
* =================================================================
* Question: Does the diploma credential itself boost earnings?
* Equation: Y = a + rho*D + b1*(Score) + b2*(D x Score) + e
* Logic:    Compare students just above vs. just below the passing cutoff.
* Estimate: LATE at the cutoff (the "sheepskin" or credential effect).
* Bias:     None, assuming continuity of potential outcomes at the cutoff
* (students cannot precisely manipulate their scores).
* -----------------------------------------------------------------
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/sheepskin_clean.csv", clear

* Interact treatment with the running variable to allow different slopes
gen pass_X_score = pass_exam * minscore

* Local linear regression using a rectangular kernel (bandwidth of +/- 30)
reg avgearnings pass_exam minscore pass_X_score ///
    if abs(minscore) <= 30 [aw = person_years], robust

* =================================================================
* METHOD 5: DIFFERENCES-IN-DIFFERENCES (WALD-DiD)
* =================================================================
* Question: Do compulsory schooling reforms (T) raise causal earnings?
* Equation: Wald-DiD = DiD_Earnings / DiD_Schooling
* Logic:    Compare changes over time in reform vs. non-reform states, using
* the reform as an instrument for actual years of schooling.
* Estimate: LATE of schooling on earnings driven by the policy change.
* Bias:     Biased if parallel trends assumption fails (states would have
* had different trajectories absent the reform).
* -----------------------------------------------------------------
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/synthetic_did.csv", clear

gen treat_post = treated * post

* First stage: Effect of policy on schooling
quietly reg avg_schooling treat_post i.state i.year, cluster(state)
scalar dd_fs = _b[treat_post]

* Reduced form: Effect of policy on earnings
quietly reg avg_earnings treat_post i.state i.year, cluster(state)

display "5. Wald-DD: " round(_b[treat_post] / dd_fs, 0.0001)

🎲

01 Randomized Trials

Selection bias · Potential outcomes · RAND HIE

import pandas as pd
import pyfixest as pf
DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"

# --- Step 1: Compare health by insurance status (observational) ---
nhis = pd.read_csv(DATA + "ch1/nhis_clean.csv")
print(nhis.groupby("insurance")["health"].mean().round(2))
result = pf.feols("health ~ insurance", data=nhis, vcov="hetero")
result.summary()

# --- Step 2: Balance check (RAND HIE — did randomization work?) ---
rand = pd.read_csv(DATA + "ch1/rand_balance.csv")
d = rand[["age","plan_free","plan_deductible","plan_coinsurance","family_id"]].dropna()
result = pf.feols("age ~ plan_free+plan_deductible+plan_coinsurance",
                  data=d, vcov={"CRV1": "family_id"})
result.summary()

# --- Step 3: Causal effect on spending (free insurance → more spending) ---
hie = pd.read_csv(DATA + "ch1/rand_utilization.csv")
d = hie[["total_expenses","plan_free","plan_deductible","plan_coinsurance","family_id"]].dropna()
result = pf.feols("total_expenses ~ plan_free+plan_deductible+plan_coinsurance",
                  data=d, vcov={"CRV1": "family_id"})
result.summary()

# --- Step 4: Causal effect on health (the RAND paradox: no effect!) ---
health = pd.read_csv(DATA + "ch1/rand_health_outcomes.csv")
d = health[["health_index","plan_free","plan_deductible","plan_coinsurance","family_id"]].dropna()
result = pf.feols("health_index ~ plan_free+plan_deductible+plan_coinsurance",
                  data=d, vcov={"CRV1": "family_id"})
result.summary()

library(fixest)
DATA <- "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"

# --- Step 1: Compare health by insurance status (observational) ---
nhis <- read.csv(paste0(DATA, "ch1/nhis_clean.csv"))
print(aggregate(health ~ insurance, data = nhis, FUN = mean))
result <- feols(health ~ insurance, data = nhis, vcov = "hetero")
summary(result)

# --- Step 2: Balance check (RAND HIE -- did randomization work?) ---
rand <- read.csv(paste0(DATA, "ch1/rand_balance.csv"))
d <- na.omit(rand[, c("age","plan_free","plan_deductible","plan_coinsurance","family_id")])
result <- feols(age ~ plan_free + plan_deductible + plan_coinsurance,
                data = d, vcov = ~family_id)
summary(result)

# --- Step 3: Causal effect on spending (free insurance -> more spending) ---
hie <- read.csv(paste0(DATA, "ch1/rand_utilization.csv"))
d <- na.omit(hie[, c("total_expenses","plan_free","plan_deductible","plan_coinsurance","family_id")])
result <- feols(total_expenses ~ plan_free + plan_deductible + plan_coinsurance,
                data = d, vcov = ~family_id)
summary(result)

# --- Step 4: Causal effect on health (the RAND paradox: no effect!) ---
health <- read.csv(paste0(DATA, "ch1/rand_health_outcomes.csv"))
d <- na.omit(health[, c("health_index","plan_free","plan_deductible","plan_coinsurance","family_id")])
result <- feols(health_index ~ plan_free + plan_deductible + plan_coinsurance,
                data = d, vcov = ~family_id)
summary(result)

clear all
set more off

* --- Step 1: Compare health by insurance status ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch1/nhis_clean.csv", clear
tabstat health, by(insurance)
reg health insurance, robust

* --- Step 2: Balance check (RAND HIE) ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch1/rand_balance.csv", clear
reg age plan_free plan_deductible plan_coinsurance, cluster(family_id)

* --- Step 3: Causal effect on spending ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch1/rand_utilization.csv", clear
reg total_expenses plan_free plan_deductible plan_coinsurance, cluster(family_id)

* --- Step 4: Causal effect on health (no effect!) ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch1/rand_health_outcomes.csv", clear
reg health_index plan_free plan_deductible plan_coinsurance, cluster(family_id)

📈

02 Regression

OLS · Omitted variable bias · Bad controls

import numpy as np, pandas as pd, pyfixest as pf

# --- Step 1: Simulate data where we KNOW the true causal effect ---
np.random.seed(42)
n = 1000
ability = np.random.normal(50, 10, n)
prob_private = 1 / (1 + np.exp(-(ability - 50) / 5))
private = np.random.binomial(1, prob_private)
true_effect = 5000
earnings = 30000 + true_effect*private + 800*ability + np.random.normal(0, 5000, n)
students = pd.DataFrame({"earnings": earnings, "private": private, "ability": ability})

# --- Step 2: Short regression (omitting ability → biased) ---
short = pf.feols("earnings ~ private", data=students)
print(f"Short (biased):  ${short.coef()['private']:,.0f}  (true = $5,000)")

# --- Step 3: Long regression (including ability → unbiased) ---
long = pf.feols("earnings ~ private + ability", data=students)
print(f"Long (unbiased): ${long.coef()['private']:,.0f}")

# --- Step 4: Verify the OVB formula ---
aux = pf.feols("ability ~ private", data=students)
pi_1 = aux.coef()["private"]
gamma = long.coef()["ability"]
ovb = pi_1 * gamma
print(f"OVB = pi1 x gamma = {pi_1:.2f} x {gamma:.0f} = ${ovb:,.0f}")

library(fixest)

# --- Step 1: Simulate data where we KNOW the true causal effect ---
set.seed(42)
n <- 1000
ability <- rnorm(n, 50, 10)
prob_private <- 1 / (1 + exp(-(ability - 50) / 5))
private <- rbinom(n, 1, prob_private)
true_effect <- 5000
earnings <- 30000 + true_effect * private + 800 * ability + rnorm(n, 0, 5000)
students <- data.frame(earnings = earnings, private = private, ability = ability)

# --- Step 2: Short regression (omitting ability -> biased) ---
short <- feols(earnings ~ private, data = students)
cat(sprintf("Short (biased):  $%s  (true = $5,000)\n",
    formatC(coef(short)["private"], format = "f", digits = 0, big.mark = ",")))

# --- Step 3: Long regression (including ability -> unbiased) ---
long <- feols(earnings ~ private + ability, data = students)
cat(sprintf("Long (unbiased): $%s\n",
    formatC(coef(long)["private"], format = "f", digits = 0, big.mark = ",")))

# --- Step 4: Verify the OVB formula ---
aux <- feols(ability ~ private, data = students)
pi_1 <- coef(aux)["private"]
gamma <- coef(long)["ability"]
ovb <- pi_1 * gamma
cat(sprintf("OVB = pi1 x gamma = %.2f x %.0f = $%s\n",
    pi_1, gamma, formatC(ovb, format = "f", digits = 0, big.mark = ",")))

clear all
set more off
set seed 42
set obs 1000

* --- Step 1: Simulate data ---
gen ability = rnormal(50, 10)
gen prob_private = 1 / (1 + exp(-(ability - 50) / 5))
gen private = rbinomial(1, prob_private)
gen earnings = 30000 + 5000*private + 800*ability + rnormal(0, 5000)

* --- Step 2: Short regression (biased) ---
reg earnings private

* --- Step 3: Long regression (unbiased) ---
reg earnings private ability

* --- Step 4: OVB formula ---
scalar long_b = _b[private]
scalar gamma = _b[ability]
quietly reg earnings private
scalar ovb_direct = _b[private] - long_b
quietly reg ability private
display "OVB = " _b[private] " x " gamma " = " _b[private]*gamma

🔧

03 Instrumental Variables

LATE · Compliers · Minneapolis DV Experiment

import pandas as pd, pyfixest as pf
DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"

# --- Step 1: Load Minneapolis Domestic Violence Experiment ---
mdve = pd.read_csv(DATA + "ch3/mdve_clean.csv")

# --- Step 2: Check compliance (assigned vs. delivered) ---
print(pd.crosstab(mdve["assigned"], mdve["delivered"], margins=True))

# --- Step 3: Create binary treatment variables ---
mdve["z_coddle"] = (mdve["assigned"] != "Arrest").astype(int)   # instrument
mdve["d_coddle"] = (mdve["delivered"] != "Arrest").astype(int)   # treatment

# --- Step 4: First stage ---
fs = pf.feols("d_coddle ~ z_coddle", data=mdve, vcov="hetero")
print(f"First stage: {fs.coef()['z_coddle']:.3f}")

# --- Step 5: Reduced form (from published data) ---
reduced_form = 0.211 - 0.097
print(f"Reduced form: {reduced_form:.3f}")

# --- Step 6: LATE = reduced form / first stage ---
late = reduced_form / fs.coef()["z_coddle"]
print(f"LATE = {reduced_form:.3f} / {fs.coef()['z_coddle']:.3f} = {late:.3f}")
print("Coddling increases recidivism by ~15 pp among compliers")

library(fixest)
DATA <- "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"

# --- Step 1: Load Minneapolis Domestic Violence Experiment ---
mdve <- read.csv(paste0(DATA, "ch3/mdve_clean.csv"))

# --- Step 2: Check compliance (assigned vs. delivered) ---
print(table(mdve$assigned, mdve$delivered))

# --- Step 3: Create binary treatment variables ---
mdve$z_coddle <- as.integer(mdve$assigned != "Arrest")   # instrument
mdve$d_coddle <- as.integer(mdve$delivered != "Arrest")   # treatment

# --- Step 4: First stage ---
fs <- feols(d_coddle ~ z_coddle, data = mdve, vcov = "hetero")
cat(sprintf("First stage: %.3f\n", coef(fs)["z_coddle"]))

# --- Step 5: Reduced form (from published data) ---
reduced_form <- 0.211 - 0.097
cat(sprintf("Reduced form: %.3f\n", reduced_form))

# --- Step 6: LATE = reduced form / first stage ---
late <- reduced_form / coef(fs)["z_coddle"]
cat(sprintf("LATE = %.3f / %.3f = %.3f\n",
    reduced_form, coef(fs)["z_coddle"], late))
cat("Coddling increases recidivism by ~15 pp among compliers\n")

clear all
set more off

* --- Step 1: Load MDVE data ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch3/mdve_clean.csv", clear

* --- Step 2: Check compliance ---
tabulate assigned delivered

* --- Step 3: Create binary variables ---
gen z_coddle = (assigned != "Arrest")
gen d_coddle = (delivered != "Arrest")

* --- Step 4: First stage ---
reg d_coddle z_coddle, robust
scalar first_stage = _b[z_coddle]

* --- Step 5: Reduced form (published data) ---
scalar reduced_form = 0.211 - 0.097

* --- Step 6: LATE ---
display "LATE = " reduced_form " / " first_stage " = " reduced_form/first_stage

📉

04 Regression Discontinuity

Sharp RD · Bandwidth · MLDA & mortality

import pandas as pd, matplotlib.pyplot as plt, pyfixest as pf
DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"

# --- Step 1: Load MLDA mortality data ---
mlda = pd.read_csv(DATA + "ch4/mlda_clean.csv")

# --- Step 2: Plot the discontinuity ---
fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(mlda["agecell"], mlda["all"], color="gray", alpha=0.7, s=40)
ax.axvline(x=21, color="red", linestyle="--", label="MLDA cutoff")
ax.set_xlabel("Age"); ax.set_ylabel("Deaths per 100,000")
ax.legend(); plt.show()

# --- Step 3: Sharp RD (linear) ---
result = pf.feols("all ~ over21 + age", data=mlda, vcov="hetero")
print(f"Linear RD jump: {result.coef()['over21']:.2f} deaths per 100k")

# --- Step 4: Quadratic RD (robustness) ---
result = pf.feols("all ~ over21+age+age2+over_age+over_age2", data=mlda, vcov="hetero")
print(f"Quadratic RD jump: {result.coef()['over21']:.2f}")

# --- Step 5: Placebo (internal causes should NOT jump) ---
result = pf.feols("internal ~ over21 + age", data=mlda, vcov="hetero")
print(f"Placebo (internal): {result.coef()['over21']:.2f} (expect ~0)")

library(fixest)
DATA <- "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"

# --- Step 1: Load MLDA mortality data ---
mlda <- read.csv(paste0(DATA, "ch4/mlda_clean.csv"))

# --- Step 2: Plot the discontinuity ---
plot(mlda$agecell, mlda$all, pch = 16, col = "gray",
     xlab = "Age", ylab = "Deaths per 100,000", main = "MLDA RD")
abline(v = 21, col = "red", lty = 2)
legend("topleft", legend = "MLDA cutoff", col = "red", lty = 2)

# --- Step 3: Sharp RD (linear) ---
result <- feols(all ~ over21 + age, data = mlda, vcov = "hetero")
cat(sprintf("Linear RD jump: %.2f deaths per 100k\n", coef(result)["over21"]))

# --- Step 4: Quadratic RD (robustness) ---
result <- feols(all ~ over21 + age + age2 + over_age + over_age2,
                data = mlda, vcov = "hetero")
cat(sprintf("Quadratic RD jump: %.2f\n", coef(result)["over21"]))

# --- Step 5: Placebo (internal causes should NOT jump) ---
result <- feols(internal ~ over21 + age, data = mlda, vcov = "hetero")
cat(sprintf("Placebo (internal): %.2f (expect ~0)\n", coef(result)["over21"]))

clear all
set more off

import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch4/mlda_clean.csv", clear

* --- Sharp RD (linear) ---
reg all over21 age, robust

* --- Quadratic RD (robustness) ---
reg all over21 age age2 over_age over_age2, robust

* --- Placebo (internal causes should NOT jump) ---
reg internal over21 age, robust

📊

05 Differences-in-Differences

Parallel trends · Two-way FE · MLDA panel data

import pandas as pd, pyfixest as pf
DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"

# --- Step 1: Manual DD (Great Depression banking data) ---
banks = pd.read_csv(DATA + "ch5/banks_clean.csv")
pre_6  = banks.loc[banks["year"]==1930, "bib6"].values[0]
post_6 = banks.loc[banks["year"]==1931, "bib6"].values[0]
pre_8  = banks.loc[banks["year"]==1930, "bib8"].values[0]
post_8 = banks.loc[banks["year"]==1931, "bib8"].values[0]
dd = (post_6 - pre_6) - (post_8 - pre_8)
print(f"Manual DD: {dd} banks saved by Atlanta Fed")

# --- Step 2: Regression DD with fixed effects ---
deaths = pd.read_csv(DATA + "ch5/deaths_clean.csv")
allcause = deaths[deaths["dtype"] == "all"]
result = pf.feols("mrate ~ legal | state + year",
                  data=allcause, vcov={"CRV1": "state"})
print(f"DD (all-cause): {result.coef()['legal']:.2f} (SE: {result.se()['legal']:.2f})")

# --- Step 3: Population-weighted DD ---
result = pf.feols("mrate ~ legal | state + year",
                  data=allcause, weights="pop", vcov={"CRV1": "state"})
print(f"Weighted DD: {result.coef()['legal']:.2f}")

# --- Step 4: Placebo (suicide should NOT respond) ---
suicide = deaths[deaths["dtype"] == "suicide"]
result = pf.feols("mrate ~ legal | state + year",
                  data=suicide, vcov={"CRV1": "state"})
print(f"Placebo (suicide): {result.coef()['legal']:.2f} (expect ~0)")

library(fixest)
DATA <- "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"

# --- Step 1: Manual DD (Great Depression banking data) ---
banks <- read.csv(paste0(DATA, "ch5/banks_clean.csv"))
pre_6  <- banks$bib6[banks$year == 1930]
post_6 <- banks$bib6[banks$year == 1931]
pre_8  <- banks$bib8[banks$year == 1930]
post_8 <- banks$bib8[banks$year == 1931]
dd <- (post_6 - pre_6) - (post_8 - pre_8)
cat(sprintf("Manual DD: %d banks saved by Atlanta Fed\n", dd))

# --- Step 2: Regression DD with fixed effects ---
deaths <- read.csv(paste0(DATA, "ch5/deaths_clean.csv"))
allcause <- deaths[deaths$dtype == "all", ]
result <- feols(mrate ~ legal | state + year, data = allcause, vcov = ~state)
cat(sprintf("DD (all-cause): %.2f (SE: %.2f)\n",
    coef(result)["legal"], se(result)["legal"]))

# --- Step 3: Population-weighted DD ---
result <- feols(mrate ~ legal | state + year, data = allcause,
                weights = ~pop, vcov = ~state)
cat(sprintf("Weighted DD: %.2f\n", coef(result)["legal"]))

# --- Step 4: Placebo (suicide should NOT respond) ---
suicide <- deaths[deaths$dtype == "suicide", ]
result <- feols(mrate ~ legal | state + year, data = suicide, vcov = ~state)
cat(sprintf("Placebo (suicide): %.2f (expect ~0)\n", coef(result)["legal"]))

clear all
set more off

* --- Step 1: Manual DD (banking data) ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch5/banks_clean.csv", clear
list

* --- Step 2: Regression DD with fixed effects ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch5/deaths_clean.csv", clear
keep if dtype == "all"
reg mrate legal i.state i.year, cluster(state)

* --- Step 3: Population-weighted DD ---
reg mrate legal i.state i.year [aw=pop], cluster(state)

* --- Step 4: Placebo (suicide) ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch5/deaths_clean.csv", clear
keep if dtype == "suicide"
reg mrate legal i.state i.year, cluster(state)

🎓

06 The Wages of Schooling

Twins · Quarter of birth · Child labor laws · Sheepskin effects

import pandas as pd, pyfixest as pf
DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"

# --- Step 1: Simple bivariate OLS ---
qob = pd.read_csv(DATA + "ch6/qob_clean.csv")
ols = pf.feols("lnw ~ s", data=qob, vcov="hetero")
print(f"Simple OLS: {ols.coef()['s']:.3f} (~7%/year, biased by ability)")

# --- Step 2: OLS with controls (twins data) ---
twins = pd.read_csv(DATA + "ch6/twins_clean.csv")
ols_c = pf.feols("lwage ~ educ+age+age2+female+white", data=twins, vcov="hetero")
print(f"OLS + controls: {ols_c.coef()['educ']:.3f} (~11%)")

# --- Step 3: Twin fixed effects ---
first = twins[twins["first"] == 1]
fe = pf.feols("dlwage ~ deduc - 1", data=first, vcov="hetero")
print(f"Twin FE: {fe.coef()['deduc']:.3f} (~6%, attenuated)")

# --- Step 4: Twin IV (corrects measurement error) ---
iv_twin = pf.feols("lwage ~ age+age2+female+white | educ ~ educt_t",
                   data=twins, vcov="hetero")
print(f"Twin IV: {iv_twin.coef()['educ']:.3f} (~11%)")

# --- Step 5: Quarter-of-birth IV ---
iv_qob = pf.feols("lnw ~ 1 | s ~ q4", data=qob, vcov="hetero")
print(f"QOB IV:  {iv_qob.coef()['s']:.3f} (~7%)")

# --- Step 6: Child labor law IV ---
cl = pd.read_csv(DATA + "ch6/childlabor_clean.csv")
ols_cl = pf.feols("lnwkwage ~ indEduc | sob + yob + year",
                  data=cl, weights="weight", vcov={"CRV1": "sob"})
print(f"OLS (child labor): {ols_cl.coef()['indEduc']:.4f}")

# --- Step 7: First-stage F-statistic ---
fs = pf.feols("s ~ q4", data=qob)
f_stat = fs.tstat()['q4'] ** 2  # F = t-squared for a single restriction
print(f"First-stage F (QOB): {f_stat:.1f} (should be > 10)")

library(fixest)
DATA <- "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"

# --- Step 1: Simple bivariate OLS ---
qob <- read.csv(paste0(DATA, "ch6/qob_clean.csv"))
ols <- feols(lnw ~ s, data = qob, vcov = "hetero")
cat(sprintf("Simple OLS: %.3f (~7%%/year, biased by ability)\n", coef(ols)["s"]))

# --- Step 2: OLS with controls (twins data) ---
twins <- read.csv(paste0(DATA, "ch6/twins_clean.csv"))
ols_c <- feols(lwage ~ educ + age + age2 + female + white,
               data = twins, vcov = "hetero")
cat(sprintf("OLS + controls: %.3f (~11%%)\n", coef(ols_c)["educ"]))

# --- Step 3: Twin fixed effects ---
first <- twins[which(twins$first == 1), ]
fe <- feols(dlwage ~ deduc - 1, data = first, vcov = "hetero")
cat(sprintf("Twin FE: %.3f (~6%%, attenuated)\n", coef(fe)["deduc"]))

# --- Step 4: Twin IV (corrects measurement error) ---
iv_twin <- feols(lwage ~ age + age2 + female + white | educ ~ educt_t,
                 data = twins, vcov = "hetero")
cat(sprintf("Twin IV: %.3f (~11%%)\n", coef(iv_twin)["fit_educ"]))

# --- Step 5: Quarter-of-birth IV ---
iv_qob <- feols(lnw ~ 1 | s ~ q4, data = qob, vcov = "hetero")
cat(sprintf("QOB IV:  %.3f (~7%%)\n", coef(iv_qob)["fit_s"]))

# --- Step 6: Child labor law IV ---
cl <- read.csv(paste0(DATA, "ch6/childlabor_clean.csv"))
ols_cl <- feols(lnwkwage ~ indEduc | sob + yob + year, data = cl,
                weights = ~weight, vcov = ~sob)
cat(sprintf("OLS (child labor): %.4f\n", coef(ols_cl)["indEduc"]))

# --- Step 7: First-stage F-statistic ---
fs <- feols(s ~ q4, data = qob)
cat(sprintf("First-stage F (QOB): %.1f (should be > 10)\n",
    fitstat(fs, "f")$f$stat))

clear all
set more off

* --- Step 1: Simple bivariate OLS ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/qob_clean.csv", clear
reg lnw s, robust

* --- Step 2: OLS with controls (twins) ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/twins_clean.csv", clear
reg lwage educ age age2 female white, robust

* --- Step 3: Twin fixed effects ---
reg dlwage deduc if first == 1, noconstant robust

* --- Step 4: Twin IV ---
ivregress 2sls lwage age age2 female white (educ = educt_t), robust

* --- Step 5: Quarter-of-birth IV ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/qob_clean.csv", clear
ivregress 2sls lnw (s = q4), robust

* --- Step 6: Child labor law IV ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/childlabor_clean.csv", clear
reg indeduc cl7 cl8 cl9 i.sob i.yob i.year [aw=weight], cluster(sob)
testparm cl7 cl8 cl9
ivregress 2sls lnwkwage i.sob i.yob i.year (indeduc = cl7 cl8 cl9) [aw=weight], cluster(sob)

⚠ Code without understanding is a dangerous shortcut

The Furious Five

Randomized Trials

Regression

Instrumental Variables

Regression Discontinuity

Differences-in-Differences