graph TD
A["THE QUESTION: Does insurance improve health?"]
B["NAIVE EVIDENCE: Insured are healthier, but is it causal?"]
C["THE PROBLEM: Selection bias contaminates the comparison"]
D["THE SOLUTION: Random assignment eliminates selection bias"]
E["THE EVIDENCE: Two landmark experiments — RAND and Oregon"]
A --> B --> C --> D --> E
style A fill:#3498db,color:#fff
style B fill:#e67e22,color:#fff
style C fill:#c0392b,color:#fff
style D fill:#8e44ad,color:#fff
style E fill:#2d8659,color:#fff
linkStyle default stroke:#64748b,stroke-width:2px
1. Randomized Trials
By the end of this chapter, you will be able to:
- Explain why simple comparisons between treated and untreated groups often fail to reveal causal effects
- Define potential outcomes, selection bias, and average treatment effects
- Describe how random assignment eliminates selection bias
- Use regression on a dummy variable as a tool to compare group means
- Interpret results from two landmark health insurance experiments
- Understand standard errors and statistical significance
This chapter follows a clear arc: we start with a real-world question, discover why naive data comparisons are misleading, learn the theoretical framework that explains the problem, and then see how randomized experiments provide a solution.
Key Concepts and Definitions
Potential Outcomes (\(Y_{1i}\), \(Y_{0i}\)): The two hypothetical outcomes for each individual — one if treated, one if not. The causal effect is the difference between them, but we can only ever observe one.
A patient’s health if she receives a new drug (\(Y_{1i}\)) versus her health if she takes a placebo (\(Y_{0i}\)). We observe one; the other remains forever unknown.
Like choosing between two routes to work. You take Route A and arrive in 20 minutes, but you will never know how long Route B would have taken that same morning.
Causal Effect: The difference between what happens to an individual with treatment and what would have happened without it (\(Y_{1i} - Y_{0i}\)). It answers the question “what did the treatment actually do?”
If a student scores 85 on a test after tutoring but would have scored 75 without it, the causal effect of tutoring is +10 points.
Like measuring how much faster you run with new shoes by comparing your time to what you would have run in your old shoes on the same day — not to someone else’s time.
Fundamental Problem of Causal Inference: We can never observe both potential outcomes for the same individual at the same time, so individual causal effects are inherently unobservable.
We cannot simultaneously see how a city’s economy performs both with and without a new minimum wage law. We must choose one policy and live with it.
Like watching a movie — you cannot experience the same movie for the first time twice to compare your reactions.
Selection Bias: A systematic difference in baseline characteristics between the treated and untreated groups that contaminates the observed comparison, making it impossible to attribute the difference to the treatment alone.
People who voluntarily buy gym memberships are already more health-conscious, so comparing gym members to non-members overstates the health benefits of exercise.
Like comparing test scores of students who choose to attend after-school study hall versus those who skip it. The attendees were probably more motivated to begin with.
Confounder: A variable that influences both the treatment and the outcome, creating a spurious association between them.
Family income affects both whether a child attends private school (treatment) and the child’s test scores (outcome), making it look like private school boosts scores even if it does not.
Like blaming an umbrella for rain. People carry umbrellas on rainy days, but the umbrella did not cause the rain — the weather (the confounder) caused both.
Randomized Controlled Trial (RCT): An experiment in which treatment is assigned randomly (like a coin flip), ensuring that treatment and control groups are comparable on all characteristics, both observed and unobserved.
The Oregon Health Plan lottery randomly selected applicants to receive Medicaid, creating two groups that differed only by insurance status.
Like shuffling a deck of cards and dealing two hands. Neither hand is systematically better — any differences are pure luck.
Random Assignment: The process of using a random mechanism (lottery, coin flip, random number generator) to determine who receives treatment, breaking any link between treatment and pre-existing characteristics.
In the RAND HIE, families were randomly assigned to insurance plans of different generosity, so high-income families were equally likely to end up in any plan group.
Like a teacher assigning lab partners by drawing names from a hat rather than letting students choose. The hat does not care who is popular or smart.
Law of Large Numbers: A statistical theorem guaranteeing that, as the sample size grows, the sample average converges to the population average. This is why large randomized experiments produce balanced groups.
Roll a die 10 times and the average may be far from 3.5. Roll it 100,000 times and the average will be almost exactly 3.5.
Like a casino’s edge. Any single bet is unpredictable, but over thousands of games, the house reliably wins because averages stabilize.
Balance Check: A test performed after randomization to verify that treatment and control groups look similar on observable baseline characteristics. If balance holds, we trust that randomization worked.
In the RAND HIE, researchers verified that age, income, education, and health were similar across plan groups before looking at outcomes.
Like a referee checking that both teams have the right number of players before the game starts. It does not guarantee a fair game, but failure would be a red flag.
Standard Error (SE): A measure of how much a sample estimate would vary across different random samples. Smaller standard errors mean more precise estimates.
A treatment effect of 5.0 with SE = 1.0 is precisely estimated; the same effect with SE = 10.0 is very uncertain.
Like the wobble of a bathroom scale. A high-quality scale gives consistent readings (small SE); a cheap scale gives different numbers each time (large SE).
t-Statistic: The ratio of an estimated coefficient to its standard error (coefficient / SE). It measures how many standard errors the estimate is from zero.
A coefficient of 8.0 with SE of 2.0 gives a t-statistic of 4.0, meaning the estimate is 4 standard errors away from zero — strong evidence of a real effect.
Like a signal-to-noise ratio on a radio. A t-statistic of 4 means the signal is much louder than the static; a t-statistic of 0.5 means the static drowns out the signal.
Statistical Significance: A result is statistically significant (at the 5% level) when its t-statistic exceeds 2 in absolute value, meaning it is unlikely to have arisen by chance alone.
A study finds that a job training program increases earnings by $2,000 with a t-statistic of 3.1. This is statistically significant — we can be confident the program had a real effect.
Like a fire alarm. It goes off only when the evidence of fire (smoke) is strong enough. A significant result says “this is probably real, not just random noise.”
Moral Hazard: The tendency for people to change their behavior when they are insulated from the consequences of that behavior, often used when insurance reduces the cost of risky choices.
In the RAND HIE, people with free insurance spent about 45% more on health care than those who paid most of their own costs.
Like an all-you-can-eat buffet. When each additional plate costs nothing, people eat more than they would at a restaurant where they pay per dish.
Dummy Variable Regression: A regression where the key explanatory variable is binary (0 or 1). The intercept gives the average for the reference group, and the coefficient on the dummy gives the difference in means between the two groups.
Regressing health on an insurance dummy (0 = uninsured, 1 = insured). The intercept is the average health of the uninsured; the coefficient is the insured-minus-uninsured gap.
Like a light switch. The variable is either “on” or “off,” and we measure how the outcome changes when we flip it.
Difference in Means: The simplest estimator of a treatment effect: the average outcome of the treated group minus the average outcome of the control group. In a randomized experiment, this equals the causal effect.
Average test score for tutored students is 82; for untutored students it is 76. The difference in means is 82 - 76 = 6 points.
Like comparing the average height of a basketball team to that of a chess club. Simple subtraction tells you the gap, but only randomization tells you it is causal.
Intent-to-Treat (ITT): The effect of being assigned to treatment, regardless of whether the individual actually received it. It captures the overall policy impact including non-compliance.
In the Oregon lottery, the ITT is the effect of winning the lottery on health outcomes, even though only 25% of winners actually enrolled in Medicaid.
Like measuring the effect of receiving an invitation to a party, whether or not you actually attend. The invitation changed your options, even if you stayed home.
Clustering (of Standard Errors): Adjusting standard errors to account for the fact that observations within the same group (family, school, state) are correlated, preventing falsely precise estimates.
In the RAND HIE, family members share the same insurance plan, so their outcomes are correlated. Clustering SEs by family corrects for this.
Like counting votes by household rather than by individual. If everyone in a household votes the same way, counting each person separately would overstate how many independent opinions you have.
Robust Standard Errors: Standard errors adjusted for heteroskedasticity — the possibility that the variance of the error term differs across observations. They provide valid inference even when the standard OLS assumption of constant variance fails.
Earnings regressions often have more variable residuals for high-income individuals. Robust SEs account for this, preventing overconfident conclusions.
Like adjusting your confidence interval when measuring an uneven road. Some stretches are smooth (low variability) and others are bumpy (high variability) — you need wider margins of error for the bumpy parts.
Weighted Least Squares (WLS): A variant of OLS that gives more weight to observations that are more precisely measured or more representative, producing more efficient estimates.
When analyzing state-level death rates, states with larger populations have more reliable rates and receive more weight in WLS.
Like averaging restaurant reviews but trusting a reviewer who has eaten there 50 times more than one who visited once. More informative observations get a louder voice.
Does Health Insurance Improve Health?
The United States spends more on health care than any other developed country, yet millions of Americans remain uninsured. A natural question arises: does having health insurance actually make people healthier?
Imagine standing at a fork in a road. One path leads through a world where you have health insurance; the other through a world where you don’t. You can only walk one path — you’ll never know what would have happened on the other. This is the fundamental problem of causal inference: we observe one outcome per person, but the causal effect requires comparing two.
At first glance, the answer seems obvious. We can look at survey data and compare the health of insured and uninsured people. Let’s do exactly that using the National Health Interview Survey (NHIS), an annual survey of the U.S. population.
Code
import pandas as pd
import pyfixest as pf
# Data URL — all datasets are hosted on GitHub
DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"
# Load pre-cleaned NHIS 2009 data (married couples aged 26-59)
nhis = pd.read_csv(DATA + "ch1/nhis_clean.csv")
nhis.head(3)| health | insurance | nonwhite | age | education | family_size | employed | family_income | gender | weight | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4.0 | 0 | 0.0 | 29.0 | 14.0 | 4.0 | 0.0 | 19282.932 | wife | 8938.0 |
| 1 | 4.0 | 0 | 0.0 | 35.0 | 11.0 | 4.0 | 1.0 | 19282.932 | husband | 8967.0 |
| 2 | 3.0 | 1 | 0.0 | 32.0 | 12.0 | 4.0 | 1.0 | 167844.530 | husband | 8905.0 |
The dataset contains a health index (1 = poor, 5 = excellent), insurance status (1 = insured, 0 = uninsured), and demographic characteristics for married couples.
A First Look: Insured vs. Uninsured
Let’s start with the simplest possible comparison. What is the average health of insured people versus uninsured people?
Code
# Average health by insurance status
means = nhis.groupby("insurance")["health"].mean()
pd.DataFrame({
"Insurance Status": ["Uninsured", "Insured"],
"Average Health (1-5)": [round(means[0], 2), round(means[1], 2)]
})| Insurance Status | Average Health (1-5) | |
|---|---|---|
| 0 | Uninsured | 3.66 |
| 1 | Insured | 3.98 |
Insured people are healthier. But can we conclude that insurance caused this difference?
The Problem: Other Differences Between Groups
Before drawing causal conclusions, let’s check whether insured and uninsured people differ in other ways too.
A simple but powerful trick: if you regress an outcome \(Y\) on a dummy variable \(D\) (where \(D = 1\) for treated, \(D = 0\) for untreated), the regression gives you:
- Intercept = average of \(Y\) in the untreated group (the control mean)
- Coefficient on \(D\) = difference in means between treated and untreated
- Standard error = a measure of how precisely the difference is estimated
This is exactly the same as computing group means and their difference — but regression also gives us a standard error, which tells us whether the difference is statistically meaningful.
Before we dive into the numbers, let’s clarify how to read the regression output we will use throughout this study guide.
Throughout this study guide, we report regression results with standard errors (SE) in parentheses.
- The SE measures how precisely a coefficient is estimated
- Rule of thumb: if |coefficient / SE| > 2, the result is statistically significant at the 5% level
- For balance checks, we want insignificant results (confirming groups are similar)
- For treatment effects, significant results provide evidence of a causal effect
Let’s apply this to compare insured and uninsured people across multiple characteristics:
Code
# Variables to compare across insurance groups
outcomes = ["health", "nonwhite", "age", "education",
"family_size", "employed", "family_income"]
# Run a separate regression for each variable and collect results
rows = []
for var in outcomes:
# Regress each variable on insurance dummy (with survey weights and robust SEs)
result = pf.feols(f"{var} ~ insurance", data=nhis, weights="weight", vcov="hetero")
# Intercept = uninsured mean; insurance coefficient = difference
rows.append({
"Variable": var,
"Uninsured mean": round(result.coef()["Intercept"], 2),
"Insured − Uninsured": round(result.coef()["insurance"], 2),
"Std. Error": round(result.se()["insurance"], 2),
})
pd.DataFrame(rows)| Variable | Uninsured mean | Insured − Uninsured | Std. Error | |
|---|---|---|---|---|
| 0 | health | 3.66 | 0.35 | 0.02 |
| 1 | nonwhite | 0.17 | -0.02 | 0.01 |
| 2 | age | 40.51 | 2.61 | 0.21 |
| 3 | education | 11.67 | 2.70 | 0.07 |
| 4 | family_size | 3.95 | -0.45 | 0.04 |
| 5 | employed | 0.72 | 0.13 | 0.01 |
| 6 | family_income | 45989.09 | 60352.25 | 976.23 |
The insured are healthier — but they are also:
- ~3 years more educated
- $60,000 richer in family income
- More likely to be employed
These are enormous differences. People who choose insurance are fundamentally different from those who don’t. The health gap we observed almost certainly reflects these pre-existing advantages, not (just) the causal effect of insurance.
Why Naive Comparisons Fail: Selection Bias
The NHIS comparison illustrates a deep problem in causal inference. To understand it precisely, we need a framework for thinking about what would have happened under different circumstances.
The Potential Outcomes Framework
Imagine person \(i\) stands at a fork in the road. One path leads to having insurance; the other doesn’t. Each path leads to a health outcome:
- \(Y_{1i}\) = health with insurance (what happens on the insurance road)
- \(Y_{0i}\) = health without insurance (what happens on the other road)
The causal effect of insurance for person \(i\) is \(Y_{1i} - Y_{0i}\) — the difference between the two roads. But here’s the catch: each person takes only one road. We observe \(Y_{1i}\) or \(Y_{0i}\), never both.
Seeing It Through an Example
| Anika | Ben | |
|---|---|---|
| Health without insurance (\(Y_{0i}\)) | 3 | 5 |
| Health with insurance (\(Y_{1i}\)) | 4 | 5 |
| Choice: buys insurance? (\(D_i\)) | Yes (1) | No (0) |
| Observed health | 4 | 5 |
| True causal effect | +1 | 0 |
Anika, who is prone to illness, buys insurance — it improves her health by 1 point. Ben, naturally robust, skips it — insurance wouldn’t have helped him anyway.
What do we observe? Anika’s health is 4; Ben’s is 5. The naive comparison (\(4 - 5 = -1\)) suggests insurance is harmful! The true effect on Anika is +1, but the comparison is polluted by the fact that Ben was healthier to begin with.
“Insured people are healthier, so insurance must work.” This confuses correlation with causation. The Anika/Ben example shows that even when the treated group looks worse, the true treatment effect can be positive. The observed comparison reflects both the causal effect and the pre-existing differences between people who choose treatment and those who don’t. You cannot read causation from a simple comparison — ever.
The Decomposition
This leads to a fundamental equation. Any observed comparison can be split into two pieces:
\[\underbrace{\text{Observed difference}}_{\text{What we see}} = \underbrace{\kappa}_{\text{Causal effect}} + \underbrace{\text{Avg}[Y_{0i} | D_i\!=\!1] - \text{Avg}[Y_{0i} | D_i\!=\!0]}_{\text{Selection bias}}\]
graph LR
A["Observed Difference<br/>(Insured vs. Uninsured)"] --> B["Causal Effect (κ)<br/>What insurance<br/>actually does"]
A --> C["Selection Bias<br/>Pre-existing differences<br/>between the groups"]
style B fill:#2d8659,color:#fff
style C fill:#c0392b,color:#fff
style A fill:#475569,color:#fff
linkStyle default stroke:#64748b,stroke-width:2px
Selection bias is the difference in health that would exist even without insurance — it reflects the fact that healthier, wealthier, more educated people are more likely to be insured. The NHIS data above showed exactly this pattern.
We can visualize this problem as a causal diagram. Confounders like education, income, and employment create a “backdoor path” between insurance status and health outcomes. Because these factors influence both who gets insured and how healthy they are, the naive comparison captures their influence along with any true causal effect of insurance.
graph TD
C["Confounders<br/>(Education, Income,<br/>Employment, etc.)"] -->|"affects"| I["Insurance<br/>Status"]
C -->|"affects"| H["Health<br/>Outcomes"]
I -.->|"causal effect?"| H
style C fill:#e67e22,color:#fff
style I fill:#3498db,color:#fff
style H fill:#2d8659,color:#fff
linkStyle default stroke:#64748b,stroke-width:2px
We want \(\kappa\) (the causal effect), but what we observe is \(\kappa\) plus selection bias. We cannot separate the two without a strategy that eliminates the bias.
The Solution: Random Assignment
The Core Idea
What if, instead of letting people choose insurance, we assigned it randomly — like a coin flip? This is the insight behind randomized controlled trials (RCTs).
When treatment is randomly assigned:
- The insured and uninsured groups are drawn from the same population
- They have similar education, income, health habits, and every other characteristic
- This includes characteristics we cannot observe or measure
The Law of Large Numbers guarantees this: in large random samples, group averages converge to the population average. So both groups end up looking alike.
Roll a fair die once — you might get 1 or 6, far from the expected value of 3.5. Roll it 10 times — the average gets closer. Roll it 10,000 times — the average is almost exactly 3.5. This is why casinos always win in the long run: any single bet is a toss-up, but over thousands of plays, the house edge reliably prevails. Random assignment works the same way: with enough people, the treatment and control groups converge to being identical on every characteristic — even ones we can’t see.
graph TD
P["Target Population"] --> R{"Random<br/>Assignment"}
R -->|"Coin = Heads"| T["Treatment Group<br/>(Receives insurance)"]
R -->|"Coin = Tails"| C["Control Group<br/>(No insurance)"]
T --> OT["Measure Health"]
C --> OC["Measure Health"]
OT --> D["Difference in Means<br/>= Causal Effect (κ)"]
OC --> D
style P fill:#3498db,color:#fff
style R fill:#8e44ad,color:#fff
style T fill:#2d8659,color:#fff
style C fill:#c0392b,color:#fff
style OT fill:#475569,color:#fff
style OC fill:#475569,color:#fff
style D fill:#2d8659,color:#fff
linkStyle default stroke:#64748b,stroke-width:2px
Why It Works Mathematically
With random assignment, the expected baseline health is the same in both groups:
\[E[Y_{0i} \mid D_i = 1] = E[Y_{0i} \mid D_i = 0]\]
This makes the selection bias term zero, so the observed difference equals the causal effect:
\[E[Y_i \mid D_i = 1] - E[Y_i \mid D_i = 0] = \kappa\]
Checking for Balance
Even in a randomized experiment, good practice requires us to check for balance: verify that baseline characteristics look similar across treatment groups. If they do, we can be confident that randomization worked and that the comparison is credible.
Case Study 1: The RAND Health Insurance Experiment
Background
The RAND Health Insurance Experiment (HIE), running from 1974 to 1982, remains one of the most influential social experiments ever conducted. Nearly 4,000 people from six U.S. sites were randomly assigned to insurance plans with varying levels of generosity:
| Plan Type | What Participants Pay | Role in the Experiment |
|---|---|---|
| Catastrophic (3 plans) | 95% of costs (capped) | Control group (≈ no insurance) |
| Deductible (1 plan) | 95% outpatient only (lower cap) | Moderate treatment |
| Coinsurance (9 plans) | 25–50% of costs (capped) | Moderate treatment |
| Free (1 plan) | Nothing — all care is free | Most generous treatment |
The experiment asked two questions:
- When health care is cheaper, do people use more of it?
- Does using more health care improve health?
Step 1: Verify Randomization (Balance Check)
First, we check whether randomization created comparable groups. We regress each baseline characteristic on plan-type dummies. The catastrophic plan is the omitted reference group, so each coefficient represents the difference between that plan group and the catastrophic group.
Code
# Load pre-cleaned RAND HIE baseline data
rand = pd.read_csv(DATA + "ch1/rand_balance.csv")
rand.head(3)| female | nonwhite | age | education | family_income | health_index | cholesterol | blood_pressure | mental_health | plan_type | plan_free | plan_deductible | plan_coinsurance | any_insurance | family_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN |
| 1 | 0.0 | 1.0 | 42.0 | 12.0 | 67486.484 | NaN | NaN | NaN | 95.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0 | 100082.0 |
| 2 | 0.0 | NaN | 16.0 | NaN | 67486.484 | NaN | NaN | NaN | 93.8 | 4.0 | 0.0 | 0.0 | 0.0 | 0 | 100082.0 |
Before running the full table, let’s see what a single balance check looks like. Is the average age different across plan groups?
Code
# Prepare data (drop rows with missing values)
d = rand[["age", "plan_free", "plan_deductible", "plan_coinsurance", "family_id"]].dropna()
# Regress age on plan-type dummies (catastrophic = omitted reference group)
result = pf.feols("age ~ plan_free + plan_deductible + plan_coinsurance", data=d, vcov={"CRV1": "family_id"})
# Extract key regression results into a clear table
pd.DataFrame({
"Variable": result.coef().index,
"Coefficient": result.coef().round(4).values,
"Std. Error": result.se().round(4).values,
"t-statistic": result.tstat().round(2).values,
"p-value": result.pvalue().round(3).values,
})| Variable | Coefficient | Std. Error | t-statistic | p-value | |
|---|---|---|---|---|---|
| 0 | Intercept | 32.3610 | 0.4849 | 66.73 | 0.000 |
| 1 | plan_free | 0.4350 | 0.6140 | 0.71 | 0.479 |
| 2 | plan_deductible | 0.5607 | 0.6759 | 0.83 | 0.407 |
| 3 | plan_coinsurance | 0.9658 | 0.6547 | 1.48 | 0.140 |
The Intercept (32.4) is the average age in the catastrophic group. The coefficients on the plan dummies (0.43 to 0.97) are the age differences — all small and statistically insignificant. Age is balanced.
In the RAND HIE, all members of a family were assigned to the same insurance plan. This means observations within a family are not independent — knowing one family member’s plan tells you the other’s. Clustering standard errors at the family level corrects for this correlation, preventing us from overstating the precision of our estimates.
Now let’s run the full balance check across all baseline variables:
Code
# List of baseline variables to check
balance_vars = ["female", "nonwhite", "age", "education", "family_income",
"health_index", "cholesterol", "blood_pressure", "mental_health"]
# Run a separate regression for each variable and collect results
rows = []
for var in balance_vars:
# Drop missing values for this variable
d = rand[[var, "plan_free", "plan_deductible", "plan_coinsurance", "family_id"]].dropna()
# Regress baseline variable on plan dummies
r = pf.feols(f"{var} ~ plan_free + plan_deductible + plan_coinsurance", data=d, vcov={"CRV1": "family_id"})
# Extract coefficients and standard errors for each plan comparison
coef_free = round(r.coef()["plan_free"], 2)
se_free = round(r.se()["plan_free"], 2)
coef_ded = round(r.coef()["plan_deductible"], 2)
se_ded = round(r.se()["plan_deductible"], 2)
coef_coin = round(r.coef()["plan_coinsurance"], 2)
se_coin = round(r.se()["plan_coinsurance"], 2)
rows.append({
"Variable": var,
"Catastrophic mean": round(r.coef()["Intercept"], 1),
"Free − Catastrophic": format(coef_free, ".2f") + " (" + format(se_free, ".2f") + ")",
"Deductible − Catastrophic": format(coef_ded, ".2f") + " (" + format(se_ded, ".2f") + ")",
"Coinsurance − Catastrophic": format(coef_coin, ".2f") + " (" + format(se_coin, ".2f") + ")",
})
pd.DataFrame(rows)| Variable | Catastrophic mean | Free − Catastrophic | Deductible − Catastrophic | Coinsurance − Catastrophic | |
|---|---|---|---|---|---|
| 0 | female | 0.6 | -0.04 (0.01) | -0.02 (0.02) | -0.02 (0.02) |
| 1 | nonwhite | 0.2 | -0.03 (0.02) | -0.02 (0.03) | -0.03 (0.02) |
| 2 | age | 32.4 | 0.43 (0.61) | 0.56 (0.68) | 0.97 (0.65) |
| 3 | education | 12.1 | -0.26 (0.18) | -0.16 (0.19) | -0.06 (0.19) |
| 4 | family_income | 31603.2 | -976.18 (1344.55) | -2104.39 (1383.69) | 969.76 (1389.01) |
| 5 | health_index | 70.9 | -1.31 (0.87) | -1.44 (0.95) | 0.21 (0.92) |
| 6 | cholesterol | 207.3 | -5.25 (2.70) | -1.42 (2.98) | -1.93 (2.76) |
| 7 | blood_pressure | 122.3 | 1.12 (1.01) | 2.32 (1.15) | 0.91 (1.08) |
| 8 | mental_health | 73.8 | 0.89 (0.77) | -0.12 (0.82) | 1.19 (0.81) |
Verdict: Differences are small, go in both directions, and almost none are statistically significant. Randomization worked. Compare this to the NHIS table earlier, where insured and uninsured groups differed dramatically on every dimension.
Step 2: Estimate Causal Effects on Health-Care Use
Now we turn to outcomes. Because treatment was randomly assigned, the same regression approach that checked balance now gives us causal effects. The coefficient on each plan dummy tells us how much that plan changed health-care use relative to having no insurance.
Code
# Load pre-cleaned RAND HIE utilization data (person-year panel)
hie = pd.read_csv(DATA + "ch1/rand_utilization.csv")
hie.head(3)| visits | outpatient_expenses | admissions | inpatient_expenses | total_expenses | plan_type | plan_free | plan_deductible | plan_coinsurance | any_insurance | family_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 36.305501 | 0.0 | 0.0 | 36.305501 | 4.0 | 0 | 0 | 0 | 0 | 100082 |
| 1 | 4.0 | 275.208504 | 0.0 | 0.0 | 275.208504 | 4.0 | 0 | 0 | 0 | 0 | 100082 |
| 2 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 4.0 | 0 | 0 | 0 | 0 | 100082 |
Code
# Outcome variables measuring health-care utilization
use_vars = ["visits", "outpatient_expenses", "admissions",
"inpatient_expenses", "total_expenses"]
# Run a separate regression for each variable and collect results
rows = []
for var in use_vars:
# Drop missing values for this outcome
d = hie[[var, "plan_free", "plan_deductible", "plan_coinsurance", "family_id"]].dropna()
# Regress outcome on plan dummies — gives causal effects (because of randomization!)
r = pf.feols(f"{var} ~ plan_free + plan_deductible + plan_coinsurance", data=d, vcov={"CRV1": "family_id"})
# Intercept = control group (catastrophic plan) mean
# Coefficients = causal effect of each plan relative to catastrophic
coef_free = int(round(r.coef()["plan_free"]))
se_free = int(round(r.se()["plan_free"]))
coef_ded = int(round(r.coef()["plan_deductible"]))
se_ded = int(round(r.se()["plan_deductible"]))
coef_coin = int(round(r.coef()["plan_coinsurance"]))
se_coin = int(round(r.se()["plan_coinsurance"]))
rows.append({
"Outcome": var,
"Catastrophic mean": int(round(r.coef()["Intercept"])),
"Free effect": str(coef_free) + " (" + str(se_free) + ")",
"Deductible effect": str(coef_ded) + " (" + str(se_ded) + ")",
"Coinsurance effect": str(coef_coin) + " (" + str(se_coin) + ")",
})
pd.DataFrame(rows)| Outcome | Catastrophic mean | Free effect | Deductible effect | Coinsurance effect | |
|---|---|---|---|---|---|
| 0 | visits | 3 | 2 (0) | 0 (0) | 0 (0) |
| 1 | outpatient_expenses | 248 | 169 (20) | 42 (21) | 60 (21) |
| 2 | admissions | 0 | 0 (0) | 0 (0) | 0 (0) |
| 3 | inpatient_expenses | 388 | 116 (60) | 72 (68) | 93 (73) |
| 4 | total_expenses | 636 | 285 (72) | 114 (79) | 152 (84) |
The free plan caused large increases in utilization:
- +1.7 more doctor visits per year
- +$169 in outpatient spending (a 68% increase over the catastrophic group’s $248)
- +$285 in total spending (a 45% increase)
This is the demand curve at work: when insurance lowers the out-of-pocket price of care to zero, people use substantially more of it. Economists call this moral hazard — not a moral judgment, but simply the observation that people respond to incentives.
Step 3: Estimate Causal Effects on Health
Here is the crucial test. All that extra spending bought more health care — but did it buy better health? These outcomes were measured 3–5 years after random assignment.
Code
# Load pre-cleaned RAND HIE exit health measures
health = pd.read_csv(DATA + "ch1/rand_health_outcomes.csv")
health.head(3)| health_index | cholesterol | blood_pressure | mental_health | plan_type | plan_free | plan_deductible | plan_coinsurance | any_insurance | family_id | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN |
| 1 | 71.6 | 245.0 | 128.0 | 94.7 | 4.0 | 0.0 | 0.0 | 0.0 | 0 | 100082.0 |
| 2 | 69.3 | 207.0 | 100.0 | 76.1 | 4.0 | 0.0 | 0.0 | 0.0 | 0 | 100082.0 |
Code
# Health outcome variables (measured at the end of the experiment)
health_vars = ["health_index", "cholesterol", "blood_pressure", "mental_health"]
# Run a separate regression for each variable and collect results
rows = []
for var in health_vars:
# Drop missing values
d = health[[var, "plan_free", "plan_deductible", "plan_coinsurance", "family_id"]].dropna()
# Regress health outcome on plan dummies
r = pf.feols(f"{var} ~ plan_free + plan_deductible + plan_coinsurance", data=d, vcov={"CRV1": "family_id"})
# Extract coefficients and standard errors
coef_free = round(r.coef()["plan_free"], 2)
se_free = round(r.se()["plan_free"], 2)
coef_ded = round(r.coef()["plan_deductible"], 2)
se_ded = round(r.se()["plan_deductible"], 2)
coef_coin = round(r.coef()["plan_coinsurance"], 2)
se_coin = round(r.se()["plan_coinsurance"], 2)
rows.append({
"Health Measure": var,
"Catastrophic mean": round(r.coef()["Intercept"], 1),
"Free effect": format(coef_free, ".2f") + " (" + format(se_free, ".2f") + ")",
"Deductible effect": format(coef_ded, ".2f") + " (" + format(se_ded, ".2f") + ")",
"Coinsurance effect": format(coef_coin, ".2f") + " (" + format(se_coin, ".2f") + ")",
})
pd.DataFrame(rows)| Health Measure | Catastrophic mean | Free effect | Deductible effect | Coinsurance effect | |
|---|---|---|---|---|---|
| 0 | health_index | 68.5 | -0.78 (0.87) | -0.87 (0.96) | 0.61 (0.90) |
| 1 | cholesterol | 203.2 | -1.83 (2.39) | 0.69 (2.57) | -2.31 (2.47) |
| 2 | blood_pressure | 121.9 | -0.52 (0.93) | 1.17 (1.06) | -1.39 (0.98) |
| 3 | mental_health | 75.5 | 0.43 (0.82) | 0.45 (0.91) | 1.07 (0.87) |
The results are striking. Across all four health measures — general health, cholesterol, blood pressure, and mental health — the differences between plan groups are small and statistically insignificant.
Despite consuming 45% more health care, participants in the free plan showed no measurable improvement in health compared to those with minimal coverage.
This is a precisely estimated null: the standard errors are small enough to rule out large health benefits. The experiment was not too small to detect an effect — the effect simply wasn’t there.
What Did We Learn from the RAND HIE?
The RAND experiment delivered three key lessons:
- People respond to prices. Cheaper health care leads to more consumption (moral hazard is real).
- More care does not automatically mean better health. The marginal medical care consumed when it’s free may not be very valuable.
- Randomization reveals the truth. The naive NHIS comparison suggested a large health benefit of insurance. The randomized experiment showed this was mostly selection bias.
These findings directly shaped the policy debate around the Affordable Care Act (2010). Proponents argued for universal coverage to improve health; skeptics cited RAND to argue that subsidized insurance mainly increases spending. The truth, as we’ll see from Oregon, is more nuanced.
The RAND experiment studied middle-class families who already had at least catastrophic coverage. But what about the people most affected by insurance policy debates — low-income adults with no coverage at all? A natural experiment in Oregon addressed exactly this gap.
Case Study 2: The Oregon Health Plan
Why a Second Experiment?
The RAND HIE was groundbreaking, but it studied middle-class families who all had at least catastrophic coverage. Today’s uninsured Americans are different: younger, poorer, less educated. Would insurance help them more?
In 2008, the state of Oregon ran a health insurance lottery. About 75,000 low-income adults applied for Medicaid expansion; roughly 30,000 were randomly selected to apply for coverage. Economist Amy Finkelstein and colleagues studied the results.
In the Oregon lottery, only about 25% of winners actually enrolled in Medicaid (the rest failed paperwork or were ineligible). This means the simple winner/loser comparison understates the true effect on those who gained insurance. Adjusting for this non-compliance requires instrumental variables (Chapter 3): divide the winner/loser difference by the enrollment rate. This is a preview of the IV method.
Results at a Glance
| Outcome | Effect of Winning the Lottery |
|---|---|
| Medicaid enrollment | +25.6 percentage points |
| Hospital admissions | Small increase |
| Emergency dept. visits | +10% (policymakers expected a decrease) |
| Self-reported health | Modest improvement (+3.9 pp) |
| Physical health (cholesterol, BP) | No significant change |
| Mental health | Improved |
| Catastrophic medical expenses | Decreased |
| Medical debt | Decreased |
Comparing the Two Experiments
| RAND HIE (1974–1982) | Oregon OHP (2008) | |
|---|---|---|
| Population | Middle-class families | Low-income adults |
| More care used? | Yes | Yes |
| Better physical health? | No | No |
| Better mental health? | Not measured | Yes |
| Less financial hardship? | Not measured | Yes |
The two experiments, conducted decades apart on very different populations, reached remarkably similar conclusions about physical health. The Oregon study added two important insights: insurance provides financial protection (less medical debt) and mental health benefits — which may be its primary value for low-income populations.
Historical Perspective: Pioneers of Randomization
The idea of using controlled comparisons did not appear overnight. Key milestones in the development of experimental methods:
timeline
title From Ancient Wisdom to Modern Trials
section Ancient
~600 BCE : Daniel's dietary trial
: First recorded use of a control group
section 18th Century
1747 : James Lind's scurvy experiment
: Tested citrus fruits on sailors
: His theory was wrong, but his data were right
section 19th Century
1885 : Peirce & Jastrow
: First use of random assignment
section 20th Century
1925 : R.A. Fisher formalizes RCTs
: Statistical Methods for Research Workers
1974 : RAND HIE launches
: Largest social experiment of its era
- Daniel (~600 BCE) proposed a 10-day vegetarian diet trial with a control group eating the king’s rich food — perhaps the first controlled experiment
- James Lind (1747) tested citrus fruits against other scurvy remedies. His theory (acids cure scurvy) was wrong, but his empirical finding was correct — a lesson about letting data speak
- R.A. Fisher (1920s–30s) formalized the theory of random assignment and experimental design, launching the modern era of RCTs
Throughout this chapter, we have relied on standard errors and t-statistics to judge whether differences are real or due to chance. The following toolkit formalizes these concepts.
Statistical Inference Toolkit
Here is a brief guide to interpreting the numbers we have been using.
The Core Problem: Sampling Variability
Any estimate from a sample could differ if we drew a different sample from the same population. Statistical inference quantifies this uncertainty.
Key Concepts
| Concept | Symbol | Plain English |
|---|---|---|
| Sample mean | \(\bar{Y}\) | The average in our data |
| Standard error | \(SE(\bar{Y})\) | How much \(\bar{Y}\) would vary across different samples |
| t-statistic | coefficient / SE | How many SEs away from zero is our estimate? |
| 95% Confidence interval | estimate \(\pm\) 2 \(\times\) SE | The range of values consistent with our data |
The Rule of Thumb
If the t-statistic (coefficient divided by its standard error) exceeds 2 in absolute value, the result is statistically significant at the 5% level. This means it is unlikely to have arisen by chance alone.
For balance checks: we want insignificant results (small t-stats), confirming groups are comparable.
For treatment effects: significant results provide evidence of a real causal effect.
A Crucial Caveat
Statistical significance measures precision, not importance:
- A large t-statistic can come from a huge sample (very precise), not necessarily a large effect
- A small t-statistic can mean the effect is small or that our sample is too small to detect it
- Lack of significance ≠ lack of effect — it may just mean insufficient data
Always consider both the size of a coefficient and its statistical precision.
Key Takeaways
The following concept map shows how the key ideas in this chapter connect — from the initial causal question, through the problem of selection bias, to the solution of random assignment and the evidence from two landmark experiments.
graph TD
Q["Causal Question"] --> NC["Naive Comparison"]
NC --> SB["Selection Bias discovered"]
SB --> PO["Potential Outcomes Framework explains why"]
PO --> RA["Random Assignment as the solution"]
RA --> BC["Balance Check to verify"]
BC --> TE["Estimate Causal Effect"]
TE --> R["RAND HIE: more care does not improve health"]
TE --> O["Oregon OHP: insurance helps finances and mental health"]
style Q fill:#475569,color:#fff
style SB fill:#c0392b,color:#fff
style PO fill:#e67e22,color:#fff
style RA fill:#8e44ad,color:#fff
style BC fill:#3498db,color:#fff
style TE fill:#2d8659,color:#fff
style R fill:#2d8659,color:#fff
style O fill:#2d8659,color:#fff
linkStyle default stroke:#64748b,stroke-width:2px
Correlation is not causation. Observed differences between groups reflect causal effects plus selection bias.
The potential outcomes framework (\(Y_{1i}\), \(Y_{0i}\)) gives precise language for causal questions.
Selection bias arises because people who choose treatment differ from those who don’t.
Random assignment eliminates selection bias by making groups comparable.
Always check for balance to verify that randomization worked.
Regression on a dummy variable is the primary tool for comparing group means and testing for differences.
The RAND HIE found that free insurance increased spending by 45% but did not improve health.
The Oregon OHP confirmed these findings and showed that insurance helps with financial protection and mental health.
Learn by Coding
Copy this code into a Python notebook to reproduce the key results from this chapter.
# ============================================================
# Chapter 1: Randomized Trials — Code Cheatsheet
# ============================================================
import pandas as pd
import pyfixest as pf
DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"
# --- Step 1: Load NHIS data and compare health by insurance status ---
nhis = pd.read_csv(DATA + "ch1/nhis_clean.csv")
print("Average health by insurance status:")
print(nhis.groupby("insurance")["health"].mean().round(2))
# --- Step 2: Regression on a dummy (difference in means + standard error) ---
result = pf.feols("health ~ insurance", data=nhis, vcov="hetero")
print("\nHealth ~ Insurance:")
print(result.summary())
# --- Step 3: Balance check (RAND HIE — did randomization work?) ---
rand = pd.read_csv(DATA + "ch1/rand_balance.csv")
d = rand[["age", "plan_free", "plan_deductible", "plan_coinsurance", "family_id"]].dropna()
result = pf.feols("age ~ plan_free + plan_deductible + plan_coinsurance", data=d, vcov={"CRV1": "family_id"})
print("\nBalance check — Age across plan groups:")
print(result.summary())
# --- Step 4: Causal effect of free insurance on spending ---
hie = pd.read_csv(DATA + "ch1/rand_utilization.csv")
d = hie[["total_expenses", "plan_free", "plan_deductible", "plan_coinsurance", "family_id"]].dropna()
result = pf.feols("total_expenses ~ plan_free + plan_deductible + plan_coinsurance", data=d, vcov={"CRV1": "family_id"})
print("\nCausal effect on total spending:")
print(result.summary())
# --- Step 5: Causal effect on health (the RAND paradox: no effect!) ---
health = pd.read_csv(DATA + "ch1/rand_health_outcomes.csv")
d = health[["health_index", "plan_free", "plan_deductible", "plan_coinsurance", "family_id"]].dropna()
result = pf.feols("health_index ~ plan_free + plan_deductible + plan_coinsurance", data=d, vcov={"CRV1": "family_id"})
print("\nCausal effect on health (expect: no significant effect):")
print(result.summary())Copy the code above and paste it into this Google Colab scratchpad to run it interactively. Modify the variables, change the specifications, and see how results change!
Below is the same cheatsheet in Stata syntax.
* ============================================================
* Chapter 1: Randomized Trials — Stata Cheatsheet
* ============================================================
clear all
set more off
* --- Step 1: Load NHIS data and compare health by insurance status ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch1/nhis_clean.csv", clear
tabstat health, by(insurance)
* --- Step 2: Regression on a dummy (difference in means + standard error) ---
reg health insurance, robust
* --- Step 3: Balance check (RAND HIE — did randomization work?) ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch1/rand_balance.csv", clear
reg age plan_free plan_deductible plan_coinsurance, cluster(family_id)
* --- Step 4: Causal effect of free insurance on spending ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch1/rand_utilization.csv", clear
reg total_expenses plan_free plan_deductible plan_coinsurance, cluster(family_id)
* --- Step 5: Causal effect on health (the RAND paradox: no effect!) ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch1/rand_health_outcomes.csv", clear
reg health_index plan_free plan_deductible plan_coinsurance, cluster(family_id)Copy the code above into a .do file and run it in Stata 14 or later (which supports loading data from URLs). If your Stata cannot access the internet, download the CSV files from the data/ folder on GitHub and replace each URL with a local file path.
Exercises
Multiple Choice Questions
- What is the fundamental problem of causal inference?
- We cannot measure outcomes accurately
- We can only observe one potential outcome per person
- Random assignment is impossible in practice
- Sample sizes are always too small
(b) We can never observe the same person in both the treated and untreated state at the same time — this is the fundamental problem of causal inference. Each person has two potential outcomes (\(Y_{1i}\) and \(Y_{0i}\)), but we only observe one. (a) is wrong because measurement accuracy is a separate issue from the missing counterfactual. (c) is wrong because random assignment is feasible and widely used (as the RAND HIE shows). (d) is wrong because even with millions of observations, we still cannot see both potential outcomes for any single individual.
- In the RAND Health Insurance Experiment, what happened to physical health when people received free insurance?
- It improved dramatically
- It worsened due to overuse of care
- It showed no significant improvement despite higher spending
- It improved only for high-income participants
(c) The RAND HIE’s most surprising finding was that free insurance increased health care spending by about 45% but produced no statistically significant improvement in physical health for the average person. (a) is wrong because despite higher utilization, the extra care did not translate into measurably better health outcomes. (b) is wrong because health did not worsen — it simply did not improve. (d) is wrong because the null result on physical health applied across income groups, though the study did find benefits for the sickest and poorest subgroups.
- Selection bias occurs when:
- The sample size is too small for reliable estimates
- The treatment and control groups differ in ways related to the outcome
- Researchers choose which results to report
- Survey respondents lie about their behavior
(b) Selection bias arises when people who receive the treatment differ systematically from those who do not, in ways that also affect the outcome. In the causal framework, this means \(E[Y_{0i}|D_i=1] \neq E[Y_{0i}|D_i=0]\). (a) is wrong because small samples increase variance (noise) but do not cause systematic bias. (c) is wrong because that describes publication bias or p-hacking, a different problem. (d) is wrong because that describes response bias, not the selection into treatment that the chapter focuses on.
- Why is random assignment considered the gold standard for causal inference?
- It guarantees a large sample size
- It eliminates measurement error
- It makes treatment and control groups comparable on all characteristics, even unobserved ones
- It ensures perfect compliance with assigned treatment
(c) Random assignment ensures that, in expectation, the treatment and control groups are identical on all characteristics — observed and unobserved — making the selection bias term equal to zero. By the Law of Large Numbers, randomization balances everything, including variables the researcher cannot measure. (a) is wrong because randomization works regardless of sample size (though larger samples increase precision). (b) is wrong because measurement error is unrelated to how subjects are assigned. (d) is wrong because non-compliance is common even in randomized experiments (as the RAND HIE and Oregon experiments both show).
- A regression coefficient has a t-statistic of 3.5. This means:
- The effect is large in practical terms
- The result is unlikely to have arisen by chance alone
- The regression model fits the data well
- The sample is representative of the population
(b) A t-statistic of 3.5 means the estimated coefficient is 3.5 standard errors away from zero. Under the null hypothesis of no effect, this would be very unlikely to occur by chance (p < 0.001), so we reject the null. (a) is wrong because the t-statistic measures statistical significance, not practical importance — a tiny effect can be statistically significant with a large sample. (c) is wrong because model fit is measured by R-squared, not t-statistics. (d) is wrong because representativeness depends on sampling design, not on the t-statistic of a coefficient.
- A “balance check” in a randomized experiment tests whether:
- The sample size is equal in both groups
- Pre-treatment characteristics are similar across treatment and control groups
- The treatment was delivered correctly
- The outcome variable is normally distributed
(b) A balance check verifies that randomization worked by comparing baseline (pre-treatment) characteristics across groups. If randomization succeeded, variables like age, income, and prior health should be statistically similar across treatment arms. (a) is wrong because groups do not need equal size — unequal allocation is common and acceptable. (c) is wrong because balance checks examine pre-treatment variables, not treatment delivery (which is a compliance issue). (d) is wrong because normality of the outcome is a distributional assumption, not related to whether randomization produced comparable groups.
- In the Oregon Health Insurance Experiment, Medicaid was found to improve:
- Physical health outcomes such as blood pressure and cholesterol
- Financial security and mental health
- Employment rates and earned income
- All of the above equally
(b) The Oregon experiment found that Medicaid significantly reduced financial hardship (fewer medical debts, less borrowing) and improved mental health (lower rates of depression). (a) is wrong because the study found no statistically significant improvements in measured physical health indicators like blood pressure, cholesterol, or glycated hemoglobin. (c) is wrong because Medicaid had no significant effect on employment. (d) is wrong because the benefits were concentrated in financial protection and mental health, not spread equally across all domains.
- The selection bias decomposition shows that the observed difference in outcomes equals:
- The treatment effect only
- The average treatment effect plus selection bias
- The sample mean minus the population mean
- The R-squared of the regression
(b) The decomposition equation shows: observed difference = average treatment effect on the treated + selection bias. The selection bias term captures pre-existing differences between the treatment and control groups (\(E[Y_{0i}|D_i=1] - E[Y_{0i}|D_i=0]\)). Only when selection bias is zero (as with randomization) does the observed difference equal the causal effect. (a) is wrong because the observed difference also includes selection bias unless we have a randomized experiment. (c) is wrong because that describes sampling error, not the causal inference decomposition. (d) is wrong because R-squared measures explained variance, not the treatment-selection decomposition.
- Why do NHIS data show that insured people are healthier than uninsured people, even though insurance may not improve health?
- The NHIS uses a biased sampling method
- People who choose insurance tend to be healthier, wealthier, and more educated to begin with
- Insurance companies only accept healthy applicants
- The NHIS measures health inaccurately
(b) The NHIS comparison reflects selection bias: people who obtain insurance tend to be employed, higher-income, and more educated — all factors independently associated with better health. The observed health gap between insured and uninsured reflects these pre-existing differences, not a causal effect of insurance. (a) is wrong because the NHIS is a well-designed national survey; the bias is in the treatment (insurance) selection, not the sampling. (c) is wrong because while some underwriting exists, the main issue is self-selection into coverage. (d) is wrong because measurement quality is not the source of the misleading comparison.
- Non-compliance in a randomized experiment means that:
- Participants drop out of the study
- Some participants do not follow their assigned treatment
- The randomization device malfunctions
- The control group is contaminated by the treatment group
(b) Non-compliance occurs when participants do not follow their assigned treatment — for example, people assigned to a free insurance plan who do not enroll, or people assigned to the control group who obtain insurance elsewhere. (a) is wrong because attrition (dropping out) is a separate problem from non-compliance — non-compliers stay in the study but don’t follow their assignment. (c) is wrong because non-compliance is about participant behavior, not technical failure. (d) is wrong because contamination is one specific form of non-compliance (control group receiving treatment), but non-compliance also includes treated subjects not taking the treatment.
Conceptual Questions
- Spotting selection bias: A study reports that people who eat organic food live 3 years longer. List three reasons why this comparison might reflect selection bias rather than a causal effect of organic food.
Organic food buyers differ systematically from non-buyers, making any health comparison suspect. Three sources of selection bias:
- Income: People who buy organic food tend to have higher incomes, and wealthier people have better access to health care and live longer regardless of diet.
- Health behavior: Organic food buyers are likely more health-conscious overall — they exercise more, smoke less, and manage stress better. This is a classic case of bundled lifestyle choices acting as confounders.
- Education: Education is correlated with both organic food consumption and longevity; more-educated people make healthier choices across many domains.
All three sources violate the comparability assumption from the selection bias decomposition: \(E[Y_{0i} | D_i = 1] \neq E[Y_{0i} | D_i = 0]\), so the observed difference overstates any true causal effect of organic food.
- Reading a regression: In the balance check above, the coefficient on
plan_freeforfamily_incomeis approximately −976 with SE ≈ 1,345. (a) What is the t-statistic? (b) Is this difference statistically significant? (c) What does your answer tell us about whether randomization worked for this variable?
A small t-statistic confirms that randomization successfully balanced family income across plan groups.
- Compute: The t-statistic is −976 / 1,345 ≈ −0.73.
- Evaluate: Since |−0.73| < 2, this difference is NOT statistically significant at conventional levels.
- Interpret: The difference in family income between the free plan and catastrophic plan groups is small enough to be attributable to chance. Randomization worked for this variable — the groups are comparable on family income. This is exactly what the balance check in the chapter’s Table “Balance of baseline characteristics” is designed to verify: if \(D_i\) is randomly assigned, baseline covariates should look similar across groups.
- The RAND paradox: Your friend says “The RAND experiment proves health insurance is worthless.” Write a short paragraph explaining why this is an oversimplification. What did the Oregon experiment show that insurance is good for?
No effect on physical health does not mean insurance is useless — it means health is a narrow outcome that misses other benefits.
- Financial protection: The Oregon experiment showed that lottery winners had less medical debt and fewer catastrophic medical expenses. Insurance smooths financial risk, which is valuable even without health gains.
- Mental health: Oregon lottery winners reported better mental health scores, an outcome dimension the RAND study did not emphasize.
- Access to care: Insurance increases access to care, which may matter more for acute conditions or preventive services not captured by the RAND outcome measures.
The correct conclusion connects both experiments from the chapter: more generous insurance increases spending without improving measurable physical health (RAND), but it provides valuable financial security and mental health benefits (Oregon). Different outcomes can tell different causal stories from the same intervention.
- Random assignment and selection bias: Using the decomposition equation from this chapter, explain step by step why random assignment makes the selection bias term equal to zero. What role does the Law of Large Numbers play?
Random assignment eliminates selection bias by making the treatment and control groups statistically identical at baseline.
- Start from the decomposition: Observed difference = \(\kappa\) + Selection bias, where selection bias = \(E[Y_{0i} | D_i = 1] - E[Y_{0i} | D_i = 0]\).
- Apply randomization: When \(D_i\) is randomly assigned, the treatment and control groups are drawn from the same population, so baseline characteristics are independent of treatment status.
- Invoke the Law of Large Numbers: With a large enough sample, the average baseline outcome \(Y_{0i}\) will be nearly identical in both groups. Formally, \(E[Y_{0i} | D_i = 1] = E[Y_{0i} | D_i = 0]\), so the selection bias term equals zero.
- Conclude: The observed difference then equals \(\kappa\), the true causal effect. This is the core logic behind every balance check in the chapter — if randomization works, baseline variables should be balanced.
- Designing an RCT: You want to test whether free school lunches improve student test scores. (a) How would you randomly assign treatment? (b) What outcome would you measure? (c) What balance check would you run? (d) Why might some students assigned to “free lunch” not actually eat it, and what problem does this create?
Designing an experiment requires specifying randomization, outcomes, balance checks, and anticipating non-compliance.
- Randomization: Randomly select classrooms or schools to receive the program (cluster randomization), or randomly assign individual students within each school. Cluster randomization avoids contamination across students in the same classroom.
- Outcome: Measure standardized test scores at the end of the semester/year. This gives a clear, quantifiable dependent variable \(Y_i\).
- Balance check: Compare baseline characteristics (prior test scores, demographics, family income) between treatment and control groups to verify balance — just as the RAND experiment checked age, education, and income in the chapter.
- Non-compliance threat: Some students may refuse the lunch, share it, or already receive food from other sources. This is a non-compliance problem: the intent-to-treat effect (being offered lunch) may differ from the effect of actually eating it. This foreshadows the instrumental variables approach in Chapter 3, where random assignment serves as an instrument for actual treatment.
Research Tasks
- Binary balance check: Using
rand_balance.csv, run a balance check using the single dummyany_insurance(instead of the three plan dummies). Regressage,education, andhealth_indexonany_insurancewith family-clustered SEs. Do you reach the same conclusion about balance as the three-dummy specification?
Code
# --- Load data ---
import pandas as pd
import pyfixest as pf
rand = pd.read_csv(DATA + "ch1/rand_balance.csv")
# --- Run balance regressions ---
# Use a single binary dummy (any_insurance) instead of three plan dummies
rows = []
for var in ["age", "education", "health_index"]:
d = rand[[var, "any_insurance", "family_id"]].dropna()
# OLS with clustered SEs at the family level
r = pf.feols(f"{var} ~ any_insurance", data=d, vcov={"CRV1": "family_id"})
rows.append({
"Variable": var,
"Catastrophic mean": round(r.coef()["Intercept"], 1), # control group mean
"Any ins. difference": round(r.coef()["any_insurance"], 2), # treatment-control gap
"SE": round(r.se()["any_insurance"], 2),
"t-stat": round(r.tstat()["any_insurance"], 2), # difference / SE
})
pd.DataFrame(rows)| Variable | Catastrophic mean | Any ins. difference | SE | t-stat | |
|---|---|---|---|---|---|
| 0 | age | 32.4 | 0.64 | 0.54 | 1.18 |
| 1 | education | 12.1 | -0.17 | 0.16 | -1.07 |
| 2 | health_index | 70.9 | -0.93 | 0.77 | -1.20 |
Stata equivalent:
* --- Binary balance check ---
clear all
set more off
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch1/rand_balance.csv", clear
* Run balance regressions with clustered SEs
foreach var in age education health_index {
reg `var' any_insurance, cluster(family_id)
}What the numbers show: All t-statistics are small (well below 2), so none of the baseline differences are statistically significant. The catastrophic and any-insurance groups look comparable on age, education, and health.
Why: Randomization ensures that treatment assignment is independent of pre-existing characteristics. The Law of Large Numbers makes the group means converge, as discussed in Q4.
What it teaches: Balance holds regardless of whether we use three plan dummies or a single binary indicator. The binary specification pools all non-catastrophic plans together, which is simpler but loses information about differences across plan types. This illustrates a general point: the choice of treatment variable definition can affect granularity but should not affect the core balance result if randomization worked.
- Relative utilization increases: Using
rand_utilization.csv, compute the percentage increase in each utilization outcome for the free plan relative to the catastrophic group mean. Which outcome shows the largest relative increase: visits, outpatient expenses, admissions, or total expenses?
Code
# --- Load data ---
hie = pd.read_csv(DATA + "ch1/rand_utilization.csv")
# --- Run regressions and compute percentage effects ---
rows = []
for var in ["visits", "outpatient_expenses", "admissions", "inpatient_expenses", "total_expenses"]:
d = hie[[var, "plan_free", "plan_deductible", "plan_coinsurance", "family_id"]].dropna()
# OLS with plan dummies; clustered SEs at the family level
r = pf.feols(f"{var} ~ plan_free + plan_deductible + plan_coinsurance", data=d, vcov={"CRV1": "family_id"})
cat_mean = r.coef()["Intercept"] # intercept = catastrophic plan mean (reference group)
free_effect = r.coef()["plan_free"] # coefficient = absolute increase from free plan
pct_increase = (free_effect / cat_mean) * 100 # express as percentage of baseline
rows.append({
"Outcome": var,
"Catastrophic mean": round(cat_mean),
"Free plan effect": round(free_effect),
"% increase": round(pct_increase, 1),
})
pd.DataFrame(rows)| Outcome | Catastrophic mean | Free plan effect | % increase | |
|---|---|---|---|---|
| 0 | visits | 3 | 2 | 59.8 |
| 1 | outpatient_expenses | 248 | 169 | 68.2 |
| 2 | admissions | 0 | 0 | 29.1 |
| 3 | inpatient_expenses | 388 | 116 | 30.0 |
| 4 | total_expenses | 636 | 285 | 44.9 |
Stata equivalent:
* --- Percentage increase in utilization for the free plan ---
clear all
set more off
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch1/rand_utilization.csv", clear
* Run regressions for each utilization outcome
foreach var in visits outpatient_expenses admissions inpatient_expenses total_expenses {
reg `var' plan_free plan_deductible plan_coinsurance, cluster(family_id)
* Compute percentage increase: free plan effect / catastrophic mean * 100
scalar cat_mean = _b[_cons]
scalar free_effect = _b[plan_free]
scalar pct_increase = (free_effect / cat_mean) * 100
display "`var': catastrophic mean = " cat_mean ", free effect = " free_effect ", % increase = " pct_increase
}What the numbers show: Outpatient expenses show the largest relative increase (~68%), followed by face-to-face visits (~60%). Hospital admissions show a smaller relative increase (~29%). Total expenses rose ~45%.
Why: Inpatient decisions are made primarily by doctors rather than patients, so reducing cost-sharing has less effect on admissions. Outpatient care, where patients have more discretion over whether to seek treatment, responds most strongly to price changes — consistent with basic demand elasticity.
What it teaches: The same experiment can reveal heterogeneous causal effects across different outcomes. The RAND results show that moral hazard (the tendency to use more care when insured) is concentrated in outpatient services, not hospital stays. This pattern is key to understanding the policy implications of insurance design discussed in the chapter.
- Husbands vs. wives: Using
nhis_clean.csv, run the insurance-health comparison separately for husbands and wives. Is the selection bias (the gap in education and income between insured and uninsured) larger for one gender? What might explain any differences?
Code
# --- Load data ---
nhis = pd.read_csv(DATA + "ch1/nhis_clean.csv")
# --- Run WLS regressions by gender ---
rows = []
for gender in ["husband", "wife"]:
subset = nhis[nhis["gender"] == gender] # split sample by gender
for var in ["health", "education", "family_income"]:
# WLS with survey weights; HC1 robust standard errors
r = pf.feols(f"{var} ~ insurance", data=subset, weights="weight", vcov="hetero")
rows.append({
"Gender": gender,
"Variable": var,
"Difference (Ins - Unins)": round(r.coef()["insurance"], 2), # coefficient = gap
"SE": round(r.se()["insurance"], 2),
})
pd.DataFrame(rows)| Gender | Variable | Difference (Ins - Unins) | SE | |
|---|---|---|---|---|
| 0 | husband | health | 0.31 | 0.03 |
| 1 | husband | education | 2.74 | 0.10 |
| 2 | husband | family_income | 60810.44 | 1355.79 |
| 3 | wife | health | 0.39 | 0.04 |
| 4 | wife | education | 2.64 | 0.11 |
| 5 | wife | family_income | 59827.50 | 1406.08 |
Stata equivalent:
* --- Selection bias by gender ---
clear all
set more off
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch1/nhis_clean.csv", clear
* Run WLS regressions by gender
foreach g in husband wife {
display "=== Gender: `g' ==="
foreach var in health education family_income {
reg `var' insurance [aw=weight] if gender == "`g'", robust
}
}What the numbers show: The education and income gaps between insured and uninsured are similar for husbands and wives. The health gap may differ slightly across genders.
Why: Selection into insurance is driven by socioeconomic factors (education, income) that operate similarly for both spouses in a household. Any gender-specific differences in the health gap likely reflect gender-specific health patterns rather than differences in the selection mechanism.
What it teaches: Both groups show substantial selection bias, reinforcing the chapter’s central lesson: observational comparisons between insured and uninsured people confound the causal effect of insurance with pre-existing differences. This is precisely why the RAND and Oregon experiments — which use randomization to eliminate selection bias — provide more credible evidence.
- Dose-response across plan generosity: Using
rand_utilization.csv, extract the three plan-dummy coefficients fortotal_expensesand rank them by plan generosity (free > coinsurance > deductible). Is there a monotonic relationship between plan generosity and spending? Test whether the free and coinsurance coefficients are statistically different.
Code
# --- Load data ---
import pandas as pd
import pyfixest as pf
hie = pd.read_csv(DATA + "ch1/rand_utilization.csv")
# --- Regression with three plan dummies ---
d = hie[["total_expenses", "plan_free", "plan_deductible", "plan_coinsurance", "family_id"]].dropna()
r = pf.feols("total_expenses ~ plan_free + plan_deductible + plan_coinsurance", data=d, vcov={"CRV1": "family_id"})
# --- Extract and rank coefficients by plan generosity ---
pd.DataFrame({
"Plan": ["Free (most generous)", "Coinsurance (medium)", "Deductible (least generous)"],
"Effect vs. catastrophic": [round(r.coef()["plan_free"]),
round(r.coef()["plan_coinsurance"]),
round(r.coef()["plan_deductible"])],
"SE": [round(r.se()["plan_free"]),
round(r.se()["plan_coinsurance"]),
round(r.se()["plan_deductible"])],
"t-stat": [round(r.tstat()["plan_free"], 2),
round(r.tstat()["plan_coinsurance"], 2),
round(r.tstat()["plan_deductible"], 2)],
})| Plan | Effect vs. catastrophic | SE | t-stat | |
|---|---|---|---|---|
| 0 | Free (most generous) | 285 | 72 | 3.94 |
| 1 | Coinsurance (medium) | 152 | 84 | 1.80 |
| 2 | Deductible (least generous) | 114 | 79 | 1.44 |
Stata equivalent:
* --- Dose-response: plan generosity and total expenses ---
clear all
set more off
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch1/rand_utilization.csv", clear
* Regression with three plan dummies and clustered SEs
reg total_expenses plan_free plan_deductible plan_coinsurance, cluster(family_id)
* Test whether free and coinsurance effects are equal
test plan_free = plan_coinsuranceWhat the numbers show: The free plan produces the largest increase in total expenses, followed by the coinsurance plan, then the deductible plan. The ordering generally follows plan generosity, though the differences between coinsurance and deductible may not be statistically significant.
Why: More generous plans reduce out-of-pocket costs more, lowering the price of care to patients. Basic demand theory predicts that lower prices increase quantity demanded. The free plan eliminates cost-sharing entirely, producing the strongest response. The coinsurance and deductible plans still require some out-of-pocket payment, partially restraining demand.
What it teaches: The dose-response pattern strengthens the causal interpretation of the RAND experiment. If insurance generosity had no real effect on spending, the coefficients would be similar across plan types. Instead, we see a gradient that matches the economic logic of moral hazard — more generous coverage leads to more spending — which is harder to explain by chance or confounding.
- Inpatient vs. outpatient elasticity: Using
rand_utilization.csv, compute the implied price elasticity of demand for inpatient vs. outpatient care. Use the free plan coefficient as the numerator (percentage change in quantity) and note that catastrophic plans cover ~5% of costs while free plans cover 100% (a 95-percentage-point price reduction). Which type of care is more price-sensitive?
Code
# --- Load data ---
hie = pd.read_csv(DATA + "ch1/rand_utilization.csv")
# --- Compute elasticities for inpatient and outpatient care ---
# Price change: catastrophic plan covers ~5% (price = 0.95), free covers 100% (price = 0.00)
# Price drop = 0.95 (from 0.95 to 0.00)
price_drop = 0.95
rows = []
for var, label in [("outpatient_expenses", "Outpatient"), ("inpatient_expenses", "Inpatient")]:
d = hie[[var, "plan_free", "plan_deductible", "plan_coinsurance", "family_id"]].dropna()
r = pf.feols(f"{var} ~ plan_free + plan_deductible + plan_coinsurance", data=d, vcov={"CRV1": "family_id"})
cat_mean = r.coef()["Intercept"] # catastrophic group mean (baseline spending)
free_effect = r.coef()["plan_free"] # absolute increase from free plan
pct_change_q = free_effect / cat_mean # percentage change in quantity
elasticity = pct_change_q / price_drop # arc elasticity of demand
rows.append({
"Care type": label,
"Catastrophic mean": round(cat_mean),
"Free plan effect": round(free_effect),
"% change in quantity": round(pct_change_q * 100, 1),
"Implied elasticity": round(elasticity, 2),
})
pd.DataFrame(rows)| Care type | Catastrophic mean | Free plan effect | % change in quantity | Implied elasticity | |
|---|---|---|---|---|---|
| 0 | Outpatient | 248 | 169 | 68.2 | 0.72 |
| 1 | Inpatient | 388 | 116 | 30.0 | 0.32 |
Stata equivalent:
* --- Implied price elasticity: inpatient vs. outpatient ---
clear all
set more off
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch1/rand_utilization.csv", clear
* Price drop from catastrophic (95% cost-sharing) to free (0%)
scalar price_drop = 0.95
* Outpatient elasticity
reg outpatient_expenses plan_free plan_deductible plan_coinsurance, cluster(family_id)
scalar cat_mean_out = _b[_cons]
scalar free_effect_out = _b[plan_free]
scalar elast_out = (free_effect_out / cat_mean_out) / price_drop
display "Outpatient elasticity = " elast_out
* Inpatient elasticity
reg inpatient_expenses plan_free plan_deductible plan_coinsurance, cluster(family_id)
scalar cat_mean_in = _b[_cons]
scalar free_effect_in = _b[plan_free]
scalar elast_in = (free_effect_in / cat_mean_in) / price_drop
display "Inpatient elasticity = " elast_inWhat the numbers show: Outpatient care has a substantially higher implied elasticity than inpatient care. Patients increase their outpatient spending by a larger percentage than their inpatient spending when insurance becomes more generous.
Why: Outpatient visits are largely discretionary — patients decide whether to schedule a check-up, seek a second opinion, or visit a specialist. Inpatient care (hospitalizations, surgeries) is typically driven by medical necessity and physician decisions, not patient choice. When the price drops to zero, patients exercise their discretion mainly in the outpatient domain.
What it teaches: This elasticity comparison reveals the mechanism behind moral hazard. The RAND experiment does not just show that free insurance increases spending — it shows where the spending increase concentrates. Policy implications follow directly: if most of the moral hazard comes from discretionary outpatient care, cost-sharing designs that target outpatient visits (like copays for doctor visits) may be more effective at controlling costs than deductibles that apply equally to all services.
