6. The Wages of Schooling

Open In Colab Return to Book Website

TipLearning Objectives

By the end of this chapter, you will be able to:

  • Estimate the simple OLS return to schooling and explain why it may overstate the causal effect
  • Apply the omitted variables bias (OVB) formula to predict the direction of bias from unobserved ability
  • Explain why randomized experiments are the gold standard but infeasible for schooling
  • Use twin fixed effects to control for shared family and genetic factors
  • Understand how measurement error creates attenuation bias, especially in differenced data
  • Apply instrumental variables (quarter of birth, twin’s report, compulsory schooling laws) to education
  • Use regression discontinuity to test for sheepskin (diploma) effects
  • Illustrate how differences-in-differences exploits policy changes to estimate causal returns
  • Compare estimates across all five methods and assess what the true return to schooling is

This chapter is unique. It applies all five methods from the book — regression, RCTs, IV, RD, and DD — to a single question: does education really cause higher earnings? When different methods agree, we gain confidence. When they disagree, we learn what each method can and cannot do. The chapter builds a methods ladder, starting from the simplest approach and climbing to the most sophisticated, with each step motivated by a limitation of the previous one.

graph TD
    A["THE QUESTION: Does education cause higher earnings?"]
    B["SIMPLE OLS: Naive regression shows about 7% return"]
    C["THE PROBLEM: Ability bias and omitted variables inflate the estimate"]
    D["MULTIPLE REGRESSION: Adding controls helps but cannot fix unobservables"]
    E["QUASI-EXPERIMENTS: IV with twins, QOB, and child labor laws"]
    F["RD: Sheepskin effect tests diploma vs. learning"]
    G["DD: Policy changes provide before-and-after comparisons"]

    A --> B --> C --> D --> E --> F --> G

    style A fill:#3498db,color:#fff
    style B fill:#2c3e50,color:#fff
    style C fill:#c0392b,color:#fff
    style D fill:#e67e22,color:#fff
    style E fill:#8e44ad,color:#fff
    style F fill:#2d8659,color:#fff
    style G fill:#2d8659,color:#fff

    linkStyle 0,1,2,3,4,5 stroke:#888,stroke-width:2px
Figure 6.1: Roadmap: The Methods Ladder for Chapter 6

Key Concepts and Definitions

Ability Bias: The upward bias in OLS estimates of the return to schooling caused by the omission of innate ability. More able people get more education AND earn more, inflating the apparent effect of schooling.

A simple regression shows each year of schooling raises earnings by 7%, but part of this reflects the fact that high-IQ individuals stay in school longer and would earn more regardless.

Like attributing a swimmer’s speed entirely to their swimsuit. Faster swimmers tend to buy better suits, so the suit gets credit for speed that was really due to talent.

Twin Fixed Effects: A strategy that compares outcomes within pairs of identical twins who differ in their education levels. Because twins share genes and family background, differencing eliminates these shared confounders.

If one twin has 16 years of schooling and earns $60,000 while the other twin has 14 years and earns $54,000, the within-pair return is ($6,000 / 2 years) = $3,000 per year.

Like comparing two identical seeds planted in the same soil and climate, but one gets extra fertilizer. Any difference in growth must be due to the fertilizer, since everything else is shared.

Within-Pair Differences: The technique of subtracting one twin’s outcome from the other’s to eliminate all shared characteristics. This transforms the data from levels (how much each twin earns) to differences (how much MORE one twin earns than the other).

\(\Delta Y_f = Y_{twin1} - Y_{twin2}\) and \(\Delta S_f = S_{twin1} - S_{twin2}\). Regressing the wage difference on the schooling difference gives the within-pair return.

Like measuring the height difference between two siblings rather than their individual heights. The difference removes the family’s genetic baseline and isolates the effect of what differed between them.

Measurement Error: Imprecision in the recording of a variable, where reported values differ from true values due to misreporting, rounding, or recall mistakes. In regression, measurement error in the explanatory variable biases the coefficient toward zero.

Twins asked to report their years of education may misremember by a year. This noise dilutes the true variation in schooling and biases the return estimate downward.

Like trying to read a ruler through foggy glasses. The markings are there, but the fog (measurement error) makes it hard to read them precisely, leading you to underestimate the true length.

Signal-to-Noise Ratio: The proportion of the total variation in a variable that reflects true variation (signal) versus measurement error (noise). A low signal-to-noise ratio causes severe attenuation bias.

If true within-twin schooling variation is 1 year but measurement error adds 2 years of noise, the signal-to-noise ratio is low and the twin FE estimate is badly attenuated.

Like trying to hear a whisper in a noisy stadium. The whisper (signal) is real, but the crowd noise overwhelms it. In a quiet room, the same whisper is perfectly clear.

Reliability Ratio: The fraction of the total variance of a variable that is true variance (as opposed to error variance). A reliability ratio of 0.5 means that attenuation bias cuts the coefficient in half.

If self-reported education has a reliability ratio of 0.85, the OLS coefficient is biased toward zero by about 15%. In differenced twin data, the reliability ratio can drop to 0.5, doubling the bias.

Like a scale that is accurate 85% of the time and gives random readings 15% of the time. The more unreliable the scale, the less you can trust its average reading.

Sheepskin Effect: The additional earnings boost associated with completing a degree (earning the diploma) beyond the year-by-year return to education. Named after the sheepskin diplomas were once printed on.

If each year of college raises earnings by 7%, but graduating (year 4 specifically) adds an extra 15% jump, the 15% jump is the sheepskin effect — the value of the credential itself.

Like a loyalty card that gives a free coffee after every 10 purchases. The first 9 stamps (years of schooling) are valuable, but the 10th stamp (the degree) unlocks a bonus reward.

Human Capital Theory: The view that education raises earnings by building productive skills, knowledge, and abilities that make workers more valuable to employers.

An engineering student learns calculus, physics, and design — skills that directly increase her productivity and justify higher pay.

Like sharpening a knife. More education makes the worker a better tool, and employers pay more for a sharper blade.

Signaling Theory: The view that education raises earnings not by increasing skills but by revealing pre-existing ability to employers. The degree serves as a signal that the holder is talented and hardworking.

An employer who cannot directly observe a job candidate’s ability uses a college degree as evidence that the candidate is smart and disciplined enough to complete four years of coursework.

Like a peacock’s tail. The tail does not make the peacock a better flyer — it signals genetic fitness to potential mates. Similarly, a degree may not make you more productive; it signals that you were productive to begin with.

Credential Effect: The earnings premium attributable specifically to holding a diploma or credential, as opposed to the knowledge gained year by year. It is the empirical counterpart of the sheepskin effect.

Clark and Martorell’s RD study found that the Texas high school diploma had almost no credential effect — students just above and below the exam cutoff had similar earnings.

Like a name-brand label on a generic product. The credential effect asks: does the label itself add value, or is it what is inside the box that matters?

Heterogeneous Treatment Effects: The idea that the causal effect of treatment varies across individuals or subgroups, rather than being a single number that applies to everyone.

The return to schooling may be 12% for low-income students but only 5% for high-income students, because education opens doors that were already open for the wealthy.

Like the effect of an umbrella on staying dry. In a light drizzle, the umbrella is barely needed. In a downpour, it is essential. The same treatment has different effects depending on the circumstances.

Convergence of Evidence: The principle that when multiple methods — each with different assumptions, data, and potential biases — all point to similar conclusions, we gain much stronger confidence in the finding than any single method can provide.

OLS, twin FE, twin IV, quarter-of-birth IV, child labor law IV, and DD all estimate the return to schooling at roughly 7–10% per year. No single estimate is definitive, but their agreement is powerful.

Like multiple witnesses to an event all telling the same story from different vantage points. One witness might be mistaken, but if five independent witnesses agree, the story is probably true.

Return to Schooling: The percentage increase in earnings caused by one additional year of education. It is the central parameter this chapter seeks to estimate using multiple methods.

A return of 8% means that one extra year of school causes earnings to increase by 8%, holding all else equal. Over a career, this compounds to a substantial difference.

Like the interest rate on a savings account. Each “deposit” (year of school) earns a return that accumulates over time. An 8% return per year of schooling is a very high-yield investment.

Quarter-of-Birth Instrument: An instrument for years of schooling based on the quarter of the year in which a person was born. Because school entry and compulsory attendance laws interact with birth timing, children born in different quarters accumulate different amounts of schooling.

Children born in Q4 start school slightly younger and must stay longer before reaching the legal dropout age, accumulating about 0.1 extra years of schooling on average.

Like a relay race where some runners start a few steps ahead because of where they line up. The starting position (birth quarter) is essentially random but determines how far they run (years of schooling) before they can step off the track (drop out).

Compulsory Schooling Laws: Government regulations that require children to attend school until a specified minimum age (e.g., 16). These laws create exogenous variation in schooling by forcing some students to stay in school longer than they otherwise would.

A state that raises its minimum school-leaving age from 14 to 16 compels students who would have dropped out at 14 to stay two more years — generating variation that is independent of ability.

Like a mandatory seatbelt law. Some people would buckle up anyway (always-takers), but the law forces additional compliance from people who would not have done it voluntarily — and it is these marginal compliers whose outcomes we can study.

Endogenous Variable: A variable in a regression whose value is determined inside the system being studied, meaning it is correlated with the error term. OLS estimates involving endogenous variables are biased because the variable is not “as good as randomly assigned.”

Years of schooling is endogenous in an earnings regression because unobserved ability affects both schooling and earnings simultaneously.

Like the chicken-and-egg problem. Did the variable cause the outcome, or did they both arise from a common underlying factor? When cause and effect are tangled together, you cannot simply read off the causal relationship.

The Earnings-Education Gradient

College graduates earn roughly twice as much as high school graduates. But how much of that gap reflects the causal effect of education, and how much reflects the fact that people who go to college were going to earn more anyway?

Let’s start with the simplest possible approach: regress earnings on schooling with no controls. This is the bivariate regression:

\[\ln W_i = \alpha + \rho \, S_i + e_i\]

where:

  • \(\ln W_i\) = log weekly earnings for individual \(i\) (lnw)
  • \(S_i\) = years of schooling (s)
  • \(\rho\) = the return to schooling — the percentage increase in earnings per additional year of education
  • \(e_i\) = residual (includes ability, motivation, family background, and everything else we cannot observe)

Because \(\ln W\) is the outcome, the coefficient \(\rho\) has a convenient interpretation: a value of 0.07 means each year of schooling is associated with approximately 7% higher earnings.

Code
# Load clean quarter-of-birth data (Angrist & Krueger 1991, 329k men born 1930-1939)
import pandas as pd
import numpy as np
import pyfixest as pf
# (IV handled by pf.feols with pipe syntax)

GITHUB_DATA_URL = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"

qob = pd.read_csv(GITHUB_DATA_URL + "ch6/qob_clean.csv")
qob.head(3)
lnw s qob yob q1 q2 q3 q4 age
0 5.790019 12.0 1 30 1 0 0 0 50.0
1 5.952494 11.0 1 30 1 0 0 0 50.0
2 5.315949 12.0 1 30 1 0 0 0 50.0
Code
# Bivariate OLS: the simplest possible regression
bivariate = pf.feols("lnw ~ s", data=qob, vcov="hetero")

# Display results
pd.DataFrame({
    "Variable": bivariate.coef().index,
    "Coefficient": bivariate.coef().round(4).values,
    "Std. Error": bivariate.se().round(4).values,
    "t-statistic": bivariate.tstat().round(2).values,
    "p-value": bivariate.pvalue().round(3).values,
})
Table 6.1: Simple bivariate regression of log weekly earnings on years of schooling. No controls — just the raw correlation.
Variable Coefficient Std. Error t-statistic p-value
0 Intercept 4.9952 0.0051 984.49 0.0
1 s 0.0709 0.0004 185.95 0.0

Each additional year of schooling is associated with about 7% higher weekly earnings. With 329,000 observations, the estimate is extremely precise.

Code
import matplotlib.pyplot as plt

# Compute mean earnings by schooling level
binned = qob.groupby("s")["lnw"].mean().reset_index()

fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(binned["s"], binned["lnw"], color="black", s=40, zorder=5)
ax.plot(binned["s"], binned["lnw"], "k-", alpha=0.4)
ax.set_xlabel("Years of Schooling")
ax.set_ylabel("Mean Log Weekly Earnings")
ax.set_title("The Earnings-Education Gradient")
plt.tight_layout()
plt.show()
Figure 6.2: Average log weekly earnings by years of schooling. The positive gradient is clear, but does it reflect causation or selection?

But is this causal? This is the ability bias problem. Smarter people stay in school longer AND earn more — both independently. If we don’t account for ability, OLS overstates the true causal return.

\[\hat{\rho}_{OLS} = \rho + \underbrace{\text{Ability bias}}_{\text{likely positive}}\]

where \(\rho\) is the true causal return to schooling and \(\hat{\rho}_{OLS}\) is the OLS estimate. If more able people get more education and earn more (for reasons unrelated to school), the OLS coefficient captures both effects.

But is ability bias necessarily upward? The answer is not obvious:

  • Arguments for upward bias (the standard view): Higher IQ → stay in school longer AND earn more. Schools select on test scores. More-educated parents invest more in their children’s education. All of these create positive correlation between ability and schooling.
  • Arguments for downward bias (the contrarian view): Some highly talented people leave school early to pursue lucrative opportunities. Bill Gates, Mark Zuckerberg, and Steve Jobs dropped out of college; Mick Jagger left the London School of Economics to form the Rolling Stones. If such exceptional ability is negatively correlated with schooling, OLS could actually understate the true return.
  • For most people, the standard view probably holds: the college-dropout billionaires are rare exceptions. But the ambiguity is important because it means we cannot assume the direction of OLS bias without evidence.

To understand how much of this 7% estimate is causal, we need to think carefully about omitted variables.

Multiple Regression and the OVB Problem

Adding Controls

Chapter 2 taught us that adding control variables can reduce omitted variables bias — as long as the controls are not “bad controls” (caused by the treatment). The long regression adds observable controls to the bivariate equation:

\[\ln W_i = \alpha + \rho \, S_i + \gamma_1 \, \text{Age}_i + \gamma_2 \, \text{Age}_i^2 + \gamma_3 \, \text{Female}_i + \gamma_4 \, \text{White}_i + e_i\]

where \(S_i\) is years of education (educ), and the \(\gamma\) coefficients capture the effects of age (age, age2), gender (female), and race (white). If ability remains in \(e_i\), then \(\hat{\rho}\) is still biased — adding observable controls only helps if they capture the omitted confounders. Let’s see this using the Twinsburg twins data.

Code
# Load clean twins data (340 twin pairs from Twinsburg, Ohio)
twins = pd.read_csv(GITHUB_DATA_URL + "ch6/twins_clean.csv")

# Key variables:
#   lwage  = log weekly wage; educ = own years of education
#   educt_t = twin's report of respondent's education (instrument)
#   first  = 1 for the first twin in each pair (use to avoid double-counting)
#   dlwage = within-pair difference in log wages; deduc = difference in own-reported education
#   deduct = difference in twin's report of education (instrument for deduc)
twins.head(3)
lwage educ educt_t age age2 female white first dlwage deduc deduct
0 2.479523 16.0 16.0 33.251190 11.056416 1.0 1.0 1.0 0.259346 0.0 0.0
1 2.220177 16.0 16.0 33.251190 11.056416 1.0 1.0 NaN -0.259346 0.0 0.0
2 2.228209 12.0 12.0 43.570145 18.983576 1.0 1.0 NaN -0.721318 -6.0 -4.0
Code
# OLS: regress log wages on education with demographic controls
ols = pf.feols("lwage ~ educ + age + age2 + female + white", data=twins, vcov="hetero")

# Extract key regression results into a clear table
pd.DataFrame({
    "Variable": ols.coef().index,
    "Coefficient": ols.coef().round(4).values,
    "Std. Error": ols.se().round(4).values,
    "t-statistic": ols.tstat().round(2).values,
    "p-value": ols.pvalue().round(3).values,
})
Table 6.2: OLS return to schooling using the Twinsburg twins data. Controls include age, age-squared, gender, and race.
Variable Coefficient Std. Error t-statistic p-value
0 Intercept -1.0949 0.2924 -3.74 0.000
1 educ 0.1100 0.0105 10.50 0.000
2 age 0.1039 0.0120 8.67 0.000
3 age2 -0.1063 0.0147 -7.23 0.000
4 female -0.3180 0.0399 -7.97 0.000
5 white -0.1001 0.0682 -1.47 0.143

The OLS return with controls is about 11% per year of schooling. Note that this estimate uses a different dataset (the Twinsburg twins) than the bivariate regression above (the 1980 Census). The higher estimate partly reflects the different sample. Even within this dataset, adding demographic controls does not substantially reduce the schooling coefficient — because these observables explain little of the ability-education correlation.

The OVB Formula Applied to Schooling

From Chapter 2, the omitted variables bias formula is:

\[\text{OVB} = \underbrace{\gamma}_{\text{effect of ability on earnings}} \times \underbrace{\pi_1}_{\text{correlation of ability with schooling}}\]

For schooling:

  • \(\gamma > 0\): More able people earn more, holding schooling constant
  • \(\pi_1 > 0\): More able people get more education

Therefore \(\text{OVB} > 0\), and OLS overstates the true return. The short regression (without ability) gives a coefficient that is too large.

Seeing OVB in Action

To see this concretely, we use a synthetic dataset where we know the true causal return (about 0.09 per year) because we designed the data-generating process.

NoteSynthetic data

This dataset was designed to illustrate the OVB concept. It contains 2,000 simulated individuals with schooling, earnings, unobserved ability, and occupation. The true total causal return to schooling is about 0.09 (9%) per year — combining a direct effect on earnings and an indirect effect through occupation.

Code
# Load synthetic OVB data (2000 simulated individuals)
ovb = pd.read_csv(GITHUB_DATA_URL + "ch6/synthetic_ovb.csv")

# Short regression: omit ability (like real life — we can't observe ability)
short_reg = pf.feols("earnings ~ schooling", data=ovb, vcov="hetero")

# Long regression: include ability (the "oracle" regression we can't run with real data)
long_reg = pf.feols("earnings ~ schooling + ability", data=ovb, vcov="hetero")

# Compare coefficients
pd.DataFrame({
    "Specification": ["Short (omit ability)", "Long (include ability)", "True causal return"],
    "Schooling coefficient": [
        f"{short_reg.coef()['schooling']:.4f} ({short_reg.se()['schooling']:.4f})",
        f"{long_reg.coef()['schooling']:.4f} ({long_reg.se()['schooling']:.4f})",
        "0.0900",
    ],
})
Table 6.3: OVB demonstration: the short regression (without ability) overstates the true return. The long regression (with ability) recovers it.
Specification Schooling coefficient
0 Short (omit ability) 0.1229 (0.0024)
1 Long (include ability) 0.0853 (0.0036)
2 True causal return 0.0900

The short regression gives about 0.12 — overstating the true return by roughly 40%. Adding the unobservable ability recovers the true total causal return (about 0.09). In real life, we cannot observe ability, so we need a different strategy.

Bad Controls: Do Not Control for Occupation

WarningBad controls

Occupation is caused by education — it is a “bad control” (a mediator or post-treatment variable). Controlling for it absorbs part of the causal effect of education on earnings, biasing the estimate downward. If education raises earnings partly by giving access to higher-paying occupations, then holding occupation constant removes that channel.

Code
# Bad control: add occupation (which is caused by schooling)
bad_control = pf.feols("earnings ~ schooling + occupation", data=ovb, vcov="hetero")

pd.DataFrame({
    "Specification": ["Without occupation", "With occupation (bad control)", "True causal return"],
    "Schooling coefficient": [
        f"{short_reg.coef()['schooling']:.4f} ({short_reg.se()['schooling']:.4f})",
        f"{bad_control.coef()['schooling']:.4f} ({bad_control.se()['schooling']:.4f})",
        "0.0900",
    ],
})
Table 6.4: Adding occupation (a bad control) biases the schooling return downward by absorbing part of the causal pathway.
Specification Schooling coefficient
0 Without occupation 0.1229 (0.0024)
1 With occupation (bad control) 0.1002 (0.0032)
2 True causal return 0.0900

Adding occupation shrinks the schooling coefficient — not because it reduces bias, but because it removes a real causal channel. Rule of thumb: never control for variables that are consequences of the treatment.

Even with good controls, we can never be sure we have accounted for all confounders. What we really want is a research design — like the randomized experiments in Chapter 1 — that eliminates ability bias by construction.

Why Not an RCT?

The gold standard for causal inference is the randomized controlled trial: randomly assign some people to get more education, and compare their earnings to a control group. Random assignment breaks the link between ability and education, so the simple difference in means is causal.

But we cannot randomize education. It would be unethical and impractical to force some people to drop out and others to stay in school for 20 years.

To see why random assignment works, consider the regression of earnings on the randomly assigned scholarship:

\[\ln W_i = \alpha + \beta \, P_i + e_i\]

where \(P_i\) = 1 if individual \(i\) received a scholarship (scholarship). Because \(P_i\) is randomly assigned, it is uncorrelated with ability in \(e_i\), so \(\hat{\beta}\) is unbiased. The per-year causal return is then:

\[\hat{\rho} = \frac{\hat{\beta}}{\text{Schooling difference}} = \frac{\text{Reduced form}}{\text{First stage}}\]

This is the Wald estimator from Chapter 3, applied to the scholarship “instrument.” We use a synthetic dataset to demonstrate this.

NoteSynthetic data

This dataset simulates a hypothetical scholarship experiment. 2,000 individuals were randomly assigned to receive a scholarship (or not). The scholarship increases schooling by about 2 years. The true causal return to schooling is 0.08 per year, so the scholarship should increase earnings by about 0.16.

Code
# Load synthetic RCT data
rct = pd.read_csv(GITHUB_DATA_URL + "ch6/synthetic_rct.csv")

# Simple comparison of means by scholarship status
means = rct.groupby("scholarship")[["schooling", "earnings"]].mean()
diff_school = means.loc[1, "schooling"] - means.loc[0, "schooling"]
diff_earn = means.loc[1, "earnings"] - means.loc[0, "earnings"]

# Regression of earnings on scholarship (= difference in means)
rct_reg = pf.feols("earnings ~ scholarship", data=rct, vcov="hetero")

# Wald estimate: earnings effect / schooling effect = per-year return
wald_rct = diff_earn / diff_school

pd.DataFrame({
    "Quantity": [
        "Scholarship → schooling (first stage)",
        "Scholarship → earnings (reduced form)",
        "Per-year return (reduced form / first stage)",
    ],
    "Estimate": [
        f"{diff_school:.3f} years",
        f"{diff_earn:.4f} log points",
        f"{wald_rct:.4f}",
    ],
})
Table 6.5: Synthetic scholarship RCT: random assignment eliminates ability bias, so a simple comparison of means recovers the causal effect.
Quantity Estimate
0 Scholarship → schooling (first stage) 1.958 years
1 Scholarship → earnings (reduced form) 0.1605 log points
2 Per-year return (reduced form / first stage) 0.0820

With random assignment, the simple difference in earnings between scholarship and non-scholarship groups gives an unbiased estimate of the causal effect. Dividing by the schooling difference gives the per-year return — close to the true 0.08.

ImportantThe lesson

The RCT recovers the right answer because random assignment makes the scholarship independent of ability. Since the scholarship affects earnings only through schooling, the Wald ratio (earnings effect / schooling effect) gives an unbiased per-year return. But since we cannot run a real schooling RCT, we need quasi-experimental methods that approximate random assignment. The rest of this chapter explores four such strategies.

Strategy 1: Twin Comparisons

The Logic

Identical twins share genes and family upbringing — the very factors we suspect drive ability bias. If one twin gets more education than the other, the earnings difference within the pair reflects the causal return, not ability.

Within-Twin Differences

By taking the difference within each twin pair, we eliminate everything shared between them:

\[\Delta Y_f = \rho \cdot \Delta S_f + \Delta e_f\]

where \(\Delta Y_f\) is the difference in log wages (dlwage) and \(\Delta S_f\) is the difference in years of education (deduc) within twin pair \(f\). Shared ability cancels out because both twins have the same value.

Code
# Use only the first twin in each pair (to avoid double-counting)
first = twins[twins["first"] == 1]

# Regress wage difference on education difference
# The "- 1" removes the intercept: when both twins have the same education,
# we expect zero wage difference, so there's no constant term needed
twin_fe = pf.feols("dlwage ~ deduc - 1", data=first, vcov="hetero")

# Extract key regression results into a clear table
pd.DataFrame({
    "Variable": twin_fe.coef().index,
    "Coefficient": twin_fe.coef().round(4).values,
    "Std. Error": twin_fe.se().round(4).values,
    "t-statistic": twin_fe.tstat().round(2).values,
    "p-value": twin_fe.pvalue().round(3).values,
})
Table 6.6: Within-twin estimate of the return to schooling. Differencing eliminates shared genetic and family factors.
Variable Coefficient Std. Error t-statistic p-value
0 deduc 0.0617 0.0198 3.12 0.002

The twin estimate drops to about 6% — nearly half the OLS estimate. This suggests ability bias pushes OLS upward.

WarningCommon Misconception: A lower estimate is not necessarily a better estimate

Twin FE gives 0.06. OLS gives 0.11. Students often assume the lower number must be “more correct.”

This is wrong. Twin FE has its own bias: measurement error amplification.

Here’s why: twins report their own education. Small errors (misremembering a year) get amplified by differencing. The true within-pair variation in schooling is tiny. So even small errors dominate the signal.

Result: This attenuation bias pushes the twin estimate below the true return.

Formally, measurement error biases the twin FE coefficient by the reliability ratio:

\[\hat{\rho}_{FE} \approx \rho \times \underbrace{\frac{\text{Var}(\Delta S^*_f)}{\text{Var}(\Delta S^*_f) + \text{Var}(\Delta m_f)}}_{\text{reliability ratio}}\]

where \(\Delta S^*_f\) is the true within-twin difference in schooling and \(\Delta m_f\) is the measurement error in the differenced data. When twins are very similar, \(\text{Var}(\Delta S^*)\) is small but \(\text{Var}(\Delta m)\) stays the same size, so the reliability ratio drops well below 1 and \(\hat{\rho}_{FE}\) is attenuated toward zero. A reliability ratio of 0.5 would cut the estimate in half.

IV: Using the Twin’s Report as an Instrument

The twin estimate may be biased downward by measurement error in self-reported education. If twins misremember their schooling, the differenced data amplifies noise relative to signal.

The fix: use each twin’s report of the other’s education as an instrument. This report is correlated with true education but has independent measurement error, so it satisfies the IV requirements. (This assumes twins do not simply agree on inaccurate reports. If twins discuss their education and reach consensus, their measurement errors may be correlated, weakening the IV correction.)

NoteReading the pyfixest IV formula syntax

In pyfixest, the IV formula uses a pipe (|) to specify the endogenous variable and its instrument:

  • | educ ~ educt_t means: educ is the endogenous variable, instrumented by educt_t
  • Controls go before the first |; fixed effects (if any) go between the first and second |
  • vcov="hetero" gives heteroskedasticity-robust standard errors (HC1)

The twin IV corrects measurement error using two stages. In the within-pair (differenced) version:

First stage: Predict own-reported schooling difference using the twin’s report

\[\Delta S_f = \pi_0 + \pi_1 \, \Delta S^{twin}_f + v_f\]

Second stage: Regress wage difference on the predicted schooling difference

\[\Delta Y_f = \rho_{IV} \, \widehat{\Delta S}_f + u_f\]

where \(\Delta S^{twin}_f\) is the difference in the twin’s report of the other’s education (deduct), and \(\widehat{\Delta S}_f\) is the fitted value from the first stage. Because the twin’s report has independent measurement error, it filters out the noise in own-reported education, correcting the attenuation bias.

Code
# IV in levels: instrument own education (educ) with twin's report (educt_t)
iv_levels = pf.feols("lwage ~ 1 + age + age2 + female + white | educ ~ educt_t", data=twins, vcov="hetero")

# IV in differences: instrument own-reported difference (deduc) with twin's report diff (deduct)
first_iv = first[["dlwage", "deduc", "deduct"]].dropna()
iv_diff = pf.feols("dlwage ~ 0 | deduc ~ deduct", data=first_iv, vcov="hetero")

# Combine all four estimates into one table
ols_coef = round(ols.coef()["educ"], 3)
ols_se = round(ols.se()["educ"], 3)
fe_coef = round(twin_fe.coef()["deduc"], 3)
fe_se = round(twin_fe.se()["deduc"], 3)
iv_lev_coef = round(iv_levels.coef()["educ"], 3)
iv_lev_se = round(iv_levels.se()["educ"], 3)
iv_dif_coef = round(iv_diff.coef()["deduc"], 3)
iv_dif_se = round(iv_diff.se()["deduc"], 3)

pd.DataFrame({
    "Method": ["OLS (levels)", "Twin FE (differences)", "IV (levels)", "IV (differences)"],
    "Return to schooling": [
        format(ols_coef, ".3f") + " (" + format(ols_se, ".3f") + ")",
        format(fe_coef, ".3f") + " (" + format(fe_se, ".3f") + ")",
        format(iv_lev_coef, ".3f") + " (" + format(iv_lev_se, ".3f") + ")",
        format(iv_dif_coef, ".3f") + " (" + format(iv_dif_se, ".3f") + ")",
    ],
})
Table 6.7: IV estimates using twin’s report of education as instrument. Corrects for measurement error in self-reported schooling.
Method Return to schooling
0 OLS (levels) 0.110 (0.010)
1 Twin FE (differences) 0.062 (0.020)
2 IV (levels) 0.116 (0.011)
3 IV (differences) 0.108 (0.034)
ImportantWhat the twin results tell us
Method Estimate Interpretation
OLS ~0.11 Likely biased UP by ability
Twin FE ~0.06 Biased DOWN by measurement error
IV (levels) ~0.12 Corrects measurement error in levels
IV (differences) ~0.11 Corrects measurement error in differences

The true return is probably 8–11% per year, with OLS slightly overstating and twin FE understating due to different biases.

NoteIntuition Builder: The Bathroom Scale Analogy

Imagine weighing yourself on a bathroom scale that randomly adds or subtracts 5 pounds. On average, the scale is right — but any single reading is noisy. Now suppose you weigh yourself in the morning and evening to measure how much weight you gained during the day. The true gain might be 0.5 lbs, but the scale’s error (±5 lbs in each reading) means the difference between readings is dominated by noise. This is exactly what happens with twin differences in education: the true within-pair variation is small (twins are similar), but measurement error stays the same size, so noise overwhelms the signal.

Lessons from the twins strategy:

  • Twin FE controls for shared ability but amplifies measurement error — two biases push in opposite directions
  • IV using the twin’s report corrects measurement error, recovering a return near 11%
  • Limitation: The Twinsburg twins are a self-selected sample (twins who attend an annual twin festival in Ohio). They may not represent the general population

The twins approach offered a first crack at ability bias but raised a new concern: measurement error. Our next strategy sidesteps both problems by finding a source of schooling variation that is entirely independent of ability — and precisely measured in census data.

Strategy 2: Quarter-of-Birth IV

Research question: What is the causal return to an additional year of schooling, using a source of variation that is independent of ability?

The data: Angrist and Krueger (1991) used the 1980 U.S. Census, extracting 329,509 men born between 1930 and 1939. The outcome is log weekly earnings (lnw). Schooling is measured in years (s). Quarter of birth (qob, 1–4) serves as the instrument.

The Idea

Compulsory schooling laws allow students to drop out at age 16. Because school-entry rules are based on birth date cutoffs, children born later in the year start school younger and accumulate more schooling before reaching the dropout age.

This creates an instrument: quarter of birth affects schooling (through compulsory attendance rules) but should not directly affect earnings.

The IV strategy has three equations:

First stage (instrument predicts schooling):

\[S_i = \alpha_1 + \phi \, Q4_i + e_{1i}\]

Reduced form (instrument predicts earnings directly):

\[\ln W_i = \alpha_2 + \rho_{RF} \, Q4_i + e_{2i}\]

Wald estimator (ratio gives the causal return):

\[\hat{\rho}_{IV} = \frac{\hat{\rho}_{RF}}{\hat{\phi}} = \frac{\text{Effect of } Q4 \text{ on earnings}}{\text{Effect of } Q4 \text{ on schooling}}\]

where \(Q4_i\) = 1 if individual \(i\) was born in the fourth quarter (q4), \(S_i\) is years of schooling (s), and \(\ln W_i\) is log weekly earnings (lnw). This estimate is a LATE (Local Average Treatment Effect): it applies only to compliers whose schooling was changed by compulsory attendance interacting with their birth quarter.

The IV Recipe: Step by Step

Code
# Step 1: Reduced form — does Q4 birth predict higher earnings?
rf = pf.feols("lnw ~ q4", data=qob, vcov="hetero")

# Step 2: First stage — does Q4 birth predict more schooling?
fs = pf.feols("s ~ q4", data=qob, vcov="hetero")

# Step 3: Wald estimate = reduced form / first stage
wald = rf.coef()["q4"] / fs.coef()["q4"]

# Step 4: Verify with 2SLS
iv = pf.feols("lnw ~ 1 | s ~ q4", data=qob, vcov="hetero")

# Extract coefficients and standard errors
rf_coef = round(rf.coef()["q4"], 4)
rf_se = round(rf.se()["q4"], 4)
fs_coef = round(fs.coef()["q4"], 4)
fs_se = round(fs.se()["q4"], 4)
wald_rounded = round(wald, 4)
iv_coef = round(iv.coef()["s"], 4)
iv_se = round(iv.se()["s"], 4)

pd.DataFrame({
    "Step": ["Reduced form (Q4 → earnings)", "First stage (Q4 → schooling)",
             "Wald estimate (RF / FS)", "2SLS verification"],
    "Estimate": [
        format(rf_coef, ".4f") + " (" + format(rf_se, ".4f") + ")",
        format(fs_coef, ".4f") + " (" + format(fs_se, ".4f") + ")",
        format(wald_rounded, ".4f"),
        format(iv_coef, ".4f") + " (" + format(iv_se, ".4f") + ")",
    ],
})
Table 6.8: The IV recipe (Wald estimate): reduced form divided by first stage gives the causal return to schooling.
Step Estimate
0 Reduced form (Q4 → earnings) 0.0068 (0.0027)
1 First stage (Q4 → schooling) 0.0921 (0.0132)
2 Wald estimate (RF / FS) 0.0740
3 2SLS verification 0.0740 (0.0280)

The reduced form shows that Q4 births earn slightly more. The first stage shows they get about 0.09 more years of schooling. Dividing gives the Wald estimate of about 7% per year — which the 2SLS verification confirms.

Visualizing the First Stage and Reduced Form

Code
# Collapse to cell means by age (= birth cohort)
cell = qob.groupby("age").agg(s=("s","mean"), lnw=("lnw","mean"),
                                q4=("q4","mean"), q1=("q1","mean")).reset_index()
cell["yob"] = 80 - cell["age"]
cell["is_q4"] = cell["q4"] > 0.5
cell["is_q1"] = cell["q1"] > 0.5

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

ax = axes[0]
ax.plot(cell["yob"], cell["s"], "k-", alpha=0.4)
ax.scatter(cell.loc[cell["is_q4"], "yob"], cell.loc[cell["is_q4"], "s"],
           color="black", s=50, zorder=5, label="Quarter 4")
ax.scatter(cell.loc[cell["is_q1"], "yob"], cell.loc[cell["is_q1"], "s"],
           facecolors="none", edgecolors="black", s=50, zorder=5, label="Quarter 1")
ax.set_xlabel("Year of Birth")
ax.set_ylabel("Years of Education")
ax.set_title("First Stage")
ax.legend()

ax = axes[1]
ax.plot(cell["yob"], cell["lnw"], "k-", alpha=0.4)
ax.scatter(cell.loc[cell["is_q4"], "yob"], cell.loc[cell["is_q4"], "lnw"],
           color="black", s=50, zorder=5, label="Quarter 4")
ax.scatter(cell.loc[cell["is_q1"], "yob"], cell.loc[cell["is_q1"], "lnw"],
           facecolors="none", edgecolors="black", s=50, zorder=5, label="Quarter 1")
ax.set_xlabel("Year of Birth")
ax.set_ylabel("Log Weekly Earnings")
ax.set_title("Reduced Form")
ax.legend()

plt.tight_layout()
plt.show()
Figure 6.3: First stage (left) and reduced form (right). Q4 births get slightly more schooling and slightly higher earnings.
NoteWho are the compliers?

The QOB IV estimate is a LATE (Local Average Treatment Effect) — it applies only to compliers, people whose schooling was actually changed by their quarter of birth interacting with compulsory schooling laws. Compliers are students at the dropout threshold. Students who would have attended college regardless (always-takers) or those who drop out very early (never-takers) are not affected by the instrument.

The quarter-of-birth IV uses a clever natural experiment, but it relies on a single source of variation. Our next strategy uses a different set of instruments — compulsory schooling laws that vary across states.

Strategy 3: Child Labor Law IV

Research question: Do compulsory schooling laws that forced children to enter school by certain ages provide another valid instrument for estimating the return to education?

The data: Acemoglu and Angrist used data on compulsory schooling laws that varied across U.S. states. Three instruments capture whether a state required children to enter school by age 7 (cl7), 8 (cl8), or 9 (cl9). The data has been collapsed to state-of-birth × year-of-birth × census-year cell means.

With multiple instruments and fixed effects, the 2SLS framework is:

First stage: Predict schooling using the three compulsory schooling instruments

\[S_{scy} = \alpha_1 + \phi_1 \, CL7_{sc} + \phi_2 \, CL8_{sc} + \phi_3 \, CL9_{sc} + \beta_s + \gamma_c + \delta_y + v_{scy}\]

Second stage: Regress earnings on the predicted schooling

\[\ln W_{scy} = \alpha_2 + \rho_{IV} \, \hat{S}_{scy} + \beta_s + \gamma_c + \delta_y + u_{scy}\]

where:

  • \(CL7\), \(CL8\), \(CL9\) = indicators for compulsory school entry by age 7, 8, or 9 (cl7, cl8, cl9)
  • \(\beta_s\) = state-of-birth fixed effects (| sob)
  • \(\gamma_c\) = year-of-birth cohort effects (| yob)
  • \(\delta_y\) = census-year effects (| year)
  • \(\hat{S}_{scy}\) = the fitted value of schooling from the first stage
  • \(\rho_{IV}\) = the causal return to schooling, identified by the three instruments jointly
Code
# Load child labor law data (collapsed cell means, ~2400 observations)
cl = pd.read_csv(GITHUB_DATA_URL + "ch6/childlabor_clean.csv")
cl.head(3)
sob yob year lnwkwage indEduc cl7 cl8 cl9 weight n
0 1 1900 1950 3.772292 7.428592 0 0 0 6862.0 21
1 1 1901 1950 3.752198 8.464196 0 0 0 9147.0 28
2 1 1902 1950 4.129656 7.740741 0 0 0 8829.0 27
Code
# First stage: do child labor laws predict education?
fs_cl = pf.feols("indEduc ~ cl7 + cl8 + cl9 | sob + yob + year",
                 data=cl, weights="weight", vcov={"CRV1": "sob"})

# Joint F-test on instruments (Wald test)
import numpy as np
coefs = np.array([fs_cl.coef()['cl7'], fs_cl.coef()['cl8'], fs_cl.coef()['cl9']])
idx = [list(fs_cl.coef().index).index(v) for v in ['cl7', 'cl8', 'cl9']]
V_sub = fs_cl._vcov[np.ix_(idx, idx)]
f_stat = float(coefs @ np.linalg.inv(V_sub) @ coefs / 3)

pd.DataFrame({
    "Instrument": ["cl7 (enter by age 7)", "cl8 (enter by age 8)", "cl9 (enter by age 9)", "Joint F-statistic"],
    "Coefficient": [
        f"{fs_cl.coef()['cl7']:.4f} ({fs_cl.se()['cl7']:.4f})",
        f"{fs_cl.coef()['cl8']:.4f} ({fs_cl.se()['cl8']:.4f})",
        f"{fs_cl.coef()['cl9']:.4f} ({fs_cl.se()['cl9']:.4f})",
        f"{f_stat:.2f}",
    ],
})
Table 6.9: First stage: compulsory schooling laws jointly predict education. The F-statistic tests instrument relevance.
Instrument Coefficient
0 cl7 (enter by age 7) 0.1690 (0.0674)
1 cl8 (enter by age 8) 0.1872 (0.0621)
2 cl9 (enter by age 9) 0.3973 (0.0984)
3 Joint F-statistic 6.06
Code
# OLS with fixed effects
ols_cl = pf.feols("lnwkwage ~ indEduc | sob + yob + year",
                  data=cl, weights="weight", vcov={"CRV1": "sob"})

# IV/2SLS with fixed effects and multiple instruments
iv_result = pf.feols("lnwkwage ~ 1 | sob + yob + year | indEduc ~ cl7 + cl8 + cl9",
                     data=cl, weights="weight", vcov={"CRV1": "sob"})

pd.DataFrame({
    "Method": ["OLS (with state, YOB, year FE)", "IV/2SLS (child labor law instruments)"],
    "Return to schooling": [
        f"{ols_cl.coef()['indEduc']:.4f} ({ols_cl.se()['indEduc']:.4f})",
        f"{iv_result.coef()['indEduc']:.4f} ({iv_result.se()['indEduc']:.4f})",
    ],
})
Table 6.10: OLS vs. IV estimates of the return to schooling using compulsory schooling laws as instruments.
Method Return to schooling
0 OLS (with state, YOB, year FE) 0.0698 (0.0066)
1 IV/2SLS (child labor law instruments) 0.1288 (0.0371)
ImportantWhat the child labor law results tell us

The OLS estimate with fixed effects gives about 7%, while the IV estimate is larger at about 13%. The IV estimate is less precise than the QOB results, in part because the first-stage F-statistic is below 10 — a sign of weak instruments that can inflate IV estimates. Despite the imprecision, the results are broadly consistent with the other IV strategies in pointing to a causal return that is at least as large as the OLS estimate.

Both the twins and IV strategies estimate the overall return to education. But they leave open a deeper question: does education raise earnings because of the skills you learn, or because employers value the diploma?

Strategy 4: Sheepskin Effects via RD

Research question: Does the diploma credential itself boost earnings (the signaling view), or is it the skills learned in school that matter (the human capital view)?

The data: Clark and Martorell (2014) studied the Texas high school exit exam. The data consists of 46 score bins around the passing cutoff. The running variable is the test score relative to the passing threshold (minscore, where 0 = cutoff).

Why RD works here: Students who scored just above vs. just below the cutoff have nearly identical skills but very different diploma rates. Any jump in earnings at the cutoff reflects the value of the diploma credential itself.

The RD regression estimates the jump at the passing threshold:

\[Y_i = \alpha + \rho \, D_i + f(\text{Score}_i) + e_i\]

where:

  • \(Y_i\) = average annual earnings (avgearnings) or diploma receipt (receivehsd)
  • \(D_i\) = 1 if the student passed the last-chance exam (pass_exam)
  • \(f(\text{Score}_i)\) = a flexible polynomial in the test score relative to the cutoff (minscore), fitted separately on each side of the threshold
  • \(\rho\) = the sheepskin effect — the jump in the outcome at the passing threshold

If \(\rho\) is large for earnings, the diploma itself has value (signaling). If \(\rho \approx 0\), the diploma credential adds little beyond the skills already reflected in the score.

Code
# Load clean sheepskin RD data (Texas last-chance exam)
sheep = pd.read_csv(GITHUB_DATA_URL + "ch6/sheepskin_clean.csv")
sheep.head(3)
minscore pass_exam receivehsd avgearnings n person_years left_1 right_1 left_2 right_2 left_3 right_3 left_4 right_4
0 -30.0 0 0.416667 11845.086 12 24.0 -30.0 -0.0 900.0 0.0 -27000.0 -0.0 810000.0 0.0
1 -29.0 0 0.387097 9205.679 31 104.0 -29.0 -0.0 841.0 0.0 -24389.0 -0.0 707281.0 0.0
2 -28.0 0 0.318182 8407.745 44 146.0 -28.0 -0.0 784.0 0.0 -21952.0 -0.0 614656.0 0.0
Code
fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# --- Panel 1: Diploma receipt ---
ax = axes[0]
ax.scatter(sheep["minscore"], sheep["receivehsd"], color="black", s=20, alpha=0.6)

left = sheep[sheep["minscore"] < 0]
right = sheep[sheep["minscore"] >= 0]

fit_l = pf.feols("receivehsd ~ pass_exam + left_1 + left_2 + left_3 + left_4", data=left, weights="n")
fit_r = pf.feols("receivehsd ~ pass_exam + right_1 + right_2 + right_3 + right_4", data=right, weights="n")

left_plot = sheep[sheep["minscore"] <= 0].copy()
left_plot["fit"] = fit_l.predict(newdata=left_plot)
right_plot = sheep[sheep["minscore"] >= 0].copy()
right_plot["fit"] = fit_r.predict(newdata=right_plot)

ax.plot(left_plot["minscore"], left_plot["fit"], "k-", linewidth=2)
ax.plot(right_plot["minscore"], right_plot["fit"], "k-", linewidth=2)
ax.axvline(x=0, color="red", linestyle="--", alpha=0.5)
ax.set_xlabel("Test score relative to cutoff")
ax.set_ylabel("Fraction receiving diploma")
ax.set_title("Diploma Receipt (First Stage)")
ax.set_xlim(-30, 15)
ax.set_ylim(0, 1)

# --- Panel 2: Earnings ---
ax = axes[1]
ax.scatter(sheep["minscore"], sheep["avgearnings"], color="black", s=20, alpha=0.6)

earn_l = pf.feols("avgearnings ~ pass_exam + left_1 + left_2 + left_3 + left_4",
                  data=left[left["minscore"] >= -30], weights="person_years")
earn_r = pf.feols("avgearnings ~ pass_exam + right_1 + right_2 + right_3 + right_4", data=right, weights="person_years")

left_earn = sheep[sheep["minscore"] <= 0].copy()
left_earn["fit"] = earn_l.predict(newdata=left_earn)
right_earn = sheep[sheep["minscore"] >= 0].copy()
right_earn["fit"] = earn_r.predict(newdata=right_earn)

ax.plot(left_earn["minscore"], left_earn["fit"], "k-", linewidth=2)
ax.plot(right_earn["minscore"], right_earn["fit"], "k-", linewidth=2)
ax.axvline(x=0, color="red", linestyle="--", alpha=0.5)
ax.set_xlabel("Test score relative to cutoff")
ax.set_ylabel("Annual Earnings ($)")
ax.set_title("Earnings (Reduced Form)")
ax.set_xlim(-30, 15)

plt.tight_layout()
plt.show()
Figure 6.4: RD at the Texas exam cutoff. Left: diploma receipt jumps sharply. Right: earnings barely change.
ImportantThe sheepskin verdict
  • Diploma receipt jumps by about 40 percentage points at the cutoff (a strong first stage). Since the jump is 40 points rather than 100, this is technically a fuzzy RD — passing the exam increases but does not guarantee diploma receipt
  • Earnings show almost no jump — the RD effect is near zero. Even scaling by the diploma receipt jump (the fuzzy RD Wald estimate), the credential effect remains negligible
  • Most of the education premium reflects actual learning (human capital), not just the piece of paper (signaling)

We have now applied regression, IV, and RD to the schooling question. The final method — differences-in-differences — exploits policy changes over time.

Strategy 5: Differences-in-Differences

Research question: Can we estimate the return to schooling by comparing states that changed their compulsory schooling laws to states that did not?

NoteSynthetic data

This dataset simulates a compulsory schooling reform adopted by 10 out of 20 states in 2005. It was designed to illustrate the DiD concept with clear parallel pre-trends and a visible treatment effect.

The DD estimator compares changes over time across treated and control groups:

\[\delta_{DD} = \underbrace{(\bar{Y}_{treat,after} - \bar{Y}_{treat,before})}_{\text{Change in treated states}} - \underbrace{(\bar{Y}_{control,after} - \bar{Y}_{control,before})}_{\text{Change in control states}}\]

In regression form with state and year fixed effects:

\[Y_{st} = \alpha + \delta \, (\text{Treated}_s \times \text{Post}_t) + \beta_s + \gamma_t + e_{st}\]

where:

  • \(Y_{st}\) = average earnings (avg_earnings) or average schooling (avg_schooling) in state \(s\) at time \(t\)
  • \(\text{Treated}_s \times \text{Post}_t\) = the interaction term (treat_post), equal to 1 for treated states after the reform
  • \(\beta_s\) = state fixed effects (| state) — absorb permanent differences between states
  • \(\gamma_t\) = year fixed effects (| year) — absorb common time trends
  • \(\delta\) = the DD estimate of the reform’s causal effect

The parallel trends assumption requires that treated and control states would have followed the same trajectory absent the reform: \(E[Y_{st}(0) \mid \text{Treated}=1] - E[Y_{st}(0) \mid \text{Treated}=0]\) is constant over time.

Code
# Load synthetic DiD data (20 states × 20 years)
did = pd.read_csv(GITHUB_DATA_URL + "ch6/synthetic_did.csv")
did.head(3)
state year treated post avg_schooling avg_earnings
0 1 1995 1 0 11.4749 9.4303
1 1 1996 1 0 11.4804 9.3323
2 1 1997 1 0 11.4371 9.3873
Code
# Compute group means by year
group_means = did.groupby(["year", "treated"]).agg(
    schooling=("avg_schooling", "mean"),
    earnings=("avg_earnings", "mean"),
).reset_index()

treated_g = group_means[group_means["treated"] == 1]
control_g = group_means[group_means["treated"] == 0]

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

ax = axes[0]
ax.plot(treated_g["year"], treated_g["schooling"], "k-", linewidth=2, label="Treated states")
ax.plot(control_g["year"], control_g["schooling"], "k--", linewidth=2, label="Control states")
ax.axvline(x=2005, color="red", linestyle="--", alpha=0.5, label="Reform year")
ax.set_xlabel("Year")
ax.set_ylabel("Average Years of Schooling")
ax.set_title("Schooling (First Stage)")
ax.legend(loc="upper left")

ax = axes[1]
ax.plot(treated_g["year"], treated_g["earnings"], "k-", linewidth=2, label="Treated states")
ax.plot(control_g["year"], control_g["earnings"], "k--", linewidth=2, label="Control states")
ax.axvline(x=2005, color="red", linestyle="--", alpha=0.5, label="Reform year")
ax.set_xlabel("Year")
ax.set_ylabel("Log Average Earnings")
ax.set_title("Earnings (Reduced Form)")
ax.legend(loc="upper left")

plt.tight_layout()
plt.show()
Figure 6.5: Parallel trends in schooling (left) and earnings (right). Treated states (solid) and control states (dashed) follow the same trend before the reform, then diverge after 2005.
Code
# DD regression for schooling (first stage)
did["treat_post"] = did["treated"] * did["post"]
dd_school = pf.feols("avg_schooling ~ treat_post | state + year", data=did, vcov={"CRV1": "state"})

# DD regression for earnings (reduced form)
dd_earn = pf.feols("avg_earnings ~ treat_post | state + year", data=did, vcov={"CRV1": "state"})

# Implied return: earnings effect / schooling effect
dd_return = dd_earn.coef()["treat_post"] / dd_school.coef()["treat_post"]

pd.DataFrame({
    "Quantity": [
        "DD effect on schooling (first stage)",
        "DD effect on earnings (reduced form)",
        "Implied return per year (RF / FS)",
    ],
    "Estimate": [
        f"{dd_school.coef()['treat_post']:.4f} ({dd_school.se()['treat_post']:.4f})",
        f"{dd_earn.coef()['treat_post']:.4f} ({dd_earn.se()['treat_post']:.4f})",
        f"{dd_return:.4f}",
    ],
})
Table 6.11: Differences-in-differences estimate of the reform effect on schooling and earnings, with state and year fixed effects.
Quantity Estimate
0 DD effect on schooling (first stage) 0.5039 (0.0165)
1 DD effect on earnings (reduced form) 0.0393 (0.0055)
2 Implied return per year (RF / FS) 0.0779
ImportantConnection to the child labor law IV

We borrow IV terminology here: the DD on schooling plays the role of a “first stage” (how much did the reform increase schooling?) and the DD on earnings plays the role of a “reduced form” (how much did the reform increase earnings?). The ratio gives the implied per-year return, just as the Wald estimator does in IV. This is valid as long as the reform affects earnings only through schooling.

The child labor law IV (Strategy 3) and the DiD approach exploit the same underlying variation — policy changes in compulsory schooling laws across states and time. Both give similar estimates, reinforcing the causal interpretation.

The Furious Five: A Grand Synthesis

This chapter has applied all five methods from the book to a single question. Each method can be summarized by its key equation, all targeting the same parameter \(\rho\) — the causal return to schooling:

The equation behind each method
Method Key Equation
Bivariate OLS \(\ln W_i = \alpha + \rho \, S_i + e_i\)
OLS with controls \(\ln W_i = \alpha + \rho \, S_i + \gamma' X_i + e_i\)
Twin FE \(\Delta Y_f = \rho \, \Delta S_f + \Delta e_f\)
IV (Wald) \(\hat{\rho}_{IV} = \hat{\rho}_{RF} \, / \, \hat{\phi}\)
2SLS Stage 1: \(S_i = \pi_0 + \pi_1 Z_i + v_i\); Stage 2: \(\ln W_i = \alpha + \rho_{IV} \hat{S}_i + u_i\)
RD \(Y_i = \alpha + \rho \, D_i + f(\text{Score}_i) + e_i\)
DD \(Y_{st} = \alpha + \delta \, (\text{Treated}_s \times \text{Post}_t) + \beta_s + \gamma_t + e_{st}\)
The Furious Five methods and their role in estimating returns to schooling
Method Key Assumption What It Estimates Main Threat Used Here?
1. RCT Random assignment ATE Non-compliance Synthetic demo
2. Regression Observable confounders only Conditional average OVB (ability bias) Yes (OLS baseline)
3. IV / 2SLS Valid instrument LATE (compliers) Weak/invalid instrument Yes (twins, QOB, child labor)
4. RD Smooth running variable Local effect at cutoff Limited generalizability Yes (sheepskin)
5. DD Parallel trends ATT Pre-trend violation Synthetic demo

What Is the True Return to Schooling?

Comparing returns to schooling across all methods
Method Estimate Main Bias Direction
Simple OLS (no controls) ~0.07 Ability bias (OVB) Likely upward
OLS with controls ~0.11 Unobserved ability Upward
Twin FE ~0.06 Measurement error Downward
Twin IV ~0.11 Corrects measurement error
Quarter-of-birth IV ~0.07–0.08 LATE for compliers only
Child labor law IV ~0.07–0.13 Weak instruments, imprecise
Sheepskin RD ~0 Diploma effect specifically
DD (synthetic) ~0.06 Parallel trends required
NoteThe big picture

The true causal return to schooling is probably 7–10% per year. OLS slightly overstates it (ability bias), while twin FE understates it (measurement error). The IV estimates cluster around 7–10%. The near-zero sheepskin effect suggests that the return comes from actual learning, not credential signaling.

No single method is perfect. The power of this chapter lies in seeing how multiple imperfect strategies converge on a similar answer.

Why this matters for policy. Multiple methods converge: twins, quarter of birth, and compulsory schooling laws all point to a genuine causal return of 7–10% per year. Education is one of the best investments individuals and governments can make — and the return comes from actual learning, not just the diploma.

Key Takeaways

graph TD
    Q["Does education cause higher earnings?"]
    AB["Ability bias inflates simple OLS"]
    OVB["OVB formula: bias equals gamma times pi"]
    RCT["RCT is ideal but infeasible"]
    TW["Twin FE removes shared ability"]
    ME["Measurement error biases twins down"]
    IV["IV corrects both biases"]
    RD["Sheepskin RD: diploma effect is small"]
    DD["DD: policy changes confirm returns"]
    SYN["Synthesis: true return is about seven to ten percent"]

    Q --> AB
    AB --> OVB
    OVB --> RCT
    RCT --> TW
    TW --> ME
    ME --> IV
    Q --> RD
    Q --> DD
    TW --> SYN
    IV --> SYN
    RD --> SYN
    DD --> SYN

    style Q fill:#2c3e50,color:#fff
    style AB fill:#c0392b,color:#fff
    style OVB fill:#e67e22,color:#fff
    style RCT fill:#3498db,color:#fff
    style TW fill:#8e44ad,color:#fff
    style ME fill:#c0392b,color:#fff
    style IV fill:#3498db,color:#fff
    style RD fill:#2d8659,color:#fff
    style DD fill:#2d8659,color:#fff
    style SYN fill:#2d8659,color:#fff

    linkStyle 0,1,2,3,4,5,6,7,8,9,10 stroke:#888,stroke-width:2px
Figure 6.6: How the key concepts of Chapter 6 connect: from the question through five methods to the synthesis
  1. Simple OLS returns to schooling (~7%) reflect both the causal effect and selection bias (ability bias), so they overstate the true causal return.

  2. The OVB formula predicts upward bias: ability raises both schooling and earnings.

  3. RCTs are the gold standard but infeasible for schooling — motivating quasi-experimental methods.

  4. OLS with controls (~11%) on the twins data shows that demographic controls do not substantially reduce the schooling coefficient.

  5. Twin fixed effects (~6%) control for shared ability but suffer from measurement error amplification.

  6. IV using twin’s report (~11%) corrects measurement error, recovering a higher estimate.

  7. Quarter-of-birth IV (~7%) uses compulsory schooling as exogenous variation, estimating a LATE for dropout-margin students.

  8. Child labor law IV (~7–10%) uses a different set of instruments, confirming the QOB results.

  9. Sheepskin RD (~0%) shows the diploma itself has little earnings value — learning matters more.

  10. DD exploiting policy changes (~8%) provides yet another perspective using before/after comparisons.

  11. Multiple methods converge on a true return of about 7–10% per year.

  12. No single method is perfect. The lesson is to use multiple approaches and look for convergence.

Learn by Coding

Copy this code into a Python notebook to reproduce the key results from this chapter.

# ============================================================
# Chapter 6: The Wages of Schooling — Code Cheatsheet
# ============================================================
import pandas as pd
import numpy as np
import pyfixest as pf
# (IV handled by pf.feols with pipe syntax)

DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"

# --- Step 1: Simple bivariate OLS (no controls) ---
qob = pd.read_csv(DATA + "ch6/qob_clean.csv")
bivariate = pf.feols("lnw ~ s", data=qob, vcov="hetero")
print(f"Simple OLS return: {round(bivariate.coef()['s'], 3)} ({round(bivariate.se()['s'], 3)})")
print("  (~7% per year — raw correlation, likely biased by ability)\n")

# --- Step 2: OLS with controls (twins data) ---
twins = pd.read_csv(DATA + "ch6/twins_clean.csv")
ols_result = pf.feols("lwage ~ educ + age + age2 + female + white", data=twins, vcov="hetero")
print(f"OLS with controls: {round(ols_result.coef()['educ'], 3)} ({round(ols_result.se()['educ'], 3)})")
print("  (~11% per year — controls barely change the estimate)\n")

# --- Step 3: Twin fixed effects (within-pair differences) ---
first = twins[twins["first"] == 1]
fe = pf.feols("dlwage ~ deduc - 1", data=first, vcov="hetero")
print(f"Twin FE return: {round(fe.coef()['deduc'], 3)} ({round(fe.se()['deduc'], 3)})")
print("  (~6% — lower because shared ability is removed, but measurement error amplified)\n")

# --- Step 4: IV with twin's report (corrects measurement error) ---
iv_lev = pf.feols("lwage ~ 1 + age + age2 + female + white | educ ~ educt_t", data=twins, vcov="hetero")
print(f"Twin IV (levels): {round(iv_lev.coef()['educ'], 3)} ({round(iv_lev.se()['educ'], 3)})")
print("  (~11% — measurement error corrected)\n")

# --- Step 5: Quarter-of-birth IV (Angrist & Krueger) ---
fs = pf.feols("s ~ q4", data=qob, vcov="hetero")
rf = pf.feols("lnw ~ q4", data=qob, vcov="hetero")
wald = rf.coef()["q4"] / fs.coef()["q4"]
print(f"Wald IV estimate: {round(wald, 3)}")
iv_qob = pf.feols("lnw ~ 1 | s ~ q4", data=qob, vcov="hetero")
print(f"2SLS estimate:    {round(iv_qob.coef()['s'], 3)} ({round(iv_qob.se()['s'], 3)})")
print("  (~7% per year via quarter-of-birth instrument)\n")

# --- Step 6: Child labor law IV ---
cl = pd.read_csv(DATA + "ch6/childlabor_clean.csv")
ols_cl = pf.feols("lnwkwage ~ indEduc | sob + yob + year",
                  data=cl, weights="weight", vcov={"CRV1": "sob"})
print(f"OLS (child labor data): {round(ols_cl.coef()['indEduc'], 4)}")

# --- Step 7: First-stage F-statistic ---
f_stat = fs.tstat()['q4'] ** 2  # F = t² for a single restriction
print(f"First-stage F-stat (QOB): {round(f_stat, 1)} (should be > 10)")
TipTry it yourself!

Copy the code above and paste it into this Google Colab scratchpad to run it interactively. Modify the variables, change the specifications, and see how results change!

Below is the same cheatsheet in Stata syntax.

* ============================================================
* Chapter 6: The Wages of Schooling — Stata Cheatsheet
* ============================================================
clear all
set more off

* --- Step 1: Simple bivariate OLS (no controls) ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/qob_clean.csv", clear
reg lnw s, robust
* ~7% per year — raw correlation, likely biased by ability

* --- Step 2: OLS with controls (twins data) ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/twins_clean.csv", clear
reg lwage educ age age2 female white, robust
* ~11% per year — controls barely change the estimate

* --- Step 3: Twin fixed effects (within-pair differences) ---
reg dlwage deduc if first == 1, noconstant robust
* ~6% — lower because shared ability is removed

* --- Step 4: IV with twin's report ---
ivregress 2sls lwage age age2 female white (educ = educt_t), robust
* ~11% — measurement error corrected

* --- Step 5: Quarter-of-birth IV (Angrist & Krueger) ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/qob_clean.csv", clear

* First stage
reg s q4, robust
scalar fs_coef = _b[q4]

* Reduced form
reg lnw q4, robust
scalar rf_coef = _b[q4]

* Wald IV estimate
scalar wald = rf_coef / fs_coef
display "Wald IV estimate: " round(wald, 0.001)

* 2SLS
ivregress 2sls lnw (s = q4), robust
* ~7% per year via quarter-of-birth instrument

* --- Step 6: Child labor law IV ---
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/childlabor_clean.csv", clear
reg indeduc cl7 cl8 cl9 i.sob i.yob i.year [aw=weight], cluster(sob)
testparm cl7 cl8 cl9
ivregress 2sls lnwkwage i.sob i.yob i.year (indeduc = cl7 cl8 cl9) [aw=weight], cluster(sob)
TipTry it in Stata!

Copy the code above into a .do file and run it in Stata 14 or later (which supports loading data from URLs). If your Stata cannot access the internet, download the CSV files from the data/ folder on GitHub and replace each URL with a local file path.

The Furious Five: A Code Summary

The Furious Five are the five core methods of causal inference covered in Mastering ’Metrics. Each method tackles the same fundamental problem — separating cause from correlation — but relies on a different source of identifying variation and a different key assumption.

The Furious Five: one question, five strategies
Method Key Assumption What It Estimates Main Threat
1. RCT Random assignment of treatment ATE (Average Treatment Effect) Non-compliance, spillovers
2. Regression (OLS) All confounders observed and controlled Conditional average effect Omitted variable bias
3. IV / 2SLS Valid instrument (relevance + exclusion) LATE (effect for compliers) Weak or invalid instrument
4. RD No manipulation of running variable Local effect at the cutoff Limited external validity
5. DD Parallel trends absent treatment ATT (effect on the treated) Pre-trend differences
Method pyfixest Syntax Intuitive Analogy
RCT group_means = df.groupby("D")["Y"].mean() A coin flip decides who gets the medicine — any difference in health must be caused by it
OLS pf.feols("Y ~ D + X", data=df, vcov="hetero") Adjusting marathon times for weather and elevation — helpful if you measure everything, but miss one factor and the comparison is off
IV pf.feols("Y ~ 1 \| D ~ Z", data=df, vcov="hetero") Using rainfall to study how carrying an umbrella affects your mood — rain shifts umbrella use but does not directly change mood
RD pf.feols("Y ~ D * score", data=df, vcov="hetero") Comparing students who scored 59 vs. 61 on a pass/fail exam — nearly identical except for the credential
DD pf.feols("Y ~ treat_post \| group + time", data=df, vcov={"CRV1": "group"}) Two cities track the same crime trend until one adopts a new policy — the divergence is the causal effect

When the Furious Five agree, we gain confidence. No single method is bulletproof, but when multiple methods — each with different data, assumptions, and potential biases — all point to a similar answer, the finding becomes much more credible. This chapter showed exactly that: the true causal return to schooling is about 7–10% per year, confirmed across all five methods.

WarningCode is not a substitute for understanding

The code below is a stylized overview of the Furious Five — a compact reference to remind you what each method does. But applying these methods should never be mechanical. Each method rests on assumptions that must be justified by the context in which the data was collected and the relationships that economic theory suggests. A valid instrument in one setting may be invalid in another. Parallel trends may hold in one policy comparison but fail in the next. Before running any of these methods, carefully study the relevant chapter of the book to understand when and why each method works — not just how to code it.

Below is a self-contained code summary that applies each of the Furious Five to the returns-to-schooling question using the datasets from this chapter.

# ================================================================
# THE FURIOUS FIVE — Complete Code Summary
# ================================================================
# Five methods, one question: Does education cause higher earnings?
# Each block is self-contained with its own data, equation, and
# interpretation. Run them all to see convergence in action.
# ================================================================
import pandas as pd
import numpy as np
import pyfixest as pf
# (IV handled by pf.feols with pipe syntax)

DATA = "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/"

# ================================================================
# METHOD 1: RANDOMIZED CONTROLLED TRIAL (WITH NON-COMPLIANCE)
# ================================================================
# Question: Does education cause higher earnings?
# Equation: ln(W) = α + ρ·S + ε  (First Stage: S = γ_0 + γ_1·Z + u)
# Logic:    Z (scholarship offer) is randomly assigned. We use it to
#           isolate exogenous variation in S (actual schooling).
# Estimate: Wald ratio = (Reduced Form effect on W) / (First Stage effect on S)
# Bias:     None, assuming the offer only affects earnings through schooling
# ----------------------------------------------------------------
rct = pd.read_csv(DATA + "ch6/synthetic_rct.csv")
means = rct.groupby("scholarship")[["schooling", "earnings"]].mean()
wald_rct = (means.loc[1, "earnings"] - means.loc[0, "earnings"]) / \
           (means.loc[1, "schooling"] - means.loc[0, "schooling"])
print(f"1. RCT (Wald):  {wald_rct:.4f}")

# ================================================================
# METHOD 2: REGRESSION (OLS)
# ================================================================
# Question: What is the raw return to each year of schooling?
# Equation: ln(W) = α + ρ·S + ε
# Logic:    Control for observables, hoping all confounders are captured.
# Estimate: Conditional average association (not necessarily causal).
# Bias:     Omitted Variable Bias (e.g., Ability Bias - upward). Smarter
#           individuals may get more schooling AND earn more regardless.
# ----------------------------------------------------------------
qob = pd.read_csv(DATA + "ch6/qob_clean.csv")
ols = pf.feols("lnw ~ s", data=qob, vcov="hetero")
print(f"2. OLS:         {ols.coef()['s']:.4f}")

# ================================================================
# METHOD 3: INSTRUMENTAL VARIABLES (IV / 2SLS)
# ================================================================
# Question: What is the causal return to schooling, using exogenous variation?
# Equation: Wald = Cov(Y, Z) / Cov(D, Z)
# Logic:    Quarter of birth (Z) shifts schooling laws but is unrelated to ability.
# Estimate: LATE — Local Average Treatment Effect for "compliers"
#           (students kept in school solely due to compulsory schooling laws).
# Bias:     None, provided Z is relevant and the exclusion restriction holds.
# ----------------------------------------------------------------
iv = pf.feols("lnw ~ 1 | s ~ q4", data=qob, vcov="hetero")
print(f"3. IV (QOB):    {iv.coef()['s']:.4f}")

# ================================================================
# METHOD 4: REGRESSION DISCONTINUITY (RD)
# ================================================================
# Question: Does the diploma credential itself boost earnings?
# Equation: Y = α + ρ·D + β_1·(Score) + β_2·(D × Score) + ε
# Logic:    Compare students just above vs. just below the passing cutoff.
# Estimate: LATE at the cutoff (the "sheepskin" or credential effect).
# Bias:     None, assuming continuity of potential outcomes at the cutoff
#           (students cannot precisely manipulate their scores).
# ----------------------------------------------------------------
sheep = pd.read_csv(DATA + "ch6/sheepskin_clean.csv")
# Center the running variable if not already centered, and apply a bandwidth (e.g., +/- 30)
bandwidth_data = sheep[abs(sheep["minscore"]) <= 30]
# Use robust local linear regression allowing varying slopes on either side of cutoff
rd = pf.feols("avgearnings ~ pass_exam * minscore", data=bandwidth_data, weights="person_years", vcov="hetero")
print(f"4. RD:          ${rd.coef()['pass_exam']:.0f} (Credential effect)")

# ================================================================
# METHOD 5: DIFFERENCES-IN-DIFFERENCES (WALD-DiD)
# ================================================================
# Question: Do compulsory schooling reforms (T) raise causal earnings?
# Equation: Wald-DiD = DiD_Earnings / DiD_Schooling
# Logic:    Compare changes over time in reform vs. non-reform states, using
#           the reform as an instrument for actual years of schooling.
# Estimate: LATE of schooling on earnings driven by the policy change.
# Bias:     Biased if parallel trends assumption fails (states would have
#           had different trajectories absent the reform).
# ----------------------------------------------------------------
did = pd.read_csv(DATA + "ch6/synthetic_did.csv")
did["treat_post"] = did["treated"] * did["post"]

# First stage: Effect of policy on schooling
dd_s = pf.feols("avg_schooling ~ treat_post | state + year", data=did, vcov={"CRV1": "state"})

# Reduced form: Effect of policy on earnings
dd_e = pf.feols("avg_earnings ~ treat_post | state + year", data=did, vcov={"CRV1": "state"})

wald_did = dd_e.coef()['treat_post'] / dd_s.coef()['treat_post']
print(f"5. Wald-DiD:    {wald_did:.4f}")

Below is the same Furious Five summary in Stata syntax.

* =================================================================
* THE FURIOUS FIVE — Complete Code Summary (Stata)
* =================================================================
* Five methods, one question: Does education cause higher earnings?
* =================================================================
clear all
set more off

* =================================================================
* METHOD 1: RANDOMIZED CONTROLLED TRIAL (WITH NON-COMPLIANCE)
* =================================================================
* Question: Does education cause higher earnings?
* Equation: ln(W) = a + rho*S + e  (First Stage: S = g0 + g1*Z + u)
* Logic:    Z (scholarship offer) is randomly assigned. We use it to
* isolate exogenous variation in S (actual schooling).
* Estimate: Wald ratio = (Reduced Form effect on W) / (First Stage effect on S)
* Bias:     None, assuming the offer only affects earnings through schooling
* -----------------------------------------------------------------
import delimited using ///
  "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/synthetic_rct.csv", clear

* Manual Wald calculation for pedagogical clarity
quietly reg earnings scholarship, robust
scalar rf = _b[scholarship]
quietly reg schooling scholarship, robust
scalar fs = _b[scholarship]
display "1. RCT (Wald): " round(rf / fs, 0.0001)

* Note: In practice, we estimate this directly via 2SLS to get correct standard errors:
* ivregress 2sls earnings (schooling = scholarship), robust

* =================================================================
* METHOD 2: REGRESSION (OLS)
* =================================================================
* Question: What is the raw return to each year of schooling?
* Equation: ln(W) = a + rho*S + e
* Logic:    Control for observables, hoping all confounders are captured.
* Estimate: Conditional average association (not necessarily causal).
* Bias:     Omitted Variable Bias (e.g., Ability Bias - upward). Smarter
* individuals may get more schooling AND earn more regardless.
* -----------------------------------------------------------------
import delimited using ///
  "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/qob_clean.csv", clear

reg lnw s, robust

* =================================================================
* METHOD 3: INSTRUMENTAL VARIABLES (IV / 2SLS)
* =================================================================
* Question: What is the causal return to schooling, using exogenous variation?
* Equation: Wald = Cov(Y, Z) / Cov(D, Z)
* Logic:    Quarter of birth (Z) shifts schooling laws but is unrelated to ability.
* Estimate: LATE -- Local Average Treatment Effect for "compliers"
* (students kept in school solely due to compulsory schooling laws).
* Bias:     None, provided Z is relevant and the exclusion restriction holds.
* -----------------------------------------------------------------
ivregress 2sls lnw (s = q4), robust

* =================================================================
* METHOD 4: REGRESSION DISCONTINUITY (RD)
* =================================================================
* Question: Does the diploma credential itself boost earnings?
* Equation: Y = a + rho*D + b1*(Score) + b2*(D x Score) + e
* Logic:    Compare students just above vs. just below the passing cutoff.
* Estimate: LATE at the cutoff (the "sheepskin" or credential effect).
* Bias:     None, assuming continuity of potential outcomes at the cutoff
* (students cannot precisely manipulate their scores).
* -----------------------------------------------------------------
import delimited using ///
  "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/sheepskin_clean.csv", clear

* Interact treatment with the running variable to allow different slopes
gen pass_X_score = pass_exam * minscore

* Local linear regression using a rectangular kernel (bandwidth of +/- 30)
reg avgearnings pass_exam minscore pass_X_score ///
    if abs(minscore) <= 30 [aw = person_years], robust

* =================================================================
* METHOD 5: DIFFERENCES-IN-DIFFERENCES (WALD-DiD)
* =================================================================
* Question: Do compulsory schooling reforms (T) raise causal earnings?
* Equation: Wald-DiD = DiD_Earnings / DiD_Schooling
* Logic:    Compare changes over time in reform vs. non-reform states, using
* the reform as an instrument for actual years of schooling.
* Estimate: LATE of schooling on earnings driven by the policy change.
* Bias:     Biased if parallel trends assumption fails (states would have
* had different trajectories absent the reform).
* -----------------------------------------------------------------
import delimited using ///
  "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/synthetic_did.csv", clear

gen treat_post = treated * post

* First stage: Effect of policy on schooling
quietly reg avg_schooling treat_post i.state i.year, cluster(state)
scalar dd_fs = _b[treat_post]

* Reduced form: Effect of policy on earnings
quietly reg avg_earnings treat_post i.state i.year, cluster(state)

display "5. Wald-DD: " round(_b[treat_post] / dd_fs, 0.0001)

Exercises

Multiple Choice Questions

  1. In the simple bivariate regression of earnings on schooling, the estimated return is about 7%. This estimate is likely:
    1. Too low because it ignores the earnings premium for higher-status jobs
    2. Too high because unobserved ability inflates the correlation between schooling and earnings
    3. Exactly right because the sample is very large
    4. Too low because measurement error biases the coefficient downward

(b) The simple bivariate OLS captures both the causal return to schooling and the spurious correlation driven by ability. Since more able people tend to get more schooling AND earn more, OLS conflates these effects, overstating the causal return.

  1. The OVB formula predicts that omitting ability from the schooling regression will bias the coefficient upward because:
    1. Ability is negatively correlated with schooling
    2. Ability has a negative effect on earnings
    3. Both the correlation of ability with schooling and the effect of ability on earnings are positive
    4. Ability is uncorrelated with education in the population

(c) The OVB formula says bias = \(\gamma \times \pi_1\). Both are positive: ability raises earnings (\(\gamma > 0\)) and ability raises schooling (\(\pi_1 > 0\)). So the product is positive, meaning OLS overstates the causal return.

  1. The main econometric challenge in estimating the return to schooling is:
    1. Education data is poorly measured
    2. Unobserved ability may be correlated with both education and earnings (ability bias)
    3. The return to schooling varies across countries
    4. Schooling has no effect on earnings

(b) The core challenge is ability bias: people with higher unobserved ability tend to get more education AND earn more, regardless of schooling. This makes the OLS estimate overstate the causal effect.

  1. The twin fixed-effects strategy controls for ability bias by:
    1. Including IQ test scores as a control variable
    2. Comparing earnings differences within identical twin pairs who share the same genes and family background
    3. Using random assignment to determine who attends college
    4. Restricting the sample to people with above-average ability

(b) Identical twins share genetics and family environment — the key sources of ability differences. By comparing earnings within twin pairs, the strategy differences out shared ability.

  1. The twin FE estimate (6%) is lower than OLS (11%). This is MOST likely because:
    1. The twin FE definitively proves the true return is 6%
    2. Measurement error in self-reported education is amplified by differencing, biasing the estimate downward
    3. Twins are not representative of the general population
    4. OLS is always biased upward

(b) When twins report their own education, small errors are amplified by differencing because the true within-pair variation is small relative to the noise. This attenuation bias pushes the twin FE estimate below the true return.

  1. Angrist and Krueger’s quarter-of-birth instrument works because:
    1. People born in different quarters have different levels of ability
    2. Compulsory schooling laws interact with birth quarter to create exogenous variation in years of education
    3. Employers pay more to workers born in certain quarters
    4. Season of birth directly affects earnings through health channels

(b) Students born in Q4 enter school at a younger age, so by the time they reach the legal dropout age (e.g., 16), they have completed more years of schooling than those born in Q1. Compulsory schooling laws create exogenous variation in education unrelated to ability.

  1. A “sheepskin effect” refers to:
    1. The return to each additional year of schooling
    2. A discrete jump in earnings at diploma/degree completion, beyond the effect of the extra year
    3. The effect of attending a prestigious university
    4. The bias introduced by omitting ability from the regression

(b) A sheepskin effect is the extra earnings bump from completing a degree. The Clark and Martorell RD showed the diploma itself had almost no effect on earnings, suggesting most of the education premium comes from actual learning.

  1. Controlling for occupation when estimating the return to schooling is problematic because:
    1. Occupation data is measured with too much error
    2. Occupation is caused by education (a post-treatment variable), so controlling for it absorbs part of the causal effect
    3. Occupation is uncorrelated with earnings
    4. There are too many occupation categories for the regression to handle

(b) Occupation is caused by education — it is a “bad control” (post-treatment variable). Controlling for it absorbs part of the causal effect: if education raises earnings partly by giving access to better jobs, holding occupation constant removes that channel.

  1. The child labor law IV estimates are useful because they:
    1. Use random assignment to determine schooling
    2. Provide an independent source of exogenous variation from a different set of instruments than quarter of birth
    3. Eliminate measurement error in education
    4. Estimate the average treatment effect for the entire population

(b) The child labor law instruments provide variation that is independent of QOB, using different policy mechanisms. This independent confirmation strengthens causal claims.

  1. In a differences-in-differences analysis of compulsory schooling reforms, the key identifying assumption is:
    1. States that adopted the reform are identical to those that did not
    2. In the absence of the reform, treated and control states would have followed the same trend in outcomes
    3. The reform was randomly assigned to states
    4. All individuals in treated states complied with the new law

(b) The key DD assumption is that in the absence of the reform, treated and control states would have followed the same trajectory. This is the parallel trends assumption.

Conceptual Questions

  1. Ability bias direction: A friend argues that ability bias could go downward (smart people drop out to start businesses). Give one example supporting this view and one supporting the standard upward-bias view. Which do you find more convincing for the general population?

Ability bias is upward for most people because the education system selects on ability, but rare high-ability dropouts illustrate the theoretical possibility of downward bias.

  1. Downward bias example: Mark Zuckerberg dropped out of Harvard to build Facebook and became a billionaire. His high ability generated high earnings without completing his degree.
  2. Upward bias example: A student with high IQ and supportive parents completes a PhD and earns a high salary. OLS conflates the ability effect with the schooling effect.
  3. For the general population, upward bias is far more convincing: the education system systematically selects on ability through grades, test scores, and admissions.
  1. Measurement error: Explain why measurement error in education is more problematic in the twin-differences specification than in the levels OLS. (Hint: think about what differencing does to the signal-to-noise ratio.)

Differencing within twin pairs amplifies measurement error because it shrinks the true signal while leaving the noise unchanged.

  1. Measurement error adds noise (\(m_i\)) to observed schooling: \(S_i = S_i^* + m_i\).
  2. In levels OLS, the reliability ratio \(r = \text{Var}(S^*) / [\text{Var}(S^*) + \text{Var}(m)]\) is close to 1 because schooling varies widely.
  3. Differencing shrinks \(\text{Var}(\Delta S^*)\) dramatically but \(\text{Var}(\Delta m)\) stays the same. The reliability ratio \(r\) falls sharply, and attenuation bias worsens.
  4. This is why the twin FE estimate (~0.06) is lower than OLS (~0.11). The IV correction using the co-twin’s report recovers the true return.
  1. The Wald estimate: Using the QOB data, the reduced form (Q4 → earnings) is 0.0068 and the first stage (Q4 → schooling) is 0.0921. (a) Compute the Wald/IV estimate. (b) Why is this estimate valid only for “compliers”? Who are the compliers in this context?

The Wald/IV estimate identifies a LATE of about 7.4% per year, but it applies only to the marginal students whose schooling was changed by compulsory schooling laws.

    1. Wald estimate = 0.0068 / 0.0921 = 0.074, or about 7.4% per year.
    1. This is a LATE — it applies only to compliers whose schooling was changed by their quarter of birth interacting with compulsory schooling laws.
  1. Compliers are students at the dropout threshold. Always-takers (college-bound) and never-takers (very early dropouts) are not affected.
  1. OLS vs. IV similarity: In the QOB analysis, OLS and IV give similar estimates (~0.07). Does this mean ability bias is small? Or could there be offsetting biases (one pushing up, one pushing down) that happen to cancel? Explain.

The similarity of OLS and IV estimates is a coincidence of two offsetting biases, not evidence that OLS is unbiased.

  1. OLS is biased upward by ability bias and downward by measurement error.
  2. IV corrects both simultaneously. The net result happens to land close to OLS.
  3. The twin evidence confirms: IV with the co-twin’s report raises the estimate above twin FE; twin FE without IV falls below OLS. Each bias is real.
  1. Sheepskin interpretation: The Texas RD shows a ~40 percentage point jump in diploma receipt but near-zero earnings effect. A skeptic says “this proves education doesn’t matter.” Explain why this conclusion is wrong. What does the sheepskin RD actually tell us about the mechanism through which education raises earnings?

The sheepskin RD proves the diploma itself has little independent value, implying that the large education premium comes from actual learning.

  1. The Texas RD compares students who barely passed vs. barely failed. The near-zero earnings difference shows credentialism is not the main driver.
  2. This separates human capital theory from signaling theory. The small sheepskin effect supports human capital.
  3. Education clearly raises earnings — the finding tells us why (skills, not signals) rather than whether.
  1. OVB formula: Using the synthetic OVB dataset, the short regression gives ~0.12 and the long regression gives ~0.09. Apply the OVB formula: what are the signs of \(\gamma\) (ability → earnings) and \(\pi_1\) (ability → schooling)? Is the bias direction consistent with what you observe?

Both \(\gamma > 0\) (ability raises earnings) and \(\pi_1 > 0\) (ability raises schooling). OVB = \(\gamma \times \pi_1 > 0\), matching the pattern: short regression (~0.12) overstates the true return (~0.09) by about 0.03.

  1. Bad controls vs. good controls: Explain the difference between controlling for age (a good control) and controlling for occupation (a bad control) when estimating returns to schooling. What determines whether a control variable is “good” or “bad”?

Age is determined before schooling decisions — it is a pre-treatment variable. Occupation is caused by education — it is post-treatment. Controlling for occupation removes part of the causal pathway (education → better job → higher earnings). The rule: a control is “good” if determined before the treatment, “bad” if it is a consequence.

Research Tasks

  1. Returns for men only: Using twins_clean.csv, restrict the sample to male twins (female == 0). Re-run the OLS and within-twin FE regressions. Do the returns to schooling differ for men compared to the full sample?
Code
# --- Full Sample: OLS and Twin FE ---
ols_all = pf.feols("lwage ~ educ + age + age2 + female + white", data=twins, vcov="hetero")
first_all = twins[twins["first"] == 1]
fe_all = pf.feols("dlwage ~ deduc - 1", data=first_all, vcov="hetero")

# --- Men Only: OLS and Twin FE ---
men = twins[twins["female"] == 0]
ols_men = pf.feols("lwage ~ educ + age + age2 + white", data=men, vcov="hetero")
first_men = men[men["first"] == 1]
fe_men = pf.feols("dlwage ~ deduc - 1", data=first_men, vcov="hetero")

pd.DataFrame({
    "Method": ["OLS (full sample)", "OLS (men only)", "Twin FE (full sample)", "Twin FE (men only)"],
    "Coefficient": [
        round(ols_all.coef()["educ"], 4), round(ols_men.coef()["educ"], 4),
        round(fe_all.coef()["deduc"], 4), round(fe_men.coef()["deduc"], 4),
    ],
    "SE": [
        round(ols_all.se()["educ"], 4), round(ols_men.se()["educ"], 4),
        round(fe_all.se()["deduc"], 4), round(fe_men.se()["deduc"], 4),
    ],
    "N": [int(ols_all._N), int(ols_men._N), int(fe_all._N), int(fe_men._N)],
})
Table 6.12: Returns to schooling: full sample vs. men only
Method Coefficient SE N
0 OLS (full sample) 0.1100 0.0105 680
1 OLS (men only) 0.1017 0.0186 274
2 Twin FE (full sample) 0.0617 0.0198 340
3 Twin FE (men only) 0.0630 0.0330 136

Stata equivalent:

* --- Returns to schooling: full sample vs. men only ---
clear all
set more off
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/twins_clean.csv", clear
reg lwage educ age age2 female white, robust
reg dlwage deduc if first == 1, robust noconstant
reg lwage educ age age2 white if female == 0, robust
reg dlwage deduc if first == 1 & female == 0, robust noconstant

The key pattern — OLS exceeding twin FE — should persist in both samples.

  1. Multiple instruments: Using qob_clean.csv, run the 2SLS regression using all three quarter dummies (q2, q3, q4) as instruments for schooling instead of just q4. Does the IV estimate change? How does the first-stage F-statistic compare?
Code
iv_q4 = pf.feols("lnw ~ 1 | s ~ q4", data=qob, vcov="hetero")
fs_q4 = pf.feols("s ~ q4", data=qob)
f_q4 = fs_q4.tstat()['q4'] ** 2  # F = t² for single restriction

iv_multi = pf.feols("lnw ~ 1 | yob | s ~ q2 + q3 + q4", data=qob, vcov="hetero")

fs_multi = pf.feols("s ~ q2 + q3 + q4 | yob", data=qob)
t_q2 = fs_multi.tstat()["q2"]
t_q3 = fs_multi.tstat()["q3"]
t_q4_m = fs_multi.tstat()["q4"]
f_multi = (t_q2**2 + t_q3**2 + t_q4_m**2) / 3  # approximate joint F

pd.DataFrame({
    "Specification": ["IV (Q4 only, no controls)", "IV (Q2+Q3+Q4, with YOB FE)"],
    "Return to schooling": [round(iv_q4.coef()["s"], 4), round(iv_multi.coef()["s"], 4)],
    "SE": [round(iv_q4.se()["s"], 4), round(iv_multi.se()["s"], 4)],
    "First-stage F": [round(f_q4, 1), round(f_multi, 1)],
})
Table 6.13: IV estimates with single vs. multiple quarter-of-birth instruments
Specification Return to schooling SE First-stage F
0 IV (Q4 only, no controls) 0.0740 0.0280 48.1
1 IV (Q2+Q3+Q4, with YOB FE) 0.1053 0.0201 48.5

Stata equivalent:

* --- IV: single vs. multiple instruments ---
clear all
set more off
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/qob_clean.csv", clear
ivregress 2sls lnw (s = q4), robust
reg s q4, robust
test q4
tab yob, gen(yob_)
ivregress 2sls lnw yob_* (s = q2 q3 q4), robust
reg s q2 q3 q4 yob_*, robust
test q2 q3 q4

Multiple instruments gain precision but require all to satisfy the exclusion restriction.

  1. White vs. non-white twins: Using twins_clean.csv, split the sample by white status and run the OLS regression for each group. Is the return to schooling different for white vs. non-white twins?
Code
rows = []
for race, label in [(1, "White"), (0, "Non-white")]:
    subset = twins[twins["white"] == race]
    if len(subset) > 10:
        r = pf.feols("lwage ~ educ + age + age2 + female", data=subset, vcov="hetero")
        rows.append({
            "Group": label,
            "OLS return": round(r.coef()["educ"], 4),
            "SE": round(r.se()["educ"], 4),
            "N": int(r._N),
        })

pd.DataFrame(rows)
Table 6.14: Returns to schooling by race (OLS, twins data)
Group OLS return SE N
0 White 0.1108 0.0109 623
1 Non-white 0.0985 0.0360 53

Stata equivalent:

* --- Returns to schooling by race ---
clear all
set more off
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/twins_clean.csv", clear
reg lwage educ age age2 female if white == 1, robust
reg lwage educ age age2 female if white == 0, robust

Several factors could drive racial differences: labor market discrimination, school quality differences, and occupational sorting.

  1. OLS vs. IV comparison: Using qob_clean.csv, compare the OLS and IV (2SLS with Q4 as instrument) estimates of the return to schooling. Are the estimates similar? What does the comparison tell you about offsetting biases?
Code
# --- OLS: naive regression ---
ols = pf.feols("lnw ~ s", data=qob, vcov="hetero")

# --- IV: 2SLS with Q4 as instrument ---
iv = pf.feols("lnw ~ 1 | s ~ q4", data=qob, vcov="hetero")

pd.DataFrame({
    "Method": ["OLS", "IV (2SLS, Q4 instrument)"],
    "Return to schooling": [round(ols.coef()["s"], 4), round(iv.coef()["s"], 4)],
    "SE": [round(ols.se()["s"], 4), round(iv.se()["s"], 4)],
})
Table 6.15: OLS vs. IV estimates of the return to schooling
Method Return to schooling SE
0 OLS 0.0709 0.0004
1 IV (2SLS, Q4 instrument) 0.0740 0.0280

Stata equivalent:

* --- OLS vs. IV for returns to schooling ---
clear all
set more off
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/qob_clean.csv", clear
reg lnw s, robust
ivregress 2sls lnw (s = q4), robust
  1. What the numbers show: The OLS and IV estimates are surprisingly close. OLS gives approximately 7% per year; IV gives a similar or slightly higher estimate. The IV standard error is considerably larger.

  2. Why: Two offsetting biases: ability bias pushes OLS upward, measurement error pushes OLS downward. IV corrects both simultaneously, and the net correction is small because the two biases approximately cancel.

  3. What it teaches: Similar point estimates can mask very different sources of variation. The IV estimate has a cleaner causal interpretation because it isolates exogenous variation in schooling.

  1. First-stage F-statistics: Using qob_clean.csv, compute the first-stage F-statistic for Q4 alone and for all three quarter dummies (Q2, Q3, Q4) together. Do both exceed the “rule of 10” threshold? What does this tell you about instrument relevance?
Code
# --- Single Instrument: Q4 only ---
fs_q4_solo = pf.feols("s ~ q4", data=qob)
f_q4_solo = fs_q4_solo.tstat()['q4'] ** 2  # F = t² for single restriction

# --- Multiple Instruments: Q2 + Q3 + Q4 ---
fs_multi_solo = pf.feols("s ~ q2 + q3 + q4", data=qob)
# Joint F-test for multiple instruments (Wald test)
coefs_m = np.array([fs_multi_solo.coef()['q2'], fs_multi_solo.coef()['q3'], fs_multi_solo.coef()['q4']])
idx_m = [list(fs_multi_solo.coef().index).index(v) for v in ['q2', 'q3', 'q4']]
V_m = fs_multi_solo._vcov[np.ix_(idx_m, idx_m)]
f_multi_solo = float(coefs_m @ np.linalg.inv(V_m) @ coefs_m / 3)

pd.DataFrame({
    "First stage": ["s ~ q4 (single)", "s ~ q2 + q3 + q4 (multiple)"],
    "F-statistic": [round(f_q4_solo, 1), round(f_multi_solo, 1)],
    "Exceeds rule of 10?": ["Yes" if f_q4_solo > 10 else "No",
                             "Yes" if f_multi_solo > 10 else "No"],
    "N instruments": [1, 3],
})
Table 6.16: First-stage F-statistics: single vs. multiple instruments
First stage F-statistic Exceeds rule of 10? N instruments
0 s ~ q4 (single) 48.1 Yes 1
1 s ~ q2 + q3 + q4 (multiple) 34.0 Yes 3

Stata equivalent:

* --- First-stage F-statistics ---
clear all
set more off
import delimited using "https://raw.githubusercontent.com/cmg777/intro2causal/main/data/ch6/qob_clean.csv", clear
reg s q4, robust
test q4
reg s q2 q3 q4, robust
test q2 q3 q4
  1. What the numbers show: Both F-statistics should exceed the “rule of 10” threshold, confirming that quarter of birth is a relevant instrument for schooling.

  2. Why: Students born in Q4 enter school at a younger age, so by the time they reach the legal dropout age (e.g., 16), they have completed more years of schooling than those born in Q1, creating a systematic relationship between quarter of birth and years of schooling.

  3. What it teaches: The first-stage F-statistic is the primary diagnostic for weak instruments. When F < 10, the IV estimate is biased toward OLS and standard errors are unreliable.

  1. Pre-trend test: Using synthetic_did.csv, run the DD regression on only the pre-reform period (years before 2005). Is the treat_post interaction significant? What does this tell you about the parallel trends assumption?
Code
did = pd.read_csv(GITHUB_DATA_URL + "ch6/synthetic_did.csv")
pre = did[did["year"] < 2005].copy()
pre["placebo_post"] = (pre["year"] >= 2000).astype(int)
pre["placebo_treat_post"] = pre["treated"] * pre["placebo_post"]

placebo_dd = pf.feols("avg_earnings ~ placebo_treat_post | state + year", data=pre, vcov={"CRV1": "state"})

pd.DataFrame({
    "Test": ["Placebo DD (pre-reform only, fake reform at 2000)"],
    "Coefficient": [f"{placebo_dd.coef()['placebo_treat_post']:.4f}"],
    "SE": [f"{placebo_dd.se()['placebo_treat_post']:.4f}"],
    "p-value": [f"{placebo_dd.pvalue()['placebo_treat_post']:.3f}"],
})
Table 6.17: Pre-trend test: DD regression on pre-reform period only
Test Coefficient SE p-value
0 Placebo DD (pre-reform only, fake reform at 2000) 0.0089 0.0111 0.433

If the placebo coefficient is small and insignificant, the parallel trends assumption is supported.