Learn panel data

The Learn module is the teaching layer behind the rest of the library. The Explore and Analyze case studies leaned on a handful of ideas — within vs between variation, fixed effects, clustered standard errors, convergence, the Kuznets wave. This page explains why those moves work, two complementary ways:

A plain-language layer on every result. Each explore_* / analyze_* result can explain itself with .interpret() (an associational reading) and .explain() (a concept explainer).
Simulated sandboxes. Each learn_* function generates its own data from known parameters, so you can see an estimator hit — or miss — a truth you control, and turn the knobs.

Read it top to bottom: we start from the real Kuznets model you fit in Analyze, browse the concept index, then isolate each idea in a sandbox — the mechanics of fixed effects, two inference classics, convergence, and the Kuznets wave itself.

Note

The sandboxes simulate data to make a teaching point; the .interpret() text is always associational (never a causal claim). The correlation_vs_causation explainer spells out what a causal reading would additionally require.

import expdpy as ex
from expdpy.data import load_kuznets, load_kuznets_data_def

df = ex.set_labels(load_kuznets(), load_kuznets_data_def(), set_panel=True)

Stage 1 — Read a real result in plain language

Fit a model anywhere in Analyze, then ask it to explain itself. Here is the two-way fixed-effects cubic Kuznets curve from the Analyze case study. .interpret() gives an associational reading of what was estimated:

res = ex.analyze_regression_table(
    df,
    dvs="gini_regional",
    idvs=["log_gdp_pc", "log_gdp_pc_sq", "log_gdp_pc_cu"],
    feffects=["country", "year"],
    clusters=["country"],
)
print(res.interpret())

This OLS regression relates **gini_regional** to its regressors. Fixed effects for *country + year* absorb time-invariant differences, so coefficients reflect variation **within** each group. Standard errors are clustered by *country*.

- **log_gdp_pc**: each one-unit increase is associated with gini_regional that is 6.41 higher (statistically significant at the 1% level).
- **log_gdp_pc_sq**: each one-unit increase is associated with gini_regional that is 0.715 lower (statistically significant at the 1% level).
- **log_gdp_pc_cu**: each one-unit increase is associated with gini_regional that is 0.0261 higher (statistically significant at the 1% level).

Model fit: N = 880, R² = 0.874, within-R² = 0.521.

_These are associations, not causal effects. A causal reading needs a research design — see `explain('correlation_vs_causation')`._

.explain() returns the concept explainer for the method — what it is, when to use it, and its caveats — the same content as ex.explain("fixed_effects"):

print(res.explain().to_markdown())

### Fixed effects

**What it is.** Fixed effects add a separate intercept for every level of a grouping variable (e.g. each country, each year). This absorbs all *time-invariant* differences across groups, so the remaining coefficients are identified from variation *within* a group over time rather than from differences *between* groups.

**When to use it.** In panel data, to control for stable, unobserved characteristics of units (country institutions, firm culture) and for common shocks in a period (year effects). Two-way (unit + time) fixed effects are the standard panel specification.

**Watch out for.**
- Fixed effects cannot estimate the effect of anything constant within the group (a country's region, a person's sex) — that variation is absorbed.
- They control only for *unobserved* confounders that are constant within the group; time-varying confounders remain a threat.
- Many groups with few observations each can leave little within-variation, inflating standard errors.

*See also:* ols, clustered_se, fwl

*References:* Wooldridge, Introductory Econometrics, ch. 13-14

Every explore_* and analyze_* result carries these two methods. The rest of this page is the ideas they describe, isolated one at a time.

Stage 2 — The browsable concept index

ex.list_topics() returns every registered concept explainer (currently 27); pass any of them (or an alias) to ex.explain(...):

ex.list_topics()

['beta_convergence',
 'clustered_se',
 'convergence_clubs',
 'correlated_random_effects',
 'correlation_vs_causation',
 'descriptive_stats',
 'dummy_variables',
 'event_study',
 'first_differences',
 'fixed_effects',
 'fwl',
 'hausman',
 'kuznets_waves',
 'ols',
 'omitted_variable_bias',
 'panel_structure',
 'parallel_trends',
 'pearson',
 'random_effects',
 'sigma_convergence',
 'spearman',
 'time_trends',
 'transition_matrix',
 'truncate',
 'winsorize',
 'within_between_variation',
 'within_transformation']

ex.explain("fixed_effects")

Fixed effects

What it is. Fixed effects add a separate intercept for every level of a grouping variable (e.g. each country, each year). This absorbs all time-invariant differences across groups, so the remaining coefficients are identified from variation within a group over time rather than from differences between groups.

When to use it. In panel data, to control for stable, unobserved characteristics of units (country institutions, firm culture) and for common shocks in a period (year effects). Two-way (unit + time) fixed effects are the standard panel specification.

Watch out for. - Fixed effects cannot estimate the effect of anything constant within the group (a country’s region, a person’s sex) — that variation is absorbed. - They control only for unobserved confounders that are constant within the group; time-varying confounders remain a threat. - Many groups with few observations each can leave little within-variation, inflating standard errors.

See also: ols, clustered_se, fwl

References: Wooldridge, Introductory Econometrics, ch. 13-14

The full catalog, grouped by theme — every entry is a key you can pass to ex.explain(...):

Theme	Explainer topics
OLS & regression	`ols`, `fwl`, `omitted_variable_bias`, `descriptive_stats`
Panel structure & variation	`panel_structure`, `within_between_variation`, `time_trends`, `transition_matrix`
Fixed effects & the within transform	`fixed_effects`, `within_transformation`, `dummy_variables`, `first_differences`
Random effects, CRE & Hausman	`random_effects`, `correlated_random_effects`, `hausman`
Standard errors & inference	`clustered_se`
Convergence	`beta_convergence`, `sigma_convergence`, `convergence_clubs`, `kuznets_waves`
Causal designs / DiD	`event_study`, `parallel_trends`, `correlation_vs_causation`
Correlation	`pearson`, `spearman`
Outlier treatment	`winsorize`, `truncate`

Stage 3 — The core identity: first differences ≈ demeaning ≈ dummy variables

A unit fixed effect is anything constant within a unit over time. Three transformations all remove it and recover the same within-unit slope:

First differences subtract each unit’s previous-period value (Δy on Δx).
The within transformation (demeaning) subtracts each unit’s time-average.
Least-squares dummy variables (LSDV) add one dummy per unit to an OLS regression.

learn_first_differences simulates a panel where the regressor is correlated with the unit effect (so pooled OLS is biased) and recovers the slope by first differences and by demeaning. On a two-period panel the two coincide exactly, and both recover the truth:

fd = ex.learn_first_differences(n_periods=2)
print(fd.interpret())
fd.fig

Pooled OLS estimates 2.72, biased by the unit effects. First differencing gives 1.93 and the within (demeaning) estimator 1.93 — both recover the true 2, because differencing and demeaning each cancel the unit effect. On this two-period panel they coincide (gap 0).

learn_within_vs_lsdv shows that demeaning and unit dummies give the identical slope — the Frisch–Waugh–Lovell theorem at work — for any number of periods:

ex.learn_within_vs_lsdv(n_periods=6).fig

Stage 4 — Why fixed effects matter

learn_pooled_vs_fixed_effects makes the bias concrete: when the unit effect is correlated with the regressor, pooled OLS is biased, and using only within-unit variation (fixed effects) recovers the true slope. This is exactly the move the Analyze case study made on the Kuznets curve.

pf = ex.learn_pooled_vs_fixed_effects(unit_effect_corr=0.8)
print(pf.interpret())
pf.fig

Pooled OLS estimates 1.78 for the slope, biased by the correlation between the regressor and the unit effects. Adding unit fixed effects recovers 1.04, close to the true 1 — the within estimator removes the bias.

The correlated random effects (Mundlak) estimator bridges fixed and random effects — see its explainer, and analyze_cre_table in Analyze:

ex.explain("correlated_random_effects")

Correlated Random Effects (Mundlak device)

What it is. The Mundlak device augments a random-effects model with the unit-level mean of each time-varying regressor. The coefficient on the original regressor then equals the fixed-effects (within) estimate, while the coefficient on the mean measures how much the between-unit association differs from the within-unit one. A joint test that the mean coefficients are zero is algebraically the Hausman test, so it makes the fixed-vs-random decision in ordinary, testable terms.

When to use it. When you want the robustness of fixed effects but also a single, interpretable model that includes time-invariant regressors, or when you want the fixed-vs-random-effects decision expressed as testable coefficients rather than a separate test statistic.

Watch out for. - It recovers the within estimate only for the time-varying regressors; truly time-invariant variables still cannot be separated from the unit effect. - Like random effects, it assumes the idiosyncratic error is uncorrelated with the regressors — the device only relaxes correlation that runs through the unit mean. - The mean-coefficient test is the Hausman test, so the same small-sample and power caveats apply.

See also: fixed_effects, random_effects, hausman

References: Mundlak (1978), On the Pooling of Time Series and Cross Section Data; Wooldridge, Econometric Analysis of Cross Section and Panel Data, ch. 10

Stage 5 — Two inference classics

Omitted-variable bias — what happens to a coefficient when a correlated confounder is left out. The short regression is biased; controlling for the confounder recovers the truth:

ex.learn_omitted_variable_bias(corr_xz=0.6).fig

Clustered standard errors — clustering changes the standard error, not the point estimate. Ignoring within-cluster correlation understates uncertainty (the bars shrink too far):

ex.learn_clustering_se(icc=0.3).fig

Stage 6 — Convergence, simulated

The Analyze case study asked whether incomes converge. These sandboxes plant a known answer so you can watch each tool recover it.

learn_beta_convergence — absolute vs conditional convergence on a known-parameter AR(1) panel: omitting a steady-state determinant biases the unconditional slope; conditioning on it recovers the true convergence speed. (See analyze_beta_convergence on real data.)

bc = ex.learn_beta_convergence(rho=0.9, gamma=0.6, corr=0.7)
print(bc.interpret())
bc.fig

Omitting the steady-state determinant, the unconditional slope is 0.0359 — biased away from the true convergence slope -0.0439, so the units look like they barely converge. Conditioning on the determinant recovers -0.044, matching the truth — that is conditional β-convergence. The recovered speed is 0.106 per period (true 0.105), a half-life of 6.51 periods.

learn_sigma_convergence — a panel whose cross-sectional dispersion narrows at a known rate: the standard deviation, the Gini index and the coefficient of variation all shrink at the log-rate ln(rho), and the function recovers it. (See analyze_sigma_convergence on real data.)

ex.learn_sigma_convergence(rho=0.93).fig

learn_convergence_clubs — a panel with a planted club structure: each unit belongs to one of several clubs converging to distinct levels, so the panel does not converge globally, yet the Phillips–Sul clustering algorithm recovers the clubs without being told they exist. (See analyze_convergence_clubs on real data.)

ex.learn_convergence_clubs().fig

Stage 7 — The Kuznets wave, simulated

learn_kuznets_waves is the teaching counterpart to the flagship analyze_kuznets_waves. It plants a known polynomial wave into a panel and fits it under three estimators. The within and pooled estimators recover the true top-order coefficient; the between estimator differs — because the average of a nonlinear function is not the function of the average. That gap is exactly why the Analyze case study asked whether the N-shape was a within- or a between-country pattern.

kw = ex.learn_kuznets_waves()
print(kw.interpret())
kw.fig

The panel was built with a known degree-4 Kuznets wave whose top-order coefficient is 0.04. The within (two-way fixed-effects) estimator recovers 0.0399 and pooled OLS 0.0388 — both close to the truth, because the wave is a within-unit relationship. The between estimator gives 0.0296, which differs: it compares unit averages, and the average of a nonlinear curve is not the curve of the average.

Where to go next

Explore panel data — describe your data before you model it.
Analyze panel data — the estimators these ideas underpin, on the real Kuznets panel.
Browse every concept explainer in the API reference, or pass any list_topics() key to ex.explain(...).