analyze_kuznets_waves

Run this page interactively in Google Colab — no install required:

This page is two things at once: an extended user guide for analyze_kuznets_waves — what it does, every argument, and everything it returns — and a testing environment that generates synthetic data with known coefficients and checks that the function recovers them. If a cell’s assert ever fails, the function is broken.

What are Kuznets waves?

Kuznets (1955) conjectured an inverted-U between inequality and development: inequality first rises as an economy industrializes, then falls. The classic test regresses a Gini on log GDP per capita and its square. The Kuznets waves hypothesis allows the relationship to be far more nonlinear by taking the development term up to the fourth power:

\[ \text{gini} \;=\; b_1 g + b_2 g^2 + b_3 g^3 + b_4 g^4 + \varepsilon, \qquad g = \log \text{GDP per capita}. \]

A cubic admits an N-shape; a quartic admits a full wave with up to three turning points. The same polynomial is estimated under three panel estimators — pooled OLS (all variation), the between estimator (country averages, the cross-country curve) and the within estimator (two-way country + year fixed effects) — and laid out side by side. The development variable is used as you supply it (pass log GDP per capita); the powers are formed internally.

import numpy as np
import pandas as pd

import expdpy as ex

1. The method in one cell

On the bundled kuznets panel the defaults already point at gini_regional and log_gdp_pc, so the call is a one-liner once the panel ids are declared:

from expdpy.data import load_kuznets

df = ex.set_panel(load_kuznets(), entity="country", time="year")
res = ex.analyze_kuznets_waves(df)
res.fig

The raw scatter overlays the pooled quartic; the annotation reports the turning points, N and R². The within comparison table shows the curvature emerging as each power is added:

res.gt_within

	gini_regional
	(1)	(2)	(3)	(4)
coef
log_gdp_pc	0.024 (0.024)	0.024 (0.209)	6.411*** (0.210)	4.850** (1.792)
log_gdp_pc_p2		0.000 (0.011)	-0.715*** (0.023)	-0.452 (0.298)
log_gdp_pc_p3			0.026*** (0.001)	0.007 (0.022)
log_gdp_pc_p4				0.001 (0.001)
fe
year	x	x	x	x
country	x	x	x	x
stats
Observations	880	880	880	880
R²	0.74	0.74	0.874	0.874
Significance levels: * p < 0.05, p < 0.01, * p < 0.001. Format of coefficient cell: Coefficient (Std. Error)

2. How the function works

Arguments

argument	what it does	when to change it
`inequality`	the outcome (a Gini), used as-is	any inequality measure
`development`	the regressor (typically `log` GDP per capita); powers `g^2..g^degree` are formed internally	pass it in logs for the income case
`controls`	covariate name(s) partialled out of the between and within figures (FWL); they do not enter the tables	to net the wave of confounders
`entity`, `time`	the panel ids	omit if declared once via `set_panel`
`degree`	polynomial order in `[2, 6]` (default 4)	`degree=2` is the classic inverted-U; raise it only when justified
`vcov`	`"hetero"` (HC1, default) or `"iid"` for the pooled/between tables; the within table always clusters by entity	classical SEs; never changes a point estimate

The three figures and three tables

The figures tell a pooled → between → within story. The between and within plots are partial-residual (component) plots: the fitted wave drawn over inequality once the controls (and, for the within view, the two-way fixed effects) are partialled out by the Frisch–Waugh–Lovell theorem.

res.fig_between

res.fig_within

gt_pooled, gt_between and gt_within are the three comparison tables; summary is the per-estimator curvature frame, and .interpret() reads it in plain language:

print("figures :", ["fig", "fig_between", "fig_within"])
print("tables  :", ["gt_pooled", "gt_between", "gt_within"])
res.summary

figures : ['fig', 'fig_between', 'fig_within']
tables  : ['gt_pooled', 'gt_between', 'gt_within']

	estimator	n_obs	r2	n_turning_points	peak_g	top_term	top_estimate	top_pvalue
0	pooled	880	0.744400	2	7.930502	log_gdp_pc_p4	0.000050	0.877707
1	between	80	0.829918	2	7.942788	log_gdp_pc_p4	-0.000131	0.864605
2	within	880	0.521024	2	7.944620	log_gdp_pc_p4	0.000523	0.372140

print(res.interpret())

Across 880 observations, a degree-4 polynomial relates **gini_regional** to **log_gdp_pc** (the extended Kuznets-waves specification). The three estimators read the association at different levels of variation:
- **Pooled OLS** (the raw cross-sectional pattern): the fitted curve shows two turning points (an N or inverted-N shape), peaking near log_gdp_pc = 7.93; its highest-order term is not statistically significant at conventional levels (R² = 0.744).
- **Between (cross-country)** (comparing country averages): the fitted curve shows two turning points (an N or inverted-N shape), peaking near log_gdp_pc = 7.94; its highest-order term is not statistically significant at conventional levels (R² = 0.83).
- **Within (two-way FE)** (within-country variation net of common year effects): the fitted curve shows two turning points (an N or inverted-N shape), peaking near log_gdp_pc = 7.94; its highest-order term is not statistically significant at conventional levels (R² = 0.521).
All three estimators agree on the curvature, so the shape is not an artefact of cross-country versus within-country variation.

_These are associations, not causal effects. A causal reading needs a research design — see `explain('correlation_vs_causation')`._

3. Does it recover the truth?

Because the wave is a within-unit relationship, the cleanest check plants a known polynomial \(y = \sum_k b_k g^k\) on top of unit and year effects drawn independently of \(g\). The within (two-way fixed-effects) and pooled estimators should recover the planted coefficients.

BETAS = (0.5, -0.3, 0.05, 0.04)  # the planted (b1, b2, b3, b4)


def wave_panel(betas=BETAS, *, n_units=80, n_years=15, seed=0):
    """Panel y = sum_k b_k g^k + a_i + d_t + e with g varying within and between units."""
    rng = np.random.default_rng(seed)
    a = rng.normal(0.0, 0.4, n_units)            # unit effects, independent of g
    d = rng.normal(0.0, 0.4, n_years)            # year effects, independent of g
    xbar = rng.normal(0.0, 1.0, n_units)         # cross-unit development level
    rows = []
    for i in range(n_units):
        for t in range(n_years):
            g = float(xbar[i] + rng.normal(0.0, 1.0))
            y = (sum(b * g ** (k + 1) for k, b in enumerate(betas))
                 + a[i] + d[t] + rng.normal(0.0, 0.03))
            rows.append((f"u{i:03d}", t, g, y))
    return pd.DataFrame(rows, columns=["unit", "year", "g", "y"])


panel = wave_panel(seed=1)
fit = ex.analyze_kuznets_waves(panel, "y", "g", entity="unit", time="year", degree=4)

terms = ["g", "g_p2", "g_p3", "g_p4"]
within_hat = np.array([fit.models["within"][-1].coef()[t] for t in terms])
check = pd.DataFrame({"term": terms, "true": BETAS, "within_recovered": within_hat})
check["abs_error"] = (check["within_recovered"] - check["true"]).abs()
check.round(4)

	term	true	within_recovered	abs_error
0	g	0.50	0.5017	0.0017
1	g_p2	-0.30	-0.2998	0.0002
2	g_p3	0.05	0.0499	0.0001
3	g_p4	0.04	0.0400	0.0000

# The within (two-way FE) estimator recovers the planted wave to within a tight tolerance.
np.testing.assert_allclose(within_hat, BETAS, atol=0.03)
print("✅ within (two-way FE) recovered the planted quartic")

✅ within (two-way FE) recovered the planted quartic

Between is not the same estimator

The between estimator compares unit averages, and the average of a nonlinear curve is not the curve of the average — so it need not match the planted within-unit wave. That divergence is exactly what the three-estimator layout is for:

pooled_hat = np.array([fit.models["pooled"][-1].coef()[t] for t in terms])
between_hat = np.array([fit.models["between"][-1].coef()[t] for t in terms])
pd.DataFrame(
    {"term": terms, "true": BETAS, "pooled": pooled_hat,
     "between": between_hat, "within": within_hat}
).round(4)

	term	true	pooled	between	within
0	g	0.50	0.4776	0.5744	0.5017
1	g_p2	-0.30	-0.3275	-0.0970	-0.2998
2	g_p3	0.05	0.0519	0.0512	0.0499
3	g_p4	0.04	0.0417	0.0419	0.0400

# Pooled also recovers the within-unit wave (its effects are orthogonal to g)...
np.testing.assert_allclose(pooled_hat, BETAS, atol=0.05)
# ...while the between (cross-unit) curve is a genuinely different object.
print("✅ pooled recovers the wave; between is the cross-country curve")

✅ pooled recovers the wave; between is the cross-country curve

4. The Kuznets curve on the bundled panel

The bundled kuznets panel was designed to show an N-shaped regional Kuznets curve. Read the three estimators together — the shape and where it lives:

res.summary

	estimator	n_obs	r2	n_turning_points	peak_g	top_term	top_estimate	top_pvalue
0	pooled	880	0.744400	2	7.930502	log_gdp_pc_p4	0.000050	0.877707
1	between	80	0.829918	2	7.942788	log_gdp_pc_p4	-0.000131	0.864605
2	within	880	0.521024	2	7.944620	log_gdp_pc_p4	0.000523	0.372140

print(res.interpret())

Across 880 observations, a degree-4 polynomial relates **gini_regional** to **log_gdp_pc** (the extended Kuznets-waves specification). The three estimators read the association at different levels of variation:
- **Pooled OLS** (the raw cross-sectional pattern): the fitted curve shows two turning points (an N or inverted-N shape), peaking near log_gdp_pc = 7.93; its highest-order term is not statistically significant at conventional levels (R² = 0.744).
- **Between (cross-country)** (comparing country averages): the fitted curve shows two turning points (an N or inverted-N shape), peaking near log_gdp_pc = 7.94; its highest-order term is not statistically significant at conventional levels (R² = 0.83).
- **Within (two-way FE)** (within-country variation net of common year effects): the fitted curve shows two turning points (an N or inverted-N shape), peaking near log_gdp_pc = 7.94; its highest-order term is not statistically significant at conventional levels (R² = 0.521).
All three estimators agree on the curvature, so the shape is not an artefact of cross-country versus within-country variation.

_These are associations, not causal effects. A causal reading needs a research design — see `explain('correlation_vs_causation')`._

What it is. Kuznets (1955) conjectured an inverted-U relationship between income inequality and economic development: as an economy industrializes, inequality first rises (labour shifts from agriculture to higher-paid industry at uneven speeds), then falls (education and redistribution spread the gains). The classic test regresses an inequality measure (a Gini) on log GDP per capita and its square; a positive linear term with a negative quadratic term traces the inverted-U. The Kuznets waves hypothesis generalizes this: the relationship may be far more nonlinear, so the development term enters up to the fourth power — gini = f(g, g^2, g^3, g^4) with g = log GDP per capita. A cubic admits an N-shape (inequality falls then rises again as a post-industrial, skill-biased phase sets in); a quartic admits a full wave with up to three turning points. Estimating the same polynomial under three panel estimators disentangles where the shape lives: pooled OLS uses all variation at once, the between estimator compares country averages (the cross-country curve), and the within (two-way fixed-effects) estimator uses only within-country movements net of common year shocks. Frisch-Waugh-Lovell partial-residual plots show the fitted wave after covariates and fixed effects are partialled out.

When to use it. Use it to test whether inequality and development trace an inverted-U, an N-shape or a richer wave across a panel of countries or regions, and to check whether that shape is a cross-sectional phenomenon (countries at different stages differ) or a within-country dynamic (a single country’s inequality rises then falls as it develops). Comparing the pooled, between and within estimators is the point: if the between curve is hump-shaped but the within curve is flat, the ‘curve’ is really a comparison of different countries, not a path any one country travels. Raise the polynomial degree only when theory or the data motivate the extra turning points — a higher-order term that is statistically indistinguishable from zero is just overfitting.

Watch out for. - The Kuznets curve is a descriptive pattern, not a causal mechanism; omitted determinants (institutions, technology, trade, redistribution) can generate the same shape. - A significant negative quadratic term is necessary but not sufficient for an inverted-U: the implied peak must lie inside the observed range of development, otherwise the curve is effectively monotonic over the data. - High-order polynomials are unstable at the edges of the data and sensitive to a few extreme units; read the turning points only where the data are dense, and prefer the lowest degree that fits. - The between and within estimators answer different questions and need not agree; the within (fixed-effects) curve discards all cross-country variation, so it can be flat even when the between curve is strongly hump-shaped (and vice versa). - The specific inequality measure matters: a Gini, a top-income share and a Palma ratio can trace different curvature on the same data.

See also: fwl, fixed_effects, within_between_variation, correlation_vs_causation

References: Kuznets (1955), ‘Economic Growth and Income Inequality’, American Economic Review 45(1): 1-28; Gallup (2012), ‘Is there a Kuznets Curve?’, Portland State University; Palma (2011), ‘Homogeneous middles vs. heterogeneous tails’, Development and Change 42(1)