The Analyze module is where exploration becomes estimation. This page is a case study: having explored the country–year panel (see Explore), you now want to model the Kuznets curve — does regional inequality rise then fall as economies develop? — and stress-test the answer the way a careful analyst would. We move through a single, intuitive workflow: fit a first model → respect the panel structure → enrich the estimation → read the fitted model → stress-test the inference → choose the right panel estimator → assemble the flagship curve → ask a related convergence question → and finally a causal design. Every Analyze function appears once, each with a note on why you reach for it at that step.
The lead dataset is the bundled kuznets panel (80 countries observed every year over 2015–2025), whose inequality measure (gini_regional) traces an N-shaped pattern in (log) GDP per capita — a cubic — surrounded by realistic determinants: trade openness, resource rents, democracy, schooling, FDI, population.
Every Analyze function takes a pandas DataFrame (or a fitted result) and returns a small result object carrying a tidy .df, plus a publication-quality table (.etable / .gt) or an interactive Plotly figure (.fig). Most also offer a plain-language .interpret(). Estimation runs on pyfixest and linearmodels under the hood — you never hand-roll an OLS.
Note
Every reading below describes an association, never a cause — even the fixed-effects and event-study results. The .interpret() text is deliberately associational; the Learn module explains the assumptions a causal claim would additionally require.
Stage 0 — Set up the panel
A panel has two coordinates: an entity (here the country) and a time index (the year). set_labels attaches the data dictionary’s human-readable labels (so tables and axes read “Regional inequality (Gini)” instead of gini_regional), and with set_panel=True it also reads the entity / time tags and declares the panel — so every estimator below can omit entity= / time= and the fixed-effects and clustering defaults just work.
Start simple. analyze_regression_table fits OLS of inequality on a cubic in log GDP per capita — the functional form that can bend up, down, then up again (the N). A pooled regression treats every country-year as an independent draw:
pooled = ex.analyze_regression_table( df, dvs="gini_regional", idvs=["log_gdp_pc", "log_gdp_pc_sq", "log_gdp_pc_cu"])pooled.etable
Regional inequality (Gini)
(1)
coef
Log GDP per capita
6.385***
(0.134)
Log GDP per capita²
-0.711***
(0.015)
Log GDP per capita³
0.026***
(0.001)
Intercept
-18.490***
(0.402)
stats
Observations
880
R2
0.744
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)
But kuznets is a country–year panel, and pooled OLS confounds two very different comparisons: rich versus poor countries and a country as it grows richer over time. The standard fix is to absorb two-way (country + year) fixed effects — identifying the curve from within-country movement, net of common shocks — with standard errors clustered by country:
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)
Every result can explain itself in plain, associational language:
print(fe.interpret())
This OLS regression relates **gini_regional** to its regressors. Fixed effects for *country + year* absorb time-invariant differences, so coefficients reflect variation **within** each group. Standard errors are clustered by *country*.
- **log_gdp_pc**: each one-unit increase is associated with gini_regional that is 6.41 higher (statistically significant at the 1% level).
- **log_gdp_pc_sq**: each one-unit increase is associated with gini_regional that is 0.715 lower (statistically significant at the 1% level).
- **log_gdp_pc_cu**: each one-unit increase is associated with gini_regional that is 0.0261 higher (statistically significant at the 1% level).
Model fit: N = 880, R² = 0.874, within-R² = 0.521.
_These are associations, not causal effects. A causal reading needs a research design — see `explain('correlation_vs_causation')`._
analyze_coefficient_plot puts the two specifications side by side, so you can see how absorbing fixed effects moves each coefficient and its confidence interval:
The same coefficient, seen. analyze_fwl_plot uses the Frisch–Waugh–Lovell theorem: it residualizes both the outcome and the focal regressor (log_gdp_pc) on the other terms and the fixed effects, then scatters the two residuals. The fitted slope equals the focal coefficient in the table above — the multivariate estimate, reduced to a single readable picture.
analyze_estimation is the richer companion to the regression table: same OLS / fixed-effects core, plus stepwise model comparison, multiple outcomes, weights, and serial-correlation-robust standard errors (Newey–West, Driscoll–Kraay). A cumulative-stepwise (csw) comparison adds one curvature term at a time — watch the linear coefficient move as the quadratic and cubic enter, the signature of a genuinely non-linear relationship:
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)
Stage 3 — Read the fitted model
A fitted model carries more than a coefficient table. The next three tools all take the fitted result from Stage 1 and interrogate it.
analyze_predictions returns fitted values, residuals and actuals on the estimation sample — the raw material for residual diagnostics:
ex.analyze_predictions(fe).df.head()
actual
predicted
residual
0
0.085611
0.087443
-0.001832
1
0.080064
0.127295
-0.047231
2
0.212049
0.195043
0.017006
3
0.221831
0.233842
-0.012010
4
0.261234
0.298890
-0.037657
analyze_fixef_plot ranks the country intercepts the fixed effects absorbed — which countries sit structurally high or low on inequality, before development plays any role (top 20 shown):
analyze_joint_test runs a Wald test that the curvature terms are jointly zero. Rejecting it is the formal statement that the relationship really bends — that a straight line would not do:
Joint F-test that [log_gdp_pc_sq, log_gdp_pc_cu] are all zero: statistic = 1099, p = 2.694e-239 — jointly statistically significant.
Stage 4 — Stress-test the inference
Cluster-robust standard errors lean on having many clusters. When that is in doubt, large-sample p-values can be over-confident. analyze_robust_inference offers two finite-sample alternatives: randomization inference (ritest) and the wild cluster bootstrap (wildboot). Here we test a determinant — does trade openness move inequality? — re-randomizing within country.
Note
pyfixest’s randomization inference needs an integer cluster code, so we add a numeric country_id alongside the string country.
Trade-share coefficient -0.059: randomization-inference p = 0.128 over 500 permutations.
A randomization-inference p-value well above 0.05 is a useful caution: the association that an asymptotic cluster standard error might call significant does not survive a stricter, assumption-light test.
Stage 5 — Which panel estimator?
Fixed effects are one choice among several. analyze_panel_table lays the classics side by side — pooled, between (cross-country means), fixed (within), and random effects:
T-stats reported in parentheses id: 0x7f05cca1d790
Fixed or random effects? Random effects is more efficient but assumes the country effect is uncorrelated with the regressors. analyze_hausman_test tests exactly that:
Hausman test (χ²(1) = 13.659, p = 0.0002192): **reject** the null — the random-effects assumption is violated, so prefer **fixed effects**. Note that failing to reject reflects a lack of evidence against random effects, not proof that it is correct.
analyze_cre_table gives the same comparison a more readable form: the correlated random effects (Mundlak) device augments a random-effects model with each regressor’s country mean. The coefficient on log_gdp_pc then equals its fixed-effects (within) estimate, while a test on the mean terms is the regression-form Hausman test — one table that holds the within estimate, the between signal, and the specification test together:
This Correlated Random Effects (Mundlak) model relates **gini_regional** to its regressors *and* their unit (entity) means. By the Mundlak equivalence the coefficient on each original regressor equals its **within (fixed-effects)** estimate, while the coefficient on the mean is the gap between the between- and within-unit associations.
- **log_gdp_pc** (within estimate): a one-unit increase is associated with gini_regional that is 0.0292 higher (not statistically significant at conventional levels).
Joint test that the 1 mean coefficient(s) are zero — the regression-form Hausman test — χ²(1) = 2.23, p = 0.136: this are indistinguishable from zero, so **random effects is admissible** (and more efficient than fixed effects).
_These are associations, not causal effects. A causal reading needs a research design — see `explain('correlation_vs_causation')`._
Stage 6 — The flagship: Kuznets waves across estimators
analyze_kuznets_waves is the synthesis. It fits the extended polynomial gini = b_1*g + b_2*g^2 + ... + b_degree*g^degree under three estimators at once — pooled OLS, between, and within two-way fixed effects — so you can read off the one question that matters: is the N-shape a between-country pattern, a within-country pattern, or both? Controls (here trade openness) are partialled out by Frisch–Waugh–Lovell before the component plots are drawn.
The within (two-way FE) component plot isolates the wave that survives inside countries, net of common shocks — the most demanding test of the Kuznets hypothesis:
waves.fig_within
print(waves.interpret())
Across 880 observations, a degree-4 polynomial relates **gini_regional** to **log_gdp_pc** (the extended Kuznets-waves specification). The three estimators read the association at different levels of variation:
- **Pooled OLS** (the raw cross-sectional pattern): the fitted curve shows two turning points (an N or inverted-N shape), peaking near log_gdp_pc = 7.93; its highest-order term is not statistically significant at conventional levels (R² = 0.744).
- **Between (cross-country)** (comparing country averages): the fitted curve shows two turning points (an N or inverted-N shape), peaking near log_gdp_pc = 7.94; its highest-order term is not statistically significant at conventional levels (R² = 0.83).
- **Within (two-way FE)** (within-country variation net of common year effects): the fitted curve shows two turning points (an N or inverted-N shape), peaking near log_gdp_pc = 7.94; its highest-order term is not statistically significant at conventional levels (R² = 0.521).
All three estimators agree on the curvature, so the shape is not an artefact of cross-country versus within-country variation.
The between and within figures partial out trade_share via the Frisch-Waugh-Lovell theorem, so the plotted wave is net of those covariates.
_These are associations, not causal effects. A causal reading needs a research design — see `explain('correlation_vs_causation')`._
The .gt_pooled, .gt_between and .gt_within tables hold the full cumulative-stepwise estimates behind each curve, and .summary reports the implied turning points.
Stage 7 — A related question: income convergence
Inequality between countries depends on whether poorer economies are catching up. The convergence tools answer that for the income driver itself, log_gdp_pc.
Warning
These estimators are designed for long balanced panels; the kuznets window is only 11 years (2015–2025), so read the results as an illustration of the tools, not a settled finding. For a full treatment on long panels see the dedicated notebooks for β-convergence, σ-convergence and convergence clubs.
analyze_beta_convergence asks whether initially poorer countries grow faster (β-convergence): it regresses annualized growth on the starting level and converts the slope into a speed and a half-life. With controls it reports the conditional version (a partial-residual plot) beside the unconditional one.
analyze_sigma_convergence tracks whether the dispersion of income across countries shrinks over time — the standard deviation, Gini and coefficient of variation per year, with a trend test (a negative slope is σ-convergence):
analyze_convergence_clubs runs the Phillips–Sul log(t) test and, when a single converging group is rejected, clusters countries into convergence clubs that each approach their own path:
The tools so far describe associations. When a determinant is a datable policy or shock, a difference-in-differences / event study design can identify its dynamic effect. The kuznets panel has no such treatment, so we switch to the bundled staggered_did example — a panel where units adopt a treatment in different years.
from expdpy.data import load_staggered_did, load_staggered_did_data_defdid = ex.set_labels( load_staggered_did(), load_staggered_did_data_def(), set_panel=True)
analyze_panel_view shows the treatment structure — who is treated and when — as a quilt over the unit-by-year grid, the first thing to inspect in any staggered design:
ex.analyze_panel_view(did, cohort="cohort").fig
analyze_event_study traces the dynamic treatment path with a built-in pre-trend check, using a modern staggered-adoption estimator (Gardner’s did2s here; twfe, Sun–Abraham saturated and lpdid are also available). Flat pre-trends and a clean post-treatment jump are what a credible design looks like:
es = ex.analyze_event_study(did, outcome="outcome", cohort="cohort", estimator="did2s")es.fig
print(es.interpret())
This event study (estimator: **did2s**) traces the outcome by event time, with t = -1 as the baseline period.
⚠️ Some **pre-treatment** coefficients differ from zero, which weakens the parallel-trends assumption — read the post-treatment path with caution.
By event time 11, the estimated effect is 3.14 (95% interval excludes zero).
_These are associations, not causal effects. A causal reading needs a research design — see `explain('correlation_vs_causation')`._
Where to go next
You have fit the Kuznets curve, shown the N largely survives within countries under two-way fixed effects, stress-tested a determinant with randomization inference, chosen among panel estimators, and seen how a staggered DiD design would identify a policy effect.
Explore panel data — the exploratory analysis that should precede every model.
Learn panel data — the ideas behind fixed effects, demeaning, correlated random effects and convergence, with runnable sandboxes.