This walkthrough mirrors the ExPanDaR functions vignette using the bundled kuznets panel (80 countries × 2015–2025) — a synthetic dataset, rich in control variables, whose regional inequality (gini_regional) traces an N-shaped Kuznets curve in income: it rises, falls, then rises again at very high income. The log of GDP per capita and its square and cube ship as columns (log_gdp_pc, log_gdp_pc_sq, log_gdp_pc_cu) so the cubic curve is turnkey.
import numpy as npimport expdpy as exfrom expdpy.data import load_kuznetsdf = load_kuznets()df[["country", "continent", "year", "gini_regional", "gdp_pc", "log_gdp_pc"]].head()
country
continent
year
gini_regional
gdp_pc
log_gdp_pc
0
country 1
Continent A
2015
0.085611
629.149402
6.444369
1
country 1
Continent A
2016
0.080064
672.932711
6.511645
2
country 1
Continent A
2017
0.212049
810.191773
6.697271
3
country 1
Continent A
2018
0.221831
917.986443
6.822183
4
country 1
Continent A
2019
0.261234
1079.513400
6.984266
Treating outliers
Winsorize the (skewed) numeric variables at the 1st/99th percentile:
Pearson correlations appear above, Spearman below the diagonal; significant cells are bold.
ex.prepare_correlation_table(analysis).gt
A
B
C
D
A: gini_regional
0.20
-0.11
0.30
B: gdp_pc
-0.20
0.79
-0.39
C: school_enrollment
-0.21
0.95
-0.44
D: resource_rents
0.34
-0.55
-0.52
This table reports Pearson correlations above and Spearman correlations below the diagonal. Number of observations: 735. Correlations with significance levels below 5% appear in bold.
/home/runner/work/expdpy/expdpy/src/expdpy/tables.py:336: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value 'nan' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
separator.loc[:, :] = np.nan
The N-shaped Kuznets curve — regional inequality against (log) GDP per capita, sized by population and colored by continent. The LOESS smoother traces the rise–fall–rise:
kuznets is a country–year panel, so the natural specification absorbs two-way (country + year) fixed effects — controlling for every time-invariant country trait and every common annual shock. A cubic in (log) GDP per capita still recovers the N — a positive, significant cubic term — within country, with standard errors clustered by country:
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)
Frisch-Waugh-Lovell plot
The same coefficient, seen. prepare_fwl_plot residualizes both the outcome and the focal regressor on the other regressors and the fixed effects, then scatters the two residuals. By the Frisch-Waugh-Lovell theorem the fitted slope equals the focal coefficient in the regression above — here the linear log_gdp_pc term, net of the quadratic/cubic terms and the country and year fixed effects: