kuznets is the package’s lead showcase dataset: a synthetic country–year panel of 80 countries over 2015–2025 (880 observations). It is modelled on Table 4 of the Lessmann & Seidel (2017) regional-inequality replication — the same variables (renamed to clean snake_case) plus a synthetic continent grouping — but the data is generated so that regional inequality traces an N-shaped Kuznets curve in income: inequality rises, falls, then rises again at very high income.
Country names are generic (country 1 … country 80), so the panel illustrates the package without standing in for any real economy. It is deliberately rich in control variables and in realistic features (skewed distributions, scattered missing values, a multi-level factor and a two-level dummy) so a single dataset can exercise everyexpdpy feature.
import expdpy as exfrom expdpy.data import load_kuznets, load_kuznets_data_defdf = load_kuznets()df.head()
country
iso
year
continent
gini_regional
gdp_pc
population
resource_rents
arable_land
trade_share
...
area
gasoline_price
aid
school_enrollment
gini_lights
polity2
federal
log_gdp_pc
log_gdp_pc_sq
log_gdp_pc_cu
0
country 1
C01
2015
Continent A
0.085611
629.149402
13105399
14.973253
0.204988
0.411814
...
36929.543281
NaN
2.200000e+09
12.194620
NaN
-10.0
0
6.444369
41.529889
267.633916
1
country 1
C01
2016
Continent A
0.080064
672.932711
13163410
11.211238
0.187430
0.457198
...
36929.543281
0.2
2.200000e+09
8.212516
0.139813
NaN
0
6.511645
42.401525
276.103693
2
country 1
C01
2017
Continent A
0.212049
810.191773
13221677
13.397164
0.206566
0.448067
...
36929.543281
NaN
2.200000e+09
26.804331
0.139719
-2.0
0
6.697271
44.853439
300.395632
3
country 1
C01
2018
Continent A
0.221831
917.986443
13280202
9.866297
0.204979
0.537570
...
36929.543281
NaN
2.200000e+09
14.433636
0.407918
-7.0
0
6.822183
46.542176
317.519223
4
country 1
C01
2019
Continent A
0.261234
1079.513400
13338987
15.495375
0.207841
0.424548
...
36929.543281
0.2
2.200000e+09
NaN
0.153892
NaN
0
6.984266
48.779967
340.692248
5 rows × 21 columns
Variable dictionary
The bundled definition table marks the panel identifiers (cs_id / ts_id) and the variable types the app uses to populate its selectors. Each description also records the original Table-4 column name.
load_kuznets_data_def()
var_name
var_def
type
can_be_na
0
country
Country identifier (synthetic, generic names)
cs_id
False
1
iso
Country ISO code (synthetic, generic codes)
cs_id
False
2
year
Calendar year
ts_id
False
3
continent
Synthetic continent (grouping factor)
factor
True
4
gini_regional
Regional inequality Gini — the N-shaped Kuznet...
numeric
True
5
gdp_pc
National GDP per capita, USD (GDP_pc_Country)
numeric
True
6
population
National population (Pop_Country)
numeric
True
7
resource_rents
Natural-resource rents, % of GDP (Resources_re...
numeric
True
8
arable_land
Arable land share (Arable_land)
numeric
True
9
trade_share
Trade openness, trade/GDP (Trade_GDP_share)
numeric
True
10
fdi_share
FDI inflows, share of GDP (FDI_share_of_GDP)
numeric
True
11
area
Country area, km^2 (area)
numeric
True
12
gasoline_price
Gasoline price, USD/litre (price_gasoline)
numeric
True
13
aid
Net official development aid received, USD (Aid)
numeric
True
14
school_enrollment
Secondary-school enrollment, % gross (School_e...
numeric
True
15
gini_lights
Light-based inequality measure, control (GINIW...
numeric
True
16
polity2
Democracy score, -10..10 (Polity2)
numeric
True
17
federal
Federal-state dummy, 0/1 (fedelupd2)
factor
True
18
log_gdp_pc
Natural log of gdp_pc (derived)
numeric
True
19
log_gdp_pc_sq
log_gdp_pc squared (derived)
numeric
True
20
log_gdp_pc_cu
log_gdp_pc cubed (derived)
numeric
True
Highlights:
gini_regional — the outcome that follows the N-shape (regional inequality).
gdp_pc and its derived log_gdp_pc, log_gdp_pc_sq, log_gdp_pc_cu — income and the polynomial terms that make the cubic curve turnkey.
country (cs_id) and year (ts_id) — the panel identifiers, and the natural two-way fixed effects for any regression on this dataset.
continent (5 levels) and federal (0/1) — extra grouping factors for by-group views (and alternative fixed effects).
gasoline_price, school_enrollment, gini_lights, polity2 carry realistic missing values (heavier in the early years), so the missing-value heatmap has something to show.
The N-shaped Kuznets curve
A scatter of regional inequality against (log) GDP per capita, with a LOESS smoother, reveals the rise–fall–rise directly:
Because kuznets is a country–year panel, the regression absorbs two-way (country + year) fixed effects — the natural specification for panel data, controlling for every time-invariant country trait and every common annual shock. A cubic in (log) GDP per capita still recovers the N within country — the cubic term is positive and significant — with standard errors clustered by country: