The kuznets dataset

kuznets is the package’s lead showcase dataset: a synthetic country–year panel of 80 countries over 2015–2025 (880 observations). It is modelled on Table 4 of the Lessmann & Seidel (2017) regional-inequality replication — the same variables (renamed to clean snake_case) plus a synthetic continent grouping — but the data is generated so that regional inequality traces an N-shaped Kuznets curve in income: inequality rises, falls, then rises again at very high income.

Country names are generic (country 1 … country 80), so the panel illustrates the package without standing in for any real economy. It is deliberately rich in control variables and in realistic features (skewed distributions, scattered missing values, a multi-level factor and a two-level dummy) so a single dataset can exercise every expdpy feature.

import expdpy as ex
from expdpy.data import load_kuznets, load_kuznets_data_def

df = load_kuznets()
df.head()

	country	iso	year	continent	gini_regional	gdp_pc	population	resource_rents	arable_land	trade_share	...	area	gasoline_price	aid	school_enrollment	gini_lights	polity2	log_gdp_pc	log_gdp_pc_sq	log_gdp_pc_cu
0	country 1	C01	2015	Continent A	0.085611	629.149402	13105399	14.973253	0.204988	0.411814	...	36929.543281	NaN	2.200000e+09	12.194620	NaN	-10.0	6.444369	41.529889	267.633916
1	country 1	C01	2016	Continent A	0.080064	672.932711	13163410	11.211238	0.187430	0.457198	...	36929.543281	0.2	2.200000e+09	8.212516	0.139813	NaN	6.511645	42.401525	276.103693
2	country 1	C01	2017	Continent A	0.212049	810.191773	13221677	13.397164	0.206566	0.448067	...	36929.543281	NaN	2.200000e+09	26.804331	0.139719	-2.0	6.697271	44.853439	300.395632
3	country 1	C01	2018	Continent A	0.221831	917.986443	13280202	9.866297	0.204979	0.537570	...	36929.543281	NaN	2.200000e+09	14.433636	0.407918	-7.0	6.822183	46.542176	317.519223
4	country 1	C01	2019	Continent A	0.261234	1079.513400	13338987	15.495375	0.207841	0.424548	...	36929.543281	0.2	2.200000e+09	NaN	0.153892	NaN	6.984266	48.779967	340.692248

5 rows × 21 columns

Variable dictionary

The bundled definition table marks the panel identifiers (entity / time) and the variable types the app uses to populate its selectors. Each description also records the original Table-4 column name.

load_kuznets_data_def()

	var_name	var_def	label	type	role	can_be_na
0	country	Country identifier (synthetic, generic names)	Country	entity		False
1	iso	Country ISO code (synthetic, generic codes)	ISO code	entity		False
2	year	Calendar year	Year	time		False
3	continent	Synthetic continent (grouping factor)	Continent	factor		True
4	gini_regional	Regional inequality Gini — the N-shaped Kuznet...	Regional inequality (Gini)	numeric	outcome	True
5	gdp_pc	National GDP per capita, USD (GDP_pc_Country)	GDP per capita (USD)	numeric		True
6	population	National population (Pop_Country)	Population	numeric		True
7	resource_rents	Natural-resource rents, % of GDP (Resources_re...	Resource rents (% of GDP)	numeric		True
8	arable_land	Arable land share (Arable_land)	Arable land (share)	numeric		True
9	trade_share	Trade openness, trade/GDP (Trade_GDP_share)	Trade openness (trade/GDP)	numeric		True
10	fdi_share	FDI inflows, share of GDP (FDI_share_of_GDP)	FDI inflows (% of GDP)	numeric		True
11	area	Country area, km^2 (area)	Area (km²)	numeric		True
12	gasoline_price	Gasoline price, USD/litre (price_gasoline)	Gasoline price (USD/litre)	numeric		True
13	aid	Net official development aid received, USD (Aid)	Official development aid (USD)	numeric		True
14	school_enrollment	Secondary-school enrollment, % gross (School_e...	Secondary enrollment (% gross)	numeric		True
15	gini_lights	Light-based inequality measure, control (GINIW...	Light-based inequality (Gini)	numeric		True
16	polity2	Democracy score, -10..10 (Polity2)	Democracy score (Polity2)	numeric		True
17	federal	Federal-state dummy, 0/1 (fedelupd2)	Federal state	factor		True
18	log_gdp_pc	Natural log of gdp_pc (derived)	Log GDP per capita	numeric	covariate	True
19	log_gdp_pc_sq	log_gdp_pc squared (derived)	Log GDP per capita²	numeric	covariate	True
20	log_gdp_pc_cu	log_gdp_pc cubed (derived)	Log GDP per capita³	numeric	covariate	True

Highlights:

gini_regional — the outcome that follows the N-shape (regional inequality).
gdp_pc and its derived log_gdp_pc, log_gdp_pc_sq, log_gdp_pc_cu — income and the polynomial terms that make the cubic curve turnkey.
country (entity) and year (time) — the panel identifiers, and the natural two-way fixed effects for any regression on this dataset.
continent (5 levels) and federal (0/1) — extra grouping factors for by-group views (and alternative fixed effects).
gasoline_price, school_enrollment, gini_lights, polity2 carry realistic missing values (heavier in the early years), so the missing-value heatmap has something to show.

The N-shaped Kuznets curve

A scatter of regional inequality against (log) GDP per capita, with a LOESS smoother, reveals the rise–fall–rise directly:

ex.explore_scatter_plot(
    df, x="log_gdp_pc", y="gini_regional", color="continent", size="population", loess=1
).fig

Statistical evidence

Because kuznets is a country–year panel, the regression absorbs two-way (country + year) fixed effects — the natural specification for panel data, controlling for every time-invariant country trait and every common annual shock. A cubic in (log) GDP per capita still recovers the N within country — the cubic term is positive and significant — with standard errors clustered by country:

ex.analyze_regression_table(
    df,
    dvs="gini_regional",
    idvs=["log_gdp_pc", "log_gdp_pc_sq", "log_gdp_pc_cu"],
    feffects=["country", "year"],
    clusters=["country"],
).etable

	gini_regional
	(1)
coef
log_gdp_pc	6.411*** (0.210)
log_gdp_pc_sq	-0.715*** (0.023)
log_gdp_pc_cu	0.026*** (0.001)
fe
year	x
country	x
stats
Observations	880
R²	0.874
Significance levels: * p < 0.05, p < 0.01, * p < 0.001. Format of coefficient cell: Coefficient (Std. Error)

Launching the app on kuznets

The dataset ships a preset configuration that opens an app directly on the curve:

from expdpy.streamlit_app import ExploreApp
from expdpy.data import load_kuznets, load_kuznets_data_def, get_config

ExploreApp(load_kuznets(), df_def=load_kuznets_data_def(), config_list=get_config("kuznets"))

When launched without data, the bundled Streamlit apps default to this dataset in their picker.