The kuznets dataset

kuznets is the package’s lead showcase dataset: a synthetic country–year panel of 80 countries over 2015–2025 (880 observations). It is modelled on Table 4 of the Lessmann & Seidel (2017) regional-inequality replication — the same variables (renamed to clean snake_case) plus a synthetic continent grouping — but the data is generated so that regional inequality traces an N-shaped Kuznets curve in income: inequality rises, falls, then rises again at very high income.

Country names are generic (country 1country 80), so the panel illustrates the package without standing in for any real economy. It is deliberately rich in control variables and in realistic features (skewed distributions, scattered missing values, a multi-level factor and a two-level dummy) so a single dataset can exercise every expdpy feature.

import expdpy as ex
from expdpy.data import load_kuznets, load_kuznets_data_def

df = load_kuznets()
df.head()
country iso year continent gini_regional gdp_pc population resource_rents arable_land trade_share ... area gasoline_price aid school_enrollment gini_lights polity2 federal log_gdp_pc log_gdp_pc_sq log_gdp_pc_cu
0 country 1 C01 2015 Continent A 0.085611 629.149402 13105399 14.973253 0.204988 0.411814 ... 36929.543281 NaN 2.200000e+09 12.194620 NaN -10.0 0 6.444369 41.529889 267.633916
1 country 1 C01 2016 Continent A 0.080064 672.932711 13163410 11.211238 0.187430 0.457198 ... 36929.543281 0.2 2.200000e+09 8.212516 0.139813 NaN 0 6.511645 42.401525 276.103693
2 country 1 C01 2017 Continent A 0.212049 810.191773 13221677 13.397164 0.206566 0.448067 ... 36929.543281 NaN 2.200000e+09 26.804331 0.139719 -2.0 0 6.697271 44.853439 300.395632
3 country 1 C01 2018 Continent A 0.221831 917.986443 13280202 9.866297 0.204979 0.537570 ... 36929.543281 NaN 2.200000e+09 14.433636 0.407918 -7.0 0 6.822183 46.542176 317.519223
4 country 1 C01 2019 Continent A 0.261234 1079.513400 13338987 15.495375 0.207841 0.424548 ... 36929.543281 0.2 2.200000e+09 NaN 0.153892 NaN 0 6.984266 48.779967 340.692248

5 rows × 21 columns

Variable dictionary

The bundled definition table marks the panel identifiers (cs_id / ts_id) and the variable types the app uses to populate its selectors. Each description also records the original Table-4 column name.

load_kuznets_data_def()
var_name var_def type can_be_na
0 country Country identifier (synthetic, generic names) cs_id False
1 iso Country ISO code (synthetic, generic codes) cs_id False
2 year Calendar year ts_id False
3 continent Synthetic continent (grouping factor) factor True
4 gini_regional Regional inequality Gini — the N-shaped Kuznet... numeric True
5 gdp_pc National GDP per capita, USD (GDP_pc_Country) numeric True
6 population National population (Pop_Country) numeric True
7 resource_rents Natural-resource rents, % of GDP (Resources_re... numeric True
8 arable_land Arable land share (Arable_land) numeric True
9 trade_share Trade openness, trade/GDP (Trade_GDP_share) numeric True
10 fdi_share FDI inflows, share of GDP (FDI_share_of_GDP) numeric True
11 area Country area, km^2 (area) numeric True
12 gasoline_price Gasoline price, USD/litre (price_gasoline) numeric True
13 aid Net official development aid received, USD (Aid) numeric True
14 school_enrollment Secondary-school enrollment, % gross (School_e... numeric True
15 gini_lights Light-based inequality measure, control (GINIW... numeric True
16 polity2 Democracy score, -10..10 (Polity2) numeric True
17 federal Federal-state dummy, 0/1 (fedelupd2) factor True
18 log_gdp_pc Natural log of gdp_pc (derived) numeric True
19 log_gdp_pc_sq log_gdp_pc squared (derived) numeric True
20 log_gdp_pc_cu log_gdp_pc cubed (derived) numeric True

Highlights:

  • gini_regional — the outcome that follows the N-shape (regional inequality).
  • gdp_pc and its derived log_gdp_pc, log_gdp_pc_sq, log_gdp_pc_cu — income and the polynomial terms that make the cubic curve turnkey.
  • country (cs_id) and year (ts_id) — the panel identifiers, and the natural two-way fixed effects for any regression on this dataset.
  • continent (5 levels) and federal (0/1) — extra grouping factors for by-group views (and alternative fixed effects).
  • gasoline_price, school_enrollment, gini_lights, polity2 carry realistic missing values (heavier in the early years), so the missing-value heatmap has something to show.

The N-shaped Kuznets curve

A scatter of regional inequality against (log) GDP per capita, with a LOESS smoother, reveals the rise–fall–rise directly:

ex.prepare_scatter_plot(
    df, x="log_gdp_pc", y="gini_regional", color="continent", size="population", loess=1
)

Statistical evidence

Because kuznets is a country–year panel, the regression absorbs two-way (country + year) fixed effects — the natural specification for panel data, controlling for every time-invariant country trait and every common annual shock. A cubic in (log) GDP per capita still recovers the N within country — the cubic term is positive and significant — with standard errors clustered by country:

ex.prepare_regression_table(
    df,
    dvs="gini_regional",
    idvs=["log_gdp_pc", "log_gdp_pc_sq", "log_gdp_pc_cu"],
    feffects=["country", "year"],
    clusters=["country"],
).etable
gini_regional
(1)
coef
log_gdp_pc 6.411***
(0.210)
log_gdp_pc_sq -0.715***
(0.023)
log_gdp_pc_cu 0.026***
(0.001)
fe
year x
country x
stats
Observations 880
R2 0.874
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)

Launching the app on kuznets

The dataset ships a preset configuration that opens either app directly on the curve:

from expdpy.streamlit_app import ExPdPy  # or: from expdpy.app import ExPdPy
from expdpy.data import load_kuznets, load_kuznets_data_def, get_config

ExPdPy(load_kuznets(), df_def=load_kuznets_data_def(), config_list=get_config("kuznets"))

When launched without data, the bundled Streamlit app defaults to this dataset in its picker.