Quickstart

Run this Quickstart in the cloud — no install needed.
Every cell below runs as-is in a free Google Colab notebook.
Run the Quick Start in Google Colab

This walkthrough mirrors the ExPanDaR functions vignette using the bundled kuznets panel (80 countries × 2015–2025) — a synthetic dataset, rich in control variables, whose regional inequality (gini_regional) traces an N-shaped Kuznets curve in income: it rises, falls, then rises again at very high income. The log of GDP per capita and its square and cube ship as columns (log_gdp_pc, log_gdp_pc_sq, log_gdp_pc_cu) so the cubic curve is turnkey.

import numpy as np
import expdpy as ex
from expdpy.data import load_kuznets

df = load_kuznets()
df[["country", "continent", "year", "gini_regional", "gdp_pc", "log_gdp_pc"]].head()
country continent year gini_regional gdp_pc log_gdp_pc
0 country 1 Continent A 2015 0.085611 629.149402 6.444369
1 country 1 Continent A 2016 0.080064 672.932711 6.511645
2 country 1 Continent A 2017 0.212049 810.191773 6.697271
3 country 1 Continent A 2018 0.221831 917.986443 6.822183
4 country 1 Continent A 2019 0.261234 1079.513400 6.984266

Treating outliers

Winsorize the (skewed) numeric variables at the 1st/99th percentile:

analysis = ex.treat_outliers(
    df[["gini_regional", "gdp_pc", "school_enrollment", "resource_rents"]], percentile=0.01
).dropna()
analysis.describe().round(3)
gini_regional gdp_pc school_enrollment resource_rents
count 735.000 735.000 735.000 735.000
mean 0.274 25633.794 55.460 14.907
std 0.090 34241.842 26.014 8.593
min 0.085 666.721 6.000 1.902
25% 0.208 2606.858 34.218 8.919
50% 0.270 11351.537 56.896 13.034
75% 0.337 33357.890 75.026 18.790
max 0.509 150000.000 109.275 40.265

Descriptive statistics

ex.prepare_descriptive_table(analysis).gt
Descriptive Statistics
N Mean Std. dev. Min. 25 % Median 75 % Max.
gini_regional 735 0.274 0.090 0.085 0.208 0.270 0.337 0.509
gdp_pc 735 25,633.794 34,241.842 666.721 2,606.858 11,351.537 33,357.890 150,000.000
school_enrollment 735 55.460 26.014 6.000 34.218 56.896 75.026 109.275
resource_rents 735 14.907 8.593 1.902 8.919 13.034 18.790 40.265

Correlations

Pearson correlations appear above, Spearman below the diagonal; significant cells are bold.

ex.prepare_correlation_table(analysis).gt
A B C D
A: gini_regional 0.20 -0.11 0.30
B: gdp_pc -0.20 0.79 -0.39
C: school_enrollment -0.21 0.95 -0.44
D: resource_rents 0.34 -0.55 -0.52
This table reports Pearson correlations above and Spearman correlations below the diagonal. Number of observations: 735. Correlations with significance levels below 5% appear in bold.
ex.prepare_correlation_graph(analysis).fig

Extreme observations

ex.prepare_ext_obs_table(
    df, n=5, cs_id=["country"], ts_id="year", var="gini_regional"
).gt
/home/runner/work/expdpy/expdpy/src/expdpy/tables.py:336: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value 'nan' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  separator.loc[:, :] = np.nan
country year gini_regional
country 80 2,023.000 0.568
country 80 2,024.000 0.539
country 69 2,025.000 0.532
country 80 2,022.000 0.528
country 79 2,021.000 0.524
... ... ...
country 4 2,020.000 0.035
country 4 2,021.000 0.033
country 4 2,024.000 0.020
country 4 2,025.000 0.020
country 4 2,023.000 0.020

By-group views

ex.prepare_by_group_bar_graph(df, "continent", "gini_regional", np.nanmedian).fig
ex.prepare_by_group_violin_graph(df, "continent", "gini_regional")

Scatter plot with LOESS

The N-shaped Kuznets curve — regional inequality against (log) GDP per capita, sized by population and colored by continent. The LOESS smoother traces the rise–fall–rise:

ex.prepare_scatter_plot(
    df, x="log_gdp_pc", y="gini_regional", color="continent", size="population", loess=1
)

Regression with fixed effects and clustered SEs

kuznets is a country–year panel, so the natural specification absorbs two-way (country + year) fixed effects — controlling for every time-invariant country trait and every common annual shock. A cubic in (log) GDP per capita still recovers the N — a positive, significant cubic term — within country, with standard errors clustered by country:

res = ex.prepare_regression_table(
    df,
    dvs="gini_regional",
    idvs=["log_gdp_pc", "log_gdp_pc_sq", "log_gdp_pc_cu"],
    feffects=["country", "year"],
    clusters=["country"],
)
res.etable
gini_regional
(1)
coef
log_gdp_pc 6.411***
(0.210)
log_gdp_pc_sq -0.715***
(0.023)
log_gdp_pc_cu 0.026***
(0.001)
fe
year x
country x
stats
Observations 880
R2 0.874
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)

Frisch-Waugh-Lovell plot

The same coefficient, seen. prepare_fwl_plot residualizes both the outcome and the focal regressor on the other regressors and the fixed effects, then scatters the two residuals. By the Frisch-Waugh-Lovell theorem the fitted slope equals the focal coefficient in the regression above — here the linear log_gdp_pc term, net of the quadratic/cubic terms and the country and year fixed effects:

ex.prepare_fwl_plot(
    df,
    dv="gini_regional",
    var="log_gdp_pc",
    controls=["log_gdp_pc_sq", "log_gdp_pc_cu"],
    feffects=["country", "year"],
    clusters=["country"],
).fig