Quickstart

Run this Quickstart in the cloud — no install needed.
Every cell below runs as-is in a free Google Colab notebook.

This walkthrough mirrors the ExPanDaR functions vignette using the bundled kuznets panel (80 countries × 2015–2025) — a synthetic dataset, rich in control variables, whose regional inequality (gini_regional) traces an N-shaped Kuznets curve in income: it rises, falls, then rises again at very high income. The log of GDP per capita and its square and cube ship as columns (log_gdp_pc, log_gdp_pc_sq, log_gdp_pc_cu) so the cubic curve is turnkey.

import numpy as np
import expdpy as ex
from expdpy.data import load_kuznets

df = load_kuznets()
df[["country", "continent", "year", "gini_regional", "gdp_pc", "log_gdp_pc"]].head()

	country	continent	year	gini_regional	gdp_pc	log_gdp_pc
0	country 1	Continent A	2015	0.085611	629.149402	6.444369
1	country 1	Continent A	2016	0.080064	672.932711	6.511645
2	country 1	Continent A	2017	0.212049	810.191773	6.697271
3	country 1	Continent A	2018	0.221831	917.986443	6.822183
4	country 1	Continent A	2019	0.261234	1079.513400	6.984266

Treating outliers

Winsorize the (skewed) numeric variables at the 1st/99th percentile:

analysis = ex.treat_outliers(
    df[["gini_regional", "gdp_pc", "school_enrollment", "resource_rents"]], percentile=0.01
).dropna()
analysis.describe().round(3)

	gini_regional	gdp_pc	school_enrollment	resource_rents
count	735.000	735.000	735.000	735.000
mean	0.274	25633.794	55.460	14.907
std	0.090	34241.842	26.014	8.593
min	0.085	666.721	6.000	1.902
25%	0.208	2606.858	34.218	8.919
50%	0.270	11351.537	56.896	13.034
75%	0.337	33357.890	75.026	18.790
max	0.509	150000.000	109.275	40.265

Descriptive statistics

ex.prepare_descriptive_table(analysis).gt

	N	Mean	Std. dev.	Min.	25 %	Median	75 %	Max.
Descriptive Statistics
gini_regional	735	0.274	0.090	0.085	0.208	0.270	0.337	0.509
gdp_pc	735	25,633.794	34,241.842	666.721	2,606.858	11,351.537	33,357.890	150,000.000
school_enrollment	735	55.460	26.014	6.000	34.218	56.896	75.026	109.275
resource_rents	735	14.907	8.593	1.902	8.919	13.034	18.790	40.265

Correlations

Pearson correlations appear above, Spearman below the diagonal; significant cells are bold.

ex.prepare_correlation_table(analysis).gt

	A	B	C	D
A: gini_regional		0.20	-0.11	0.30
B: gdp_pc	-0.20		0.79	-0.39
C: school_enrollment	-0.21	0.95		-0.44
D: resource_rents	0.34	-0.55	-0.52
This table reports Pearson correlations above and Spearman correlations below the diagonal. Number of observations: 735. Correlations with significance levels below 5% appear in bold.

ex.prepare_correlation_graph(analysis).fig

Extreme observations

ex.prepare_ext_obs_table(
    df, n=5, cs_id=["country"], ts_id="year", var="gini_regional"
).gt

/home/runner/work/expdpy/expdpy/src/expdpy/tables.py:336: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value 'nan' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  separator.loc[:, :] = np.nan

country	year	gini_regional
country 80	2,023.000	0.568
country 80	2,024.000	0.539
country 69	2,025.000	0.532
country 80	2,022.000	0.528
country 79	2,021.000	0.524
...	...	...
country 4	2,020.000	0.035
country 4	2,021.000	0.033
country 4	2,024.000	0.020
country 4	2,025.000	0.020
country 4	2,023.000	0.020

Time trends

ex.prepare_trend_graph(df, ts_id="year", var=["gini_regional"]).fig

ex.prepare_quantile_trend_graph(df, ts_id="year", var="gini_regional").fig

By-group views

ex.prepare_by_group_bar_graph(df, "continent", "gini_regional", np.nanmedian).fig

ex.prepare_by_group_violin_graph(df, "continent", "gini_regional")

Scatter plot with LOESS

The N-shaped Kuznets curve — regional inequality against (log) GDP per capita, sized by population and colored by continent. The LOESS smoother traces the rise–fall–rise:

ex.prepare_scatter_plot(
    df, x="log_gdp_pc", y="gini_regional", color="continent", size="population", loess=1
)

Regression with fixed effects and clustered SEs

kuznets is a country–year panel, so the natural specification absorbs two-way (country + year) fixed effects — controlling for every time-invariant country trait and every common annual shock. A cubic in (log) GDP per capita still recovers the N — a positive, significant cubic term — within country, with standard errors clustered by country:

res = ex.prepare_regression_table(
    df,
    dvs="gini_regional",
    idvs=["log_gdp_pc", "log_gdp_pc_sq", "log_gdp_pc_cu"],
    feffects=["country", "year"],
    clusters=["country"],
)
res.etable

	gini_regional
	(1)
coef
log_gdp_pc	6.411*** (0.210)
log_gdp_pc_sq	-0.715*** (0.023)
log_gdp_pc_cu	0.026*** (0.001)
fe
year	x
country	x
stats
Observations	880
R²	0.874
Significance levels: * p < 0.05, p < 0.01, * p < 0.001. Format of coefficient cell: Coefficient (Std. Error)

Frisch-Waugh-Lovell plot

The same coefficient, seen. prepare_fwl_plot residualizes both the outcome and the focal regressor on the other regressors and the fixed effects, then scatters the two residuals. By the Frisch-Waugh-Lovell theorem the fitted slope equals the focal coefficient in the regression above — here the linear log_gdp_pc term, net of the quadratic/cubic terms and the country and year fixed effects:

ex.prepare_fwl_plot(
    df,
    dv="gini_regional",
    var="log_gdp_pc",
    controls=["log_gdp_pc_sq", "log_gdp_pc_cu"],
    feffects=["country", "year"],
    clusters=["country"],
).fig