Explore panel data

The Explore module is your first look at a dataset — before you fit a single model. This page is a case study: you have just been handed a country–year panel and asked whether inequality follows the famous Kuznets curve in income — rising, then falling, as economies develop — and what determines it. We will answer that the way an analyst actually works: explore first, model later.

The lead dataset is the bundled kuznets panel (80 countries observed every year over 2015–2025). Its regional inequality measure (gini_regional) is engineered to trace an N-shaped Kuznets pattern in (log) GDP per capita, surrounded by a realistic set of determinants — trade openness, resource rents, democracy, schooling, FDI, population.

Every Explore function takes a pandas DataFrame and returns a small result object carrying a tidy .df plus an interactive Plotly figure (.fig) or a publication-quality Great Tables object (.gt). Many also offer a plain-language .interpret(). Read this page top to bottom: the functions are ordered as a workflow — know the panel → describe it → split its variation → trace its trends → compare groups → study relationships → measure its dynamics.

Note

This is exploratory analysis: every reading below describes an association, never a cause. The Analyze module turns these patterns into estimates, and Learn explains the ideas behind them.

Stage 0 — Set up the panel

A panel has two coordinates: an entity (here the country) and a time index (the year). Declaring them once means every call below can omit them, and panel-aware tables know to summarize by period. set_labels does double duty: it attaches the data dictionary’s human-readable labels (so figures say “Regional inequality (Gini)” instead of gini_regional), and with set_panel=True it also reads the entity / time tags from that dictionary and declares the panel.

import expdpy as ex
from expdpy.data import load_kuznets, load_kuznets_data_def

df = ex.set_labels(load_kuznets(), load_kuznets_data_def(), set_panel=True)
df[["country", "continent", "year", "gini_regional", "gdp_pc", "log_gdp_pc"]].head()

	country	continent	year	gini_regional	gdp_pc	log_gdp_pc
0	country 1	Continent A	2015	0.085611	629.149402	6.444369
1	country 1	Continent A	2016	0.080064	672.932711	6.511645
2	country 1	Continent A	2017	0.212049	810.191773	6.697271
3	country 1	Continent A	2018	0.221831	917.986443	6.822183
4	country 1	Continent A	2019	0.261234	1079.513400	6.984266

You can also declare the panel directly with set_panel, and read back what is stored with resolve_panel — handy when a step (a merge, a column subset) drops the metadata and you need to re-declare it:

ex.set_panel(df, entity="country", time="year")
ex.resolve_panel(df)  # -> ('country', 'year')

('country', 'year')

Stage 1 — Know the panel’s skeleton

Before looking at a single value, learn the shape of the data. Is the panel balanced? Are there gaps? Where is information missing? A surprising amount of analysis goes wrong because this step was skipped.

explore_panel_structure summarizes balance and coverage and draws a presence grid (one row per country, one column per year): solid means observed, blank means missing.

structure = ex.explore_panel_structure(df)
structure.gt

	value
Panel Structure
units	80
periods	11
observations	880
balanced	Yes
internal gaps	0
min obs per unit	11
max obs per unit	11
A cell is 'present' when the unit is observed in that period. An interior gap is a missing period between a unit's first and last.

structure.fig

Some variables are missing far more often than others. explore_missing_values_plot maps the fraction missing by variable and year — notice the determinants (resource_rents, polity2, school_enrollment, gasoline_price) are patchier, and heavier in the early years. Knowing this now prevents a model later from silently dropping half your sample.

ex.explore_missing_values_plot(df).fig

explore_value_heatmap shows the raw outcome across the whole country-by-year grid. Standardizing by time (a z-score within each year) strips out the common trend so that what remains is each country’s relative position — who is persistently more unequal than its peers.

ex.explore_value_heatmap(df, var="gini_regional", standardize="by_time").fig

Stage 2 — Clean and describe the variables

Now meet the variables one at a time: their level, spread, and shape.

Several determinants are right-skewed (GDP per capita, resource rents, aid): a handful of huge values can dominate a correlation or a scatter. treat_outliers winsorizes — it caps values at the 1st/99th percentile rather than dropping rows. We build a cleaned analysis frame here and reuse it for the relationship views in Stage 6.

cols = [
    "gini_regional", "gdp_pc", "log_gdp_pc", "trade_share",
    "resource_rents", "school_enrollment", "polity2",
]
analysis = ex.set_labels(
    ex.treat_outliers(df[cols], percentile=0.01), load_kuznets_data_def()
)
analysis.describe().round(2)

	gini_regional	gdp_pc	log_gdp_pc	trade_share	resource_rents	school_enrollment	polity2
count	880.00	880.00	880.00	880.00	880.00	735.00	788.00
mean	0.27	25524.00	9.21	0.61	14.95	55.46	0.55
std	0.09	34212.74	1.49	0.23	8.56	26.01	4.77
min	0.09	666.72	6.50	0.26	1.90	6.00	-10.00
25%	0.21	2512.16	7.83	0.44	9.02	34.22	-3.00
50%	0.27	11199.80	9.32	0.55	13.08	56.90	1.00
75%	0.33	33444.17	10.42	0.73	18.81	75.03	4.00
max	0.51	150000.00	11.92	1.32	40.27	109.28	10.00

explore_descriptive_table reports the standard statistics for every numeric variable. Because the panel is declared, it lays them out by period — here the first and last year — so you can read level and change at a glance. The note beneath records the panel’s dimensions and which variables carry missing data.

ex.explore_descriptive_table(df, periods=[2015, 2025]).gt

	Mean		Std. dev.		Median		Min.		Max.
Descriptive Statistics
	2015	2025	2015	2025	2015	2025	2015	2025	2015	2025
Regional inequality (Gini)	0.257	0.279	0.087	0.106	0.256	0.263	0.086	0.020	0.455	0.532
GDP per capita (USD)	22,269.506	29,530.530	28,843.529	40,520.322	8,738.576	14,547.014	629.149	531.027	115,241.626	150,000.000
Population	18,438,095.075	21,017,625.750	45,331,372.978	51,130,240.237	4,639,542.000	5,395,351.000	209,936.000	263,897.000	331,159,497.000	368,704,165.000
Resource rents (% of GDP)	15.201	14.605	8.485	9.232	14.134	12.235	1.953	0.350	40.644	44.025
Arable land (share)	0.246	0.246	0.122	0.122	0.218	0.224	0.033	0.029	0.550	0.558
Trade openness (trade/GDP)	0.613	0.612	0.243	0.233	0.554	0.563	0.239	0.288	1.404	1.293
FDI inflows (% of GDP)	0.023	0.025	0.054	0.059	0.027	0.016	−0.082	−0.149	0.119	0.137
Area (km²)	450,980.746	450,980.746	1,023,860.606	1,023,860.606	125,400.969	125,400.969	1,224.257	1,224.257	8,189,082.989	8,189,082.989
Gasoline price (USD/litre)	0.530	0.574	0.213	0.216	0.492	0.616	0.200	0.200	0.950	1.024
Official development aid (USD)	347,510,465.864	345,373,820.793	508,075,434.134	492,345,744.460	124,807,975.154	141,595,753.694	−7,457,636.633	−7,407,267.673	2,200,000,000.000	2,200,000,000.000
Secondary enrollment (% gross)	53.845	57.257	26.238	27.313	53.060	60.128	6.261	9.809	104.200	125.014
Light-based inequality (Gini)	0.255	0.308	0.140	0.132	0.264	0.319	0.010	0.010	0.532	0.587
Democracy score (Polity2)	0.210	1.000	4.691	4.848	1.000	1.500	−10.000	−8.000	8.000	10.000
Federal state	0.200	0.200	0.403	0.403	0.000	0.000	0.000	0.000	1.000	1.000
Log GDP per capita	9.078	9.322	1.494	1.535	9.074	9.584	6.444	6.275	11.655	11.918
Log GDP per capita²	84.613	89.229	27.205	28.523	82.338	91.859	41.530	39.373	135.834	142.048
Log GDP per capita³	808.131	874.893	379.593	407.283	747.186	880.434	267.634	247.060	1,583.117	1,692.984
Observations: 880.
Variables with missing data: FDI inflows (% of GDP) (46), Gasoline price (USD/litre) (232), Official development aid (USD) (65), Secondary enrollment (% gross) (145), Light-based inequality (Gini) (133), Democracy score (Polity2) (92).
Panel: 80 units (Country) over 11 periods (Year); 11.0 observations per unit on average.
Columns show 2 of 11 periods (2015 to 2025).

explore_histogram shows the distribution of one variable. The outcome gini_regional is fairly symmetric; switch the variable to gdp_pc and you will see the long right tail that motivates the log transform used throughout the Kuznets literature (and provided here as log_gdp_pc).

ex.explore_histogram(df, "gini_regional", kde=True).fig

explore_bar_plot counts the levels of a categorical variable — how many observations fall in each continent.

ex.explore_bar_plot(df, "continent").fig

explore_ext_obs_table lists the most extreme observations — the country-years with the highest and lowest inequality. Extremes are where data errors hide, and where the substantive story is often most vivid.

ex.explore_ext_obs_table(df, n=5, var="gini_regional", entity=["country"], time="year").gt

Country	Year	Regional inequality (Gini)
Highest 5
country 80	2,023.000	0.568
country 80	2,024.000	0.539
country 69	2,025.000	0.532
country 80	2,022.000	0.528
country 79	2,021.000	0.524
Lowest 5
country 4	2,020.000	0.035
country 4	2,021.000	0.033
country 4	2,023.000	0.020
country 4	2,024.000	0.020
country 4	2,025.000	0.020

Stage 3 — Two kinds of variation: within vs between

This is the idea that makes panel data special. Every variable varies in two directions: across countries (between) and over time within a country (within). Mixing them up is the classic panel mistake, so it pays to separate them up front.

explore_xtsum_table decomposes each variable’s variation into overall, between, and within components (the Stata xtsum you may know). If almost all of a variable’s variation is between countries, then over an 11-year window it is nearly a fixed trait; if much is within, it genuinely moves over time.

xt = ex.explore_xtsum_table(df, var=["gini_regional", "log_gdp_pc", "trade_share"])
xt.gt

	Mean	Std. dev.	Min.	Max.	N	n units	T-bar
Within/Between Variation (xtsum)
Regional inequality (Gini)
Overall	0.273	0.091	0.020	0.568	880	80	11.000
Between		0.078	0.068	0.468
Within		0.048	0.076	0.451
Log GDP per capita
Overall	9.205	1.490	6.275	11.918	880	80	11.000
Between		1.482	6.483	11.839
Within		0.222	8.436	10.051
Trade openness (trade/GDP)
Overall	0.611	0.236	0.198	1.421	880	80	11.000
Between		0.233	0.290	1.335
Within		0.048	0.451	0.739
Overall = pooled. Between = across unit means. Within = over time inside a unit (deviations from the unit mean, recentered on the grand mean).

print(xt.interpret())

Splitting each variable's variation into **between** (differences across units) and **within** (variation over time inside a unit):
- **gini_regional**: between SD 0.0783, within SD 0.0478 — variation is mostly **across units** (between).
- **log_gdp_pc**: between SD 1.48, within SD 0.222 — variation is mostly **across units** (between).
- **trade_share**: between SD 0.233, within SD 0.048 — variation is mostly **across units** (between).

_Between variation drives cross-sectional comparisons; within variation is what fixed-effects (within) estimators rely on._

explore_spaghetti_plot makes the within variation visible: one faint line per country plus a bold mean overlay. Here are the development trajectories — most countries drift upward together, which is exactly the kind of common time trend a panel model will later absorb.

ex.explore_spaghetti_plot(df, var="log_gdp_pc").fig

Stage 4 — Trends over time

How does inequality evolve for the panel as a whole?

explore_trend_plot plots the cross-country mean each year with standard-error bars — the average trajectory.

ex.explore_trend_plot(df, var=["gini_regional"]).fig

A mean can hide what happens in the tails. explore_quantile_trend_plot tracks several quantiles at once, so you can see whether the spread of inequality across countries is widening or narrowing over time.

ex.explore_quantile_trend_plot(df, var="gini_regional").fig

explore_distribution_over_time shows the whole distribution shifting year by year as a ridgeline — each ridge is one year’s density. Is the distribution simply sliding, or is it also changing shape (skew, bimodality)?

ex.explore_distribution_over_time(df, var="gini_regional").fig

Stage 5 — Compare groups

The panel groups countries by continent (and flags federal states). Comparing across these groups is often the first hint of what drives the outcome.

explore_bar_plot_by_group reduces each group to a single statistic — here the mean level of inequality per continent.

ex.explore_bar_plot_by_group(df, "continent", "gini_regional").fig

A bar hides the spread inside each group. explore_violin_plot_by_group draws the full distribution per continent, so you can see overlap and outliers a mean would mask.

ex.explore_violin_plot_by_group(df, "continent", "gini_regional").fig

explore_trend_plot_by_group puts time back in: one trend line per continent, revealing whether the groups move together or diverge.

ex.explore_trend_plot_by_group(df, group_var="continent", var="gini_regional").fig

Stage 6 — Relationships and the Kuznets curve

Now the central question. We work on the winsorized analysis frame so a few extreme values do not distort the picture.

explore_correlation_table reports pairwise associations — Pearson above the diagonal, Spearman (rank) below, significant cells in bold. This is the quickest scan of which determinants co-move with inequality.

ex.explore_correlation_table(analysis).gt

	A	B	C	D	E	F	G
A: Regional inequality (Gini)		0.20	-0.10	-0.16	0.31	-0.11	-0.07
B: GDP per capita (USD)	-0.19		0.83	-0.08	-0.38	0.79	0.65
C: Log GDP per capita	-0.19	1.00		0.03	-0.46	0.95	0.78
D: Trade openness (trade/GDP)	-0.15	0.06	0.06		-0.02	0.01	0.05
E: Resource rents (% of GDP)	0.35	-0.54	-0.54	-0.02		-0.44	-0.37
F: Secondary enrollment (% gross)	-0.21	0.95	0.95	0.05	-0.52		0.75
G: Democracy score (Polity2)	-0.15	0.78	0.78	0.06	-0.44	0.76
This table reports Pearson correlations above and Spearman correlations below the diagonal. The number of observations ranges from 663 to 880. Correlations with significance levels below 5% appear in bold.

explore_correlation_plot shows the same matrix as a heatmap (an ellipse style is also available) — easier to spot blocks of related variables at a glance.

ex.explore_correlation_plot(analysis).fig

And here it is — the N-shaped Kuznets curve. explore_scatter_plot puts inequality against (log) GDP per capita, colored by continent and sized by population, with a LOESS smoother that traces the rise–fall–rise without assuming any functional form.

ex.explore_scatter_plot(
    df, x="log_gdp_pc", y="gini_regional", color="continent", size="population", loess=1
).fig

But is that curve a story about rich versus poor countries (between) or about a country getting richer over time (within)? They can even have opposite signs — Simpson’s paradox for panels. explore_scatter_plot_within_between splits the relationship into pooled, between, and within clouds, each with its own fitted slope, so you can see which one the raw scatter was really showing.

wb = ex.explore_scatter_plot_within_between(df, x="log_gdp_pc", y="gini_regional")
wb.fig

print(wb.interpret())

The pooled association between **Log GDP per capita** and **Regional inequality (Gini)** has slope -0.00537. It blends a **between-unit** slope of -0.00616 (comparing unit averages) with a **within-unit** slope of 0.0292 (variation over time inside a unit).
The two diverge markedly — the cross-unit and over-time relationships tell different stories, a sign that unit-level confounders matter and that a fixed-effects specification would change the estimate.

_These are associations, not causal effects. A causal reading needs a research design — see `explain('correlation_vs_causation')`._

Stage 7 — Dynamics and persistence

A panel lets you ask a question a cross-section cannot: is a country’s inequality sticky?

explore_transition_matrix bins inequality into states (here quartiles) and tabulates how often a country moves between them from one year to the next. A heavy diagonal means high persistence — where you start is where you stay.

ex.explore_transition_matrix(df, var="gini_regional", n_bins=4).fig

explore_within_persistence measures the same idea continuously: this year’s value against last year’s, within country, with the AR(1) serial-correlation rho. The closer rho is to 1, the more inequality this year is just a near-copy of last year’s.

persistence = ex.explore_within_persistence(df, var="gini_regional")
persistence.fig

print(persistence.interpret())

The within-unit (after removing each unit's mean) serial correlation is rho = 0.355 (n = 800 consecutive pairs): a **weak** relationship — this period's value relates to the previous one positively — high values tend to follow high values (persistence).

_These are associations, not causal effects. A causal reading needs a research design — see `explain('correlation_vs_causation')`._

Where to go next

You now have the full exploratory picture: a balanced panel with patchy determinants, an N-shaped curve that is mostly a between-country pattern, and an inequality measure that is highly persistent. The natural next step is to estimate it.

Analyze panel data — fit the cubic Kuznets curve with two-way fixed effects and clustered standard errors, then compare pooled / between / within estimators.
Learn panel data — the ideas behind within vs between, fixed effects and demeaning, with runnable sandboxes.