Explore panel data

Try this page interactively — no install needed.

The Explore module is your first look at a dataset — before you fit a single model. This page is a case study: you have just been handed a country–year panel and asked whether inequality follows the famous Kuznets curve in income — rising, then falling, as economies develop — and what determines it. We will answer that the way an analyst actually works: explore first, model later.

The lead dataset is the bundled kuznets panel (80 countries observed every year over 2015–2025). Its regional inequality measure (gini_regional) is engineered to trace an N-shaped Kuznets pattern in (log) GDP per capita, surrounded by a realistic set of determinants — trade openness, resource rents, democracy, schooling, FDI, population.

Every Explore function takes a pandas DataFrame and returns a small result object carrying a tidy .df plus an interactive Plotly figure (.fig) or a publication-quality Great Tables object (.gt). Many also offer a plain-language .interpret(). Read this page top to bottom: the functions are ordered as a workflowknow the panel → describe it → split its variation → trace its trends → compare groups → study relationships → measure its dynamics.

Note

This is exploratory analysis: every reading below describes an association, never a cause. The Analyze module turns these patterns into estimates, and Learn explains the ideas behind them.

Stage 0 — Set up the panel

A panel has two coordinates: an entity (here the country) and a time index (the year). Declaring them once means every call below can omit them, and panel-aware tables know to summarize by period. set_labels does double duty: it attaches the data dictionary’s human-readable labels (so figures say “Regional inequality (Gini)” instead of gini_regional), and with set_panel=True it also reads the entity / time tags from that dictionary and declares the panel.

import expdpy as ex
from expdpy.data import load_kuznets, load_kuznets_data_def

df = ex.set_labels(load_kuznets(), load_kuznets_data_def(), set_panel=True)
df[["country", "continent", "year", "gini_regional", "gdp_pc", "log_gdp_pc"]].head()
country continent year gini_regional gdp_pc log_gdp_pc
0 country 1 Continent A 2015 0.085611 629.149402 6.444369
1 country 1 Continent A 2016 0.080064 672.932711 6.511645
2 country 1 Continent A 2017 0.212049 810.191773 6.697271
3 country 1 Continent A 2018 0.221831 917.986443 6.822183
4 country 1 Continent A 2019 0.261234 1079.513400 6.984266

You can also declare the panel directly with set_panel, and read back what is stored with resolve_panel — handy when a step (a merge, a column subset) drops the metadata and you need to re-declare it:

ex.set_panel(df, entity="country", time="year")
ex.resolve_panel(df)  # -> ('country', 'year')
('country', 'year')

Stage 1 — Know the panel’s skeleton

Before looking at a single value, learn the shape of the data. Is the panel balanced? Are there gaps? Where is information missing? A surprising amount of analysis goes wrong because this step was skipped.

explore_panel_structure summarizes balance and coverage and draws a presence grid (one row per country, one column per year): solid means observed, blank means missing.

structure = ex.explore_panel_structure(df)
structure.gt
Panel Structure
value
units 80
periods 11
observations 880
balanced Yes
internal gaps 0
min obs per unit 11
max obs per unit 11
A cell is 'present' when the unit is observed in that period. An interior gap is a missing period between a unit's first and last.
structure.fig

Some variables are missing far more often than others. explore_missing_values_plot maps the fraction missing by variable and year — notice the determinants (resource_rents, polity2, school_enrollment, gasoline_price) are patchier, and heavier in the early years. Knowing this now prevents a model later from silently dropping half your sample.

ex.explore_missing_values_plot(df).fig

explore_value_heatmap shows the raw outcome across the whole country-by-year grid. Standardizing by time (a z-score within each year) strips out the common trend so that what remains is each country’s relative position — who is persistently more unequal than its peers.

ex.explore_value_heatmap(df, var="gini_regional", standardize="by_time").fig

Stage 2 — Clean and describe the variables

Now meet the variables one at a time: their level, spread, and shape.

Several determinants are right-skewed (GDP per capita, resource rents, aid): a handful of huge values can dominate a correlation or a scatter. treat_outliers winsorizes — it caps values at the 1st/99th percentile rather than dropping rows. We build a cleaned analysis frame here and reuse it for the relationship views in Stage 6.

cols = [
    "gini_regional", "gdp_pc", "log_gdp_pc", "trade_share",
    "resource_rents", "school_enrollment", "polity2",
]
analysis = ex.set_labels(
    ex.treat_outliers(df[cols], percentile=0.01), load_kuznets_data_def()
)
analysis.describe().round(2)
gini_regional gdp_pc log_gdp_pc trade_share resource_rents school_enrollment polity2
count 880.00 880.00 880.00 880.00 880.00 735.00 788.00
mean 0.27 25524.00 9.21 0.61 14.95 55.46 0.55
std 0.09 34212.74 1.49 0.23 8.56 26.01 4.77
min 0.09 666.72 6.50 0.26 1.90 6.00 -10.00
25% 0.21 2512.16 7.83 0.44 9.02 34.22 -3.00
50% 0.27 11199.80 9.32 0.55 13.08 56.90 1.00
75% 0.33 33444.17 10.42 0.73 18.81 75.03 4.00
max 0.51 150000.00 11.92 1.32 40.27 109.28 10.00

explore_descriptive_table reports the standard statistics for every numeric variable. Because the panel is declared, it lays them out by period — here the first and last year — so you can read level and change at a glance. The note beneath records the panel’s dimensions and which variables carry missing data.

ex.explore_descriptive_table(df, periods=[2015, 2025]).gt
Descriptive Statistics
Mean Std. dev. Median Min. Max.
2015 2025 2015 2025 2015 2025 2015 2025 2015 2025
Regional inequality (Gini) 0.257 0.279 0.087 0.106 0.256 0.263 0.086 0.020 0.455 0.532
GDP per capita (USD) 22,269.506 29,530.530 28,843.529 40,520.322 8,738.576 14,547.014 629.149 531.027 115,241.626 150,000.000
Population 18,438,095.075 21,017,625.750 45,331,372.978 51,130,240.237 4,639,542.000 5,395,351.000 209,936.000 263,897.000 331,159,497.000 368,704,165.000
Resource rents (% of GDP) 15.201 14.605 8.485 9.232 14.134 12.235 1.953 0.350 40.644 44.025
Arable land (share) 0.246 0.246 0.122 0.122 0.218 0.224 0.033 0.029 0.550 0.558
Trade openness (trade/GDP) 0.613 0.612 0.243 0.233 0.554 0.563 0.239 0.288 1.404 1.293
FDI inflows (% of GDP) 0.023 0.025 0.054 0.059 0.027 0.016 −0.082 −0.149 0.119 0.137
Area (km²) 450,980.746 450,980.746 1,023,860.606 1,023,860.606 125,400.969 125,400.969 1,224.257 1,224.257 8,189,082.989 8,189,082.989
Gasoline price (USD/litre) 0.530 0.574 0.213 0.216 0.492 0.616 0.200 0.200 0.950 1.024
Official development aid (USD) 347,510,465.864 345,373,820.793 508,075,434.134 492,345,744.460 124,807,975.154 141,595,753.694 −7,457,636.633 −7,407,267.673 2,200,000,000.000 2,200,000,000.000
Secondary enrollment (% gross) 53.845 57.257 26.238 27.313 53.060 60.128 6.261 9.809 104.200 125.014
Light-based inequality (Gini) 0.255 0.308 0.140 0.132 0.264 0.319 0.010 0.010 0.532 0.587
Democracy score (Polity2) 0.210 1.000 4.691 4.848 1.000 1.500 −10.000 −8.000 8.000 10.000
Federal state 0.200 0.200 0.403 0.403 0.000 0.000 0.000 0.000 1.000 1.000
Log GDP per capita 9.078 9.322 1.494 1.535 9.074 9.584 6.444 6.275 11.655 11.918
Log GDP per capita² 84.613 89.229 27.205 28.523 82.338 91.859 41.530 39.373 135.834 142.048
Log GDP per capita³ 808.131 874.893 379.593 407.283 747.186 880.434 267.634 247.060 1,583.117 1,692.984
Observations: 880.
Variables with missing data: FDI inflows (% of GDP) (46), Gasoline price (USD/litre) (232), Official development aid (USD) (65), Secondary enrollment (% gross) (145), Light-based inequality (Gini) (133), Democracy score (Polity2) (92).
Panel: 80 units (Country) over 11 periods (Year); 11.0 observations per unit on average.
Columns show 2 of 11 periods (2015 to 2025).

explore_histogram shows the distribution of one variable. The outcome gini_regional is fairly symmetric; switch the variable to gdp_pc and you will see the long right tail that motivates the log transform used throughout the Kuznets literature (and provided here as log_gdp_pc).

ex.explore_histogram(df, "gini_regional", kde=True).fig

explore_bar_plot counts the levels of a categorical variable — how many observations fall in each continent.

ex.explore_bar_plot(df, "continent").fig

explore_ext_obs_table lists the most extreme observations — the country-years with the highest and lowest inequality. Extremes are where data errors hide, and where the substantive story is often most vivid.

ex.explore_ext_obs_table(df, n=5, var="gini_regional", entity=["country"], time="year").gt
Country Year Regional inequality (Gini)
Highest 5
country 80 2,023.000 0.568
country 80 2,024.000 0.539
country 69 2,025.000 0.532
country 80 2,022.000 0.528
country 79 2,021.000 0.524
Lowest 5
country 4 2,020.000 0.035
country 4 2,021.000 0.033
country 4 2,023.000 0.020
country 4 2,024.000 0.020
country 4 2,025.000 0.020

Stage 3 — Two kinds of variation: within vs between

This is the idea that makes panel data special. Every variable varies in two directions: across countries (between) and over time within a country (within). Mixing them up is the classic panel mistake, so it pays to separate them up front.

explore_xtsum_table decomposes each variable’s variation into overall, between, and within components (the Stata xtsum you may know). If almost all of a variable’s variation is between countries, then over an 11-year window it is nearly a fixed trait; if much is within, it genuinely moves over time.

xt = ex.explore_xtsum_table(df, var=["gini_regional", "log_gdp_pc", "trade_share"])
xt.gt
Within/Between Variation (xtsum)
Mean Std. dev. Min. Max. N n units T-bar
Regional inequality (Gini)
Overall 0.273 0.091 0.020 0.568 880 80 11.000
Between 0.078 0.068 0.468
Within 0.048 0.076 0.451
Log GDP per capita
Overall 9.205 1.490 6.275 11.918 880 80 11.000
Between 1.482 6.483 11.839
Within 0.222 8.436 10.051
Trade openness (trade/GDP)
Overall 0.611 0.236 0.198 1.421 880 80 11.000
Between 0.233 0.290 1.335
Within 0.048 0.451 0.739
Overall = pooled. Between = across unit means. Within = over time inside a unit (deviations from the unit mean, recentered on the grand mean).
print(xt.interpret())
Splitting each variable's variation into **between** (differences across units) and **within** (variation over time inside a unit):
- **gini_regional**: between SD 0.0783, within SD 0.0478 — variation is mostly **across units** (between).
- **log_gdp_pc**: between SD 1.48, within SD 0.222 — variation is mostly **across units** (between).
- **trade_share**: between SD 0.233, within SD 0.048 — variation is mostly **across units** (between).

_Between variation drives cross-sectional comparisons; within variation is what fixed-effects (within) estimators rely on._

explore_spaghetti_plot makes the within variation visible: one faint line per country plus a bold mean overlay. Here are the development trajectories — most countries drift upward together, which is exactly the kind of common time trend a panel model will later absorb.

ex.explore_spaghetti_plot(df, var="log_gdp_pc").fig

Stage 5 — Compare groups

The panel groups countries by continent (and flags federal states). Comparing across these groups is often the first hint of what drives the outcome.

explore_bar_plot_by_group reduces each group to a single statistic — here the mean level of inequality per continent.

ex.explore_bar_plot_by_group(df, "continent", "gini_regional").fig

A bar hides the spread inside each group. explore_violin_plot_by_group draws the full distribution per continent, so you can see overlap and outliers a mean would mask.

ex.explore_violin_plot_by_group(df, "continent", "gini_regional").fig

explore_trend_plot_by_group puts time back in: one trend line per continent, revealing whether the groups move together or diverge.

ex.explore_trend_plot_by_group(df, group_var="continent", var="gini_regional").fig

Stage 6 — Relationships and the Kuznets curve

Now the central question. We work on the winsorized analysis frame so a few extreme values do not distort the picture.

explore_correlation_table reports pairwise associations — Pearson above the diagonal, Spearman (rank) below, significant cells in bold. This is the quickest scan of which determinants co-move with inequality.

ex.explore_correlation_table(analysis).gt
A B C D E F G
A: Regional inequality (Gini) 0.20 -0.10 -0.16 0.31 -0.11 -0.07
B: GDP per capita (USD) -0.19 0.83 -0.08 -0.38 0.79 0.65
C: Log GDP per capita -0.19 1.00 0.03 -0.46 0.95 0.78
D: Trade openness (trade/GDP) -0.15 0.06 0.06 -0.02 0.01 0.05
E: Resource rents (% of GDP) 0.35 -0.54 -0.54 -0.02 -0.44 -0.37
F: Secondary enrollment (% gross) -0.21 0.95 0.95 0.05 -0.52 0.75
G: Democracy score (Polity2) -0.15 0.78 0.78 0.06 -0.44 0.76
This table reports Pearson correlations above and Spearman correlations below the diagonal. The number of observations ranges from 663 to 880. Correlations with significance levels below 5% appear in bold.

explore_correlation_plot shows the same matrix as a heatmap (an ellipse style is also available) — easier to spot blocks of related variables at a glance.

ex.explore_correlation_plot(analysis).fig

And here it is — the N-shaped Kuznets curve. explore_scatter_plot puts inequality against (log) GDP per capita, colored by continent and sized by population, with a LOESS smoother that traces the rise–fall–rise without assuming any functional form.

ex.explore_scatter_plot(
    df, x="log_gdp_pc", y="gini_regional", color="continent", size="population", loess=1
).fig

But is that curve a story about rich versus poor countries (between) or about a country getting richer over time (within)? They can even have opposite signs — Simpson’s paradox for panels. explore_scatter_plot_within_between splits the relationship into pooled, between, and within clouds, each with its own fitted slope, so you can see which one the raw scatter was really showing.

wb = ex.explore_scatter_plot_within_between(df, x="log_gdp_pc", y="gini_regional")
wb.fig
print(wb.interpret())
The pooled association between **Log GDP per capita** and **Regional inequality (Gini)** has slope -0.00537. It blends a **between-unit** slope of -0.00616 (comparing unit averages) with a **within-unit** slope of 0.0292 (variation over time inside a unit).
The two diverge markedly — the cross-unit and over-time relationships tell different stories, a sign that unit-level confounders matter and that a fixed-effects specification would change the estimate.

_These are associations, not causal effects. A causal reading needs a research design — see `explain('correlation_vs_causation')`._

Stage 7 — Dynamics and persistence

A panel lets you ask a question a cross-section cannot: is a country’s inequality sticky?

explore_transition_matrix bins inequality into states (here quartiles) and tabulates how often a country moves between them from one year to the next. A heavy diagonal means high persistence — where you start is where you stay.

ex.explore_transition_matrix(df, var="gini_regional", n_bins=4).fig

explore_within_persistence measures the same idea continuously: this year’s value against last year’s, within country, with the AR(1) serial-correlation rho. The closer rho is to 1, the more inequality this year is just a near-copy of last year’s.

persistence = ex.explore_within_persistence(df, var="gini_regional")
persistence.fig
print(persistence.interpret())
The within-unit (after removing each unit's mean) serial correlation is rho = 0.355 (n = 800 consecutive pairs): a **weak** relationship — this period's value relates to the previous one positively — high values tend to follow high values (persistence).

_These are associations, not causal effects. A causal reading needs a research design — see `explain('correlation_vs_causation')`._

Where to go next

You now have the full exploratory picture: a balanced panel with patchy determinants, an N-shaped curve that is mostly a between-country pattern, and an inequality measure that is highly persistent. The natural next step is to estimate it.

  • Analyze panel data — fit the cubic Kuznets curve with two-way fixed effects and clustered standard errors, then compare pooled / between / within estimators.
  • Learn panel data — the ideas behind within vs between, fixed effects and demeaning, with runnable sandboxes.