The Explore module is your first look at a dataset — before you fit a single model. This page is a case study: you have just been handed a country–year panel and asked whether inequality follows the famous Kuznets curve in income — rising, then falling, as economies develop — and what determines it. We will answer that the way an analyst actually works: explore first, model later.
The lead dataset is the bundled kuznets panel (80 countries observed every year over 2015–2025). Its regional inequality measure (gini_regional) is engineered to trace an N-shaped Kuznets pattern in (log) GDP per capita, surrounded by a realistic set of determinants — trade openness, resource rents, democracy, schooling, FDI, population.
Every Explore function takes a pandas DataFrame and returns a small result object carrying a tidy .df plus an interactive Plotly figure (.fig) or a publication-quality Great Tables object (.gt). Many also offer a plain-language .interpret(). Read this page top to bottom: the functions are ordered as a workflow — know the panel → describe it → split its variation → trace its trends → compare groups → study relationships → measure its dynamics.
Note
This is exploratory analysis: every reading below describes an association, never a cause. The Analyze module turns these patterns into estimates, and Learn explains the ideas behind them.
Stage 0 — Set up the panel
A panel has two coordinates: an entity (here the country) and a time index (the year). Declaring them once means every call below can omit them, and panel-aware tables know to summarize by period. set_labels does double duty: it attaches the data dictionary’s human-readable labels (so figures say “Regional inequality (Gini)” instead of gini_regional), and with set_panel=True it also reads the entity / time tags from that dictionary and declares the panel.
You can also declare the panel directly with set_panel, and read back what is stored with resolve_panel — handy when a step (a merge, a column subset) drops the metadata and you need to re-declare it:
Before looking at a single value, learn the shape of the data. Is the panel balanced? Are there gaps? Where is information missing? A surprising amount of analysis goes wrong because this step was skipped.
explore_panel_structure summarizes balance and coverage and draws a presence grid (one row per country, one column per year): solid means observed, blank means missing.
A cell is 'present' when the unit is observed in that period. An interior gap is a missing period between a unit's first and last.
structure.fig
Some variables are missing far more often than others. explore_missing_values_plot maps the fraction missing by variable and year — notice the determinants (resource_rents, polity2, school_enrollment, gasoline_price) are patchier, and heavier in the early years. Knowing this now prevents a model later from silently dropping half your sample.
ex.explore_missing_values_plot(df).fig
explore_value_heatmap shows the raw outcome across the whole country-by-year grid. Standardizing by time (a z-score within each year) strips out the common trend so that what remains is each country’s relative position — who is persistently more unequal than its peers.
Now meet the variables one at a time: their level, spread, and shape.
Several determinants are right-skewed (GDP per capita, resource rents, aid): a handful of huge values can dominate a correlation or a scatter. treat_outlierswinsorizes — it caps values at the 1st/99th percentile rather than dropping rows. We build a cleaned analysis frame here and reuse it for the relationship views in Stage 6.
explore_descriptive_table reports the standard statistics for every numeric variable. Because the panel is declared, it lays them out by period — here the first and last year — so you can read level and change at a glance. The note beneath records the panel’s dimensions and which variables carry missing data.
Variables with missing data: FDI inflows (% of GDP) (46), Gasoline price (USD/litre) (232), Official development aid (USD) (65), Secondary enrollment (% gross) (145), Light-based inequality (Gini) (133), Democracy score (Polity2) (92).
Panel: 80 units (Country) over 11 periods (Year); 11.0 observations per unit on average.
Columns show 2 of 11 periods (2015 to 2025).
explore_histogram shows the distribution of one variable. The outcome gini_regional is fairly symmetric; switch the variable to gdp_pc and you will see the long right tail that motivates the log transform used throughout the Kuznets literature (and provided here as log_gdp_pc).
explore_bar_plot counts the levels of a categorical variable — how many observations fall in each continent.
ex.explore_bar_plot(df, "continent").fig
explore_ext_obs_table lists the most extreme observations — the country-years with the highest and lowest inequality. Extremes are where data errors hide, and where the substantive story is often most vivid.
Stage 3 — Two kinds of variation: within vs between
This is the idea that makes panel data special. Every variable varies in two directions: across countries (between) and over time within a country (within). Mixing them up is the classic panel mistake, so it pays to separate them up front.
explore_xtsum_table decomposes each variable’s variation into overall, between, and within components (the Stata xtsum you may know). If almost all of a variable’s variation is between countries, then over an 11-year window it is nearly a fixed trait; if much is within, it genuinely moves over time.
Overall = pooled. Between = across unit means. Within = over time inside a unit (deviations from the unit mean, recentered on the grand mean).
print(xt.interpret())
Splitting each variable's variation into **between** (differences across units) and **within** (variation over time inside a unit):
- **gini_regional**: between SD 0.0783, within SD 0.0478 — variation is mostly **across units** (between).
- **log_gdp_pc**: between SD 1.48, within SD 0.222 — variation is mostly **across units** (between).
- **trade_share**: between SD 0.233, within SD 0.048 — variation is mostly **across units** (between).
_Between variation drives cross-sectional comparisons; within variation is what fixed-effects (within) estimators rely on._
explore_spaghetti_plot makes the within variation visible: one faint line per country plus a bold mean overlay. Here are the development trajectories — most countries drift upward together, which is exactly the kind of common time trend a panel model will later absorb.
A mean can hide what happens in the tails. explore_quantile_trend_plot tracks several quantiles at once, so you can see whether the spread of inequality across countries is widening or narrowing over time.
explore_distribution_over_time shows the whole distribution shifting year by year as a ridgeline — each ridge is one year’s density. Is the distribution simply sliding, or is it also changing shape (skew, bimodality)?
A bar hides the spread inside each group. explore_violin_plot_by_group draws the full distribution per continent, so you can see overlap and outliers a mean would mask.
Now the central question. We work on the winsorized analysis frame so a few extreme values do not distort the picture.
explore_correlation_table reports pairwise associations — Pearson above the diagonal, Spearman (rank) below, significant cells in bold. This is the quickest scan of which determinants co-move with inequality.
ex.explore_correlation_table(analysis).gt
A
B
C
D
E
F
G
A: Regional inequality (Gini)
0.20
-0.10
-0.16
0.31
-0.11
-0.07
B: GDP per capita (USD)
-0.19
0.83
-0.08
-0.38
0.79
0.65
C: Log GDP per capita
-0.19
1.00
0.03
-0.46
0.95
0.78
D: Trade openness (trade/GDP)
-0.15
0.06
0.06
-0.02
0.01
0.05
E: Resource rents (% of GDP)
0.35
-0.54
-0.54
-0.02
-0.44
-0.37
F: Secondary enrollment (% gross)
-0.21
0.95
0.95
0.05
-0.52
0.75
G: Democracy score (Polity2)
-0.15
0.78
0.78
0.06
-0.44
0.76
This table reports Pearson correlations above and Spearman correlations below the diagonal. The number of observations ranges from 663 to 880. Correlations with significance levels below 5% appear in bold.
explore_correlation_plot shows the same matrix as a heatmap (an ellipse style is also available) — easier to spot blocks of related variables at a glance.
ex.explore_correlation_plot(analysis).fig
And here it is — the N-shaped Kuznets curve. explore_scatter_plot puts inequality against (log) GDP per capita, colored by continent and sized by population, with a LOESS smoother that traces the rise–fall–rise without assuming any functional form.
But is that curve a story about rich versus poor countries (between) or about a country getting richer over time (within)? They can even have opposite signs — Simpson’s paradox for panels. explore_scatter_plot_within_between splits the relationship into pooled, between, and within clouds, each with its own fitted slope, so you can see which one the raw scatter was really showing.
The pooled association between **Log GDP per capita** and **Regional inequality (Gini)** has slope -0.00537. It blends a **between-unit** slope of -0.00616 (comparing unit averages) with a **within-unit** slope of 0.0292 (variation over time inside a unit).
The two diverge markedly — the cross-unit and over-time relationships tell different stories, a sign that unit-level confounders matter and that a fixed-effects specification would change the estimate.
_These are associations, not causal effects. A causal reading needs a research design — see `explain('correlation_vs_causation')`._
Stage 7 — Dynamics and persistence
A panel lets you ask a question a cross-section cannot: is a country’s inequality sticky?
explore_transition_matrix bins inequality into states (here quartiles) and tabulates how often a country moves between them from one year to the next. A heavy diagonal means high persistence — where you start is where you stay.
explore_within_persistence measures the same idea continuously: this year’s value against last year’s, within country, with the AR(1) serial-correlation rho. The closer rho is to 1, the more inequality this year is just a near-copy of last year’s.
The within-unit (after removing each unit's mean) serial correlation is rho = 0.355 (n = 800 consecutive pairs): a **weak** relationship — this period's value relates to the previous one positively — high values tend to follow high values (persistence).
_These are associations, not causal effects. A causal reading needs a research design — see `explain('correlation_vs_causation')`._
Where to go next
You now have the full exploratory picture: a balanced panel with patchy determinants, an N-shaped curve that is mostly a between-country pattern, and an inequality measure that is highly persistent. The natural next step is to estimate it.
Analyze panel data — fit the cubic Kuznets curve with two-way fixed effects and clustered standard errors, then compare pooled / between / within estimators.
Learn panel data — the ideas behind within vs between, fixed effects and demeaning, with runnable sandboxes.