treat_outliers

source

treat_outliers(x, percentile=0.01, *, truncate=False, by=None, method='linear')

Treat numerical outliers by winsorizing or truncating.

For each numeric variable, values outside the [percentile, 1 - percentile] quantile range are either clipped to the boundary (winsorizing, the default) or set to NaN (truncating). Non-finite values (inf/-inf) are set to NaN first.

Parameters

Name	Description	Default
x	A :class:`pandas.Series`, :class:`pandas.DataFrame`, or :class:`numpy.ndarray`. For a DataFrame, only numeric columns are treated; other columns pass through unchanged.	required
percentile	The (symmetric) tail probability defining the cut-offs. Must satisfy `0 < percentile < 0.5`. Defaults to `0.01`.	`0.01`
truncate	If `True`, out-of-bounds values are set to `NaN`; otherwise they are clipped to the boundary value (winsorizing). Defaults to `False`.	`False`
by	Optional grouping. Either an array/Series of group labels (same length as `x`) or, for a DataFrame `x`, a column name. Outlier cut-offs are then computed within each group. Must not contain missing values.	`None`
method	Quantile interpolation method passed to :func:`numpy.nanquantile`. Defaults to `"linear"` (equivalent to R’s `type = 7`). Use `"averaged_inverted_cdf"` to match Stata / R `type = 2`.	`'linear'`

Returns

Name	Type	Description
	Same type as `x`	The outlier-treated data. A DataFrame keeps its non-numeric columns, column order and index; a Series keeps its index and name; an ndarray keeps its shape.

Examples

Basic — winsorize a single variable at the 1st/99th percentile:

import expdpy as ex
from expdpy.data import load_kuznets

df = load_kuznets()
ex.treat_outliers(df["gdp_pc"], percentile=0.01).describe()

count       880.000000
mean      25524.001800
std       34212.744869
min         666.720522
25%        2512.157493
50%       11199.795097
75%       33444.168592
max      150000.000000
Name: gdp_pc, dtype: float64

Advanced — winsorize several columns at the 5th/95th percentile, with the cut-offs computed within each continent:

treated = ex.treat_outliers(
    df[["gini_regional", "gdp_pc"]], percentile=0.05, by=df["continent"]
)
treated.describe()

	gini_regional	gdp_pc
count	880.000000	880.000000
mean	0.273319	24703.010142
std	0.086432	32756.473008
min	0.080619	646.309044
25%	0.206811	2512.157493
50%	0.270045	11199.795097
75%	0.334933	32639.241341
max	0.503312	150000.000000