treat_outliers

treat_outliers(x, percentile=0.01, *, truncate=False, by=None, method='linear')

Treat numerical outliers by winsorizing or truncating.

For each numeric variable, values outside the [percentile, 1 - percentile] quantile range are either clipped to the boundary (winsorizing, the default) or set to NaN (truncating). Non-finite values (inf/-inf) are set to NaN first.

Parameters

Name Type Description Default
x A :class:pandas.Series, :class:pandas.DataFrame, or :class:numpy.ndarray. For a DataFrame, only numeric columns are treated; other columns pass through unchanged. required
percentile The (symmetric) tail probability defining the cut-offs. Must satisfy 0 < percentile < 0.5. Defaults to 0.01. 0.01
truncate If True, out-of-bounds values are set to NaN; otherwise they are clipped to the boundary value (winsorizing). Defaults to False. False
by Optional grouping. Either an array/Series of group labels (same length as x) or, for a DataFrame x, a column name. Outlier cut-offs are then computed within each group. Must not contain missing values. None
method Quantile interpolation method passed to :func:numpy.nanquantile. Defaults to "linear" (equivalent to R’s type = 7). Use "averaged_inverted_cdf" to match Stata / R type = 2. 'linear'

Returns

Name Type Description
Same type as x The outlier-treated data. A DataFrame keeps its non-numeric columns, column order and index; a Series keeps its index and name; an ndarray keeps its shape.

Examples

Basic — winsorize a single variable at the 1st/99th percentile:

import expdpy as ex
from expdpy.data import load_kuznets

df = load_kuznets()
ex.treat_outliers(df["gdp_pc"], percentile=0.01).describe()

Advanced — winsorize several columns at the 5th/95th percentile, with the cut-offs computed within each continent:

treated = ex.treat_outliers(
    df[["gini_regional", "gdp_pc"]], percentile=0.05, by=df["continent"]
)
treated.describe()