treat_outliers
treat_outliers(x, percentile=0.01, *, truncate=False, by=None, method='linear')Treat numerical outliers by winsorizing or truncating.
For each numeric variable, values outside the [percentile, 1 - percentile] quantile range are either clipped to the boundary (winsorizing, the default) or set to NaN (truncating). Non-finite values (inf/-inf) are set to NaN first.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| x | A :class:pandas.Series, :class:pandas.DataFrame, or :class:numpy.ndarray. For a DataFrame, only numeric columns are treated; other columns pass through unchanged. |
required | |
| percentile | The (symmetric) tail probability defining the cut-offs. Must satisfy 0 < percentile < 0.5. Defaults to 0.01. |
0.01 |
|
| truncate | If True, out-of-bounds values are set to NaN; otherwise they are clipped to the boundary value (winsorizing). Defaults to False. |
False |
|
| by | Optional grouping. Either an array/Series of group labels (same length as x) or, for a DataFrame x, a column name. Outlier cut-offs are then computed within each group. Must not contain missing values. |
None |
|
| method | Quantile interpolation method passed to :func:numpy.nanquantile. Defaults to "linear" (equivalent to R’s type = 7). Use "averaged_inverted_cdf" to match Stata / R type = 2. |
'linear' |
Returns
| Name | Type | Description |
|---|---|---|
Same type as x |
The outlier-treated data. A DataFrame keeps its non-numeric columns, column order and index; a Series keeps its index and name; an ndarray keeps its shape. |
Examples
Basic — winsorize a single variable at the 1st/99th percentile:
import expdpy as ex
from expdpy.data import load_kuznets
df = load_kuznets()
ex.treat_outliers(df["gdp_pc"], percentile=0.01).describe()Advanced — winsorize several columns at the 5th/95th percentile, with the cut-offs computed within each continent:
treated = ex.treat_outliers(
df[["gini_regional", "gdp_pc"]], percentile=0.05, by=df["continent"]
)
treated.describe()