Categorical Encoding#

class CategoricalEncodingPreprocessor(method: str = 'label', cutoffs: List[float] | None = None, **kwargs)[source]#

Bases: BasePreprocessor

Preprocessor for categorical feature encoding.

Supports multiple encoding strategies including ordinal, one-hot, label, cyclic, and threshold encoding. Can encode both features and target variables based on configuration.

Parameters:
methodstr or dict, default=”label”

Encoding method: “ordinal”, “onehot”, “label”, “cyclic”, or “threshold” Or dict mapping column names to methods: {“col1”: “ordinal”, “col2”: “onehot”} If a target feature name matches a key in the dict, it will be encoded.

cutoffslist, optional

For threshold encoding: list of cutoff values to create bins. Example: [20, 40] creates bins: <20=0, 20-40=1, >40=2

Attributes:
encodersdict

Dictionary mapping feature names to their fitted encoder objects

target_encoderobject or None

Encoder for target variable if target name matches method dict

is_fittedbool

Whether the preprocessor has been fitted

Notes

The preprocessor supports various encoding strategies: - Ordinal: Maps categories to integers - One-hot: Creates binary columns for each category - Label: Maps categories to integers (for single column) - Cyclic: Creates sin/cos features for cyclical data - Threshold: Bins continuous values into categories

Examples

Label encoding for all categorical features:
>>> preprocessor = CategoricalEncodingPreprocessor(method="label")
One-hot encoding for all categorical features:
>>> preprocessor = CategoricalEncodingPreprocessor(method="onehot")
Mixed encoding strategies:
>>> preprocessor = CategoricalEncodingPreprocessor(
...     method={"category1": "onehot", "category2": "ordinal"}
... )
Threshold encoding with custom cutoffs:
>>> preprocessor = CategoricalEncodingPreprocessor(
...     method="threshold", cutoffs=[20, 40, 60]
... )
export_params() Dict[str, Any][source]#

Export parameters for serialization.

Returns:
Dict[str, Any]

Dictionary containing all parameters

fit(X: DataFrame, y: Series | None = None, categorical_features: List[str] | None = None) CategoricalEncodingPreprocessor[source]#

Fit the encoders to the data.

Learns encoding parameters from the training data for each categorical feature and optionally the target variable.

Parameters:
Xpd.DataFrame

Training data

ypd.Series, optional

Target values

categorical_featuresList[str], optional

List of categorical feature names to encode

Returns:
selfCategoricalEncodingPreprocessor

Fitted preprocessor

Notes

The method fits encoders for each categorical feature based on the specified encoding method. If the target variable name matches a key in the method dictionary, it will also be encoded.

fit_transform(X: DataFrame, y: Series | None = None, categorical_features: List[str] | None = None) Tuple[DataFrame, Series | None][source]#

Fit the encoders and transform the data.

Parameters:
Xpd.DataFrame

Data to fit and transform

ypd.Series, optional

Target values

categorical_featuresList[str], optional

List of categorical feature names to encode

Returns:
Tuple[pd.DataFrame, Optional[pd.Series]]

Tuple containing (encoded_X, encoded_y). The target y may be encoded if its name matches a key in the method dictionary

Notes

This method combines fit and transform operations. The target will be automatically encoded if its name matches a key in the method dictionary.

get_feature_names(feature_names: List[str]) List[str][source]#

Get the feature names after encoding.

Parameters:
feature_namesList[str]

Original feature names

Returns:
List[str]

Updated feature names after encoding

Notes

Feature names are updated based on the encoding method: - One-hot encoding: Creates new features for each category - Cyclic encoding: Creates sin and cos features - Other methods: Preserve original feature names

transform(X: DataFrame, y: Series | None = None) Tuple[DataFrame, Series | None][source]#

Transform features using the fitted encoders.

Parameters:
Xpd.DataFrame

Features to transform

ypd.Series, optional

Target values to transform (if applicable)

Returns:
Tuple[pd.DataFrame, Optional[pd.Series]]

Tuple containing (encoded_X, encoded_y). The target y may be encoded if its name matches a key in the method dictionary, otherwise it’s returned unchanged

Raises:
ValueError

If preprocessor has not been fitted

Notes

The method applies the encoding.