Preprocessing#

class BasePreprocessor(**kwargs)[source]#

Bases: ABC

Abstract base class for all preprocessors.

All preprocessors must implement the fit and transform methods to follow the scikit-learn estimator interface pattern. This ensures consistency across all preprocessing operations in the Brisk framework.

Parameters:

**kwargs: Additional parameters specific to each preprocessor implementation

Attributes:

is_fittedbool: Whether the preprocessor has been fitted to data

Notes

This abstract base class provides the common interface that all preprocessors must implement. It includes parameter validation, the standard fit/transform pattern, and utility methods for feature name handling and parameter export.

Examples

Create a custom preprocessor:

>>> class CustomPreprocessor(BasePreprocessor):
...     def _validate_params(self, **kwargs):
...         # Validate parameters
...         pass
...     def fit(self, X, y=None, categorical_features=None):
...         # Fit logic
...         return self
...     def transform(self, X):
...         # Transform logic
...         return X
...     def export_params(self):
...         # Export parameters
...         return {}

abstractmethod export_params() → Dict[str, Any][source]#

Export parameters for serialization and rerun functionality.

Returns:

Dict[str, Any]: Dictionary containing all parameters in JSON-serializable format

Notes

This method should return all parameters needed to recreate the preprocessor instance, suitable for JSON serialization.

abstractmethod fit(X: DataFrame, y: Series | None = None, categorical_features: List[str] | None = None) → BasePreprocessor[source]#

Fit the preprocessor to the data.

Parameters:

Xpd.DataFrame: Training data
ypd.Series, optional: Target values
categorical_featuresList[str], optional: List of categorical feature names

Returns:

selfBasePreprocessor: Fitted preprocessor instance

Notes

This method should fit the preprocessor to the training data and set the is_fitted flag to True upon completion.

fit_transform(X: DataFrame, y: Series | None = None, categorical_features: List[str] | None = None) → DataFrame[source]#

Fit the preprocessor and transform the data.

Convenience method that combines fit and transform operations in a single call.

Parameters:

Xpd.DataFrame: Training data
ypd.Series, optional: Target values
categorical_featuresList[str], optional: List of categorical feature names

Returns:

pd.DataFrame: Transformed data

Notes

This method is equivalent to calling fit() followed by transform() on the same data. It’s provided for convenience and follows the scikit-learn pattern.

get_feature_names(feature_names: List[str]) → List[str][source]#

Get the feature names after preprocessing.

Parameters:

feature_namesList[str]: Original feature names

Returns:

List[str]: Feature names after preprocessing

Notes

By default, this method returns the original feature names unchanged. Subclasses should override this method if preprocessing changes the number or names of features (e.g., one-hot encoding).

abstractmethod transform(X: DataFrame) → DataFrame[source]#

Transform the data using the fitted preprocessor.

Parameters:

Xpd.DataFrame: Data to transform

Returns:

pd.DataFrame: Transformed data

Notes

This method should apply the transformation learned during fit to the provided data. It should raise an error if called before the preprocessor has been fitted.

class MissingDataPreprocessor(strategy: str = 'drop_rows', impute_method: str = 'mean', constant_value: Any = 0, **kwargs)[source]#

Bases: BasePreprocessor

Preprocessor for handling missing values in datasets.

Provides strategies for dealing with missing data including dropping rows with missing values or imputing missing values using various statistical methods.

Parameters:

strategystr, default=”drop_rows”: Strategy for handling missing values: “drop_rows” or “impute”
impute_methodstr, default=”mean”: Imputation method when strategy=”impute”: “mean”, “median”, “mode”, or “constant”
constant_valueAny, default=0: Constant value to use when impute_method=”constant”

Attributes:

constant_valuesdict: Dictionary mapping column names to their fitted imputation values
is_fittedbool: Whether the preprocessor has been fitted

Notes

The preprocessor supports two main strategies: 1. Drop rows: Remove any rows containing missing values 2. Impute: Fill missing values using statistical methods

For imputation, the method is fitted on training data and the same values are used to fill missing values in test data.

Examples

Drop rows with missing values:

>>> preprocessor = MissingDataPreprocessor(strategy="drop_rows")

Impute with mean values:

>>> preprocessor = MissingDataPreprocessor(
...     strategy="impute", impute_method="mean"
... )

Impute with constant value:

>>> preprocessor = MissingDataPreprocessor(
...     strategy="impute", impute_method="constant", constant_value=-1
... )

export_params() → Dict[str, Any][source]#

Export parameters for serialization.

Returns:

Dict[str, Any]: Dictionary containing all parameters

fit(X: DataFrame, y: Series | None = None, categorical_features: List[str] | None = None) → MissingDataPreprocessor[source]#

Fit the missing data preprocessor.

Learns imputation values from the training data for each column that contains missing values.

Parameters:

Xpd.DataFrame: Training data
ypd.Series, optional: Target values (not used for missing data handling)

Returns:

selfMissingDataPreprocessor: Fitted preprocessor

Notes

For imputation methods, the preprocessor learns the appropriate values (mean, median, mode, or constant) for each column from the training data. These values are then used consistently for both training and test data.

get_feature_names(feature_names: List[str]) → List[str][source]#

Get the feature names after missing data handling.

Parameters:

feature_namesList[str]: Original feature names

Returns:

List[str]: Feature names (no columns are dropped in this simplified version)

Notes

In the current implementation, no columns are dropped during missing data handling, so feature names remain unchanged.

transform(X: DataFrame) → DataFrame[source]#

Transform the data by handling missing values.

Parameters:

Xpd.DataFrame: Data to transform

Returns:

pd.DataFrame: Data with missing values handled

Raises:

ValueError: If preprocessor has not been fitted

Notes

The transformation applies the strategy learned during fit: - For “drop_rows”: removes any rows with missing values - For “impute”: fills missing values using learned imputation values

class ScalingPreprocessor(method: str = 'standard', **kwargs)[source]#

Bases: BasePreprocessor

Preprocessor for scaling numerical features.

Provides various scaling methods for numerical features while preserving categorical features in their original form. Supports standard, min-max, robust, max-abs, and normalizer scaling methods.

Parameters:

methodstr, default=”standard”: Scaling method: “standard”, “minmax”, “robust”, “maxabs”, or “normalizer”

Attributes:

scalersklearn.preprocessing scaler: The fitted scaler object
_scaled_featureslist: List of feature names that were scaled during fit
is_fittedbool: Whether the preprocessor has been fitted

Notes

The preprocessor automatically excludes categorical features from scaling to preserve their original form. Only continuous numerical features are scaled using the specified method.

Examples

Standard scaling:

>>> preprocessor = ScalingPreprocessor(method="standard")

Min-max scaling:

>>> preprocessor = ScalingPreprocessor(method="minmax")

Robust scaling:

>>> preprocessor = ScalingPreprocessor(method="robust")

export_params() → Dict[str, Any][source]#

Export parameters for serialization.

Returns:

Dict[str, Any]: Dictionary containing all parameters

fit(X: DataFrame, y: Series | None = None, categorical_features: List[str] | None = None) → ScalingPreprocessor[source]#

Fit the scaler to the data.

Learns scaling parameters from the training data, excluding categorical features from scaling.

Parameters:

Xpd.DataFrame: Training data
ypd.Series, optional: Target values (not used for scaling)
categorical_featuresList[str], optional: List of categorical feature names to exclude from scaling

Returns:

selfScalingPreprocessor: Fitted preprocessor

Notes

The scaler is fitted only on continuous features, excluding any categorical features specified in categorical_features. This ensures that categorical features remain in their original form while continuous features are properly scaled.

fit_transform(X: DataFrame, y: Series | None = None, categorical_features: List[str] | None = None) → DataFrame[source]#

Fit the scaler and transform the data.

Parameters:

Xpd.DataFrame: Data to fit and transform
ypd.Series, optional: Target values (not used for scaling)
categorical_featuresList[str], optional: List of categorical feature names to exclude from scaling

Returns:

pd.DataFrame: Transformed data with scaled continuous features

Notes

This method combines fit and transform operations, scaling only the continuous features while preserving categorical features.

get_feature_names(feature_names: List[str] | None = None) → List[str][source]#

Get the feature names after transformation.

Parameters:

feature_namesList[str], optional: Original feature names

Returns:

List[str]: Feature names after transformation (same as input)

Notes

Scaling does not change feature names, so the original names are returned unchanged.

transform(X: DataFrame) → DataFrame[source]#

Transform the data using the fitted scaler.

Applies scaling to continuous features while preserving categorical features in their original form.

Parameters:

Xpd.DataFrame: Data to transform

Returns:

pd.DataFrame: Transformed data with scaled continuous features

Notes

Only features that were scaled during fit are transformed. Categorical features remain unchanged.

class CategoricalEncodingPreprocessor(method: str = 'label', cutoffs: List[float] | None = None, **kwargs)[source]#

Bases: BasePreprocessor

Preprocessor for categorical feature encoding.

Supports multiple encoding strategies including ordinal, one-hot, label, cyclic, and threshold encoding. Can encode both features and target variables based on configuration.

Parameters:

methodstr or dict, default=”label”: Encoding method: “ordinal”, “onehot”, “label”, “cyclic”, or “threshold” Or dict mapping column names to methods: {“col1”: “ordinal”, “col2”: “onehot”} If a target feature name matches a key in the dict, it will be encoded.
cutoffslist, optional: For threshold encoding: list of cutoff values to create bins. Example: [20, 40] creates bins: <20=0, 20-40=1, >40=2

Attributes:

encodersdict: Dictionary mapping feature names to their fitted encoder objects
target_encoderobject or None: Encoder for target variable if target name matches method dict
is_fittedbool: Whether the preprocessor has been fitted

Notes

The preprocessor supports various encoding strategies: - Ordinal: Maps categories to integers - One-hot: Creates binary columns for each category - Label: Maps categories to integers (for single column) - Cyclic: Creates sin/cos features for cyclical data - Threshold: Bins continuous values into categories

Examples

Label encoding for all categorical features:

>>> preprocessor = CategoricalEncodingPreprocessor(method="label")

One-hot encoding for all categorical features:

>>> preprocessor = CategoricalEncodingPreprocessor(method="onehot")

Mixed encoding strategies:

>>> preprocessor = CategoricalEncodingPreprocessor(
...     method={"category1": "onehot", "category2": "ordinal"}
... )

Threshold encoding with custom cutoffs:

>>> preprocessor = CategoricalEncodingPreprocessor(
...     method="threshold", cutoffs=[20, 40, 60]
... )

export_params() → Dict[str, Any][source]#

Export parameters for serialization.

Returns:

Dict[str, Any]: Dictionary containing all parameters

fit(X: DataFrame, y: Series | None = None, categorical_features: List[str] | None = None) → CategoricalEncodingPreprocessor[source]#

Fit the encoders to the data.

Learns encoding parameters from the training data for each categorical feature and optionally the target variable.

Parameters:

Xpd.DataFrame: Training data
ypd.Series, optional: Target values
categorical_featuresList[str], optional: List of categorical feature names to encode

Returns:

selfCategoricalEncodingPreprocessor: Fitted preprocessor

Notes

The method fits encoders for each categorical feature based on the specified encoding method. If the target variable name matches a key in the method dictionary, it will also be encoded.

fit_transform(X: DataFrame, y: Series | None = None, categorical_features: List[str] | None = None) → DataFrame[source]#

Fit the encoders and transform the data.

Parameters:

Xpd.DataFrame: Data to fit and transform
ypd.Series, optional: Target values
categorical_featuresList[str], optional: List of categorical feature names to encode

Returns:

pd.DataFrame: Transformed data with encoded categorical features

Notes

This method combines fit and transform operations, encoding categorical features and optionally the target variable.

get_feature_names(feature_names: List[str]) → List[str][source]#

Get the feature names after encoding.

Parameters:

feature_namesList[str]: Original feature names

Returns:

List[str]: Updated feature names after encoding

Notes

Feature names are updated based on the encoding method: - One-hot encoding: Creates new features for each category - Cyclic encoding: Creates sin and cos features - Other methods: Preserve original feature names

transform(X: DataFrame, y: Series | None = None) → Tuple[DataFrame, Series][source]#

Transform the data using the fitted encoders.

Parameters:

Xpd.DataFrame: Features to transform
ypd.Series, optional: Target values to transform (if target name matches method dict)

Returns:

Tuple[pd.DataFrame, pd.Series or None]: Always returns tuple of (transformed features, transformed target or None)

Raises:

ValueError: If preprocessor has not been fitted

Notes

The method applies the learned encoding to both features and optionally the target variable. It always returns a tuple for consistency.

class FeatureSelectionPreprocessor(method: str = 'selectkbest', n_features_to_select: int = 5, feature_selection_cv: int = 3, estimator: Any | None = None, algorithm_config=None, feature_selection_estimator: str | None = None, problem_type: str = 'classification', **kwargs)[source]#

Bases: BasePreprocessor

Preprocessor for feature selection methods.

Supports various feature selection algorithms including SelectKBest, RFECV, and SequentialFeatureSelector. Can use different estimators for wrapper methods.

Parameters:

methodstr, default=”selectkbest”: Feature selection method (“selectkbest”, “rfecv”, “sequential”)
n_features_to_selectint, default=5: Number of features to select
feature_selection_cvint, default=3: Number of CV folds for RFECV and SequentialFeatureSelector
estimatorAny, optional: Direct estimator to use for RFECV and SequentialFeatureSelector
algorithm_configAlgorithmCollection, optional: User-provided collection of AlgorithmWrapper objects to use for feature selection
feature_selection_estimatorstr, optional: The name of the estimator to use for feature selection. If not specified, defaults to the first algorithm in the relevant wrapper list
problem_typestr, default=”classification”: The type of problem (“classification” or “regression”). Used to determine appropriate scoring function for SelectKBest.

Attributes:

selectorsklearn.feature_selection selector: The fitted feature selector object
scalersklearn.preprocessing scaler, optional: Fitted scaler for internal use (if provided)
is_fittedbool: Whether the preprocessor has been fitted

Notes

The preprocessor supports three main feature selection methods: 1. SelectKBest: Selects k best features based on statistical tests 2. RFECV: Recursive feature elimination with cross-validation 3. SequentialFeatureSelector: Sequential feature selection

For wrapper methods (RFECV, SequentialFeatureSelector), an estimator must be provided either directly or through algorithm_config.

Examples

SelectKBest for classification:

>>> preprocessor = FeatureSelectionPreprocessor(
...     method="selectkbest", n_features_to_select=10
... )

RFECV with custom estimator:

>>> from sklearn.ensemble import RandomForestClassifier
>>> preprocessor = FeatureSelectionPreprocessor(
...     method="rfecv", estimator=RandomForestClassifier()
... )

Sequential feature selection:

>>> preprocessor = FeatureSelectionPreprocessor(
...     method="sequential", n_features_to_select=5
... )

export_params() → Dict[str, Any][source]#

Export parameters for serialization and rerun functionality.

Returns:

Dict[str, Any]: Dictionary containing all parameters in JSON-serializable format

Notes

Returns all parameters needed to recreate the preprocessor instance, suitable for JSON serialization. Note that complex objects like estimators may not be directly serializable.

fit(X: DataFrame, y: Series | None = None) → FeatureSelectionPreprocessor[source]#

Fit the feature selector to the data.

Learns feature selection parameters from the training data using the specified feature selection method. For wrapper methods (RFECV, SequentialFeatureSelector), the target variable is required.

Parameters:

Xpd.DataFrame: Training data features
ypd.Series, optional: Target values (required for RFECV and SequentialFeatureSelector)

Returns:

selfFeatureSelectionPreprocessor: Fitted preprocessor

Raises:

ValueError: If y is required but not provided for wrapper methods If preprocessor has not been fitted before transform

Notes

The method creates and fits the appropriate feature selector based on the specified method. If a scaler is provided, features are scaled before feature selection. The fitted selector can then be used to transform new data.

get_feature_names(feature_names: List[str]) → List[str][source]#

Get the selected feature names after feature selection.

Parameters:

feature_namesList[str]: Original feature names

Returns:

List[str]: Names of selected features

Notes

Returns the names of features that were selected during fitting. If the preprocessor is not fitted or no selector is available, returns the original feature names unchanged.

transform(X: DataFrame) → DataFrame[source]#

Transform the data using the fitted selector.

Applies the learned feature selection to new data, returning only the selected features from the original unscaled data.

Parameters:

Xpd.DataFrame: Data to transform

Returns:

pd.DataFrame: Data with only selected features

Raises:

ValueError: If preprocessor has not been fitted before transform

Notes

The method returns the selected features from the original unscaled data, not the scaled version used during fitting. This ensures that the output data maintains the original scale and meaning.

Preprocessing#

This Page