Applying Preprocessing#

Brisk provides built-in preprocessing capabilities for missing data handling, scaling, categorical encoding, and feature selection. You can configure preprocessing in your DataManager or use experiment groups to apply different preprocessing strategies. See Using Experiment Groups for more details.

Note

For the DataManager documentation, see the DataManager API reference.

Built-in Preprocessors#

Brisk includes several built-in preprocessing classes with different methods:

Missing Data Handling (MissingDataPreprocessor):

  • strategy="drop_rows": Remove rows with missing values

  • strategy="impute": Fill missing values (impute_method="mean", "median", "mode", "constant")

Categorical Encoding (CategoricalEncodingPreprocessor):

  • method="ordinal": Ordinal encoding (preserves order)

  • method="onehot": One-hot encoding (creates binary columns)

  • method="label": Label encoding (assigns integers)

  • method="cyclic": Cyclic encoding (for circular features)

  • method="threshold": Threshold encoding (requires cutoffs parameter, e.g., cutoffs=[20, 40])

  • method={"col1": "onehot", "target": "label"}: Column-specific encoding (can encode target variable)

Scaling (ScalingPreprocessor):

  • method="standard": StandardScaler (mean=0, std=1)

  • method="minmax": MinMaxScaler (scales to 0-1 range)

  • method="robust": RobustScaler (uses median and IQR)

  • method="maxabs": MaxAbsScaler (scales by max absolute value)

  • method="normalizer": Normalizer (scales individual samples to unit norm)

Feature Selection (FeatureSelectionPreprocessor):

  • method="selectkbest": Select K best features using statistical tests

  • method="rfecv": Recursive feature elimination with cross-validation

  • method="sequential": Sequential feature selection (forward/backward)

All feature selection methods require n_features_to_select parameter. The rfecv and sequential methods also require algorithm_config parameter.

Configuring Preprocessors#

Configure preprocessors in your data.py file by adding them to the DataManager constructor. Preprocessors are applied in the order they appear in the list:

from brisk.data.data_manager import DataManager
from brisk.data.preprocessing import (
    MissingDataPreprocessor,
    CategoricalEncodingPreprocessor,
    ScalingPreprocessor,
    FeatureSelectionPreprocessor
)

data_manager = DataManager(
    test_size=0.2,
    split_method="shuffle",
    preprocessors=[
        MissingDataPreprocessor(strategy="impute", impute_method="mean"),
        CategoricalEncodingPreprocessor(method="onehot"),
        ScalingPreprocessor(method="standard"),
        FeatureSelectionPreprocessor(method="selectkbest", n_features_to_select=10)
    ]
)

This pipeline will: handle missing values → encode categories → scale features → select top features.

Pipeline Order Considerations#

The order of preprocessors in your pipeline is critical and follows these guidelines:

  1. Missing Data - Handle missing values first before other transformations

  2. Categorical Encoding - Encode categorical features before scaling

  3. Scaling - Apply scaling after encoding to avoid issues with categorical data

  4. Feature Selection - Select features last, after all transformations are complete

Important

Incorrect preprocessing order can lead to data leakage or unexpected results. Always handle missing data first and feature selection last.