Preprocessing#
- class BasePreprocessor(**kwargs)[source]#
Bases:
ABCAbstract base class for all preprocessors.
All preprocessors must implement the fit and transform methods to follow the scikit-learn estimator interface pattern. This ensures consistency across all preprocessing operations in the Brisk framework.
- Parameters:
- **kwargs
Additional parameters specific to each preprocessor implementation
- Attributes:
- is_fittedbool
Whether the preprocessor has been fitted to data
Notes
This abstract base class provides the common interface that all preprocessors must implement. It includes parameter validation, the standard fit/transform pattern, and utility methods for feature name handling and parameter export.
Examples
- Create a custom preprocessor:
>>> class CustomPreprocessor(BasePreprocessor): ... def _validate_params(self, **kwargs): ... # Validate parameters ... pass ... def fit(self, X, y=None, categorical_features=None): ... # Fit logic ... return self ... def transform(self, X, y=None): ... # Transform logic ... return X, y ... def export_params(self): ... # Export parameters ... return {}
- abstractmethod export_params() Dict[str, Any][source]#
Export parameters for serialization and rerun functionality.
- Returns:
- Dict[str, Any]
Dictionary containing all parameters in JSON-serializable format
Notes
This method should return all parameters needed to recreate the preprocessor instance, suitable for JSON serialization.
- abstractmethod fit(X: DataFrame, y: Series | None = None, categorical_features: List[str] | None = None) BasePreprocessor[source]#
Fit the preprocessor to the data.
- Parameters:
- Xpd.DataFrame
Training data
- ypd.Series, optional
Target values
- categorical_featuresList[str], optional
List of categorical feature names
- Returns:
- selfBasePreprocessor
Fitted preprocessor instance
Notes
This method should fit the preprocessor to the training data and set the is_fitted flag to True upon completion.
- fit_transform(X: DataFrame, y: Series | None = None, categorical_features: List[str] | None = None) Tuple[DataFrame, Series | None][source]#
Fit the preprocessor and transform the data.
Convenience method that combines fit and transform operations in a single call.
- Parameters:
- Xpd.DataFrame
Training data
- ypd.Series, optional
Target values
- categorical_featuresList[str], optional
List of categorical feature names
- Returns:
- Tuple[pd.DataFrame, Optional[pd.Series]]
Tuple containing (transformed_X, transformed_y)
Notes
This method is equivalent to calling fit() followed by transform() on the same data. It’s provided for convenience and follows the scikit-learn pattern.
- get_feature_names(feature_names: List[str]) List[str][source]#
Get the feature names after preprocessing.
- Parameters:
- feature_namesList[str]
Original feature names
- Returns:
- List[str]
Feature names after preprocessing
Notes
By default, this method returns the original feature names unchanged. Subclasses should override this method if preprocessing changes the number or names of features (e.g., one-hot encoding).
- abstractmethod transform(X: DataFrame, y: Series | None = None) Tuple[DataFrame, Series | None][source]#
Transform the data using the fitted preprocessor.
- Parameters:
- Xpd.DataFrame
Features to transform
- ypd.Series, optional
Target values to transform (if applicable)
- Returns:
- Tuple[pd.DataFrame, Optional[pd.Series]]
Tuple containing (transformed_X, transformed_y). The target y will be None if not provided, or transformed if the preprocessor modifies it (e.g., CategoricalEncodingPreprocessor)
- Raises:
- ValueError
If preprocessor has not been fitted
Notes
All preprocessors must return a tuple (X, y). Even if a preprocessor doesn’t transform y, it must still return the tuple with y unchanged. This allows all preprocessors to be called uniformly: X, y = preprocessor.transform(X, y)
- class MissingDataPreprocessor(strategy: str = 'drop_rows', impute_method: str = 'mean', constant_value: Any = 0, **kwargs)[source]#
Bases:
BasePreprocessorPreprocessor for handling missing values in datasets.
Provides strategies for dealing with missing data including dropping rows with missing values or imputing missing values using various statistical methods.
- Parameters:
- strategystr, default=”drop_rows”
Strategy for handling missing values: “drop_rows” or “impute”
- impute_methodstr, default=”mean”
Imputation method when strategy=”impute”: “mean”, “median”, “mode”, or “constant”
- constant_valueAny, default=0
Constant value to use when impute_method=”constant”
- Attributes:
- constant_valuesdict
Dictionary mapping column names to their fitted imputation values
- is_fittedbool
Whether the preprocessor has been fitted
Notes
The preprocessor supports two main strategies: 1. Drop rows: Remove any rows containing missing values 2. Impute: Fill missing values using statistical methods
For imputation, the method is fitted on training data and the same values are used to fill missing values in test data.
Examples
- Drop rows with missing values:
>>> preprocessor = MissingDataPreprocessor(strategy="drop_rows")
- Impute with mean values:
>>> preprocessor = MissingDataPreprocessor( ... strategy="impute", impute_method="mean" ... )
- Impute with constant value:
>>> preprocessor = MissingDataPreprocessor( ... strategy="impute", impute_method="constant", constant_value=-1 ... )
- export_params() Dict[str, Any][source]#
Export parameters for serialization.
- Returns:
- Dict[str, Any]
Dictionary containing all parameters
- fit(X: DataFrame, y: Series | None = None, categorical_features: List[str] | None = None) MissingDataPreprocessor[source]#
Fit the missing data preprocessor.
Learns imputation values from the training data for each column that contains missing values.
- Parameters:
- Xpd.DataFrame
Training data
- ypd.Series, optional
Target values (not used for missing data handling)
- Returns:
- selfMissingDataPreprocessor
Fitted preprocessor
Notes
For imputation methods, the preprocessor learns the appropriate values (mean, median, mode, or constant) for each column from the training data. These values are then used consistently for both training and test data.
- get_feature_names(feature_names: List[str]) List[str][source]#
Get the feature names after missing data handling.
- Parameters:
- feature_namesList[str]
Original feature names
- Returns:
- List[str]
Feature names (no columns are dropped in this simplified version)
Notes
In the current implementation, no columns are dropped during missing data handling, so feature names remain unchanged.
- transform(X: DataFrame, y: Series | None = None) Tuple[DataFrame, Series | None][source]#
Transform the data by handling missing values.
- Parameters:
- Xpd.DataFrame
Features to transform
- ypd.Series, optional
Target values (passed through unchanged)
- Returns:
- Tuple[pd.DataFrame, Optional[pd.Series]]
Tuple containing (transformed_X, y). The y is returned unchanged as this preprocessor only handles missing values in features
- Raises:
- ValueError
If preprocessor has not been fitted
Notes
The transformation applies the strategy learned during fit: - For “drop_rows”: removes any rows with missing values - For “impute”: fills missing values using learned imputation values
The target variable y is not modified by this preprocessor.
- class ScalingPreprocessor(method: str = 'standard', **kwargs)[source]#
Bases:
BasePreprocessorPreprocessor for scaling numerical features.
Provides various scaling methods for numerical features while preserving categorical features in their original form. Supports standard, min-max, robust, max-abs, and normalizer scaling methods.
- Parameters:
- methodstr, default=”standard”
Scaling method: “standard”, “minmax”, “robust”, “maxabs”, or “normalizer”
- Attributes:
- scalersklearn.preprocessing scaler
The fitted scaler object
- _scaled_featureslist
List of feature names that were scaled during fit
- is_fittedbool
Whether the preprocessor has been fitted
Notes
The preprocessor automatically excludes categorical features from scaling to preserve their original form. Only continuous numerical features are scaled using the specified method.
Examples
- Standard scaling:
>>> preprocessor = ScalingPreprocessor(method="standard")
- Min-max scaling:
>>> preprocessor = ScalingPreprocessor(method="minmax")
- Robust scaling:
>>> preprocessor = ScalingPreprocessor(method="robust")
- export_params() Dict[str, Any][source]#
Export parameters for serialization.
- Returns:
- Dict[str, Any]
Dictionary containing all parameters
- fit(X: DataFrame, y: Series | None = None, categorical_features: List[str] | None = None) ScalingPreprocessor[source]#
Fit the scaler to the data.
Learns scaling parameters from the training data, excluding categorical features from scaling.
- Parameters:
- Xpd.DataFrame
Training data
- ypd.Series, optional
Target values (not used for scaling)
- categorical_featuresList[str], optional
List of categorical feature names to exclude from scaling
- Returns:
- selfScalingPreprocessor
Fitted preprocessor
Notes
The scaler is fitted only on continuous features, excluding any categorical features specified in categorical_features. This ensures that categorical features remain in their original form while continuous features are properly scaled.
- fit_transform(X: DataFrame, y: Series | None = None, categorical_features: List[str] | None = None) Tuple[DataFrame, Series | None][source]#
Fit the scaler and transform the data.
- Parameters:
- Xpd.DataFrame
Data to fit and transform
- ypd.Series, optional
Target values (not used for scaling)
- categorical_featuresList[str], optional
List of categorical feature names to exclude from scaling
- Returns:
- Tuple[pd.DataFrame, Optional[pd.Series]]
Tuple containing (scaled_X, y). The y is returned unchanged
Notes
This method combines fit and transform operations, scaling only the continuous features while preserving categorical features.
- get_feature_names(feature_names: List[str] | None = None) List[str][source]#
Get the feature names after transformation.
- Parameters:
- feature_namesList[str], optional
Original feature names
- Returns:
- List[str]
Feature names after transformation (same as input)
Notes
Scaling does not change feature names, so the original names are returned unchanged.
- transform(X: DataFrame, y: Series | None = None) Tuple[DataFrame, Series | None][source]#
Transform the data using the fitted scaler.
Applies scaling to continuous features while preserving categorical features in their original form.
- Parameters:
- Xpd.DataFrame
Features to transform
- ypd.Series, optional
Target values (passed through unchanged)
- Returns:
- Tuple[pd.DataFrame, Optional[pd.Series]]
Tuple containing (scaled_X, y). The y is returned unchanged as this preprocessor only scales features
Notes
Only features that were scaled during fit are transformed. Categorical features remain unchanged. The target variable y is not modified by this preprocessor.
- class CategoricalEncodingPreprocessor(method: str = 'label', cutoffs: List[float] | None = None, **kwargs)[source]#
Bases:
BasePreprocessorPreprocessor for categorical feature encoding.
Supports multiple encoding strategies including ordinal, one-hot, label, cyclic, and threshold encoding. Can encode both features and target variables based on configuration.
- Parameters:
- methodstr or dict, default=”label”
Encoding method: “ordinal”, “onehot”, “label”, “cyclic”, or “threshold” Or dict mapping column names to methods: {“col1”: “ordinal”, “col2”: “onehot”} If a target feature name matches a key in the dict, it will be encoded.
- cutoffslist, optional
For threshold encoding: list of cutoff values to create bins. Example: [20, 40] creates bins: <20=0, 20-40=1, >40=2
- Attributes:
- encodersdict
Dictionary mapping feature names to their fitted encoder objects
- target_encoderobject or None
Encoder for target variable if target name matches method dict
- is_fittedbool
Whether the preprocessor has been fitted
Notes
The preprocessor supports various encoding strategies: - Ordinal: Maps categories to integers - One-hot: Creates binary columns for each category - Label: Maps categories to integers (for single column) - Cyclic: Creates sin/cos features for cyclical data - Threshold: Bins continuous values into categories
Examples
- Label encoding for all categorical features:
>>> preprocessor = CategoricalEncodingPreprocessor(method="label")
- One-hot encoding for all categorical features:
>>> preprocessor = CategoricalEncodingPreprocessor(method="onehot")
- Mixed encoding strategies:
>>> preprocessor = CategoricalEncodingPreprocessor( ... method={"category1": "onehot", "category2": "ordinal"} ... )
- Threshold encoding with custom cutoffs:
>>> preprocessor = CategoricalEncodingPreprocessor( ... method="threshold", cutoffs=[20, 40, 60] ... )
- export_params() Dict[str, Any][source]#
Export parameters for serialization.
- Returns:
- Dict[str, Any]
Dictionary containing all parameters
- fit(X: DataFrame, y: Series | None = None, categorical_features: List[str] | None = None) CategoricalEncodingPreprocessor[source]#
Fit the encoders to the data.
Learns encoding parameters from the training data for each categorical feature and optionally the target variable.
- Parameters:
- Xpd.DataFrame
Training data
- ypd.Series, optional
Target values
- categorical_featuresList[str], optional
List of categorical feature names to encode
- Returns:
- selfCategoricalEncodingPreprocessor
Fitted preprocessor
Notes
The method fits encoders for each categorical feature based on the specified encoding method. If the target variable name matches a key in the method dictionary, it will also be encoded.
- fit_transform(X: DataFrame, y: Series | None = None, categorical_features: List[str] | None = None) Tuple[DataFrame, Series | None][source]#
Fit the encoders and transform the data.
- Parameters:
- Xpd.DataFrame
Data to fit and transform
- ypd.Series, optional
Target values
- categorical_featuresList[str], optional
List of categorical feature names to encode
- Returns:
- Tuple[pd.DataFrame, Optional[pd.Series]]
Tuple containing (encoded_X, encoded_y). The target y may be encoded if its name matches a key in the method dictionary
Notes
This method combines fit and transform operations. The target will be automatically encoded if its name matches a key in the method dictionary.
- get_feature_names(feature_names: List[str]) List[str][source]#
Get the feature names after encoding.
- Parameters:
- feature_namesList[str]
Original feature names
- Returns:
- List[str]
Updated feature names after encoding
Notes
Feature names are updated based on the encoding method: - One-hot encoding: Creates new features for each category - Cyclic encoding: Creates sin and cos features - Other methods: Preserve original feature names
- transform(X: DataFrame, y: Series | None = None) Tuple[DataFrame, Series | None][source]#
Transform features using the fitted encoders.
- Parameters:
- Xpd.DataFrame
Features to transform
- ypd.Series, optional
Target values to transform (if applicable)
- Returns:
- Tuple[pd.DataFrame, Optional[pd.Series]]
Tuple containing (encoded_X, encoded_y). The target y may be encoded if its name matches a key in the method dictionary, otherwise it’s returned unchanged
- Raises:
- ValueError
If preprocessor has not been fitted
Notes
The method applies the encoding.
- class FeatureSelectionPreprocessor(method: str = 'selectkbest', n_features_to_select: int = 5, feature_selection_cv: int = 3, estimator: Any | None = None, algorithm_config=None, feature_selection_estimator: str | None = None, problem_type: str = 'classification', **kwargs)[source]#
Bases:
BasePreprocessorPreprocessor for feature selection methods.
Supports various feature selection algorithms including SelectKBest, RFECV, and SequentialFeatureSelector. Can use different estimators for wrapper methods.
- Parameters:
- methodstr, default=”selectkbest”
Feature selection method (“selectkbest”, “rfecv”, “sequential”)
- n_features_to_selectint, default=5
Number of features to select
- feature_selection_cvint, default=3
Number of CV folds for RFECV and SequentialFeatureSelector
- estimatorAny, optional
Direct estimator to use for RFECV and SequentialFeatureSelector
- algorithm_configAlgorithmCollection, optional
User-provided collection of AlgorithmWrapper objects to use for feature selection
- feature_selection_estimatorstr, optional
The name of the estimator to use for feature selection. If not specified, defaults to the first algorithm in the relevant wrapper list
- problem_typestr, default=”classification”
The type of problem (“classification” or “regression”). Used to determine appropriate scoring function for SelectKBest.
- Attributes:
- selectorsklearn.feature_selection selector
The fitted feature selector object
- scalersklearn.preprocessing scaler, optional
Fitted scaler for internal use (if provided)
- is_fittedbool
Whether the preprocessor has been fitted
Notes
The preprocessor supports three main feature selection methods: 1. SelectKBest: Selects k best features based on statistical tests 2. RFECV: Recursive feature elimination with cross-validation 3. SequentialFeatureSelector: Sequential feature selection
For wrapper methods (RFECV, SequentialFeatureSelector), an estimator must be provided either directly or through algorithm_config.
Examples
- SelectKBest for classification:
>>> preprocessor = FeatureSelectionPreprocessor( ... method="selectkbest", n_features_to_select=10 ... )
- RFECV with custom estimator:
>>> from sklearn.ensemble import RandomForestClassifier >>> preprocessor = FeatureSelectionPreprocessor( ... method="rfecv", estimator=RandomForestClassifier() ... )
- Sequential feature selection:
>>> preprocessor = FeatureSelectionPreprocessor( ... method="sequential", n_features_to_select=5 ... )
- export_params() Dict[str, Any][source]#
Export parameters for serialization and rerun functionality.
- Returns:
- Dict[str, Any]
Dictionary containing all parameters in JSON-serializable format
Notes
Returns all parameters needed to recreate the preprocessor instance, suitable for JSON serialization. Note that complex objects like estimators may not be directly serializable.
- fit(X: DataFrame, y: Series | None = None) FeatureSelectionPreprocessor[source]#
Fit the feature selector to the data.
Learns feature selection parameters from the training data using the specified feature selection method. For wrapper methods (RFECV, SequentialFeatureSelector), the target variable is required.
- Parameters:
- Xpd.DataFrame
Training data features
- ypd.Series, optional
Target values (required for RFECV and SequentialFeatureSelector)
- Returns:
- selfFeatureSelectionPreprocessor
Fitted preprocessor
- Raises:
- ValueError
If y is required but not provided for wrapper methods If preprocessor has not been fitted before transform
Notes
The method creates and fits the appropriate feature selector based on the specified method. If a scaler is provided, features are scaled before feature selection. The fitted selector can then be used to transform new data.
- get_feature_names(feature_names: List[str]) List[str][source]#
Get the selected feature names after feature selection.
- Parameters:
- feature_namesList[str]
Original feature names
- Returns:
- List[str]
Names of selected features
Notes
Returns the names of features that were selected during fitting. If the preprocessor is not fitted or no selector is available, returns the original feature names unchanged.
- transform(X: DataFrame, y: Series | None = None) Tuple[DataFrame, Series | None][source]#
Transform the data using the fitted selector.
Applies the learned feature selection to new data, returning only the selected features from the original unscaled data.
- Parameters:
- Xpd.DataFrame
Features to transform
- ypd.Series, optional
Target values (passed through unchanged)
- Returns:
- Tuple[pd.DataFrame, Optional[pd.Series]]
Tuple containing (selected_X, y). The y is returned unchanged as this preprocessor only selects features
- Raises:
- ValueError
If preprocessor has not been fitted before transform
Notes
The method returns the selected features from the original unscaled data, not the scaled version used during fitting. This ensures that the output data maintains the original scale and meaning. The target variable y is not modified by this preprocessor.