Missing Data#
- class MissingDataPreprocessor(strategy: str = 'drop_rows', impute_method: str = 'mean', constant_value: Any = 0, **kwargs)[source]#
Bases:
BasePreprocessorPreprocessor for handling missing values in datasets.
Provides strategies for dealing with missing data including dropping rows with missing values or imputing missing values using various statistical methods.
- Parameters:
- strategystr, default=”drop_rows”
Strategy for handling missing values: “drop_rows” or “impute”
- impute_methodstr, default=”mean”
Imputation method when strategy=”impute”: “mean”, “median”, “mode”, or “constant”
- constant_valueAny, default=0
Constant value to use when impute_method=”constant”
- Attributes:
- constant_valuesdict
Dictionary mapping column names to their fitted imputation values
- is_fittedbool
Whether the preprocessor has been fitted
Notes
The preprocessor supports two main strategies: 1. Drop rows: Remove any rows containing missing values 2. Impute: Fill missing values using statistical methods
For imputation, the method is fitted on training data and the same values are used to fill missing values in test data.
Examples
- Drop rows with missing values:
>>> preprocessor = MissingDataPreprocessor(strategy="drop_rows")
- Impute with mean values:
>>> preprocessor = MissingDataPreprocessor( ... strategy="impute", impute_method="mean" ... )
- Impute with constant value:
>>> preprocessor = MissingDataPreprocessor( ... strategy="impute", impute_method="constant", constant_value=-1 ... )
- export_params() Dict[str, Any][source]#
Export parameters for serialization.
- Returns:
- Dict[str, Any]
Dictionary containing all parameters
- fit(X: DataFrame, y: Series | None = None, categorical_features: List[str] | None = None) MissingDataPreprocessor[source]#
Fit the missing data preprocessor.
Learns imputation values from the training data for each column that contains missing values.
- Parameters:
- Xpd.DataFrame
Training data
- ypd.Series, optional
Target values (not used for missing data handling)
- Returns:
- selfMissingDataPreprocessor
Fitted preprocessor
Notes
For imputation methods, the preprocessor learns the appropriate values (mean, median, mode, or constant) for each column from the training data. These values are then used consistently for both training and test data.
- get_feature_names(feature_names: List[str]) List[str][source]#
Get the feature names after missing data handling.
- Parameters:
- feature_namesList[str]
Original feature names
- Returns:
- List[str]
Feature names (no columns are dropped in this simplified version)
Notes
In the current implementation, no columns are dropped during missing data handling, so feature names remain unchanged.
- transform(X: DataFrame) DataFrame[source]#
Transform the data by handling missing values.
- Parameters:
- Xpd.DataFrame
Data to transform
- Returns:
- pd.DataFrame
Data with missing values handled
- Raises:
- ValueError
If preprocessor has not been fitted
Notes
The transformation applies the strategy learned during fit: - For “drop_rows”: removes any rows with missing values - For “impute”: fills missing values using learned imputation values