Missing Data#

class MissingDataPreprocessor(strategy: str = 'drop_rows', impute_method: str = 'mean', constant_value: Any = 0, **kwargs)[source]#

Bases: BasePreprocessor

Preprocessor for handling missing values in datasets.

Provides strategies for dealing with missing data including dropping rows with missing values or imputing missing values using various statistical methods.

Parameters:
strategystr, default=”drop_rows”

Strategy for handling missing values: “drop_rows” or “impute”

impute_methodstr, default=”mean”

Imputation method when strategy=”impute”: “mean”, “median”, “mode”, or “constant”

constant_valueAny, default=0

Constant value to use when impute_method=”constant”

Attributes:
constant_valuesdict

Dictionary mapping column names to their fitted imputation values

is_fittedbool

Whether the preprocessor has been fitted

Notes

The preprocessor supports two main strategies: 1. Drop rows: Remove any rows containing missing values 2. Impute: Fill missing values using statistical methods

For imputation, the method is fitted on training data and the same values are used to fill missing values in test data.

Examples

Drop rows with missing values:
>>> preprocessor = MissingDataPreprocessor(strategy="drop_rows")
Impute with mean values:
>>> preprocessor = MissingDataPreprocessor(
...     strategy="impute", impute_method="mean"
... )
Impute with constant value:
>>> preprocessor = MissingDataPreprocessor(
...     strategy="impute", impute_method="constant", constant_value=-1
... )
export_params() Dict[str, Any][source]#

Export parameters for serialization.

Returns:
Dict[str, Any]

Dictionary containing all parameters

fit(X: DataFrame, y: Series | None = None, categorical_features: List[str] | None = None) MissingDataPreprocessor[source]#

Fit the missing data preprocessor.

Learns imputation values from the training data for each column that contains missing values.

Parameters:
Xpd.DataFrame

Training data

ypd.Series, optional

Target values (not used for missing data handling)

Returns:
selfMissingDataPreprocessor

Fitted preprocessor

Notes

For imputation methods, the preprocessor learns the appropriate values (mean, median, mode, or constant) for each column from the training data. These values are then used consistently for both training and test data.

get_feature_names(feature_names: List[str]) List[str][source]#

Get the feature names after missing data handling.

Parameters:
feature_namesList[str]

Original feature names

Returns:
List[str]

Feature names (no columns are dropped in this simplified version)

Notes

In the current implementation, no columns are dropped during missing data handling, so feature names remain unchanged.

transform(X: DataFrame) DataFrame[source]#

Transform the data by handling missing values.

Parameters:
Xpd.DataFrame

Data to transform

Returns:
pd.DataFrame

Data with missing values handled

Raises:
ValueError

If preprocessor has not been fitted

Notes

The transformation applies the strategy learned during fit: - For “drop_rows”: removes any rows with missing values - For “impute”: fills missing values using learned imputation values