DataSplitInfo#

class DataSplitInfo(X_train: DataFrame, X_test: DataFrame, y_train: Series, y_test: Series, filename: str, scaler: Any | None = None, features: List[str] | None = None, categorical_features: List[str] | None = None)#

Bases: object

Store and analyze features and labels of training and testing splits.

This class provides methods for calculating descriptive statistics for both continuous and categorical features, as well as visualizing the distributions of these features through various plots.

Parameters:

X_trainpd.DataFrame: The training features
X_testpd.DataFrame: The testing features
y_trainpd.Series: The training labels
y_testpd.Series: The testing labels
filenamestr: The filename or table name of the dataset
scalerobject, optional: The scaler used for this split
featureslist of str, optional: The order of input features
categorical_featureslist of str, optional: List of categorical feature names

Attributes:

X_trainpd.DataFrame: The training features
X_testpd.DataFrame: The testing features
y_trainpd.Series: The training labels
y_testpd.Series: The testing labels
filenamestr: The filename or table name of the dataset
scalerobject or None: The scaler used for this split
featureslist of str or None: The order of input features
categorical_featureslist of str: List of categorical features present in the training dataset
continuous_featureslist of str: List of continuous features derived from the training dataset
continuous_statsdict: Descriptive statistics for continuous features
categorical_statsdict: Statistics for categorical features

Notes

The class automatically detects categorical features if not provided. Statistics are calculated for both continuous and categorical features during initialization.

get_split_metadata() → Dict[str, Any]#

Returns the split metadata used in certain metric calculations.

Returns:: Dict[str, Any]: A dictionary containing the split metadata.

get_test() → Tuple[DataFrame, Series]#

Returns the testing features.

Returns:: Tuple[pd.DataFrame, pd.Series]: A tuple containing the testing features and testing labels.

get_train() → Tuple[DataFrame, Series]#

Returns the training features.

Returns:: Tuple[pd.DataFrame, pd.Series]: A tuple containing the training features and training labels.

get_train_test() → Tuple[DataFrame, DataFrame, Series, Series]#

Returns both the training and testing split.

Returns:: Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]: A tuple containing the training features, testing features, training labels, and testing labels.

save_distribution(dataset_dir: str) → None#

Save the continuous and categorical statistics to JSON files.

Args:: dataset_dir (str): The directory where the statistics JSON files and visualizations will be saved.

DataSplitInfo#

This Page