DataSplitInfo#

class DataSplitInfo(X_train: DataFrame, X_test: DataFrame, y_train: Series, y_test: Series, group_index_train: Dict[str, array] | None, group_index_test: Dict[str, array] | None, filename: str, scaler: Any | None = None, features: List[str] | None = None, categorical_features: List[str] | None = None)#

Bases: object

Store and analyze features and labels of training and testing splits.

This class provides methods for calculating descriptive statistics for both continuous and categorical features, as well as visualizing the distributions of these features through various plots.

Parameters:
X_trainpd.DataFrame

The training features

X_testpd.DataFrame

The testing features

y_trainpd.Series

The training labels

y_testpd.Series

The testing labels

filenamestr

The filename or table name of the dataset

scalerobject, optional

The scaler used for this split

featureslist of str, optional

The order of input features

categorical_featureslist of str, optional

List of categorical feature names

Attributes:
X_trainpd.DataFrame

The training features

X_testpd.DataFrame

The testing features

y_trainpd.Series

The training labels

y_testpd.Series

The testing labels

filenamestr

The filename or table name of the dataset

scalerobject or None

The scaler used for this split

featureslist of str or None

The order of input features

categorical_featureslist of str

List of categorical features present in the training dataset

continuous_featureslist of str

List of continuous features derived from the training dataset

continuous_statsdict

Descriptive statistics for continuous features

categorical_statsdict

Statistics for categorical features

Notes

The class automatically detects categorical features if not provided. Statistics are calculated for both continuous and categorical features during initialization.

get_split_metadata() Dict[str, Any]#

Returns the split metadata used in certain metric calculations.

Returns:

Dict[str, Any]: A dictionary containing the split metadata.

get_test() Tuple[DataFrame, Series]#

Returns the testing features.

Returns:

Tuple[pd.DataFrame, pd.Series]: A tuple containing the testing features and testing labels.

get_train() Tuple[DataFrame, Series]#

Returns the training features.

Returns:

Tuple[pd.DataFrame, pd.Series]: A tuple containing the training features and training labels.

get_train_test() Tuple[DataFrame, DataFrame, Series, Series]#

Returns both the training and testing split.

Returns:

Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]: A tuple containing the training features, testing features, training labels, and testing labels.

save_distribution(dataset_dir: str) None#

Save the continuous and categorical statistics to JSON files.

Args:

dataset_dir (str): The directory where the statistics JSON files and visualizations will be saved.