DataSplitInfo#
- class DataSplitInfo(X_train: DataFrame, X_test: DataFrame, y_train: Series, y_test: Series, filename: str, scaler: Any | None = None, features: List[str] | None = None, categorical_features: List[str] | None = None)#
Bases:
objectStore and analyze features and labels of training and testing splits.
This class provides methods for calculating descriptive statistics for both continuous and categorical features, as well as visualizing the distributions of these features through various plots.
- Parameters:
- X_trainpd.DataFrame
The training features
- X_testpd.DataFrame
The testing features
- y_trainpd.Series
The training labels
- y_testpd.Series
The testing labels
- filenamestr
The filename or table name of the dataset
- scalerobject, optional
The scaler used for this split
- featureslist of str, optional
The order of input features
- categorical_featureslist of str, optional
List of categorical feature names
- Attributes:
- X_trainpd.DataFrame
The training features
- X_testpd.DataFrame
The testing features
- y_trainpd.Series
The training labels
- y_testpd.Series
The testing labels
- filenamestr
The filename or table name of the dataset
- scalerobject or None
The scaler used for this split
- featureslist of str or None
The order of input features
- categorical_featureslist of str
List of categorical features present in the training dataset
- continuous_featureslist of str
List of continuous features derived from the training dataset
- continuous_statsdict
Descriptive statistics for continuous features
- categorical_statsdict
Statistics for categorical features
Notes
The class automatically detects categorical features if not provided. Statistics are calculated for both continuous and categorical features during initialization.
- get_split_metadata() Dict[str, Any]#
Returns the split metadata used in certain metric calculations.
- Returns:
Dict[str, Any]: A dictionary containing the split metadata.
- get_test() Tuple[DataFrame, Series]#
Returns the testing features.
- Returns:
Tuple[pd.DataFrame, pd.Series]: A tuple containing the testing features and testing labels.
- get_train() Tuple[DataFrame, Series]#
Returns the training features.
- Returns:
Tuple[pd.DataFrame, pd.Series]: A tuple containing the training features and training labels.
- get_train_test() Tuple[DataFrame, DataFrame, Series, Series]#
Returns both the training and testing split.
- Returns:
Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]: A tuple containing the training features, testing features, training labels, and testing labels.
- save_distribution(dataset_dir: str) None#
Save the continuous and categorical statistics to JSON files.
- Args:
dataset_dir (str): The directory where the statistics JSON files and visualizations will be saved.