DataSplitInfo#
- class DataSplitInfo(X_train: DataFrame, X_test: DataFrame, y_train: Series, y_test: Series, group_index_train: Dict[str, array] | None, group_index_test: Dict[str, array] | None, split_key: SplitKey, split_index: int, scaler: Any | None = None, categorical_features: List[str] | None = None, continuous_features: List[str] | None = None)[source]#
Bases:
objectStore and analyze features and labels of training and testing splits.
This class provides methods for calculating descriptive statistics for both continuous and categorical features, as well as visualizing the distributions of these features through various plots. It handles data scaling, feature categorization, and statistical analysis automatically.
- Parameters:
- X_trainpd.DataFrame
The training features
- X_testpd.DataFrame
The testing features
- y_trainpd.Series
The training labels
- y_testpd.Series
The testing labels
- group_index_trainDict[str, np.array] or None
Index of the groups for the training split
- group_index_testDict[str, np.array] or None
Index of the groups for the testing split
- split_keyTuple[str, str, str]
The split key (group_name, dataset_name, table_name)
- split_indexint
The split index in DataSplits container
- scalerobject, optional
The fitted scaler used for this split, by default None
- categorical_featuresList[str], optional
List of categorical feature names, by default None
- continuous_featuresList[str], optional
List of continuous feature names, by default None
- Attributes:
- group_namestr
The name of the experiment group
- dataset_namestr
The name of the dataset
- table_namestr
The name of the table
- featuresList[str]
The order of input features
- split_indexint
The split index in DataSplits container
- servicesServiceBundle
The global services bundle
- X_trainpd.DataFrame
The training features
- X_testpd.DataFrame
The testing features
- y_trainpd.Series
The training labels
- y_testpd.Series
The testing labels
- group_index_trainDict[str, np.array] or None
Index of the groups for the training split
- group_index_testDict[str, np.array] or None
Index of the groups for the testing split
- registryEvaluatorRegistry
The evaluator registry with evaluators for datasets
- categorical_featuresList[str]
List of categorical features present in the training dataset
- continuous_featuresList[str]
List of continuous features derived from the training dataset
- scalerobject or None
The scaler used for this split
Notes
The class automatically detects categorical features if not provided. Statistics are calculated for both continuous and categorical features during initialization. The class also handles data scaling when a scaler is provided, ensuring that only continuous features are scaled while preserving categorical features in their original form.
Examples
- Create a basic data split info:
>>> data_info = DataSplitInfo( ... X_train, X_test, y_train, y_test, ... group_index_train=None, group_index_test=None, ... split_key=("group1", "dataset.csv", None), ... split_index=0 ... )
- Create with specific feature types:
>>> data_info = DataSplitInfo( ... X_train, X_test, y_train, y_test, ... group_index_train=None, group_index_test=None, ... split_key=("group1", "dataset.csv", None), ... split_index=0, ... categorical_features=["category1", "category2"], ... continuous_features=["feature1", "feature2"] ... )
- evaluate_data_split() None[source]#
Evaluate distribution of features in the train and test splits.
This method calculates descriptive statistics for both continuous and categorical features in the training and testing splits. It also generates plots including histograms, boxplots, bar plots, and correlation matrices.
The method uses the evaluator registry to get the appropriate evaluators for the dataset and then calls the evaluate method for each evaluator.
Notes
The evaluation process includes: 1. Setting up the reporting context 2. Calculating statistics for continuous features 3. Calculating statistics for categorical features 4. Generating histogram and box plots for continuous features 5. Generating bar plots for categorical features 6. Creating correlation matrices for continuous features 7. Clearing the reporting context
All plots and statistics are saved to the configured output directory.
- get_split_metadata() Dict[str, Any][source]#
Return the split metadata used in certain metric calculations.
Provides metadata about the data split that can be used for metric calculations and reporting purposes.
- Returns:
- Dict[str, Any]
A dictionary containing the split metadata with keys: - num_features: Number of features in the dataset - num_samples: Total number of samples (train + test)
Examples
- Get split metadata:
>>> metadata = data_info.get_split_metadata() >>> print(f"Features: {metadata['num_features']}") >>> print(f"Samples: {metadata['num_samples']}")
- get_test() Tuple[DataFrame, Series][source]#
Return the testing features and labels.
Returns the testing data with optional scaling applied to continuous features. Categorical features are preserved in their original form while continuous features are scaled using the fitted scaler.
- Returns:
- Tuple[pd.DataFrame, pd.Series]
A tuple containing the testing features and testing labels. Features are scaled if a scaler is available and continuous features are present.
Notes
If a scaler is available and continuous features exist: 1. Categorical features are kept in their original form 2. Continuous features are scaled using the fitted scaler 3. Features are concatenated and reordered to match original order 4. The original column order is preserved
If no scaler is available, the original data is returned unchanged.
Examples
- Get scaled testing data:
>>> X_test, y_test = data_info.get_test() >>> print(X_test.shape) # (n_samples, n_features)
- get_train() Tuple[DataFrame, Series][source]#
Return the training features and labels.
Returns the training data with optional scaling applied to continuous features. Categorical features are preserved in their original form while continuous features are scaled using the fitted scaler.
- Returns:
- Tuple[pd.DataFrame, pd.Series]
A tuple containing the training features and training labels. Features are scaled if a scaler is available and continuous features are present.
Notes
If a scaler is available and continuous features exist: 1. Categorical features are kept in their original form 2. Continuous features are scaled using the fitted scaler 3. Features are concatenated and reordered to match original order 4. The original column order is preserved
If no scaler is available, the original data is returned unchanged.
Examples
- Get scaled training data:
>>> X_train, y_train = data_info.get_train()
- get_train_test() Tuple[DataFrame, DataFrame, Series, Series][source]#
Return both the training and testing splits.
Convenience method that returns both training and testing data in a single call. Data is scaled if a scaler is available.
- Returns:
- Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]
A tuple containing the training features, testing features, training labels, and testing labels.
Notes
This method is equivalent to calling get_train() and get_test() separately, but provides a more convenient interface when both splits are needed.
Examples
- Get both training and testing data:
>>> X_train, X_test, y_train, y_test = data_info.get_train_test()
- set_services(services: ServiceBundle | None = None)[source]#