DataSplitInfo#

class DataSplitInfo(X_train: DataFrame, X_test: DataFrame, y_train: Series, y_test: Series, group_index_train: Dict[str, array] | None, group_index_test: Dict[str, array] | None, split_key: Tuple[str, str, str], split_index: int, scaler: Any | None = None, categorical_features: List[str] | None = None, continuous_features: List[str] | None = None)[source]#

Bases: object

Store and analyze features and labels of training and testing splits.

This class provides methods for calculating descriptive statistics for both continuous and categorical features, as well as visualizing the distributions of these features through various plots. It handles data scaling, feature categorization, and statistical analysis automatically.

Parameters:
X_trainpd.DataFrame

The training features

X_testpd.DataFrame

The testing features

y_trainpd.Series

The training labels

y_testpd.Series

The testing labels

group_index_trainDict[str, np.array] or None

Index of the groups for the training split

group_index_testDict[str, np.array] or None

Index of the groups for the testing split

split_keyTuple[str, str, str]

The split key (group_name, dataset_name, table_name)

split_indexint

The split index in DataSplits container

scalerobject, optional

The fitted scaler used for this split, by default None

categorical_featuresList[str], optional

List of categorical feature names, by default None

continuous_featuresList[str], optional

List of continuous feature names, by default None

Attributes:
group_namestr

The name of the experiment group

dataset_namestr

The name of the dataset

table_namestr

The name of the table

featuresList[str]

The order of input features

split_indexint

The split index in DataSplits container

servicesServiceBundle

The global services bundle

X_trainpd.DataFrame

The training features

X_testpd.DataFrame

The testing features

y_trainpd.Series

The training labels

y_testpd.Series

The testing labels

group_index_trainDict[str, np.array] or None

Index of the groups for the training split

group_index_testDict[str, np.array] or None

Index of the groups for the testing split

registryEvaluatorRegistry

The evaluator registry with evaluators for datasets

categorical_featuresList[str]

List of categorical features present in the training dataset

continuous_featuresList[str]

List of continuous features derived from the training dataset

scalerobject or None

The scaler used for this split

Notes

The class automatically detects categorical features if not provided. Statistics are calculated for both continuous and categorical features during initialization. The class also handles data scaling when a scaler is provided, ensuring that only continuous features are scaled while preserving categorical features in their original form.

Examples

Create a basic data split info:
>>> data_info = DataSplitInfo(
...     X_train, X_test, y_train, y_test,
...     group_index_train=None, group_index_test=None,
...     split_key=("group1", "dataset.csv", None),
...     split_index=0
... )
Create with specific feature types:
>>> data_info = DataSplitInfo(
...     X_train, X_test, y_train, y_test,
...     group_index_train=None, group_index_test=None,
...     split_key=("group1", "dataset.csv", None),
...     split_index=0,
...     categorical_features=["category1", "category2"],
...     continuous_features=["feature1", "feature2"]
... )
evaluate_data_split() None[source]#

Evaluate distribution of features in the train and test splits.

This method calculates descriptive statistics for both continuous and categorical features in the training and testing splits. It also generates plots including histograms, boxplots, pie plots, and correlation matrices.

The method uses the evaluator registry to get the appropriate evaluators for the dataset and then calls the evaluate method for each evaluator.

Notes

The evaluation process includes: 1. Setting up the reporting context 2. Calculating statistics for continuous features 3. Calculating statistics for categorical features 4. Generating histogram and box plots for continuous features 5. Generating bar plots for categorical features 6. Creating correlation matrices for continuous features 7. Clearing the reporting context

All plots and statistics are saved to the configured output directory.

get_split_metadata() Dict[str, Any][source]#

Return the split metadata used in certain metric calculations.

Provides metadata about the data split that can be used for metric calculations and reporting purposes.

Returns:
Dict[str, Any]

A dictionary containing the split metadata with keys: - num_features: Number of features in the dataset - num_samples: Total number of samples (train + test)

Examples

Get split metadata:
>>> metadata = data_info.get_split_metadata()
>>> print(f"Features: {metadata['num_features']}")
>>> print(f"Samples: {metadata['num_samples']}")
get_test() Tuple[DataFrame, Series][source]#

Return the testing features and labels.

Returns the testing data with optional scaling applied to continuous features. Categorical features are preserved in their original form while continuous features are scaled using the fitted scaler.

Returns:
Tuple[pd.DataFrame, pd.Series]

A tuple containing the testing features and testing labels. Features are scaled if a scaler is available and continuous features are present.

Notes

If a scaler is available and continuous features exist: 1. Categorical features are kept in their original form 2. Continuous features are scaled using the fitted scaler 3. Features are concatenated and reordered to match original order 4. The original column order is preserved

If no scaler is available, the original data is returned unchanged.

Examples

Get scaled testing data:
>>> X_test, y_test = data_info.get_test()
>>> print(X_test.shape)  # (n_samples, n_features)
get_train() Tuple[DataFrame, Series][source]#

Return the training features and labels.

Returns the training data with optional scaling applied to continuous features. Categorical features are preserved in their original form while continuous features are scaled using the fitted scaler.

Returns:
Tuple[pd.DataFrame, pd.Series]

A tuple containing the training features and training labels. Features are scaled if a scaler is available and continuous features are present.

Notes

If a scaler is available and continuous features exist: 1. Categorical features are kept in their original form 2. Continuous features are scaled using the fitted scaler 3. Features are concatenated and reordered to match original order 4. The original column order is preserved

If no scaler is available, the original data is returned unchanged.

Examples

Get scaled training data:
>>> X_train, y_train = data_info.get_train()
get_train_test() Tuple[DataFrame, DataFrame, Series, Series][source]#

Return both the training and testing splits.

Convenience method that returns both training and testing data in a single call. Data is scaled if a scaler is available.

Returns:
Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]

A tuple containing the training features, testing features, training labels, and testing labels.

Notes

This method is equivalent to calling get_train() and get_test() separately, but provides a more convenient interface when both splits are needed.

Examples

Get both training and testing data:
>>> X_train, X_test, y_train, y_test = data_info.get_train_test()