DataSplitInfo#

class DataSplitInfo(X_train: DataFrame, X_test: DataFrame, y_train: Series, y_test: Series, group_index_train: Dict[str, array] | None, group_index_test: Dict[str, array] | None, split_key: Tuple[str, str, str], split_index: int, scaler: Any | None = None, categorical_features: List[str] | None = None, continuous_features: List[str] | None = None)[source]#

Bases: object

Store and analyze features and labels of training and testing splits.

This class provides methods for calculating descriptive statistics for both continuous and categorical features, as well as visualizing the distributions of these features through various plots. It handles data scaling, feature categorization, and statistical analysis automatically.

Parameters:

X_trainpd.DataFrame: The training features
X_testpd.DataFrame: The testing features
y_trainpd.Series: The training labels
y_testpd.Series: The testing labels
group_index_trainDict[str, np.array] or None: Index of the groups for the training split
group_index_testDict[str, np.array] or None: Index of the groups for the testing split
split_keyTuple[str, str, str]: The split key (group_name, dataset_name, table_name)
split_indexint: The split index in DataSplits container
scalerobject, optional: The fitted scaler used for this split, by default None
categorical_featuresList[str], optional: List of categorical feature names, by default None
continuous_featuresList[str], optional: List of continuous feature names, by default None

Attributes:

group_namestr: The name of the experiment group
dataset_namestr: The name of the dataset
table_namestr: The name of the table
featuresList[str]: The order of input features
split_indexint: The split index in DataSplits container
servicesServiceBundle: The global services bundle
X_trainpd.DataFrame: The training features
X_testpd.DataFrame: The testing features
y_trainpd.Series: The training labels
y_testpd.Series: The testing labels
group_index_trainDict[str, np.array] or None: Index of the groups for the training split
group_index_testDict[str, np.array] or None: Index of the groups for the testing split
registryEvaluatorRegistry: The evaluator registry with evaluators for datasets
categorical_featuresList[str]: List of categorical features present in the training dataset
continuous_featuresList[str]: List of continuous features derived from the training dataset
scalerobject or None: The scaler used for this split

Notes

The class automatically detects categorical features if not provided. Statistics are calculated for both continuous and categorical features during initialization. The class also handles data scaling when a scaler is provided, ensuring that only continuous features are scaled while preserving categorical features in their original form.

Examples

Create a basic data split info:

>>> data_info = DataSplitInfo(
...     X_train, X_test, y_train, y_test,
...     group_index_train=None, group_index_test=None,
...     split_key=("group1", "dataset.csv", None),
...     split_index=0
... )

Create with specific feature types:

>>> data_info = DataSplitInfo(
...     X_train, X_test, y_train, y_test,
...     group_index_train=None, group_index_test=None,
...     split_key=("group1", "dataset.csv", None),
...     split_index=0,
...     categorical_features=["category1", "category2"],
...     continuous_features=["feature1", "feature2"]
... )

evaluate_data_split() → None[source]#

Evaluate distribution of features in the train and test splits.

This method calculates descriptive statistics for both continuous and categorical features in the training and testing splits. It also generates plots including histograms, boxplots, pie plots, and correlation matrices.

The method uses the evaluator registry to get the appropriate evaluators for the dataset and then calls the evaluate method for each evaluator.

Notes

The evaluation process includes: 1. Setting up the reporting context 2. Calculating statistics for continuous features 3. Calculating statistics for categorical features 4. Generating histogram and box plots for continuous features 5. Generating bar plots for categorical features 6. Creating correlation matrices for continuous features 7. Clearing the reporting context

All plots and statistics are saved to the configured output directory.

get_split_metadata() → Dict[str, Any][source]#

Return the split metadata used in certain metric calculations.

Provides metadata about the data split that can be used for metric calculations and reporting purposes.

Returns:

Dict[str, Any]: A dictionary containing the split metadata with keys: - num_features: Number of features in the dataset - num_samples: Total number of samples (train + test)

Examples

Get split metadata:

>>> metadata = data_info.get_split_metadata()
>>> print(f"Features: {metadata['num_features']}")
>>> print(f"Samples: {metadata['num_samples']}")

get_test() → Tuple[DataFrame, Series][source]#

Return the testing features and labels.

Returns the testing data with optional scaling applied to continuous features. Categorical features are preserved in their original form while continuous features are scaled using the fitted scaler.

Returns:

Tuple[pd.DataFrame, pd.Series]: A tuple containing the testing features and testing labels. Features are scaled if a scaler is available and continuous features are present.

Notes

If a scaler is available and continuous features exist: 1. Categorical features are kept in their original form 2. Continuous features are scaled using the fitted scaler 3. Features are concatenated and reordered to match original order 4. The original column order is preserved

If no scaler is available, the original data is returned unchanged.

Examples

Get scaled testing data:

>>> X_test, y_test = data_info.get_test()
>>> print(X_test.shape)  # (n_samples, n_features)

get_train() → Tuple[DataFrame, Series][source]#

Return the training features and labels.

Returns the training data with optional scaling applied to continuous features. Categorical features are preserved in their original form while continuous features are scaled using the fitted scaler.

Returns:

Tuple[pd.DataFrame, pd.Series]: A tuple containing the training features and training labels. Features are scaled if a scaler is available and continuous features are present.

Notes

If a scaler is available and continuous features exist: 1. Categorical features are kept in their original form 2. Continuous features are scaled using the fitted scaler 3. Features are concatenated and reordered to match original order 4. The original column order is preserved

If no scaler is available, the original data is returned unchanged.

Examples

Get scaled training data:

>>> X_train, y_train = data_info.get_train()

get_train_test() → Tuple[DataFrame, DataFrame, Series, Series][source]#

Return both the training and testing splits.

Convenience method that returns both training and testing data in a single call. Data is scaled if a scaler is available.

Returns:

Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]: A tuple containing the training features, testing features, training labels, and testing labels.

Notes

This method is equivalent to calling get_train() and get_test() separately, but provides a more convenient interface when both splits are needed.

Examples

Get both training and testing data:

>>> X_train, X_test, y_train, y_test = data_info.get_train_test()

DataSplitInfo#

This Page