DataManager#

class DataManager(test_size: float = 0.2, n_splits: int = 5, split_method: str = 'shuffle', group_column: str | None = None, stratified: bool = False, random_state: int | None = None, scale_method: str | None = None)#

A class that handles data splitting logic for creating train-test splits.

This class allows users to configure different splitting strategies (e.g., shuffle, k-fold, stratified) and return train-test splits or cross-validation folds. It supports splitting based on groupings and includes options for data scaling.

Parameters:

test_sizefloat, optional: The proportion of the dataset to allocate to the test set, by default 0.2
n_splitsint, optional: Number of splits for cross-validation, by default 5
split_methodstr, optional: The method to use for splitting (“shuffle” or “kfold”), by default “shuffle”
group_columnstr, optional: The column to use for grouping (if any), by default None
stratifiedbool, optional: Whether to use stratified sampling or cross-validation, by default False
random_stateint, optional: The random seed for reproducibility, by default None
scale_methodstr, optional: The method to use for scaling (“standard”, “minmax”, “robust”, “maxabs”, “normalizer”), by default None

Attributes:

test_sizefloat: Proportion of dataset allocated to test set
n_splitsint: Number of splits for cross-validation
split_methodstr: Method used for splitting
group_columnstr or None: Column used for grouping
stratifiedbool: Whether stratified sampling is used
random_stateint or None: Random seed for reproducibility
scale_methodstr or None: Method used for scaling features
splittersklearn.model_selection._BaseKFold: The initialized scikit-learn splitter object
_splitsdict: Cache of previously computed splits

split(data_path: str, categorical_features: List[str], table_name: str | None = None, group_name: str | None = None, filename: str | None = None) → DataSplitInfo#

Splits the data based on the preconfigured splitter.

Parameters:

data_pathstr: Path to the dataset file
categorical_featureslist of str: List of categorical feature names
table_namestr, optional: Name of the table in SQL database, by default None
group_namestr, optional: Name of the group for split caching, by default None
filenamestr, optional: Filename for split caching, by default None

Returns:

DataSplitInfo: Object containing train/test splits and related information

Raises:

ValueError: If group_name is provided without filename or vice versa

to_markdown() → str#

Creates a markdown representation of the DataManager configuration.

Returns:

str: Markdown formatted string describing the configuration.

DataManager#

This Page