DataManager#

class DataManager(test_size: float = 0.2, n_splits: int = 5, split_method: str = 'shuffle', group_column: str | None = None, stratified: bool = False, random_state: int | None = None, scale_method: str | None = None)#

A class that handles data splitting logic for creating train-test splits.

This class allows users to configure different splitting strategies (e.g., shuffle, k-fold, stratified) and return train-test splits or cross-validation folds. It supports splitting based on groupings and includes options for data scaling.

Parameters:
test_sizefloat, optional

The proportion of the dataset to allocate to the test set, by default 0.2

n_splitsint, optional

Number of splits for cross-validation, by default 5

split_methodstr, optional

The method to use for splitting (“shuffle” or “kfold”), by default “shuffle”

group_columnstr, optional

The column to use for grouping (if any), by default None

stratifiedbool, optional

Whether to use stratified sampling or cross-validation, by default False

random_stateint, optional

The random seed for reproducibility, by default None

scale_methodstr, optional

The method to use for scaling (“standard”, “minmax”, “robust”, “maxabs”, “normalizer”), by default None

Attributes:
test_sizefloat

Proportion of dataset allocated to test set

n_splitsint

Number of splits for cross-validation

split_methodstr

Method used for splitting

group_columnstr or None

Column used for grouping

stratifiedbool

Whether stratified sampling is used

random_stateint or None

Random seed for reproducibility

scale_methodstr or None

Method used for scaling features

splittersklearn.model_selection._BaseKFold

The initialized scikit-learn splitter object

_splitsdict

Cache of previously computed splits

split(data_path: str, categorical_features: List[str], table_name: str | None = None, group_name: str | None = None, filename: str | None = None) DataSplitInfo#

Splits the data based on the preconfigured splitter.

Parameters:
data_pathstr

Path to the dataset file

categorical_featureslist of str

List of categorical feature names

table_namestr, optional

Name of the table in SQL database, by default None

group_namestr, optional

Name of the group for split caching, by default None

filenamestr, optional

Filename for split caching, by default None

Returns:
DataSplitInfo

Object containing train/test splits and related information

Raises:
ValueError

If group_name is provided without filename or vice versa

to_markdown() str#

Creates a markdown representation of the DataManager configuration.

Returns:
str: Markdown formatted string describing the configuration.