DataManager#
- class DataManager(test_size: float = 0.2, n_splits: int = 5, split_method: str = 'shuffle', group_column: str | None = None, stratified: bool = False, random_state: int | None = None, scale_method: str | None = None)#
A class that handles data splitting logic for creating train-test splits.
This class allows users to configure different splitting strategies (e.g., shuffle, k-fold, stratified) and return train-test splits or cross-validation folds. It supports splitting based on groupings and includes options for data scaling.
- Parameters:
- test_sizefloat, optional
The proportion of the dataset to allocate to the test set, by default 0.2
- n_splitsint, optional
Number of splits for cross-validation, by default 5
- split_methodstr, optional
The method to use for splitting (“shuffle” or “kfold”), by default “shuffle”
- group_columnstr, optional
The column to use for grouping (if any), by default None
- stratifiedbool, optional
Whether to use stratified sampling or cross-validation, by default False
- random_stateint, optional
The random seed for reproducibility, by default None
- scale_methodstr, optional
The method to use for scaling (“standard”, “minmax”, “robust”, “maxabs”, “normalizer”), by default None
- Attributes:
- test_sizefloat
Proportion of dataset allocated to test set
- n_splitsint
Number of splits for cross-validation
- split_methodstr
Method used for splitting
- group_columnstr or None
Column used for grouping
- stratifiedbool
Whether stratified sampling is used
- random_stateint or None
Random seed for reproducibility
- scale_methodstr or None
Method used for scaling features
- splittersklearn.model_selection._BaseKFold
The initialized scikit-learn splitter object
- _splitsdict
Cache of previously computed splits
- split(data_path: str, categorical_features: List[str], table_name: str | None = None, group_name: str | None = None, filename: str | None = None) DataSplitInfo#
Splits the data based on the preconfigured splitter.
- Parameters:
- data_pathstr
Path to the dataset file
- categorical_featureslist of str
List of categorical feature names
- table_namestr, optional
Name of the table in SQL database, by default None
- group_namestr, optional
Name of the group for split caching, by default None
- filenamestr, optional
Filename for split caching, by default None
- Returns:
- DataSplitInfo
Object containing train/test splits and related information
- Raises:
- ValueError
If group_name is provided without filename or vice versa
- to_markdown() str#
Creates a markdown representation of the DataManager configuration.
- Returns:
- str: Markdown formatted string describing the configuration.