DataManager#

class DataManager(test_size: float = 0.2, n_splits: int = 5, split_method: str = 'shuffle', group_column: str | None = None, stratified: bool = False, random_state: int | None = None, problem_type: str = 'classification', algorithm_config=None, preprocessors: List[BasePreprocessor] | None = None)[source]#

A class that handles data splitting logic for creating train-test splits.

This class allows users to configure different splitting strategies (e.g., shuffle, k-fold, stratified) and return train-test splits or cross-validation folds. It supports splitting based on groupings and includes a data preprocessing pipeline with support for missing data handling, categorical encoding, scaling, and feature selection.

Parameters:

test_sizefloat, optional: The proportion of the dataset to allocate to the test set, by default 0.2
n_splitsint, optional: Number of splits for cross-validation, by default 5
split_methodstr, optional: The method to use for splitting (“shuffle” or “kfold”), by default “shuffle”
group_columnstr, optional: The column to use for grouping (if any), by default None
stratifiedbool, optional: Whether to use stratified sampling or cross-validation, by default False
random_stateint, optional: The random seed for reproducibility, by default None
problem_typestr, optional: The type of problem (“classification” or “regression”), by default “classification”
algorithm_configAlgorithmCollection: User-provided collection of AlgorithmWrapper objects to use for feature selection, by default None
preprocessorsList[BasePreprocessor], optional: List of preprocessor objects to apply to the data in sequence, by default None

Attributes:

test_sizefloat: Proportion of dataset allocated to test set
n_splitsint: Number of splits for cross-validation
split_methodstr: Method used for splitting
group_columnstr or None: Column used for grouping
stratifiedbool: Whether stratified sampling is used
random_stateint or None: Random seed for reproducibility
problem_typestr: Type of problem (classification or regression)
algorithm_configlist of AlgorithmWrapper or None: List of algorithms to use as feature selection estimators
preprocessorsList[BasePreprocessor]: List of preprocessors to apply to the data
splittersklearn.model_selection._BaseKFold: The initialized scikit-learn splitter object
_splitsdict: Cache of previously computed splits

Notes

The DataManager supports various splitting strategies: - ShuffleSplit: Random train-test splits - KFold: K-fold cross-validation - StratifiedShuffleSplit: Stratified random splits - StratifiedKFold: Stratified k-fold cross-validation - GroupShuffleSplit: Group-aware random splits - GroupKFold: Group-aware k-fold cross-validation - StratifiedGroupKFold: Stratified group-aware k-fold

The preprocessing pipeline follows a fixed order: 1. Missing Data Handling 2. Categorical Encoding 3. Scaling (continuous features only) 4. Feature Selection

Examples

Create a basic data manager:

>>> manager = DataManager(test_size=0.2, n_splits=5)

Create with preprocessing:

>>> from brisk.data.preprocessing import MissingDataPreprocessor,         ...     ScalingPreprocessor
>>> preprocessors = [
...     MissingDataPreprocessor(strategy="mean"),
...     ScalingPreprocessor(method="standard")
... ]
>>> manager = DataManager(
...     test_size=0.2,
...     preprocessors=preprocessors
... )

Create splits for grouped data:

>>> manager = DataManager(
...     split_method="kfold",
...     group_column="subject_id",
...     stratified=True
... )

export_params() → None[source]#

Export a JSON-serializable snapshot of the DataManager init params.

Creates a dictionary containing all initialization parameters and preprocessor configurations for serialization and rerun functionality.

Returns:

dict: Dictionary containing all DataManager parameters and preprocessor configurations in JSON-serializable format

Notes

The exported parameters include: - All DataManager initialization parameters - Preprocessor configurations (exported from each preprocessor) - Parameter values in a format suitable for JSON serialization

Examples

Export parameters for saving:

>>> manager = DataManager(test_size=0.2)
>>> params = manager.export_params()
>>> # Save params to file or database

split(data_path: str, categorical_features: List[str], group_name: str, filename: str, table_name: str | None = None) → DataSplitInfo[source]#

Split the data based on the preconfigured splitter.

Creates train-test splits for the specified dataset using the configured splitting strategy and applies the complete preprocessing pipeline. Results are cached to be accessed later using the same parameters.

Parameters:

data_pathstr: Path to the dataset file
categorical_featuresList[str]: List of categorical feature names
group_namestr: Name of the group for split caching
filenamestr: Filename for split caching
table_namestr, optional: Name of the table in SQL database, by default None

Returns:

DataSplitInfo: Object containing train/test splits and related information

Raises:

ValueError: If group_name is provided without filename or vice versa

Notes

The split process: 1. Loads data from the specified path 2. Separates features and target variables 3. Handles grouping columns if specified 4. Creates splits using the configured splitter 5. Applies preprocessing pipeline to each split 6. Creates DataSplitInfo objects for each split 7. Caches results for efficient reuse

Examples

Create splits for a dataset:

>>> manager = DataManager(test_size=0.2)
>>> splits = manager.split(
...     data_path="data.csv",
...     categorical_features=["category1", "category2"],
...     group_name="experiment1",
...     filename="dataset1"
... )

to_markdown() → str[source]#

Create a markdown representation of the DataManager configuration.

Generates a formatted markdown string describing the current DataManager configuration including splitting parameters and preprocessing pipeline.

Returns:

str: Markdown formatted string describing the configuration

Examples

Generate configuration documentation:

>>> manager = DataManager(test_size=0.2, n_splits=5)
>>> md = manager.to_markdown()
>>> print(md)

DataManager#

This Page