DataManager#
- class DataManager(test_size: float = 0.2, n_splits: int = 5, split_method: str = 'shuffle', group_column: str | None = None, stratified: bool = False, random_state: int | None = None, problem_type: str = 'classification', algorithm_config=None, preprocessors: List[BasePreprocessor] | None = None)[source]#
A class that handles data splitting logic for creating train-test splits.
This class allows users to configure different splitting strategies (e.g., shuffle, k-fold, stratified) and return train-test splits or cross-validation folds. It supports splitting based on groupings and includes a data preprocessing pipeline with support for missing data handling, categorical encoding, scaling, and feature selection.
- Parameters:
- test_sizefloat, optional
The proportion of the dataset to allocate to the test set, by default 0.2
- n_splitsint, optional
Number of splits for cross-validation, by default 5
- split_methodstr, optional
The method to use for splitting (“shuffle” or “kfold”), by default “shuffle”
- group_columnstr, optional
The column to use for grouping (if any), by default None
- stratifiedbool, optional
Whether to use stratified sampling or cross-validation, by default False
- random_stateint, optional
The random seed for reproducibility, by default None
- problem_typestr, optional
The type of problem (“classification” or “regression”), by default “classification”
- algorithm_configAlgorithmCollection
User-provided collection of AlgorithmWrapper objects to use for feature selection, by default None
- preprocessorsList[BasePreprocessor], optional
List of preprocessor objects to apply to the data in sequence, by default None
- Attributes:
- test_sizefloat
Proportion of dataset allocated to test set
- n_splitsint
Number of splits for cross-validation
- split_methodstr
Method used for splitting
- group_columnstr or None
Column used for grouping
- stratifiedbool
Whether stratified sampling is used
- random_stateint or None
Random seed for reproducibility
- problem_typestr
Type of problem (classification or regression)
- algorithm_configlist of AlgorithmWrapper or None
List of algorithms to use as feature selection estimators
- preprocessorsList[BasePreprocessor]
List of preprocessors to apply to the data
- splittersklearn.model_selection._BaseKFold
The initialized scikit-learn splitter object
- _splitsdict
Cache of previously computed splits
Notes
The DataManager supports various splitting strategies: - ShuffleSplit: Random train-test splits - KFold: K-fold cross-validation - StratifiedShuffleSplit: Stratified random splits - StratifiedKFold: Stratified k-fold cross-validation - GroupShuffleSplit: Group-aware random splits - GroupKFold: Group-aware k-fold cross-validation - StratifiedGroupKFold: Stratified group-aware k-fold
The preprocessing pipeline follows a fixed order: 1. Missing Data Handling 2. Categorical Encoding 3. Scaling (continuous features only) 4. Feature Selection
Examples
- Create a basic data manager:
>>> manager = DataManager(test_size=0.2, n_splits=5)
- Create with preprocessing:
>>> from brisk.data.preprocessing import MissingDataPreprocessor, ... ScalingPreprocessor >>> preprocessors = [ ... MissingDataPreprocessor(strategy="mean"), ... ScalingPreprocessor(method="standard") ... ] >>> manager = DataManager( ... test_size=0.2, ... preprocessors=preprocessors ... )
- Create splits for grouped data:
>>> manager = DataManager( ... split_method="kfold", ... group_column="subject_id", ... stratified=True ... )
- export_params() None[source]#
Export a JSON-serializable snapshot of the DataManager init params.
Creates a dictionary containing all initialization parameters and preprocessor configurations for serialization and rerun functionality.
- Returns:
- dict
Dictionary containing all DataManager parameters and preprocessor configurations in JSON-serializable format
Notes
The exported parameters include: - All DataManager initialization parameters - Preprocessor configurations (exported from each preprocessor) - Parameter values in a format suitable for JSON serialization
Examples
- Export parameters for saving:
>>> manager = DataManager(test_size=0.2) >>> params = manager.export_params() >>> # Save params to file or database
- split(data_path: str, categorical_features: List[str], group_name: str, filename: str, table_name: str | None = None) DataSplitInfo[source]#
Split the data based on the preconfigured splitter.
Creates train-test splits for the specified dataset using the configured splitting strategy and applies the complete preprocessing pipeline. Results are cached to be accessed later using the same parameters.
- Parameters:
- data_pathstr
Path to the dataset file
- categorical_featuresList[str]
List of categorical feature names
- group_namestr
Name of the group for split caching
- filenamestr
Filename for split caching
- table_namestr, optional
Name of the table in SQL database, by default None
- Returns:
- DataSplitInfo
Object containing train/test splits and related information
- Raises:
- ValueError
If group_name is provided without filename or vice versa
Notes
The split process: 1. Loads data from the specified path 2. Separates features and target variables 3. Handles grouping columns if specified 4. Creates splits using the configured splitter 5. Applies preprocessing pipeline to each split 6. Creates DataSplitInfo objects for each split 7. Caches results for efficient reuse
Examples
- Create splits for a dataset:
>>> manager = DataManager(test_size=0.2) >>> splits = manager.split( ... data_path="data.csv", ... categorical_features=["category1", "category2"], ... group_name="experiment1", ... filename="dataset1" ... )
- to_markdown() str[source]#
Create a markdown representation of the DataManager configuration.
Generates a formatted markdown string describing the current DataManager configuration including splitting parameters and preprocessing pipeline.
- Returns:
- str
Markdown formatted string describing the configuration
Examples
- Generate configuration documentation:
>>> manager = DataManager(test_size=0.2, n_splits=5) >>> md = manager.to_markdown() >>> print(md)