Using ExperimentGroups#

The ExperimentGroup object is used to define a group of experiments. This can be used to help organize your experiments. The ExperimentGroup class also allows you to override the default settings for a set of experiments. This allows you to test out different values without having to change the configuration files.

Modify DataManager#

By default Brisk will use the DataManager defined in the data.py configuration file. The data_config argument is used to modify the DataManager for the group. To do this you define a dictionary of arguments that will be passed to the DataManager constructor.

As an example imagine we define the following DataManager in the data.py file:

from brisk.data.data_manager import DataManager
from brisk.data.preprocessing import ScalingPreprocessor

data_manager = DataManager(
    test_size=0.2,
    split_method="shuffle",
    preprocessors=[ScalingPreprocessor(method="standard")]
)

This will use a shuffle split method to create a train/test split of 80/20. It will also use standard scaling to scale the input features.

What if we want to compare different preprocessing strategies? We can create experiment groups with different preprocessing configurations:

from brisk.data.preprocessing import (
    ScalingPreprocessor,
    FeatureSelectionPreprocessor
)

config.add_experiment_group(
    name="standard_scaling",
    datasets=["data.csv"],
    description="Use standard scaling (from data.py configuration)"
)

config.add_experiment_group(
    name="minmax_scaling",
    datasets=["data.csv"],
    data_config={"preprocessors": [ScalingPreprocessor(method="minmax")]},
    description="Use minmax scaling instead"
)

config.add_experiment_group(
    name="with_feature_selection",
    datasets=["data.csv"],
    data_config={"preprocessors": [
        ScalingPreprocessor(method="robust"),
        FeatureSelectionPreprocessor(method="selectkbest", n_features_to_select=10)
    ]},
    description="Robust scaling with top 10 features"
)

Any argument passed to the DataManager constructor can be modified in this way.

Note

If you are using multiple datasets in the ExperimentGroup then the same DataManager will be used for all of the datasets. If you want to use different DataManager for each dataset then you will need to define a separate ExperimentGroup for each dataset.

Note

For database files, you can specify datasets as tuples: datasets=[("database.db", "table_name")]

Modify Algorithms#

The algorithm_config argument is used to modify the hyperparameter grid used for the algorithm in this group. You provide a nested dictionary where the first key is the algorithm name and the second key is the hyperparameter to modify.

For example if we want to modify the hyperparameter grid for the ridge algorithm to only use the values [0.01, 0.05, 0.1, 0.5, 1.0] for the alpha hyperparameter we can do the following:

config.add_experiment_group(
    name="group1",
    datasets=["data.csv"],
    algorithms=["ridge"],
    algorithm_config={"ridge":
        {"alpha": [0.01, 0.05, 0.1, 0.5, 1.0]}
    }
)

This allows you to test out different hyperparameter values without having to change the algorithms.py file.

Important

algorithm_config will only modify the specified hyperparameter in the hyperparameter grid. This means the other hyperparameters will still use the values defined in the AlgorithmWrapper. It is not possible to remove a hyperparameter from the grid using algorithm_config.

Assign Different Workflows#

Each experiment group can use a different workflow by specifying the workflow parameter. This allows you to test different modeling approaches across experiment groups.

Setting Default Workflow#

By default, all experiment groups use the default_workflow specified in the Configuration object:

# settings.py
config = Configuration(
    default_algorithms=["linear", "ridge"],
    default_workflow="basic_workflow"  # All groups use this unless overridden
)

config.add_experiment_group(
    name="baseline",
    datasets=["data.csv"]
    # Uses default_workflow automatically
)

Assign Different Workflows for Specific Groups#

You can override the default workflow for specific experiment groups:

# settings.py
config = Configuration(
    default_algorithms=["linear", "ridge"],
    default_workflow="basic_workflow"  # Default for all groups
)

config.add_experiment_group(
    name="baseline_models",
    datasets=["data.csv"],
    algorithms=["linear", "ridge"]
    # Uses default_workflow (basic_workflow)
)

config.add_experiment_group(
    name="tuned_models",
    datasets=["data.csv"],
    algorithms=["rf", "svm"],
    workflow="advanced_workflow",  # Overrides default_workflow
    description="Hyperparameter tuning with customized evaluation"
)

Important

The workflow filename (without .py) must match the workflow name you specify in add_experiment_group(). For example, workflow="my_analysis" expects a file named workflows/my_analysis.py.

Note

For detailed examples of creating these workflow files, see the Project Structure guide.

Pass Workflow Arguments#

If you want to use workflow_args it is best practice to first define the default values in the Configuration object. This will ensure that the Workflow class has the correct attributes for every run. If a value is not set for a workflow argument in an experiment group, the workflow may fail when trying to use this argument.

As an example say we want to use a different value of kfold for each experiment group we can do the following. First we define the default value in the Configuration object:

config = Configuration(
    default_workflow="basic_workflow",
    default_algorithms=["linear", "ridge"],
    categorical_features={"housing_data.csv": ["neighborhood", "property_type"]},
    default_workflow_args={"kfold": 5}
)

Then we can create two experiment groups with different values of kfold.

config.add_experiment_group(
    name="group1",
    datasets=["data.csv"]
    # Uses default_workflow_args
)

config.add_experiment_group(
    name="group2",
    datasets=["data.csv"],
    workflow_args={"kfold": 10} # Overrides default_workflow_args
)

Now when we are creating a workflow we can access the kfold value as self.kfold:

def workflow(self):
    self.evaluate_model_cv(
        self.model, self.X_train, self.y_train,
        ["MAE", "MSE"], "evaluate_cv", cv=self.kfold
    )

You can include as many arguments as you want in the workflow_args dictionary. This is a good way to avoid hardcoding values in the Workflow class and helps ensure you use the same value for a particular argument across all method calls in the workflow.