Workflow#
- class Workflow(evaluator: EvaluationManager, X_train: DataFrame, X_test: DataFrame, y_train: Series, y_test: Series, output_dir: str, algorithm_names: List[str], feature_names: List[str], workflow_attributes: Dict[str, Any])#
Base class for machine learning workflows. Delegates EvaluationManager.
- Parameters:
- evaluatorEvaluationManager
Manager for model evaluation and visualization
- X_trainDataFrame
Training feature data
- X_testDataFrame
Test feature data
- y_trainSeries
Training target data
- y_testSeries
Test target data
- output_dirstr
Directory where results will be saved
- algorithm_nameslist of str
Names of the algorithms used
- feature_nameslist of str
Names of the features
- workflow_attributesdict
Additional attributes to be unpacked into the workflow
- Attributes:
- evaluatorEvaluationManager
Manager for model evaluation
- X_trainDataFrame
Training feature data
- X_testDataFrame
Test feature data
- y_trainSeries
Training target data
- y_testSeries
Test target data
- output_dirstr
Output directory path
- algorithm_nameslist of str
Algorithm names
- feature_nameslist of str
Feature names
- model1, model2, …BaseEstimator
Models unpacked from workflow_attributes
- compare_models(*models: BaseEstimator, X: DataFrame, y: Series, metrics: List[str], filename: str, calculate_diff: bool = False) Dict[str, Dict[str, float]]#
Compare multiple models using specified metrics.
- Parameters:
- *modelsBaseEstimator
Models to compare
- XDataFrame
Feature data
- ySeries
Target data
- metricslist of str
Names of metrics to calculate
- filenamestr
Output filename (without extension)
- calculate_diffbool, optional
Whether to compute differences between models, by default False
- Returns:
- dict
Nested dictionary containing metric results for each model
- confusion_matrix(model: Any, X: ndarray, y: ndarray, filename: str) None#
Generate and save a confusion matrix.
- Parameters:
- modelAny
Trained classification model with predict method
- Xndarray
The input features.
- yndarray
The true target values.
- filenamestr
The name of the output file (without extension).
- evaluate_model(model: BaseEstimator, X: DataFrame, y: Series, metrics: List[str], filename: str) None#
Evaluate model on specified metrics and save results.
- Parameters:
- modelBaseEstimator
Trained model to evaluate
- XDataFrame
Feature data
- ySeries
Target data
- metricslist of str
Names of metrics to calculate
- filenamestr
Output filename (without extension)
- evaluate_model_cv(model: BaseEstimator, X: DataFrame, y: Series, metrics: List[str], filename: str, cv: int = 5) None#
Evaluate model using cross-validation.
- Parameters:
- modelBaseEstimator
Model to evaluate
- XDataFrame
Feature data
- ySeries
Target data
- metricslist of str
Names of metrics to calculate
- filenamestr
Output filename (without extension)
- cvint, optional
Number of cross-validation folds, by default 5
- hyperparameter_tuning(model: BaseEstimator, method: str, X_train: DataFrame, y_train: Series, scorer: str, kf: int, num_rep: int, n_jobs: int, plot_results: bool = False) BaseEstimator#
Perform hyperparameter tuning using grid or random search.
- Parameters:
- modelBaseEstimator
Model to tune
- method{‘grid’, ‘random’}
Search method to use
- X_trainDataFrame
Training data
- y_trainSeries
Training targets
- scorerstr
Scoring metric
- kfint
Number of cross-validation splits
- num_repint
Number of CV repetitions
- n_jobsint
Number of parallel jobs
- plot_resultsbool, optional
Whether to plot hyperparameter performance, by default False
- Returns:
- BaseEstimator
Tuned model
- plot_confusion_heatmap(model: Any, X: ndarray, y: ndarray, filename: str) None#
Plot a heatmap of the confusion matrix for a model.
- Parameters:
- model (Any):
The trained classification model with a predict method.
- X (np.ndarray):
The input features.
- y (np.ndarray):
The target labels.
- filename (str):
The path to save the confusion matrix heatmap image.
- plot_feature_importance(model: BaseEstimator, X: DataFrame, y: Series, threshold: int | float, feature_names: List[str], filename: str, metric: str, num_rep: int) None#
Plot the feature importance for the model and save the plot.
- Parameters:
- model (BaseEstimator):
The model to evaluate.
- X (pd.DataFrame):
The input features.
- y (pd.Series):
The target data.
- threshold (Union[int, float]):
The number of features or the threshold to filter features by importance.
- feature_names (List[str]):
A list of feature names corresponding to the columns in X.
- filename (str):
The name of the output file (without extension).
- metric (str):
The metric to use for evaluation.
- num_rep (int):
The number of repetitions for calculating importance.
- plot_learning_curve(model: BaseEstimator, X_train: DataFrame, y_train: Series, cv: int = 5, num_repeats: int = 1, n_jobs: int = -1, metric: str = 'neg_mean_absolute_error', filename: str = 'learning_curve') None#
Plot learning curves showing model performance vs training size.
- Parameters:
- modelBaseEstimator
Model to evaluate
- X_trainDataFrame
Training features
- y_trainSeries
Training target values
- cvint, optional
Number of cross-validation folds, by default 5
- num_repeatsint, optional
Number of times to repeat CV, by default 1
- n_jobsint, optional
Number of parallel jobs, by default -1
- metricstr, optional
Scoring metric to use, by default “neg_mean_absolute_error”
- filenamestr, optional
Name for output file, by default “learning_curve”
- plot_model_comparison(*models: BaseEstimator, X: DataFrame, y: Series, metric: str, filename: str) None#
Plot a comparison of multiple models based on the specified metric.
- Parameters:
- models:
A variable number of model instances to evaluate.
- X (pd.DataFrame):
The input features.
- y (pd.Series):
The target data.
- metric (str):
The metric to evaluate and plot.
- filename (str):
The name of the output file (without extension).
- plot_precision_recall_curve(model: Any, X: ndarray, y: ndarray, filename: str, pos_label: int | None = 1) None#
Plot a precision-recall curve with average precision.
- Parameters:
- model (Any):
The trained binary classification model.
- X (np.ndarray):
The input features.
- y (np.ndarray):
The true binary labels.
- filename (str):
The path to save the plot.
- pos_label (int):
The label of the positive class.
- plot_pred_vs_obs(model: BaseEstimator, X: DataFrame, y_true: Series, filename: str) None#
Plot predicted vs. observed values and save the plot.
- Parameters:
- model (BaseEstimator):
The trained model.
- X (pd.DataFrame):
The input features.
- y_true (pd.Series):
The true target values.
- filename (str):
The name of the output file (without extension).
- plot_residuals(model: BaseEstimator, X: DataFrame, y: Series, filename: str, add_fit_line: bool = False) None#
Plot the residuals of the model and save the plot.
- Parameters:
- model (BaseEstimator):
The trained model.
- X (pd.DataFrame):
The input features.
- y (pd.Series):
The true target values.
- filename (str):
The name of the output file (without extension).
- add_fit_line (bool):
Whether to add a line of best fit to the plot.
- plot_roc_curve(model: Any, X: ndarray, y: ndarray, filename: str, pos_label: int | None = 1) None#
Plot a reciever operator curve with area under the curve.
- Parameters:
- model (Any):
The trained binary classification model.
- X (np.ndarray):
The input features.
- y (np.ndarray):
The true binary labels.
- filename (str):
The path to save the ROC curve image.
- pos_label (Optional[int]):
The label of the positive class.
- abstractmethod workflow() None#