Dataset Measures#

class ContinuousStatistics(method_name: str, description: str)[source]#

Bases: DatasetMeasureEvaluator

Calculate continuous statistics for a dataset.

This evaluator calculates comprehensive descriptive statistics for continuous features in both training and test datasets, including measures of central tendency, dispersion, and distribution shape.

Attributes:
namestr

The name of the evaluator, set to ‘continuous_statistics’

calculate_measures(train_data: DataFrame | Series, test_data: DataFrame | Series, feature_names: List[str]) Dict[str, Dict[str, float]][source]#

Calculate continuous statistics for a dataset.

Calculates comprehensive descriptive statistics for continuous features including mean, median, standard deviation, variance, min/max values, percentiles, skewness, kurtosis, and coefficient of variation.

Parameters:
train_datapd.DataFrame or pd.Series

The training data containing continuous features

test_datapd.DataFrame or pd.Series

The test data containing continuous features

feature_namesList[str]

The names of the continuous features to calculate statistics for

Returns:
Dict[str, Dict[str, float]]

A nested dictionary containing statistics for each feature. Structure: {feature_name: {split: {statistic: value}}} where split is ‘train’ or ‘test’ and statistic includes: - mean, median, std_dev, variance - min, max, range - 25_percentile, 75_percentile - skewness, kurtosis, coefficient_of_variation

class CategoricalStatistics(method_name: str, description: str)[source]#

Bases: DatasetMeasureEvaluator

Calculate categorical statistics for a dataset.

This evaluator calculates descriptive statistics for categorical features in both training and test datasets, including frequency distributions, proportions, entropy, and chi-square tests for distribution differences.

Attributes:
namestr

The name of the evaluator, set to ‘categorical_statistics’

calculate_measures(train_data: DataFrame | Series, test_data: DataFrame | Series, feature_names: List[str]) Dict[str, Dict[str, float]][source]#

Calculate categorical statistics for a dataset.

Calculates comprehensive descriptive statistics for categorical features including frequency distributions, proportions, entropy, and chi-square tests to assess distribution differences between train and test sets.

Parameters:
train_datapd.DataFrame or pd.Series

The training data containing categorical features

test_datapd.DataFrame or pd.Series

The test data containing categorical features

feature_namesList[str]

The names of the categorical features to calculate statistics for

Returns:
Dict[str, Dict[str, float]]

A nested dictionary containing statistics for each feature. Structure: {feature_name: {split: {statistic: value}}} where split is ‘train’ or ‘test’ and statistic includes: - frequency: dict of value counts - proportion: dict of normalized value counts - num_unique: number of unique values - entropy: Shannon entropy of the distribution - chi_square: chi-square test results (chi2_stat, p_value, degrees_of_freedom)