Dataset Measures#
- class ContinuousStatistics(method_name: str, description: str)[source]#
Bases:
DatasetMeasureEvaluatorCalculate continuous statistics for a dataset.
This evaluator calculates comprehensive descriptive statistics for continuous features in both training and test datasets, including measures of central tendency, dispersion, and distribution shape.
- Attributes:
- namestr
The name of the evaluator, set to ‘continuous_statistics’
- calculate_measures(train_data: DataFrame | Series, test_data: DataFrame | Series, feature_names: List[str]) Dict[str, Dict[str, float]][source]#
Calculate continuous statistics for a dataset.
Calculates comprehensive descriptive statistics for continuous features including mean, median, standard deviation, variance, min/max values, percentiles, skewness, kurtosis, and coefficient of variation.
- Parameters:
- train_datapd.DataFrame or pd.Series
The training data containing continuous features
- test_datapd.DataFrame or pd.Series
The test data containing continuous features
- feature_namesList[str]
The names of the continuous features to calculate statistics for
- Returns:
- Dict[str, Dict[str, float]]
A nested dictionary containing statistics for each feature. Structure: {feature_name: {split: {statistic: value}}} where split is ‘train’ or ‘test’ and statistic includes: - mean, median, std_dev, variance - min, max, range - 25_percentile, 75_percentile - skewness, kurtosis, coefficient_of_variation
- class CategoricalStatistics(method_name: str, description: str)[source]#
Bases:
DatasetMeasureEvaluatorCalculate categorical statistics for a dataset.
This evaluator calculates descriptive statistics for categorical features in both training and test datasets, including frequency distributions, proportions, entropy, and chi-square tests for distribution differences.
- Attributes:
- namestr
The name of the evaluator, set to ‘categorical_statistics’
- calculate_measures(train_data: DataFrame | Series, test_data: DataFrame | Series, feature_names: List[str]) Dict[str, Dict[str, float]][source]#
Calculate categorical statistics for a dataset.
Calculates comprehensive descriptive statistics for categorical features including frequency distributions, proportions, entropy, and chi-square tests to assess distribution differences between train and test sets.
- Parameters:
- train_datapd.DataFrame or pd.Series
The training data containing categorical features
- test_datapd.DataFrame or pd.Series
The test data containing categorical features
- feature_namesList[str]
The names of the categorical features to calculate statistics for
- Returns:
- Dict[str, Dict[str, float]]
A nested dictionary containing statistics for each feature. Structure: {feature_name: {split: {statistic: value}}} where split is ‘train’ or ‘test’ and statistic includes: - frequency: dict of value counts - proportion: dict of normalized value counts - num_unique: number of unique values - entropy: Shannon entropy of the distribution - chi_square: chi-square test results (chi2_stat, p_value, degrees_of_freedom)