utils
utils module#
statista.utils
#
merge_small_bins(bin_count_observed, bin_count_fitted_data)
#
Merge small bins for goodness-of-fit tests (e.g., chi-square).
This utility merges adjacent "small" bins (those whose expected count is < 5) starting from the right-most bin and moving left, accumulating small bins until their combined expected count is >= 5. If a large (>= 5) bin is encountered while there is an accumulation, that accumulation is merged into that bin. If the left edge is reached with a remaining accumulation that was never merged into a large bin, the accumulation is appended as its own bin.
After merging, the expected counts are rescaled so that their sum equals the total observed count (required by Pearson's chi-square test), preserving the expected proportions within the merged structure.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
bin_count_observed
|
List[float]
|
Observed counts per original bin. Must be the same length as
|
required |
bin_count_fitted_data
|
List[float]
|
Expected (model-fitted) counts per original bin. Must be the same
length as |
required |
Returns:
Type | Description |
---|---|
Tuple[np.ndarray, np.ndarray]:
Two 1D numpy arrays |
Raises:
Type | Description |
---|---|
ZeroDivisionError
|
If the total expected count across merged bins is 0, rescaling cannot be performed (division by zero). This can happen if all expected counts are zero. |
ValueError
|
If the input sequences have different lengths. |
Notes
- The function assumes a one-to-one correspondence of observed and
expected bins. If lengths differ, only a partial zip would occur; to
avoid silent truncation a
ValueError
is raised. - Merging proceeds from right to left and the result is then reversed back to low-to-high order.
- The "< 5" rule is a common heuristic for chi-square tests to ensure adequate expected counts per bin.
Examples:
-
Merge tail small bins with the nearest large bin on the left
-
No merging when all expected counts are >= 5
-
Accumulated leftmost small bins remain as their own bin if no large bin is found to the left
-
Expected counts are rescaled to match the observed total while preserving proportions
Source code in statista/utils.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
|