Skip to content

utils

utils module#

statista.utils #

merge_small_bins(bin_count_observed, bin_count_fitted_data) #

Merge small bins for goodness-of-fit tests (e.g., chi-square).

This utility merges adjacent "small" bins (those whose expected count is < 5) starting from the right-most bin and moving left, accumulating small bins until their combined expected count is >= 5. If a large (>= 5) bin is encountered while there is an accumulation, that accumulation is merged into that bin. If the left edge is reached with a remaining accumulation that was never merged into a large bin, the accumulation is appended as its own bin.

After merging, the expected counts are rescaled so that their sum equals the total observed count (required by Pearson's chi-square test), preserving the expected proportions within the merged structure.

Parameters:

Name Type Description Default
bin_count_observed List[float]

Observed counts per original bin. Must be the same length as bin_count_fitted_data. Values should be non-negative.

required
bin_count_fitted_data List[float]

Expected (model-fitted) counts per original bin. Must be the same length as bin_count_observed. Values should be non-negative.

required

Returns:

Type Description

Tuple[np.ndarray, np.ndarray]: Two 1D numpy arrays (merged_observed, merged_expected) in low-to-high bin order after merging and rescaling. The two arrays are the same length, and merged_expected.sum() == merged_observed.sum().

Raises:

Type Description
ZeroDivisionError

If the total expected count across merged bins is 0, rescaling cannot be performed (division by zero). This can happen if all expected counts are zero.

ValueError

If the input sequences have different lengths.

Notes
  • The function assumes a one-to-one correspondence of observed and expected bins. If lengths differ, only a partial zip would occur; to avoid silent truncation a ValueError is raised.
  • Merging proceeds from right to left and the result is then reversed back to low-to-high order.
  • The "< 5" rule is a common heuristic for chi-square tests to ensure adequate expected counts per bin.

Examples:

  • Merge tail small bins with the nearest large bin on the left

    >>> from statista.utils import merge_small_bins
    >>> merge_small_bins([10, 3, 2], [10, 3, 2])
    (array([15]), array([15.]))
    
  • No merging when all expected counts are >= 5

    >>> merge_small_bins([10, 20, 30], [12, 18, 30])
    (array([10, 20, 30]), array([12., 18., 30.]))
    
  • Accumulated leftmost small bins remain as their own bin if no large bin is found to the left

    >>> merge_small_bins([10, 10], [4, 6])
    (array([10, 10]), array([ 8., 12.]))
    
  • Expected counts are rescaled to match the observed total while preserving proportions

    >>> merge_small_bins([5, 5, 10], [2, 3, 5])
    (array([10, 10]), array([10., 10.]))
    
Source code in statista/utils.py
def merge_small_bins(bin_count_observed: List[float], bin_count_fitted_data: List[float]):
    """Merge small bins for goodness-of-fit tests (e.g., chi-square).

    This utility merges adjacent "small" bins (those whose expected count is < 5)
    starting from the right-most bin and moving left, accumulating small bins
    until their combined expected count is >= 5. If a large (>= 5) bin is
    encountered while there is an accumulation, that accumulation is merged into
    that bin. If the left edge is reached with a remaining accumulation that was
    never merged into a large bin, the accumulation is appended as its own bin.

    After merging, the expected counts are rescaled so that their sum equals the
    total observed count (required by Pearson's chi-square test), preserving the
    expected proportions within the merged structure.

    Args:
        bin_count_observed (List[float]):
            Observed counts per original bin. Must be the same length as
            ``bin_count_fitted_data``. Values should be non-negative.
        bin_count_fitted_data (List[float]):
            Expected (model-fitted) counts per original bin. Must be the same
            length as ``bin_count_observed``. Values should be non-negative.

    Returns:
        Tuple[np.ndarray, np.ndarray]:
            Two 1D numpy arrays ``(merged_observed, merged_expected)`` in
            low-to-high bin order after merging and rescaling. The two arrays
            are the same length, and ``merged_expected.sum() ==
            merged_observed.sum()``.

    Raises:
        ZeroDivisionError: If the total expected count across merged bins is 0,
            rescaling cannot be performed (division by zero). This can happen if
            all expected counts are zero.
        ValueError: If the input sequences have different lengths.

    Notes:
        - The function assumes a one-to-one correspondence of observed and
          expected bins. If lengths differ, only a partial zip would occur; to
          avoid silent truncation a ``ValueError`` is raised.
        - Merging proceeds from right to left and the result is then reversed
          back to low-to-high order.
        - The "< 5" rule is a common heuristic for chi-square tests to ensure
          adequate expected counts per bin.

    Examples:
        - Merge tail small bins with the nearest large bin on the left

            ```python
            >>> from statista.utils import merge_small_bins
            >>> merge_small_bins([10, 3, 2], [10, 3, 2])
            (array([15]), array([15.]))

            ```

        - No merging when all expected counts are >= 5

            ```python
            >>> merge_small_bins([10, 20, 30], [12, 18, 30])
            (array([10, 20, 30]), array([12., 18., 30.]))

            ```

        - Accumulated leftmost small bins remain as their own bin if no large bin is found to the left

            ```python
            >>> merge_small_bins([10, 10], [4, 6])
            (array([10, 10]), array([ 8., 12.]))

            ```

        - Expected counts are rescaled to match the observed total while preserving proportions

            ```python
            >>> merge_small_bins([5, 5, 10], [2, 3, 5])
            (array([10, 10]), array([10., 10.]))

            ```
    """
    if len(bin_count_observed) != len(bin_count_fitted_data):
        raise ValueError("bin_count_observed and bin_count_fitted_data must have the same length.")

    # Merge tail bins whose expected counts are < 5
    merged_obs = []
    merged_exp = []
    accum_obs  = 0
    accum_exp  = 0

    # Work from the rightmost bin backwards, accumulating bins until the combined
    # expected count is ≥ 5
    for observed, expected in reversed(list(zip(bin_count_observed, bin_count_fitted_data))):
        if expected < 5:
            accum_obs += observed
            accum_exp += expected
        else:
            if accum_exp > 0:
                # combine the accumulated small bins with this one
                accum_obs += observed
                accum_exp += expected
                merged_obs.append(accum_obs)
                merged_exp.append(accum_exp)
                accum_obs = accum_exp = 0
            else:
                # keep this bin separate
                merged_obs.append(observed)
                merged_exp.append(expected)

    # Append any remaining accumulated bins
    if accum_exp > 0:
        merged_obs.append(accum_obs)
        merged_exp.append(accum_exp)

    # Reverse the order back to low→high
    merged_obs = np.array(merged_obs[::-1])
    merged_exp = np.array(merged_exp[::-1]).astype(float)

    # Rescale expected counts so they sum to the total number of observations
    # This is required for Pearson’s χ² test
    merged_exp *= merged_obs.sum() / merged_exp.sum()
    return merged_obs, merged_exp