Rhine River Discharge Analysis using Statista Distributions¶

This notebook demonstrates how to use the statista.distributions module to analyze discharge time series data from the Rhine River. We'll explore different probability distributions, fit them to the data, and calculate return periods and flood frequency curves.

The Rhine River is one of Europe's major rivers, flowing through several countries including Switzerland, Liechtenstein, Austria, Germany, France and the Netherlands. Analyzing discharge data is crucial for flood risk assessment, water resource management, and understanding the hydrological behavior of the river.

In this notebook, we will:

Load and preprocess discharge time series data from multiple gauges along the Rhine River
Fit different probability distributions to the data
Evaluate the goodness of fit for each distribution
Calculate return periods and flood frequency curves
Visualize the results

In [ ]:

Copied!





# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import statista distributions module
from statista.distributions import Distributions, PlottingPosition, Gumbel, GEV, Exponential, Normal

# Set plot style
plt.style.use('ggplot')

# Display all columns in pandas DataFrames
pd.set_option('display.max_columns', None)
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import statista distributions module
from statista.distributions import Distributions, PlottingPosition, Gumbel, GEV, Exponential, Normal

# Set plot style
plt.style.use('ggplot')

# Display all columns in pandas DataFrames
pd.set_option('display.max_columns', None)

Data Loading and Preprocessing¶

We'll load the Rhine River discharge data from the CSV file. The data contains daily discharge measurements from multiple gauges along the river. The first column is the date, and the remaining columns represent different gauges.

We need to handle missing values (empty strings) in the data and convert the date column to a datetime format. This preprocessing step is crucial for ensuring the quality of our analysis.

In [ ]:

Copied!





def load_and_preprocess_data(file_path):
    """
    Load and preprocess the Rhine River discharge data.

    - Reads a CSV file with a 'date' column and discharge columns for each gauge.
    - Parses the date column into `datetime` objects and sets it as the index.
    - Converts empty strings to NaN and casts all gauge columns to floats.
    - Returns a clean `DataFrame` ready for analysis.
    
    Args:
        file_path: Path to the CSV file containing the discharge data
        
    Returns:
        pandas.DataFrame: Preprocessed discharge data
    """
    # Load the data
    df = pd.read_csv(file_path)
    
    # Display the first few rows of the data
    print(f"Data shape: {df.shape}")
    print(df.head())
    
    # Convert the date column to datetime
    df['date'] = pd.to_datetime(df['date'])
    
    # Set the date column as the index
    df.set_index('date', inplace=True)
    
    # Check for missing values
    print("\nNumber of missing values in each column:")
    print(df.isna().sum())
    
    # Convert empty strings to NaN
    df = df.replace('', np.nan)
    
    # Convert all columns to numeric
    for col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')
    
    # Check for missing values again
    print("\nNumber of missing values after conversion:")
    print(df.isna().sum())
    
    # Display basic statistics
    print("\nBasic statistics:")
    print(df.describe())
    
    return df

# Define the path to the data file
file_path = '../../../examples/data/rhine-full-time-series.csv'

# Load and preprocess the data
print("Loading and preprocessing the data...")
df = load_and_preprocess_data(file_path)
def load_and_preprocess_data(file_path):
    """
    Load and preprocess the Rhine River discharge data.

    - Reads a CSV file with a 'date' column and discharge columns for each gauge.
    - Parses the date column into `datetime` objects and sets it as the index.
    - Converts empty strings to NaN and casts all gauge columns to floats.
    - Returns a clean `DataFrame` ready for analysis.
    
    Args:
        file_path: Path to the CSV file containing the discharge data
        
    Returns:
        pandas.DataFrame: Preprocessed discharge data
    """
    # Load the data
    df = pd.read_csv(file_path)
    
    # Display the first few rows of the data
    print(f"Data shape: {df.shape}")
    print(df.head())
    
    # Convert the date column to datetime
    df['date'] = pd.to_datetime(df['date'])
    
    # Set the date column as the index
    df.set_index('date', inplace=True)
    
    # Check for missing values
    print("\nNumber of missing values in each column:")
    print(df.isna().sum())
    
    # Convert empty strings to NaN
    df = df.replace('', np.nan)
    
    # Convert all columns to numeric
    for col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')
    
    # Check for missing values again
    print("\nNumber of missing values after conversion:")
    print(df.isna().sum())
    
    # Display basic statistics
    print("\nBasic statistics:")
    print(df.describe())
    
    return df

# Define the path to the data file
file_path = '../../../examples/data/rhine-full-time-series.csv'

# Load and preprocess the data
print("Loading and preprocessing the data...")
df = load_and_preprocess_data(file_path)

Exploratory Data Analysis¶

Let's explore the data to understand the discharge patterns at different gauges along the Rhine River. We'll visualize the time series, examine the distribution of discharge values, and identify any seasonal patterns.

Time series analysis helps us understand how discharge varies over time, while histograms provide insights into the statistical distribution of discharge values. This exploratory analysis is essential for selecting appropriate probability distributions for modeling.

In [ ]:

Copied!





def plot_time_series(df, selected_gauges):
    """
    Plot time series for selected gauges.
    
    Args:
        df: DataFrame containing the discharge data
        selected_gauges: List of gauge names to plot
    """
    plt.figure(figsize=(14, 8))
    for gauge in selected_gauges:
        if gauge in df.columns:
            plt.plot(df.index, df[gauge], label=gauge)
    plt.title('Discharge Time Series for Selected Gauges')
    plt.xlabel('Date')
    plt.ylabel('Discharge (m³/s)')
    plt.legend()
    plt.grid(True)
    plt.show()

def plot_histograms(df, selected_gauges):
    """
    Create histograms for selected gauges.
    
    Args:
        df: DataFrame containing the discharge data
        selected_gauges: List of gauge names to plot
    """
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    axes = axes.flatten()
    
    for i, gauge in enumerate(selected_gauges):
        if gauge in df.columns and i < len(axes):
            # Use matplotlib's histogram function
            data = df[gauge].dropna()
            axes[i].hist(data, bins=20, density=True, alpha=0.7)
            
            # Add a density curve
            from scipy import stats
            min_val, max_val = data.min(), data.max()
            x = np.linspace(min_val, max_val, 1000)
            kde = stats.gaussian_kde(data)
            axes[i].plot(x, kde(x), 'r-', linewidth=2)
            
            axes[i].set_title(f'Distribution of Discharge at {gauge}')
            axes[i].set_xlabel('Discharge (m³/s)')
            axes[i].set_ylabel('Density')
    
    plt.tight_layout()
    plt.show()

# Define selected gauges for analysis
selected_gauges = ['rees', 'cologne', 'kaub', 'mainz']

# Plot time series
print("\nPlotting time series for selected gauges...")
plot_time_series(df, selected_gauges)

# Plot histograms
print("\nPlotting histograms for selected gauges...")
plot_histograms(df, selected_gauges)
def plot_time_series(df, selected_gauges):
    """
    Plot time series for selected gauges.
    
    Args:
        df: DataFrame containing the discharge data
        selected_gauges: List of gauge names to plot
    """
    plt.figure(figsize=(14, 8))
    for gauge in selected_gauges:
        if gauge in df.columns:
            plt.plot(df.index, df[gauge], label=gauge)
    plt.title('Discharge Time Series for Selected Gauges')
    plt.xlabel('Date')
    plt.ylabel('Discharge (m³/s)')
    plt.legend()
    plt.grid(True)
    plt.show()

def plot_histograms(df, selected_gauges):
    """
    Create histograms for selected gauges.
    
    Args:
        df: DataFrame containing the discharge data
        selected_gauges: List of gauge names to plot
    """
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    axes = axes.flatten()
    
    for i, gauge in enumerate(selected_gauges):
        if gauge in df.columns and i < len(axes):
            # Use matplotlib's histogram function
            data = df[gauge].dropna()
            axes[i].hist(data, bins=20, density=True, alpha=0.7)
            
            # Add a density curve
            from scipy import stats
            min_val, max_val = data.min(), data.max()
            x = np.linspace(min_val, max_val, 1000)
            kde = stats.gaussian_kde(data)
            axes[i].plot(x, kde(x), 'r-', linewidth=2)
            
            axes[i].set_title(f'Distribution of Discharge at {gauge}')
            axes[i].set_xlabel('Discharge (m³/s)')
            axes[i].set_ylabel('Density')
    
    plt.tight_layout()
    plt.show()

# Define selected gauges for analysis
selected_gauges = ['rees', 'cologne', 'kaub', 'mainz']

# Plot time series
print("\nPlotting time series for selected gauges...")
plot_time_series(df, selected_gauges)

# Plot histograms
print("\nPlotting histograms for selected gauges...")
plot_histograms(df, selected_gauges)

Annual Maximum Discharge Analysis¶

For flood frequency analysis, we typically focus on annual maximum discharge values. This approach, known as the Annual Maximum Series (AMS) method, is widely used in hydrology for estimating the probability of extreme flood events.

By extracting the maximum discharge value for each year, we create a dataset that represents the most extreme conditions observed annually. This dataset is then used to fit probability distributions that model the frequency of extreme events.

In [ ]:

Copied!





def extract_annual_maxima(df):
    """
    Extract annual maximum discharge for each gauge.
    
    Args:
        df: DataFrame containing the discharge data
        
    Returns:
        pandas.DataFrame: Annual maximum discharge for each gauge
    """
    # Extract annual maximum discharge for each gauge
    annual_max = df.resample('YE').max()
    
    return annual_max

def plot_annual_maxima(annual_max, selected_gauges):
    """
    Plot annual maximum discharge for selected gauges.
    
    Args:
        annual_max: DataFrame containing annual maximum discharge
        selected_gauges: List of gauge names to plot
    """
    plt.figure(figsize=(14, 8))
    for gauge in selected_gauges:
        if gauge in annual_max.columns:
            plt.plot(annual_max.index, annual_max[gauge], 'o-', label=gauge)
    plt.title('Annual Maximum Discharge for Selected Gauges')
    plt.xlabel('Year')
    plt.ylabel('Maximum Discharge (m³/s)')
    plt.legend()
    plt.grid(True)
    plt.show()

# Extract annual maximum discharge
print("\nExtracting annual maximum discharge...")
annual_max = extract_annual_maxima(df)

# Plot annual maximum discharge
print("\nPlotting annual maximum discharge for selected gauges...")
plot_annual_maxima(annual_max, selected_gauges)
def extract_annual_maxima(df):
    """
    Extract annual maximum discharge for each gauge.
    
    Args:
        df: DataFrame containing the discharge data
        
    Returns:
        pandas.DataFrame: Annual maximum discharge for each gauge
    """
    # Extract annual maximum discharge for each gauge
    annual_max = df.resample('YE').max()
    
    return annual_max

def plot_annual_maxima(annual_max, selected_gauges):
    """
    Plot annual maximum discharge for selected gauges.
    
    Args:
        annual_max: DataFrame containing annual maximum discharge
        selected_gauges: List of gauge names to plot
    """
    plt.figure(figsize=(14, 8))
    for gauge in selected_gauges:
        if gauge in annual_max.columns:
            plt.plot(annual_max.index, annual_max[gauge], 'o-', label=gauge)
    plt.title('Annual Maximum Discharge for Selected Gauges')
    plt.xlabel('Year')
    plt.ylabel('Maximum Discharge (m³/s)')
    plt.legend()
    plt.grid(True)
    plt.show()

# Extract annual maximum discharge
print("\nExtracting annual maximum discharge...")
annual_max = extract_annual_maxima(df)

# Plot annual maximum discharge
print("\nPlotting annual maximum discharge for selected gauges...")
plot_annual_maxima(annual_max, selected_gauges)

Fitting Probability Distributions¶

Now we'll fit different probability distributions to the annual maximum discharge data for each gauge. We'll use the following distributions from the statista.distributions module:

Gumbel distribution: Also known as the Extreme Value Type I distribution, it's commonly used for modeling maximum values, such as annual flood peaks.
Generalized Extreme Value (GEV) distribution: A flexible three-parameter distribution that encompasses the Gumbel, Fréchet, and Weibull distributions, making it suitable for modeling extreme events.
Normal distribution: While not typically used for extreme values, it serves as a baseline for comparison.
Exponential distribution: Often used for modeling the time between events in a Poisson process, but can also be applied to certain hydrological variables.

We'll evaluate the goodness of fit using the Kolmogorov-Smirnov test, which measures the maximum difference between the empirical and theoretical cumulative distribution functions.

In [ ]:

Copied!





def fit_distributions(data, method="lmoments"):
    """
    Fit different distributions to the data and evaluate goodness of fit.
    
    Args:
        data: numpy array of discharge values
        method: fitting method ('mle' for Maximum Likelihood Estimation or 'lmoments' for L-moments)
        
    Returns:
        dict: Dictionary of fitted distribution objects and test results
    """
    # Remove NaN values
    data = data[~np.isnan(data)]
    
    # Sort data in ascending order
    data = np.sort(data)
    
    # Initialize distributions
    gumbel = Gumbel(data=data)
    gev = GEV(data=data)
    normal = Normal(data=data)
    exponential = Exponential(data=data)
    
    # Fit distributions
    gumbel_params = gumbel.fit_model(method=method)
    gev_params = gev.fit_model(method=method)
    normal_params = normal.fit_model(method=method)
    exponential_params = exponential.fit_model(method=method)
    
    # Perform Kolmogorov-Smirnov test
    gumbel_ks = gumbel.ks()
    gev_ks = gev.ks()
    normal_ks = normal.ks()
    exponential_ks = exponential.ks()
    
    # Return results
    return {
        'Gumbel': {'dist': gumbel, 'params': gumbel_params, 'ks': gumbel_ks},
        'GEV': {'dist': gev, 'params': gev_params, 'ks': gev_ks},
        'Normal': {'dist': normal, 'params': normal_params, 'ks': normal_ks},
        'Exponential': {'dist': exponential, 'params': exponential_params, 'ks': exponential_ks}
    }

#Example usage:
for gauge in selected_gauges:
    if gauge in annual_max.columns:
        data = annual_max[gauge].values
        results = fit_distributions(data)
def fit_distributions(data, method="lmoments"):
    """
    Fit different distributions to the data and evaluate goodness of fit.
    
    Args:
        data: numpy array of discharge values
        method: fitting method ('mle' for Maximum Likelihood Estimation or 'lmoments' for L-moments)
        
    Returns:
        dict: Dictionary of fitted distribution objects and test results
    """
    # Remove NaN values
    data = data[~np.isnan(data)]
    
    # Sort data in ascending order
    data = np.sort(data)
    
    # Initialize distributions
    gumbel = Gumbel(data=data)
    gev = GEV(data=data)
    normal = Normal(data=data)
    exponential = Exponential(data=data)
    
    # Fit distributions
    gumbel_params = gumbel.fit_model(method=method)
    gev_params = gev.fit_model(method=method)
    normal_params = normal.fit_model(method=method)
    exponential_params = exponential.fit_model(method=method)
    
    # Perform Kolmogorov-Smirnov test
    gumbel_ks = gumbel.ks()
    gev_ks = gev.ks()
    normal_ks = normal.ks()
    exponential_ks = exponential.ks()
    
    # Return results
    return {
        'Gumbel': {'dist': gumbel, 'params': gumbel_params, 'ks': gumbel_ks},
        'GEV': {'dist': gev, 'params': gev_params, 'ks': gev_ks},
        'Normal': {'dist': normal, 'params': normal_params, 'ks': normal_ks},
        'Exponential': {'dist': exponential, 'params': exponential_params, 'ks': exponential_ks}
    }

#Example usage:
for gauge in selected_gauges:
    if gauge in annual_max.columns:
        data = annual_max[gauge].values
        results = fit_distributions(data)

Visualizing Fitted Distributions¶

After fitting the distributions, it's important to visualize how well they match the empirical data. We'll create plots of both the probability density function (PDF) and the cumulative distribution function (CDF) for each distribution.

The PDF shows the relative likelihood of different discharge values, while the CDF shows the probability of not exceeding a given discharge value. These visualizations help us assess which distribution provides the best fit to the observed data.

In [ ]:

Copied!





def plot_fitted_distributions(data, fitted_dists, gauge_name):
    """
    Plot the empirical and fitted distributions.
    
    Args:
        data: numpy array of discharge values
        fitted_dists: dictionary of fitted distribution objects
        gauge_name: name of the gauge
    """
    # Remove NaN values
    data = data[~np.isnan(data)]
    
    # Sort data in ascending order
    data = np.sort(data)
    
    # Calculate empirical CDF using Weibull plotting position
    pp = PlottingPosition.weibul(data)
    
    # Create figure
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Plot PDF
    ax1.hist(data, bins=20, density=True, alpha=0.5, label='Empirical')
    x = np.linspace(min(data), max(data), 1000)
    
    for name, dist_info in fitted_dists.items():
        dist = dist_info['dist']
        params = dist_info['params']
        
        # Plot PDF
        # y_pdf = dist._pdf_eq(x, params)
        y_pdf = dist.pdf(data=x, parameters=params)
        ax1.plot(x, y_pdf, label=f'{name} (KS p-value: {dist_info["ks"][1]:.4f})')
    
    ax1.set_title(f'Probability Density Function - {gauge_name}')
    ax1.set_xlabel('Discharge (m³/s)')
    ax1.set_ylabel('Density')
    ax1.legend()
    ax1.grid(True)
    
    # Plot CDF
    ax2.plot(data, pp, 'o', label='Empirical')
    
    for name, dist_info in fitted_dists.items():
        dist = dist_info['dist']
        params = dist_info['params']
        
        # Plot CDF
        y_cdf = dist.cdf(data=x, parameters=params)
        ax2.plot(x, y_cdf, label=f'{name} (KS p-value: {dist_info["ks"][1]:.4f})')
    
    ax2.set_title(f'Cumulative Distribution Function - {gauge_name}')
    ax2.set_xlabel('Discharge (m³/s)')
    ax2.set_ylabel('Probability of Non-Exceedance')
    ax2.legend()
    ax2.grid(True)
    
    plt.tight_layout()
    plt.show()
def plot_fitted_distributions(data, fitted_dists, gauge_name):
    """
    Plot the empirical and fitted distributions.
    
    Args:
        data: numpy array of discharge values
        fitted_dists: dictionary of fitted distribution objects
        gauge_name: name of the gauge
    """
    # Remove NaN values
    data = data[~np.isnan(data)]
    
    # Sort data in ascending order
    data = np.sort(data)
    
    # Calculate empirical CDF using Weibull plotting position
    pp = PlottingPosition.weibul(data)
    
    # Create figure
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Plot PDF
    ax1.hist(data, bins=20, density=True, alpha=0.5, label='Empirical')
    x = np.linspace(min(data), max(data), 1000)
    
    for name, dist_info in fitted_dists.items():
        dist = dist_info['dist']
        params = dist_info['params']
        
        # Plot PDF
        # y_pdf = dist._pdf_eq(x, params)
        y_pdf = dist.pdf(data=x, parameters=params)
        ax1.plot(x, y_pdf, label=f'{name} (KS p-value: {dist_info["ks"][1]:.4f})')
    
    ax1.set_title(f'Probability Density Function - {gauge_name}')
    ax1.set_xlabel('Discharge (m³/s)')
    ax1.set_ylabel('Density')
    ax1.legend()
    ax1.grid(True)
    
    # Plot CDF
    ax2.plot(data, pp, 'o', label='Empirical')
    
    for name, dist_info in fitted_dists.items():
        dist = dist_info['dist']
        params = dist_info['params']
        
        # Plot CDF
        y_cdf = dist.cdf(data=x, parameters=params)
        ax2.plot(x, y_cdf, label=f'{name} (KS p-value: {dist_info["ks"][1]:.4f})')
    
    ax2.set_title(f'Cumulative Distribution Function - {gauge_name}')
    ax2.set_xlabel('Discharge (m³/s)')
    ax2.set_ylabel('Probability of Non-Exceedance')
    ax2.legend()
    ax2.grid(True)
    
    plt.tight_layout()
    plt.show()

Flood Frequency Analysis¶

Flood frequency analysis is a statistical technique used to estimate the probability of occurrence of a given discharge magnitude. The results are typically expressed in terms of return periods, which represent the average time interval between events of a given magnitude.

The flood frequency curve plots discharge values against their corresponding return periods. This curve is a valuable tool for flood risk assessment and the design of hydraulic structures.

In [ ]:

Copied!





def plot_flood_frequency_curve(data, fitted_dists, gauge_name):
    """
    Calculate return periods and plot flood frequency curves.
    
    Args:
        data: numpy array of discharge values
        fitted_dists: dictionary of fitted distribution objects
        gauge_name: name of the gauge
    """
    # Remove NaN values
    data = data[~np.isnan(data)]
    
    # Sort data in ascending order
    data = np.sort(data)
    
    # Calculate empirical return periods using Weibull plotting position
    pp = PlottingPosition.weibul(data)
    rp = PlottingPosition.return_period(pp)
    
    # Create figure
    plt.figure(figsize=(12, 8))
    
    # Plot empirical return periods
    plt.semilogx(rp, data, 'o', label='Empirical')
    
    # Generate return periods for plotting
    return_periods = np.logspace(0, 3, 1000)  # 1 to 1000 years
    non_exceed_prob = 1 - 1/return_periods
    
    # Plot theoretical return periods for each distribution
    for name, dist_info in fitted_dists.items():
        dist = dist_info['dist']
        params = dist_info['params']
        
        # Calculate quantiles for each return period
        quantiles = dist.inverse_cdf(non_exceed_prob, params)
        
        # Plot flood frequency curve
        plt.semilogx(return_periods, quantiles, label=name)
    
    plt.title(f'Flood Frequency Curve - {gauge_name}')
    plt.xlabel('Return Period (years)')
    plt.ylabel('Discharge (m³/s)')
    plt.grid(True)
    plt.legend()
    
    # Add vertical lines for common return periods
    common_rp = [2, 5, 10, 25, 50, 100, 200, 500]
    for rp_val in common_rp:
        plt.axvline(x=rp_val, color='gray', linestyle='--', alpha=0.5)
        plt.text(rp_val, plt.ylim()[0], str(rp_val), ha='center', va='bottom', alpha=0.7)
    
    plt.show()
def plot_flood_frequency_curve(data, fitted_dists, gauge_name):
    """
    Calculate return periods and plot flood frequency curves.
    
    Args:
        data: numpy array of discharge values
        fitted_dists: dictionary of fitted distribution objects
        gauge_name: name of the gauge
    """
    # Remove NaN values
    data = data[~np.isnan(data)]
    
    # Sort data in ascending order
    data = np.sort(data)
    
    # Calculate empirical return periods using Weibull plotting position
    pp = PlottingPosition.weibul(data)
    rp = PlottingPosition.return_period(pp)
    
    # Create figure
    plt.figure(figsize=(12, 8))
    
    # Plot empirical return periods
    plt.semilogx(rp, data, 'o', label='Empirical')
    
    # Generate return periods for plotting
    return_periods = np.logspace(0, 3, 1000)  # 1 to 1000 years
    non_exceed_prob = 1 - 1/return_periods
    
    # Plot theoretical return periods for each distribution
    for name, dist_info in fitted_dists.items():
        dist = dist_info['dist']
        params = dist_info['params']
        
        # Calculate quantiles for each return period
        quantiles = dist.inverse_cdf(non_exceed_prob, params)
        
        # Plot flood frequency curve
        plt.semilogx(return_periods, quantiles, label=name)
    
    plt.title(f'Flood Frequency Curve - {gauge_name}')
    plt.xlabel('Return Period (years)')
    plt.ylabel('Discharge (m³/s)')
    plt.grid(True)
    plt.legend()
    
    # Add vertical lines for common return periods
    common_rp = [2, 5, 10, 25, 50, 100, 200, 500]
    for rp_val in common_rp:
        plt.axvline(x=rp_val, color='gray', linestyle='--', alpha=0.5)
        plt.text(rp_val, plt.ylim()[0], str(rp_val), ha='center', va='bottom', alpha=0.7)
    
    plt.show()

In [ ]:

Copied!





def analyze_distributions(annual_max, selected_gauges):
    """
    Fit distributions to annual maximum discharge for selected gauges.
    
    Args:
        annual_max: DataFrame containing annual maximum discharge
        selected_gauges: List of gauge names to analyze
        
    Returns:
        dict: Dictionary of fitted distribution results for each gauge
    """
    results = {}
    
    for gauge in selected_gauges:
        if gauge in annual_max.columns:
            print(f"\nFitting distributions to {gauge}...")
            data = annual_max[gauge].values
            results[gauge] = fit_distributions(data)
            
            # Print goodness of fit results
            print(f"\nGoodness of fit results for {gauge}:")
            for dist_name, dist_info in results[gauge].items():
                ks_stat = dist_info['ks'][0]
                ks_pvalue = dist_info['ks'][1]
                print(f"{dist_name}: KS statistic = {ks_stat:.4f}, p-value = {ks_pvalue:.4f}")
            
            # Plot fitted distributions
            plot_fitted_distributions(data, results[gauge], gauge)
            
            # Plot flood frequency curve
            plot_flood_frequency_curve(data, results[gauge], gauge)
    
    return results

# Fit distributions and analyze results
print("\nFitting distributions and analyzing results...")
results = analyze_distributions(annual_max, selected_gauges)
def analyze_distributions(annual_max, selected_gauges):
    """
    Fit distributions to annual maximum discharge for selected gauges.
    
    Args:
        annual_max: DataFrame containing annual maximum discharge
        selected_gauges: List of gauge names to analyze
        
    Returns:
        dict: Dictionary of fitted distribution results for each gauge
    """
    results = {}
    
    for gauge in selected_gauges:
        if gauge in annual_max.columns:
            print(f"\nFitting distributions to {gauge}...")
            data = annual_max[gauge].values
            results[gauge] = fit_distributions(data)
            
            # Print goodness of fit results
            print(f"\nGoodness of fit results for {gauge}:")
            for dist_name, dist_info in results[gauge].items():
                ks_stat = dist_info['ks'][0]
                ks_pvalue = dist_info['ks'][1]
                print(f"{dist_name}: KS statistic = {ks_stat:.4f}, p-value = {ks_pvalue:.4f}")
            
            # Plot fitted distributions
            plot_fitted_distributions(data, results[gauge], gauge)
            
            # Plot flood frequency curve
            plot_flood_frequency_curve(data, results[gauge], gauge)
    
    return results

# Fit distributions and analyze results
print("\nFitting distributions and analyzing results...")
results = analyze_distributions(annual_max, selected_gauges)

Design Floods¶

Design floods are discharge values associated with specific return periods. They are used in the design of hydraulic structures, flood protection measures, and risk assessment.

For example, a 100-year flood (a flood with a 1% annual probability of occurrence) is often used as a design standard for major infrastructure. By calculating design floods for different return periods, we can provide valuable information for flood risk management and infrastructure planning.

In [ ]:

Copied!





def find_best_distribution(fitted_dists):
    """
    Find the best-fitting distribution based on KS test p-value.
    
    Args:
        fitted_dists: dictionary of fitted distribution objects
        
    Returns:
        tuple: (best distribution name, distribution info)
    """
    best_dist = None
    best_pvalue = -1
    
    for name, dist_info in fitted_dists.items():
        pvalue = dist_info['ks'][1]
        if pvalue > best_pvalue:
            best_pvalue = pvalue
            best_dist = (name, dist_info)
    
    return best_dist

def calculate_design_floods(results):
    """
    Calculate design floods for common return periods.
    
    Args:
        results: Dictionary of fitted distribution results for each gauge
        
    Returns:
        pandas.DataFrame: Design floods for different return periods
    """
    common_rp = [2, 5, 10, 25, 50, 100, 200, 500, 1000]
    design_floods = {}
    
    for gauge, fitted_dists in results.items():
        best_dist_name, best_dist_info = find_best_distribution(fitted_dists)
        dist = best_dist_info['dist']
        params = best_dist_info['params']
        
        # Calculate non-exceedance probabilities for common return periods
        non_exceed_prob = 1 - 1/np.array(common_rp)
        
        # Calculate quantiles (design floods)
        quantiles = dist.inverse_cdf(non_exceed_prob, params)
        
        # Store results
        design_floods[gauge] = {
            'best_dist': best_dist_name,
            'return_periods': common_rp,
            'design_floods': quantiles
        }
    
    # Create a DataFrame to display design floods
    design_flood_df = pd.DataFrame(index=common_rp)
    for gauge, info in design_floods.items():
        design_flood_df[f"{gauge} ({info['best_dist']})"] = info['design_floods']
    
    design_flood_df.index.name = 'Return Period (years)'
    design_flood_df.columns.name = 'Gauge (Best Distribution)'
    
    return design_flood_df

# Calculate design floods
print("\nCalculating design floods...")
design_flood_df = calculate_design_floods(results)
print("\nDesign Floods (m³/s) for Different Return Periods:")
print(design_flood_df)
def find_best_distribution(fitted_dists):
    """
    Find the best-fitting distribution based on KS test p-value.
    
    Args:
        fitted_dists: dictionary of fitted distribution objects
        
    Returns:
        tuple: (best distribution name, distribution info)
    """
    best_dist = None
    best_pvalue = -1
    
    for name, dist_info in fitted_dists.items():
        pvalue = dist_info['ks'][1]
        if pvalue > best_pvalue:
            best_pvalue = pvalue
            best_dist = (name, dist_info)
    
    return best_dist

def calculate_design_floods(results):
    """
    Calculate design floods for common return periods.
    
    Args:
        results: Dictionary of fitted distribution results for each gauge
        
    Returns:
        pandas.DataFrame: Design floods for different return periods
    """
    common_rp = [2, 5, 10, 25, 50, 100, 200, 500, 1000]
    design_floods = {}
    
    for gauge, fitted_dists in results.items():
        best_dist_name, best_dist_info = find_best_distribution(fitted_dists)
        dist = best_dist_info['dist']
        params = best_dist_info['params']
        
        # Calculate non-exceedance probabilities for common return periods
        non_exceed_prob = 1 - 1/np.array(common_rp)
        
        # Calculate quantiles (design floods)
        quantiles = dist.inverse_cdf(non_exceed_prob, params)
        
        # Store results
        design_floods[gauge] = {
            'best_dist': best_dist_name,
            'return_periods': common_rp,
            'design_floods': quantiles
        }
    
    # Create a DataFrame to display design floods
    design_flood_df = pd.DataFrame(index=common_rp)
    for gauge, info in design_floods.items():
        design_flood_df[f"{gauge} ({info['best_dist']})"] = info['design_floods']
    
    design_flood_df.index.name = 'Return Period (years)'
    design_flood_df.columns.name = 'Gauge (Best Distribution)'
    
    return design_flood_df

# Calculate design floods
print("\nCalculating design floods...")
design_flood_df = calculate_design_floods(results)
print("\nDesign Floods (m³/s) for Different Return Periods:")
print(design_flood_df)

Flood Frequency Curves with Uncertainty¶

When estimating flood frequencies, it's important to acknowledge the uncertainty in our predictions. This uncertainty increases for larger return periods, as we're extrapolating beyond the range of observed data.

Due to limitations in the current implementation, we'll skip the confidence interval calculation, but in a complete analysis, you would want to include confidence intervals to show the uncertainty in the flood frequency estimates.

In [ ]:

Copied!





def plot_flood_frequency_with_ci(data, fitted_dists, gauge_name):
    """
    Plot flood frequency curve with simplified uncertainty representation.
    
    Args:
        data: numpy array of discharge values
        fitted_dists: dictionary of fitted distribution objects
        gauge_name: name of the gauge
    """
    # Remove NaN values
    data = data[~np.isnan(data)]
    
    # Sort data in ascending order
    data = np.sort(data)
    
    # Calculate empirical return periods using Weibull plotting position
    pp = PlottingPosition.weibul(data)
    rp = PlottingPosition.return_period(pp)
    
    # Find best distribution
    best_dist_name, best_dist_info = find_best_distribution(fitted_dists)
    dist = best_dist_info['dist']
    params = best_dist_info['params']
    
    # Generate return periods for plotting
    return_periods = np.logspace(0, 3, 1000)  # 1 to 1000 years
    non_exceed_prob = 1 - 1/return_periods

    # Calculate confidence intervals
    ci = dist.confidence_interval(alpha=0.1, prob_non_exceed=non_exceed_prob)

    # Create figure
    plt.figure(figsize=(12, 8))
    
    # Plot empirical return periods
    plt.semilogx(rp, data, 'o', label='Empirical')
    
    # Calculate quantiles for each return period
    quantiles = dist.inverse_cdf(non_exceed_prob, params)
    
    # Plot flood frequency curve
    plt.semilogx(return_periods, quantiles, label=best_dist_name)

    # Plot confidence intervals
    plt.fill_between(return_periods, ci[1], ci[0], alpha=0.2, label='90% Confidence Interval')
    
    plt.title(f'Flood Frequency Curve - {gauge_name} (Best Fit: {best_dist_name})')
    plt.xlabel('Return Period (years)')
    plt.ylabel('Discharge (m³/s)')
    plt.grid(True)
    plt.legend()
    
    # Add vertical lines for common return periods
    common_rp = [2, 5, 10, 25, 50, 100, 200, 500, 1000]
    for rp_val in common_rp:
        plt.axvline(x=rp_val, color='gray', linestyle='--', alpha=0.5)
        plt.text(rp_val, plt.ylim()[0], str(rp_val), ha='center', va='bottom', alpha=0.7)
    
    plt.show()

def plot_confidence_intervals(annual_max, results, selected_gauges):
    """
    Plot flood frequency curves with simplified uncertainty representation for selected gauges.
    
    Args:
        annual_max: DataFrame containing annual maximum discharge
        results: Dictionary of fitted distribution results for each gauge
        selected_gauges: List of gauge names to plot
    """
    for gauge in selected_gauges:
        if gauge in results:
            print(f"\nPlotting flood frequency curve for {gauge}...")
            data = annual_max[gauge].values
            plot_flood_frequency_with_ci(data, results[gauge], gauge)

# Plot flood frequency curves with simplified uncertainty representation
print("\nPlotting flood frequency curves...")
plot_confidence_intervals(annual_max, results, selected_gauges)
def plot_flood_frequency_with_ci(data, fitted_dists, gauge_name):
    """
    Plot flood frequency curve with simplified uncertainty representation.
    
    Args:
        data: numpy array of discharge values
        fitted_dists: dictionary of fitted distribution objects
        gauge_name: name of the gauge
    """
    # Remove NaN values
    data = data[~np.isnan(data)]
    
    # Sort data in ascending order
    data = np.sort(data)
    
    # Calculate empirical return periods using Weibull plotting position
    pp = PlottingPosition.weibul(data)
    rp = PlottingPosition.return_period(pp)
    
    # Find best distribution
    best_dist_name, best_dist_info = find_best_distribution(fitted_dists)
    dist = best_dist_info['dist']
    params = best_dist_info['params']
    
    # Generate return periods for plotting
    return_periods = np.logspace(0, 3, 1000)  # 1 to 1000 years
    non_exceed_prob = 1 - 1/return_periods

    # Calculate confidence intervals
    ci = dist.confidence_interval(alpha=0.1, prob_non_exceed=non_exceed_prob)

    # Create figure
    plt.figure(figsize=(12, 8))
    
    # Plot empirical return periods
    plt.semilogx(rp, data, 'o', label='Empirical')
    
    # Calculate quantiles for each return period
    quantiles = dist.inverse_cdf(non_exceed_prob, params)
    
    # Plot flood frequency curve
    plt.semilogx(return_periods, quantiles, label=best_dist_name)

    # Plot confidence intervals
    plt.fill_between(return_periods, ci[1], ci[0], alpha=0.2, label='90% Confidence Interval')
    
    plt.title(f'Flood Frequency Curve - {gauge_name} (Best Fit: {best_dist_name})')
    plt.xlabel('Return Period (years)')
    plt.ylabel('Discharge (m³/s)')
    plt.grid(True)
    plt.legend()
    
    # Add vertical lines for common return periods
    common_rp = [2, 5, 10, 25, 50, 100, 200, 500, 1000]
    for rp_val in common_rp:
        plt.axvline(x=rp_val, color='gray', linestyle='--', alpha=0.5)
        plt.text(rp_val, plt.ylim()[0], str(rp_val), ha='center', va='bottom', alpha=0.7)
    
    plt.show()

def plot_confidence_intervals(annual_max, results, selected_gauges):
    """
    Plot flood frequency curves with simplified uncertainty representation for selected gauges.
    
    Args:
        annual_max: DataFrame containing annual maximum discharge
        results: Dictionary of fitted distribution results for each gauge
        selected_gauges: List of gauge names to plot
    """
    for gauge in selected_gauges:
        if gauge in results:
            print(f"\nPlotting flood frequency curve for {gauge}...")
            data = annual_max[gauge].values
            plot_flood_frequency_with_ci(data, results[gauge], gauge)

# Plot flood frequency curves with simplified uncertainty representation
print("\nPlotting flood frequency curves...")
plot_confidence_intervals(annual_max, results, selected_gauges)

Summary and Conclusions¶

In this notebook, we've demonstrated how to use the statista.distributions module to analyze discharge time series data from the Rhine River. We've explored different probability distributions, fitted them to the data, and calculated return periods and flood frequency curves.

Key findings:

The best-fitting distribution varies between gauges, highlighting the importance of testing multiple distributions
The GEV and Gumbel distributions generally provide good fits for annual maximum discharge data, which is consistent with extreme value theory
Design floods increase with return period, but the rate of increase varies between gauges
Uncertainty in flood frequency estimates increases for larger return periods

This analysis provides valuable information for flood risk assessment, water resource management, and the design of hydraulic structures along the Rhine River.

In [ ]:

Copied!





# Print a summary of the analysis
print("\nAnalysis complete!")

# Summary of findings
print("\nKey Findings:")
print("- The best-fitting distribution varies between gauges, highlighting the importance of testing multiple distributions")
print("- The GEV and Gumbel distributions generally provide good fits for annual maximum discharge data, which is consistent with extreme value theory")
print("- Design floods increase with return period, but the rate of increase varies between gauges")
print("- Uncertainty in flood frequency estimates increases for larger return periods, though we couldn't visualize this due to implementation limitations")
# Print a summary of the analysis
print("\nAnalysis complete!")

# Summary of findings
print("\nKey Findings:")
print("- The best-fitting distribution varies between gauges, highlighting the importance of testing multiple distributions")
print("- The GEV and Gumbel distributions generally provide good fits for annual maximum discharge data, which is consistent with extreme value theory")
print("- Design floods increase with return period, but the rate of increase varies between gauges")
print("- Uncertainty in flood frequency estimates increases for larger return periods, though we couldn't visualize this due to implementation limitations")