Rhine River Discharge Analysis using Statista Distributions¶
This notebook demonstrates how to use the statista.distributions module to analyze discharge time series data from the Rhine River. We'll explore different probability distributions, fit them to the data, and calculate return periods and flood frequency curves.
The Rhine River is one of Europe's major rivers, flowing through several countries including Switzerland, Liechtenstein, Austria, Germany, France and the Netherlands. Analyzing discharge data is crucial for flood risk assessment, water resource management, and understanding the hydrological behavior of the river.
In this notebook, we will:
- Load and preprocess discharge time series data from multiple gauges along the Rhine River
- Fit different probability distributions to the data
- Evaluate the goodness of fit for each distribution
- Calculate return periods and flood frequency curves
- Visualize the results
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Import statista distributions module
from statista.distributions import Distributions, PlottingPosition, Gumbel, GEV, Exponential, Normal
# Set plot style
plt.style.use('ggplot')
# Display all columns in pandas DataFrames
pd.set_option('display.max_columns', None)
Data Loading and Preprocessing¶
We'll load the Rhine River discharge data from the CSV file. The data contains daily discharge measurements from multiple gauges along the river. The first column is the date, and the remaining columns represent different gauges.
We need to handle missing values (empty strings) in the data and convert the date column to a datetime format. This preprocessing step is crucial for ensuring the quality of our analysis.
def load_and_preprocess_data(file_path):
"""
Load and preprocess the Rhine River discharge data.
- Reads a CSV file with a 'date' column and discharge columns for each gauge.
- Parses the date column into `datetime` objects and sets it as the index.
- Converts empty strings to NaN and casts all gauge columns to floats.
- Returns a clean `DataFrame` ready for analysis.
Args:
file_path: Path to the CSV file containing the discharge data
Returns:
pandas.DataFrame: Preprocessed discharge data
"""
# Load the data
df = pd.read_csv(file_path)
# Display the first few rows of the data
print(f"Data shape: {df.shape}")
print(df.head())
# Convert the date column to datetime
df['date'] = pd.to_datetime(df['date'])
# Set the date column as the index
df.set_index('date', inplace=True)
# Check for missing values
print("\nNumber of missing values in each column:")
print(df.isna().sum())
# Convert empty strings to NaN
df = df.replace('', np.nan)
# Convert all columns to numeric
for col in df.columns:
df[col] = pd.to_numeric(df[col], errors='coerce')
# Check for missing values again
print("\nNumber of missing values after conversion:")
print(df.isna().sum())
# Display basic statistics
print("\nBasic statistics:")
print(df.describe())
return df
# Define the path to the data file
file_path = '../../../examples/data/rhine-full-time-series.csv'
# Load and preprocess the data
print("Loading and preprocessing the data...")
df = load_and_preprocess_data(file_path)
Exploratory Data Analysis¶
Let's explore the data to understand the discharge patterns at different gauges along the Rhine River. We'll visualize the time series, examine the distribution of discharge values, and identify any seasonal patterns.
Time series analysis helps us understand how discharge varies over time, while histograms provide insights into the statistical distribution of discharge values. This exploratory analysis is essential for selecting appropriate probability distributions for modeling.
def plot_time_series(df, selected_gauges):
"""
Plot time series for selected gauges.
Args:
df: DataFrame containing the discharge data
selected_gauges: List of gauge names to plot
"""
plt.figure(figsize=(14, 8))
for gauge in selected_gauges:
if gauge in df.columns:
plt.plot(df.index, df[gauge], label=gauge)
plt.title('Discharge Time Series for Selected Gauges')
plt.xlabel('Date')
plt.ylabel('Discharge (m³/s)')
plt.legend()
plt.grid(True)
plt.show()
def plot_histograms(df, selected_gauges):
"""
Create histograms for selected gauges.
Args:
df: DataFrame containing the discharge data
selected_gauges: List of gauge names to plot
"""
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
for i, gauge in enumerate(selected_gauges):
if gauge in df.columns and i < len(axes):
# Use matplotlib's histogram function
data = df[gauge].dropna()
axes[i].hist(data, bins=20, density=True, alpha=0.7)
# Add a density curve
from scipy import stats
min_val, max_val = data.min(), data.max()
x = np.linspace(min_val, max_val, 1000)
kde = stats.gaussian_kde(data)
axes[i].plot(x, kde(x), 'r-', linewidth=2)
axes[i].set_title(f'Distribution of Discharge at {gauge}')
axes[i].set_xlabel('Discharge (m³/s)')
axes[i].set_ylabel('Density')
plt.tight_layout()
plt.show()
# Define selected gauges for analysis
selected_gauges = ['rees', 'cologne', 'kaub', 'mainz']
# Plot time series
print("\nPlotting time series for selected gauges...")
plot_time_series(df, selected_gauges)
# Plot histograms
print("\nPlotting histograms for selected gauges...")
plot_histograms(df, selected_gauges)
Annual Maximum Discharge Analysis¶
For flood frequency analysis, we typically focus on annual maximum discharge values. This approach, known as the Annual Maximum Series (AMS) method, is widely used in hydrology for estimating the probability of extreme flood events.
By extracting the maximum discharge value for each year, we create a dataset that represents the most extreme conditions observed annually. This dataset is then used to fit probability distributions that model the frequency of extreme events.
def extract_annual_maxima(df):
"""
Extract annual maximum discharge for each gauge.
Args:
df: DataFrame containing the discharge data
Returns:
pandas.DataFrame: Annual maximum discharge for each gauge
"""
# Extract annual maximum discharge for each gauge
annual_max = df.resample('YE').max()
return annual_max
def plot_annual_maxima(annual_max, selected_gauges):
"""
Plot annual maximum discharge for selected gauges.
Args:
annual_max: DataFrame containing annual maximum discharge
selected_gauges: List of gauge names to plot
"""
plt.figure(figsize=(14, 8))
for gauge in selected_gauges:
if gauge in annual_max.columns:
plt.plot(annual_max.index, annual_max[gauge], 'o-', label=gauge)
plt.title('Annual Maximum Discharge for Selected Gauges')
plt.xlabel('Year')
plt.ylabel('Maximum Discharge (m³/s)')
plt.legend()
plt.grid(True)
plt.show()
# Extract annual maximum discharge
print("\nExtracting annual maximum discharge...")
annual_max = extract_annual_maxima(df)
# Plot annual maximum discharge
print("\nPlotting annual maximum discharge for selected gauges...")
plot_annual_maxima(annual_max, selected_gauges)
Fitting Probability Distributions¶
Now we'll fit different probability distributions to the annual maximum discharge data for each gauge. We'll use the following distributions from the statista.distributions module:
Gumbel distribution: Also known as the Extreme Value Type I distribution, it's commonly used for modeling maximum values, such as annual flood peaks.
Generalized Extreme Value (GEV) distribution: A flexible three-parameter distribution that encompasses the Gumbel, Fréchet, and Weibull distributions, making it suitable for modeling extreme events.
Normal distribution: While not typically used for extreme values, it serves as a baseline for comparison.
Exponential distribution: Often used for modeling the time between events in a Poisson process, but can also be applied to certain hydrological variables.
We'll evaluate the goodness of fit using the Kolmogorov-Smirnov test, which measures the maximum difference between the empirical and theoretical cumulative distribution functions.
def fit_distributions(data, method="lmoments"):
"""
Fit different distributions to the data and evaluate goodness of fit.
Args:
data: numpy array of discharge values
method: fitting method ('mle' for Maximum Likelihood Estimation or 'lmoments' for L-moments)
Returns:
dict: Dictionary of fitted distribution objects and test results
"""
# Remove NaN values
data = data[~np.isnan(data)]
# Sort data in ascending order
data = np.sort(data)
# Initialize distributions
gumbel = Gumbel(data=data)
gev = GEV(data=data)
normal = Normal(data=data)
exponential = Exponential(data=data)
# Fit distributions
gumbel_params = gumbel.fit_model(method=method)
gev_params = gev.fit_model(method=method)
normal_params = normal.fit_model(method=method)
exponential_params = exponential.fit_model(method=method)
# Perform Kolmogorov-Smirnov test
gumbel_ks = gumbel.ks()
gev_ks = gev.ks()
normal_ks = normal.ks()
exponential_ks = exponential.ks()
# Return results
return {
'Gumbel': {'dist': gumbel, 'params': gumbel_params, 'ks': gumbel_ks},
'GEV': {'dist': gev, 'params': gev_params, 'ks': gev_ks},
'Normal': {'dist': normal, 'params': normal_params, 'ks': normal_ks},
'Exponential': {'dist': exponential, 'params': exponential_params, 'ks': exponential_ks}
}
#Example usage:
for gauge in selected_gauges:
if gauge in annual_max.columns:
data = annual_max[gauge].values
results = fit_distributions(data)
Visualizing Fitted Distributions¶
After fitting the distributions, it's important to visualize how well they match the empirical data. We'll create plots of both the probability density function (PDF) and the cumulative distribution function (CDF) for each distribution.
The PDF shows the relative likelihood of different discharge values, while the CDF shows the probability of not exceeding a given discharge value. These visualizations help us assess which distribution provides the best fit to the observed data.
def plot_fitted_distributions(data, fitted_dists, gauge_name):
"""
Plot the empirical and fitted distributions.
Args:
data: numpy array of discharge values
fitted_dists: dictionary of fitted distribution objects
gauge_name: name of the gauge
"""
# Remove NaN values
data = data[~np.isnan(data)]
# Sort data in ascending order
data = np.sort(data)
# Calculate empirical CDF using Weibull plotting position
pp = PlottingPosition.weibul(data)
# Create figure
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
# Plot PDF
ax1.hist(data, bins=20, density=True, alpha=0.5, label='Empirical')
x = np.linspace(min(data), max(data), 1000)
for name, dist_info in fitted_dists.items():
dist = dist_info['dist']
params = dist_info['params']
# Plot PDF
# y_pdf = dist._pdf_eq(x, params)
y_pdf = dist.pdf(data=x, parameters=params)
ax1.plot(x, y_pdf, label=f'{name} (KS p-value: {dist_info["ks"][1]:.4f})')
ax1.set_title(f'Probability Density Function - {gauge_name}')
ax1.set_xlabel('Discharge (m³/s)')
ax1.set_ylabel('Density')
ax1.legend()
ax1.grid(True)
# Plot CDF
ax2.plot(data, pp, 'o', label='Empirical')
for name, dist_info in fitted_dists.items():
dist = dist_info['dist']
params = dist_info['params']
# Plot CDF
y_cdf = dist.cdf(data=x, parameters=params)
ax2.plot(x, y_cdf, label=f'{name} (KS p-value: {dist_info["ks"][1]:.4f})')
ax2.set_title(f'Cumulative Distribution Function - {gauge_name}')
ax2.set_xlabel('Discharge (m³/s)')
ax2.set_ylabel('Probability of Non-Exceedance')
ax2.legend()
ax2.grid(True)
plt.tight_layout()
plt.show()
Flood Frequency Analysis¶
Flood frequency analysis is a statistical technique used to estimate the probability of occurrence of a given discharge magnitude. The results are typically expressed in terms of return periods, which represent the average time interval between events of a given magnitude.
The flood frequency curve plots discharge values against their corresponding return periods. This curve is a valuable tool for flood risk assessment and the design of hydraulic structures.
def plot_flood_frequency_curve(data, fitted_dists, gauge_name):
"""
Calculate return periods and plot flood frequency curves.
Args:
data: numpy array of discharge values
fitted_dists: dictionary of fitted distribution objects
gauge_name: name of the gauge
"""
# Remove NaN values
data = data[~np.isnan(data)]
# Sort data in ascending order
data = np.sort(data)
# Calculate empirical return periods using Weibull plotting position
pp = PlottingPosition.weibul(data)
rp = PlottingPosition.return_period(pp)
# Create figure
plt.figure(figsize=(12, 8))
# Plot empirical return periods
plt.semilogx(rp, data, 'o', label='Empirical')
# Generate return periods for plotting
return_periods = np.logspace(0, 3, 1000) # 1 to 1000 years
non_exceed_prob = 1 - 1/return_periods
# Plot theoretical return periods for each distribution
for name, dist_info in fitted_dists.items():
dist = dist_info['dist']
params = dist_info['params']
# Calculate quantiles for each return period
quantiles = dist.inverse_cdf(non_exceed_prob, params)
# Plot flood frequency curve
plt.semilogx(return_periods, quantiles, label=name)
plt.title(f'Flood Frequency Curve - {gauge_name}')
plt.xlabel('Return Period (years)')
plt.ylabel('Discharge (m³/s)')
plt.grid(True)
plt.legend()
# Add vertical lines for common return periods
common_rp = [2, 5, 10, 25, 50, 100, 200, 500]
for rp_val in common_rp:
plt.axvline(x=rp_val, color='gray', linestyle='--', alpha=0.5)
plt.text(rp_val, plt.ylim()[0], str(rp_val), ha='center', va='bottom', alpha=0.7)
plt.show()
def analyze_distributions(annual_max, selected_gauges):
"""
Fit distributions to annual maximum discharge for selected gauges.
Args:
annual_max: DataFrame containing annual maximum discharge
selected_gauges: List of gauge names to analyze
Returns:
dict: Dictionary of fitted distribution results for each gauge
"""
results = {}
for gauge in selected_gauges:
if gauge in annual_max.columns:
print(f"\nFitting distributions to {gauge}...")
data = annual_max[gauge].values
results[gauge] = fit_distributions(data)
# Print goodness of fit results
print(f"\nGoodness of fit results for {gauge}:")
for dist_name, dist_info in results[gauge].items():
ks_stat = dist_info['ks'][0]
ks_pvalue = dist_info['ks'][1]
print(f"{dist_name}: KS statistic = {ks_stat:.4f}, p-value = {ks_pvalue:.4f}")
# Plot fitted distributions
plot_fitted_distributions(data, results[gauge], gauge)
# Plot flood frequency curve
plot_flood_frequency_curve(data, results[gauge], gauge)
return results
# Fit distributions and analyze results
print("\nFitting distributions and analyzing results...")
results = analyze_distributions(annual_max, selected_gauges)
Design Floods¶
Design floods are discharge values associated with specific return periods. They are used in the design of hydraulic structures, flood protection measures, and risk assessment.
For example, a 100-year flood (a flood with a 1% annual probability of occurrence) is often used as a design standard for major infrastructure. By calculating design floods for different return periods, we can provide valuable information for flood risk management and infrastructure planning.
def find_best_distribution(fitted_dists):
"""
Find the best-fitting distribution based on KS test p-value.
Args:
fitted_dists: dictionary of fitted distribution objects
Returns:
tuple: (best distribution name, distribution info)
"""
best_dist = None
best_pvalue = -1
for name, dist_info in fitted_dists.items():
pvalue = dist_info['ks'][1]
if pvalue > best_pvalue:
best_pvalue = pvalue
best_dist = (name, dist_info)
return best_dist
def calculate_design_floods(results):
"""
Calculate design floods for common return periods.
Args:
results: Dictionary of fitted distribution results for each gauge
Returns:
pandas.DataFrame: Design floods for different return periods
"""
common_rp = [2, 5, 10, 25, 50, 100, 200, 500, 1000]
design_floods = {}
for gauge, fitted_dists in results.items():
best_dist_name, best_dist_info = find_best_distribution(fitted_dists)
dist = best_dist_info['dist']
params = best_dist_info['params']
# Calculate non-exceedance probabilities for common return periods
non_exceed_prob = 1 - 1/np.array(common_rp)
# Calculate quantiles (design floods)
quantiles = dist.inverse_cdf(non_exceed_prob, params)
# Store results
design_floods[gauge] = {
'best_dist': best_dist_name,
'return_periods': common_rp,
'design_floods': quantiles
}
# Create a DataFrame to display design floods
design_flood_df = pd.DataFrame(index=common_rp)
for gauge, info in design_floods.items():
design_flood_df[f"{gauge} ({info['best_dist']})"] = info['design_floods']
design_flood_df.index.name = 'Return Period (years)'
design_flood_df.columns.name = 'Gauge (Best Distribution)'
return design_flood_df
# Calculate design floods
print("\nCalculating design floods...")
design_flood_df = calculate_design_floods(results)
print("\nDesign Floods (m³/s) for Different Return Periods:")
print(design_flood_df)
Flood Frequency Curves with Uncertainty¶
When estimating flood frequencies, it's important to acknowledge the uncertainty in our predictions. This uncertainty increases for larger return periods, as we're extrapolating beyond the range of observed data.
Due to limitations in the current implementation, we'll skip the confidence interval calculation, but in a complete analysis, you would want to include confidence intervals to show the uncertainty in the flood frequency estimates.
def plot_flood_frequency_with_ci(data, fitted_dists, gauge_name):
"""
Plot flood frequency curve with simplified uncertainty representation.
Args:
data: numpy array of discharge values
fitted_dists: dictionary of fitted distribution objects
gauge_name: name of the gauge
"""
# Remove NaN values
data = data[~np.isnan(data)]
# Sort data in ascending order
data = np.sort(data)
# Calculate empirical return periods using Weibull plotting position
pp = PlottingPosition.weibul(data)
rp = PlottingPosition.return_period(pp)
# Find best distribution
best_dist_name, best_dist_info = find_best_distribution(fitted_dists)
dist = best_dist_info['dist']
params = best_dist_info['params']
# Generate return periods for plotting
return_periods = np.logspace(0, 3, 1000) # 1 to 1000 years
non_exceed_prob = 1 - 1/return_periods
# Calculate confidence intervals
ci = dist.confidence_interval(alpha=0.1, prob_non_exceed=non_exceed_prob)
# Create figure
plt.figure(figsize=(12, 8))
# Plot empirical return periods
plt.semilogx(rp, data, 'o', label='Empirical')
# Calculate quantiles for each return period
quantiles = dist.inverse_cdf(non_exceed_prob, params)
# Plot flood frequency curve
plt.semilogx(return_periods, quantiles, label=best_dist_name)
# Plot confidence intervals
plt.fill_between(return_periods, ci[1], ci[0], alpha=0.2, label='90% Confidence Interval')
plt.title(f'Flood Frequency Curve - {gauge_name} (Best Fit: {best_dist_name})')
plt.xlabel('Return Period (years)')
plt.ylabel('Discharge (m³/s)')
plt.grid(True)
plt.legend()
# Add vertical lines for common return periods
common_rp = [2, 5, 10, 25, 50, 100, 200, 500, 1000]
for rp_val in common_rp:
plt.axvline(x=rp_val, color='gray', linestyle='--', alpha=0.5)
plt.text(rp_val, plt.ylim()[0], str(rp_val), ha='center', va='bottom', alpha=0.7)
plt.show()
def plot_confidence_intervals(annual_max, results, selected_gauges):
"""
Plot flood frequency curves with simplified uncertainty representation for selected gauges.
Args:
annual_max: DataFrame containing annual maximum discharge
results: Dictionary of fitted distribution results for each gauge
selected_gauges: List of gauge names to plot
"""
for gauge in selected_gauges:
if gauge in results:
print(f"\nPlotting flood frequency curve for {gauge}...")
data = annual_max[gauge].values
plot_flood_frequency_with_ci(data, results[gauge], gauge)
# Plot flood frequency curves with simplified uncertainty representation
print("\nPlotting flood frequency curves...")
plot_confidence_intervals(annual_max, results, selected_gauges)
Summary and Conclusions¶
In this notebook, we've demonstrated how to use the statista.distributions module to analyze discharge time series data from the Rhine River. We've explored different probability distributions, fitted them to the data, and calculated return periods and flood frequency curves.
Key findings:
- The best-fitting distribution varies between gauges, highlighting the importance of testing multiple distributions
- The GEV and Gumbel distributions generally provide good fits for annual maximum discharge data, which is consistent with extreme value theory
- Design floods increase with return period, but the rate of increase varies between gauges
- Uncertainty in flood frequency estimates increases for larger return periods
This analysis provides valuable information for flood risk assessment, water resource management, and the design of hydraulic structures along the Rhine River.
# Print a summary of the analysis
print("\nAnalysis complete!")
# Summary of findings
print("\nKey Findings:")
print("- The best-fitting distribution varies between gauges, highlighting the importance of testing multiple distributions")
print("- The GEV and Gumbel distributions generally provide good fits for annual maximum discharge data, which is consistent with extreme value theory")
print("- Design floods increase with return period, but the rate of increase varies between gauges")
print("- Uncertainty in flood frequency estimates increases for larger return periods, though we couldn't visualize this due to implementation limitations")