In operational hydrology, estimation of the predictive uncertainty of hydrological models used for flood modelling is essential for risk-based decision making for flood warning and emergency management. In the literature, there exists a variety of methods analysing and predicting uncertainty. However, studies devoted to comparing the performance of the methods in predicting uncertainty are limited. This paper focuses on the methods predicting model residual uncertainty that differ in methodological complexity: quantile regression (QR) and UNcertainty Estimation based on local Errors and Clustering (UNEEC). The comparison of the methods is aimed at investigating how well a simpler method using fewer input data performs over a more complex method with more predictors. We test these two methods on several catchments from the UK that vary in hydrological characteristics and the models used. Special attention is given to the methods' performance under different hydrological conditions. Furthermore, normality of model residuals in data clusters (identified by UNEEC) is analysed. It is found that basin lag time and forecast lead time have a large impact on the quantification of uncertainty and the presence of normality in model residuals' distribution. In general, it can be said that both methods give similar results. At the same time, it is also shown that the UNEEC method provides better performance than QR for small catchments with the changing hydrological dynamics, i.e. rapid response catchments. It is recommended that more case studies of catchments of distinct hydrologic behaviour, with diverse climatic conditions, and having various hydrological features, be considered.

The importance of accounting for uncertainty in hydrological models used in
flood early warning systems is widely recognized (e.g. Krzysztofowicz, 2001;
Pappenberger and Beven, 2006). Such an uncertainty in the model prediction
stems mainly from four important sources: perceptual model uncertainty, data
uncertainty, parameter estimation uncertainty, and model structural
uncertainty (e.g. Solomatine and Wagener, 2011). Analysis of

While the discussions on the necessity of evaluating the contributions of
various sources of errors to the overall model uncertainty have been going
for a long time (see e.g. Gupta et al., 2005; Brown and Heuvelink, 2005; Liu
and Gupta, 2007), there have also been attempts to estimate the

In this context, two classes of uncertainty analysis methods can be considered. The first one relates to the Bayesian framework with the meta-Gaussian transformation of data as its important part; these methods are based on a rigorous statistical framework. The following techniques and papers can be mentioned: the original Bayesian forecasting system (BFS) and the Hydrological Uncertainty Processor as its part (Krzysztofowicz, 1999; Krzysztofowicz and Kelly, 2000); its implementations and variations described in Montanari and Brath (2004), Reggiani and Weerts (2008), Reggiani et al. (2009), and Bogner and Pappenberger (2011); and the Model Conditional Processor (Todini, 2008; Coccia and Todini, 2011).

The other class of methods (of which two are dealt with in this paper) includes more “straightforward” ones which are directly oriented at predicting the properties (quantiles) of the residual error distribution by linear or non-linear regression (machine learning) techniques: quantile regression (QR) (Koenker and Basset, 1978) with its applications in hydrology reported by Solomatine and Shrestha (2009), Weerts et al. (2011), and López López et al. (2013); UNcertainty Estimation based on local Errors and Clustering (UNEEC) that uses machine learning techniques (Shrestha and Solomatine, 2006; Solomatine and Shrestha, 2009); and the dynamic uncertainty model by regression on absolute error (DUMBRAE) (Pianosi and Raso, 2012). In this paper we consider two methods from this class that differ in their methodological complexity: quantile regression (QR) and UNcertainty Estimation based on local Errors and Clustering (UNEEC).

Quantile regression (Koenker and Basset, 1978; Koenker and Hallock, 2001; Koenker, 2005) is basically a set of linear regression models (typically, two) where predictands (response variables) are the selected quantiles of the conditional distribution of some variables (discharge or water level in the present research study), and predictors are lagged values of the same variable. This methodology allows for examination of the entire distribution of the variable of interest rather than a single measure of the central tendency of its distribution (Koenker, 2005). QR models have been used in a broad range of applications: economics and financial market analysis (Kudryavtsev, 2009; Taylor, 2007), agriculture (Barnwal and Kotani, 2013), meteorology (Bremnes, 2004; Friederichs and Hense, 2007; Cannon, 2011), wind forecasting (Nielsen et al., 2006; Møller et al., 2008), the prediction of ozone concentrations (Baur et al., 2004; Munir et al., 2012), etc. In hydrological modelling the QR method has been applied as an uncertainty post-processing technique in previous research studies with different configurations.

The configurations of QR differ mainly in two aspects: treatment of the quantile crossing problem (a problem when quantiles of the lower order appear to be larger than those of the higher order) and the quantiles derivation in normal space using normal quantile transformation (NQT). Solomatine and Shrestha (2009) make use of the classical QR approach, without considering quantile crossing and NQT. Weerts et al. (2011), Verkade and Werner (2011), and Roscoe et al. (2012) apply QR to various deterministic hydrologic forecasts. The QR configuration investigated in these studies uses the water level or discharge forecasts as predictors to estimate the distribution quantiles of the model error. It includes a transformation into normal space using the NQT and the quantile crossing problem is addressed by imposing a fixed distribution of the predictand in the crossing domain. Singh et al. (2013) make use of a similar configuration differentiating two cases based on the similarities in information content between calibration and validation data periods. Coccia and Todini (2011) observe that QR's usefulness and performance depend on the assumed patterns in quantiles; for example, lack of linear variation of the error variance with the magnitude of the forecasts hinders reasonable estimation of the quantiles, especially for high flows/water levels. López López et al. (2014) apply QR to predict the quantiles of the environmental variables itself (water level) rather than the quantiles of the model error, and the four different configurations of QR are compared and extensively verified. It should be noted that, in this study, by design, the only predictor in QR is the deterministic model output for discharge/water level, and the quantiles of observed discharge/water level are estimated through linear regression.

The UNEEC method was introduced 10 years ago (Shrestha and Solomatine, 2006;
Shrestha et al., 2006). The method builds a non-linear regression model
(machine learning, e.g. an artificial neural network) to estimate the
quantiles of the error distribution, and it assumes that residual uncertainty
depends on the modelled system state characteristics so that any variable can
be used as a predictor. A notable characteristic of UNEEC is the special
attention to achieving accuracy by local modelling of errors (by clustering
and treating clusters separately), so that particularities of different
hydrometeorological conditions, i.e. heterogeneities inherent to
rainfall–runoff process, are represented through different error

Solomatine and Shrestha (2009) presented their initial experiments to compare QR and UNEEC on one case study, and Weerts et al. (2011) discussed the experience with QR on another one. In this paper we go further and test the newer variants of these methods on several contrasting catchments that cover a wide range of climatic conditions and hydrological characteristics. The motivation here is to identify possible advantages and disadvantages of using the QR and UNEEC methods based on their comparative performance, especially during flooding conditions (i.e. for the data cluster associated with high flow/water level conditions). The knowledge gaps regarding the use of the methods with different parameterizations are addressed. For example, we now incorporate into UNEEC the autoregressive component by considering past error values (in addition to discharge and effective rainfall) in one case study, and model outputs for the state variables soil moisture deficit (SMD) and groundwater level (GW) are used as predictors (in addition to water level) in another case study. In the QR version implemented, the linear regression model was established to predict the quantiles of observed water levels conditioned on simulated/forecasted water levels. Furthermore, we present results of statistical analysis of error time series to better understand (hydrological) models' quality in relation to its effect on uncertainty analysis results, and to discuss the assumption of normality in the model residuals, particularly in view of the clustering approach employed within the framework of the UNEEC method. We apply methods to estimate predictive uncertainty in the Brue catchment (southwestern UK) and the Upper Severn catchments – Yeaton, Llanyblodwel, and Llanerfyl (Midlands, UK).

It should be noticed that by design UNEEC uses a richer set of predictors than QR and a more sophisticated non-linear regression model, so the comparison between simple and complex models may seem unfair. However, more predictors may not bring more information needed for accurate prediction. Only experiments can allow for stating that for each particular case. Our experience with the data-driven models (and both QR and UNEEC are such) showed that adding more predictors does not necessarily mean higher accuracy on unseen data. Parsimony (Box et al., 2008) often leads to better generalization. In this study we compare the two uncertainty prediction methods, with the aim of investigating whether a simpler method using fewer input data may possibly perform better than the more complex method with more predictors. Overall, selection of the most appropriate uncertainty processor for a specific catchment is a matter of compromise between its complexity and accuracy in consideration of the data availability and also the characteristics of the catchment, and we believe the findings of such a comparative analysis could be useful for the operational hydrology community.

The remainder of the paper is structured as follows. The next section describes the residual uncertainty analysis methods (QR and UNEEC) and the validation measures used. Section 3 describes the studied catchments and the experimental set-up. The results for error and uncertainty analyses are presented and discussed in Sect. 4. In Sect. 5 the main conclusions from the study and recommendations for future work are presented.

As in Solomatine and Shrestha (2009) and Weerts et al. (2011), we consider a
deterministic (hydrological) model

As mentioned, several QR configurations have been previously investigated for estimating the residual uncertainty. In López López et al. (2014) (in open access), the four alternative configurations of QR for several catchments at the Upper Severn River have been compared and verified. The comparative analysis included different experiments on the derivation of regression quantiles in original and in normal space using NQT, a piecewise linear configuration considering independent predictand domains and avoiding the quantile crossing problem with a relatively recent technique (Bondell et al., 2010). The intercomparison showed that the reliability and sharpness vary across configurations, but in none of the configurations do these two forecast quality aspects improve simultaneously. Further analysis reveals that skills in terms of the various verification metrics (i.e. Brier skill score, BSS; mean continuous ranked probability skill core, CRPSS; and the relative operating characteristic score, ROCS) are very similar across the four configurations. Therefore, noting also the main idea behind the current study (which is to investigate how well a simpler method using fewer input data performs over a more complex method with more predictors), the simplest QR configuration (termed there the “QR1: non-crossing Quantile Regresssion”) was applied in this study. QR1 estimates the quantiles of the distribution of water level or discharge in the original domain, without any initial transformation, and avoids the quantile crossing problem. A brief description of the QR configuration used in the present work is given below (for details, the reader is referred to López López et al., 2014).

For every quantile

Quantile regression example scheme considering different quantiles.

Figure 1 illustrates the estimation of a selection of quantiles, including
the 0.95, 0.75, 0.25 and 0.05 quantiles. To obtain the QR function for a
specific quantile, e.g.

In UNEEC, a machine learning model, e.g. an artificial neural network,
instance-based learning (e.g.

Identify the set of predictor variables (the lagged rainfall data, soil moisture, flow, etc.) that describe the flow process based on their effect on the model error. These predictors can be selected using average mutual information (AMI) and correlation analysis. Using AMI brings the advantage of detection of non-linear relationships (Battiti, 1994).

Identify the fuzzy clusters in the data set in the space of predictor variables (using e.g. the fuzzy c-means method) (Fig. 2). The optimal number of clusters can be determined using the methods described, e.g. in Xie and Benie (1991), Halkidi et al. (2001), and Nasseri and Zahraie (2011).

For each cluster

For each data vector, calculate the “global” estimate of the quantile

Train a machine learning model (

An example of fuzzy clustering of input data (the predictors are
past rainfall at lag

In this study we use several statistical measures of uncertainty to evaluate and to some extent compare performances of QR and UNEEC. These are, namely, prediction interval coverage probability (PICP; Shrestha and Solomatine, 2006), mean prediction interval (MPI; Shrestha and Solomatine, 2006), and average relative interval length (ARIL; Jin et al., 2010). PICP has also been used by other authors (e.g. Laio and Tamea, 2007) as an important performance measure to estimate the accuracy of probabilistic forecasts.

PICP should be seen as the most important measure since it shows how many
observations fall into the estimated interval. PICP is the probability that
the observed values (

Ideally, the PICP value should be equal to or close to the specified CL.

MPI computes the average width of the uncertainty band (or prediction
interval), i.e. the distance between the upper and lower prediction limits
(PL

ARIL is similar to MPI and considers the average width of uncertainty bounds
in relation to the observed value:

A possibility to combine PICP and ARIL is to use the NUE indicator proposed
by Nasseri and Zahraie (2011):

Summary of the main basin characteristics for the catchments selected.

There is no single objective measure of the quality of an uncertainty
prediction method (since the “actual” uncertainty of the model (error

PICP indeed evaluates whether the expected percentage of observations falls into the predicted interval, and should be seen as an important average indicator of the predictor's performance. However, in the case of high noise in the model error (aleatoric uncertainty), the fact that PICP is far from 90 % could mean simply that none of the data-driven predictive models can capture the input–output dependencies and predict quantiles accurately. For comparative studies, however, PICP can very well be used: the method with PICP closest to 90 % should be seen as the best (with some tolerance). Additional analysis may be carried out to see whether the methods developed for the assessment of the probabilistic forecast quality can be used (Laio and Tamea, 2007) (it is not exactly the same as the residual uncertainty analysed here, but the mathematical apparatus seems to be transferrable). In this paper, however, we have not considered these, so they can be recommended for exploration and testing in the future studies.

It is also worth mentioning that all considered measures are averages and so should be used together with the uncertainty bound plots of which visual analysis reveals more information on the capacity of different uncertainty prediction methods during particular periods.

Located in the southwest of England, the Brue River catchment has a history
of severe flooding. Draining an area of 135 km

The flow in the Brue River was simulated by the HBV-96 model (Lindström
et al., 1997), which is an updated version of the HBV rainfall–runoff model
(Bergström, 1976). This lumped conceptual hydrological model consists of
subroutines for snow accumulation and melt (excluded for Brue), the soil
moisture accounting procedure, routines for runoff generation, and a simple
routing procedure (Fig. 3b). The input data used are hourly observations of
precipitation (basin average), air temperature, and potential
evapotranspiration (estimated by the modified Penmann method) computed from
the 15 min data. The model time step is 1 h (

The uncertainty analyses conducted for the Brue catchment are based on
one-step-ahead flow estimates, i.e. LT

Flowing from the Cambrian Mountains (610 m) in Wales, the River Severn is
the longest river in Britain (about 354 km). It forms the border between
England and Wales and flows into the Bristol Channel. The river drains an
area of approximately 10 500 km

The Upper Severn catchments: Yeaton, Llanyblodwel and Llanerfyl.

In this work, the three sub-catchments of the Upper Severn River are analysed: Yeaton, Llanyblodwel, and Llanerfyl (Fig. 4). The area, elevation, mean flow, mean annual rainfall and basin lag time (time of concentration) information of the catchments are presented in Table 1. The Yeaton catchment is located at a lower elevation and over a flat area compared to Llanerfyl and Llanyblodwel. This catchment also has the longest basin lag time. The smallest catchment in terms of drainage area is Llanerfyl, which also has the shortest basin lag time (approx. 3–5 h) leading to flash floods, so that the predictive uncertainty information on flood forecast for this catchment has especially high importance.

In the Midlands Flood Forecasting System (MFSS; a Delft-FEWS forecast production system as described in Werner et al., 2013), the Upper Severn catchment is represented by a combination of numerical models for rainfall–runoff modelling (MCRM; Bailey and Dobson, 1981), hydrological routing (DODO; Wallingford, 1994), hydrodynamic routing (ISIS; Wallingford, 1997), and error correction (ARMA). The input data used within MFSS include (a) real-time spatial data (observed water level and rain-gauge data as well as air temperature and catchment average rainfall); (b) radar actuals; (c) radar forecasts; and (d) numerical weather prediction data (all provided by the UK Meteorological Office). The data available were split into two parts for calibration (7 March 2007, 08:00 UTC–7 March 2010, 08:00 UTC) and validation (7 March 2010, 20:00 UTC–7 March 2013, 08:00 UTC), preserving similar statistical properties in both data sets.

The forecasting system issues two forecasts per day (08:00 and 20:00 UTC) with a time horizon of 2 days. First, the estimates of internal states are obtained by running the models (which are forced with observed precipitation, evapotranspiration and temperature) in historical mode over the previous period. The state variables for the (hydrological) model are soil moisture deficit (SMD, the amount of water required to bring the current soil moisture content to field capacity in the root zone), groundwater level (GW), snow water equivalent (SWE), and snow density (SD). Using a stand-alone version of MFSS, the system (forced by the forecasted precipitation) is then run forward with a time step of 1 h.

It is important to note that this case study, unlike the Brue catchment,
includes errors in the meteorological forecast and the back transformation of
discharge to water level – via a rating curve – in a lumped manner.
Therefore, the effects of

The uncertainty analysis is aimed at estimating predictive uncertainty for
the forecast time series (

In all case studies the QR uncertainty prediction method employs a linear
regression model. While in the Brue catchment the linear regression model
estimates the quantile

Shrestha and Solomatine (2008) tested the UNEEC method on the Brue catchment
to assess residual uncertainty of the one-step-ahead flow estimates. The
predictors of model error identified using AMI and correlation analysis were
only lagged discharge (

In the Upper Severn case studies, a variety of predictors are considered for the model, e.g. observed and modelled water level, forecasted precipitation, and state variables (GW, SMD, SWE, SD). Although the benefits of using the soil moisture (observed or modelled) and groundwater level information for modelling rainfall–runoff processes and predicting runoff are well known in the literature (Aubert et al., 2003; Lee and Seo, 2011; Tayfur et al., 2014), we cannot cite any studies exploring the possible advantages of using such information for improving predictive capabilities of uncertainty analysis methods. Therefore, the dependence of model residuals on variables expressing the internal state of the catchments is also analysed.

Among the state variables, the most significant correlation with the model error was shown by GW and SMD. While GW was found to be positively correlated with model residuals (i.e. as GW increases, error increases too), SMD and model error had a negative correlation. The positive correlation between GW and model residuals can be explained by the fact that high groundwater levels are associated with excessive precipitation during which model errors are higher in magnitude. High soil moisture deficit, on the other hand, indicates that there has been no excessive precipitation and that the soil is not filled up with infiltrated water. High evaporation rates (causing soil to dry up) can also result in high soil moisture deficit. It should be noted that the latter is less likely to be valid for the Upper Severn catchments considering the prevailing climate in the region. Accordingly, lower soil moisture deficit is linked with excessive precipitation events such that soil moisture deficit is negatively correlated with the model error.

Eventually, on the basis of studying the correlations and AMI between
various candidate predictors and the output, and using expert judgement, the
following variables have been chosen to serve as candidate predictors:

the most recent precipitation (

the observed water level (

error (

state variables GW and SMD.

In an attempt at removing least influential inputs, the set of variables
above was then subjected to the model-based optimization: the degree of
influence of various inputs has been explored by running the UNEEC predictor
for different sets of inputs and comparing the resulting PICP and MPI. It was
found that there were only negligible changes (and mostly no change) when

Fuzzy clustering in UNEEC is carried out by the fuzzy c-means method and employs six clusters with the fuzzy exponential coefficient set to 2. The number of clusters was chosen based on computation of the Partition Index (SC), the Separation Index (S) and the Xie and Beni Index (XB) (Bensaid et al., 1996; Xie and Beni, 1991). (It should be mentioned that the sensitivity of PICP and MPI to different numbers of clusters supports the choice of six clusters.)

Within the variables considered in clustering, GW is the most influential
one. Fig. 5 shows fuzzy clustering of GW, SMD, and

It must be noted that in this study the hydrological model output is not included as yet another input to UNEEC (along with the observed discharge/water level) in all case studies. However, it may be worth exploring this idea in the further studies.

Fuzzy clustering of GW (left, top) and its relation to the model
residuals (right), SMD (left, middle) and

Observed discharge, simulated discharge and model residuals during calibration and validation (Brue catchment).

Probability plots for model residuals (during training) for the Brue
catchment: comparison of the two fitted distributions: normal vs.

This part focuses on statistical error analysis (Sect. 4.1) and comparison of uncertainty analysis results (Sect. 4.2).

Understanding the quality of hydrological model quality (e.g. water level
forecasts) is important in order to discuss uncertainty analysis results
provided by any method. For this purpose, we analyse the error time series
statistically. We also check the homoscedasticity (the assumption which
simplifies the mathematical and computational treatment of random variables)
of the model residuals. Furthermore, we investigate the normality of model
residuals through probability plots of the

Residual uncertainty varies in time and with the changing hydrometeorological situation, so in this paper we investigate the residual distribution for different hydrometeorological conditions represented by clusters found within the UNEEC method (on the training data set).

The observed discharge plotted against simulated discharge during calibration
and validation periods can be seen in Fig. 6a and c, respectively. During
calibration, although the model residuals are lower at flows higher than
35 m

Figure 6b and 6d shows how model residuals change with increasing discharge values during calibration and validation periods, respectively. Clearly, model residuals of the Brue catchment are heteroscedastic; that is to say, the variance of model residuals varies with the effect being modelled, i.e. observed discharge.

Figure 7 presents probability plots for model residuals during training. The
top left plot compares the two selected distributions (normal distribution
and

Normality of the model residuals' distribution is further investigated for different hydrometeorological conditions as identified by clustering in the space of the predictor variables. Analysis of the probability plot for each cluster formed indicates that there is no significant departure from normality (with regard to the fitted normal distribution), unlike in the overall model residuals. The most striking result among all clusters is achieved in the one representing very high flow and high rainfall (Cluster 4, 0.95 % of total data) (Fig. 7, bottom middle). The distribution of all the other clusters (Clusters 1, 2, 3, and 5) was found to be more or less equally close to normal. When visually compared, these distributions were only slightly less close to normal with respect to Cluster 4.

Standard deviation of model error during calibration and validation (Upper Severn catchments).

Observed water level, forecasted water level and model residuals during calibration and validation (Llanyblodwel, lead time 6 h).

Probability plots for model residuals (during training) for the
Llanyblodwel catchment: comparison of fitted normal distributions for all the
lead times (top left); comparison of the two fitted distributions: normal vs.

Comparison of prediction limits for the 90 % confidence level
during validation:

The quality of (water level) forecasts is assessed based on standard deviation of model error. The results are comparatively presented for different lead times in Fig. 8. It can be clearly seen that during both calibration and validation as lead time increases, the standard deviation of error increases as well. Also, it should be noticed that there is a direct increasing effect of shorter basin lag time on standard deviation. For example, the catchment with the shortest basin lag time, that is Llanerfyl, always has a larger standard deviation for all lead times. On the contrary, the smallest standard deviation always occurs in the catchment with the longest basin lag time, which is Yeaton. This is mainly due to the fact that the basin lag time represents the memory of a catchment. Hence, the flood forecasting capability of a hydrological model is affected negatively when the basin lag time is short.

The observed water levels are plotted against forecasted water levels in the Llanyblodwel catchment during calibration and validation for lead time 6 h in Fig. 9a and c, respectively. Figure 9b and d shows model error plotted against observed water level on the logarithmic scale. Although it is not very clear from Fig. 9a (and Fig. 9c), it is evident from Fig. 9b (and Fig. 9d) that the model error increases with higher water levels, as expected. This confirms the heteroscedasticity of model residuals.

Normality of model residuals for the Llanyblodwel catchment for all lead
times was investigated (see Fig. 10, top left). Visual inspection of
probability plots, superimposed on which the line joining the 25th and 75th
percentiles of the fitted normal distributions, reveals that errors are not
normally distributed, i.e. the data do not fall on the straight line, as is
especially the case for the tails. It should be realized that the departure
from normality increases with longer lead times. The top right plot in
Fig. 10 compares the two selected distributions (normal distribution and

Furthermore, a normality check for model residuals' distribution is made
individually for the data clusters corresponding to particular
hydrometeorological conditions. The variables used for clustering are
groundwater level (GW), soil moisture deficit (SMD), and observed water level
(

Uncertainty analysis results for 90 and 50 % confidence levels (Brue catchment).

TR: training; VD: validation.

PICP, MPI, and ARIL values for each cluster (training, 90 % confidence level, Brue): UNEEC vs. QR.

Both the Brue and Llanyblodwel case studies indicate that it is not possible to understand the origin of the model error in uncertainty assessment by looking at the probability plots of model residuals for each cluster. However, it is worthwhile mentioning that it is mostly the extreme events which make the overall distribution non-Gaussian. Classifying data so that different hydrometeorological conditions (most importantly, the extreme events) are separated helps to achieve homogeneity, and thus normality in model residuals' distribution. Therefore clustering can be suggested as an alternative to transformation of model residuals before applying any statistical methods to them.

Uncertainty analysis results from both methods are evaluated and compared employing the validation measures explained in Sect. 2.2.

Validation measures PICP, MPI, and ARIL are provided in Table 2. In terms of PICP, even though QR provides PICP values slightly closer to 90 and 50 % during training, UNEEC was found to be more reliable in validation, especially for the 90 % CL. While the narrowest prediction interval on average is given by UNEEC during training for both 90 and 50 % CL, comparable MPI values are obtained during validation. QR has smaller ARIL values, particularly for the 90 % CL. However, on aggregate UNEEC yields better results over QR, especially in validation.

Looking at Fig. 11a, visual analysis of 90 % prediction intervals for the highest flow period in validation reveals that neither UNEEC nor QR is perfectly able to enclose the observations of high flows. Overall, in validation, the analysis results from UNEEC and QR are comparable for the highest peak event (Table 2). For medium peaks in validation, however, QR produces wider uncertainty bounds in comparison to UNEEC. This is illustrated in Fig. 11b. For this medium peak event it should be noted that the higher MPI (and ARIL) value by QR is not manifested in PICP – both methods have very close PICP values (Table 2). One of the reasons for this may relate to the fact that by design UNEEC uses more predictors that explain the (past) catchment behaviour and hence is able to “memorize” catchment behaviour better, and this is especially pronounced during the longer periods of medium flows rather than during high flows having shorter duration.

Comparison of UNEEC and QR based on both PICP and MPI during calibration period (7 March 2007, 08:00 UTC–7 March 2010, 08:00 UTC) and validation period (7 March 2010, 20:00 UTC–7 March 2013, 08:00 UTC) for the 90 % and 50 % confidence levels. (The size of the marker represents the lead time; i.e. the bigger the marker, the longer the lead time.)

MPI (left) and ARIL (right) values obtained during calibration period (7 March 2007, 08:00 UTC–7 March 2010, 08:00 UTC) and validation period (7 March 2010, 20:00 UTC–7 March 2013, 08:00 UTC) for the 90 % confidence level.

Comparison of prediction limits for the 90% confidence level during validation (1 April 2012–7 March 2013): Yeaton, lead time 3 h (top); Llanyblodwel, lead time 6 h (middle); and Llanerfyl, lead time 12 h (bottom).

We have also compared performance of QR and UNEEC for each cluster found by UNEEC during training. Unlike for the whole data set (which is highly heterogeneous due to extremes in rainfall–runoff processes), analysis for each individual cluster focuses on more homogeneous data sets. Table 3 shows the corresponding PICP, MPI and ARIL. In general, it is difficult to decide which method is better – results are mixed. However, there is one observation that can be made. For most clusters there is a dependency between PICP and MPI: typically the higher MPI corresponds to PICP being closer to the CL (90 %). This may be explained by the fact that for narrow MPIs PICP would be under “pressure” and be lower (however, it would be difficult to generalize). For example, for the high flow cluster (Cluster 4), QR appears to be better in terms of PICP, whereas UNEEC ends up with very narrow MPI, and this is probably the reason why its PICP could not reach 90 % CL.

The reported comparison was done for the clusters found by UNEEC during training. In principle, a similar comparison can also be made for the homogeneous groups of data in the validation set; however, this may not have much sense since this set imitates the model in operation, and in operation all models are run for individual input vectors at each time step of the model run, and not for the whole set of data (so the “validation set” in operation will never exist).

For these catchments, in order to reflect performance for different lead times better, we are using the graphical representation of results.

Figure 12 shows the PICP values plotted against the MPI for the calibration and validation periods. The most important general conclusion is that both methods show excellent results in terms of PICP for 90 % CL. For the 50 % CL the results seem to be worse, especially for UNEEC – but the reader should take into account that for the low lead times the hydrological models are very accurate; hence MPI is extremely narrow (especially for 50 % CL) and it is no surprise that PICP cannot be accurately calculated. Furthermore, for the 90 % CL, the following can be said: for Yeaton, QR does slightly better than UNEEC; for Llanyblodwel, both methods are equally good; for Llanerfyl, the UNEEC method is a bit better than QR.

For the further analysis, Fig. 13 presents MPI and ARIL values for the 90 % CL on calibration and validation data sets. It can be seen that with the increase in the lead time, the forecast error obviously increases, and the values of both indicators follow. In view of the (high) model accuracy, the relatively low MPI values in the Yeaton catchment are not surprising for both methods. Overall, the results are mixed: for some catchments, QR is marginally better; for other catchments, UNEEC has the higher performance.

Comparison of prediction limits for the falling limb part of the
hydrographs (medium water levels) for the 90 % confidence level during
validation:

PICP, MPI, and ARIL values for MEDIUM water levels (validation, 90 % confidence level): UNEEC vs. QR.

For the further comparison of estimated prediction limits through uncertainty
plots, three cases are selected based on the relationship between basin lag
time and lead time. These cases are (1) Yeaton, lead time 3 h (lead time
< basin lag time), (2) Llanyblodwel, lead time 6 h (lead time

In Llanerfyl, one can notice a strange behaviour of the model
causing sharp changes in forecasted water levels (unstable model outputs),
and thus in prediction limits. Considering that the Llanerfyl catchment has a
basin lag time of

For the low water levels in Yeaton and Llanyblodwel, UNEEC gives wider prediction intervals as compared to QR. A possible explanation for this can be encapsulation of groundwater level information in UNEEC. Groundwater levels remain at higher levels for longer periods than water levels in the river (i.e. due to slow and long response times of groundwater levels to changing hydrometeorological conditions). Thus, using GW as an input variable in its non-linear model, UNEEC has the potential to provide an uncertainty band of larger widths for water levels when the groundwater level is high.

For the medium water levels in Yeaton and Llanybldowel, QR gives wider prediction intervals as compared to UNEEC, which is confirmed by the higher MPI and ARIL (without any significant improvements in PICP) values for QR (Table 4) obtained for medium water levels. This is particularly true on the falling limb part of the hydrographs as exemplified in Fig. 15a and b (for Yeaton and Llanyblodwel, respectively). The average of the MPI values corresponding to three examples shown from Yeaton and Llanyblodwel, respectively, are 0.0204 and 0.0201 m for UNEEC, whereas for QR it is 0.0418 and 0.0295 m.

For peak water levels in the Yeaton and Llanyblodwel catchments, it is mostly QR that produces a higher upper prediction limit than UNEEC. Yet, this does not contribute to the overall performance of the method significantly. On the contrary, it is seen in some cases that such high upper prediction limits make the uncertainty band unnecessarily wide.

Continuous peaks prevail in the Llanerfyl catchment (as its basin lag time is far shorter than the forecast lead time of interest). Such continuous peaks occur during certain periods in the Llanyblodwel catchment too. In most of these cases, UNEEC gives narrower uncertainty band, and wider prediction interval computed by QR is redundant. That is to say, it does not contribute QR method's performance (as measured by PICP) at all in terms of its ability to enclose more observations within the band. For peak water levels, however, QR is slightly more informative than UNEEC.

Noticeably, upper prediction limits obtained by QR in the Llanerfyl catchment for the long-lasting falling limb part of the hydrograph (indicated by arrows in Fig. 14c) are too high, e.g. even greater than those provided by UNEEC. QR (in this study, by design) is a method building simple linear regression models considering only observed water levels on forecasted water levels. Having a rather simple mathematical formulation, it might be that the sensitivity of the computed upper prediction limit to the magnitude of water level increases, and shows an amplifying effect on uncertainty band width.

PICP, MPI, and ARIL values for each cluster (training, 90 % confidence level, Llanyblodwel, lead time 6 h): UNEEC vs. QR.

Table 5 shows the values of validation measures (PICP, MPI, and ARIL) for each cluster (obtained during training) for the Llanyblodwel catchment (lead time 6 h). For flood management the cluster 2 (4.6 % of all data) – with the high groundwater levels, and hence potentially corresponding to flood conditions – could be the most interesting one. In UNEEC, the highest MPI value was obtained for this cluster with a relatively bad PICP value compared to the other clusters. Similar to UNEEC, the largest MPI was obtained for this cluster with the QR method also. Both methods provide equally bad PICP values. Giving a wider uncertainty band than UNEEC on average, QR is less capable of estimating reasonable prediction limits for very high groundwater levels. This is also supported by its greater (12 %) ARIL value compared to UNEEC.

PICP and MPI values for Cluster 4 should be mentioned as well. This cluster represents the situations with the very low water levels, very low groundwater levels, and very high soil moisture deficit, and constitutes 16.6 % of all the data. In comparison to UNEEC, QR provides a PICP value very close to 90 % CL despite its slightly lower MPI. Thus, one can say that UNEEC fails in providing reliable uncertainty estimates for the extreme condition associated with very low water and groundwater levels. This can be due to the effect of using state variables as predictors. All in all, the state variables are calculated by the model and they cannot reflect real catchment conditions accurately, especially when the (hydrological) model is not very accurate. That is particularly true for the extreme events, considering that models mostly fail in simulating such events.

Overall, UNEEC is worse than QR on for one cluster but better or equal on all other clusters; however, in general, both methods in terms of PICP show reasonably good results.

This study should be seen as
accompanying the study by López López et al. (2014) (and earlier work
on UNEEC and QR) and presents a comparative evaluation of uncertainty
analysis and prediction results from QR and UNEEC methods on the four
catchments that vary in hydrological characteristics and the models used:
Brue catchment (simulation mode) and Upper Severn catchments – Yeaton,
Llanyblodwel, and Llanerfyl (forecasting mode). The latter set of case
studies is important from a practical perspective in that the effect of lead
time on uncertainty analysis results and its relation to basin lag time is
demonstrated. For both QR and UNEEC different model configurations than their
previous applications are considered. One of reasons to compare these two
methods was to understand if a simpler linear method (QR) using fewer input
data performs well compared to a more complex (non-linear) method (UNEEC)
with more predictors. The following conclusions can be drawn from the results
of this study:

In terms of easiness of set-up (data preparation and calibration), preference should be given to QR simply because it is a simpler linear method with one input variable (in this study), whereas UNEEC has more steps and requires more data analysis. However, the model set-up is carried out only once, and in operation both methods can be easily used and both have very low running times (a fraction of a second on a standard PC) since they are based on algebraic calculations.

In almost all case studies both methods adequately represent residual uncertainty and provide similar results consistent with the understanding of the hydrological picture of the catchment and the accuracy of the (hydrological) models used. We can recommend both methods for use in flood forecasting.

In one case study, Llanerfyl, we found that UNEEC was giving more adequate estimates than QR. This catchment has a shorter basin lag time and the model outputs for this catchment were characterized by a relatively high error, so our conclusion was that probably in such a rapid response catchment the UNEEC's more sophisticated non-linear models were able to capture relationships between the hydrometeorological and state variables, and the quantiles better than the QR's linear model.

A useful finding is that inclusion of a variable representing groundwater level (GW) as a predictor in UNEEC improves its performance for the Upper Severn catchments. This can be explained by the fact that this variable has a high level of information content about the state of a catchment. However, it should be noted that, in other catchments, using such information can be misleading due to slow (and long) response times of groundwater levels to changing hydrometeorological conditions. Yet, overall, it can be advised to make use of variables which can be representative of the hydrological response behaviour of a catchment for improving the predictive capacity of data-driven methods.

We recommend comparing the two presented methods (QR and UNEEC) with more predictive uncertainty methods which use different methodologies, such as HUP (Krzysztofowicz, 1999), the more recent MCP (Todini, 2008) and DUMBRAE (Pianosi and Raso, 2012). Yet another recommendation (induced by the referee's and Editor's suggestions) is to extend the list of the possible performance measures and to test the applicability of the methods developed for the assessment of the probabilistic forecast quality (Laio and Tamea, 2007) whose mathematical apparatus is transferrable to the problem of residual uncertainty prediction.

It can also be recommended to test capabilities of different predictive uncertainty methods on theoretical cases with the known distributions, as well as on the catchments of distinct hydrologic behaviour, with diverse climatic conditions, and having various hydrological features. In this study, we found that the basin lag time is a notable characteristic of a catchment having great influence on uncertainty analysis results (as measured by PICP and MPI). When the lag time is longer, the catchment memorizes more information regarding its hydrological response characteristics.

On the other hand, exploring the performance of different methods on

When different predictive uncertainty methods are evaluated based on their comparative performance, it is more important to have validation measures incorporating certain aspects of rainfall–runoff processes, i.e. varying flow conditions. For example, the accuracy of the hydrological model decreases during high flow events, and thus the amount of residual uncertainty increases. This necessitates exploring validation measures linking the prediction interval to the (hydrological) model quality, e.g. by employing the weighted mean prediction interval (Dogulu et al., 2014).

There are other possibilities for further improvements in both of the presented methods. For example, the different configurations of QR, the alternative clustering techniques for UNEEC, as well as using it in instance-based learning (e.g. locally weighted regression) as the predicting model can be explored further.

The results presented in this paper were obtained during the thesis research of the European Erasmus Mundus Flood Risk Management Master students Nilay Dogulu and Patricia López López. The Environment Agency is gratefully acknowledged for providing the data for the Upper Severn catchments in the UK. We are very grateful to the two anonymous reviewers and the HESS Editor, E. Todini, for very useful comments and suggestions.Edited by: E. Todini