In bcgov/BCWaterDischargeAnalysis: Analysis of stream discharge data

This will demonstrate the impact of missing values on the computations using a sample analysis of the stream discharge data from station WSC.08HA011 (Cowichan River near Duncan) after creating some missing values (at random) during the period of record.

The base data is provided as a sample dataset (WSC.08HA011) with this package. The basin has an area of 826 km**2. While the data frame runs from 1900 onwards, data was only populated in this example starting in 1965 onwards.

Loading the data

We first load the package, load the data, add a Year column, and look at the first few rows starting in 1965.

library(plyr)       # split-apply-combine paradigm

library(BCWaterDischargeAnalysis)


data(WSC.08HA011)
WSC.08HA011$Year <- as.numeric(format(WSC.08HA011$Date, "%Y"))
head( WSC.08HA011[ WSC.08HA011$Year ==1965,])

Notice that the flow dataframe MUST have two variables: Date and Q - the daily discharge value.

Variable Description

Date The date of the reading in standard R date class format.

For example to convert a 4-digit year, numeric month, and numeric day
of the month use

flow$Date <- as.Date{paste(flow$Year,'-',flow$Month,'-',flow$Day,sep="")
To convert a character string (e.g. '1/15/2001') use the

flow$Date <- as.Date(flow$chardate, "%d/%m/%Y")
The formatting codes (%Y etc) are explained in the help for the strptime() function.

Q The average daily flow as a numeric value.

Other variables in the data frame will be ignored.

We now set the variables limiting the years for which the analysis should be done and setting up other needed variables for the analysis.

Station.Code <- 'WSC-08HA011'
Station.Name <- 'Cowichan River near Duncan'
Station.Area <- 826    # square km's

start.year   <- 1965  # when do you want the analysis to start at.
end.year     <- 2012  # what is last year with complete data

We now create some missing values at random in the dataset.

# Create 1% missing values (approximately)
set.seed(3423432)
sum(is.na(WSC.08HA011[ start.year <= WSC.08HA011$Year & WSC.08HA011$Year <= end.year, "Q"]))

missing <- sample(1:nrow(WSC.08HA011), .0003*nrow(WSC.08HA011))
WSC.08HA011$Q[ missing ] <- NA
sum(is.na(WSC.08HA011[ start.year <= WSC.08HA011$Year & WSC.08HA011$Year <= end.year, "Q"]))

Impact of missing values

According to the WSC,

Monthly or annual values are not calculated unless there is complete daily record. If the gauge went down and the daily record cannot be reasonable estimated, then summaries at longer time scales (monthly, annual) are not computed.

This is controlled in all of the functions in this package using the na.rm argument (similar to many R functions such as mean() etc), but usage is slightly different from the standard R usage to allow future versions of the software to have finer control on dealing with missing values.

The treatment of missing values for all functions is controlled through the na.rm=list(na.rm.global=XXXXX) argument. The current default actions are:

Function | Default Action ---------|----------------- compute.Q.stat.annual | na.rm=list(na.rm.global=FALSE) compute.Q.stat.longterm | na.rm=list(na.rm.global=TRUE) compute.Q.percentile.longterm | na.rm=list(na.rm.global=TRUE) compute.volume.frequency.analysis | na.rm=list(na.rm.global=TRUE)

If na.rm=list(na.rm.global=FALSE), i.e. do NOT remove missing values before computing statistics, then if there is a missing value present in a series, the missingness is propagated forward. For example, if a month is missing a value, then the minimum flow for that month in that year is not computed, but the minimum flows in the other months are computed. The minimum flow for that year that has the missing value is also not computed. If the missing value was in January, then the JFM season missing is also not computed, but the minimum for AMJ etc are computed. Similarly 6 month averages are computed only if they have a complete record.
The rule is the same for both the CY (calendar-year) and WY (water-year) statistics.

The missing value also affects a rolling average that includes the missing value. This may cross month boundaries, e.g. if the missing value occurred on 31 January, then the 3 day rolling average for the 1 Feb is also missing so the statistics for February on the 3 day rolling average are also not computed. If the missing value occurred at the end of a year, the rolling average in January of the next year may be affected, and so on.

There is also an unintended consequence with the rolling averages. In the first year of the series, the 3, 7, and 30 day rolling averages for the 1st January will be missing and so any monthly/ season/ yearly statistic on the rolling averages will also be missing. The way around this is to INCLUDE data prior to 1 January, e.g. the actual data record runs from 1940 onwards, but you set the start.year=1941. Then the rolling average for 1941-01-01 can be computed.

Because the first year water-year starts in the previous October, many off the water-year statistics will be missing for the first year, unless you again include more data before the 1 January in the start.year (as above).

If na.rm=list(na.rm.glob=TRUE), then ALL statistics are computed with the missing value excluded, i.e. in a month with a missing value, the statistics are computed on the remaining values.

The default action to deal missing values for the long-term statistics is to EXCLUDE all missing values, and so statistics are still computed. For example, if 1942-05-12 has a missing value, any statistics about May will ignore this missing value and at the long-term averages will still be computed.

The default rule of the volume frequency analysis is to exclude all missing values so so you will get statistics for years without a complete record. If the na.rm argument is set to TRUE, then if a year has a missing values in finding the min etc, that year has a NA for the statistic and that year is then EXCLUDED from fitting the distribution. So if you had 40 years on the record, and 1 year had a missing value, then 39 years would be used in the analysis. Because the rolling average is always missing in the first year, any analysis involving a rolling average will miss the first year.

There is a bit of weirdness if you choose to analyze on the water-year and set the na.rm.global=TRUE, because it will compute statistics for the first water-year even though you may be missing data from the previous October prior to 1 January of start.year. Be careful.

Counting the number of missing values.

The simple summary statistics are computed as before.

# get a simple summary of the data
# Table of simple statistics by year to help check for outliers, etc
WSC.08HA011.sum <- plyr::ddply(WSC.08HA011[ WSC.08HA011$Year >= start.year & WSC.08HA011$Year <=end.year,], "Year", plyr::summarize,
         n.days   = length(Year),
         n.Q      = sum (!is.na(Q)),
         n.miss.Q = sum ( is.na(Q)),
         min.Q    = min (Q, na.rm=TRUE),
         max.Q    = max (Q, na.rm=TRUE),
         mean.Q   = mean(Q,na.rm=TRUE),
         sd.Q     = sd  (Q,na.rm=TRUE))
WSC.08HA011.sum

We see that there are now missing values in a few years (e.g. 1967).

Computing the Annual Statistics (on calendar- and water-year)

We compute the plethora of annual statistics with the default action to propagate missing values.

stat.annual <- compute.Q.stat.annual(
                          Station.Code  =Station.Code,
                          Station.Area  =Station.Area,
                          flow          =WSC.08HA011,
                          start.year    =start.year,
                          end.year      =end.year)

The returned list includes a dataframe of the missing days

stat.annual$dates.missing.flow

The returned dataframe of statistics has many, many statistics some of which are shown below. Notice that some are now missing (e.g. in 1967) over and above the missing in the rolling averages for the first year.

head(stat.annual$Q.stat.annual[, 1:10])

To ignore the missing values, we set the na.rm parameter:

stat.annual.ignore.missing <- compute.Q.stat.annual(
                          Station.Code  =Station.Code,
                          Station.Area  =Station.Area,
                          flow          =WSC.08HA011,
                          start.year    =start.year,
                          end.year      =end.year,
                          na.rm=list(na.rm.global=TRUE))
head(stat.annual.ignore.missing$Q.stat.annual[, 1:10])

Notice that all statistics are computed even if there were missing values.

Computing the long term summary statistics

The long-term summary statistics default action is to IGNORE the missing values.

stat.longterm <- compute.Q.stat.longterm(
                          Station.Code =Station.Code,
                          #Station.Area =Station.Area,
                          flow         =WSC.08HA011,
                          start.year   =start.year,
                          end.year     =end.year)
head(stat.longterm$Q.cy.stat.longterm)

If you wish missing values to propagate, use:

stat.longterm.missing <- compute.Q.stat.longterm(
                          Station.Code =Station.Code,
                          #Station.Area =Station.Area,
                          flow         =WSC.08HA011,
                          start.year   =start.year,
                          end.year     =end.year,
                          na.rm=list(na.rm.global=FALSE))
head(stat.longterm.missing$Q.cy.stat.longterm)

Computing the long term percentile statistics

The long-term percentile statistics default action is to IGNORE the missing values.

percentile.longterm <- compute.Q.percentile.longterm(
                          Station.Code=Station.Code,
                          #Station.Area=Station.Area,
                          flow        =WSC.08HA011,
                          start.year  =start.year,
                          end.year    =end.year)
head(percentile.longterm$Q.cy.percentile.stat)

If you want to propagate missing values, again set the na.rm argument:

percentile.longterm.missing <- compute.Q.percentile.longterm(
                          Station.Code=Station.Code,
                          #Station.Area=Station.Area,
                          flow        =WSC.08HA011,
                          start.year  =start.year,
                          end.year    =end.year,
                          na.rm=list(na.rm.global=FALSE))
head(percentile.longterm.missing$Q.cy.percentile.stat)

Computing a volume frequency analysis

The volume frequency analysis automatically drops missing values:

vfa.analysis <- compute.volume.frequency.analysis( 
                      Station.Code   =Station.Code, 
                      flow           =WSC.08HA011,
                      start.year     =start.year, 
                      end.year       =end.year)
vfa.analysis$fitted.quantiles.trans

If you set the na.rm parameter the output appears to be unaffected:

vfa.analysis.missing <- compute.volume.frequency.analysis( 
                      Station.Code   =Station.Code, 
                      flow           =WSC.08HA011,
                      start.year     =start.year, 
                      end.year       =end.year,
                      na.rm=list(na.rm.global=FALSE))
vfa.analysis.missing$fitted.quantiles.trans

The years with the missing values will be DROPPED from the analysis, and the analysis will continue with the remaining years of data, i.e. the sample size for fitting the distributions will be reduced. It is difficult to see this in action other than by actually looking at the plotting data for the frequency plot and comparing to the same summary with no missing data.

plyr::ddply(vfa.analysis.missing$plotdata, "Measure", plyr::summarize, n.years=length(Measure))
plyr::ddply(vfa.analysis$plotdata        , "Measure", plyr::summarize, n.years=length(Measure))