In theGreatWhiteShark/climex: Estimating Extreme Events in Climatic Time Series (using Data of the DWD)

About

This vignette provides an introduction into the basic usage of the climex package. You will learn how to preprocess a time series and to fit the generalized extreme value (GEV) and generalized Pareto (GP) distribution to it afterwards.

Preprocessing and fitting

The climex package is intended to provide tools for fitting both the GEV and GP distribution in a stationary setting. Since the workflow differs depending on which limiting distribution is considered, I will cover both use cases in consecutive parts starting with the GEV one.

For a thorough and didactic introduction into the concept of the extreme value theory please refer to the book of Stuart Coles "An Introduction to Statistical Modeling of Extreme Value", 2001 Springer.

For the sake of convenience all the analysis will be performed on the daily maximum temperature time series of the Potsdam station in Germany provided by the climex package via the variable temp.potsdam.

require( climex )
data( temp.potsdam )

## convenience function for plotting xts class time series using ggplot2
ttplot( temp.potsdam )

The GEV distribution

The essence of this branch of extreme value analysis is to split a time series into equally sized blocks and extract just each of their maximal values. Those values will then be fitted using the maximum likelihood function of the generalized extreme value (GEV) distribution. In most use cases, especially when working with temperature series, the block length will be set to one year to get rid of the seasonality in the data.

Removing incomplete years

Since we are extracting the maximum value for each year, it is quite important that all our years are actually complete. Imagine for one year just data from January and December is available. Then the maximum value would of course be quite a huge artifact leading to wrong GEV parameters in the fit.

A quite straight forward way to circumvent this problem is to remove all incomplete years from your time series. This will of course shorten your time series and is quite over the top when just one or two days are missing. But on the other hand it is a rather convenient way to ensure the validity of your analysis when you want to perform your extreme value analysis in a massive parallel setting where the exploration of each individual time series becomes intractable.

temp.potsdam.complete <- remove.incomplete.years( temp.potsdam )

ttplot( temp.potsdam.complete )

Discard short time series

Just a remark: Make sure the data you analyze has a sufficient amount of data! Since we want to extract the annual maxima, a minimal length of 30 years for our climatological series would be adequate. You can use it as a rule of thumb to choose your data/block size in such a way to always end up with at least 30 point. This is important since we will fit the three parameter GEV distribution in a later step and with a smaller number we would not be able to do decent statistics.

Removing seasonality

One of the basic assumptions within the extreme value theory is that the time series is stationary and does not contain any correlations. It is a rather bold approximation (especially in the context of the climate change), but our hands are tied when it comes to alternative approaches to estimate the return levels of our time series. So bite the bullet and assume that our time series is indeed stationary and does not contain any long-range correlations. But there is still a very prominent short-range correlation present in e.g. the temperature: the annual cycle. We will get rid of it by calculating the anomalies. Therefore we calculate the mean value of each day and subtract it from the individual ones.

temp.potsdam.anomalies <- anomalies( temp.potsdam.complete )

ttplot( temp.potsdam.anomalies )

Though you have way more data contributing to the analysis (you can not expect the hottest day of the year to occur during winter) and thus the asymptotic properties of the GEV distribution are more likely fulfilled (quarter-annual block may be already sufficient), the interpretation of the extreme events in those anomalies is more difficult than in the raw series.

So whether you perform this step of the preprocessing or not depends on the your application at hand.

Blocking

In the final step of our preprocessing we will separate the time series in annual blocks and extract their maximal values. Even if there are some short-range correlations left in the data, through the annual cycle or other sources, this still will get rid of them.

## Per default the block length of the block() function will be one year
temp.potsdam.blocked <- block( temp.potsdam.anomalies )

ttplot( temp.potsdam.blocked )

Fitting the GEV distribution

After we extracted all the extreme events from our time series, it's time to fit the GEV distribution to them. In addition we will also calculate the 100 year return level of the time series.

## The result will be a list of objects similar to the output of the
## optim function.
gev.potsdam <- fit.gev( temp.potsdam.blocked )

gev.return.level.potsdam <- return.level( gev.potsdam )
print( gev.return.level.potsdam )

## another convenience function to easily bundle plots
multiplot( plotlist = list( ttplot( temp.potsdam.blocked ),
                           plot( gev.potsdam ) ), cols = 2,
          main = "Blocked temperature anomalies and GEV fit" )

The GP distribution

Even though the individual step in the GP approach differ from the GEV one, the overall character of the analysis is very similar. Again we want to have as much data as possible but still fulfill our asymptotic conditions enabling us to fit the extreme event with the limiting distribution and we have to get rid of the short-range correlation. Therefore I will provided a little bit more condensed version of an explanation compared to the last section.

The main difference compared to the previous approach, apart from fitting the likelihood function of the GP instead of the GEV distribution, is that we now apply a threshold to our time series. All events above this threshold will be considered as extreme events in our analysis.

Removing short-range correlations

Especially considering climate time series it's quite evident that there will be short-range correlations in our time series when we just consider threshold exceedances. We all experienced heat waves or consecutive daily of heavy raining.

To prevent these correlations from interfering with our analysis, we have to decluster all the point obtained by applying the threshold. The declustering function now takes a look on the time stamps of all exceedances and searches for gaps of at least the size of a predefined length, let's called it x for now. All exceedances being not separated by at least x point, which were below the threshold, will be grouped into one cluster. If we would face a heatwave with consecutive daily maximal temperatures of let's say 30, 31, 32, 29, 29, 33, and 30 degrees Celsius with our threshold at 29.5 degrees and our x equal 3, all seven days of the heatwave will be considered as one single cluster.

After grouping all exceedances into clusters, only the maximum value of each of them is extracted and fitted by the GP distribution.

We fortunately do not have to set our x manually but it will be determined by using the extremal index (see Ferro & Segers, 2003, Journal of Royal Statistical Society, Series B.).

## By default the threshold function of the climex package will decluster your
## time series. You can disable it by setting the decluster argument to FALSE
temp.potsdam.declustered <- threshold( temp.potsdam, threshold = 33 )

ttplot( temp.potsdam.declustered )

Fitting the GP distribution

After we obtained our extreme events by applying a sufficiently high threshold to our time series and declustering the resulting point, we will the likelihood function of the generalized Pareto (GP) distribution to our data using the maximum likelihood approach and determine the 100 year return level.

## Fitting the GP distribution
## Since in the default setup the return level will be calculated not in
## units of times but numbers of exceedances, we have to provide the
## length of the original time series to rescale our return level to
## years.
gp.potsdam <- fit.gpd( temp.potsdam.declustered, threshold = 33,
                      total.length = length( temp.potsdam ) )

gp.return.level.potsdam <- gp.potsdam$return.level
print( gp.return.level.potsdam )

## another convenience function to easily bundle plots
multiplot( plotlist = list( ttplot( temp.potsdam.declustered ),
                           plot( gp.potsdam ) ), cols = 2,
          main = "Declustered daily maximum temperature and its GP fit" )