suppressPackageStartupMessages(library("foehnix"))
The "Tyrolean" data set provides hourly observations from two stations, namely "Ellbögen" and "Sattelberg" located in Tyrol, Austria.
Ellbögen is our target station (valley site) located in the Wipp Valley, a north-south oriented alpine valley on the northern side of the European Alps. To the north the Wipp Valley opens into the Inn Valley (close to Innsbruck, the capitol of Tyrol), to the south the valley ends at a narrow gap in the main Alpine ridge called Brennerpass ($1370~m$; the pass between Austria and Italy) flanked by mountains ($>2100~m$). The Wipp Valley is one of the lowest and most distinct cuts trough the Alpine mountain range and well known for south foehn (north of the Alps). Station Sattelberg serves as crest station and provides observations of the upstream air mass during south foehn events. The station is located on top of the mountain to the west of the pass.
library("leaflet") library("sp") stations <- data.frame(lon = c(11.42889, 11.47889), lat = c(47.18694, 47.01083), alt = c(1080, 2107), name = c("Ellbögen", "Sattelberg"), stringsAsFactors = FALSE) b <- list(x0 = min(stations$lon) - .2, x1 = max(stations$lon) + .2, y0 = max(stations$lat) + .2, y1 = max(stations$lat) + .2) m <- leaflet(width = "100%") %>% addTiles() %>% fitBounds(b$x0, b$y0, b$x1, b$x2) m <- setView(m, mean(stations$lon), mean(stations$lat), zoom = 10.5) for (i in 1:nrow(stations)) { m <- addPopups(m, stations$lon[i], stations$lat[i], sprintf("Station %s, %d m a.m.s.l.", stations$name[i], stations$alt[i])) } m <- addProviderTiles(m, "OpenTopoMap") m
The call demodata("tyrol")
returns the combined
data set for both station (Ellbögen and Sattelberg).
In addition, the potential temperature difference between the two stations
is calculated by reducing the dry air temperature from "Sattelberg"
to the height of "Ellbögen" (dry adiabatic lapse rate of 1K per 100m;
stored on diff_t
).
Details can be found on the demodata
reference page.
data <- demodata("tyrol") head(data, n = 3) # Check if our object is a numeric zoo object: c("is.zoo" = is.zoo(data), "is.numeric" = is.numeric(data), "is regular" = is.regular(data), "is strictly regular" = is.regular(data, strict = TRUE)) c("tepmoral resolution (seconds)" = deltat(data), "missing values" = sum(is.na(data)))
The data set returned is a regular numeric
zoo
time series object.
Note that the data set is not strictly regular (contains hourly observations,
but some are missing) and contains quite some missing values (NA
).
This is not a problem as the functions and methods will take care of missing
values and inflate the time series object (regular $\rightarrow$ strictly regular).
Important: The names of the variables in the Tyrolean data set are the "standard names" on which most functions and methods provided by this package are based on. To be precise:
t
, relative humidity rh
,
wind speed ff
, wind direction dd
(meteorological, degrees $\in [0, 360]$)crest_t
, relative humidity crest_rh
,
wind speed crest_ff
, wind direction crest_dd
($\in [0, 360]$)diff_t
(calculated
by demodata
... however, all functions arguments which allow to set custom names (see "Demos > Viejas (California, USA)" or function references).
After preparing the data set (regular or strictly regular zoo object with
numeric` values) we can investigate the observed wind information.
par(mfrow = c(1,2)) # Observed wind speed/wind direction "Ellboegen" # Expects a variable 'dd' and 'ff' by default. Thus, # 'ddvar'/'ffvar' do not have to be specified. windrose(data, main = "Windrose Ellbögen", breaks = seq(0, 35, by = 5)) # Observed wind speed/wind direction "Sattelberg" # Specify dd/ff of the crest station (ddvar, ffvar) windrose(data, ddvar = "crest_dd", ffvar = "crest_ff", main = "Windrose Sattelberg", hue = 270, breaks = seq(0, 35, by = 5))
Based on prior knowledge and the plots above we define two "foehn wind sectors" as follows:
dd
) needs to be
along valley within 43 and 223 degrees
(south-easterly; a r 223 - 43
degree sector).crest_dd
) needs to be within
90 and 270 degrees (south wind; r 270 - 90
degree sector).The wind sector(s) can be added to the windrose
plots for visual justification, but well also be used later when estimating the
foehnix
classification model.
par(mfrow = c(1,2)) # Ellboegen with custom wind sector windrose(data, windsector = list("south-east wind" = c(43, 223)), main = "Windrose Ellbögen", breaks = seq(0, 35, by = 5)) # Observed wind speed/wind direction "Sattelberg" # Specify custom variable names for dd/ff (crest_dd, crest_ff). windrose(data, ddvar = "crest_dd", ffvar = "crest_ff", windsector = list("south wind" = c(90, 270)), main = "Windrose Sattelberg", hue = 270, breaks = seq(0, 35, by = 5))
The most important step is to estimate the foehnix
classification model. We use the following model assumptions:
diff_t
is used as the main covariate to separate 'foehn'
from 'no foehn' events (potential temperature difference).rh
and ff
at valley site (relative humidity and
wind speed).dd = c(43, 223)
for Ellbögen and
crest_dd = c(90, 270)
for Sattelberg (see above).switch = TRUE
as high diff_temp
indicate stable stratification (no foehn).# Estimate the foehnix classification model filter <- list(dd = c(43, 223), crest_dd = c(90, 270)) mod <- foehnix(diff_t ~ rh + ff, data = data, switch = TRUE, filter = filter)
# Model summary summary(mod, detailed = TRUE)
The data set contains $N = r nrow(mod$data)
$ observations, $r nrow(data)
$ from
the data set itself (data
) and $r mod$inflated
$ due to inflation used to make the
time series object strictly regular.
Due to missing data $r length(mod$filter_obj$ugly)
$ observations are not considered
during model estimation (dd
, crest_dd
, diff_t
, rh
, or ff
missing),
$r length(mod$filter_obj$bad)
$ and are not included in model estimation as they do not
lie within the defined wind sectors (filter
).
Thus, the foehnix
model is based on a total number of
$r length(mod$filter_obj$good)
$ observations (or rows).
Once we have estimated the model we can check whether or not the two clusters
are well separated (foehn and no foehn).
This can be done by checking the "Cluster separation" summary provided by
summary
or checking the
posterior probability plot:
# Cluster separation (summary) summary(mod)$separation
The separation
matrix shows the prior probabilities,
the size (number of observations assigned to each component; posterior probability),
number of probabilities exceeding a threshold (default eps = 1e-4
), and the
ratio between the latter two. Ratios close to $1.0$ indicate that the two clusters
are well separated ($ratio > 0.5$ are already good for this application).
The "posterior probability histogram" (plot(..., which = "posterior")
) shows
the empirical histogram of estimated probabilities (for within-windsector
observations). Point masses around $0.0$ and $1.0$ indicate that we have two
well separated clusters (the probability to fall in one of the clusters is
always close to either $0$ or $1$).
plot(mod, which = "posterior", breaks = seq(0, 1, by = 0.05))
The following parameters are estimated for the two r mod$control$family$name
clusters:
r round(coef(mod)["mu1"], 2)
$,
$\sigma_1 = r round(coef(mod)["sigma1"], 2)
$ (parameter scale)r round(coef(mod)["mu2"], 2)
$,
$\sigma_2 = r round(coef(mod)["sigma2"], 2)
$ (parameter scale)r ifelse(sign(coef(mod)["rh"]), "positive", "negative")
rh
effect of r sprintf("%+.1f", 100 * exp(coef(mod)["rh"]) - 100)
percent per on
relative humidity and a
r ifelse(sign(coef(mod)["ff"]), "positive", "negative")
ff
effect of r sprintf("%+.1f", 100 * exp(coef(mod)["ff"]) - 100)
percent on wind speedcoef(mod)
In other words: if the relative humidity increases the probability that we observed foehn decreases, while the probability increases with increasing wind speed.
A foehnix
object comes with generic plots for graphical model
assessment.
The following figure shows the 'log-likelihood contribution' of
The abscissa shows (by default) the logarithm of the iterations during optimization.
# Log-likelihood contribution plot(mod, which = "loglikcontribution")
In addition, the coefficient paths during optimization can be visualized:
# Coefficient path plot(mod, which = 3L)
The left plot shows the parameters of the two components ($\mu_1$, $\log(\sigma_1)$, $\mu_2$, $\log(\sigma_2)$), the right one the standardized coefficients of the concomitant model.
Last but not least a histogram with the two clusters is plotted.
which = "hist"
creates an empirical density histogram separating "no foehn"
and "foehn" events adding the estimated distribution for these two clusters.
plot(mod, which = "hist")
The Californian demo data set has non-standard variable names (by purpose).
Thus, when calling tsplot
(time series plot) we do have to manually specify
these names.
# Some smaller quality issues in the data (should not be a big deal) start <- as.POSIXct("2017-02-01") end <- as.POSIXct("2017-02-12") # As we dont have the standard names: re-specify variable names. # In addition, use 'style = "advanced"' to show more details. tsplot(mod, style = "advanced", windsector = list(c(43, 223)), start = start, end = end)
devtools::load_all("..") windrose(mod, type = "hist", which = c("foehn", "nofoehn"), windsector = list(c(43, 223)), breaks = seq(0, 22, by = 2))
# Default image plot image(mod)
Customized plot which shows the "foehn frequency" for the interesting time period from August to April with custom colors and additional contour lines and custom aggregation period (two-weeks, 3-hourly).
# Customizing image plot devtools::load_all("..") image(mod, deltad = 10L, deltat = 2*3600, contours = TRUE, contour.col = "white", lwd = 2, labcex = 1.5, col = colorspace::sequential_hcl(51, "Purple-Yellow", rev = TRUE), xlim = c(212, 119), zlim = c(0, 0.5))
The model above (diff_t ~ rh + ff
) fits this station well, however,
we could of course change the model specifications.
Let us specify a second model solely based on the target station (valley site).
This can be important in practice if:
Thus, we estimate a second model mod_valley
using the following
model specification:
ell
; demodata("ellboegen")
)ff
(wind speed)rf
(relative humidity) dd
$\in [43, 223]$ (as above)# Load demo data set ell <- demodata("ellboegen") # Estimate alternative model mod_valley <- foehnix(ff ~ rh, data = ell, filter = list(dd = c(43, 223))) plot(mod_valley, which = "hist")
To compare the performance of the two models we can calculate
a set of information criterion.
The following matrix shows the sample size (N
), log-likelihood sum (loglik
),
Akaike information criterion (AIC
), Bayesian information criterion (BIC
),
and ignorance (IGN
).
get_IC <- function(x) c(N = length(x$filter_obj$good), logLik(x), AIC(x), BIC(x), IGN(x)) IC <- sapply(list(mod = mod, mod_valley = mod_valley), get_IC) round(IC, 1)
Warning: due to the model specification the two models are based on
different data sets (sample size strongly differs). Thus, we cannot directly compare
the models using loglik
, AIC
, or BIC
. IGN
accounts for the sample size
(mean negative log-likelihood), however, as the two models include different training
data sets this is inconclusive.
For a fair and reliable comparison one should estimate both models on the very same data set, e.g., as follows:
# Load the demo data set once more data_for_comparison <- demodata("tyrol") # Extract the required variables data_for_comparison <- subset(data_for_comparison, select = c(dd, ff, rh, diff_t, crest_dd)) # Remove all observations (rows) with missing values data_for_comparison <- na.omit(data_for_comparison) # Estimate alternative model "mod_comparison" # Note: using dd/crest_dd filter for both models (required to # end up with the same sample size) filter <- list(dd = c(43, 223), crest_dd = c(90, 270)) mod_comparison <- foehnix(formula(mod), data = data_for_comparison, filter = filter) mod_valley_comparison <- foehnix(formula(mod_valley), data = data_for_comparison, filter = filter)
Again, calculate the information criteria:
# Information criteria IC <- sapply(list(mod_comparison = mod_comparison, mod_valley_comparison = mod_valley_comparison), get_IC) round(IC, 1)
Both models are now based on the same sample ($N = r IC["N", 1L]
$). Thus,
we can compare the different scores. The more complex model (mod_comparison
)
outperforms the one solely based on observations from the valley site.
Thus, the model mod
should be preferred over the simpler one. However,
in practice it is often useful to have both (more than one) model to
fill in periods where e.g., the crest station does not provide data.
To fill gaps (missing values in the classification of model mod
) we
could combine the estimates of both models:
mod
. If not available,mod_valley
.# Combine results of both models: prob <- merge(mod = fitted(mod), mod_valley = fitted(mod_valley)) # Number of missing entries apply(prob, 2, function(x) sum(is.na(x))) # For demonstration purposes: find rows where mod is missing # while mod_valley returns a probability: idx <- which(is.na(prob$mod) & !is.na(prob$mod_valley))[1L] + seq(-3, 3) head(prob[idx, ]) # Combine the two (take probability from mod if available, # else from mod_valley; both missing: NA). combine_prob <- function(x) { if (all(is.na(x))) return(NA) return(head(na.omit(x), n = 1)) } prob <- apply(prob, 1, combine_prob) head(prob[idx]) # Check missing values sum(is.na(prob))
start <- as.POSIXct("2010-02-01") end <- as.POSIXct("2010-02-15") # Time series plot to compare "mod" (the most complex model) # and the four alternatives. tsplot(list(mod = mod, mod_valley = mod_valley), style = "advanced", windsector = list(dd = c(43, 223)), start = start, end = end)
# Formula and wind direction filter specification filter <- list(dd = c(43, 223), crest_dd = c(90, 270)) formula <- "ff ~ rh" # Gaussian models: Gaussian, left-censored Gaussian (at 0), # and left-truncated Gaussian (at 0). Note that, for demonstration # purposes, the censored/truncated models do 20 iterations. mod_gauss <- foehnix(formula, data = data, family = "gaussian", verbose = FALSE) mod_cgauss <- foehnix(formula, data = data, family = "gaussian", left = 0, tol = -Inf, maxit = 20, verbose = FALSE) mod_tgauss <- foehnix(formula, data = data, family = "gaussian", left = 0, tol = -Inf, maxit = 20, truncated = TRUE, verbose = FALSE) # Logistic models: logistic, left-censored logistic (at 0), # and left-truncated logistic (at 0). Again, fixed number of # iterations for the censored and truncated models. mod_logis <- foehnix(formula, data = data, family = "logistic", verbose = FALSE) mod_clogis <- foehnix(formula, data = data, family = "logistic", left = 0, tol = -Inf, maxit = 20, verbose = FALSE) mod_tlogis <- foehnix(formula, data = data, family = "logistic", left = 0, tol = -Inf, maxit = 20, truncated = TRUE, verbose = FALSE)
# Information criteria get_IC <- function(x) c(N = length(x$filter_obj$good), logLik(x), AIC(x), BIC(x), IGN(x)) IC <- sapply(list("gaussian" = mod_gauss, "gaussian censored" = mod_cgauss, "gaussian truncated" = mod_tgauss, "logistic" = mod_logis, "logistic censored" = mod_clogis, "logistic truncated" = mod_tlogis), get_IC) t(IC)
As the number of observations the classification is based on is the same for all models, we can directly compare the information criteria (i.e., AIC, BIC, log-likelihood). In all cases the "logistic" model seems to perform best among these models.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.