library(learnr) knitr::opts_chunk$set(echo = FALSE,warning=FALSE, message=FALSE) tutorial_options(exercise.eval = "TRUE") library(fpp2) library(tidyverse) library(broom) library(knitr) library(kableExtra) library(rstanarm) library(bayesplot) library(car)
If you are running this tutorial locally, make sure the following packages can be loaded
library(learnr) library(fpp2) library(tidyverse) library(broom) library(knitr) library(kableExtra) library(rstanarm) library(car) library(tsfe) # version 1.0.1
Seasonal effects or calendar anomalies in financial markets have been the subject of much research, especially the equity markets.They manifest as the tendency of financial returns to display systematic patterns at certain times of the day, week or year.As we have seen in in week 2 one of the most important such anomalies is the “day of the week effect”, which is well documented in the financial literature. The existence of such effects is a direct contradiction of weak form market efficiency.
These predictable patterns could be exploited in a trading strategy to achieve abnormal profits.Profitable trading strategies may be illusory in reality due to trading costs and the fact these effects may be attributed to time-varying stock market risk premiums.From a modelling perspective, if seasonality is ignored in $y_t$ it usually leads to autocorrelation of the order of the seasonality. * For example, this would be fifth order autocorrelation if $y_t$ is a series of daily returns.
One of the simplest ways to model seasonality is using dummy variables in a regression. The number of dummies required to model the seasonality depends on the frequency of the data
Create a daily Russell 2000 price index from
tsfe::indices
. Add a log_return variable and a day of the week variable using the functionweekdays()
. Use the following code to create these variables.Hint:
weekdays
extracts the day of the week from a date variable and theabbreviate=TRUE
option adds a text abbreviation.
tsfe::indices %>% select(`RUSSELL 2000 - PRICE INDEX`,date) %>% drop_na() %>% rename(Price=`RUSSELL 2000 - PRICE INDEX`) %>% mutate(log_return=log(Price)-log(lag(Price)), day_of_week=weekdays(date,abbreviate=TRUE))->r2000_d
Is there any daily seasonality present in the autocorrelations of the daily log returns of the Russell 2000 index? Hint: Use
ggAcf
ggAcf(r2000_d$log_return)
Create a Boxplot of the returns for each day of the week and discuss your results.
r2000_d %>% ggplot(aes(y=log_return,x=day_of_week)) + geom_boxplot()
Using the the
lm()
command find a regression model for daily log returns of the Russell 2000 Index with only each day of the week as a predictor. Which day of the week produces the largest return?
model1 <- lm(log_return~day_of_week,r2000_d) m <- tidy(model1,conf.int=TRUE) m %>% kable(digits=5) %>% kable_classic_2()
R
we can use a linear restriction test in the car
package to test this difference.
Roughly speaking we can therefore say that the standard error of the difference in two parameter is $SE_{A-B}=\sqrt{(SE_A^2+SE_B^2} $ assuming the parameter estimates are uncorrelated. While this latter assume is technically wrong in most practical applications this may not matter.
Importantly, all null hypothesis testing results are only valid, if the underlying model from which the test statistics are derived is valid
Test hypothesis of no difference
linearHypothesis(model1,"(Intercept)=day_of_weekMon",test="F")
How would you validate your inference from the previous model? HINT: Are the residuals adequate?
checkresiduals(model1)
Statistics is about uncertainty and variation. The the crisis around the misuse of p-values in finance, it is more important than every to report coherent estimates of uncertainty for regression output.
Visualise confidence intervals for the parameter estimates.
ggplot(m, aes(estimate, term, xmin = conf.low, xmax = conf.high, height = 0)) + geom_point() + geom_vline(xintercept = 0, lty = 4) + geom_errorbarh() + ggtitle("Regression parameter estimateis with 95% CIs")
Bayesian inference involves three steps that go beyond classical estimation. First, the data and model are combined to form a posterior distribution, which we typically summarize by a set of simulations of the parameters in the model. Second, we can propagate uncertainty in this distribution–that is, we can get simulation-based predictions for unobserved or future outcomes that account for uncertainty in the model parameters. Third, we can include additional information into the model using a prior distribution.- Gelman, A., Hill, J., & Vehtari, A. (2020). Prediction and Bayesian inference. In Regression and Other Stories (Analytical Methods for Social Research, pp. 113-130). Cambridge: Cambridge University Press. doi:10.1017/9781139161879.010
Bayesian inference allows us to incorporating model uncertainty as well as estimation uncertainty in a linear regression.
(model2<-stan_glm(log_return~day_of_week,data=r2000_d))
You will notice the model is run via a Markov Chain Monte Carlo sampling algorithm, which produces probabilistic output. We can then visualise this output using the following.
```{css, echo=FALSE, eval=TRUE} .scroll-50 { max-height: 50px; overflow-y: auto; background-color: inherit; }
```r posterior <- as.matrix(model2) plot_title <- ggtitle("Posterior distributions","with medians and 95% intervals") mcmc_areas(posterior, pars=c("(Intercept)","day_of_weekMon", "day_of_weekThu","day_of_weekTue","day_of_weekWed"), prob = 0.95) + plot_title
Daily electricity demand for Victoria, Australia, during 2014 is contained in elecdaily
. The data for the first 20 days can be obtained as follows.
daily20 <- head(elecdaily,20)
Plot the data and find the regression model for Demand with temperature as an explanatory variable. What is the relationship between demand and temperature?
daily20 %>% as.data.frame() %>% ggplot(aes(x = Temperature, y = Demand)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
Fit a linear regression model to the previously visualised relationship and display the results. Why is there a positive relationship?
(fit <- tslm(Demand ~ Temperature, data = daily20))
Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?
checkresiduals(fit)
Ignoring the modelling inadequacies for now, use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was $15^\circ$ and compare it with the forecast if the with maximum temperature was $35^\circ$. Do you believe these forecasts?
(fc <- forecast(fit, newdata=data.frame(Temperature=c(15,35))))
Plot Demand vs Temperature for all of the available data in
elecdaily
. What does this say about your model?
elecdaily %>% as.data.frame() %>% ggplot(aes(x = Temperature, y = Demand)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
Data set mens400
contains the winning times (in seconds) for the men's 400 meters final in each Olympic Games from 1896 to 2016.
Plot the winning time against the year. Describe the main features of the scatterplot.
qplot
is a quick way to draw a scatterplot without make atibble
qplot(time(mens400),mens400) + labs(y="Winning time",x="year", title = "Mens 400 metres Olympics")
Fit a regression line to the data. Obviously the winning times have been decreasing, but at what average rate per year?
timetrend <- time(mens400) fit <- tslm(mens400 ~ timetrend) qplot(time(mens400), mens400) + labs(y="Winning time",x="year", title = "Mens 400 metres Olympics", subtitle = "with regression line") + geom_smooth(method='lm')
Plot the residuals against the year. What does this indicate about the suitability of the fitted line?
qplot(time(mens400), residuals(fit)) + geom_hline(yintercept=0, col='grey')
Ignoring the modelling inadequacies for now,predict the winning time for the men's 400 meters final in the 2020 Olympics. Give a prediction interval for your forecasts. What assumptions have you made in these calculations?
For users of forecast 8.4 or earlier, it is necessary to use lm()
to do the forecasting due to issues with missing values
forecast(fit, newdata=data.frame(timetrend=2020))
Type
easter(ausbeer)
and interpret what you see.
easter(ausbeer)
An elasticity coefficient is the ratio of the percentage change in the forecast variable ($y$) to the percentage change in the predictor variable ($x$). Mathematically, the elasticity is defined as $(dy/dx)\times(x/y)$. Consider the log-log model,
$$\log y=\beta_0+\beta_1 \log x + \varepsilon$$
Express $y$ as a function of $x$ and show that the coefficient $\beta_1$ is the elasticity coefficient.
.content-box-yellow[ We will take conditional expectation of the left and the right parts of the equation: $$\mathrm{E}(\log(y)\mid x) = \mathrm{E}(\beta_0 + \beta_1 \log(x) + \varepsilon\mid x) = \beta_0 + \beta_1\log(x).$$ By taking derivatives of the left and the right parts of the last equation we get: $\frac{y'}{y} = \frac{\beta_1}{x}$, and then $\beta_1 = \frac{y' x}{y}.$ It is exactly what we need to prove, taking into account that $y' = \frac{dy}{dx}$. ]
The data set fancy
concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.
Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.
autoplot(fancy) + xlab("Year") + ylab("Sales")
Explain why it is necessary to take logarithms of these data before fitting a model.
HINT:The last feature above suggests taking logs to make the pattern (and variance) more stable
# Taking logarithms of the data autoplot(log(fancy)) + ylab("log Sales")
Use
R
to fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a "surfing festival" dummy variable.
#Create festival dummy: festival <- cycle(fancy) == 3 festival[3] <- FALSE # Fit linear model to logged data (by specifying lambda=0) fit <- tslm(fancy ~ trend + season + festival, lambda = 0) # Check fitted values autoplot(fancy) + ylab("Sales") + autolayer(fitted(fit), series="Fitted")
Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?
autoplot(residuals(fit)) qplot(fitted(fit), residuals(fit))
Do boxplots of the residuals for each month. Does this reveal any problems with the model?
month <- factor(cycle(residuals(fit)), labels=month.abb) ggplot() + geom_boxplot(aes(x=month, y=residuals(fit), group=month))
What do the values of the coefficients tell you about each variable?
coefficients(fit)
(Intercept)
is not interpretable.
trend
coefficient shows that with every year logarithm of sales increases on average by 0.02.
season2
coefficient shows that in February logarithm of sales increases compare to January on average by 0.25.
season12
coefficient shows that in December logarithm of sales increases compare to January on average by 1.96.
* festivalTRUE
coefficient shows that with surfing festival logarithm of sales increases compare to months without the festival on average by 0.5.
What does the Breusch-Godfrey test tell you about your model?
checkresiduals(fit,plot = FALSE)
Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.
future.festival <- rep(FALSE, 36) future.festival[c(3, 15, 27)] <- TRUE fit %>% forecast(h=36, newdata=data.frame(festival = future.festival)) %>% autoplot
Transform your predictions and intervals to obtain predictions and intervals for the raw data.
This was done automatically because I used lambda=0
inside tslm
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.