Two main template files are available; modify these, but keep their names:
report.Rmd
,
that you should expand with your answers and solutions for this project assignment,functions.R
file where you should place function definitions
needed by report.Rmd
, with associated documentation.When knitting your report to produce the report.html
file, ensure that both report.Rmd
and functions.R
are in the same folder and your working directory is set to this folder (run getwd()
to check). If you need to change your working directory then navigate to your project folder under the Files
panel, then click-on the settings cog icon in that panel and select Set As Working Directory
.
report.Rmd
and functions.R
.report.Rmd
.report.Rmd
that load the function definitions from functions.R
(your own function definitions).functions.R
without running it.styler
package Addin for restyling code for better and consistent readability works for both .R
and .Rmd
files.echo=TRUE
for analysis code chunks and echo=FALSE
for table display and plot-generating code chunks.When submitting, include a generated html version of the report with the file name report.html
. The project will be marked as a whole, between $0$ and $40$, using the marking guide posted on Learn.
\newpage
The Global Historical Climatology Network at https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily provides historical weather data collected from all over the globe. A subset of the daily resolution data set is available in the StatCompLab package
containing data from eight weather stations in Scotland, covering the time period from 1 January 1960 to 31 December 2018. Some of the measurements are missing, either due to instrument problems or data collection issues.
See Tutorial07
for an exploratory introduction to the data, and techniques
for wrangling the data.
Load the data with
data(ghcnd_stations, package = "StatCompLab") data(ghcnd_values, package = "StatCompLab")
The ghcnd_stations
data frame has 5 variables:
ID
: The identifier code for each stationName
: The humanly readable station nameLatitude
: The latitude of the station location, in degreesLongitude
: The longitude of the station location, in degreesElevation
: The station elevation, in metres above sea levelThe station data set is small enough that you can view the whole thing,
e.g. with knitr::kable(ghcnd_stations)
.
You can try to find some of the locations on a map (Google maps and other online map
systems can usually interpret latitude and longitude searches).
The ghcnd_values
data frame has 7 variables:
ID
: The station identifier code for each observationYear
: The year the value was measuredMonth
: The month the value was measuredDay
: The day of the month the value was measuredDecYear
: "Decimal year", the measurement date converted to a fractional value,
where whole numbers correspond to 1 January, and fractional values correspond to
later dates within the year. This is useful for both plotting and modelling.Element
: One of "TMIN" (minimum temperature), "TMAX" (maximum temperature), or "PRCP" (precipitation), indicating what each value in the Value
variable represents Value
: Daily measured temperature (in degrees Celsius) or precipitation (in mm)Note: For this first half of the assignment,
you can essentially ignore that some of the data are missing; when defining averages,
just take the average of whatever observations are available in the relevant data subset.
If you construct averages with group_by()
and summarise()
& mean()
,
That will happen by itself.
The daily precipitation data is more challenging to model than temperature because of the mix of zeros and positive values. Plot temperature and precipitation data to show the behaviour, e.g., during one of the years, for all the stations.
Let winter be ${\text{Jan, Feb, Mar, Oct, Nov, Dec}}$, and let summer be ${\text{Apr, May, Jun, Jul, Aug, Sep}}$. For easier code structure, add this season information to the weather data object.
Construct a Monte Carlo permutation test for the hypotheses:
$$
\begin{aligned}
H_0:&\text{The rainfall distribution is the same in winter as in summer} \
H_1:&\text{The winter and summer distributions have different expected values}
\end{aligned}
$$
Use $T=|\text{winter average} - \text{summer average}|$ as a test statistic and add a Summer
column to the data
that is TRUE
for data in the defined summer months.
Compute separate p-values for each weather station and their respective Monte Carlo standard deviations.
Construct a function p_value_CI
to construct the interval.
Collect the results into a suitable data structure, present and discuss the results.
See Project2Hints for hints on constructing an approximate $95\%$ confidence interval for a p-value when most observed counts are zero.
For this second half of the project, you should first construct a version of the data set with a new variable, Value_sqrt_avg
, defined as the square root of the monthly averaged precipitation values.
As noted in the Project2Hints document (Handling non-Gaussian prediction), the precipitation values are very skewed with variance that increases with the mean value. Taking the square root of the monthly averages helps alleviate that issue, making a constant-variance model more plausible.
Here, you will define and estimate models for the square root of the monthly averaged precipitation values in Scotland.
As a basic model ($M_0$) for the square root of monthly average precipitation is defined by:
$M_0$: Value_sqrt_avg ~ Intercept + Longitude + Latitude + Elevation + DecYear
Covariates such as the spatial coordinates, elevation, and suitable $\cos$ and $\sin$ functions are used to capture seasonal variability. By adding covariates $\cos(2\pi k t)$ and $\sin(2\pi k t)$ of frequency $k=1,2,\dots$
we can also model the seasonal variability, defining models $M_1$, $M_2$, $M_3$,
and $M_4$ where the predictor expression for model $M_K$ adds
$$
\sum_{k=1}^K \left[ \gamma_{c,k} \cos(2\pi k t) + \gamma_{s,k} \sin(2\pi k t) \right]
$$
to the linear predictor of $M_0$, where the coefficients $\gamma_{c,k}$ and $\gamma_{s,k}$ are to be estimated.
The time variable $t$ is defined to be DecYear
so that the lowest frequency $k=1$ corresponds to a cosine function with a period of one full year.
Organise your code so that you can easily change the input data without having to change the estimation code and predict for new locations and times. You'll need to be able to run the estimation for different subsets of the data, as well as for manually constructed covariate values, e.g., to illustrate prediction at a new location not present in the available weather data.
Estimate the model parameters for $M_0, M_1, M_2, M_3$ and $M_4$.
Present and discuss the different models and the results of the model estimation.
We are interested in how well the model can predict precipitation at new locations. Therefore, construct a stratified cross-validation that groups the data by weather station and computes the prediction scores for each station, as well as the overall cross-validated average scores, aggregated to the 12 months of the year.
Present and discuss the results of the score assessments for both Squared Error and Dawid-Sebastiani scores.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.