In leerichardson/sdsmR: Implementation of Statistical Downscaling

knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.height = 7, fig.width = 7)

Introduction

This document walks through the basic functionality of the sdsmR package. The idea of sdsmR is an R implementation of the Statistical Downscaling Model (SDSM) tool, widely used by climate researchers. The hope is that sdsmR will give users the same functionality as SDSM, along with the advantages of the R language.

Statistical downscaling is a method of relating large scale variables, such as those from General Circulation Model (GCM) output, to locally observed variables. Statistical downscaling is a method for performing local impact studies, such as assessing the risk of flooding in a particular city under different climate change scenario's. At a high level, we build a statistical model of the relationship between large scale variables and local variables, then use this model to predict how the local variable responds to various scenario's.

We will walk through the same Blogsville example used in the SDSM manual, and show how to perform the same analysis using sdsmR.

Exploratory Data Analysis

A critical part of building a statistical model is the exploratory analysis. This includes graphical and numerical summaries, checking for abnormalities, etc... In SDSM, this is usually done in the quality control and screen variables sections. Next, we'll walk through exploratory analysis using the sdsmR package.

Accessing the data

Before starting, we must load the Blogsville data into R's workspace. Loading data into R is heavily depen dent on the specific data-set, and there's not an exact recipe for loading every dataset you'll encounter into R. For this reason, we'll provide an example of loading in the Blogsville data manually in the appendix of this document. However, to focus on the functionality of sdsmR, we'll use the blogsville data-set which comes pre-installed with the package.

The blogsville data-set comes is a dataframe in R. A dataframe is one of the most common data structures used for statistical analysis in the R language. A dataframe looks like a matrix, with each row representing an observation, and each column representing a variable. For example, every row in the blogsville dataset corresponds to a single day, and every variable represents the value of a variable on that particular day.

The functionality of the sdsmR package is designed to work specifically for dataframes. If you're working through this tutorial with a different dataset, the best first step would be combining your data into dataframe. This will allow you to work through the tutorial smoothly, with your dataset of interest.

# Load the sdsmR package into memory, and view what the 
# blogsville dataframe looks like. 
library(sdsmR)
head(blogsville)

The above code loads in the sdsmR package, which includes a dataframe named blogsville. For detailed description of the data frame and the variables included, use the help command ?blogsville.

We've provided a quality_control function, similar to that found in SDSM. Right now, this will verify your dataset is a data frame, has a date column, and will warn the user if any of the data is missing. We can use the quality_control function as follows with the Blogsville dataset:

quality_control(blogsville)
````

## Example plots 

After loading and checking the dataset for errors, another key step in exploratory analysis is visualizing the data. For instance, we explore the relationships between predictor and predictand variables, to see if there's any obvious correlations. For those familiar with the SDSM tool, the first stage of the analysis is usually the Screen Variables section. In SDSM, the three main plots for exploratory data analysis are:

1. The explained variance from each predictor on the predictand, by month.
2. Correlation between all predictor variables, by month. 
3. Scatterplots of the relationship between an individual predictor and predictand. 

Creating the first two plots is a bit tricky. For this reason, we have created a function, called `generate_table`, which can be used to create these informative tables. This function takes as input a filepath, dataframe, and the name of the predictand variable, and outputs two pdf files with the first two tables in input filepath. For more detailed information and examples, you can access the help file with `?generate_table`. 

For example, here is how I would save the explained variance and correlation tables for maximum temperature into my home directory:

```r
generate_table(plotname = "/home/lee/example", dataframe = blogsville, y = "tmax")

Next, let's consider how to create scatterplots of individual predictor/predictand relationships. If you look at the explained variance table created with the generate_table function, you'll notice that humidity (humxx) has the highest correlation with maximum temperature. The following code shows how to create a scatterplot of humidity against maximum temperature:

# Produce a scatterplot of humidity vs. Maximum temperature 
# from the blogsville dataframe using the plot function. Note that 
# the first two inputs are the columns of the dataframe, and 
# the rest are just convinient plotting options
plot(blogsville$humxx, blogsville$tmax, xlab = "Humidity", 
     ylab = "Maximum Temperature", main = "Humidity vs. Maximum Temperature", 
     cex = .1, cex.main = 1.8)

# Draw a line of the linear regression model overlayed on the 
# scatterplot. 
abline(lm(tmax ~ humxx, data = blogsville))

We can now represent all of the SDSM plots in the sdsmR package. However, it's important to point out that R has very powerful graphics capabilities. This means that many more plots, other than the few we've seen above, can be created. To give a flavor of this, we'll show how to create a different, informative plot with the Blogsville data. For example, let's visualize how maximum temperature changes throughout the dataset, by plotting it against time.

# Pull out the year our of the dates column in R, and 
# then add a variable to the dataframe which represents 
# the the three different decades. 
dates <- as.numeric(format(blogsville$dates, "%Y"))
period <- rep(1, length(dates))
period[dates < 1970] <- 1
period[dates < 1980 & dates >= 1970] <- 2
period[dates < 1990 & dates >= 1980] <- 3
blogsville$period <- period

# First, set R's output window to contain 3 panels 
# for plotting. Next, plot the time series of maximum temp
# in each of the three different period we specified above. 
# Make sure all of the scaled are the same with the ylim command. 
par(mfrow = c(3, 1))
plot(tmax ~ dates, data = blogsville[blogsville$period == 1,], 
     cex = .2, col = "red", ylim = c(0, 30), 
     main = "Daily Maximum Temperature in Blogsville over
     three different decades")
plot(tmax ~ dates, data = blogsville[blogsville$period == 2,], 
     cex = .2, col = "red", ylim = c(0, 30))
plot(tmax ~ dates, data = blogsville[blogsville$period == 3,], 
     cex = .2, col = "red", ylim = c(0, 30))

# Remove the period variable from the blogsville dataframe. 
blogsville <- blogsville[, 1:8]

Model Calibration

Next, we will see how to use sdsmR and the blogsville data-set to build a statistical model. This is the same step as the "Calibrate Model" section of the SDSM tool, and we'll use the same predictor variables used in the SDSM manual. We also follow the blogsville example in the SDSM manual by splitting our data into training and testing sets, using the former for model building and the latter for validation.

First, we show how to fit a annual model using the blogsville example, and that the values derived from this are identical to what you would have obtained from the SDSM tool. At the bottom of the following code chunk, we print the r squared and standard error from the linear model fit in R, and then print a screen shot of the same model in SDSM. This is just to demonstrate that we're getting the same answers as SDSM.

# Split the dataset in half into training and testing 
# sets. This is the same split used by Wibly in the SDSM
# manual which splits the data into 1960-75 and 1976-90 
# time periods
train <- blogsville[1:5478, ]
test <- blogsville[5479:10957, ]

# Fit a linear model to the tmax predictand using all of the 
# predictor variables from blogsville. Note that we are using the 
# training data defined above in order to build this model. 
blogsville_annual_model <- lm(tmax ~ uxx + vxx + zxx + xx500 + humxx, data = train)

# Print the R Squared and Standard error from the blogsville
# annual model. Note that these numbers are identical to those 
# produced by SDSM!
summary(blogsville_annual_model)$r.squared
summary(blogsville_annual_model)$sigma

![test](/home/lee/Dropbox/Work (Case Conflict)/ncar/outputs/images/sdsm_calibrate.PNG)

One issue with the annual model for blogsville is that temperature is heavily dependent on the time of year, as seen in the image above. In SDSM, one usually deals with this by fitting separate models eiether by season or month. In the next section, we'll show how to do this both manually and with the ?calibrate_model function.

Monthly and Seasonal Models

We've seen how to fit annual models using R's lm (linear model) function, now let's see how to create more complicated/realistic model. In SDSM, an important option is specifying an annual, monthly, or seasonal model. If you choose a monthly or seasonal, SDSM calibrates separate linear models for each time period. For example, if you select monthly, SDSM returns seperate linear models for january, february, march, etc... Carrying out this step in R requires a little bookkeeping, but is certainly feasible. We'll first show how to manually fit both monthly seasonal models in R, then introduce the calibrate_model function, which will automates this step for you!

# Get a list of the months in our dataframe and add a column 
# which contains month into our blogsville dataframe. Note that we will
# denote month as a factor in R, which is the normal way that categorical 
# variables are stored. 
months <- unique(format.Date(blogsville$dates, "%m"))
blogsville$month <- factor(format.Date(blogsville$dates, "%m"))

# Using this month variable, create another column which represents
# which season each data-point corresponds to. 
season <- rep(NA, nrow(blogsville))
season[blogsville$month %in% c("12", "01", "02")] <- "winter"
season[blogsville$month %in% c("03", "04", "05")] <- "spring"
season[blogsville$month %in% c("06", "07", "08")] <- "summer"
season[blogsville$month %in% c("09", "10", "11")] <- "fall"
blogsville$season <- factor(season)

# Next, set up a data-frame which will hold the summary statistics 
# for each monthly linear model. Then loop through each month, 
# subset the data frame so it only contains data for the 
# individual month, fit a linear model using this monthly dataset, 
# and finally record the r squared value along with the residual standard error. 
# Finally, note that we are storing all of these models in a list
# called month_mods, so you can access and explore the individual 
# models after calibration. 
month_mods <- list()
month_r2 <- data.frame(month = months, r.squared = NA, se = NA)
count <- 0
for (m in months) {
    count <- count + 1
    month_dataframe <- subset(blogsville, month == m)
    month_lm <- lm(tmax ~ uxx + vxx + zxx + xx500 + humxx, data = month_dataframe)
    month_mods[[m]] <- month_lm
    month_r2[count,"se"] <- round(summary(month_lm)$sigma, digits =  3)
    month_r2[count,"r.squared"] <- round(summary(month_lm)$r.squared, digits =  3)
}                 

# Perform the same calulations used in the month above, except using 
# season instead. 
season_mods <- list()
season_r2 <- data.frame(season = unique(blogsville$season), r.squared = NA, se = NA) 
count <- 0
for (s in unique(blogsville$season)) {
    count <- count + 1 
    season_dataframe <- subset(blogsville, season == s)
    season_lm <- lm(tmax ~ uxx + vxx + zxx + xx500 + humxx, data = season_dataframe)
    season_mods[[s]] <- season_lm
    season_r2[count,"se"] <- round(summary(season_lm)$sigma, digits =  3)
    season_r2[count,"r.squared"] <- round(summary(season_lm)$r.squared, digits =  3)
}

# Print the results of the r squared and residual standard 
# error terms. Compare these with the output that SDSM produces, to see
# that they ae indeed the same. 
print(month_r2)
print(season_r2)

# Remove the monthly and seasonal models from the blogsville dataframe 
blogsville <- blogsville[, 1:8]

What we've seen above is a way to compute a seperate linear model for either each month or each season. Note that we stored these models inside a list, one of R's fundamental data structures. To access one of these models, simply type season_mods$winter or month_mods[[1]], and to view a summary of one of these models (coefficients, r squared, etc..), type summary(season_mods$winter).

An alternative to using the code above to store lists of linear models is the calibrate_model function, which comes with the sdsmR package. It works the same as above, by storing each model in a list. We can summarize the models the ?summarize_models function, which also comes with the sdsmR package. This function prints out the r squared, standard error, and coefficients of each model in the list.

# Fit a blogsville model using the calibrate model function. Note that 
# the !(names(train) %in% "pcrp") command is just making sure 
# that we are not using precipitation as a predictor for maximum 
# temperature. In situations where you're downscaling more than one variable, 
# it might make sense to create a seperate data-frame for each. 
blogsville_mod <- calibrate_model(train[, !(names(train) %in% "pcrp")], 
                        y = "tmax", model_type = "monthly")
blogsville_summary <- summarize_models(blogsville_mod)
print(blogsville_summary$jan)

Weather Generator

Now that we know how to calibrate our models, we'll show how to produce an equivalent of the weather generator section of the SDSM tool. Recall that we split the blogsville data-set into a training and testing halves, the first from 1961-1975, and the second from 1976-1990. Generally, the weather generator function is used to verify calibrated models, so we'll use our testing data set to see how well our model performs.

SDSM adds white noise, ie: a standard normal random variable, onto all of its predictions in order to create ensembles. The reasoning behind these ensembles is to closer match the variance of the observed and downscaled distributions. We'll show how to generate these ensembles manually with our annual model, and then show how the ?generate_weather function automates this step for us.

# Use the predict.lm function in R with our annual model, 
# along with the test data. This will output a numeric 
# vector of predictions for each observation in the test dataset. 
# Note that if we don't provide new data, the default is that 
# predict.lm will return the fitted values for this model. 
weather <- predict.lm(blogsville_annual_model, newdata = test[, -which(colnames(test) == "pcrp")])

# Add a normal random variable to the predictions based 
# on how many ensembles are desired. The paramaters for the normal 
# random variable are mean 0, and standard error equivalent to 
# the standard error from the linear model. 
# In this case, we will create 20 different ensembles, 
# the default in SDSM. 
num_ensembles <- 20
num_predictions <- nrow(test)
sigma <- summary(blogsville_annual_model)$sigma
white_noise <- matrix(rnorm(num_predictions * num_ensembles, 
                    mean = 0, sd = sigma), ncol = num_ensembles)
ensembles <- white_noise + weather 
head(ensembles[, 1:6])

Similar to the way calibrate_model mimics the calibrate model section of SDSM, we've created the generate_weather function to mimic the functionality o Generate Weather. With generate_weather, you can use the output from the calibrate_model and a new dataframe as inputs, and obtain a dataframe with predictions in the output. For more detailed information on generate weather, access the ?generate_weather help file.

blogsville_preds <- generate_weather(models = blogsville_mod, 
                        new_dataframe = test[, -which(colnames(test) == "pcrp")], 
                        uncertainty = "ensemble", num_ensembles = 20)
head(blogsville_preds[, 1:6])

Prediction Intervals

Next, we'll show a different way of quantifying uncertainty in our downscaled predictions. Instead of adding white noise, we use prediction intervals, which come straight from the regression equation. The reason to use the interval approach is to give confidence intervals: approximately 95 percent of the prediction intervals will contain the true value. In the next code chunk, we show the R code which generates the predictions and corresponding prediction intervals. Recall that the actual predictions are the same whether you're generating predictions intervals or using ensembles, but what's different is the method for quantifying uncertainty.

# Specify the interval = "prediction" option in predict.lm to 
# obtain prediction intervals for this new dataset. Then, 
# combine these intervals with the true observations. 
observed <- test[, "tmax"]
weather_intervals = predict.lm(blogsville_annual_model, 
                        newdata = test[, -which(colnames(test) == "pcrp")], 
                        interval = "prediction")
prediction_results <- cbind(observed, weather_intervals)

# This just demonstrates which percentage of observations our 
# intervals are trapping. The number is very close to  95%. Note 
# that one of the observations in the blogsville data-set is an NA, 
# which is why we need to use the na.rm = TRUE command in the sum. 
upper <- prediction_results[, "upr"]
lower <- prediction_results[, "lwr"]
trapped <- (observed < upper) & (observed > lower) 
sum(trapped, na.rm = TRUE)/length(trapped)

Finally, let's visualize what these two methods of uncertainty look like against one another. To demonstate, we will use predictions for the year 1975 in the Blogsville dataset.

# Ensembles -------------------
par(mfrow = c(1, 2))
cols <- rainbow(20)
plot(test$dates[1:365], weather[1:365], type = "l", 
     ylim = c(0, 35), lwd = 3, 
     main = "Ensembles 1975", 
     xlab = "Date", ylab = "Maximum temperature")
for (i in 1:20) {
    lines(test$dates[1:365], ensembles[1:365, i], 
         col = cols[i], lwd = .2)
}
lines(test$dates[1:365], weather[1:365], type = "l")
points(test$dates[1:365], test[1:365, "tmax"], cex = 1, 
       col = "black", pch = 16)

# Prediction Intervals --------------
plot(test$dates[1:365], prediction_results[1:365, 2], 
     type = "l", lwd = 3, ylim = c(0, 35), col = "red" , 
     main = "Prediction Intervals 1975", 
     xlab = "Date", ylab = "Maximum temperature")
lines(test$dates[1:365], prediction_results[1:365, 3], 
      col = "purple", lwd = 2) 
lines(test$dates[1:365], prediction_results[1:365, 4], 
      col = "purple", lwd = 2)
points(test$dates[1:365], test[1:365, "tmax"], cex = 1, 
       col = "black", pch = 16)
par(mfrow = c(1, 1))

As we can see, both options do a fairly good job providing a range of weather to expect in future years. However, we do notice that niether does a great job capturing the extremely hot days, which occur around the middle of July. This is an issue with statistical downscaling: it doesn't do a great job capturing extreme values. We will see more of this theme in the conditional models section.

Scenario Generation

Finally, we've arrive at the scenario generation section. Here we use GCM output to downscale local variables. In this case, we are using predictors from the HadCM3 experiment, the same variables used in the Blogsville SDSM manual. Note that the process of generating scenario's from climate model output is mechanically the same as the generate_weather function we used in the previous section. However, instead of using a testing set of the NCEP re-analysis data to verify model output, we are using climate model output to guide the forecast. Since the calculations are the same, we can once again use the generate_weather function to generate different scenario's.

Recall that a key requirement that your data is in a dataframe. It's also critical that the data you're using for scenario generation has the same predictor vairables used for model calibration. Below we'll generate the scenario's from both 1961-1990 and 2070-2099. We will show that, as expected from climate change, maximum temperatures are expected to be higher on average in Blogsville in the 2070-2099 time period. Similar to the blogsville data, the climate model output comes with the sdsmR package, so we can just load it into R's workspace automatically.

# Load in the climate model output for the two time periods. Recall
# this comes pre-insalled with the sdsmR package. 
data(weather_1960)
data(weather_2070)

# Generate the weather using the Blogsville monthly model. 
# Note that we are using the monthly model for weather demonstration 
# since it was the same used in the SDSM model, and there is a clear 
# dependence between time of year and maximum temperatures. 
blogsville_monthly_mod <- calibrate_model(train[, -which(colnames(train) == "pcrp")], 
                                y = "tmax", model_type = "monthly")
preds_1960 <- generate_weather(blogsville_monthly_mod, 
                               new_dataframe = weather_1960)
preds_2070 <- generate_weather(blogsville_monthly_mod, 
                               new_dataframe = weather_2070)

# Plot the two figures side by side 
par(mfrow=c(1,2), oma = c(0, 0, 2, 0))
plot(preds_1960$dates[1:(365*3)], preds_1960$predictions[1:(365*3)], 
     type = "l", ylim = c(0, 35), xlab = "", ylab = "Maximum Temperature")
abline(h = mean(preds_1960$predictions), col = "red", lwd = 2)
plot(preds_2070$dates[1:(365*3)], preds_2070$predictions[1:(365*3)], 
     type = "l", ylim = c(0, 35), xlab = "", ylab = "")
abline(h = mean(preds_2070$predictions), col = "red", lwd = 2)
mtext("Temperature Max Scenario in Blogsville for Time Periods 
      1961-1963 and 2070-2072", side = 3, outer = TRUE)

In the above plot, we see the first three years of generated maxiumum temperatures for each one of our two scenario's. We also plot the mean maximum temperature over all years in each scenario. From these, two things jump out. First, the average maxiumum temperature is clearly higher in the second scenario than the first. And second, this seems to be due to an overall shift, since the both the minimum and maximum temperatures for the earlier scenario are lower.

Conditional Models

Another important function in the SDSM tool is fitting conditional models. For example, the SDSM manual example fits a conditional mode for precipitation. Since there are many days with no precipitation, SDSM fits a two stage model. First, we will create a new binary variable, $W_{i}$, which equals 1 if there is precipitation on day i, and 0 if there isn't. Next, we fit a linear regression model to $W_{i}$ using a some set of predictor variables $X_{i}$, and call this the first stage model.

For stage two, we subset the data to only include days in which there is precipitation (ie: when $W_{i}$ = 1). Next, call the amount of precipitation on day i $P_{i}$. Using the same predictor variables $X_{i}$ we used to fit the stage one model, we fit a linear model to $P_{i}$. And that's it, here is the code which performs those first two steps:

# Create a 0/1 variable which indicates whether or not there 
# was precipitation on this day. Make sure to do this on both 
# the training AND test dataset
train$wetday <- ifelse(train$pcrp > 0, 1, 0)
test$wetday <- ifelse(test$pcrp > 0, 1, 0)

# Fit a linear model to the wetday variable created above. 
prec_step1 <- lm(wetday ~ uxx + vxx + zxx + xx500 + humxx, 
                 data = train)

# Subset the training data to only include data in which there
# was a wet day. Then, use this dataset to fit a linear model based 
# on the amount of precipitation
prec_train <- subset(train, wetday == 1)
prec_step2 <- lm(pcrp ~ uxx + vxx + zxx + xx500 + humxx, 
                 data = prec_train)

Now that we have our two linear models calibrated, we move to making predictions with these models. First, we will use our first linear model, prec_step1, from above to make a prediction for $W_{i}$ (whether or not it will rain on day i). Note that in SDSM, this value is truncated to be between 0 and 1. For our purposes, we won't need to include this step because truncating this value between 0 and 1 doesn't affect the prediction. Next, a uniform random number between 0 and 1, call this $U_{i}$, is generated. If $U_{i}$ is greater than $W_{i}$, then we predict there will be no precipitation. If it's smaller, then we predict there will be precipitation on this day!

If we predict that there will be precipitation on day i ($W_{1} = 1$), then we will use our second model to predict how much precipitation will occur. Below is the R Code which executes these steps to make predictions for the blogsville testing dataset.

# Define a function to use the two calibrated linear models 
# in order to predict the amount of precipitation on a given day. 
# This function will be used below in order to make predictions 
# for all of our days in the testing dataset. 
predict_row <- function(lm1, lm2, row) {
    # Use the first linear model to predict whether or not there
    # will be a wet day
    wet_day <- predict(lm1, row)
    rand <- runif(1)

    # Determine how much precipitation based on the corresponding prediction.
    # If it's a wet day, determine the amount. However, if it's a non wet day,
    # then return a 0
    if (rand < wet_day) {
        precip_amount <- predict(lm2, row)
    } else {
        return(0)
    }
}

# Testing this function on a subset of models
cond_preds <- vector(mode = "numeric", length = 0)
for (i in 1:nrow(test)) {
    prediction <- predict_row(prec_step1, prec_step2, test[i,])
    cond_preds <- c(cond_preds, prediction)
}

Similar to other sections, we have incorporated a way to fit conditional models as part of the sdsmR package. We implement this with the process = "conditional" option included in the calibrate_model function. When you calibrate a conditional model this way, generate_weather will recognize it, and the predictions will come from the conditional model prediction scheme described above. The following code shows how to fit a conditional model using the sdsmR functions mentioned above, and then graphically display what these predictions look like.

# Calibrate the conditional model using the calibrate model function. 
# Note that we are using the blogsville[, -7] argument as our data
# because we don't want to include tmax as a predictor variable.
blogsville_cond_mod <- calibrate_model(dataframe = train[, -7], y = "pcrp",
                           model_type = "monthly", process = "conditional")

# Using the calibrated conditional model, generate the weather for 
# the testing dataset. Note that the predictions will come out and 
blogsville_cond_preds <- generate_weather(models = blogsville_cond_mod, 
                                          new_dataframe = test)

# Plot the conditional predictions 
prec_observed <- test[, "pcrp"]
plot(blogsville_cond_preds$dates[1:365], prec_observed[1:365], cex = .2, 
     main = "Blogsville Conditional Vs. Predicted Precipitation 
     year 1975", ylab = "Precipitation Amount", xlab = "Time")
lines(blogsville_cond_preds$dates, blogsville_cond_preds$predictions, 
      lwd = .2)

Summary

In this document, we've shown how to implement the main statistical downscaling using the R language. We have replicated the functionality of the four main SDSM functions (and to a lesser extent, ?quality_control):

Screen Variables
Calbrate Model
Weather Generator
Scenario Generator

To demonstrate the utility of the package for carrying out statistical downscaling, let's look at how concise the process can be made using sdsmR. In this example, we downscale maximum temperature, using a monthly model.

data(blogsville)

# Screen Variables ---------------
generate_table(plotname = "/home/lee/blogsville_demo", dataframe = blogsville, y = "tmax")

# Calibrate Model -----------------
demo_mod <- calibrate_model(dataframe = blogsville[, -8], y = "tmax", model_type = "monthly")

# Weather Generator ---------------
demo_test <- generate_weather(models = demo_mod, new_dataframe = blogsville)

# Scenario Generator -------------
demo_scenario <- generate_weather(models = demo_mod, new_dataframe = weather_2070)
````

While building the tool, we've taken great care to include all of the options that are familiar to users of the SDSM tool, such as fitting monthly/seasonal models, including an autoregressive term, generating the same exploratory plots, etc..

Throughout the document, we have shown how to obtain the SDSM outputs in two different ways. First, we've shown how you can produce these results with raw R code, which can be mimicked for specific use cases by modifying a few lines of code. Next, we've shown how to use the functions provided by the sdsmR package to obtain the identical results. The reason we have taken this approach is because we believe there is value in both approaches. If a user wants to get something done quickly, or just doesn't want to type the same lines of code over and over, using the pre-made functions is a good approach. However, if someone wants to understand the process of statistical downscaling, and all of the steps involved, going through the actual code and trying to repeat it for your unique problem is a great way to learn.

To go a bit further in understanding what's happening under the hood with sdsmR, all of the code is available online at https://github.com/leerichardson/sdsmR

# Supplement: Loading Data

Before you can perform any of the calculations mentioned in the vignette, you must first have the data loaded into R's workspace as a dataframe. Generally speaking, this task is different for most data-sets you encounter, so we didn't decide to include it in the main sections of this document. The idea behind a dataframe is that every row is an observation, and every column is a variable. Once the data is into this form, executing statistical functions and algorithms becomes much simpler and reproducible for others. 

To give a flavor of how to transform a dataset into a dataframe, we are including the code and steps used to download the example blogsville dataset from the SDSM web page and convert it into a dataframe. While future situations most likely won't be identical to this, hopefully some of the ideas can be useful when you need to convery your data-set into a dataframe. 

## Blogsville Data 

First, you can download the Blogsville data at the following link on the SDSM homepage http://co-public.lboro.ac.uk/cocwd/SDSM/blogsville.html. This data-set has four different directories:

* gcmx1961-90. GCM output from five different predictor variables from 1961-1990. Ultimately used to generate scenarios for the desired local variables in this time period.  
* gcmx2070-99. GCM output from five different predictor variables from 2070-2099. Used in the same way as the GCM output above. 
* ncep1961-90. NCEP re-analysis data from the five predictor variables from 1961-1990. This data is used to build the statistical model between our local variable and larger scale climate variables. 
* observed1961-90. These are the local variables we are hoping to understand. In this case, we have maximum temperature and precipitation. 

We need to bring the data into our R workspace in a useable form. In R, the most common way to do this is to combine all of our data into one data frame. We achieve this with the blogsville data using the following commands (note that there are many ways to do this):


```r
# Obtain a list of all the ncep predictor files
variable_names <- list.files("/home/lee/Dropbox/Work (Case Conflict)/ncar/data/blogsville/ncep1961-90/")
predictor_files <- paste0("/home/lee/Dropbox/Work (Case Conflict)/ncar/data/blogsville/ncep1961-90/", 
                          variable_names)

# Set up a data-frame to store the variables. 
# Do this by reading in the first file and figuring 
# out the number of rows, then allocating a dataframe of that length.     
temp <- read.table(predictor_files[1])
n <- dim(temp)[1]
predictor_matrix <- as.data.frame(matrix(NA, nrow=n, ncol=0))    

# Combine all of the predictor variables into a data frame 
# by looping through each file, reading it in, 
# and binding it (with cbind) to the predictor matrix. 
for (file in predictor_files) {
    single_pred <- read.table(file, stringsAsFactors=FALSE)        
    predictor_matrix <- cbind(predictor_matrix, single_pred)
}

# Rename the variables in the predictor matrix. 
# This will help up later on since it will make the names 
# in all of the different directories uniform
predictor_names <- c("uxx", "vxx", "zxx", "xx500", "humxx")
colnames(predictor_matrix) <- predictor_names

# Finally, read in our predictand variables, get a sequence of 
# dates, and combine these with our predictor matrix. Note that one 
# of the tmax observations comes in as a stange character, so we need 
# to deal with this as a special case. stringsAsFactors = FALSE is just 
# used out of habit, since it usually helps out. You can also set this as the
# default in your R options. 
tmax_predictand <- read.table("/home/lee/Dropbox/Work (Case Conflict)//ncar/data/blogsville/observed1961-90/TMAX.DAT", stringsAsFactors = FALSE)
colnames(tmax_predictand) <- "tmax"
pcrp_predictand <- read.table("/home/lee/Dropbox/Work (Case Conflict)/ncar/data/blogsville/observed1961-90/PRCP.DAT", stringsAsFactors = FALSE)
colnames(pcrp_predictand) <- "pcrp"

# WARNING: when downloading this data, in the observed1961-90/TMAX.DAT file,
# line number 10105 currently contains a non numeric symbol. To deal with this, 
# you can either open the file and replace the symbol with "NA", or use the line 
# tmax_predictand[10105,] <- NA after you read in this variable.

# Get a vector of the dates used in the blogsville data. Finally, 
# combine all of these together. 
dates <- seq(as.Date("1960-01-01"), by=1, len = 10957)
blogsville <- cbind(dates, predictor_matrix, tmax_predictand, pcrp_predictand)
head(blogsville)