if(!"knitr" %in% installed.packages()){ install.packages("knitr", repos='http://cran.us.r-project.org') } if(!"xts" %in% installed.packages()){ install.packages("xts", repos='http://cran.us.r-project.org') } if(!"devtools" %in% installed.packages()){ install.packages("devtools", repos='http://cran.us.r-project.org') } if(!"TTFF" %in% installed.packages()){ devtools::install_github("troyhill/TTFF") } library(TTFF)
This document describes a possible alternative to the COP rainfall formula used to predict flows. A problem with the current approach is the substantial number of correlations among the 54 independent variables included. Principal component analysis is offered as a way to remove multicollinearity and reduce the dimensionality of the dataset while retaining all of the original variables and the information they provide.
Principal component analysis (PCA) is one approach to reducing many variables to a few synthetic variables that are linear combinations of the original inputs (scaled to mean = 0). Importantly for our case, redundancy in the input parameters is of no consequence for the synthetic variables produced.
To prepare the 41-year COP dataset for PCA, daily input data provided by the South Florida Water Management District were aggregated to weekly data. Mean values were used for stage, while sums were used for PET, rainfall, and lagged flows.
### generate data for period of record by compiling from tab-delimited .txt files rainDat <- system.file("extdata", "data_Daily_Rain.txt", package = "TTFF") PETDat <- system.file("extdata", "data_daily_PET.txt", package = "TTFF") stageDat <- system.file("extdata", "data_ALT_output_stage.txt", package = "TTFF") flowDat <- system.file("extdata", "data_ALTO_output_Flows.txt", package = "TTFF") por <- mergeData(rainfall = rainDat, PET = PETDat, stage = stageDat, flow = flowDat) ### the merged data is supplied and documented with the COPmod package ?COPmod::por
Applying PCA to the 54-variable COP input dataset shows that 81% of the variation in the dataset can be captured by four synthetic variables (Fig. 2; left panel). These dimensions generally reflect common variables. For example, dimensions 1 and 4 were dominated by PET, dimension 2 is dominated by rainfall, and dimension 3 is dominated by lagged flow data.
# Principal component analysis -------------------------------------------- pca.dat <- por colsToUse <- which(!names(pca.dat) %in% c("date", "Vero", "sumFlow")) # columns to use for PCA (exclude date, Vero, sumFlow) pca1 <- FactoMineR::PCA(pca.dat[, colsToUse], graph = FALSE, ncp = 4, scale.unit = TRUE) head(pca1$eig[, c(2,3)]) # ~81% of the variance in the data is captured in the first four principal components
plot(pca1, choix="var", title = "")
The PCA output can then be applied to future data to generate principal components
# Apply PCA to sample data -------------------------------------------- # generate a dataset on which to apply the PCA newDat <- data.frame(por[sample(1:nrow(por), size = nrow(por)*0.3), ]) pca.out <- data.frame(cbind(pca1$ind$coord, sumFlow = data.frame(por$sumFlow))) pca.lm <- lm(sumFlow ~ Dim.1 + Dim.2 + Dim.3 + Dim.4, data = pca.out) pcaPred <- FactoMineR::predict.PCA(pca1, newdata = newDat) testDat <- data.frame(pcaPred$coord, date = row.names(pcaPred$coord)) testDat$predFlow <- stats::coef(pca.lm)[1] + stats::coef(pca.lm)[2] * testDat$Dim.1 + stats::coef(pca.lm)[3] * testDat$Dim.2 + stats::coef(pca.lm)[4] * testDat$Dim.3 + stats::coef(pca.lm)[5] * testDat$Dim.4 ### this work is streamlined in the estimateFlow() function flowPredictions <- estimateFlow(newData = newDat, PCA_output = pca1, periodOfRecord = por)
To examine the performance of the PCA approach, the input data were randomly classified as training (80%) and test (20%) data. PCA was performed on the training dataset and applied to the test dataset, then used to predict flows in the test data. This process (randomized assignment, PCA, application to test data, discharge prediction) was repeated 100 times. In this bootstrapped approach, observed flows were estimated well by PCA-predicted flows, with an average pearson correlation coefficient of 0.92 (r2 = 0.84).
out2 <- bootstrapPCA(iter = 100, plot = FALSE, data = por, columnsUsed = colsToUse) summary(out2)
bootstrapPCA(iter = 1, plot = TRUE, data = por, columnsUsed = colsToUse)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.