In mhweber/spsurvey: Spatial Survey Design and Analysis

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

This document presents change analysis of a GRTS survey design. The resource employed in the analysis is rivers and streams in the 48 contiguous United States. Data was obtained from two surveys conducted by the U.S. Environmental Protection Agency: (1) the Wadeable Streams Assessment (WSA) in 2004 [@EPA:2006] and (2) the National Rivers and Streams Survey (NRSA) in 2008 and 2009 [@EPA:2016a]. Change analysis measures the difference between response variables that were estimated in two surveys. Both continuous and categorical response variables can be employed for change analysis. For a categorical response variable, change is estimated by the difference in category estimates for the two surveys, where a category estimate is the estimated proportion of values in a category. Note that a separate change estimate is calculated for each category of a categorical response variable. For a continuous response variable, change can be estimated for the mean, the median, or for both the mean and median. For a continuous response variable using the mean, change is estimated by the difference in estimated mean values for the two surveys. For change estimates using the median, the first step is to calculate an estimate of the median for the first survey. The estimated median from the first survey is then used to define two categories: (1) values that are less than or equal to the estimated median and (2) values that are greater than the estimated median. Once the categories are defined, change analysis for the median is identical to change analysis for a categorical variable, i.e., change is estimated by the difference in category estimates for the two surveys.

Preliminaries

The initial step is to use the library function to load the spsurvey package. After the package is loaded, a message is printed to the R console indicating that the spsurvey package was loaded successfully.

Load the spsurvey package

library(spsurvey)
library(sf)

Load the survey design and analytical variables data set

The original data file contains more than 2,400 records for change estimation. To produce a more manageable number of records, rivers and streams located in the Western Mountains Level III Ecoregion [@Omernik:1987] were retained in the data that will be analyzed, which produced a data set containing 668 records.

The next step is to load the data set, which includes both survey design variables and analytical variables. The data function is used to load the data set and assign it to a data frame named NRSA_2009. The nrow function is used to determine the number of rows in the NRSA_2009 data frame, and the resulting value is assigned to an object named nr. Finally, the initial six lines and the final six lines in the NRSA_2009 data frame are printed using the head and tail functions, respectively.

Load the survey design and analytical variables data set

data(NRSA_2009)
nr <- nrow(NRSA_2009)

Display the initial six lines in the data file:

head(NRSA_2009)

Display the final six lines in the data file:

tail(NRSA_2009)

The location of rivers and streams that were sampled in the Western Mountains Ecoregion is displayed below. The sample sites are displayed using a unique color for each survey.

Location of rivers and streams that were sampled in the Western Mountains Ecoregion by the U.S. Environmental Protection Agency during the Wadeable Streams Assessment (WSA) conducted in 2004 and the National Rivers and Streams Assessment (NRSA) conducted in 2008-2009.

Change Analysis

Change analysis will be investigated by examining three continuous response variables and three categorical response variables. The continuous response variables are total phosphorus concentration, total nitrogen concentration, and benthic macroinvertebrate multimetric index (MMI) score. The categorical response variables are condition class variables for each of the continuous response variables. Condition classes are created by grouping values for a continuous response variable into categories that reflect the impact of a response value on the overall ecological condition of a site. Categories used for condition classes are: "Good", "Fair" and "Poor".

The change.analysis function will be used to calculate change estimates. Six data frames constitute the primary input to the change.analysis function. The site ID provides the unique identifier for each sample site and is used to connect records among the data frames. The siteID variable in the NRSA_2009 data frame is assigned to the siteID variable (or variables in one case) in the data frames. The six data frames that will be created are named as follows: sites, repeats, subpop, design, data.cat, and data.cont. In order to obtain change estimates for the continuous variables using both the mean and median, the test argument of the change.analysis function is assigned the following value: c("mean", "median"). Note that the default value for the test argument is "mean".

The sites data frame identifies sites to use in the analysis and contains three variables: (1) siteID - site ID values, (2) Survey1 - a logical vector identifying sites for survey one, and (3) Survey2 - a logical vector identifying sites for survey two. The Survey1 variable is created by assigning the value TRUE to every site for which the Survey variable in the NRSA_2009 data frame equals the value "WSA". Similarly, the Survey2 variable is created by assigning the value TRUE to every site for which the Survey variable in the NRSA_2009 data frame equals the value "NRSA".

The repeats data frame identifies repeat visit sites and contains two variables: (1) siteID_1 - the site ID value for survey one and (2) siteID_2 - the site ID value for survey two. The siteID_1 variable is created by selecting values of the siteID variable in the NRSA_2009 data frame for which the the Survey variable in the NRSA_2009 data frame equals the value "WSA" and the Revisit_Site variable in the NRSA_2009 data frame equals "Y". The siteID_2 variable is created using an analogous process. For each row of the repeats data frame, the two site ID values must correspond to the same site. Note that the NRSA_2009 data frame has been organized so that repeat visit sites for WSA occur in the same order as repeat visit sites for NRSA.

The subpop data frame defines populations and, optionally, subpopulations for which estimates are desired. Unlike the sites and design data frames, the subpop data frame can contain an arbitrary number of columns. The first variable in the subpop data frame identifies site ID values and each subsequent variable identifies a type of population, where the variable name is used to identify type. A type variable identifies each site with a character value. If the number of unique values for a type variable is greater than one, then the set of values represent subpopulations of that type. When a type variable consists of a single unique value, then the type does not contain subpopulations. For this analysis, the subpop data frame contains three variables: (1) siteID - site ID values, (2) Western_Mountains - which will be used to calculate estimates for all of the sample sites combined, and (3) Stream_Size - which will be used to calculate estimates for each of the two classes of stream size (large and small). The rep (repeat) function is used to assign values to the Western_Mountains variable in the subpop data frame, and the Stream_Size variable in the NRSA_2009 data frame is assigned to the Stream_Size variable in the subpop data frame. Recall that nr, which is included in the call to the rep function, is an object containing the number of rows in the NRSA_2009 data frame.

The design data frame consists of survey design variables. For the analysis under consideration, the design data frame contains the following variables: (1) siteID - site ID values; (2) wgt - survey design weights; (3) xcoord - x-coordinates for location; and (4) ycoord - y-coordinates for location. The wgt, xcoord, and ycoord variables in the design data frame are assigned values using variables with the same names in the NRSA_2009 data frame.

The final two data frames, data.cat and data.cont, provide values of categorical response variables and continuous response variables, respectively. Like the subpop data frame, the data.cat and data.cont data frames can contain an arbitrary number of columns. The first variable in the data.cat data frame identifies site ID values and each subsequent variable identifies a categorical response response variable. For this analysis, the categorical response variables are Nitrogen_Condition, Phosphorus_Condition, and Benthic_MMI_Condition, which are assigned, respectively, variables NTL_cond, PTL_cond, and Benthic_MMI_cond in the NRSA_2009 data frame. For benthic MMI score, there are four sites (three in WSA and one in NRSA) for which the MMI score could not be calculated. Those sites are coded as NA for the Benthic_MMI variable and as category "Not Assessed" for the Benthic_MMI_cond variable.

The data.cont data frame is organized analogous to the data.cat data frame. The first variable in the data frame identifies site ID values and each subsequent variable identifies a continuous response response variable. For this analysis, the continuous response variables are Log_Total_Nitrogen, Log_Total_Phosphorus, and Benthic_MMI, which are assigned, respectively, variables NTL, PTL, and Benthic_MMI in the NRSA_2009 data frame. Note that total nitrogen and total phosphorus are analyzed using the base ten log scale, which are created by use of the log10 function.

Create the sites data frame:

sites <- data.frame(siteID=NRSA_2009$siteID,
                    Survey1=NRSA_2009$Survey == "WSA",
                    Survey2=NRSA_2009$Survey == "NRSA")

Create the repeats data frame:

repeats <- data.frame(siteID_1=NRSA_2009$siteID[NRSA_2009$Survey == "WSA" & NRSA_2009$Revisit_Site == "Y"], siteID_2=NRSA_2009$siteID[NRSA_2009$Survey == "NRSA" & NRSA_2009$Revisit_Site == "Y"])

Create the subpop data frame:

subpop <- data.frame(siteID=NRSA_2009$siteID, Western_Mountains=rep("Western Mountains", nr), Stream_Size=NRSA_2009$Stream_Size)

Create the design data frame:

design <- data.frame(siteID=NRSA_2009$siteID,
                     wgt=NRSA_2009$wgt,
                     xcoord=NRSA_2009$xcoord,
                     ycoord=NRSA_2009$ycoord)

Create the data.cat data frame:

data.cat <- data.frame(siteID=NRSA_2009$siteID,
                       Nitrogen_Condition=NRSA_2009$NTL_Cond,
                       Phosphorus_Condition=NRSA_2009$PTL_Cond,
                       Benthic_MMI_Condition=NRSA_2009$Benthic_MMI_Cond)

Create the data.cont data frame:

data.cont <- data.frame(siteID=NRSA_2009$siteID,
                        Log_Total_Phosphorus=log10(NRSA_2009$PTL+1),
                        Log_Total_Nitrogen=log10(NRSA_2009$NTL+1),
                        Benthic_MMI=NRSA_2009$Benthic_MMI)

Calculate change estimates:

Change_Estimates <- change.analysis(sites, repeats, subpop, design,
                                    data.cat, data.cont, 
                                    test=c("mean", "median"))

Like other functions in the spsurvey package, the change.analysis function generates warning messages when certain situations are encountered in the data. When warning messages are generated, the functions print a message to the R console window stating the number of warning messages and explaining the procedure for recovering the messages. The call to the change.analysis function generated thirty-seven warning messages. These messages fall into two categories: (1) cases where the number of repeat visit sites was less than two and (2) cases where a category level was not present among the repeat visit sites in one of the surveys. For both cases, covariance among the revisited sites was not included in calculation of the standard error estimate. The warnprnt function is used to display two of the warning messages.

warnprnt(m = c(1, 3))

The change estimates are displayed using the print function. For categorical response variables and continuous response variables using the median, change estimates are printed for the complete set of sites only. For continuous response variables using the mean, all change estimates are printed. The object produced by change.analysis is a list composed of three data frames. The first data frame, named catsum, contains estimates for categorical response variables. The second data frame, named contsum_mean, contains estimates for continuous response variables using the mean. The third data frame, named contsum_median, contains estimates for continuous response variables using the median. The catsum and contsum_median data frames will be described first. The initial four columns in those data frames identify the population (Type), subpopulation (Subpopulation), response variable (Indicator), and category of the response variable (Category). The next four columns provide results for change estimates using the percent scale: the change estimate (DiffEst.P), standard error of the estimate (StdError.P), lower confidence bound (LCB95Pct.P), and upper confidence bound (UCB95Pct.P). Argument conf for change.analysis allows control of the confidence bound level. The default value for conf is 95, hence the column names for confidence bounds contain the value 95. Supplying a different value to the conf argument will be reflected in the confidence bound names. Confidence bounds are obtained using the standard error and the Normal distribution multiplier corresponding to the confidence level. The next four columns provide results for change estimates using the size (units scale): the change estimate (DiffEst.U), standard error of the estimate (StdError.U), lower confidence bound (LCB95Pct.U), and upper confidence bound (UCB95Pct.U). For this data, the units are kilometers of stream length. The next nine columns provide estimates for survey one: the first column is the number of response values for a category (NResp); the next four columns contain survey one estimates, standard errors, and confidence bounds in the percent scale; and the final four columns contain survey one estimates, standard errors, and confidence bounds in the units scale. The final nine columns of the catsum data frame provide estimates for survey two using the format described for survey one.

Description of the contsum_mean data frame follows. The initial four columns in that data frame identify the population (Type), subpopulation (Subpopulation), response variable (Indicator), and statistic employed for the change estmate (Statistic). The Statistic column contains the value "Mean" as a reminder that change estimates for continuous response variable use the mean. The next four columns provide results for the change estimates: the change estimate (DiffEst), standard error of the estimate (StdError), lower confidence bound (LCB95Pct), and upper confidence bound (UCB95Pct). The next five columns provide estimates for survey one: the first column is the number of response values for a category (NResp); the next four columns contain survey one estimates, standard errors, and confidence bounds. The final five columns of the contsum_mean data frame provide estimates for survey two using the format described for survey one.

# Print Western Mountains change estimates for categorical variables
print(subset(Change_Estimates$catsum, Type == "Western_Mountains"))

# Print change estimates for continuous variables using the mean
print(Change_Estimates$contsum_mean)

# Print change estimates for continuous variables using the median
print(subset(Change_Estimates$contsum_median, Type == "Western_Mountains"))

The write.csv function is used to store the change estimates as comma-separated value (csv) files. Files in csv format can be read by programs such as Microsoft Excel. The three data frames created by the change.analysis function are stored in separate files.

write.csv(Change_Estimates$catsum, file="Change_Estimates_Categorical.csv", row.names=FALSE)
write.csv(Change_Estimates$contsum_mean, file="Change_Estimates_Continuous_Mean.csv", row.names=FALSE)
write.csv(Change_Estimates$contsum_median, file="Change_Estimates_Continuous_Median.csv", row.names=FALSE)