Analyzing Data"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", 
  warning = FALSE, 
  message = FALSE
)

If you have yet not read the "Start Here" vignette, please do so by running

vignette("start-here", "spsurvey")

Introduction

spsurvey's analysis functions are used to analyze data in a variety of contexts. We focus mainly on using the NLA_PNW analysis data to introduce some of these analysis functions. The NLA_PNW data contains response variables measured at several lakes in the Pacific Northwest Region (PNW) of the United States. There are several variables in NLA_PNW you will use throughout this vignette:

Before proceeding, we load spsurvey by running

library(spsurvey)

Categorical Variable Analysis

Categorical variables are analyzed in spsurvey using the cat_analysis() function. The cat_analysis() function estimates the proportion of observations and the total units (i.e. extent) that belong to each level of the categorical variable (total units refer to the total number (point resources), total line length (linear network), or total area (areal network)). Several useful pieces of information are returned by cat_analysis(), including estimates, standard errors, margins of error, and confidence intervals. The analysis results contain columns with a .P and .U suffixes. The .P suffix corresponds to estimates of proportions for each category, while the .U suffix corresponds to estimates of total units (i.e. extent) for each category.

Unstratified Analysis

To estimate the proportion of total lakes in each nitrogen condition category and the total number of lakes in each nitrogen condition category, run

cat_ests <- cat_analysis(
  NLA_PNW,
  siteID = "SITE_ID",
  vars = "NITR_COND",
  weight = "WEIGHT"
)
cat_ests

The estimate of the proportion of lakes in Good condition is 51.35% with a 95% confidence interval of (36.8%, 65.9%), while the estimate of the total number of lakes in Good condition is 5484 lakes with a 95% confidence interval of (3086, 7882). In each case, the estimated standard error and margin of error is given. The confidence level can be changed using the conf argument. If more than one categorical variable is of interest, then vars can be a vector of variables and separate analyses are performed for each variable.

Sometimes the goal is to estimate proportions and totals separately for different subsets of the population -- these subsets are called subpopulations. To estimate the proportion of total lakes and in each nitrogen condition category the total number of lakes in each nitrogen condition category separately for California, Oregon, and Washington lakes, run

cat_ests_sp <- cat_analysis(
  NLA_PNW,
  siteID = "SITE_ID",
  vars = "NITR_COND",
  weight = "WEIGHT",
  subpop = "STATE"
)
cat_ests_sp

If more than one type of subpopulation is of interest, then subpop can be a vector of subpopulation variables and separate analyses are performed for each subpopulation. If both vars and subpops are vectors, separate analyses are performed for each variable and subpopulation combination.

Analysis results for all sites (ignoring subpopulations) can be presented alongside the subpopulation analysis results using the All_Sites argument:

cat_ests_sp <- cat_analysis(
  NLA_PNW,
  siteID = "SITE_ID",
  vars = "NITR_COND",
  weight = "WEIGHT",
  subpop = "STATE",
  All_Sites = TRUE
)
cat_ests_sp

Stratified Analysis

To estimate the proportion of total lakes in each nitrogen condition category and the total number of lakes in each nitrogen condition category stratified by URBAN category (whether the lake is classified as Urban or Non-Urban), run

strat_cat_ests <- cat_analysis(
  NLA_PNW,
  siteID = "SITE_ID",
  vars = "NITR_COND",
  weight = "WEIGHT",
  stratumID = "URBAN"
)
strat_cat_ests

To then compute these estimates separately for California, Oregon, and Washington, run

strat_cat_ests_sp <- cat_analysis(
  NLA_PNW,
  siteID = "SITE_ID",
  vars = "NITR_COND",
  weight = "WEIGHT",
  stratumID = "URBAN",
  subpop = "STATE"
)
strat_cat_ests_sp

Continuous Variable Analysis

Continuous variables are analyzed in spsurvey using the cont_analysis() function. The cont_analysis() function estimates cumulative distribution functions (CDFs), percentiles, and means of continuous variables. By default, all these quantities are estimated (though this can be changed using the statsitics argument to cont_analysis()). For the quantities requiring estimation, several useful pieces of information are returned by cont_analysis(), including estimates, standard errors, margins of error, and confidence intervals. The .P suffix corresponds to estimates of proportions for each variable, while the .U suffix corresponds to estimates of total units (i.e. extent) for each variable.

Unstratified Analysis

To estimate the cumulative distribution function (CDF), certain percentiles, and means of BMMI, run

cont_ests <- cont_analysis(
  NLA_PNW,
  siteID = "SITE_ID",
  vars = "BMMI",
  weight = "WEIGHT"
)

To view the analysis results for the mean estimates, run

cont_ests$Mean

Similarly, the CDF and select percentile estimates can be viewed (the output is omitted here) by running

cont_ests$CDF
cont_ests$Pct

To visualize the CDF estimates, run

plot(cont_ests$CDF)

The solid line indicates the CDF estimates, and the dashed lines indicate lower and upper 95% confidence interval bounds for the CDF estimates. cdf_plot() can equivalently be used in place of plot() (cdf_plot() is currently maintained for backwards compatibility with previous spsurvey versions).

To estimate the CDF, certain percentiles, and means of BMMI separately for each state, run

cont_ests_sp <- cont_analysis(
  NLA_PNW,
  siteID = "SITE_ID",
  vars = "BMMI",
  weight = "WEIGHT",
  subpop = "STATE"
)

To view the analysis results for the mean estimates, run

cont_ests_sp$Mean

Stratified Analysis

To estimate the CDF, certain percentiles, and means of BMMI for a design stratified by URBAN category, run

strat_cont_ests <- cont_analysis(
  NLA_PNW,
  siteID = "SITE_ID",
  vars = "BMMI",
  weight = "WEIGHT",
  stratumID = "URBAN"
)

To view the analysis results for the mean estimates, run

strat_cont_ests$Mean

To then compute these estimates separately for each state, run

strat_cont_ests_sp <- cont_analysis(
  NLA_PNW,
  siteID = "SITE_ID",
  vars = "BMMI",
  weight = "WEIGHT",
  stratumID = "URBAN",
  subpop = "STATE",
)

To view the analysis results for the mean estimates, run

strat_cont_ests_sp$Mean

Additional Analysis Approaches

Alongside the cat_analysis() and cont_analysis() functions, spsurvey offers functions for estimating attributable risk, relative risk, risk difference, change, and trend. Attributable risk analysis, relative risk analysis, and risk difference analysis quantify the attributable risk, relative risk, and risk difference, respectively, of environmental resources being in poor condition after exposure to a stressor. Attributable risk analysis is performed using the attrisk_analysis() function, relative risk analysis is performed using the relrisk_analysis() function, and risk difference analysis is performed using the diffrisk_analysis() function. Change and trend analysis capture the behavior of environmental resources between two samples, while trend analysis generalizes this approach to include more than two samples. Often, change and trend analyses are performed to study an environmental resource through time. Change analysis is performed using the change_analysis() function, and trend analysis is performed using the trend_analysis() function. The attrisk_analysis(), relrisk_analysis(), diffrisk_analysis(), change_analysis() and trend_analysis() functions all share very similar syntax with the cat_analysis() and cont_analysis() functions.

Attributable Risk, Relative Risk, and Risk Difference Analysis

The attributable risk is defined as $$1 - \frac{P(Response = Poor | Stressor = Good)}{P(Response = Poor)},$$ where $P(\cdot)$ is a probability and $P(\cdot | \cdot)$ is a conditional probability. The attributable risk measures the proportion of the response variable in poor condition that could be eliminated if the stressor was always in good condition. To estimate the attributable risk of benthic macroinvertebrates with a phosphorous condition stressor, run

attrisk_ests <- attrisk_analysis(
  NLA_PNW, 
  siteID = "SITE_ID",
  vars_response = "BMMI_COND",
  vars_stressor = "PHOS_COND",
  weight = "WEIGHT"
)
attrisk_ests

The relative risk is defined as $$\frac{P(Response = Poor | Stressor = Poor)}{P(Response = Poor | Stressor = Good)},$$ which measures the risk of the response variable being in poor condition relative to the stressor's condition. To estimate the relative risk of benthic macroinvertebrates being in poor condition with a phosphorous condition category stressor, run

relrisk_ests <- relrisk_analysis(
  NLA_PNW, 
  siteID = "SITE_ID",
  vars_response = "BMMI_COND",
  vars_stressor = "PHOS_COND",
  weight = "WEIGHT"
)
relrisk_ests

The risk difference is defined as $$P(Response = Poor | Stressor = Poor) - P(Response = Poor | Stressor = Good),$$ which measures the risk of the response variable being in poor condition differenced by the stressor's condition. To estimate the risk difference of benthic macroinvertebrates being in poor condition with a phosphorous condition category stressor, run

diffrisk_ests <- diffrisk_analysis(
  NLA_PNW, 
  siteID = "SITE_ID",
  vars_response = "BMMI_COND",
  vars_stressor = "PHOS_COND",
  weight = "WEIGHT"
)
diffrisk_ests

By default, the levels of the variables in vars_response and vars_stressor are assumed to equal "Poor" (event occurs) or "Good" (event does not occur). If those default levels do not match the levels of the variables in vars_response and vars_stressor, the levels of vars_response and vars_stressor must be explicitly stated using the response_levels and stressor_levels arguments, respectively. Similar to cat_analysis() and cont_analysis() from the previous sections, subpopulations and stratification are incorporated using subpops and stratumID, respectively. For more on attributable and relative risk in an environmental resource context, see Van Sickle and Paulsen (2008).

Change and Trend Analysis

To demonstrate change analysis, we use the NRSA_EPA7 data. There are several variables in NRSA_EPA7 you will use next:

To estimate the change between samples (time points) for BMMI (a continuous variable) and NITR_COND (a categorical variable), run

change_ests <- change_analysis(
  NRSA_EPA7,
  siteID = "SITE_ID",
  vars_cont = "BMMI",
  vars_cat = "NITR_COND",
  surveyID = "YEAR",
  weight = "WEIGHT"
)

The surveyID argument is the variable in the data distinguishing the different samples (YEAR in the previous example).

To view the analysis results for NITR_COND (the categorical variable), run

change_ests$catsum

Estimates are provided for the difference between the two samples and for each of the two individual samples (the _1 and _2 suffixes).

To view the analysis results for BMMI (the continuous variable), run

change_ests$contsum_mean

Though we don't show an example here, the median can be estimated using the test argument in change_anlaysis().

Trend analysis generalizes change analysis to more than two samples (usually time points). Though we omit an example here, the arguments to trend_analysis() are very similar to the arguments for change_analysis(). One difference is that trend_analysis() contains arguments that specify which statistical model to apply to the estimates from each sample.

Variance Estimation

The default variance estimator in spsurvey is the local neighborhood variance estimator (Stevens and Olsen, 2003). The local neighborhood variance estimator incorporates the spatial locations of design sites into the variance estimation process. Because the local neighborhood variance estimator incorporates spatial locations, the resulting variance estimate tends to be smaller than variance estimates from approaches ignoring spatial locations. The local neighborhood variance estimator requires x-coordinates and y-coordinates of the design sites. When the analysis data is an sf object, the coordinates from the data's geometry are used. When the analysis data is just a data frame, x-coordinates and y-coordinates (from an appropriate coordinate reference system) must be provided using the xcoord and ycoord arguments, respectively, of the analysis function being used.

Several additional variance estimation options are available in spsurvey's analysis functions through the vartype and jointprob arguments.

References

Sickle, J. V., & Paulsen, S. G. (2008). Assessing the attributable risks, relative risks, and regional extents of aquatic stressors. Journal of the North American Benthological Society, 27(4), 920-931.

Stevens Jr, D. L., & Olsen, A. R. (2003). Variance estimation for spatially balanced samples of environmental resources. Environmetrics, 14(6), 593-610.



Try the spsurvey package in your browser

Any scripts or data that you put into this service are public.

spsurvey documentation built on May 31, 2023, 6:25 p.m.