selectData: Select data for analysis from a larger data frame

View source: R/selectData.R

selectDataR Documentation

Select data for analysis from a larger data frame

Description

Select data for analysis from a larger data frame based on dependent variable, station, and layer. Removing records with missing values, performing log-transformations, and adding a centering date are performed based on settings.

Usage

selectData(
  df,
  dep,
  stat,
  layer = NA,
  transform = TRUE,
  remMiss = TRUE,
  analySpec
)

Arguments

df

data frame

dep

dependent variable

stat

station

layer

layer (optional)

transform

logical field to return log-transformed value (TRUE [default])

remMiss

logical field to remove records where dependent variable, dep, is a missing value (TRUE [default])

analySpec

analytical specifications

Details

The returned data frame will include dyear and cyear. dyear is the decimal year computed using smwrBase::baseDay2decimal and smwrBase::baseDay. From this, the minimum and maximum 'dyear' are averaged. This averaged value, centerYear, is used to compute the centering date, cyear, using cyear = dyear - centerYear.

The variable identified by dep is copied to the variable name dep+".orig" (e.g., chla.orig) allowing the user to track the original concentrations. A new column, recensor, is added. The value of recensor is FALSE unless the value of dep.orig was <=0. In the cases where dep.orig is <= 0, recensor is set to TRUE and the value of dep is set to "less-than" a small positive value which is stored as iSpec$recensor. If transform=TRUE, the returned data frame will also include a variable "ln"+dep (i.e., "lnchla" for log transformed chla).

The data frame will include a column, intervention, which is a factor identifying different periods of record such as when different laboratory methods were used and is based on the data frame methodsList that is loaded into the global environment. This column is set to "A" with only 1 level if the data frame methodsList has not been loaded into the global environment.

The data frame will include a column, lowCensor, to indicate whether the data record occurs in a year with a low level of censoring over that particular year. The function gamTest uses this column to identify years of record (i.e., when lowCensor==FALSE) that should not be used in analyses.

If remMiss=TRUE, then the returned data frame will be down selected by removing records where the variable identified in 'dep' is missing; otherwise, no down selection is performed.

iSpec contains a large list of information

dep - name of column where dependent variable is stored, could be "ln"+dep for variables that will be analyzed after natural log transformation

depOrig - name of original dependent variable, could be same as dep if no transformation is used

stat - name of station

stationMethodGroup - name of station group that the station belongs to, derived from station list (stationMasterList) and used to identify interventions specified in methodsList table

intervenNum - number of interventions found for this station and dependent variable as derived from methodsList table, a value of 1 is assigned if no methodsList entry is found

intervenList - data frame of interventions identified by beginning and ending date and labeled consecutively starting with "A"

layer - layer

layerName - layer name derived from layerLukup

transform - TRUE/FALSE indicating whether log transformations were taken

trendIncrease - an indicator for interpretation of an increasing concentration

logConst - not currently used

recensor - small value that observations <=0 are recensored to as "less than" the small value

censorFrac - data frame indicating the yearly number of observations and fraction of observations reported as less than, uncensored, interval censored, less than zero, and recensored; also includes a 'lowCensor' field indicating which years will be dropped by gamTest due to high yearly censoring

yearRangeDropped - year range of data that will be dropped due to censoring

censorFracSum - censoring overall summary

centerYear - centering year

parmName - parameter name

parmNamelc - parameter name in lower case

parmUnits - parameter units

statLayer - station/layer label, e.g., "LE3.1 (S)"

usgsGageID - USGS gage used for flow adjustments

usgsGageName - USGS gage used for flow adjustments

numObservations - number of observations

dyearBegin - begin date in decimal form

dyearEnd - end date in decimal form

dyearLength - period of record length

yearBegin - period of record begin year

yearend - period of record end year

dateBegin - begin date

dateEnd - end date

The baseDay and baseDay2decimal functions have been added to this package from the smwrBase package.

Value

A nest list is returned. The first element of the nest list is the down-selected data frame. The second element is the list, iSpec, contains specifications for data extraction. See examples for usage and details for further discussion of the data processing and components of each element.

Examples

## Not run: 
dfr    <- analysisOrganizeData(dataCensored)

# retrieve Secchi depth for Station CB5.4, no transformations are applied
dfr1   <- selectData(dfr[["df"]], 'secchi', 'CB5.4', 'S', transform=FALSE,
                    remMiss=FALSE, analySpec=dfr[["analySpec"]])
df1    <- dfr1[[1]]   # data frame of selected data
iSpec1 <- dfr1[[2]]   # meta data about selected data

# retrieve surface corrected chlorophyll-a concentrations for Station CB5.4,
# missing values are removed and transformation applied
dfr2   <- selectData(dfr[["df"]], 'chla', 'CB5.4', 'S', analySpec=dfr[["analySpec"]])
df2    <- dfr2[[1]]   # data frame of selected data
iSpec2 <- dfr2[[2]]   # meta data about selected data

## End(Not run)

leppott/baytrends documentation built on Nov. 2, 2024, 6:42 p.m.