ms_read_csv: read a csv into macrosheds format
In MacroSHEDS/macrosheds: Tools for interfacing with the MacroSheds dataset

ms_read_csv

R Documentation

read a csv into macrosheds format

Description

read in csv data that meets minimum criteria, built to be as robust as possible to heterogeneous data formats and contents.

Usage

ms_read_csv(
  filepath,
  preprocessed_tibble,
  datetime_cols,
  datetime_tz,
  optionalize_nontoken_characters = ":",
  site_code_col,
  alt_site_code,
  data_cols,
  data_col_pattern,
  alt_datacol_pattern,
  is_sensor,
  set_to_NA,
  var_flagcol_pattern,
  alt_varflagcol_pattern,
  summary_flagcols,
  sampling_type = NULL
)

Arguments

`filepath`	character. path to local CSV.
`preprocessed_tibble`	tibble. a tibble with all character columns. Supply this argument if a dataset requires modification before it can be processed by ms_read_raw_csv. This may be necessary if, e.g. time is stored in a format that can't be parsed by standard datetime format strings. Either filepath or preprocessed_tibble must be supplied, but not both.
`datetime_cols`	named character vector. names are column names that
`datetime_tz`	character. specifying time zone. this specification must be among those provided by OlsonNames()
`optionalize_nontoken_characters`	character vector. used when there might be
`site_code_col`	character. name of column containing site name information
`alt_site_code`	optional list. Names of list elements are desired site_codes within MacroSheds. List elements are character vectors of alternative names that might be encountered. Used when sites are misnamed or need to be changed due to inconsistencies within and across datasets.
`data_cols`	vector. vector of names of columns containing data. If elements of this vector are named, names are taken to be the column names as they exist in the file, and values are used to replace those names. Data columns that aren't referred to in this argument will be omitted from the output, as will their associated flag columns (if any).
`data_col_pattern`	character. a string containing the wildcard "#V#", which represents any number of characters. If data column names will be used as-is, this wildcard is all you need. if data columns contain recurring, superfluous characters, you can omit them with regex. for example, if data columns are named outflow_x, outflow_y, outflow_...., use data_col_pattern = 'outflow_#V#' and then you don't have to bother typing the full names in your argument to data_cols.
`alt_datacol_pattern`	optional string with same mechanics as data_col_pattern. use this if there might be a second way in which column names are generated, e.g. output_x, output_y, output_....
`is_sensor`	logical. either a single logical value, which will be applied to all variable columns OR a named logical vector with the same length and names as data_cols. If the latter, names correspond to variable names in the file to be read. TRUE means the corresponding variable(s) was/were measured with a sensor (which may be susceptible to drift and/or fouling), FALSE means the measurement(s) was/were not recorded by a sensor. This category includes analytical measurement in a lab, visual recording, etc.
`set_to_NA`	character. For values such as 9999 that are proxies for NA values.
`var_flagcol_pattern`	character. optional string with same mechanics as the other pattern parameters. this one is for columns containing flag information that is specific to one variable. If there's only one data column, omit this argument and use summary_flagcols for all flag information.
`alt_varflagcol_pattern`	character. optional string with same mechanics as the other pattern parameters. just in case there are two naming conventions for variable-specific flag columns
`summary_flagcols`	vector. optional unnamed vector of column names for flag columns that pertain to all variables
`sampling_type`	optional value to overwrite identify_sampling because some . vector's function is misidentifying sampling type. This must b single . vector 'G or I and is applied to all variables in product

Value

returns a tibble of ordered and renamed columns, omitting any columns from the original file that do not contain data, flag/qaqc information, datetime, or site_code. All-NA data columns and their corresponding flag columns will also be omitted, as will rows where all data values are NA. Rows with NA in the datetime or site_code column are dropped. data columns are given type double. all other columns are given type character. data and flag/qaqc columns are given two-letter prefixes representing sample regimen (I = installed vs. G = grab; S = sensor vs N = non-sensor). Data and flag/qaqc columns are also given suffixes (__|flg and __|dat) that allow them to be cast into long format by ms_cast_and_reflag.