read_data: Read FWF, DSV or EXCEL data files
In a-maldet/readall: Read FWF, DSV, EXCEL And SAS Files

Description Usage Arguments Details Value File types difference file_structure/file_definition/file_collection adapters

The functions read_data(), read_data_fwf(), read_data_dsv() and read_data_excel() are all used in order to read FWF, DSV or EXCEL data files. The function read_data() is the heart of the readall package and it only requires the user to pass a single function argument (a file_definition class object), holding all needed file information in order to read the data file. By instead passing a file_collection class object into read_data(), it is also possible to read multiple data files at once and store the concatenated data sets into a single data.frame. The functions read_data_fwf(), read_data_dsv() and read_data_excel() are less flexible, but have a more common structure, since this functions do not use file_definition class objects, but require the user to pass in all file information directly as function arguments.

read_data(file_definition)

read_data_fwf(
  file_path,
  specification_files = NULL,
  cols = NULL,
  col_names = NULL,
  col_types = NULL,
  col_start = NULL,
  col_end = NULL,
  col_widths = NULL,
  sep_width = 0,
  skip_rows = 0,
  na = "",
  decimal_mark = ".",
  big_mark = ",",
  trim_ws = TRUE,
  n_max = Inf,
  encoding = "latin1",
  to_lower = TRUE,
  adapters = new_adapters(),
  cols_keep = TRUE,
  extra_col_name = NULL,
  extra_col_val = NULL,
  extra_col_file_path = FALSE,
  ...
)

read_data_dsv(
  file_path,
  specification_files = NULL,
  cols = NULL,
  col_names = NULL,
  col_types = NULL,
  sep = ";",
  header = TRUE,
  skip_rows = 0,
  na = "",
  decimal_mark = ".",
  big_mark = ",",
  trim_ws = TRUE,
  n_max = Inf,
  encoding = "latin1",
  to_lower = TRUE,
  rename_cols = FALSE,
  adapters = new_adapters(),
  cols_keep = TRUE,
  extra_col_name = NULL,
  extra_col_val = NULL,
  extra_col_file_path = FALSE,
  ...
)

read_data_excel(
  file_path,
  specification_files = NULL,
  range = NULL,
  sheet = NULL,
  cols = NULL,
  col_names = NULL,
  col_types = NULL,
  header = TRUE,
  skip_rows = 0,
  na = "",
  trim_ws = TRUE,
  n_max = Inf,
  to_lower = TRUE,
  rename_cols = FALSE,
  adapters = new_adapters(),
  cols_keep = TRUE,
  extra_col_name = NULL,
  extra_col_val = NULL,
  extra_col_file_path = FALSE,
  ...
)

read_data_sas(
  file_path,
  specification_files = NULL,
  skip_rows = 0,
  n_max = Inf,
  encoding = NULL,
  to_lower = TRUE,
  rename_cols = FALSE,
  retype_cols = FALSE,
  adapters = new_adapters(),
  cols_keep = TRUE,
  extra_col_name = NULL,
  extra_col_val = NULL,
  extra_col_file_path = FALSE,
  ...
)

`file_definition`	A file_definitionuration object, holds all informations needed for reading the data. This object can be created with one of the following functions: `new_file_definition()`: For reading FWF, DSV or EXCEL files `new_file_definition_fwf()`: For reading FWF files `new_file_definition_dsv()`: For reading DSV files `new_file_definition_excel()`: For reading EXCEL files
`file_path`	A string holding the path to the data file.
`specification_files`	An optional character vector holding the paths to the files, where the file structure is described.
`cols`	An optional list argument, holding the column definitions. This argument can be used instead of the arguments `col_names`, `col_types`, `col_start`, `col_end`, `col_widths`, in order to define the column structure. If the argument `cols` is used, then non of the `col_` argument are allowed. If so, the `cols` argument has the following structure: It is list, where each list entry fully describes a single column. Each list entry must have the same subselection of the following possible list entries: `type`: (obligatoric) A string value defining the data type of the column. The following values are allowed: `"character"`, `"logical"`, `"integer"`, `"numeric"` and `"NULL"` (for skipping this column). In the case of SAS files the `type` information can be omitted, since the data type information is stored in the SAS data files, but the argument `type` can still be useful in order to check that the read data column have the expected data type. For SAS-Files this check is done automatically after reading the data with `read_data()`. `name`: (optional) A string holding the column name. `start`: (optional) A number holding the position of the first character of the column. `end`: (optional) A number holding the position of the last character of the column. `width` (optional) A numeric holding the number characters of the column. `col_meta`: (optional) A col_meta class object, holding some meta information for the specific column (column description, possible column values + descriptions of possible column values). For details see section meta information*.
`col_names`	An optional character vector holding the names of the columns. If omitted, then the strings `"x1"`, `"x2"`, ... will be used. In the case of DSV or EXCEL files: If the argument `header` is set to `TRUE`, then the column names given in the data header will be used instead. If `col_names` is also supplied, then the column names given in the DSV, EXCEL or SAS file will be compared with the names given in `col_names`. Sometimes it is useful, to have the column names to be automatically transformed to lower case (directly after reading the date, but before comparing the column names). This can be achieved by setting `to_lower = TRUE`. Generally, the argument `cols` can be used instead, in order to define the column names. If the argument `cols` is not `NULL`, then the argument `col_names` must be omitted.
`col_types`	A character vector defining the data types for each column. The following strings are allowed: `"character"`, `"logical"`, `"integer"`, `"numeric"` and `"NULL"` (for skipping this column). Generally, the argument `cols` can be used instead, in order to define the column types. If the argument `cols` is not `NULL`, then the argument `col_types` must be omitted. In the case of SAS files the `col_types` information can be omitted, since the data type information is stored in the SAS data files, but the argument `col_types` can still be useful in order to check the read data files, if the data types are as expected. For SAS-Files this check is done automatically after reading the data with `read_data()`
`col_start`	An optional numeric vector holding the positions of the first character of each column. Generally, the argument `cols` can be used instead, in order to define the column start positions. If the argument `cols` is not `NULL`, then the argument `col_start` must be omitted.
`col_end`	An optional numeric vector holding the positions of the last character of each column. The last vector entry (for the most right column) is the only entry that can be `NA`. In this case, the most right cells are always read till the new line character. Generally, the argument `cols` can be used instead, in order to define the column end positions. If the argument `cols` is not `NULL`, then the argument `col_end` must be omitted.
`col_widths`	An optional numeric vector holding the numbers of characters of each column. Generally, the argument `cols` can be used instead, in order to define the column widths. If the argument `cols` is not `NULL`, then the argument `col_widths` must be omitted.
`sep_width`	An optional number, defining the number of characters between each column (often `0`).
`skip_rows`	The number of rows to be skipped. In the case of DSV or EXCEL files: If the argument `header` is set to `TRUE`, then the first row is always assumed to be the header row.
`na`	A string representing missing values in the data file.
`decimal_mark`	A character, defining the decimal separator in numeric columns. Only the strings `"."` and `","` are allowed.
`big_mark`	A character, defining the thousands separator in numeric columns. Only the strings `"."` and `","` are allowed.
`trim_ws`	A logical value, defining if the character values should be stipped of all leading and trailing white spaces.
`n_max`	A number, defining the maximum number of rows to be read. If `n_max = Inf`, then all available rows will be read.
`encoding`	A string, defining which encoding should be assumed when reading the data file. The following valuels are allowed: `"UTF-8"`: For UTF-8 encoded files. `"latin1"`: For ISO 8859-1 (also called Latin-1) encoded files. This encoding is almost the same as Windows-1252 (also called ANSI). They differ only in 32 symbol codes (special symbols that are rarely used). In the case of SAS files, it is possible to set `encoding = NULL`. In this case, the encoding defined in the SAS data file header will be used.
`to_lower`	A logical flag, defining if the names of the columns should be transformed to lower case after reading the data set (by calling `read_data()`). This transformation will be applied before comparing the column names (in the case of SAS-Files or DSV- and EXCE-Files with `header = TRUE`). In the case of `new_file_definition()` the `to_lower` argument overwrites the `to_lower` argument in the file_structure class object given in `file_structure`. If `to_lower` is omitted, then the `file_structure` class object remains unchanged. In the case of `new_file_definition_fwf()`, `new_file_definition_dsv()`, `new_file_definition_excel()` or `new_file_definition_sas()` the argument `to_lower` must either be `TRUE` or `FALSE`.
`adapters`	An optional list argument, holding a list of adapter functions (See section adapters).
`cols_keep`	Either `TRUE` or a character vector. If set to `TRUE`, then all columns of the data are kept when calling `read_data()`. If `cols_keep` character vector, then the values in `cols_keep` represent the names of the columns, which are kept when calling `read_data()`.
`extra_col_name`	An optional string, which defines the column, which will be added to the data set (after reading it with function `read_data()`). Each entry of the column will have the single value given in `extra_col_val`. For example: This column is useful when reading similar data files for separate years (one could pass the current data set year to `extra_col_name` and set `extra_col_name = "year"`). If `extra_col_name` is omitted, no column will be added to the data set and then `extra_col_val` must be omitted as well. additional column with the column name, given in `extra_col_name`. If omitted, then no column will be added to the data set and the argument `extra_col_name` must be omitted as well.
`extra_col_val`	An optional value (any atomic type), which will be added (after reading the data set with function `read_data()`) as an additional column with the column name, given in `extra_col_name`. For example: This column is useful when reading similar data files for separate years (one could pass the current data set year to `extra_col_name` and set `extra_col_name = "year"`). If omitted, then no column will be added to the data set and the argument `extra_col_name` must be omitted as well.
`extra_col_file_path`	Either `FALSE` or a string. If set to `FALSE` no file-path-column will be added to the data set, when calling `read_data()`. If the argument `extra_col_file_path` is a string, then a column holding the file path of the data file will be added to the read data set, when calling `read_data()`. The string of `extra_col_file_path` will be used as column name for this additional column.
`...`	Additional function arguments for `readr::read_fwf()` in case of FWF files `utils::read.delim()` in case of DSV files `readxl::read_excel()` in case of EXCEL files
`sep`	A string holding the column deliminator symbol.
`header`	A logical value, which defines if the first row contains the data headers. If set to `TRUE`, then the names given in the data header will be used as column names instead.
`rename_cols`	A logical value, which defines if the columns given in the data file should be overwritten by the columns given in argument `col_names`. If `col_names` is not given, then `rename_cols` has no effect.
`range`	An optional string, holding an EXCEL range string, defining the data range in the spread sheet. If `header` is set to `TRUE`, then the range must include a header row.
`sheet`	A string or an integer number: string: The value defines the name of the sheet, which should be read. integer: The value defines the position of the sheet, which should be read. (start counting with `1`).
`retype_cols`	A logical value, which defines if the types of the columns given in SAS file changed to the types given in the `col_types` argument. If `col_types` is not given, then `retype_cols` has no effect.

The function read_data() can either read a single data file and return a data.frame or it can read multiple data files at once and return the concatenated data sets as a single data.frame. read_data() can read the following data file types:

FWF: Fixed width files. This files are text files, where the data is stored in columns, that have a fixed character width.
DSV: Delimiter-separated value file. This files are text files, where the data is stored in columns that are separated by a delimiter character.
EXCEL: An excel file holding the data.

In order to read a single file with read_data() a file_definition class object must be passed into the function argument file_definition. This file_definition class objects contain all information needed for reading a specific data file. When calling read_data(file_definition) where file_definition is a file_definition class object, the following tasks will be executed:

reading the data file specified in file_definition and storing the data to a data.frame
if the argument to_lower was set to TRUE, then replace all column names of the read data set by its lower case version.
if the column names where read from the data file and the column names are given by the col_names argument, then compare the read column names with the column names given in col_names and print a warning in case of discrepancies.
in the case of SAS-files: If the argument col_types was given as well, then compare the read data types of the data columns with the data types given in col_types and print a warning in case of discrepancies.
modifying the resulting data.frame by consecutively applying all adapter functions stored in the adapter function list argument file_definition$adapters. For details see section adapters
Optionally adding a column with value file_definition$extra_col_val and column name file_definition$extra_col_name. For details see new_file_definition()
Optionally adding a character column holding the path of the read data file with column name defined in file_definition$extra_col_file_path.
If file_definition$cols_keep is not NULL, then only the columns defined in file_definition$cols_keep will be kept. If the attribute is NULL then all columns will be kept.
Finally the resulting data.frame will be returned.

In order to read multiple data files at once and automatically concatenate the resulting data.frames into a single data.frame, you need to create a list of [file_definition][new_file_definition()] class objects first by using the function new_file_collection(). Each list entry holds the meta data of a different data file. When read_data() is applied on a file_collection class object, then the following tasks will be executed:

loop through the list apply read_data() on every list entry. Since these entries are file_definition class objects, the tasks of reading single data files (as described above) will be executed for each list entry.
concatenate all resulting data.frames into a single data.frame.

A data.frame holding the read data.

The function read_data() can read read four different types of data

FWF: Fixed width files. This files are text files, where the data is stored in columns, that have a fixed character width.
DSV: Delimiter-separated value file. This files are text files, where the data is stored in columns that are separated by a delimiter character.
EXCEL: An excel file holding the data.
SAS: A SAS file holding the data.

In order to read a data file with the function read_data(), it is useful to create a file_definitionuration or file_structure class object, holding all needed data file file_structures:
- new_file_definition_fwf() or new_file_structure_fwf() for FWF files
- new_file_definition_dsv() or new_file_structure_dsv() for DSV files
- new_file_definition_excel() or new_file_structure_excel() for Excel files
- new_file_definition_sas() or new_file_structure_sas() for SAS files

The goal of the package readall is it to read data files. For this purpose the package offers three different class objects in order to store meta data about the data files:

file_structure class objects: Objects of this class can be used in order to define all file type specific information (e.g. column positions, column names, column types, deliminator symbols, rows to skip etc.). The idea is, that one file_structure object may valid for several files and therefore be used to read multiple data files.
file_definition class objects: Objects of this class type contain all informations in order to read a single specific data file (path to the data file, file file_structure etc.). A file_definition class object contains a file_structure, which holds all file type specific information, but also other informations that are only valid for this specific file.
file_collection class objects: A file_collection class object is simply a list holding multiple file_definition class objects. A file_collection class object can be used in order to read several data files at once and concatenate the data into a single data.frame.

An adapter function is a function that takes a data.frame as input argument and returns a modified version of this data.frame. The adapter functions are stored in an adapters class object, which is a special list that contains all adapter functions and a description text of each function. This class objects can be created by using the function new_adapters(). The adapters class objects can be added to a file_structure or a file_definition or a file_collection class object. After reading a data file (by calling read_data(file_definition)) all adapter functions listed in the adapters argument of the file_definition]new_file_definition() class object will be applied consecutively to the loaded data set. Adapter functions can be added to an existing file_structure or a file_definition or a file_collection class object by using the function add_adapters(). Adapter functions can be used for several tasks:

adapt the data sets in such a way that they can be concatenated for mutliple years
compute new variables from existing variables
fix errors in variables
transform the values of a variable of an older data set, such that it complies with a newer variable definition