new_file_definition: Create new file_definitionuration object
In a-maldet/readall: Read FWF, DSV, EXCEL And SAS Files

Description Usage Arguments Value File types difference file_structure/file_definition/file_collection adapters meta information See Also

View source: R/file_definition.R

In order to read a data file with read_data(), you need to create a new file_definitionuration object. The following functions are available:

new_file_definition(): Can create a file_definitionuration object for FWF, DSV, EXCEL or SAS data files, depending on the supported file type of the file_structure class object.
new_file_definition_fwf(): Can create a file_definitionuration object for FWF files.
new_file_definition_dsv(): Can create a file_definitionuration object for DSV files.
new_file_definition_excel(): Can create a file_definitionuration object for EXCEL files.
new_file_definition_sas(): Can create a file_definitionuration object for SAS files.

new_file_definition(
  file_path,
  file_structure,
  to_lower = NULL,
  cols_keep = TRUE,
  extra_col_name = NULL,
  extra_col_val = NULL,
  extra_col_file_path = FALSE,
  extra_adapters = new_adapters()
)

new_file_definition_fwf(
  file_path,
  specification_files = NULL,
  cols = NULL,
  col_names = NULL,
  col_types = NULL,
  col_start = NULL,
  col_end = NULL,
  col_widths = NULL,
  file_meta = NULL,
  sep_width = NULL,
  skip_rows = 0,
  na = "",
  decimal_mark = ".",
  big_mark = ",",
  trim_ws = TRUE,
  n_max = Inf,
  encoding = "latin1",
  to_lower = TRUE,
  adapters = new_adapters(),
  cols_keep = TRUE,
  extra_col_name = NULL,
  extra_col_val = NULL,
  extra_col_file_path = FALSE,
  ...
)

new_file_definition_dsv(
  file_path,
  specification_files = NULL,
  cols = NULL,
  col_names = NULL,
  col_types = NULL,
  file_meta = NULL,
  sep = ";",
  header = TRUE,
  skip_rows = 0,
  na = "",
  decimal_mark = ".",
  big_mark = ",",
  trim_ws = TRUE,
  n_max = Inf,
  encoding = "latin1",
  to_lower = TRUE,
  rename_cols = FALSE,
  adapters = new_adapters(),
  cols_keep = TRUE,
  extra_col_name = NULL,
  extra_col_val = NULL,
  extra_col_file_path = FALSE,
  ...
)

new_file_definition_excel(
  file_path,
  specification_files = NULL,
  sheet = 1,
  range = NULL,
  cols = NULL,
  col_names = NULL,
  col_types = NULL,
  file_meta = NULL,
  header = TRUE,
  skip_rows = 0,
  na = "",
  trim_ws = TRUE,
  n_max = Inf,
  to_lower = TRUE,
  rename_cols = FALSE,
  adapters = new_adapters(),
  cols_keep = TRUE,
  extra_col_name = NULL,
  extra_col_val = NULL,
  extra_col_file_path = FALSE,
  ...
)

new_file_definition_sas(
  file_path,
  specification_files = NULL,
  file_meta = NULL,
  skip_rows = 0,
  n_max = Inf,
  encoding = NULL,
  to_lower = TRUE,
  rename_cols = FALSE,
  retype_cols = FALSE,
  adapters = new_adapters(),
  cols_keep = TRUE,
  extra_col_name = NULL,
  extra_col_val = NULL,
  extra_col_file_path = FALSE,
  ...
)

`file_path`	A string holding the path to the data file.
`file_structure`	A file_structure class object. This type of objects can be created by the functions `new_file_structure_fwf()`, `new_file_structure_dsv()`, `new_file_structure_excel()` or `new_file_structure_sas()` and fully defines the file structure of the data files. The idea is that a single file_structure can be valid for multiple data files and therefore be reused. Whereas a `file_definition` class object also holds the path to the file and is therefore only valid for a single file.
`to_lower`	A logical flag, defining if the names of the columns should be transformed to lower case after reading the data set (by calling `read_data()`). This transformation will be applied before comparing the column names (in the case of SAS-Files or DSV- and EXCE-Files with `header = TRUE`). In the case of `new_file_definition()` the `to_lower` argument overwrites the `to_lower` argument in the file_structure class object given in `file_structure`. If `to_lower` is omitted, then the `file_structure` class object remains unchanged. In the case of `new_file_definition_fwf()`, `new_file_definition_dsv()`, `new_file_definition_excel()` or `new_file_definition_sas()` the argument `to_lower` must either be `TRUE` or `FALSE`.
`cols_keep`	Either `TRUE` or a character vector. If set to `TRUE`, then all columns of the data are kept when calling `read_data()`. If `cols_keep` character vector, then the values in `cols_keep` represent the names of the columns, which are kept when calling `read_data()`.
`extra_col_name`	An optional string, which defines the column, which will be added to the data set (after reading it with function `read_data()`). Each entry of the column will have the single value given in `extra_col_val`. For example: This column is useful when reading similar data files for separate years (one could pass the current data set year to `extra_col_name` and set `extra_col_name = "year"`). If `extra_col_name` is omitted, no column will be added to the data set and then `extra_col_val` must be omitted as well. additional column with the column name, given in `extra_col_name`. If omitted, then no column will be added to the data set and the argument `extra_col_name` must be omitted as well.
`extra_col_val`	An optional value (any atomic type), which will be added (after reading the data set with function `read_data()`) as an additional column with the column name, given in `extra_col_name`. For example: This column is useful when reading similar data files for separate years (one could pass the current data set year to `extra_col_name` and set `extra_col_name = "year"`). If omitted, then no column will be added to the data set and the argument `extra_col_name` must be omitted as well.
`extra_col_file_path`	Either `FALSE` or a string. If set to `FALSE` no file-path-column will be added to the data set, when calling `read_data()`. If the argument `extra_col_file_path` is a string, then a column holding the file path of the data file will be added to the read data set, when calling `read_data()`. The string of `extra_col_file_path` will be used as column name for this additional column.
`extra_adapters`	An optional adapters class object, which holds a list of adapter functions. These adapter functions will be added to the adapter functions already stored in the file_structure class object. For further details on adapter functions see section adapters.
`specification_files`	An optional character vector holding the paths to the files, where the file structure is described.
`cols`	An optional list argument, holding the column definitions. This argument can be used instead of the arguments `col_names`, `col_types`, `col_start`, `col_end`, `col_widths`, in order to define the column structure. If the argument `cols` is used, then non of the `col_` argument are allowed. If so, the `cols` argument has the following structure: It is list, where each list entry fully describes a single column. Each list entry must have the same subselection of the following possible list entries: `type`: (obligatoric) A string value defining the data type of the column. The following values are allowed: `"character"`, `"logical"`, `"integer"`, `"numeric"` and `"NULL"` (for skipping this column). In the case of SAS files the `type` information can be omitted, since the data type information is stored in the SAS data files, but the argument `type` can still be useful in order to check that the read data column have the expected data type. For SAS-Files this check is done automatically after reading the data with `read_data()`. `name`: (optional) A string holding the column name. `start`: (optional) A number holding the position of the first character of the column. `end`: (optional) A number holding the position of the last character of the column. `width` (optional) A numeric holding the number characters of the column. `col_meta`: (optional) A col_meta class object, holding some meta information for the specific column (column description, possible column values + descriptions of possible column values). For details see section meta information*.
`col_names`	An optional character vector holding the names of the columns. If omitted, then the strings `"x1"`, `"x2"`, ... will be used. In the case of DSV or EXCEL files: If the argument `header` is set to `TRUE`, then the column names given in the data header will be used instead. If `col_names` is also supplied, then the column names given in the DSV, EXCEL or SAS file will be compared with the names given in `col_names`. Sometimes it is useful, to have the column names to be automatically transformed to lower case (directly after reading the date, but before comparing the column names). This can be achieved by setting `to_lower = TRUE`. Generally, the argument `cols` can be used instead, in order to define the column names. If the argument `cols` is not `NULL`, then the argument `col_names` must be omitted.
`col_types`	A character vector defining the data types for each column. The following strings are allowed: `"character"`, `"logical"`, `"integer"`, `"numeric"` and `"NULL"` (for skipping this column). Generally, the argument `cols` can be used instead, in order to define the column types. If the argument `cols` is not `NULL`, then the argument `col_types` must be omitted. In the case of SAS files the `col_types` information can be omitted, since the data type information is stored in the SAS data files, but the argument `col_types` can still be useful in order to check the read data files, if the data types are as expected. For SAS-Files this check is done automatically after reading the data with `read_data()`
`col_start`	An optional numeric vector holding the positions of the first character of each column. Generally, the argument `cols` can be used instead, in order to define the column start positions. If the argument `cols` is not `NULL`, then the argument `col_start` must be omitted.
`col_end`	An optional numeric vector holding the positions of the last character of each column. The last vector entry (for the most right column) is the only entry that can be `NA`. In this case, the most right cells are always read till the new line character. Generally, the argument `cols` can be used instead, in order to define the column end positions. If the argument `cols` is not `NULL`, then the argument `col_end` must be omitted.
`col_widths`	An optional numeric vector holding the numbers of characters of each column. Generally, the argument `cols` can be used instead, in order to define the column widths. If the argument `cols` is not `NULL`, then the argument `col_widths` must be omitted.
`file_meta`	An optional file_meta class object, holding some meta information for each data column (column description, possible column values + descriptions of possible column values). For details see section meta information. If the argument `cols` is not `NULL`, then the argument `file_meta` must be omitted.
`sep_width`	An optional number, defining the number of characters between each column (often `0`).
`skip_rows`	The number of rows to be skipped. In the case of DSV or EXCEL files: If the argument `header` is set to `TRUE`, then the first row is always assumed to be the header row.
`na`	A string representing missing values in the data file.
`decimal_mark`	A character, defining the decimal separator in numeric columns. Only the strings `"."` and `","` are allowed.
`big_mark`	A character, defining the thousands separator in numeric columns. Only the strings `"."` and `","` are allowed.
`trim_ws`	A logical value, defining if the character values should be stipped of all leading and trailing white spaces.
`n_max`	A number, defining the maximum number of rows to be read. If `n_max = Inf`, then all available rows will be read.
`encoding`	A string, defining which encoding should be assumed when reading the data file. The following valuels are allowed: `"UTF-8"`: For UTF-8 encoded files. `"latin1"`: For ISO 8859-1 (also called Latin-1) encoded files. This encoding is almost the same as Windows-1252 (also called ANSI). They differ only in 32 symbol codes (special symbols that are rarely used). In the case of SAS files, it is possible to set `encoding = NULL`. In this case, the encoding defined in the SAS data file header will be used.
`adapters`	An optional list argument, holding a list of adapter functions (See section adapters).
`...`	Additional function arguments for `readr::read_fwf()` in case of FWF files `utils::read.delim()` in case of DSV files `readxl::read_excel()` in case of EXCEL files
`sep`	A string holding the column deliminator symbol.
`header`	A logical value, which defines if the first row contains the data headers. If set to `TRUE`, then the names given in the data header will be used as column names instead.
`rename_cols`	A logical value, which defines if the columns given in the data file should be overwritten by the columns given in argument `col_names`. If `col_names` is not given, then `rename_cols` has no effect.
`sheet`	A string or an integer number: string: The value defines the name of the sheet, which should be read. integer: The value defines the position of the sheet, which should be read. (start counting with `1`).
`range`	An optional string, holding an EXCEL range string, defining the data range in the spread sheet. If `header` is set to `TRUE`, then the range must include a header row.
`retype_cols`	A logical value, which defines if the types of the columns given in SAS file changed to the types given in the `col_types` argument. If `col_types` is not given, then `retype_cols` has no effect.

An file_definition class object holding all information needed for reading the data file with read_data().

The function read_data() can read read four different types of data

FWF: Fixed width files. This files are text files, where the data is stored in columns, that have a fixed character width.
DSV: Delimiter-separated value file. This files are text files, where the data is stored in columns that are separated by a delimiter character.
EXCEL: An excel file holding the data.
SAS: A SAS file holding the data.

In order to read a data file with the function read_data(), it is useful to create a file_definitionuration or file_structure class object, holding all needed data file file_structures:
- new_file_definition_fwf() or new_file_structure_fwf() for FWF files
- new_file_definition_dsv() or new_file_structure_dsv() for DSV files
- new_file_definition_excel() or new_file_structure_excel() for Excel files
- new_file_definition_sas() or new_file_structure_sas() for SAS files

The goal of the package readall is it to read data files. For this purpose the package offers three different class objects in order to store meta data about the data files:

file_structure class objects: Objects of this class can be used in order to define all file type specific information (e.g. column positions, column names, column types, deliminator symbols, rows to skip etc.). The idea is, that one file_structure object may valid for several files and therefore be used to read multiple data files.
file_definition class objects: Objects of this class type contain all informations in order to read a single specific data file (path to the data file, file file_structure etc.). A file_definition class object contains a file_structure, which holds all file type specific information, but also other informations that are only valid for this specific file.
file_collection class objects: A file_collection class object is simply a list holding multiple file_definition class objects. A file_collection class object can be used in order to read several data files at once and concatenate the data into a single data.frame.

An adapter function is a function that takes a data.frame as input argument and returns a modified version of this data.frame. The adapter functions are stored in an adapters class object, which is a special list that contains all adapter functions and a description text of each function. This class objects can be created by using the function new_adapters(). The adapters class objects can be added to a file_structure or a file_definition or a file_collection class object. After reading a data file (by calling read_data(file_definition)) all adapter functions listed in the adapters argument of the file_definition]new_file_definition() class object will be applied consecutively to the loaded data set. Adapter functions can be added to an existing file_structure or a file_definition or a file_collection class object by using the function add_adapters(). Adapter functions can be used for several tasks:

adapt the data sets in such a way that they can be concatenated for mutliple years
compute new variables from existing variables
fix errors in variables
transform the values of a variable of an older data set, such that it complies with a newer variable definition

The col_meta class objects are used in order to store some meta information about single data columns, like additional column desciptions, and column value/level descriptions. In order to store meta information about a set of columns a file_meta class object can be used. This objects store a list of col_meta class objects, where each col_meta class object corresponds to a specific column in a data set. This file_meta class objects are usually stored in file_structure class objects or file_definition class objects. But when calling read_data(), the meta information gets also appended to the resulting data.frame. The meta information stored in a file_structure, a file_definition class object or a read data.frame can be extracted by using the function get_meta(). A col_meta class object holds the following informations:

desc: A string holding the column description.
values: A vector (character/logical/numeric) usually holding the possible column values (e.g. c(1, 2)) or a more abstract text version of the column values (e.g. c("JJJJMMDD", "99999999", "")).
values_desc: A character vector that corresponds to the values vector. Each entry of values_desc is a more detailed description of the corresponding entry in values. If some descriptions are not present, the entries are NA values.