read_data: Read FWF, DSV or EXCEL data files

Description Usage Arguments Details Value File types difference file_structure/file_definition/file_collection adapters

View source: R/read_data.R

Description

The functions read_data(), read_data_fwf(), read_data_dsv() and read_data_excel() are all used in order to read FWF, DSV or EXCEL data files. The function read_data() is the heart of the readall package and it only requires the user to pass a single function argument (a file_definition class object), holding all needed file information in order to read the data file. By instead passing a file_collection class object into read_data(), it is also possible to read multiple data files at once and store the concatenated data sets into a single data.frame. The functions read_data_fwf(), read_data_dsv() and read_data_excel() are less flexible, but have a more common structure, since this functions do not use file_definition class objects, but require the user to pass in all file information directly as function arguments.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
read_data(file_definition)

read_data_fwf(
  file_path,
  specification_files = NULL,
  cols = NULL,
  col_names = NULL,
  col_types = NULL,
  col_start = NULL,
  col_end = NULL,
  col_widths = NULL,
  sep_width = 0,
  skip_rows = 0,
  na = "",
  decimal_mark = ".",
  big_mark = ",",
  trim_ws = TRUE,
  n_max = Inf,
  encoding = "latin1",
  to_lower = TRUE,
  adapters = new_adapters(),
  cols_keep = TRUE,
  extra_col_name = NULL,
  extra_col_val = NULL,
  extra_col_file_path = FALSE,
  ...
)

read_data_dsv(
  file_path,
  specification_files = NULL,
  cols = NULL,
  col_names = NULL,
  col_types = NULL,
  sep = ";",
  header = TRUE,
  skip_rows = 0,
  na = "",
  decimal_mark = ".",
  big_mark = ",",
  trim_ws = TRUE,
  n_max = Inf,
  encoding = "latin1",
  to_lower = TRUE,
  rename_cols = FALSE,
  adapters = new_adapters(),
  cols_keep = TRUE,
  extra_col_name = NULL,
  extra_col_val = NULL,
  extra_col_file_path = FALSE,
  ...
)

read_data_excel(
  file_path,
  specification_files = NULL,
  range = NULL,
  sheet = NULL,
  cols = NULL,
  col_names = NULL,
  col_types = NULL,
  header = TRUE,
  skip_rows = 0,
  na = "",
  trim_ws = TRUE,
  n_max = Inf,
  to_lower = TRUE,
  rename_cols = FALSE,
  adapters = new_adapters(),
  cols_keep = TRUE,
  extra_col_name = NULL,
  extra_col_val = NULL,
  extra_col_file_path = FALSE,
  ...
)

read_data_sas(
  file_path,
  specification_files = NULL,
  skip_rows = 0,
  n_max = Inf,
  encoding = NULL,
  to_lower = TRUE,
  rename_cols = FALSE,
  retype_cols = FALSE,
  adapters = new_adapters(),
  cols_keep = TRUE,
  extra_col_name = NULL,
  extra_col_val = NULL,
  extra_col_file_path = FALSE,
  ...
)

Arguments

file_definition

A file_definitionuration object, holds all informations needed for reading the data. This object can be created with one of the following functions:

  • new_file_definition(): For reading FWF, DSV or EXCEL files

  • new_file_definition_fwf(): For reading FWF files

  • new_file_definition_dsv(): For reading DSV files

  • new_file_definition_excel(): For reading EXCEL files

file_path

A string holding the path to the data file.

specification_files

An optional character vector holding the paths to the files, where the file structure is described.

cols

An optional list argument, holding the column definitions. This argument can be used instead of the arguments col_names, col_types, col_start, col_end, col_widths, in order to define the column structure. If the argument cols is used, then non of the col_* argument are allowed. If so, the cols argument has the following structure: It is list, where each list entry fully describes a single column. Each list entry must have the same subselection of the following possible list entries:

  • type: (obligatoric) A string value defining the data type of the column. The following values are allowed: "character", "logical", "integer", "numeric" and "NULL" (for skipping this column). In the case of SAS files the type information can be omitted, since the data type information is stored in the SAS data files, but the argument type can still be useful in order to check that the read data column have the expected data type. For SAS-Files this check is done automatically after reading the data with read_data().

  • name: (optional) A string holding the column name.

  • start: (optional) A number holding the position of the first character of the column.

  • end: (optional) A number holding the position of the last character of the column.

  • width (optional) A numeric holding the number characters of the column.

  • col_meta: (optional) A col_meta class object, holding some meta information for the specific column (column description, possible column values + descriptions of possible column values). For details see section meta information.

col_names

An optional character vector holding the names of the columns. If omitted, then the strings "x1", "x2", ... will be used. In the case of DSV or EXCEL files: If the argument header is set to TRUE, then the column names given in the data header will be used instead. If col_names is also supplied, then the column names given in the DSV, EXCEL or SAS file will be compared with the names given in col_names. Sometimes it is useful, to have the column names to be automatically transformed to lower case (directly after reading the date, but before comparing the column names). This can be achieved by setting to_lower = TRUE. Generally, the argument cols can be used instead, in order to define the column names. If the argument cols is not NULL, then the argument col_names must be omitted.

col_types

A character vector defining the data types for each column. The following strings are allowed: "character", "logical", "integer", "numeric" and "NULL" (for skipping this column). Generally, the argument cols can be used instead, in order to define the column types. If the argument cols is not NULL, then the argument col_types must be omitted. In the case of SAS files the col_types information can be omitted, since the data type information is stored in the SAS data files, but the argument col_types can still be useful in order to check the read data files, if the data types are as expected. For SAS-Files this check is done automatically after reading the data with read_data()

col_start

An optional numeric vector holding the positions of the first character of each column. Generally, the argument cols can be used instead, in order to define the column start positions. If the argument cols is not NULL, then the argument col_start must be omitted.

col_end

An optional numeric vector holding the positions of the last character of each column. The last vector entry (for the most right column) is the only entry that can be NA. In this case, the most right cells are always read till the new line character. Generally, the argument cols can be used instead, in order to define the column end positions. If the argument cols is not NULL, then the argument col_end must be omitted.

col_widths

An optional numeric vector holding the numbers of characters of each column. Generally, the argument cols can be used instead, in order to define the column widths. If the argument cols is not NULL, then the argument col_widths must be omitted.

sep_width

An optional number, defining the number of characters between each column (often 0).

skip_rows

The number of rows to be skipped. In the case of DSV or EXCEL files: If the argument header is set to TRUE, then the first row is always assumed to be the header row.

na

A string representing missing values in the data file.

decimal_mark

A character, defining the decimal separator in numeric columns. Only the strings "." and "," are allowed.

big_mark

A character, defining the thousands separator in numeric columns. Only the strings "." and "," are allowed.

trim_ws

A logical value, defining if the character values should be stipped of all leading and trailing white spaces.

n_max

A number, defining the maximum number of rows to be read. If n_max = Inf, then all available rows will be read.

encoding

A string, defining which encoding should be assumed when reading the data file. The following valuels are allowed:

  • "UTF-8": For UTF-8 encoded files.

  • "latin1": For ISO 8859-1 (also called Latin-1) encoded files. This encoding is almost the same as Windows-1252 (also called ANSI). They differ only in 32 symbol codes (special symbols that are rarely used). In the case of SAS files, it is possible to set encoding = NULL. In this case, the encoding defined in the SAS data file header will be used.

to_lower

A logical flag, defining if the names of the columns should be transformed to lower case after reading the data set (by calling read_data()). This transformation will be applied before comparing the column names (in the case of SAS-Files or DSV- and EXCE-Files with header = TRUE). In the case of new_file_definition() the to_lower argument overwrites the to_lower argument in the file_structure class object given in file_structure. If to_lower is omitted, then the file_structure class object remains unchanged. In the case of new_file_definition_fwf(), new_file_definition_dsv(), new_file_definition_excel() or new_file_definition_sas() the argument to_lower must either be TRUE or FALSE.

adapters

An optional list argument, holding a list of adapter functions (See section adapters).

cols_keep

Either TRUE or a character vector. If set to TRUE, then all columns of the data are kept when calling read_data(). If cols_keep character vector, then the values in cols_keep represent the names of the columns, which are kept when calling read_data().

extra_col_name

An optional string, which defines the column, which will be added to the data set (after reading it with function read_data()). Each entry of the column will have the single value given in extra_col_val. For example: This column is useful when reading similar data files for separate years (one could pass the current data set year to extra_col_name and set extra_col_name = "year"). If extra_col_name is omitted, no column will be added to the data set and then extra_col_val must be omitted as well. additional column with the column name, given in extra_col_name. If omitted, then no column will be added to the data set and the argument extra_col_name must be omitted as well.

extra_col_val

An optional value (any atomic type), which will be added (after reading the data set with function read_data()) as an additional column with the column name, given in extra_col_name. For example: This column is useful when reading similar data files for separate years (one could pass the current data set year to extra_col_name and set extra_col_name = "year"). If omitted, then no column will be added to the data set and the argument extra_col_name must be omitted as well.

extra_col_file_path

Either FALSE or a string. If set to FALSE no file-path-column will be added to the data set, when calling read_data(). If the argument extra_col_file_path is a string, then a column holding the file path of the data file will be added to the read data set, when calling read_data(). The string of extra_col_file_path will be used as column name for this additional column.

...

Additional function arguments for

  • readr::read_fwf() in case of FWF files

  • utils::read.delim() in case of DSV files

  • readxl::read_excel() in case of EXCEL files

sep

A string holding the column deliminator symbol.

header

A logical value, which defines if the first row contains the data headers. If set to TRUE, then the names given in the data header will be used as column names instead.

rename_cols

A logical value, which defines if the columns given in the data file should be overwritten by the columns given in argument col_names. If col_names is not given, then rename_cols has no effect.

range

An optional string, holding an EXCEL range string, defining the data range in the spread sheet. If header is set to TRUE, then the range must include a header row.

sheet

A string or an integer number:

  • string: The value defines the name of the sheet, which should be read.

  • integer: The value defines the position of the sheet, which should be read. (start counting with 1).

retype_cols

A logical value, which defines if the types of the columns given in SAS file changed to the types given in the col_types argument. If col_types is not given, then retype_cols has no effect.

Details

The function read_data() can either read a single data file and return a data.frame or it can read multiple data files at once and return the concatenated data sets as a single data.frame. read_data() can read the following data file types:

In order to read a single file with read_data() a file_definition class object must be passed into the function argument file_definition. This file_definition class objects contain all information needed for reading a specific data file. When calling read_data(file_definition) where file_definition is a file_definition class object, the following tasks will be executed:

In order to read multiple data files at once and automatically concatenate the resulting data.frames into a single data.frame, you need to create a list of [file_definition][new_file_definition()] class objects first by using the function new_file_collection(). Each list entry holds the meta data of a different data file. When read_data() is applied on a file_collection class object, then the following tasks will be executed:

Value

A data.frame holding the read data.

File types

The function read_data() can read read four different types of data

difference file_structure/file_definition/file_collection

The goal of the package readall is it to read data files. For this purpose the package offers three different class objects in order to store meta data about the data files:

adapters

An adapter function is a function that takes a data.frame as input argument and returns a modified version of this data.frame. The adapter functions are stored in an adapters class object, which is a special list that contains all adapter functions and a description text of each function. This class objects can be created by using the function new_adapters(). The adapters class objects can be added to a file_structure or a file_definition or a file_collection class object. After reading a data file (by calling read_data(file_definition)) all adapter functions listed in the adapters argument of the file_definition]new_file_definition() class object will be applied consecutively to the loaded data set. Adapter functions can be added to an existing file_structure or a file_definition or a file_collection class object by using the function add_adapters(). Adapter functions can be used for several tasks:


a-maldet/readall documentation built on Dec. 18, 2021, 9:23 p.m.