knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) #library(readall)
The package readall offers a single interface in order to read
various types of data files. The following data file types are supported:
"," or ";").*.xlsx or *.xlsm)*.sas7bdat or *.sas7bcat)devtools::install_github('a-maldet/readall', build_vignettes = TRUE)
file_structure class objectsIn order to read data files, which are not stored as an R data file, we often
have to supply additional information about the file structure to the
reading operation (column names, column types, delimiter symbols etc.).
Often multiple files share the same structure. Therefore, the readall
package offers the file_structure class, which can be
used in order to store all information about the structure of a
data file in a single file_structure class object.
This can be done by using one of the following functions:
new_file_structure_fwf(): Define the file structure of an FWF data filenew_file_structure_dsv(): Define the file structure of a DSV data filenew_file_structure_excel(): Define the file structure of an EXCEL data file new_file_structure_sas(): Define the file structure of a SAS data fileThe created file_structure class object can be used in order to read
all data files which share the defined structure.
file_definition class objectsIn the section before, we defined file_structure class objects, which
hold the file structure information. This type of information
can be valid for several data files. But there are also informations,
which are only valid for a single data file. For example the file path to
the data file.
Therefore, the readall package offers a file_definition class, which
extends the file_structure class. This means that
a file_definition class object contains all needed information about the
file structure information and additionally some information that is
only valid for a specific file (like the file path).
A file_definition class object holds all information, which is neccessary
in order to read a specific data file with the function read_data().
The following functions can be used in order to create a file_definition
class object:
new_file_definition(): With this function you can use an existing
file_structure class object in order to create a file_definition class
object, by appending all needed file specific information to the
given file_structure object.
Depending on the file type defined in the given file_structure class object
the resulting file_definition class object can describe FWF, DSV, EXCEL
or SAS data files.new_file_definition_fwf(): Create a file_definition class object for FWF data filesnew_file_definition_dsv(): Create a file_definition class object for DSV data filesnew_file_definition_excel(): Create a file_definition class object for EXCEL data filesnew_file_definition_sas(): Create a file_definition class object for SAS data filesfile_collection class objectsSometimes, it is neccessary not only to read a single data file, but to read a
collection of data files and concatenate all data sets into a
single data frame. Sometimes, the data files in such a file collection
can contain data files of different file structure or even file types (for example
a mix of SAS, EXCEL, FWF and DSV data files).
For this case, the readall package offers a file_collection class.
In a file_collection class object, you can store multiple file_definition
class objects, which define the needed data files. When calling the function
read_data(), all defined data files will be read an automatically
concatenated into a single data.frame.
The function new_file_collection() is used in order to create such
file_collection class objects.
When reading a collection of data files,
it is often neccessary to post process the each data.frame,
before concatenating all data.frames.
E.g. recode some variable levels, calculate new columns or rename
existing columns.
For this reason, the readall package offers adapter functions, which
are functions of the typ f: DATA.FRAME -> DATA.FRAME. This means,
that an adapter function takes a data.frame and returns a mutated version
of the data.frame.
An example of an adapter function would be the following function:
g <- function(x) { x[x$a > 1,] }
Usually, on does not only want to perform a single data transformation, but
an entire list of data transformations. For this reason the readall package
offers the function new_adapters(), which can be used to store multiple
adapter funcitons in a single adapters class object. An adapters
class object is a list of adapter functions, which can be stored in a
file_structure or a file_definition class object.
For example:
structure_excel_1 <- new_file_structure_excel( col_names = c("city", "adult"), col_types = c("character", "logical"), adapters = new_adapters( function(x) { names(x) <- c("CITY", "ADULT") x }, function(x) { x <- x[!is.na(x$CITY),] x$CITY <- trimws(x$CITY) x } ) )
When calling read_data() the data will be read from the EXCEL file and
then the columns will automatically be renamed to CITY and ADULT and
all rows with missing values for CITY will be removed. Furthermore,
the strings in CITY will stripped of all leading and trailing white spaces.
The package readall allows you to add meta data to file_structure
and file_definition class objects. This meta data can contain a detailed
description of each column and its value leves.
The meta data of a single data column is stored in a col_meta class object,
which can be created with the command new_col_meta().
The meta information for all columns can be collected with the command
new_file_meta().
Example-1:
structure_excel_1 <- new_file_structure_excel( col_names = c("city", "adult"), col_types = c("character", "logical"), meta_list = new_file_meta( new_col_meta( desc = "National and international city codes", values = c("XXXX", "A", NA), values_desc = c("4 digit city code", "abroad", "missing") ), new_col_meta( desc = "Is the person an adult", values = c(TRUE, FALSE, NA), values_desc = c("The person is an adult", "The person is a child", "unknown") ) ) )
Example-2:
structure_excel_1 <- new_file_structure_excel( cols = list( list( name = "city", type = "character", new_col_meta( desc = "National and international city codes", values = c("XXXX", "A", NA), values_desc = c("4 digit city code", "abroad", "missing") ) ), list( name = "adult", type = "logical", new_col_meta( desc = "Is the person an adult", values = c(TRUE, FALSE, NA), values_desc = c("The person is an adult", "The person is a child", "unknown") ) ) ) )
If a data file is read with the command read_data(), then all
specified meta informations will be added to specific columns of the resulting
data.frame.
The meta data stored in a file_structure or a file_definition class object
or a data.frame generated by calling read_data() can be extracted by
calling get_meta().
Example-3:
df_meta <- get_meta(file_definition_1, cols = c("city", "adult"))
Example-4:
data <- read_data(file_definition_1) df_meta <- get_meta(data, cols = c("city", "adult"))
The following section describes the most important function of readall.
Each function is documented and the documentation of each function can be
displayed by calling ?readall::FUNCTIONNAME.
file_structure class objectsfile_structure class objects contain all information about the file
structure, which can be valid for several data files (e.g. column structure,
delimiter symbols, column names, data types etc.). This class objects
can be created with the following commands:
new_file_structure_fwf(): File structure of FWF data filesnew_file_structure_dsv(): File structure of DSV filesnew_file_structure_excel(): File structure of EXCEL filesnew_file_structure_sas(): File structure of SAS filesExample:
structure_fwf_1 <- new_file_structure_fwf( cols = list( list( type = "character", name = "sex", start = 1 ), list( type = "numeric", name = "age", start = 3 ), list( typ = "numeric", name = "city", start = 7 ) ), sep_width = 1, adapters = new_adapters( function(x) { x$sex <- ifelse(x$sex == "m", "male", "female") x } ) )
The code above creates a file_structure class object for FWF data files.
This data files contain three column (sex, age and city codes).
The sex column starts with the first row character and contains a single
character. After that, there is a blank space (sep_width = 1) and then
cames the age column consisting out of three characters. After that
is again a blank space and finally comes the city column holding the
city codes. The created file_structure class object contains an
adapter function, which will automatically be executed after reading a
data file with this file_structure object. The adapter function
recodes the sex column to full length strings.
file_definition class objectsA file_definition class object contains all information can also be
stored in file_structure class objects (information about the file structure
which can be valid for all files of the same structure),
but it also contains information that is only valid for a single specific
data file. file_definition class objects can be created with the following
command:
new_file_definition(): Takes an existing file_structure class objects and
extends it to a file_definition class object.
Depending on the file_structure class object the resulting file_definition
class object can be describe FWF, DSV, EXCEL and SAS data files.new_file_definition_fwf(): Creates a file_definition class object for FWF data filesnew_file_definition_dsv(): Creates a file_definition class object for DSV data filesnew_file_definition_excel(): Creates a file_definition class object for EXCEL data filesnew_file_definition_sas(): Creates a file_definition class object for SAS data filesExample:
file_definition_file_1 <- new_file_definition( file_path = "C:/file1.dat", file_structure = structure_fwf_1, extra_adapters = new_adapters( function(x) { x[x$age >= 30 && x$age < 40,] } ) )
The created file_definition class object file_definition_file_1
extends the file_structure class object structure_fwf_1, which was defined
earlier.
Therefore, it is a file_definition for an FWF data file, which is
located at C:/file1.dat. After reading this data file, the read data set is
automatically filtered by age, such that only persons with age between 30 and
40 are kept.
file_collection class objectsA file_collection class object contains a list of file_definition
class objects, describing different data files, whose data should
automatically be concatenated after reading the data files.
A file_collection class object can be created with the command
new_file_collection().
Example:
file_collection_1 <- new_file_collection( file_definition_file_1, file_definition_file_2, file_definition_file_3, cols_keep = c("sex", "age"), extra_adapters = new_adapters( function(x) { x[x$sex == "male",] } ) extra_col_file_path = "file" )
The file_collection class object in this example contains the
informations of 3 different data files. If read_data() is applied on
file_collection_1, then the following steps are executed automatically:
data.framedata.frame only the columns sex and age are keptfile is added to each data.frame, holding the file path of each data filedata.frame.
For the data file defined in file_definition_1, three adapter functions
were defined. First, the column sex is recoded, then the data.frame is
filtered by age and finally only entries with sex == "male" are kept.
For the data files defined in file_definition_2 and file_definition_3
we don't know if there were adapter functions defined before
(when the file_definition class objects were defined), but
for both data files there is at least one adapter function defined, which
filters sex == "male".data.frames are concatenated into a single data.frameThe resulting data.frame could look something like this
data.frame( sex = rep("male", 7), age = c(16, 67, 31, 84, 47, 33, 98), file = c("R:/daten/file1.dat", "R:/daten/file1.dat", "R:/daten/file2.sas", "R:/daten/file2.sas", "R:/daten/file3.excel", "R:/daten/file3.excel", "R:/daten/file3.excel") )
The following functions can be used, in order to read data files:
read_data(): Read the FWF, DSV, EXCEL or SAS data file, which was defined
in the passed in file_definition class object.read_data_fwf(): Read an FWF data file. Instead of passing in a
file_definition class object holding all needed informations,
the needed file definitions are passed directy to the function. read_data_dsv(): Read a DSV data file. Instead of passing in a
file_definition class object holding all needed informations,
the needed file definitions are passed directy to the function. read_data_excel(): Read an EXCEL data file. Instead of passing in a
file_definition class object holding all needed informations,
the needed file definitions are passed directy to the function. read_data_sas(): Read a SAS data file. Instead of passing in a
file_definition class object holding all needed informations,
the needed file definitions are passed directy to the function. Example-1:
df1 <- read_data(file_definition_file_1)
Example-2:
df2 <- read_data_fwf( file_path = "R:/daten/file1.dat", cols = list( list( type = "character", name = "sex", start = 1 ), list( type = "numeric", name = "age", start = 3 ), list( typ = "numeric", name = "city", start = 7 ) ), sep_width = 1, adapters = new_adapters( function(x) { x$sex <- ifelse(x$sex == "m", "male", "female") x }, function(x) { x[x$city >= 70000 && x$city < 80000,] } ), cols_keep = c("sex", "age"), extra_col_file_path = "file" )
The function read_data() does not only take file_definition class objects,
but also file_collection class objects. By using file_collection class
objects, we can also read multiple data files at once.
Example:
df_gesamt <- read_data(file_collection_1)
The command above reads all 3 data files defined in file_collection_1.
The following steps are executed automatically:
data.framedata.frame only the columns sex and age are keptfile is added to each data.frame, holding the file path of each data filedata.frame.
For the data file defined in file_definition_1, three adapter functions
were defined. First, the column sex is recoded, then the data.frame is
filtered by age and finally only entries with sex == "male" are kept.
For the data files defined in file_definition_2 and file_definition_3
we don't know if there were adapter functions defined before
(when the file_definition class objects were defined), but
for both data files there is at least one adapter function defined, which
filters sex == "male".data.frames are concatenated into a single data.frameThe function get_meta() can extract meta data from:
file_structure class objectsfile_definition class objectsfile_collection class objectsdata.frames created by calling read_data() or read_data_*()
(where * can stands for fwf/dsv/excel/sas)The command get_meta() returns a data.frame holding the following columns:
col_name: A charcter column, holding the names for each data columncol_id: A numeric column, holding the positions of each data columncol_type: A character column, holding data type of each data columncol_desc: A character column, describing each column in detailcol_values: A character column, holding a text version the value levels of each columncol_values_desc: A character column, describing each value level of each columncol_valid_start: A character column, giving some information since when
the variable is validcol_valid_end: A character column, giving some information till when
the variable is validfile_path: A character column, holding the file paths of the data files.file_definition or file_collection objectsSometimes it is useful to modify some attributes of a file_definition or a
file_collection. The following commands can be useful:
set_cols_keep(): Set the columns which should be kept when calling read_data()set_cols_keep_intersection(): Use the maximal possible set of columns
for a file_collection. This means, that the intersection of the
column names of each data file is used.set_extra_col_file_path(): Set the name of the column, in which
the file path of the data files should be stored.set_n_max(): Set the maximum number of rows to be read when calling
read_data(). If applied to a file_collection this value is set for
each file_definition contained in the collection.set_adapters(): Set the adapters attribute. This argument overwrites all
existing adapter functions. If applied to a file_collection, this
command is applied to each file_definition contained in the collection.add_adapters(): Append a set of adapter functions to the already defined
adapter functions. If applied to a file_collection, the adapter functions
are appended to each file_definition contained in the collection.Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.