This R package extracts data from CSV files, transform it and saves the results as separate files. It also allows user to load data into the Postgres database.
The package was uploaded to R Studio Cloud. You can find it here: https://rstudio.cloud/project/1516312
The Package is already deployed there - you can run it without any installation.
All you have to do is to type this command in its R console:
devtools::load_all(".")
If you'd prefer not to log in to the R Studio Cloud follow the steps below.
You can skip this part if you are going to use R Studio Cloud. Otherwise, follow the steps below.
To install the package you need to have R installed on your machine (https://www.r-project.org/).
There are also required two packages, devtools and roxygen2, which you can install by typing this command in bash:
$ R -e "install.packages(c('devtools', 'roxygen2'))"
Pull the Tidyverse-Data-Pipeline repository, for example:
$ git pull https://github.com/vonshick/Tidyverse-Data-Pipeline.git
Go to the source code directory, for example:
$ cd ~/Tidyverse-Data-Pipeline/
Run the package installation:
$ R -e "devtools::install()"
This script extracts data from CSV files and checks if the data is correct (tasks 1 and 2).
I used readr package to load the data. It automatically infers the schema of a CSV file (also the column types). If any record does not match the pattern we get the warning. In case of loading the three files no warning appeared (all the numeric columns were parsed correctly).
Only one step of data validation was added to the script - testing if ID is unique for each data set. I didn't see a reason to remove any of the records. E.g. I found some special characters in class names but I was not convinced if it's the reason to drop the appropriate values.
In my opinion decision to remove some records (or to automate removing of them) may be taken after a conversation with the data owner / client who knows their business purpose best.
I attached the example tests I made during the data exploration in exploration.R
file.
Command to run (in RStudio Cloud):
get_raw_data_sets()
To run the function in bash you need to type the function this way:
$ R -e "get_raw_data_sets()"
Similarly for all the examples below.
This script executes the first script and transforms the data to the desired aggregations (tasks 3 and 4).
Command to run:
generate_output_data_sets()
The script runs previous scripts and saves the result to separate CSV files in data_files
directory.
Command to run:
save_output_csv_files()
The script connects to Postgres database and loads the result tables into it.
Information necessary to create a connection (database name, host, port, etc.) need to be passed in environment_variables.R
file.
Command to run:
load_tables_to_database()
There were also provided three simple unit tests. They check whether data sets contain appropriate columns.
Their source code is placed under this path:
tests/testthat/test-test_of_column_names_in_data.R
Command to run the tests:
devtools::test()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.