Home

/

GitHub

/

README.md
In vonshick/Tidyverse-Data-Pipeline: What the Package Does (One Line, Title Case)

Tidyverse-Data-Pipeline

This R package extracts data from CSV files, transform it and saves the results as separate files. It also allows user to load data into the Postgres database.

The package was uploaded to R Studio Cloud. You can find it here: https://rstudio.cloud/project/1516312

The Package is already deployed there - you can run it without any installation.

All you have to do is to type this command in its R console:

devtools::load_all(".")

If you'd prefer not to log in to the R Studio Cloud follow the steps below.

You can skip this part if you are going to use R Studio Cloud. Otherwise, follow the steps below.

To install the package you need to have R installed on your machine (https://www.r-project.org/).

There are also required two packages, devtools and roxygen2, which you can install by typing this command in bash:

$ R -e "install.packages(c('devtools', 'roxygen2'))"

Pull the Tidyverse-Data-Pipeline repository, for example:

$ git pull https://github.com/vonshick/Tidyverse-Data-Pipeline.git

Go to the source code directory, for example:

$ cd ~/Tidyverse-Data-Pipeline/

Run the package installation:

$ R -e "devtools::install()"

This script extracts data from CSV files and checks if the data is correct (tasks 1 and 2).

I used readr package to load the data. It automatically infers the schema of a CSV file (also the column types). If any record does not match the pattern we get the warning. In case of loading the three files no warning appeared (all the numeric columns were parsed correctly).

Only one step of data validation was added to the script - testing if ID is unique for each data set. I didn't see a reason to remove any of the records. E.g. I found some special characters in class names but I was not convinced if it's the reason to drop the appropriate values.

In my opinion decision to remove some records (or to automate removing of them) may be taken after a conversation with the data owner / client who knows their business purpose best.

I attached the example tests I made during the data exploration in exploration.R file.

Command to run (in RStudio Cloud):

get_raw_data_sets()

To run the function in bash you need to type the function this way:

$ R -e "get_raw_data_sets()"

Similarly for all the examples below.

This script executes the first script and transforms the data to the desired aggregations (tasks 3 and 4).

Command to run:

generate_output_data_sets()

The script runs previous scripts and saves the result to separate CSV files in data_files directory.

Command to run:

save_output_csv_files()

The script connects to Postgres database and loads the result tables into it. Information necessary to create a connection (database name, host, port, etc.) need to be passed in environment_variables.R file.

Command to run:

load_tables_to_database()

There were also provided three simple unit tests. They check whether data sets contain appropriate columns.

Their source code is placed under this path: tests/testthat/test-test_of_column_names_in_data.R

Command to run the tests:

devtools::test()

vonshick/Tidyverse-Data-Pipeline documentation built on Aug. 13, 2020, 12:33 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

vonshick/Tidyverse-Data-Pipeline
What the Package Does (One Line, Title Case)

README.md
In vonshick/Tidyverse-Data-Pipeline: What the Package Does (One Line, Title Case)

Tidyverse-Data-Pipeline

1. Local package installation

Download and install R

Install required R packages

Install ETLPackage

2. Scripts description

2.1 extract_data_from_files.R

2.2 transform_data_to_result_dataset.R

2.3 save_output_files.R

2.4 load_tables_to_db.R

3. Unit tests

R Package Documentation

Browse R Packages

We want your feedback!

vonshick/Tidyverse-Data-Pipeline What the Package Does (One Line, Title Case)

README.md In vonshick/Tidyverse-Data-Pipeline: What the Package Does (One Line, Title Case)

Tidyverse-Data-Pipeline

1. Local package installation

Download and install R

Install required R packages

Install ETLPackage

2. Scripts description

2.1 extract_data_from_files.R

2.2 transform_data_to_result_dataset.R

2.3 save_output_files.R

2.4 load_tables_to_db.R

3. Unit tests

R Package Documentation

Browse R Packages

We want your feedback!

vonshick/Tidyverse-Data-Pipeline
What the Package Does (One Line, Title Case)

README.md
In vonshick/Tidyverse-Data-Pipeline: What the Package Does (One Line, Title Case)