README.md
In joshua-ruf/fidelis: Automate Data Reporting

fidelis

Automate data reporting with this R package!

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("joshua-ruf/fidelis",
                         build_vignettes = T)

Follow these steps to get R and the fidelis package installed on Windows:

First, download and install the most recent version of R from CRAN. As of 7/12/2019 that is 3.6.1 – if R is already installed run R.version.string to check your version.

Second, install RStudio. This is not strictly necessary, but the fidelis package makes use of the rstudioapi package for handling passwords, so it’s required to use some functions.

Third, Windows users need to install Rtools before installing the fidelis package. The installr package will handle the Rtools installation, simply run the code below and accept the default options.

install.packages('devtools')
# install devtools

if(Sys.info()['sysname'] == 'Windows' && !devtools::find_rtools()){
  install.packages('installr');
  installr::install.URL('https://cran.r-project.org/bin/windows/Rtools/Rtools35.exe')
  }
# install Rtools if not already installed; as of 7/12/2019 3.5 is the recommended version

Fourth, install fidelis from Github.

devtools::install_github('joshua-ruf/fidelis',
                         dependencies = T,
                         build_vignettes = T)

Fifth, install tidyverse and openxlsx. This is optional, but both are highly useful packages.

install.packages(c('tidyverse', 'openxlsx'))

At the heart of the fidelis package is really just an efficient integration of R with Greenplum/PostgreSQL. This section will serve as the Hello World! to just some of these functions. Throughout this tutorial I frequently make use of %>% from the magrittr package. This pipe allows us to split code onto multiple lines, making it easier to read (and also write). For example, sum(x, y) returns the same value as x %>% sum(y).

We’ll begin by creating a data.frame and uploading it to the database.

library(data.table) #data manipulation
library(magrittr) #pipes
library(dplyr) #data manipulation
library(formattable) #number formatting/coloring
library(knitr) #basic table creation
library(kableExtra) #table formatting
library(fidelis)

df <- expand.grid(stringsAsFactors = F,

  date = seq.Date(lubridate::round_date(Sys.Date(), 'month') - months(23),
                  lubridate::round_date(Sys.Date(), 'month'),
                  by = 'month'),
  product = LETTERS[1:6],
  region = letters[1:4]
) # 24 months of data for 6 products in 4 regions

n <- nrow(df) # number of rows in df

df <- df %>%
  data.table::as.data.table() %>%
  .[, `:=`(spend = runif(n = n)*1000,
           members = sample(100:200, size = n, replace = T))]
# add 2 new columns of random data, spend and members

fidelis::greenplum_connect(ask=F)
# create a greenplum connection object, note the result does not need to be assigned to an object
# the option ask=F tells the function not to ask for input but instead to look for a secret password
# see fidelis::create_secret() for more information.

fidelis::send_to_database(df, name = 'temptable')
#> <PostgreSQLResult>
# send df to the greenplum database as a temp table
# note: if name began with "sandbox." then the table would be permenant instead of temporary

fidelis::query("select * from temptable limit 5;")
#>         date product region      spend members
#> 1 2018-01-01       C      a 189.051225     122
#> 2 2019-05-01       E      a 672.625337     104
#> 3 2019-06-01       C      b 485.056984     129
#> 4 2018-10-01       F      b 869.956733     119
#> 5 2018-02-01       C      c   1.482337     167
# to run a SQL query pass sql code as a string through fidelis::query()
# assign to R object to save results

The query() function also has the ability to execute dynamic SQL queries. Suppose we want to run an analysis on only a couple of products in a certain region, we can create R vectors with that information and add those as options to the query() function.

products <- c('A', 'B')
regions <- 'd'

df_limited <- fidelis::query(

  "
  select *
  from temptable
  where product in %product%
    and region = %region%
  ;
  ",
  product = products,
  region = regions
)
# note how dynamic inputs are added:
# they must be named after the SQL text and surrounded by "%" in the SQL code
# see `?fidelis::query` for more information

unique(df_limited[, c('product', 'region')])
#>   product region
#> 1       A      d
#> 2       B      d
# test to make sure the dynamic where clause was successful!

The dynamic queries are especially useful if we are interested in working with dates. Suppose now that we are interested in comparing the last 3 months of 2017 to the last three months of 2018. The fidelis functions set_dates() and usdate() will help us prepare these inputs.

final_month <- as.Date('2018-12-01') # specify the final month we're interested in
rolling_window <- 3 # number of months to compare across years

dates_df <- final_month %>%
  fidelis::set_dates()

dates_df
#>         effper     col    col6    col9   col12
#>  1: 2018-12-01 current current current current
#>  2: 2018-11-01 current current current current
#>  3: 2018-10-01 current current current current
#>  4: 2018-09-01    <NA> current current current
#>  5: 2018-08-01    <NA> current current current
#>  6: 2018-07-01    <NA> current current current
#>  7: 2018-06-01    <NA>    <NA> current current
#>  8: 2018-05-01    <NA>    <NA> current current
#>  9: 2018-04-01    <NA>    <NA> current current
#> 10: 2018-03-01    <NA>    <NA>    <NA> current
#> 11: 2018-02-01    <NA>    <NA>    <NA> current
#> 12: 2018-01-01    <NA>    <NA>    <NA> current
#> 13: 2017-12-01   prior   prior   prior   prior
#> 14: 2017-11-01   prior   prior   prior   prior
#> 15: 2017-10-01   prior   prior   prior   prior
#> 16: 2017-09-01    <NA>   prior   prior   prior
#> 17: 2017-08-01    <NA>   prior   prior   prior
#> 18: 2017-07-01    <NA>   prior   prior   prior
#> 19: 2017-06-01    <NA>    <NA>   prior   prior
#> 20: 2017-05-01    <NA>    <NA>   prior   prior
#> 21: 2017-04-01    <NA>    <NA>   prior   prior
#> 22: 2017-03-01    <NA>    <NA>    <NA>   prior
#> 23: 2017-02-01    <NA>    <NA>    <NA>   prior
#> 24: 2017-01-01    <NA>    <NA>    <NA>   prior
#>         effper     col    col6    col9   col12

dates_df is a data.table that assigns each month for the last 24 months to one of three categories: ‘current’ period, ‘prior’ period, and NA (neither). The four col columns differentiate between the number of months to include in the rolling window.

dates_df <- dates_df[, c(1, 1 + rolling_window/3), with = F] %>%
  data.table::setnames(names(.), c('effper', 'col'))
# get the number of months rolling window based on the rolling_window object defined above.

current <- dates_df[col == 'current', effper] %>%
  fidelis::usdate()

prior <- dates_df[col == 'prior', effper] %>%
  fidelis::usdate()
# get only months that belong to the current and prior periods respectively
# also convert them to SQL friendly format with usdate()

Finally, we’ll run the query, calculating the sum of spend and members, by period, product, and region. Since Greenplum is a distributed system most aggregations will be faster in Greenplum than R, especially as the size of the data increases. Bottom line: do as much of our calculations in SQL as you can.

df_aggregated <- fidelis::query(

  "
  select case when date in %current% then 'current'
    else 'prior'
    end as col
    ,product
    ,region
    ,sum(spend) as spend
    ,sum(members) as members
  from temptable
  where (date in %current% or date in %prior%)
  group by 1,2,3
  ;
  ",
  current = current,
  prior = prior

)

head(df_aggregated, 10)
#>        col product region     spend members
#> 1  current       F      b 1608.1111     484
#> 2  current       F      c  713.7015     412
#> 3    prior       F      d 2491.9417     455
#> 4    prior       A      a 2319.2867     456
#> 5  current       F      a 1592.5014     413
#> 6    prior       A      c 1950.5786     449
#> 7    prior       A      b 2123.0189     449
#> 8    prior       F      c 1410.5791     402
#> 9  current       F      d 1880.2631     449
#> 10   prior       F      a 1302.8812     397

Great! One major limitation of this approach is that error handling can be quite difficult; a query will return an error but often the output does not fully display the error message. This problem lead to madlib(), a function that prepares the dynamic query but rather than send it to the Greenplum database, it prints to console. This makes it possible to copy and paste the query into another program for more thorough debugging.

fidelis::madlib(

  "
  select case when date in %current% then 'current'
    else 'prior'
    end as col
    ,product
    ,region
    ,sum(spend) as spend
    ,sum(members) as members
  from temptable
  where (date in %current% or date in %prior%)
  group by 1,2,3
  ;
  ",
  current = current,
  prior = prior

)
#> 
#>   select case when date in ('12/01/2018','11/01/2018','10/01/2018') then 'current'
#>     else 'prior'
#>     end as col
#>     ,product
#>     ,region
#>     ,sum(spend) as spend
#>     ,sum(members) as members
#>   from temptable
#>   where (date in ('12/01/2018','11/01/2018','10/01/2018') or date in ('12/01/2017','11/01/2017','10/01/2017'))
#>   group by 1,2,3
#>   ;
#>

Fortunately, this query runs with no errors! Now let’s create a summary table using formattable, knitr, kableExtra, and fidelis. A great reference was made by Hao Zhu.

data.table::setDT(df_aggregated)
# set as data.table

df_aggregated[, spend_per_member := spend/members]
# create new spend_per_member column

df_wide <- fidelis::add_var_cols2(df_aggregated, vars = c('spend_per_member'))
# reshape long to wide and add var/var_pct column for specified variable

for(n in fidelis::names_with(names(df_wide), with = 'spend', without = 'pct')){
  data.table::set(df_wide, j = n, value = formattable::currency(df_wide[[n]]))
} #format spend variables (except spend_per_member_var_pct) as currency

Be sure to specify results='asis' as an option in the following Rmarkdown chunk for best results.

left_cols <- 2 #specify how manyb left most columns should be left aligned (the rest are right algined)

df_wide %>%
  dplyr::mutate(
      spend_per_member_prior = formattable::color_bar('lightblue')(spend_per_member_prior),
      spend_per_member_current = formattable::color_bar('lightblue')(spend_per_member_current),
      spend_per_member_var = fidelis::red_green_tiles.r(spend_per_member_var),
      spend_per_member_var_pct = fidelis::red_green_text.r(spend_per_member_var_pct)
      ) %>% #apply colouring
  knitr::kable(
      escape = F, #necessary for html formatting to be read as html
      format = 'html',
      table.attr = "style = \"color: black;\"", #change font from grey to black
      col.names = c('Product', 'Region',
                    'Prior', 'Current', 'Var', 'Var (%)'),
      align = c(rep('l', left_cols),  rep('r', ncol(.) - left_cols)) #align columns left or right
      ) %>% 
  kableExtra::kable_styling(
      full_width = F,
      protect_latex = F, # useful for coloring formatting as well
      position = "left",
      bootstrap_options = c('condensed', 'hover')
      ) %>%
  kableExtra::column_spec(1, bold = T) %>%
  kableExtra::add_header_above(c(' ' = 2, 'Spend/ Members' = 4)) %>%
  kableExtra::collapse_rows(columns = 1, valign = 'top') %>%
  kableExtra::footnote(general = "Cool eh?") %>%
  htmltools::knit_print.html() #this is necessary for use with with results='asis'

To view all fidelis functions, simply run help(package = 'fidelis'). To learn more about a specific function run ? followed by the function name (i.e. ?currency.auto).

Included in the fidelis package are two Rmarkdown templates that you can use to make html reports, dashboard and fidelis_toc. They can easily be accessed in RStudio by going to: File >> New File >> R Markdown… >> From Template Once there, choose the template you’d like to use, optionally change the name and directory and then click OK.

The dashboard template creates a data dashboard using the flexdashboard package. A basic report might look like this:

The fidelis_toc template creates a report with a floating table of contents. It’s a nice alternative to regular Rmarkdown tabs, especially when a report needs many tabs. Without modification the template would create an html report like this:

I’ve written a brief tutorial outlining how to use R in tandem with Excel and Greenplum. Run browseVignettes('fidelis') to view the html output, as well as the source code if desired.

joshua-ruf/fidelis documentation built on July 20, 2019, 1:56 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com