knitr::opts_chunk$set(echo = TRUE) library(dplyr) library(readr)
The purpose of this document is to evaluate certain aspects related to when, how and where in the code data is accessed in the Appsilon Marine application.
The github repository contains a folder called "raw_data", which contains the zip-file from he specifications. This needs to be unzipped in the same directory.
raw_data_path <- here::here("raw_data","ships.csv") if(!file.exists(raw_data_path)){ #Unzip utils::unzip(here::here("raw_data","ships_data.zip"),exdir=here::here("raw_data")) }
Initial latency refers to the initial time it takes to load/connect to the data. Our main options are:
We are measuring the load times on laptop, which is significantly faster than the standard setup on shinyapps.io, however, for comparative purposes, and in the interest of time, we will use this as our baseline.
system.time( df_csv_with_readr <- readr::read_csv(raw_data_path, col_types = cols()) )
This approach will clearly add significant latency.
system.time( df_csv_with_data_table <- data.table::fread(raw_data_path) )
Significantly faster in terms of "elapsed", however, we know that the package takes advantage of underlying multithreaded code, which may not be available on a standard shinyapps.io Linux box.
tempfile1 <- tempfile() tempfile2 <- tempfile() readr::write_rds(df_csv_with_readr,file = tempfile1) readr::write_rds(df_csv_with_data_table,file = tempfile2) system.time(readr::read_rds(tempfile1)) system.time(readr::read_rds(tempfile2))
Reading raw-data from an RDS cuts initial latency in half.
For this test we are assuming sqlite, since it is single file, and easily deployed to shipnyapps.io. There are two caveats using sqlite:
tempfile3 <- tempfile(fileext = ".sqlite") system.time({ library(RSQLite, quietly = TRUE) library(DBI) }) con <- dbConnect(RSQLite::SQLite(),tempfile3) dbWriteTable(con, "ships", df_csv_with_readr %>% rename(port2 = port) #To avoid duplicate column names ) system.time( con <- dbConnect(RSQLite::SQLite(),tempfile3) )
This approach leads essentially to zero-latency. However, two packages (DBI and RSQLite) need to be loaded, which add back some latency.
system.time( df_pre_calculated <- readr::read_rds(system.file("data/ships.RDS", package="appsilonmarine")) )
Again, near zero latency, and no additional packages to load.
For the purposes of this exercise we will use the pre-calculation approach. We recognize that this may not be ideal solution for a production environment, since we will surely be dealing with new incoming data on a continuous basis. However, this will surely necessitate a data-base back-end, and a pipeline for data-cleansing, both of which are beyond the scope of the current requirement.
format(object.size(df_csv_with_readr), units = "Mb") format(object.size(df_csv_with_data_table), units = "Mb") format(object.size(df_pre_calculated), units = "Mb")
Again, data.table is somewhat more efficient than the tibble (this is likely due to their intelligent use of pointers), but the precalculated data, not surprisingly, takes up a small fraction of the RAM occupied by the alternative data-structures. At any rate, all of the numbers reported are small enough so as not to cause any concern of the application running out of resources, even with the smallest available shipnyapps.io setup.
We will deal with the actual speed of calculations in the post-hoc analysis.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.