knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = nzchar(Sys.getenv("COMPILE_VIG")) )
Most StatCan tables are small in size and can easily processed in memory. However, some tables are so large that this is not a feasible strategy. Table 43-10-0024
is one such example and comes with a CSV file that is several gigabytes in size. In cases like this it is more useful to store and access the data as a parquet, feather, or SQLite database format using the get_cansim_connection
function instead of the usual get_cansim
. In these circumstances it is also useful to cache the data for longer than just the current R session, and the data_cache
option allows to specify a permanent location. It defaults to getOption("cansim.cache_path")
, and if this option is not set it will only cache the data for the duration of the current session.
For this vignette we use the (rather small) motor vehicle sales data as an example.
library(dplyr, warn.conflicts = FALSE) library(ggplot2) library(cansim)
One main difference to the get_cansim
method is that get_cansim_connection
does not return data but only a connection to the database. This allows us to filter the data before fetching the data into memory. The package supports three formats with different advantages and disadvantages:
parquet
: This is the most efficient format for importing and storing data, but a bit slower when filtering and reading data.feather
: This is a somewhat less efficient format for importing and storing data, but slightly faster than parquet when filtering and reading data.sqlite
: This is fairly inefficient format for importing and storing data, taking a lot more disk space, time and a little more memory when importing and indexing, but after importing it allows for very efficient queries and filtering, and additionally allows for database level summary statistics.By default get_cansim_connection
uses the parquet data format as the best all-around solution, but in use cases of infrequently updating tables with lots of database queries in the meantime that need to be fast, and possibly the need for database level statistics, the sqlite format is likely preferable.
When accessing cached data the package automatically checks if newer versions are available and issues a warning if the cached version is out of date. The auto_update
argument can be set to TRUE
to automatically refresh the table if needed, or this can be done manually by setting the refresh
argument to TRUE
when calling the function to force a refresh.
If data is not cached the function will download the data first and convert it to the speficied format. The package is designed so that the differences in database formats are mostly abstracted away.
::: {.panel-tabset .nav-pills}
connection.parquet <- get_cansim_connection("20-10-0001") # format='parquet' is the default glimpse(connection.parquet)
connection.feather <- get_cansim_connection("20-10-0001", format='feather') glimpse(connection.feather)
connection.sqlite <- get_cansim_connection("20-10-0001", format='sqlite') glimpse(connection.sqlite)
To make good use of the data we will have to look at the metadata and inspect the member columns and variables available.
get_cansim_table_overview("20-10-0001")
This gives us an understanding of the available variables. For the purpose of this vignette we are interested in the breakdown of sales units by Vehicle type in Canada overall. The data is stored in its raw form in the database, the only processing done is that it is augmented by the GeoUID.
In order to work with the data we need to load it into memory, which is done calling collect()
on the connection object. If we want to make use of the additional metadata and processing the cansim package usually does and the main operations done on the connection were filtering (and not renaming or de-selecing columns needed for enriching with metadata) then we can utilize the custom collect_and_normalize
function to do this and at the same time normalize the data so it will appear the same way as if we had used the get_cansim
function. This will add the category and hierarchy metadata and the normalized value column. In the case of sqlite connections we might want to pass the disconnect = TRUE
argument in the collect_and_normalize
function to close the connection after normalizing the data, or do that manually at a later time via disconnect_cansim_sqlite(connection)
. This is not required for parquet or feather connections.
The collect_and_normalize()
interface is designed to be used in the same way across database formats. In this comparison we also add the "traditional" get_cansim()
approach that reads the entire table into memory and normalizes the data.
data.parquet <- connection.parquet %>% filter(GEO=="Canada", `Seasonal adjustment`=="Unadjusted", Sales=="Units", `Origin of manufacture`=="Total, country of manufacture", `Vehicle type` %in% c("Passenger cars","Trucks")) %>% collect_and_normalize() data.parquet %>% head()
data.feather <- connection.feather %>% filter(GEO=="Canada", `Seasonal adjustment`=="Unadjusted", Sales=="Units", `Origin of manufacture`=="Total, country of manufacture", `Vehicle type` %in% c("Passenger cars","Trucks")) %>% collect_and_normalize() data.feather %>% head()
data.sqlite <- connection.sqlite %>% filter(GEO=="Canada", `Seasonal adjustment`=="Unadjusted", Sales=="Units", `Origin of manufacture`=="Total, country of manufacture", `Vehicle type` %in% c("Passenger cars","Trucks")) %>% collect_and_normalize() data.sqlite %>% head()
data.memory <- get_cansim("20-10-0001") %>% filter(GEO=="Canada", `Seasonal adjustment`=="Unadjusted", Sales=="Units", `Origin of manufacture`=="Total, country of manufacture", `Vehicle type` %in% c("Passenger cars","Trucks")) data.memory %>% head()
We note that the syntax, and the resulting data frames, are identical.
With all three data formats producing the same output we can now work with the data as if it was fetched and subsequently filtered via get_cansim
.
Given the data we can further filter the date range and plot it.
data.parquet %>% filter(Date>=as.Date("1990-01-01")) %>% ggplot(aes(x=Date,y=val_norm,color=`Vehicle type`)) + geom_smooth(span=0.2,method = 'loess', formula = y ~ x) + theme(legend.position="bottom") + scale_y_continuous(labels = function(d)scales::comma(d,scale=10^-3,suffix="k")) + labs(title="Canada new motor vehicle sales",caption="StatCan Table 20-10-0001", x=NULL,y="Number of units")
To improve read performance of parquet and feather data one can specify a partioning
argument when calling get_cansim_connection
. This will partition the data by the specified columns. This can be useful when filtering by these columns as it will only read the relevant partitions and greatly increase read performance with a sight cost to size on disk. If for example a dataset is mostly accessed by filtering on geographic regions, it migh be useful to partition by GeoUID
, or the GEO
column if querying data by name. More than one partitioning column can be specified, but this is only helpful for very large datasets with a high number of dimensions. A parquet dataset that is partitioned with the subsequent queries in mind is often faster in data retrieval than an index SQLite database. The arrow package has more guidance on partitioning and the tradeoffs.
Partitioning happens on initial data import and changing the partitioning
parameter in subsequent calls to get_cansim_connection()
won't have any effect, althoug a warning will get issued if the specified partitioning is not empty and differs from the initial paritioning. In some cases, for example when doing lots of data queries on the dataset, it might make sense to occasionally change the partitioning of the data in order to optimize read performance. This can be done with the cansim_repartition_cached_table()
that takes a new_partitioning
argument. Repartitioning happnes fairly fast, taking up to several seconds on fairly large tables where the original CSV is several gigabytes in size.
Since we now have the option of a more permanent cache we should take care to manage that space properly. The list_cansim_sqlite_cached_tables
function gives us an overview over the cached data we have.
list_cansim_cached_tables()
If we want to free up disk space we can remove a cached table or several tables. The following call will remove all cached "20-10-0001" tables in all formats and languages. Before that we disconnect the connection to the sqlite database.
disconnect_cansim_sqlite(connection.sqlite) remove_cansim_cached_tables("20-10-0001")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.