knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(diseasystore)
if (rlang::is_installed("withr")) { withr::local_options("tibble.print_min" = 5) withr::local_options("tibble.print_max" = 5) withr::local_options("diseasystore.verbose" = FALSE) withr::local_options("diseasystore.DiseasystoreGoogleCovid19.n_max" = 1000) } else { opts <- options( "tibble.print_min" = 5, "tibble.print_max" = 5, "diseasystore.verbose" = FALSE, "diseasystore.DiseasystoreGoogleCovid19.n_max" = 1000 ) } # We have a "hard" dependency for duckdb to render parts of this vignette suggests_available <- rlang::is_installed("duckdb") not_on_cran <- interactive() || as.logical(Sys.getenv("NOT_CRAN", unset = "false"))
To see the available diseasystores
on your system, you can use the available_diseasystores()
function.
available_diseasystores()
This function looks for diseasystores
on the current search path.
By default, this will show the diseasystores
bundled with the base package.
If you have extended diseasystore
with either your own diseasystores
or from an
external package, then attaching the package to your search path will allow it to show up as available.
Note: diseasystores
are found if they are defined within packages named diseasystore*
and are of the class
?DiseasystoreBase
.
Each of these diseasystores
may have their own vignette that further details their content, use and/or tips and tricks.
This is for example the case with ?DiseasystoreGoogleCovid19
.
To use a diseasystore
we need to first do some configuration.
The diseasystores
are designed to work with data bases to store the computed features in.
Each diseasystore
may require individual configuration as listed in its documentation or accompanying vignette.
For this Quick start, we will configure a ?DiseasystoreGoogleCovid19
to use a local {duckdb}
data base
Ideally, we want to use a faster, more capable, data base to store the features in.
The diseasystores
uses {SCDB}
in the back end and can use any data base back end supported by {SCDB}
.
# The files we need are stored remotely in Google's API google_files <- c("by-age.csv", "demographics.csv", "index.csv", "weather.csv") remote_conn <- diseasyoption("remote_conn", "DiseasystoreGoogleCovid19") # In practice, it is best to make a local copy of the data which is # stored in the "vignette_data" folder. # This folder can either be in the package folder # (preferred, please create the folder) or in the tempdir(). local_conn <- purrr::detect( "vignette_data", checkmate::test_directory_exists, .default = tempdir() ) # Then we download the first n rows of each data set of interest try({ purrr::discard( google_files, ~ checkmate::test_file_exists(file.path(local_conn, .)) ) |> purrr::walk(\(file) { paste0(remote_conn, file) |> readr::read_csv(n_max = 1000, show_col_types = FALSE, progress = FALSE) |> readr::write_csv(file.path(local_conn, file)) }) }) # Check that the files are available after attempting to download files_missing <- purrr::some( google_files, ~ !checkmate::test_file_exists(file.path(local_conn, .)) ) if (files_missing) { data_available <- FALSE } else { data_available <- TRUE } ds <- DiseasystoreGoogleCovid19$new(target_conn = DBI::dbConnect(duckdb::duckdb()), source_conn = local_conn, start_date = as.Date("2020-03-01"), end_date = as.Date("2020-03-15"))
ds <- DiseasystoreGoogleCovid19$new( target_conn = DBI::dbConnect(duckdb::duckdb()), start_date = as.Date("2020-03-01"), end_date = as.Date("2020-03-15") )
When we create our new diseasystore
instance, we also supply start_date
and end_date
arguments.
These are not strictly required, but make getting features for this time interval simpler.
Once configured we can query the available features in the diseasystore
ds$available_features
These features can be retrieved individually
(using the start_date
and end_date
we specified during creation of ds
):
ds$get_feature("n_hospital")
Notice that features have associated "key_*" and "valid_from/until" columns.
These are used for one of the primary selling points of diseasystore
, namely automatic aggregation.
Go get features for other time intervals, we can manually supply start_date
and/or end_date
:
ds$get_feature("n_hospital", start_date = as.Date("2020-03-01"), end_date = as.Date("2020-03-02"))
The diseasystore
automatically expands the computed features.
Say a given "n_hospital" has been computed between 2020-03-01 and 2020-03-15. In this case, the call
$get_feature("n_hospital", start_date = as.Date("2020-03-01"), end_date = as.Date("2020-03-20")
only needs to compute
the feature between 2020-03-16 and 2020-03-20.
Through using {SCDB}
as the back end, the features are stored even as new data becomes available.
This way, we get a time-versioned record of the features provided by diseasystore
.
The features being computed is controlled through the slice_ts
argument.
By default, diseasystores
uses today's date for this argument.
The dynamical expansion of the features described above is only valid for any given slice_ts
.
That is, if a feature has been computed for a time interval on one slice_ts
, diseasystore
will recompute the feature
for any other slice_ts
.
This way, feature computation can be implemented into continuous integration (requesting features will preserve a history of computed features). Furthermore, post-hoc analysis can be performed by computing features as they would have looked on previous dates.
The real strength of diseasystore
comes from its built-in automatic aggregation.
We saw above that the features come with additional associated "key_*" and "valid_from/until" columns.
This additional information is used to do automatic aggregation through the ?DieasystoreBase$key_join_features()
method
(see extending-diseasystore for more details).
To use this method, you need to provide the observable
that you want to aggregate and the stratification
you want
to apply to the aggregation.
To see which features are considered "observables" and which are considered "stratifications" you can use the included methods:
ds$available_observables
ds$available_stratifications
Lets start with an simple example where we request no stratification (NULL
):
ds$key_join_features(observable = "n_hospital", stratification = NULL)
This gives us the same feature information as ds$get_feature("n_hospital")
but simplified to give the
observable per day (in this case, the number of people hospitalised).
To specify a level of stratification
, we need to supply a list of quosures
(see help("topic-quosure", package = "rlang")
).
ds$key_join_features(observable = "n_hospital", stratification = rlang::quos(country_id))
The stratification
argument is very flexible, so we can supply any valid R expression:
ds$key_join_features(observable = "n_hospital", stratification = rlang::quos(country_id, old = age_group == "90+"))
Sometimes, it is need to clear the compute features from the data base.
For this purpose, we provide the drop_diseasystore()
function.
By default, this deletes all stored features in the default diseasystore
schema.
A pattern
argument to match tables by and a schema
argument to specify the schema to delete from[^1].
SCDB::get_tables(ds$target_conn, show_temporary = FALSE)
drop_diseasystore(conn = ds$target_conn) SCDB::get_tables(ds$target_conn, show_temporary = FALSE)
diseasystores
have a number of options available to make configuration easier.
These options all start with "diseasystore.".
To see all options related to diseasystore
, we can use the diseasyoption()
function without arguments.
diseasyoption()
This returns all options related to diseasystore
and its sister package {diseasy}
.
If you want the options for a specific package
, you can use the namespace
argument.
Notice that several options are set as empty strings (""). These are treated as NULL
by diseasystore
[^2].
Importantly, the options are scoped.
Consider the above options for "source_conn":
Looking at the list of options we find "diseasystore.source_conn" and "diseasystore.DiseasystoreGoogleCovid19.source_conn".
The former is a general setting while the latter is specific setting for ?DiseasystoreGoogleCovid19
.
The general setting is used as fallback if no specific setting is found.
This allows you to set a general configuration to use and to overwrite it for specific cases.
To get the option related to a scope, we can use the diseasyoption()
function.
diseasyoption("source_conn", class = "DiseasystoreGoogleCovid19")
As we saw in the options, a source_conn
option was defined specifically for ?DiseasystoreGoogleCovid19
.
If we try the same for the hypothetical DiseasystoreDiseaseY
, we see that no value is defined as we have not yet
configured the fallback value.
diseasyoption("source_conn", class = "DiseasystoreDiseaseY")
If we change our general setting for source_conn
and retry, we see that we get the fallback value.
options("diseasystore.source_conn" = file.path("local", "path")) diseasyoption("source_conn", class = "DiseasystoreDiseaseY")
Finally, we can use the .default
argument as a final fallback value in case no option is set for either general or
specific case.
diseasyoption( "non_existent", class = "DiseasystoreDiseaseY", .default = "final fallback" )
[^1]: If using SQLite
as the back end, it will instead prepend the schema specification to the pattern before matching (e.g. "ds\..*").
[^2]: R's options()
does not allow setting an option to NULL
. By setting options as empty strings, the user can see the available options to set.
if (exists("ds")) rm(ds) gc() if (!rlang::is_installed("withr")) { options(opts) }
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.