knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) data_frame <- function(...) { data.frame(stringsAsFactors = FALSE, ...) }
stoner
is a package to help with various tasks involving VIMC
touchstones. Its purpose is evolving somewhat with the needs of the VIMC
project; stoner is becoming an umbrella to keep these needs expressed in
a tested package. As such, it can be used in a number of modes.
Creation of touchstones is quite a common process, although new touchstones will often be based on a previous one. However, creating the touchstone involves additions to various related tables, so the code to create touchstones is not always trivial to review.
Dettl has somewhat helped here, encouraging separation of extract, transform and load stages of an import, with testing of each stage, forcing the code for touchstone creation to be written in a way that separates those concerns and makes reviewing easier. Furthermore, it has often been possible to review a new import as a diff to a previously reviewed import.
Stoner takes this a step further by allowing the touchstone creation to be expressed in csv meta-data, providing function calls for the extract, transform and load stages.
The code for a stoner touchstone import is very simple. Dettl requires that we
write extract, transform, and load functions, and tests for the extract and load.
So we create a dettl import as usual (see dettl::dettl_new
), which begins a new
import in our imports repo.
Dettl requires us to write various functions, which we can satisfy with single line functions for a start.
knitr::kable(data_frame( `Dettl function` = c("extract(con)", "test-extract(extracted_data)", "transform(extracted_data)", "test-transform(transformed_data)", "load(transformed_data, con)"), `Stoner call` = c("stoner::stone_extract('.', con)", "stoner::stone_test_extract(extracted_data)", "stoner::stone_transform(extracted_data)", "stoner::stone_test_transform(transformed_data)", "stoner::stone_load(transformed_data, con)")))
So for the minimal example, when writing the dettl import, delegate each of dettl's functions to the stoner handlers, passing the same arguments.
The minimal example on its own will do nothing and exit cleanly. To make
stoner do some useful work, we write csv files in a folder called meta
within
the import folder. These csv files should be thought of a specification of
"how you would like things to be" - rather than "what you want stoner to do".
If rows in your csv file identically exist in the database, stoner will use
the existing ones and not add duplicates. If the rows in your csv are provably
absent from the database, stoner will add new ones.
If stoner detects in some way that the items already exist, but not all csv data matches with those items, then some other factors come into play that affect whether stoner can update the database content or not. Imports are incremental to the database, yet on some occasions, it is useful to be able to do in-place edits of touchstones that are still in preparation, for example.
Following are the csv files that stoner will recognise, and their columns and formats, and notes on the requirements. Any failure to meet the requirements will cause an abort with an error message.
You do not have to provide all of the csvs, only ones where you expect something to change, but you may find it good practice to "over-specify", since the result is that Stoner will check the database tables are all as you expect. It also may be helpful to be able to compare complete touchstone definitions (ie, sets of stoner csv files) as a diff between two imports.
A touchstone_name
refers to a broad ensemble of closely related runs; there
will be at least one version for each touchstone_name, and it is the specific
version that we colloquially refer to as 'a touchstone'.
knitr::kable(data_frame( `Column` = c("id", "description", "comment"), `Example` = c("`201910gavi`", "`October 2019 touchstone`", "`Standard GAVI`")))
id
is not found in the touchstone_name
db table, then the row is
added.id
is found, and description
and comment
match, then the row
in the csv is ignored.touchstone_name
with that id
exists, but description
and/or
comment
differ from the original, then the fields in the database are updated
in-place if either:in-preparation
.Description
and Comment
must be non-empty. Conventionally, description
has been used to describe the date and basic purpose, and comment
for any
further detail required.A touchstone
is a particular version of a touchstone_name
, and is the basic
unit of currency for touchstones in Montagu. Coverage, expectations and burden
estimates are all attached to one of these versioned touchstones.
knitr::kable(data_frame( `Column` = c("id", "touchstone_name", "version", "status", "description", "comment"), `Example` = c("`201910gavi-1`", "`201910gavi`", "`1`", "`in-preparation`, `open` or `finished`", "`201910gavi (version 1)`", "`GAVI Version 1`")))
id
must be in the form touchstone_name
-version
.touchstone_name
must match a touchstone_name.id
, either in the database,
or in the accompanying touchstone_name.csv
id
is not found, then stoner will add the new row.id
is found and all other columns match, stoner ignores it.id
is found, but any other column differs, then stoner will
update the fields in the existing touchstone only if its status is
in-preparation
. Otherwise, it will fail with an error.description
and comment
must be non-empty. Typically, description
has
been very minimal about the existence of the touchstone, and comments
to
record more details about why the touchstone version exists.The touchstone_country
table in the database should really be called
touchstone_country_disease
. For a given touchstone, it records which
countries should be returned when groups download their demographic data.
This might differ from the countries a group is expected to model for a
certain touchstone; see the responsibilities.csv section for that.
knitr::kable(data_frame( `Column` = c("touchstone", "disease", "country"), `Example` = c("`201910gavi-1`", "`Measles;MenA`", "`AFG;BEN;IND;ZWE`")))
id
must exist in the database already, or in the
touchstone.csv
file included in your import.id
field of the disease
table. Stoner
cannot currently add new diseases.id
column of the country
table. Stoner
cannot currently add new countries, but the country
table should be
complete.The touchstone_demographic_dataset table determines which demographic_statistic_types from which demographic_source will be used when providing demographic data for a particular touchstone. Generally, there will be a new demographic source each year, when either the IGME child mortality data, or the UNWPP population data, or both get updated. Because these updates happen at different times (UNWPP bi-yearly, and IGME yearly), sometimes a touchstone_demographic_dataset might incorporate fields from different sources, hence this table.
knitr::kable(data_frame( `Column` = c("demographic_source", "demographic_statistic_type", "touchstone"), `Example` = c("`dds-201910_2`", "`int_pop`", "`201910gavi-1`")))
demographic_statistic_type
and demographic_source
are strings, that must
exist in the code
column of the respective database tables.id
must exist in the database already, or in the
touchstone.csv
file included in your import.knitr::kable(data_frame( `Column` = c("id", "name"), `Example` = c("`stop`", "`VIMC stop scenario`")))
id
is not found in the database table, then new rows
will be added.id
is found in the database table, and the name
matches that in your csv file, then the row is ignored.id
exists, but the name
differs then:-id
, the name
in the database table for this id will be updated.id
, then stoner
looks up the status of any touchstones that refer to that
scenario_description, and will only perform the update if all are in the
in-preparation
state.
stoner::stone_load(transformed_data, con, allow_overwrite_scenario_type = TRUE)
knitr::kable(data_frame( `Column` = c("id", "description", "disease", "scenario_type"), `Example` = c("`mena-routine-no-vaccination`", "`Description free text`", "`MenA`", "`stop`")))
id
has conventionally been in lower-case, and in the form
disease-coverage_type
.id
does not exist in the database, stoner
will add the new scenario_description.id
exists, and all other columns match the
existing values too, then the row is ignored.
If the id
exists, but other columns differ, then:scenario
exists that refers to this scenario_description
,
and the touchstone associated with that scenario is not in the
in-preparation
state, then the import fails with an error.However, on occasion it has been desirable to change the description
of a scenario while a touchstone referring to it is open. To override the
in-preparation
requirement, in the load phase of the import, use:
stoner::stone_load(transformed_data, con, allow_overwrite_scenario_description = TRUE)
disease
must currently be one of Cholera
, HepB
, Hib
, HPV
, JE
,
Measles
, MenA
, PCV
, Rota
, Rubella
, Typhoid
or YF
. These match
the id
column of the disease
table. Stoner cannot currently add new
diseases; this is an admin task done separately.
scenario_type
must be the id
of a scenario_type
, either in the
database table, or in a scenario_type.csv
as part of your import.Most of the work for implementing a touchstone is done here, in which we add the scenario, responsibility and expectations (including countries and outcomes) that form the tasks different groups must perform.
knitr::kable(data_frame( `Column` = c("modelling_group", "disease", "touchstone", "scenario", "scenario_type", "age_min_inclusive", "age_max_inclusive", "cohort_min_inclusive", "cohort_max_inclusive", "year_min_inclusive", "year_max_inclusive", "countries", "outcomes"), `Example` = c("`IC-Hallet`", "`HepB`", "`201910gavi-1`", "`hepb-no-vaccination;hepb-bd-routine-bestcase`", "`standard`", "`0`", "`99`", "`1901`", "`2100`", "`2000`", "`2100`", "`AFG;BEN;COD`", "`dalys;deaths;cases`")))
modelling_group
must match an id
of the modelling_group
table.
Stoner can't add new modelling groups.disease
must match an id
of the disease
table. Stoner can't
add new diseases either.touchstone
must exist either in the touchstone
table, or in the
touchstone.csv
as part of your import.scenario
here is a semi-colon separated list of scenarios - which are
actually scenario_description
ids. The matching description must exist
in either the scenario_description
table, or scenario_description.csv
in your import.id
column of the country
table. Stoner
cannot currently add new countries, but the country
table should be
complete.code
column of the burden_outcome` table.
Stoner cannot currently add new burden outcomes.The responsibilities.csv
file may cause changes to the scenario
,
responsibility_set
, responsibility
, burden_estimate_expectation
,
burden_estimate_country_expectation
and burden_estimate_outcome_expectation
tables. Where possible, existing rows are re-used, rather than creating
duplicates.
scenario
is defined by scenario_description
and touchstone
. If
combinations of those exist in responsibilities_csv
that aren't in the
scenario
database table, then new rows get added.in-prep
.responsibility_set
is defined by modelling_group
and touchstone
. If
combinations exist in responsibilities_csv
that aren't in the database,
they get created.in-prep
.touchstone_name
- no version number), and a
description, conventionally in the form disease:group:scenario_type
,
where the scenario_type
might be a particular scenario (if expectations
need to be specific to that scenario), or in many cases, then scenario_type
has been defined as standard
, allowing the same expectation definition to
be shared for different scenarios (for a particular group and disease).in-prep
.burden_estimate_expectation
, the rows in the
burden_estimate_country_expectation
table list the countries
for which we are expecting estimates to be uploaded by a particular group,
for a particular disease, and a particular scenario.burden_estimate_expectation
in this table is a numerical id,
referring to either a newly created expectation as a result of your import,
or an expectation that previous existed and matched the details exactly.burden_estimate_outcome_expectation
table list the expected outcomes that
a group will upload for a particular scenario and disease.burden_outcome
table - but Stoner
cannot change the contents of that table at present.in-prep
.current_burden_estimate_set
and
current_stochastic_burden_estimate_set
- both of which are nullable,
and are left at NA
by default.is_open
, which Stoner
will set to TRUE as default.Firstly, have your test-extract
call stoner::stone_test_extract(extracted_data)
,
and test-transform
call stoner::stone_test_transform(transformed_data)
for the
built-in tests to be called. Most likely, there is nothing else useful you can
write for these tests, if your extract and transform functions are simply calling
Stoner's.
Possibly the best approach to tests is to write the test-queries
function for
dettl in the following form:-
wzxhzdk:9
and a test_load.R
that tests how many rows have been added, for example...
wzxhzdk:10
For this though, you will have to have prior knowledge about how many of the rows in your various CSV files exist already in the database, and how many you are expecting will need to be created.
The need for fast-forwarding arises when the following events happen.
Therefore, fast-forwarding is a process where burden estimates are moved from one touchstone to another - or more specifically, one responsibility_set to another (since responsibility_set is defined by modelling_group and touchstone).
Suppose then, that we the new touchstone ready, and we have, potentially, some burden estimate sets to migrate. Fast-forwarding would do the following. Let's consider it first for a single scenario, and a single modelling_group.
We specify that we want to fast-forward an existing burden estimate set for a certain modelling_group and scenario, from one touchstone to another. We'll see how to specify that in a simple CSV file shortly. A stoner import running on that CSV file then does essentially the following:-
If necessary, create a new responsibility_set
for the modelling_group,
in the destination touchstone.
responsibility_set
already exists, that's fine, we'll use the existing one.responsibility
within the new
responsibility_set
, for the specified scenario. responsibility
already exists, but there is no burden
estimate set associated with it, then we can continue using the existing
responsibility
. responsibility
, we abort and don't fast forward.Fast-forwarding a burden_estimate_set then means copying the
current_burden_estimate_set
value from one responsibility to another;
from the older into the newer, and setting the older to NA
.
Additionally, when a new responsibility_set
is created by stoner, it will
copy the most recent responsibility_set_comment
from the old, to the new
responsibility_set
, noting that the new one was created by fast-forwding.
responsibility_comment
s - if any work is done (either creating a
new responsibility, or setting current_burden_estimate_set
on the new
responsibility for the first time, then the most recent
responsibility_comment
(if there is one) will be copied to the new
responsibility, with a note about fast-forwarding.Write a fast_forward.csv
file in the following form.
knitr::kable(data_frame( `Column` = c("modelling_group", "scenario", "touchstone_from", "touchstone_to"), `Example` = c("IC-Hallett;Li", "hepb-no-vaccination", "202110gavi-2", "202110gavi-3") ))
Note that fast-forwarding must be the only thing in the import, and the only .csv file in use. Combining fast-forwarding with other touchstone creation or management functions is too stressful to contemplate. Do them separate, and test them separately.
Also, while you can do fastforwarding for different touchstones in the same CSV, be careful with it as it gets confusing. Stoner will not let you fastforward into, and out of, the same touchstone (ie, from version 1 to 2, and 2 to 3) in the same CSV file.
The modelling_group
and scenario
columns can either be single items,
with multiple rows in the csv file. Or, they can be semi-colon separated,
to give multiple combinations of groups and scenarios. Finally, they can
be wildcard *
to match anything.
Include all the standard stoner one-lines, for the extract, test-extract,
transform, test-transform, and load stages, as above, and ensure that in
dettl.yml
for the report, automatic loading is not enabled.
When modelling groups upload more than one burden estimate set for the same
responsibility (that is, the same touchstone, scenario, disease), only the
most recent is regarded as interesting, and is marked as the
current_burden_estimate_set
for the responsibility. To save space (for
some groups, a considerable amount of space), the old orphaned
burden estimate sets can be deleted.
Note that this should be considered a "final" delete; rows will be dropped
from the burden_estimate_set
table, and especially the burden_estimate
table. While rolling the database back via backups is possible, it's not
desirable. That said, there should be no reason to keep previous versions
of a burden estimate set. If both the old and new versions are important,
they should both be "current" burden estimate sets, in different touchstones
or reponsibilities perhaps.
Write a prune.csv
file in the following form.
knitr::kable(data_frame( `Column` = c("modelling_group", "disease", "scenario", "touchstone"), `Example` = c("IC-Hallett;Li", "*", "hepb-no-vaccination", "202110gavi-2;202110gavi-3") ))
Each field can be semi-colon-separated, and the result is that all possibilities are multipled out. (So in the above example, both touchstones, for both modelling groups will be examined for pruning opportunities).
You can also include multiple lines in the CSV file, which will be considered one at a time, thus allowing flexibility to look at a number of specific combinations for pruning.
The *
is a wildcard, and in the simplest case, all the fields
can be left as *
, to look for pruning opportunities in the
entire history of burden estimate sets.
Note that if prune.csv
exists, no other csv
file
should be included - that is: a pruning import should just do
pruning, and not any other functionality. This keeps things
simple, which is a good thing since here we are (somewhat
uniquely) performing a deletion of data.
Include all the standard stoner one-lines, for the extract, test-extract,
transform, test-transform, and load stages, as above, and ensure that in
dettl.yml
for the report, automatic loading is not enabled.
stoner::stone_dump(con, touchstone, path)
, called with a database connection,
a touchstone, and an output path, will produce csv files of everything connected
with that touchstone, in the form stoner would use to import, as described above.
This might be useful if you want to download an existing touchstone, edit some
details (including the touchstone id), and upload a modified version.
Modelling groups submit stochastic data to VIMC by responding to a Dropbox File Request. A stochastic set consists of 200 runs for each scenario for that group, using a range of different parameters that are intended to capture the uncertainty in the model.
After some initial sanity checks (which are manual at present),
the incoming csvs are compressed with xz
with maximum settings, which
provides the best compression for csvs, but fast decompression, and seamless
decompression in R. (Windows command-line xz -z -k -9 -e *.csv
)
The incoming stochastics are separated certainly by scenario, and may
be further separated for convenience; some groups have provided a file per
country, others a file per stochastic run. From these, we create four
intermediate files for each group, which eliminate age by summing over a
calendar year, summing over a birth cohort (year - age), and for each option,
either including all ages, or filtering just ages 0 to 4. They include
just the cases
, deaths
and dalys
outcomes (which might be calculated by
summing more detailed outcomes a group provides) for each scenario in columns.
The idea is so that calculating impact between scenarios can then be calculated
simply by doing maths on values from the same row of the file.
These four files are later uploaded to four separate tables on the annex database.
Note that the production of the intermediate files can take a few hours per group, whereas the upload to annex takes only a few minutes. Storing the intermediate files can be useful should we need to redeploy annex at any point.
Also note the examples below assume you have a connection to the production
database (con
), and later, a connection to the annex database (annex
). See
the end for notes on getting those connections in different ways.
In the simplest case, a group uploads a single csv file per scenario as follows:-
knitr::kable(data_frame( `disease` = c("YF","YF","YF","YF","YF"), `run_id` = c(1,1,1,1,1), `year` = c(2000,2001,2002,2003,2004), `age` = c(0,0,0,0,0), `country` = c('AGO','AGO','AGO','AGO','AGO'), `country_name` = c('Angola','Angola','Angola','Angola','Angola'), `cohort_size` = c(677439,700540,725742,753178,782967), `cases` = c(59,61,66,69,71), `deaths` = c(22,23,24,25,26), `dalys` = c(1233,1390,1330,1196,1490) ))
which would continue for all the countries, years and ages, for 200 runs of a particular scenario. A separate file would exist for each scenario. To transform this into the four intermediate files, we might write below - where the argument names are included just for clarity, and are not needed.
stone_stochastic_process( con = con, modelling_group = "IC-Garske", disease = "YF", touchstone = "201910gavi-4", scenarios = c("yf-no-vaccination", "yf-preventive-bestcase", "yf-preventive-default", "yf-routine-bestcase", "yf-routine-default", "yf-stop"), in_path = "E:/Dropbox/File Requests/IC-Garske", file = ":scenario.csv.xz", cert = "certfile", index_start = NA, index_end = NA, out_path = "E:/Stochastic_Outputs")
This assumes that in the in_path
folder, 6 .csv files are present. The
file
argument indicates the template for those files. In this case we
are assuming all the files follow the same template, where :scenario
will be replaced by each of the 6 specified scenarios in turn. If the
files do not obey such a simple templating, then you can supply a vector
of strings for file
, to indicate which files; just note there should be
either a one-to-one mapping, or a many-to-one mapping between the different
scenarios, and the different files indicated.
In this example, there is only one file per scenario; the index_start
and
index_end
arguments are set to NA
, and there is no reference to :index
in
the file
template. We will see later multi-file examples where these three
fields are changed to describe the sequence of files we are expecting.
The result is that four files are written - below is an abbreviated section of each.
knitr::kable(data_frame( `run_id` = c(1,1,1), `year` = c(2000,2001,2002), `country` = c(24,24,24), `cases_novac` = c(1219,1269,1319), `dalys_novac` = c(21388,22884,24129), `deaths_novac` = c(452,471,494), `cases_prevbest` = c(1165,1199,1235), `dalys_prevbest` = c(20219,21353,22207), `deaths_prevbest` = c(432,444,461) ))
So here, we have in each row, the cases, deaths and dalys summed over age for a country and calendar year, for each scenario.
knitr::kable(data_frame( `run_id` = c(1,1,1), `year` = c(2000,2001,2002), `country` = c(24,24,24), `cases_novac` = c(269,280,290), `dalys_novac` = c(5710,6220,6564), `deaths_novac` = c(100,105,110), `cases_prevbest` = c(215,210,213), `dalys_prevbest` = c(4541,4689,4849), `deaths_prevbest` = c(80,78,80) ))
This is similar to the calendar year, but ages five and above are ignored, when summing over age, so the numbers are all smaller.
knitr::kable(data_frame( `run_id` = c(1,1,1,1,1,1), `cohort` = c(1900,1901,1902,2000,2001,2002), `country` = c(24,24,24,24,24,24), `cases_novac` = c(0,0,0,3149,3261,3384), `dalys_novac` = c(0,0,0,44542,47051,51399), `deaths_novac` = c(0,0,0,1184,1222,1269), `cases_prevbest` = c(0,0,0,774,809,799), `dalys_prevbest` = c(0,0,0,15763,16902,17573), `deaths_prevbest` = c(0,0,0,280,284,283) ))
The cohort
is calculated by subtracting age
from year
; it asks the question
when were people of a certain age in a certain calendar year born. Notice the
cohort
column instead of year
. This model includes 100-year-olds alive in
calendar year 2000, so these were born in the year 1900, but no yellow fever
cases or deaths for these scenarios are recorded for that birth cohort.
knitr::kable(data_frame( `run_id` = c(1,1,1,1,1,1,1), `cohort` = c(1996,1997,1998,1999,2000,2001,2002), `country` = c(24,24,24,24,24,24,234), `cases_novac` = c(49,102,160,221,289,297,310), `dalys_novac` = c(1010,2196,3626,4483,6057,6915,7223), `deaths_novac` = c(18,38,60,83,108,112,116), `cases_prevbest` = c(49,86,122,152,207,225,234), `dalys_prevbest` = c(1010,1854,2778,3086,4346,5232,5464), `deaths_prevbest` = c(18,32,45,57,78,84,87) ))
This is similar to birth cohort, but only considering those age 4 or less. Hence, the oldest age group in the year 2000 (where calendar years begin for this model) will be 4, and they were born in 1996, which is the first birth cohort.
Some groups submit a file per stochastic run, or a file per country. Some have even arbitrarily started a new file when one file has become, say, 10Mb in size. Stoner doesn't mind at what point the files are split, except that data for two scenarios cannot exist in the same file, and the files that make up a set must be numbered with contiguous integers.
The example below will expect runs numbered from 1 to 200, as indicated with
index_start
and index_end
. Also notice the presence of the :index
placeholder in the file
stub, which will be replaced with the sequence number
when the files are parsed.
stone_stochastic_process( con = con, modelling_group = "IC-Garske", disease = "YF", touchstone = "201910gavi-4", scenarios = c("yf-no-vaccination", "yf-preventive-bestcase", "yf-preventive-default", "yf-routine-bestcase", "yf-routine-default", "yf-stop"), in_path = "E:/Dropbox/File Requests/IC-Garske", file = ":scenario_:index.csv.xz", cert = "certfile", index_start = 1, index_end = 200, out_path = "E:/Stochastic_Outputs")
Some groups might also submit different numbers of files for each
scenario. For example, HepB for some groups requires different numbers
of countries to be modelled for different scenarios, dependingn on what
campaigns were made in those countries. If a group wishes to split their
results by country, they will then have different numbers of files per
scenario. In this case, index_start
and index_end
can be vectors, of the
same length as the scenarios
vector, giving the start and end ids for each
scenario.
Stoner can also support a mixture of single and multi-files for different
scenarios. For that case, you'll need vectors for both the file
stub, and
the index_start
and index_end
- Stoner will test that whenever the file
stub contains :index
, the index_start
and index_end
are specified,
otherwise not.
Some groups provide multiple deaths or cases categories which need to be summed
to give the total deaths or cases. The example below uses the optional outcomes
argument, where we can give a vector of column names to be summed for each named
burden outcome. All the columns mentioned must exist in the incoming data, and
in the responsibilities for that group and disease too), to be summed to give the
final outcome.
stone_stochastic_process( con = con, modelling_group = "IC-Garske", disease = "YF", touchstone = "201910gavi-4", scenarios = c("yf-no-vaccination", "yf-preventive-bestcase", "yf-preventive-default", "yf-routine-bestcase", "yf-routine-default", "yf-stop"), in_path = "E:/Dropbox/File Requests/IC-Garske", file = ":scenario_:index.csv.xz", cert = "certfile", index_start = 1, index_end = 200, out_path = "E:/Stochastic_Outputs"), outcomes = list( deaths = c("deaths_cat1", "deaths_cat2"), cases = c("cases_cat1", "cases_cat2"), dalys = "dalys") )
Occasionally, a group omit the run_id
column in their input data. In practice
this only happens when the run_id
is specified as part of the filename. To
handle this, set the optional runid_from_file
argument to TRUE
- and in
that case, index_start
and index_end
must be 1
and 200
respectively,
and :index
must be included in the file template for all scenarios (either
specified as a vector, or a singleton).
stone_stochastic_process( con = con, modelling_group = "IC-Garske", disease = "YF", touchstone = "201910gavi-4", scenarios = c("yf-no-vaccination", "yf-preventive-bestcase", "yf-preventive-default", "yf-routine-bestcase", "yf-routine-default", "yf-stop"), in_path = "E:/Dropbox/File Requests/IC-Garske", file = ":scenario_:index.csv.xz", cert = "certfile", index_start = 1, index_end = 200, out_path = "E:/Stochastic_Outputs"), runid_from_file = TRUE)
Some groups have also omitted the constant disease
from their stochastic
results. This would normally generate a warning but work correctly in any
case; to silence the warning, set the optional allow_missing_disease
to be
TRUE
.
stone_stochastic_process( con = con, modelling_group = "IC-Garske", disease = "YF", touchstone = "201910gavi-4", scenarios = c("yf-no-vaccination", "yf-preventive-bestcase", "yf-preventive-default", "yf-routine-bestcase", "yf-routine-default", "yf-stop"), in_path = "E:/Dropbox/File Requests/IC-Garske", file = ":scenario_:index.csv.xz", cert = "certfile", index_start = 1, index_end = 200, out_path = "E:/Stochastic_Outputs"), allow_missing_disease = TRUE)
As we said, this can occur, with HepB being an example. If this is the case,
besides dealing with a different number of files per scenario (if the group
split their files by country), there is nothing you need to do for Stoner
to process this properly. In the output CSV files, any country for which
there is no data for a particular scenario will have NA
for those scenarios.
Care might be needed in analysis later on in ensuring comparisons or impact
calculations only occur where all the values are not NA
.
When groups upload the parameters for their stochastic runs into Montagu, they are provided with a certificate - a small JSON file providing metadata, and confirmation of the upload information. The certificate should be provided by the group along with the stochastic data files that were produced using the parameters they uploaded.
By default, stoner will verify that the certificate file exists, and checks that the metadata (modelling group, touchstone, disease) on production that match the certificate also match with the arguments you provide when you call stoner_stochastic_process.
Should you be lacking a group's certificate, but still want to attempt to
process the stochastic data, then set the option bypass_cert_check
to be
TRUE:-
stone_stochastic_process( con = con, modelling_group = "IC-Garske", disease = "YF", touchstone = "201910gavi-4", scenarios = c("yf-no-vaccination", "yf-preventive-bestcase", "yf-preventive-default", "yf-routine-bestcase", "yf-routine-default", "yf-stop"), in_path = "E:/Dropbox/File Requests/IC-Garske", file = ":scenario_:index.csv.xz", cert = "", index_start = 1, index_end = 200, out_path = "E:/Stochastic_Outputs"), bypass_cert_check = TRUE)
You can also manually perform validation of a certificate file without processing stochastic data, with the call:-
stone_stochastic_cert_verify(con, "certfile", "IC-Garske", "201910gavi-5", "YF")
This call will stop with an error if either the modelling group, or the touchstone
do not match with the details used to submit the parameter set, and retrieve the
certfile
provided here.
The processed CSV files can be uploaded to annex automatically, if an
additional database connection annex
is provided, and the
upload_to_annex
is set to TRUE. The files will be uploaded after processing.
stone_stochastic_process( con = con, modelling_group = "IC-Garske", disease = "YF", touchstone = "201910gavi-4", scenarios = c("yf-no-vaccination", "yf-preventive-bestcase", "yf-preventive-default", "yf-routine-bestcase", "yf-routine-default", "yf-stop"), in_path = "E:/Dropbox/File Requests/IC-Garske", file = ":scenario_:index.csv.xz", cert = "certfile", index_start = 1, index_end = 200, out_path = "E:/Stochastic_Outputs"), upload_to_annex = TRUE, annex = annex, allow_new_database = FALSE)
If allow_new_database
is set to TRUE
, then Stoner will try to create the
stochastic_file
index table on annex
; this will only be wanted on the
first time of uploading data to a new empty database, so typically, this
will be left as FALSE
.
The result of uploading is that four new rows will be added to the
stochastic_file
table, for example:-
knitr::kable(data_frame( `id` = c(1,2,3,4), `touchstone` = rep("201910gavi-4", 4), `modelling_group` = rep("IC-Garske", 4), `disease` = rep("YF", 4), `is_cohort` = c(FALSE, TRUE, FALSE, TRUE), `is_under5` = c(TRUE, TRUE, FALSE, FALSE), `version` = rep(1, 4), `creation_date` = rep("2020-08-06", 4) ))
Four new tables named in the form stochastic_
followed by the id
field
listed in the table above will also have been made, which are uploaded copies
of the final CSV files. If further uploads are made that match the touchstone
,
modelling_group
, disease
, is_cohort
and is_under5
, then the new data
will overwrite the existing data, and the version
and creation_date
in the
table above will be updated.
You can also call the stone_stochastic_upload
directly, if you have CSV
files ready to upload. Call the function as below, to upload a single CSV file.
(Vectors for multiple scenarios in one go are not currently supported in the
function).
```stone_stochastic_upload( file = 'IC-Garske_YF_calendar_u5.csv', con = con, annex = annex, modelling_group = 'IC-Garske', disease = 'YF', touchstone = '201910gavi-4', is_cohort = FALSE, is_under5 = TRUE )
The filename is treated as arbitrary; `is_cohort` and `is_under5` need specifying to describe the data being uploaded. If this is the first ever upload to a new database, then the optional `allow_new_database` will enable creation of the `stochastic_file` table. #### The testing argument `stone_stochastic_process` and `stone_stochastic_upload` both take a `testing` logical argument; ignore this, as it is only used to as part of the tests, in which a fake annex database is set up. #### Database connections (and where to find them). We use the `vaultr` package, and assume that the `VAULT_ADDR` and `VAULT_AUTH_GITHUB_TOKEN` environment variables are set up - we won't go into doing that here. A read-only connection to the production database is used to validate the outcomes and countries against those in a group's expectations. To get the connection to production:- ```R vault <- vaultr::vault_client(login = "github") password <- vault$read("/secret/vimc/database/production/users/readonly")$password con <- DBI::dbConnect(RPostgres::Postgres(), dbname = "montagu", host = "production.montagu.dide.ic.ac.uk", port = 5432, password = password, user = "readonly")
To get a connection to annex:- ```R password <- vault$read("/secret/vimc/annex/users/vimc")$password annex <- DBI::dbConnect(RPostgres::Postgres(), dbname = "montagu", host = "annex.montagu.dide.ic.ac.uk", port = 15432, password = password, user = "vimc") ````
However, rather than acquiring connections as above and manually running
ad hoc database queries on annex, it will be better to express imports to
annex using dettl
. The imports are made a little more complex than usual
by the length of time taken to do the data reduction, the RAM they require
can be very large, and the possibility that data will be replaced on annex
with subsequent versions. Never-the-less, it would be good to have a formal
process for uploading data to annex, and dettl
would be a good way.
For example:
Extract stage:
read meta data, which would contain a list of groups and
locations for different stochastic datasets to be processed. Also look up
the necessary metadata for those files - responsibilities and outcomes.Transform stage:
Perform the reduction, producing csv files. We would need
dettl
to not try to validate the output against specific database tables,
since we often need to create those tables on annex
as part of the upload.Load stage:
Perform the uploads on the created csv files, updating the
stochastic_file
and adding new tables on annex
.Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.