knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
spartanDB has been created to provide all additional requirements to link the sensitivity analysis techniques in the spartan package with an autogenerated mysql database. This way, results from all sensitivity analyses, machine learning algorithm generated emulations, approximate bayesian computation, and optimisation experiments of a simulator are in one place, aiding transparency and reproducibility of generated statistical analyses, and increasing performance of the spartan package. Parameter value sets generated for sensitivity analyses can be added to the database. Results from executions of the simulator under those conditions can then be added to the database with reference to those parameter sets and details of the experiment. This database functionality then aids mining of these simulation results for the machine learning emulation techniques that were introduced in spartan version 3. Details of these emulation/ensemble experiments can also be added to the database, as well as results from the techniques in spartan that utilise the emulators to perform time intensive analyses. Similarly to spartan, we include detailed examples that show how parameter value sample sets are stored in the database, the results of executions under those conditions added, and analysed results produced by spartan stored alongside those experimental setups. In all this method establishes five database tables: experiment: stores details of the performed experiment (description, date, and ID) parameters: stores details of parameter values run under each experiment results: stores details of simulation responses for the parameter values in the parameters table analysed_results: stores summary results of those in the results table. Used when replicate runs are performed for parameter sets (as in the tutorial dataset) * generated_stats: stores the generated statistics for all spartan methods.
Methods are provided to mine these tables to reproduce experiment output, to check the conditions under which the experiment was run, and save having to perform each again
The following are required to run the spartanDB and spartan methods described here:
Out demonstration utilises data from our previously described agent-based lymphoid tissue development simulator (Patel et al., 2012; Alden et al., 2012). This is the same data on which the functionality of the spartan package was generated, and as such, we refer the reader to the vignettes for spartan for further detail of the case study. This vignette focuses on the link spartanDB provides between a mysql result storage system and the analysis methods provided by spartan
SpartanDB assumes that a working mySQL installation exists on the machine upon which the package is being used, and that a database user is set up that has ALTER, CREATE, DELETE, DROP, INSERT, SELECT, and UPDATE privileges. A schema should be created in this database in which the tables created by spartanDB will be stored. A mySQL settings file is used to state the settings that should be used to connect to the database. The below is an example of such a settings file, which would connect to the schema spartan_db:
[spartan_db] user=spartan_db_user password=database_password host=127.0.0.1 port=3306 database=spartan_db
A R script that connects to this database should then be initialised as follows
library(RMySQL) library(spartan) # R needs a full path to find the settings file rmysql.settingsfile<-["path_to_settings_file.cnf"] rmysql.db<-"spartan_db" dblink<-dbConnect(MySQL(),default.file=rmysql.settingsfile,group=rmysql.db)
SpartanDB then has methods for both creating and deleting the table structure the package requires. The creation method needs to know the names of the parameters and responses of the simulator for which the results are being stored.
parameters<-c("stableBindProbability","chemokineExpressionThreshold","initialChemokineExpressionValue","maxChemokineExpressionValue","maxProbabilityOfAdhesion","adhesionFactorExpressionSlope") measures<-c("Velocity","Displacement") create_database_structure(dblink, parameters, measures) delete_database_structure(dblink)
The dblink object created above will then be passed to each spartanDB function. At the end of the script, this connection should be closed:
dbDisconnect(dblink)
With the mysql database set up as detailed in the previous section, spartan can be used to create parameter value sets for a robustness analysis, which are stored in the database under a specified experiment. Once these are executed, functions are provided for adding these results to the database, associating these with the specified experiment, and then analysing these results to produce insights from this statistical analysis. Full detail of this analysis is not provided here, this would duplicate much of the information in the spartan vignettes. Instead, we refer the reader to the detail in the spartan package.
Parameter value sets are generated by specifying the parameter names, baseline/calibrated value, minimum and maximum value of the range being explored, and the increment value to apply in sampling.
parameters<-c("stableBindProbability","chemokineExpressionThreshold","initialChemokineExpressionValue","maxChemokineExpressionValue","maxProbabilityOfAdhesion","adhesionFactorExpressionSlope") measures<-c("Velocity","Displacement") baseline<- c(50,0.3, 0.2, 0.04, 0.60, 1.0) minvals <- c(10, 0.10, 0.10, 0.015, 0.1, 0.25) maxvals <- c(100, 0.9, 0.50, 0.08, 0.95, 5.0) incvals <- c(10, 0.1, 0.05, 0.005, 0.05, 0.25) # Experiment is created in the database by specifying an experiment description, such as that below. This will be created with the current date. generate_robustness_set_in_db(dblink,parameters, baseline, minvals, maxvals, incvals, experiment_id=NULL, experiment_description="PPSim Robustness") # If you wish, you can specify the date generate_robustness_set_in_db(dblink,parameters, baseline, minvals, maxvals, incvals, experiment_id=NULL, experiment_description="PPSim Robustness", experiment_date="2018-09-03") # If you have already established an experiment in the database and know the experiment ID (the primary key), you can also generate a parameter set for that experiment (though use of this method is more unlikely) generate_robustness_set_in_db(dblink,parameters, baseline, minvals, maxvals, incvals, experiment_id=2) # You can then download this sample as a CSV file. Again you can do this using experiment_description and date, or by experiment ID output_directory<-"~/Documents/" download_sample_as_csvfile(output_directory, dblink, experiment_id=1) download_sample_as_csvfile(output_directory, dblink, experiment_description="PPSim Robustness", experiment_date="2018-10-29")
In the case where a pre-generated sample already exists, spartanDB contains a method to add this to the database, creating a new experiment. This pre-generated sample should exist as an R object (so if this was a CSV file, output from spartan, one would have to read this in first:
# The package contains a pre-generated sample as an R object, for use as an exemplar data(ppsim_robustness_set) # Read these into the database: add_existing_robustness_sample_to_database(dblink, parameters, ppsim_robustness_set, experiment_description="Original PPSim Robustness") # We run this and the message stated this created experiment ID 2. This will be used in the next section. # If you want to specify experiment date, you can do this with the experiment_date argument too
A message will be returned stating that the parameter set has been added to the database, with a stated experiment ID.
For simulations that require replicate executions to obtain a representative result for a given parameter set, you can store all replicates in the spartan_results database table. Results can be added to the database from two formats: an R object in the environment that contains all the results, or a CSV file. In both cases, the columns should be the parameter values followed by the simulation responses, with one result per row. This file can be created with methods available in spartan from raw simulation result files - see the spartan vignette for more information if needed. In the example below we use data from the spartan tutorial, which has been stored in the package to aid demonstration:
measures<-c("Velocity","Displacement") data(ppsim_robustness_results) # Add the data from this R object to the database, using the experiment ID that was returned when adding the parameters earlier (2 in this case) add_lhc_and_robustness_sim_results(dblink, parameters, measures, experiment_id=2, results_obj=ppsim_robustness_results) # Again you could also do this using description and date add_lhc_and_robustness_sim_results(dblink, parameters, measures, experiment_description="Original PPSim Robustness", experiment_date="2018-10-29", results_obj=ppsim_robustness_results) # If you are adding a CSV file, you would do so by: add_lhc_and_robustness_sim_results(dblink, parameters, measures, experiment_id=2, results_csv="/path/to/csv_file.csv")
With these results in the database, you can then use spartan to generate the A-Test statistics produced by a Robustness analysis (see the spartan vignette for full details)
generate_robustness_analysis(dblink, parameters, measures, baseline, experiment_id=2) # Again, you could use experiment description and date rather than experiment ID if you prefer # With the analysis complete you can then graph these results. As these are stored in the database, you can graph these again at any point should you wish output_directory_for_graphs<-"~/Documents" graph_robustness_analysis(dblink, output_directory_for_graphs, parameters, measures, experiment_id=2)
Similarly to the above, methods are provided to generate sets of parameter values using a hypercube, and then produce and store the analysis of simulation executions run under those conditions. Again for full detail of the implementation of the technique, see the spartan vignettes.
Samples are generated by specifying the parameter names, minimum and maximum values of the range being explored, the number of sets to generate, and the sampling algorithm for the lhs package (either normal or optimal)
parameters<-c("stableBindProbability","chemokineExpressionThreshold","initialChemokineExpressionValue","maxChemokineExpressionValue","maxProbabilityOfAdhesion","adhesionFactorExpressionSlope") minvals <- c(10, 0.10, 0.10, 0.015, 0.1, 0.25) maxvals <- c(100, 0.9, 0.50, 0.08, 0.95, 5.0) number_samples<-500 algorithm<-"normal" # Experiment is created in the database by specifying an experiment description, such as that below. This will be created with the current date. generate_lhc_set_in_db(dblink, parameters, number_samples, minvals, maxvals, algorithm, experiment_description="ppsim lhc dataset") # Similarly to robustness analysis, you can also specify the date using the experiment_date argument, # or if you wanted to, specify the experiment ID if this already exists in the database using the experiment_id argument(more unlikely) # You can download the generated samples from the database. In this case, we know the ID of this experiment was 3 output_directory<-"~/Documents" download_sample_as_csvfile(output_directory, dblink,experiment_id=3)
In the case where a pre-generated sample already exists (maybe generated by spartan), spartanDB contains a method to add this to the database, creating a new experiment. This sample should be in the R environment, and specified as the name of an R object. If exists as a CSV file (as output by spartan), this should be read into R first
# For demonstration purposes, we have included a pre-generated sample in the package data(pregenerated_lhc) add_existing_lhc_sample_to_database(dblink, pregenerated_lhc, experiment_description="existing ppsim lhc dataset") # Similarly to robustness, we were told in the output message this was experiment ID 4, and we will use this in the analysis section that follows # Again, if you want to specify a date for the experiment, you can do this with the experiment_date argument
Similarly to robustness analysis above, replicate executions of generated latin-hypercube parameter sets are stored in the spartan_results database table. Results can be added to the database from two formats: an R object in the environment that contains all the results, or a CSV file. In both cases, the columns should be the parameter values followed by the simulation responses, with one result per row. This file can be created with methods available in spartan from raw simulation result files - see the spartan vignette for more information if needed. In the example below we use data from the spartan tutorial, which we have included as a R data object in this package.
data(ppsim_lhc_results) # We recall above we added the parameters and the package told us this was experiment_id 4, so we use that here. We could have specified experiment_date and experiment_description instead # This is a fair sized data set, and thus may take a while.. add_lhc_and_robustness_sim_results(dblink, parameters, measures, experiment_id=4, results_obj=ppsim_lhc_results) # If you had a CSV file containing all the results, you would use this call: add_lhc_and_robustness_sim_results(dblink, parameters, measures, "LHC", experiment_id=4, results_csv="~/path/to/csv_file.csv")
With these results in the database, you can then use spartan to generate the Partial Rank Correlation Coefficient statistics and graphs produced by this analysis technique (see the spartan vignette for full details). In this case, the replicates executions of each parameter set are summarised to create a summary response under those parameter conditions, which is stored in the analysed_results table. The analysis statistics are stored in the generated_stats table:
# In the example dataset, we had a number of replicate runs per parameter set. Summarise the behaviour of each set. We use the same experiment ID as above (though date and description are possible) # Again this may take some time on some setups measures<-c("Velocity","Displacement") summarise_replicate_lhc_runs(dblink, measures, experiment_id=4) # Now we have the data in a format that spartan can process - so we'll do the analysis generate_lhc_analysis(dblink, parameters, measures, experiment_id=4) # Now produce plots of these stats held in the DB output_directory<-"~/Desktop/" # This method requires the scale of each measure (for display on the axes) measure_scale<-c("microns/min","microns") # Again using experiment_id, but you can use date and description. Also you can specify output format, including PDF, PNG, BMP, etc graph_lhc_analysis(dblink, parameters, measures, measure_scale, output_directory, experiment_id=4, output_type=c("PDF"))
Methods are provided to generate sets of parameter values using the eFAST technique, and then produce and store the analysis of simulation executions run under those conditions. Again for full detail of the implementation of the technique, see the spartan vignettes.
Samples are generated by specifying the parameter names, minimum and maximum values of the range being explored, the number of samples to generate for each parameter, and the number of resample curves to employ.
parameters<-c("stableBindProbability","chemokineExpressionThreshold","initialChemokineExpressionValue","maxChemokineExpressionValue","maxProbabilityOfAdhesion","adhesionFactorExpressionSlope") minvals <- c(10, 0.10, 0.10, 0.015, 0.1, 0.25) maxvals <- c(100, 0.9, 0.50, 0.08, 0.95, 5.0) number_samples<-65 number_curves<-3 # Experiment is created in the database by specifying an experiment description, such as that below. This will be created with the current date. generate_efast_set_in_db(dblink, parameters, number_samples, minvals, maxvals, number_curves, experiment_description="PPSim eFAST") # If you wish, you can specify the date generate_efast_set_in_db(dblink, parameters, number_samples, minvals, maxvals, number_curves, experiment_description="PPSim eFAST", experiment_date="2018-09-03") # If you have already established an experiment in the database and know the experiment ID (the primary key), you can also generate a parameter set for that experiment (though use of this method is unlikely) generate_efast_set_in_db(dblink, parameters, number_samples, minvals, maxvals, number_curves, experiment_id=5) # You can then download this sample as a CSV file. Again you can do this using experiment_description and date, or by experiment ID # In this case, the sample indicates which parameter of interest and resample curve the sample is for output_directory<-"~/Documents/" download_sample_as_csvfile(output_directory, dblink, experiment_id=5)
Similarly to the methods above, if you have a pre-generated sample (produced by spartan, that outputs one CSV file per parameter/curve pair) you can add this to the database. To ease demonstration of this approach, a zip file containing an example set of CSV files is available on the spartan website. Alternatively, if you have an R object containing the parameter samples, these can also be added straight to the database. An example R object containing a set of samples for the case study has been included in the package.
# Download of the example CSV files. These are extracted into an efast directory: dir.create(file.path(getwd(), "efast"), showWarnings = FALSE) unzip(system.file("extdata","pregenerated_efast_sample.zip",package="spartanDB"),exdir=file.path(getwd(), "efast")) add_existing_efast_sample_to_database(dblink, parameters, number_curves, parameter_set_path=file.path(getwd(), "efast"), experiment_description="Pre-Generated CSV PPSim eFAST") # Or addition of an R object: data(pregenerated_efast_set) add_existing_efast_sample_to_database(dblink, parameters, number_curves, parameters_r_object=pregenerated_efast_set, experiment_description="Pre-Generated R Object PPSim eFAST") # When we ran this, the parameter set was added to the database with experiment ID 6, which we use in the next section to store the analysis of these parameters # Similarly to all methods above, you can specify the experiment_date if you don't want to use that day's date, as well as experiment_id (more unlikely)
As above, replicate executions of generated eFAST parameter sets are stored in the spartan_results database table. However, this method assumes that there are a number of result files, one per parameter/re-sample curve pair, as described in the description of this analysis approach in the spartan package. Similarly to the two methods above, spartan does contain methods to create these files from raw simulation result files - see the spartan vignette for more information if needed. In the example below we use data from the spartan tutorial, a zip file containing the data files for which are available for download with the spartan package.
measures<-c("Velocity","Displacement") # Download of the zip file. Extract this to a directory on your system sample_results<-"~/Documents/spartanDB/test_data/eFAST_Sample_Outputs.zip" unzip(sample_results,exdir=file.path(getwd(), "efast")) # Here we are adding to experiment ID 6, which we generated when adding the parameter sets above. If you wanted you could specify description and date of that experiment instead add_efast_sim_results_from_csv_files(dblink, file.path(getwd(), "efast"), parameters, measures, number_curves, experiment_id=6) # With the results in the database, we can perform the analysis # Create summary stats from the replicates. These will be stored in the analysed_results table # We use the experiment ID created when adding the parameters above - but recall you could do this with experiment ID and date summarise_replicate_efast_runs(dblink, parameters, measures, experiment_id=6) # Now do the eFAST Analysis - statistics are stored in the generated_stats table output_directory<-"~/Documents/" generate_efast_analysis(dblink, parameters, measures, experiment_id=6, graph_results=TRUE, output_directory=output_directory)
Using a surrogate tool in place of an original simulator, an emulator, can reduce resource requirements for model analysis and thus enrich understanding of how a model functions and relates back to the problem domain. The creation of an emulator from simulation data is described in the Spartan Vignette "Expedited and Enriched Analyses Using Emulations & Ensembles". Each emulator development method has the potential to perform very differently for the same dataset. As such, generating a prediction from a combination of emulators, rather than a single emulation alone, may provide an increase in performance, as seen in Spartan Technique 7. Possessing data within a database makes it possible to mine that database for the data to use to train machine learning algorithms. These data may then come from more than one experiment, and thus is it possible to use the database to keep refining the emulators over time. In addition, by storing the data used to train, test, and validate each machine learning algorithm, it becomes possible to recreate the emulator at a later timepoint if need be, aiding result reproducibility.
In this example, we are going to use the Latin-Hypercube generated dataset seen earlier in this vignette, under experiment ID 4. We can create emulators for a number of machine learning algorithms:
# Use emulator list to specify the algorithms to use - these include SVM, NNET, RF, GLM, and GP (see Spartan Vignette for more detail) emulator_list<-c("RF","SVM") sim_emulators<-create_emulators_from_database_experiments(dblink, parameters, measures, emulator_list,normalise_set=TRUE,experiment_id=4) # In our run through, this was created as experiment ID 7
This will store the training, test, and validation datasets in the database and performance statitsics for all emulators. If you have already done this, and want to recover this data, you can recreate the emulators using the method:
emulator_list<-c("RF","SVM") # Note the experiment ID should be that used to create the emulators. Again you can use experiment date and description if you prefer sim_emulators<-regenerate_emulators_from_db_data(dblink, parameters, measures, emulator_list, normalise_set=TRUE, experiment_id=7)
You can then use these emulations to make predictions. In emulator creation, the experiment data was separated into training, test, and validation sets. Below we show retrieving the validation set from the database and using the emulators to make predictions of the output for those parameter conditions. We then show that this experiment can be stored in the database:
```{R,eval=FALSE}
validation_set<-retrieve_validation_set_from_db_for_emulator(dblink, parameters, measures, experiment_id=7)
use_emulators_to_make_and_store_predictions(dblink, sim_emulators, parameters, measures, validation_set, normalise=FALSE, normalise_result=TRUE, experiment_description="Predict Validation Set")
As also shown in the spartan vignettes, predictions may be more accurate when a number of emulators are combined to form an ensemble: one predictive tool that makes predictions by weighting the predictions made by each emulator. These can also be created in this package and the statistics from their generation stored in the database: ```{R,eval=FALSE} # Make an ensemble of two emulators in this case below. The emulators are made, followed by the ensemble. Again this uses the data from the latin-hypercube experiment above, # in the database as experiment 4 ensemble<-generate_emulators_and_ensemble_using_db(dblink, parameters, measures, emulator_list=c("RF","SVM"), normalise_set=TRUE, experiment_id=4) # You can then use this ensemble to make predictions, in a similar way to that shown above. The ensemble was generated with experiment ID 9 validation_set<-retrieve_validation_set_from_db_for_emulator(dblink, parameters, measures, experiment_id=9) use_ensemble_to_make_and_store_predictions(dblink, ensemble, parameters, measures, validation_set, normalise=TRUE, normalise_result=TRUE, experiment_description="Predict Validation Set with Ensemble")
As detailed in the spartan vignette, once an emulator or ensemble is generated, it becomes possible to use this in place of the simulator to perform a sensitivity analysis. You can then combine this approach with the methods above to store the results of an emulated sensitivity analysis in the database, as follows:
```{R,eval=FALSE}
emulated_lhc_values<-spartan::lhc_generate_lhc_sample(NULL, parameters, 500, minvals, maxvals, "normal", write_csv=FALSE)
analyse_and_add_emulated_lhc_to_db(dblink, emulated_lhc_values, ensemble, parameters, measures, experiment_description="Emulated LHC Analysis", output_directory="/home/kja505/Desktop", normalise_sample=TRUE)
emulated_efast_values<-efast_generate_sample(NULL, 3,65,c(parameters,"Dummy"), c(minvals,0), c(maxvals,1), write_csv=FALSE, return_sample=TRUE)
analyse_and_add_emulated_efast_to_db(dblink, emulated_efast_values, ensemble, parameters, measures, experiment_description="Emulated eFAST Analysis9", graph_results=TRUE, output_directory="/home/kja505/Desktop", normalise_sample=TRUE, normalise_result=TRUE)
## SpartanDB with Spartan Technique 9: Using Ensemble for Approximate Bayesian Computation with EasyABC Possessing an ensemble makes it possible to perform some analyses that may not have been tractable previously. This technique, detailed in the spartan vignette, uses the ensemble to perform an Approximate Bayesian Computation technique that predicts the posterior distribution for each parameter. For full details see the spartan vignette. In this package, the results from an ABC analysis are stored in the database so these can be recovered later, for plotting or further analysis. ```{R,eval=FALSE} # Whether parameter sets generated by the EasyABC sampling algorithm need to be normalised prior to input into the ensemble normalise_values = TRUE # Whether the generated predictions from the ensemble need to be rescaled normalise_result = TRUE # Set prior distribution for each parameter prior=list(c("unif",0,100),c("unif",0.1,0.9),c("unif",0.1,0.5), c("unif",0.015,0.08),c("unif",0.1,1.0),c("unif",0.25,5.0)) # Set the summary statistics you would like the ideal parameter sets to get close to producing sum_stat_obs=c(4.4677342593,28.5051144444) # Create an abc object with this information, for feeding into the EasyABC methods. See the spartan vignette for more info abc_set<-create_abc_settings_object(parameters, measures, ensemble, normalise_values, normalise_result, file_out = FALSE) # Number of parameter sets to generate numRunsUnderThreshold=100 # Declining tolerance values for use by the algorithm. See the EasyABC package for more detail here tolerance=c(20,15,10.00,7,5.00) # Run the EasyABC method to generate the predicted posterior distribution abc_resultSet<-ABC_sequential(method="Beaumont", model=ensemble_abc_wrapper, prior=prior, nb_simul=numRunsUnderThreshold, summary_stat_target=sum_stat_obs, tolerance_tab=tolerance, verbose=FALSE) # Store these results in the database store_abc_experiment_in_db(dblink, abc_set, abc_resultSet, parameters, measures, experiment_description="ABC for PPSim Parameters", graph_results=TRUE, output_directory="/home/kja505/Desktop") # The above can produce plots of the posterior distributions. However, we can retrieve stored results (i.e. predicted posterior distributions), for plotting retrieve_abc_experiment_for_plotting(dblink, parameters, experiment_description="ABC for PPSim Parameters", experiment_date = Sys.Date())
Vignettes are long form documentation commonly included in packages. Because they are part of the distribution of the package, they need to be as compact as possible. The html_vignette
output type provides a custom style sheet (and tweaks some options) to ensure that the resulting html is as small as possible. The html_vignette
format:
Note the various macros within the vignette
section of the metadata block above. These are required in order to instruct R how to build the vignette. Note that you should change the title
field and the \VignetteIndexEntry
to match the title of your vignette.
The html_vignette
template includes a basic CSS theme. To override this theme you can specify your own CSS in the document metadata as follows:
output: rmarkdown::html_vignette: css: mystyles.css
The figure sizes have been customised so that you can easily put two images side-by-side.
plot(1:10) plot(10:1)
You can enable figure captions by fig_caption: yes
in YAML:
output: rmarkdown::html_vignette: fig_caption: yes
Then you can use the chunk option fig.cap = "Your figure caption."
in knitr.
You can write math expressions, e.g. $Y = X\beta + \epsilon$, footnotes^[A footnote here.], and tables, e.g. using knitr::kable()
.
knitr::kable(head(mtcars, 10))
Also a quote using >
:
"He who gives up [code] safety for [code] speed deserves neither." (via)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.