OpenStats is a freely available R package that presents statistical methods and detailed analyses to promote the hard process of identification of abnormal phenotypes. The package incorporates several checks and cleaning on the input data prior to the statistical analysis. For continuous data, Linear Mixed Model with an optional model selection routine is implemented, whilst for categorical data, Fisher's Exact Test is implemented. For cases where the linear mixed model fails, Reference Range Plus method has been employed for a quick, simple analysis of the continuous data. User can perform inspections and diagnostics of the final fitted model by the visualisation tools that come with the software. Furthermore, the user can export/report the outputs in the form of either standard R list or JavaScript Object Notation (JSON). OpenStats has been tested and demonstrated with an application of $2.5M+$ analyses from the Internationa Mouse Phenotyping Consortium (IMPC).
The User's Guide with more details about the statistical analysis is available as part of the online documentation from https://rpubs.com/hamedhm/openstats. Project Github repository including dev version of the package is available on https://git.io/JeOVN.
OpenStats can be installed using the standard R package installation routin:
R code here
OpenStats consists of one input layer and three operational layers:
OpenStatsList function performs data processing and creates an OpenStatsList object. As input, OpenStatsList function requires dataset of phenotypic data that can be presented as data frame. For instance, it can be dataset stored in csv, tsv or txt file. Data is organised with rows and columns for samples and features respectively. Following shows an example of the input data where rows and columns represent mice and features (mouse id, treatment group, gender, age of animal in days):
library(OpenStats) ################### # Data preparation ################### fileCon <- system.file("extdata", "test_continuous.csv", package = "OpenStats" ) read.csv(fileCon, as.is = TRUE)[60:75, c( "external_sample_id", "biological_sample_group", "sex", "age_in_days" )]
The main preprocessing tasks performed by the OpenStatsList function are:
We define "terminology unification" as the terminology used to describe data (variables) that are essential for the analysis. OpenStats package uses the following nomenclature for the names of columns: "Genotype", the only mandatory variable, "Sex", "Batch" "LifeStage" and "Weight". In addition, expected (default) Sex, LifeStage values are "Male/Female" and "Early/Late" respectively. However, the user can define the custom levels by setting dataset.values.male, dataset.values.female, dataset.values.early and dataset.values.late in the OpenStatsList function. Missing value is specified by dataset.values.missingValue argument and set to NA.
The statistical analysis requires exactly two "Genotype" groups for comparison (e.g. wild-type versus knockout). Thus the function OpenStatsList requires users to define the reference genotype (mandatory argument refGenotype with default value "control") and test genotype (mandatory argument testGenotype), defaulted to "experimental". If the OpenStatsList function argument dataset.clean is set to TRUE then all records with genotype values others than reference or test genotype are filtered out.
All tasks in OpenStats are accompanied by step-by-step reports, error messages, warnings and/or other useful information about the progress of the function. If messages are not desirable, OpenStatsList function's argument debug can be set to FALSE meaning there will be no messages.
The chunk of code below demonstrates an example of using OpenStatsList when the user sets out-messages to TRUE/FALSE:
####################################### # Default behaviour with messages ####################################### library(OpenStats) fileCon <- system.file("extdata", "test_continuous.csv", package = "OpenStats" ) test_Cont <- OpenStatsList( dataset = read.csv(fileCon), testGenotype = "experimental", refGenotype = "control", dataset.colname.genotype = "biological_sample_group", dataset.colname.batch = "date_of_experiment", dataset.colname.lifestage = NULL, dataset.colname.weight = "weight", dataset.colname.sex = "sex" ) ####################################### # OpenStatsLis behaviour without messages ####################################### fileCon <- system.file("extdata", "test_continuous.csv", package = "OpenStats" ) test_Cont <- OpenStatsList( dataset = read.csv(fileCon), testGenotype = "experimental", refGenotype = "control", dataset.colname.genotype = "biological_sample_group", dataset.colname.batch = "date_of_experiment", dataset.colname.lifestage = NULL, dataset.colname.weight = "weight", dataset.colname.sex = "sex", debug = FALSE ) # No output printed
The output of the OpenStatsList function is the OpenStatsList object that contains a cleaned dataset as well as a copy of the original dataset. OpenStats allows plot and summary/print of the OpenStatList object. Below is an example of the OpenStatsList function accompanied by the plot and summary:
library(OpenStats) df <- read.csv(system.file("extdata", "test_continuous.csv", package = "OpenStats" )) OpenStatsList <- OpenStatsList( dataset = df, testGenotype = "experimental", refGenotype = "control", dataset.colname.batch = "date_of_experiment", dataset.colname.genotype = "biological_sample_group", dataset.colname.sex = "sex", dataset.colname.weight = "weight", debug = FALSE ) p <- plot(OpenStatsList, vars = c("Sex", "Genotype", "data_point"), ask = TRUE) # Plot categorical variables p$Categorical # plot continuous variable p$Continuous summary(OpenStatsList, style = "grid", varnumbers = FALSE, # See more options ?summarytools::dfSummary graph.col = FALSE, # Do not show the graph column valid.col = FALSE, vars = c("Sex", "Genotype", "data_point") )
OpenStatsList object stores many characteristics of the data, for instance, reference genotype, test genotype, original column names, factor levels etc.
OpenStats package contains three statistical frameworks for the phenodeviants identification:
OpenStats's function OpenStatsAnalysis works as a hub for the different statistical analysis methods. It checks the dependent variable, the data, missings, not proper terms in the model (such as terms that do not exist in the input data) and runs the selected statistical analysis framework and returns modelling\slash testing results. All analysis frameworks output a statistical significance measure, effect size measure, model diagnostics, and graphical visualisations.
Here we explain the main bits of the OpenStatsAnalysis function:
The possible values for the method arguments are "MM" which stands for mixed model framework, "FE" to perform Fisher's exact test model and "RR" for Reference Range Plus framework. The semantic naming in the input arguments of the OpenStatsAnalysis function allows natural distinction of the input arguments For example, $MM_$, $RR_$ and $FE_$ prefixes represent the arguments that can be set in the corresponging frameworks. Having said that,
The OpenStatsAnalysis function performs basic checks to ensure that the data and model match, the model is feasible for the type of the data and reports step-by-step progress of the function. Some of the checks and operations are listed below:
All frameworks are equipped with the step-by-step report of the progress of the function. Warnings, errors and messages are reported to the user. In the situation where the function encounters a critical failure, then the output object contains a slot called $messages$ that reports back the cause of the failure.
OpenStatsAnalysis output consists of three elements namely, input, output and extra. The input object encapsulate the input parameters to the function, output hold the analysis results and the extra keeps some extra processes on the data/model. Below is an example output from the Reference Rage plus framework:
library(OpenStats) ################# # Data preparation ################# ################# # Continuous data - Creating OpenStatsList object ################# fileCon <- system.file("extdata", "test_continuous.csv", package = "OpenStats" ) test_Cont <- OpenStatsList( dataset = read.csv(fileCon), testGenotype = "experimental", refGenotype = "control", dataset.colname.genotype = "biological_sample_group", dataset.colname.batch = "date_of_experiment", dataset.colname.lifestage = NULL, dataset.colname.weight = "weight", dataset.colname.sex = "sex", debug = FALSE ) ################# # Reference range framework ################# RR_result <- OpenStatsAnalysis( OpenStatsList = test_Cont, method = "RR", RR_formula = data_point ~ Genotype + Sex, debug = FALSE ) lapply(RR_result, names) # lapply(RR_result$output,names)
In this section, we show some examples of the functionalities in OpenStats for the continuous and categorical data. Each section contains the code and different possible scenarios.
The linear mixed model framework applies to continuous data. In this example, data is extracted from the sample data that accompany the software. Here, "Genotype" is the effect of interest. The response is stored in the variable "data_point" and genotype (Genotype) and body weight (Weight) are covariates. The model selection is left to the default, stepwise, and between-group covariance structure are assumes proportional to the genotype levels (different variation for controls than mutants):
library(OpenStats) ################# # Data preparation ################# ################# # Continuous data - Creating OpenStatsList object ################# fileCon <- system.file("extdata", "test_continuous.csv", package = "OpenStats" ) test_Cont <- OpenStatsList( dataset = read.csv(fileCon), testGenotype = "experimental", refGenotype = "control", dataset.colname.genotype = "biological_sample_group", dataset.colname.batch = "date_of_experiment", dataset.colname.lifestage = NULL, dataset.colname.weight = "weight", dataset.colname.sex = "sex", debug = FALSE ) ################# # LinearMixed model (MM) framework ################# MM_result <- OpenStatsAnalysis( OpenStatsList = test_Cont, method = "MM", MM_fixed = data_point ~ Genotype + Weight )
OpenStats allows fitting submodels from an input model. This is called Split model effects in the outputs and it is mainly useful for reporting sex/age-specific etc. effects. This is performed by creating submodels of a full model. For instance, for the input fixed effect, MM_fixed, model $Response\sim Genotype+Sex+Weight$ a possible submodel is $Response \sim Sex+Sex:Genotype + Weight$ that can be used to estimate sex-specific effects for genotype. This model is then estimated under the configuration of the optimal model. One can turn off Split model effects by setting the fourth element of "MM_optimise" to FALSE.
An alternative to the analytically estimating the sub-models is to break the input data into splits and run the model on the subset of the data. This can be performed by passing the output of the OpenStatsAnalysis function, OpenStatsMM, to the function, OpenStatsComplementarySplit. This function allows the OpenStatsMM object as input and a set of variable names that split the data. The output is stored in an OpenStatsComplementarySplit object. The example below shows a split on "Sex":
library(OpenStats) ################# # Data preparation ################# ################# # Continuous data - Creating OpenStatsList object ################# fileCon <- system.file("extdata", "test_continuous.csv", package = "OpenStats" ) test_Cont <- OpenStatsList( dataset = read.csv(fileCon), testGenotype = "experimental", refGenotype = "control", dataset.colname.genotype = "biological_sample_group", dataset.colname.batch = "date_of_experiment", dataset.colname.lifestage = NULL, dataset.colname.weight = "weight", dataset.colname.sex = "sex", debug = FALSE ) ################# # LinearMixed model (MM) framework ################# MM_result <- OpenStatsAnalysis( OpenStatsList = test_Cont, method = "MM", MM_fixed = data_point ~ Genotype + Weight, debug = FALSE ) # SplitEffect estimation with respect to the Sex levels Spliteffect <- OpenStatsComplementarySplit( object = MM_result, variables = "Sex" ) class(Spliteffect)
Reference range plus framework applies to continuous data. In this example, data is extracted from the sample data that accompany the software. Here, "Genotype" is the effect of interest. The response is stored in the variable "data_point" and genotype (Genotype) and sex (Sex) are covariates.
library(OpenStats) ################# # Data preparation ################# ################# # Continuous data - Creating OpenStatsList object ################# fileCon <- system.file("extdata", "test_continuous.csv", package = "OpenStats" ) test_Cont <- OpenStatsList( dataset = read.csv(fileCon), testGenotype = "experimental", refGenotype = "control", dataset.colname.genotype = "biological_sample_group", dataset.colname.batch = "date_of_experiment", dataset.colname.lifestage = NULL, dataset.colname.weight = "weight", dataset.colname.sex = "sex", debug = FALSE ) ################# # Reference range framework ################# RR_result <- OpenStatsAnalysis( OpenStatsList = test_Cont, method = "RR", RR_formula = data_point ~ Genotype + Sex )
Fisher's Exact test framework applies to categorical data. In this example, data is extracted from the sample data that accompany the software. Here, Genotype is the effect of interest. The response is stored in the variable category and Genotype and Sex are the covariates.
library(OpenStats) ################# # Categorical data - Creating OpenStatsList object ################# fileCat <- system.file("extdata", "test_categorical.csv", package = "OpenStats" ) test_Cat <- OpenStatsList( dataset = read.csv(fileCat, na.strings = "-"), testGenotype = "Aff3/Aff3", refGenotype = "+/+", dataset.colname.genotype = "Genotype", dataset.colname.batch = "Assay.Date", dataset.colname.lifestage = NULL, dataset.colname.weight = "Weight", dataset.colname.sex = "Sex", debug = FALSE ) ################# # Fisher's exact test framework ################# FE_result <- OpenStatsAnalysis( OpenStatsList = test_Cat, method = "FE", FE_formula = Thoracic.Processes ~ Genotype + Sex )
OpenStats package stores the input data in OpenStatsList and the results of statistical analyses in the OpenStatsMM/RR/FE or OpenStatsComplementarySplit object. The standard summary/print function applies to print off a summary table. The summary table encompasses:
The function OpenStatsReport can be used to create a table of detailed summary from OpenStatsMM/RR/FE object in the form of either list or JSON. The following is an example of the summary output of the liner mixed model framework.
library(OpenStats) ################# # Data preparation ################# ################# # Continuous data - Creating OpenStatsList object ################# fileCon <- system.file("extdata", "test_continuous.csv", package = "OpenStats" ) test_Cont <- OpenStatsList( dataset = read.csv(fileCon), testGenotype = "experimental", refGenotype = "control", dataset.colname.genotype = "biological_sample_group", dataset.colname.batch = "date_of_experiment", dataset.colname.lifestage = NULL, dataset.colname.weight = "weight", dataset.colname.sex = "sex", debug = FALSE ) ################# # LinearMixed model (MM) framework ################# MM_result <- OpenStatsAnalysis( OpenStatsList = test_Cont, method = "MM", MM_fixed = data_point ~ Genotype + Weight, debug = FALSE ) summary(MM_result)
OpenStatsReport function was developed for large scale application where automatic implementation is require. Following is the JSON output of the function from an OpenStatsMM object (cut to the first 1500 charachters):
strtrim( OpenStatsReport( object = MM_result, JSON = TRUE, RemoveNullKeys = TRUE, pretty = TRUE ), 1500 )
Graphics in OpenStats are as easy as calling the plot() function on a OpenStatsList or the OpenStatsMM/FE/RR object. Calling the plot function on the OpenStatsList object is shown below:
library(OpenStats) ################### file <- system.file("extdata", "test_continuous.csv", package = "OpenStats" ) ################### # OpenStatsList object ################### OpenStatsList <- OpenStatsList( dataset = read.csv(file), testGenotype = "experimental", refGenotype = "control", dataset.colname.batch = "date_of_experiment", dataset.colname.genotype = "biological_sample_group", dataset.colname.sex = "sex", dataset.colname.weight = "weight", debug = FALSE ) plot(OpenStatsList) summary( OpenStatsList, style = "grid", varnumbers = FALSE, # See more options ?summarytools::dfSummary graph.col = FALSE, # Do not show the graph column valid.col = FALSE )
There are also graphics for the OpenStatsMM/FE/RR. Here is the list of plots for each framework:
Linear mixed model framework:
Reference Range plus frameworks:
Fisher's exact test framework:
Below shows an example for the OpenStatsMM output:
plot(MM_result, col = 2)
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.