R package used to detect changes in large datasets statistically.
The variant package can be found at https://github.com/pgrugwiro/variant . To install and run it, use the following command: install_github(“pgrugwiro/variant”). Please note that the “devtools” package must be installed before installing a github package.
df1 <- data.frame(matrix(rnorm(1000000), ncol = 100)) df2 <- data.frame(matrix(rnorm(500000), ncol = 100)) variant_signal(df1, df2, alpha = 0.5)
This package contains several helper functions: cil,cormat,df_check,fisher_trans,no_overlap,percent_overlap, and variant_signal. These helper functions work together to accomplish the desired computational and plotting requirements. In alphabetic order, the functions are described below. The full code for the helper functions can be found in appendix 2. cil()
The cil() function calculates the confidence interval parameters of a numeric vector. It takes in as input the numeric vector and the level of significance desired for the computation. As an output, it gives the population mean’s confidence interval lower and upper bounds, as well as a number that represents the length of the confidence interval, i.e., the difference between the upper bound and the lower bound of the confidence interval. The primary use of this function is in the graphic representation of confidence intervals. To plot these, two of these three outputs are needed. Either the lower bound and confidence length, or upper bound and confidence length, or again the lower bound and upper bound. cil() in use: library(variant)
vecx <- rnorm(100) cil(vecx, 0.05)
cormat()
The cormat() function is used to calculate the correlation matrix of any given numerical data frame. It is specifically designed to handle large dataset by making use of parallel computing. It takes in as input a numerical data frame of n columns and any number of rows and it returns a n x n-1 matrix of correlation coefficients. n-1 is because the correlation coefficients of any given variable with itself is removed in the output.
cormat() in use: library(variant)
df <- data.frame(a = rnorm(100), b = rnorm(100), c = rnorm(100)) cormat(df)
df_check()
The df_check() function is included in the package to guide the user if the input data frames do not satisfy the requirements. E.g., the two data frames being analysed must have an equal number of variables, i.e., columns. They must also fulfil the numeric data type requirement. If any of these requirements is not fulfilled, an error message is displayed indicating that a problem exists. This helps the user save time in diagnosing the problem. The function takes as input two data frames and displays a message of validity of data frames as output. df_check() in use: library(variant)
df <- data.frame(a = rnorm(100), b = rnorm(100), c = rnorm(100)) df2 <- data.frame(a = rnorm(100), b = rnorm(100), c = rnorm(100)) df_check(df, df2)
library(variant)
df <- data.frame(a = rnorm(100), b = rnorm(100), c = rnorm(100)) df2 <- data.frame(a = rnorm(100), b = rnorm(100), c = rep("a", 100)) df_check(df, df2)
fisher_trans()
The fisher_trans() function is used to transform the coefficient of correlation ρ values into their z-scores using the Fisher transform. This is a necessary step before conducting parametric analyses on the correlation coefficient distribution. The function takes as input a numeric value or a vector of numeric values and returns a corresponding numeric value or vector of numeric values. fisher_trans() in use: library(variant)
rho = 0.89 fisher_trans(rho)
no_overlap()
The no_overlap() function is used to determine if two numeric vectors have overlapping confidence intervals of their population means using a desired level of significance. It takes as input two vectors and the level of significance and returns a Boolean value TRUE if no overlap exists and FALSE if overlap exists. no_overlap() in use: library(variant)
vecx <- rnorm(100) vecy <- rnorm(100) no_overlap(vecx, vecy, 0.05)
library(variant)
vecx <- rnorm(100) vecy <- rnorm(100, 5, 0.5) no_overlap(vecx, vecy, 0.05)
percent_overlap()
The percent_overlap() function is used to calculate the overlap percentage between two numeric vectors. The function uses equation 3 described in chapter 2. The function takes as input two numeric vectors and a desired level of significance and it outputs the percentage overlap between the confidence intervals of population means of the input vectors. If there’s no overlap of the confidence intervals, e.g., the upper bound of the confidence interval of one vector is lower than the lower bound of the confidence interval of the other vector, a negative percent overlap would be displayed.
percent_overlap() in use: library(variant)
vecx <- rnorm(100) vecy <- rnorm(100) percent_overlap(vecx, vecy, 0.05)
library(variant)
vecx <- rnorm(100) vecy <- rnorm(100, 5, 0.5) percent_overlap(vecx, vecy, 0.05)
variant_signal()
The variant_signal() is the last and most important function in the package. It pipelines all the helper functions described above and uses their output to compute the signal. It takes as input two data frames and a desired level of significance. It then checks for the data frames validity using the df_check() function and computes the correlation coefficient matrices for both data frames, if they are valid, using the cormat() function. The correlation coefficients are transformed into their z-scores using the fisher_trans() function. Then confidence interval parameters for the mean correlation coefficient of each variable are calculated using the cil() function, overlapping percentages and presence or absence of overlap is determined with the percent_overlap(), no_overlap() functions respectively. The variant_signal() function then calculates average change for the non-overlapping category and compares it to the overall change. This comparison is displayed in terms of percentage and is called “signal”. Additional outputs of the variant_signal() functions are a list of percentage overlaps as computed by the percent_overlap() function, indices for which there is no overlap as determined by the no_overlap() function as well as a confidence interval plot for all the variables and for the non-overlapping variables.
variant_signal() in use: library(variant)
df1 <- data.frame(matrix(rnorm(10000), ncol = 10)) df2 <- data.frame(matrix(rnorm(1000), ncol = 10)) variant_signal(df1, df2, 0.2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.