cat("this will be hidden; use for general initializations.\n") library(superb) library(ggplot2)
The package superb
includes the function GRD()
. This function is used to easily generate
random data sets. With a few options, it is possible to obtain data from any design, with
any effects. This function, first created for SPSS [@hc14, @hc15] was exported to R [@ch19].
A brief report shows one possible use in the class for teaching statistics to undergrads [@c20].
This vignette illustrate some of its use.
The simplest use relies on the default value:
dta <- GRD() head(dta)
By default, one hundred scores are generated from a normal distribution with
mean 0 and standard deviation of 1. In other words, it generate 100 z scores.
The dependent variable, the last column in the dataframe that will be generated
is called by default DV
. The first column is an "id" column containing a
number identifying each simulated participant. To change the dependent
variable's name, use
dta <- GRD( RenameDV = "score" )
To add various groups to the dataset, use the argument BSFactors
, as in
dta <- GRD( BSFactors = 'Group(3)')
There will be 100 random z scores in each of three groups, for a total of 300 data. The group
number will be given in an additional column, here called Group
. A factorial
design can be generated with more than one factors, such as
dta <- GRD( BSFactors = c('Surgery(2)', 'Therapy(3)') )
which will results in 2 $\times$ 3, that is, 6 different groups, crossing all the levels of Surgery (1 and 2) and all the levels of Therapy (1, 2 and 3). The levels can receive names rather than number, as in
dta <- GRD( BSFactors = c('Surgery(yes, no)', 'Therapy(CBT,Control,Exercise)') ) unique(dta$Surgery) unique(dta$Therapy)
Finally, within-subject factors can also be given, as in
dta <- GRD( BSFactors = c('Surgery(yes,no)', 'Therapy(CBT, Control,Exercise)'), WSFactors = 'Contrast(C1,C2,C3)', )
For within-subject designs, the repeated measures will appear in distinct columns (here "DV.C1", "DV.C2", and "DV.C3" ). This format is called wide format, meaning that the repeated measures are all on the same line for a given simulated participant.
The default is to generate 100 participants in each between-subject groups.
This default can be changed with SubjectsPerGroup
. The most straigthforward
specification is, e.g., SubjectsPerGroup = 25
for 25 participants in each
groups. Unequal group sizes can be specified with:
dta <- GRD( BSFactors = "Therapy(3)", SubjectsPerGroup = c(2, 5, 1) ) dta
To sample random data, it is necessary to specify a theoretical population distribution.
The default is to use a normal distribution (the famous "bell-shaped" curve). That population
has a grand mean (GM
, $\mu$) given by the element mean
and standard deviation ($\sigma$) given
by the element stddev
. These can be redefined using the argument Population
with a
list of the relevant elements. In the following example, IQ are being simulated with :
dta <- GRD( RenameDV = "IQ", Population=list(mean=100,stddev=15) ) hist(dta$IQ)
(increase the number of participants using SubjectsPerGroup
to say 10,000, and the bell-shape
curve will be evident!).
Internally, the above call to GRD()
will use rnorm
to generate the
scores, passing along for the mean parameter the grand mean (internally called
GM
) and for the standard deviation parameter the provided standard deviation
(internally called STDDEV
). This can be explicitly stated using the element
scores
as in:
dta <- GRD( BSFactors = "Group(2)", Population = list( mean = 100, # this set GM to 100 stddev = 15, # this set STDDEV to 15 scores = "rnorm(1, mean = GM, sd = STDDEV )" ) )
Using scores
, it is possible to alter the parameters, for example, have a mean proportional
to the group number, or the standard deviation proportional to the group number, as in:
dta <- GRD( BSFactors = "Group(2)", Population = list( mean = 100, # this set GM to 100 stddev = 15, # this set STDDEV to 15 scores = "rnorm(1, mean = GM, sd = Group * STDDEV )" ) ) superb( DV ~ Group, dta, plotStyle = "pointjitterviolin" )
Any valid R instruction could be placed in the scores
arguments, such
as scores = "rnorm(1, mean = GM, sd = ifelse(Group==1,10,50) )"
to
select the standard deviation according to Group
or
scores = "1"
to generate constants. Other theoretical distributions
can also be chosen, as in:
dta <- GRD(SubjectsPerGroup = 5000, RenameDV = "RT", Population=list( scores = "rweibull(1, shape=2, scale=40)+250" ) ) hist(dta$RT,breaks=seq(250,425,by=5))
It is possible to generate non-null effects on the factors using
the argument Effects
. Effects can be slope(x)
(an increase of x
points for each level of the factor), extent(x)
(a total increase of
x
over all the levels), custom(x, y, etc)
for an effect of x
point for
the first level of the factor, y
point for the second, etc.
Here is a slope, effect:
dta <- GRD( BSFactors = 'Therapy(CBT, Control, Exercise)', WSFactors = 'Contrast(3)', SubjectsPerGroup = 1000, Effects = list('Contrast' = slope(2)) ) superb( crange(DV.1, DV.3) ~ Therapy, dta, WSFactors = "Contrast(3)", plotStyle = "line" )
Effects can also be any R code manipulating the factors, using Rexpression
.
One example:
dta <- GRD( BSFactors = 'Therapy(CBT,Control,Exercise)', WSFactors = 'Contrast(3) ', SubjectsPerGroup = 1000, Effects = list( "code1"=Rexpression("if (Therapy =='CBT'){-1} else {0}"), "code2"=Rexpression("if (Contrast ==3) {+1} else {0}") ) ) superb( crange(DV.1, DV.3) ~ Therapy, dta, WSFactors = "Contrast(3)", plotStyle = "line" )
Repeated measures can also be generated from a multivariate normal
distribution with a correlation rho
, with, e.g.,
dta <- GRD( WSFactors = 'Difficulty(1, 2)', SubjectsPerGroup = 1000, Population=list(mean = 0,stddev = 20, rho = 0.5) ) plot(dta$DV.1, dta$DV.2)
In the case of a multivariate normal distribution, the parameters for the mean and the standard deviations can be vectors of length equal to the number of repeated measures. However, covariances are constants.
dta <- GRD( WSFactors = 'Difficulty(1, 2)', SubjectsPerGroup = 1000, Population=list(mean = c(10,2),stddev= c(1,0.2),rho =-0.85) ) plot(dta$DV.1, dta$DV.2)
Contaminants can be inserted in the simulated data using Contaminant
.
This argument works exactly like Population
except for the additional
option proportion
which indicates the proportion of contaminants in
the samples:
dta <- GRD(SubjectsPerGroup = 5000, Population= list( mean=100, stddev = 15 ), Contaminant=list( mean=200, stddev = 15, proportion = 0.10 ) ) hist(dta$DV,breaks=seq(-25,300,by=2.5))
Contaminants can be normally distributed (as above) or come from any theoretical distribution which can be simulated in R:
dta <- GRD(SubjectsPerGroup = 10000, Population=list( mean=100, stddev = 15 ), Contaminant=list( proportion = 0.10, scores="rweibull(1,shape=1.5, scale=30)+1.5*GM") ) hist(dta$DV,breaks=seq(0,365,by=2.5))
Finally, contaminants can be used to add missing data (missing completely at random) with:
dta <- GRD( BSFactors="grp(2)", WSFactors = "Moment (2)", SubjectsPerGroup = 1000, Effects = list("grp" = slope(100) ), Population=list(mean=0,stddev=20,rho= -0.85), Contaminant=list(scores = "NA", proportion=0.2) )
GRD()
is a convenient function to generate about any sorts of data sets
with any form of effects. The data can simulate any factorial designs
involving between-subject designs, repeated-measure designs, and
multivariate data.
One use if of course in the classroom: students can test their skill by generating random data sets and run statistical procedures. To illustrate type-I errors, it become then easy to generate data with no effect whatsoever and ask the students who obtain a rejection decision to raise their hand.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.