repMean  R Documentation 
Compute totals, means, adjusted means, mean differences, variances and standard deviations
with standard errors in random or clustered or complex samples. Variance estimation in complex cluster
designs based on Jackknife (JK1, JK2) or Balanced Repeated Replicates (BRR) procedure. Moreover, analyses
can be customized for multiple or nested imputed variables, applying the combination rules of Rubin (1987)
for imputed data and Rubin (2003) for nested imputed data. Conceptually, the function combines replication
methods and methods for multiple imputed data. Trend estimation as usual in largescale assessments is supported as well.
Technically, this is a wrapper for the svymean
and svyvar
functions of the survey
package.
repMean (datL, ID, wgt = NULL, type = c("none", "JK2", "JK1", "BRR", "Fay"), PSU = NULL, repInd = NULL, repWgt = NULL, nest=NULL, imp=NULL, groups = NULL, group.splits = length(groups), group.differences.by = NULL, cross.differences = FALSE, crossDiffSE = c("wec", "rep","old"), adjust = NULL, useEffectLiteR = FALSE, nBoot = 100, group.delimiter = "_", trend = NULL, linkErr = NULL, dependent, na.rm = FALSE, doCheck = TRUE, engine = c("survey", "BIFIEsurvey"), scale = 1, rscales = 1, mse=TRUE, rho=NULL, hetero=TRUE, se_type = c("HC3", "HC0", "HC1", "HC2", "CR0", "CR2"), clusters = NULL, crossDiffSE.engine= c("lavaan", "lm"), stochasticGroupSizes = FALSE, verbose = TRUE, progress = TRUE)
datL 
Data frame in the long format (i.e. each line represents one ID unit in one imputation of one nest) containing all variables for analysis. 
ID 
Variable name or column number of student identifier (ID) variable. ID variable must not contain any missing values. 
wgt 
Optional: Variable name or column number of weighting variable. If no weighting variable is specified, all cases will be equally weighted. 
type 
Defines the replication method for cluster replicates which is to be applied. Depending on 
PSU 
Variable name or column number of variable indicating the primary sampling unit (PSU). When a jackknife procedure is applied,
the PSU is the jackknife zone variable. If 
repInd 
Variable name or column number of variable indicating replicate ID. In a jackknife procedure, this is the jackknife replicate
variable. If 
repWgt 
Normally, replicate weights are created by 
nest 
Optional: name or column number of the nesting variable. Only applies in nested multiple imputed data sets. 
imp 
Optional: name or column number of the imputation variable. Only applies in multiple imputed data sets. 
groups 
Optional: vector of names or column numbers of one or more grouping variables. 
group.splits 
Optional: If groups are defined, 
group.differences.by 
Optional: Specifies variable group differences should be computed for. The corresponding variable must be included in
the 
cross.differences 
Either a list of vectors, specifying the pairs of levels for which crosslevel differences should be computed. Alternatively, if TRUE, crosslevel differences for all pairs of levels are computed. If FALSE, no crosslevel differences are computed. (see example 2a, 3, and 4) 
crossDiffSE 
Method for standard error estimation for cross level differences, where groups are dependent.

adjust 
Variable name or column number of variable(s) for which adjusted means should be computed. Note: to date, only numeric or 0/1 dichotomous variables are allowed. 
useEffectLiteR 
Logical: use the 
nBoot 
Without replicates (i.e., for completely random samples), the 
group.delimiter 
Character string which separates the group names in the output frame. 
trend 
Optional: name or column number of the trend variable. Note: Trend variable must have exact two levels. Levels for grouping variables must be equal in both 'sub populations' partitioned by the trend variable. 
linkErr 
Optional: Either the name or column number of the linking error variable. If 'NULL', a linking error of 0 will be assumed in trend estimation.
Alternatively, the linking error may be given as a single scalar value (i.e. 'linkErr = 1.225'). Alternatively, linking errors may be
given as data.frame with following specifications: Two columns, named 
dependent 
Variable name or column number of the dependent variable. 
na.rm 
Logical: Should cases with missing values be dropped? 
doCheck 
Logical: Check the data for consistency before analysis? If 
engine 
Which package should be used for estimation? 
scale 
scaling constant for variance, for details, see help page of 
rscales 
scaling constant for variance, for details, see help page of 
mse 
Logical: If 
rho 
Shrinkage factor for weights in Fay's method. See help page of 
hetero 
Logical: Assume heteroscedastic variance for weighted effect coding? 
se_type 
The sort of standard error sought for cross level differences. Only applies if 
clusters 
Optional: Variable name or column number of cluster variable. Only necessary if weighted effecting coding
should be performed using heteroscedastic variances. See the help page of 
crossDiffSE.engine 
Software implementation used for estimating crosslevel differences. Choices are either 
stochasticGroupSizes 
Logical: Assume stochastic group sizes for using weighted effect coding in crosslevel differences? Note: To date,
only 
verbose 
Logical: Show analysis information on console? 
progress 
Logical: Show progress bar on console? 
Function first creates replicate weights based on PSU and repInd variables (if defined) according to JK2 or BRR procedure
as implemented in WesVar. According to multiple imputed data sets, a workbook with several analyses is created.
The function afterwards serves as a wrapper for svymean
called by svyby
implemented in
the ‘survey’ package. The results of the several analyses are then pooled according to Rubin's rule.
A list of data frames in the long format. The output can be summarized using the report
function.
The first element of the list is a list with either one (no trend analyses) or two (trend analyses)
data frames with at least six columns each. For each subpopulation denoted by the groups
statement, each parameter (i.e., mean, variance, or group differences) and each coefficient (i.e., the estimate
and the corresponding standard error) the corresponding value is given.
group 
Denotes the group an analysis belongs to. If no groups were specified and/or analysis for the whole sample were requested, the value of ‘group’ is ‘wholeGroup’. 
depVar 
Denotes the name of the dependent variable in the analysis. 
modus 
Denotes the mode of the analysis. For example, if a JK2 analysis without sampling weights was conducted, ‘modus’ takes the value ‘jk2.unweighted’. If a analysis without any replicates but with sampling weights was conducted, ‘modus’ takes the value ‘weighted’. 
parameter 
Denotes the parameter of the regression model for which the corresponding value is given further. Amongst others, the ‘parameter’ column takes the values ‘mean’, ‘sd’, ‘var’ and ‘meanGroupDiff’ if group differences were requested. 
coefficient 
Denotes the coefficient for which the corresponding value is given further. Takes the values ‘est’ (estimate) and ‘se’ (standard error of the estimate). 
value 
The value of the parameter estimate in the corresponding group. 
If groups were specified, further columns which are denoted by the group names are added to the data frame.
te Grotenhuis, M., Pelzer, B., Eisinga, R., Nieuwenhuis, R., SchmidtCatran, A., & Konig, R. (2017). When size matters: advantages of weighted effect coding in observational studies. International Journal of Public Health. 62, 163–167.
Sachse, K. A. & Haag, N. (2017). Standard errors for national trends in international largescale assessments in the case of crossnational differential item functioning. Applied Measurement in Education, 30, (2), 102116. http://dx.doi.org/10.1080/08957347.2017.1283315
Weirich, S., Hecht, M., Becker, B. et al. Comparing group means with the total mean in random samples, surveys, and largescale assessments: A tutorial and software illustration. Behav Res (2021). https://doi.org/10.3758/s13428021015531
data(lsa) ### Example 1: only means, SD and variances for each country ### We only consider domain 'reading' rd < lsa[which(lsa[,"domain"] == "reading"),] ### We only consider the first "nest". rdN1 < rd[which(rd[,"nest"] == 1),] ### First, we only consider year 2010 rdN1y10< rdN1[which(rdN1[,"year"] == 2010),] ### mean estimation means1 < repMean(datL = rdN1y10, ID="idstud", wgt="wgt", type = "JK2", PSU = "jkzone", repInd = "jkrep", imp="imp", groups = "country", dependent = "score", na.rm=FALSE, doCheck=TRUE, engine = "BIFIEsurvey") ### reporting function: the function does not know which content domain is being considered, ### so the user may add new columns in the output using the 'add' argument res1 < report(means1, add = list(domain = "reading")) ### Example 1a: Additionally to example 1, we decide to estimate whether ### each country's mean differ significantly from the overall mean as well ### as from the individual means of the other contries means1a< repMean(datL = rdN1y10, ID="idstud", wgt="wgt", type = "JK2", PSU = "jkzone", repInd = "jkrep", imp="imp", groups = "country", group.splits = 0:1, group.differences.by = "country", cross.differences = TRUE, dependent = "score", na.rm=FALSE, doCheck=TRUE, hetero=FALSE) res1a < report(means1a, add = list(domain = "reading")) ### See that the means of 'countryA' and 'countryB' significantly differ from the overall mean. print(res1a[intersect(which(res1a[,"comparison"] == "crossDiff"), which(res1a[,"parameter"] == "mean")),], digits = 3) ### Example 2: Sex differences by country. Assume equally weighted cases by omitting ### 'wgt' argument. means2 < repMean(datL = rdN1y10, ID="idstud", type = "JK2", PSU = "jkzone", repInd = "jkrep", imp="imp", groups = c("country", "sex"), group.splits = 0:2, group.differences.by="sex", dependent = "score", na.rm=FALSE, doCheck=TRUE, crossDiffSE.engine= "lm") res2 < report(means2,add = list(domain = "reading")) ### Example 2a: Additionally to example 2, we decide to estimate whether ### each country's mean differ significantly from the overall mean. (Note: by default, ### such cross level differences are estimated using 'weighted effect coding'. Use the ### 'crossDiffSE' argument to choose alternative methods.) Moreover, we estimate whether ### each country's sex difference differ significantly from the sex difference in the ### whole population. means2a< repMean(datL = rdN1y10, ID="idstud", wgt="wgt", type = "JK2", PSU = "jkzone", repInd = "jkrep", imp="imp", groups = c("country", "sex"), group.splits = 0:2, group.differences.by="sex", cross.differences = list(c(0,1), c(0,2)), dependent = "score", na.rm=FALSE, doCheck=TRUE, crossDiffSE.engine= "lm", clusters = "idclass") res2a < report(means2a,add = list(domain = "reading"), trendDiffs = TRUE) ### Third example: like example 2a, but using nested imputations of dependent variable, ### and additionally estimating trend: use 'rd' instead of 'rdN1y10' ### assume equally weighted cases by omitting 'wgt' argument ### ignoring jackknife by omitting 'type', 'PSU' and 'repInd' argument means3T< repMean(datL = rd, ID="idstud", imp="imp", nest="nest", groups = c("country", "sex"), group.splits = 0:2, group.differences.by="sex", cross.differences = list(c(0,1), c(0,2)), dependent = "score", na.rm=FALSE, doCheck=TRUE, trend = "year", linkErr = "leScore", crossDiffSE = "wec", crossDiffSE.engine= "lavaan") res3T < report(means3T, add = list(domain = "reading")) ### Example 3a: like example 3, but providing linking errors in an additional data.frame ### This is optional for two measurement occasions but mandatory if the analysis contains ### more than two measurement occasions linkErr< data.frame ( trendLevel1 = 2010, trendLevel2 = 2015, depVar = "score", parameter = "mean", unique(lsa[,c("domain", "leScore")]), stringsAsFactors = FALSE) colnames(linkErr) < car::recode(colnames(linkErr), "'leScore'='linkingError'") ### note that the linking errors for the specified domain have to be chosen via ### subsetting means3a< repMean(datL = rd, ID="idstud", imp="imp", nest="nest", groups = c("country", "sex"), group.splits = 0:2, group.differences.by="sex", cross.differences = list(c(0,1), c(0,2)), dependent = "score", na.rm=FALSE, doCheck=TRUE, trend = "year", linkErr = linkErr[which(linkErr[,"domain"] == "reading"),], crossDiffSE = "wec", crossDiffSE.engine= "lavaan") res3a < report(means3a, add = list(domain = "reading")) ### Fourth example: using a loop do analyse 'reading' and 'listening' comprehension ### in one function call. Again with group and cross differences and trends, and ### trend differences ### we use weights but omit jackknife analysis by omitting 'type', 'PSU' and 'repInd' ### argument means4T< by ( data = lsa, INDICES = lsa[,"domain"], FUN = function (sub.dat) { repMean(datL = sub.dat, ID="idstud", wgt="wgt", imp="imp", nest="nest", groups = c("country", "sex"), group.splits = 0:2, group.differences.by="sex", cross.differences = list(c(0,1), c(0,2)), dependent = "score", na.rm=FALSE, doCheck=TRUE, trend = "year", linkErr = "leScore", crossDiffSE.engine= "lm") }) ret4T < do.call("rbind", lapply(names(means4T), FUN = function ( domain ) { report(means4T[[domain]], trendDiffs = TRUE, add = list(domain = domain))})) ### Fifth example: compute adjusted means, also with trend estimation ### Note: all covariates must be numeric or 0/1 dichotomous rdN1[,"mignum"] < as.numeric(rdN1[,"mig"]) rdN1[,"sexnum"] < car::recode(rdN1[,"sex"], "'male'=0; 'female'=1", as.numeric=TRUE, as.factor=FALSE) means5 < repMean(datL = rdN1, ID="idstud", wgt="wgt", type = "JK2", PSU = "jkzone", repInd = "jkrep", imp="imp", groups = "country", adjust = c("sexnum", "ses", "mignum"), useEffectLiteR = FALSE, dependent = "score", na.rm=FALSE, doCheck=TRUE, trend = "year", linkErr = "leScore") res5 < report(means5, add = list(domain = "reading")) ## Not run: ############################################################################################ # Example 6: R code for running the PISA 2015 science example to compare group means # # with the total mean using weighted effect coding # ############################################################################################ # Warning: large PISA data set requires at least 16 GB free working memory (RAM): ### define necessary directories (note: writing permissions required) folder < tempdir() ### download PISA 2015 zipped student questionnaire data (420 MB) to a folder with ### writing permissions download.file(url = "https://webfs.oecd.org/pisa/PUF_SPSS_COMBINED_CMB_STU_QQQ.zip", destfile = file.path(folder, "pisa2015.zip")) ### unzip PISA 2015 student questionnaire data (1.5 GB) to temporary folder zip::unzip(zipfile = file.path(folder, "pisa2015.zip"), files= "CY6_MS_CMB_STU_QQQ.sav", exdir=folder) ### read data pisa < foreign::read.spss(file.path (folder, "CY6_MS_CMB_STU_QQQ.sav"), to.data.frame=TRUE, use.value.labels = FALSE, use.missings = TRUE) # dependent variables measure.vars < paste0("PV", 1:10, "SCIE") ### choose desired variables and reshape into the long format # 'CNTSTUID' = individual student identifier # 'CNT' = country identifier # 'SENWT' = senate weight (assume a population of 5000 in each country) # 'W_FSTUWT' = final student weight # 'OECD' = dummy variable indicating which country is part of the OECD # 'W_FSTURWT' (1 to 80) = balanced repeated replicate weights # 'PV1SCIE' to 'PV10SCIE' = 10 plausible values of (latent) science performance pisaLong < reshape2::melt(pisa, id.vars = c("CNTSTUID", "CNT", "SENWT", "W_FSTUWT", "OECD", paste0("W_FSTURWT", 1:80)), measure.vars = measure.vars, value.name = "value", variable.name="imp", na.rm=TRUE) ### choose OECD countries oecd < pisaLong[which(pisaLong[,"OECD"] == 1),] ### analyze data ### analysis takes approximately 30 minutes on an Intel i56500 machine with 32 GB RAM means < repMean( datL = oecd, # data.frame in the long format ID = "CNTSTUID", # student identifier dependent = "value", # the dependent variable in the data groups = "CNT", # the grouping variable wgt = "SENWT", # (optional) weighting variable. We use senate # weights (assume a population of 5000 in each # country) type = "Fay", # type of replication method. Corresponding to # the PISA sampling method, we use "Fay" rho = 0.5, # shrinkage factor for weights in Fay's method scale = NULL, # scaling constant for variance, set to NULL # according to PISA's sampling method rscales = NULL, # scaling constant for variance, set to NULL # according to PISA's sampling method repWgt = paste0("W_FSTURWT", 1:80), # the replicate weights, # provided by the OECD imp = "imp", # the imputation variable mse = FALSE, # if TRUE, compute variances based on sum of # squares around the point estimate, rather # than the mean of the replicates. group.splits = 0:1, # defining the 'levels' for which means should # be computed. 0:1 implies that means for the # whole sample (level 0) as well as for groups # (level 1) are computed cross.differences = TRUE, # defines whether (and which) cross level mean # differences should be computed. TRUE means # that all cross level mean differences are # computed crossDiffSE = "wec", # method for standard errors of mean # differences crossDiffSE.engine = "lm", # software implementation for standard # errors of mean differences hetero = TRUE, # assume heteroscedastic group variances stochasticGroupSizes = FALSE # assume fixed group sizes ) ### call a reporting function to generate userfriendly output results < report(means, exclude = c("Ncases", "NcasesValid", "var", "sd")) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.