testRetest: Find various test-retest statistics, including test, person...

Description Usage Arguments Details Value Note Author(s) References See Also Examples


Given two presentations of a test, it is straightforward to find the test-retest reliablity, as well as the item reliability and person stability across items. Using the multi-level structure of the data, it is also possible to do a variance deomposition to find variance components for people, items, time, people x time, people x items, and items x time as well as the residual variance. This leads to various generalizability cofficients.


testRetest(t1,t2=NULL,keys=NULL,id="id", time=  "time", select=NULL, 
check.keys=TRUE, warnings=TRUE,lmer=TRUE,sort=TRUE)



a data.frame or matrix for the first time of measurement.


a data.frame or matrix for the second time of measurement. May be NULL if time is specifed in t1


item names (or locations) to analyze, preface by "-" to reverse score.


subject identification codes to match across time


The name of the time variable identifying time 1 or 2 if just one data set is supplied.


A subset of items to analyze


If TRUE will automatically reverse items based upon their correlation with the first principal component. Will throw a warning when doing so, but some people seem to miss this kind of message.


If TRUE, then warn when items are reverse scored


If TRUE, include the lmer variance decomposition. By default, this is true, but this can lead to long times for large data sets.


If TRUE, the data are sorted by id and time. This allows for random ordering of data, but will fail if ids are duplicated in different studies. In that case, we need to add a constant to the ids for each study. See the last example.


There are many ways of measuring reliability. Test - Retest is one way. If the time interval is very short (or immediate), this is known as a dependability correlation, if the time interval is longer, a stability coefficient. In all cases, this is a correlation between two measures at different time points. Given the multi-level nature of these data, it is possible to find variance components associated with individuals, time, item, and time by item, etc. This leads to several different estimates of reliability (see multilevel.reliability for a discussion and references).

It is also possible to find the subject reliability across time (this is the correlation across the items at time 1 with time 2 for each subject). This is a sign of subject reliability (Wood et al, 2017). Items can show differing amounts of test-retest reliability over time. Unfortunately, the within person correlation has problems if people do not differ very much across items. If all items are in the same keyed direction, and measuring the same construct, then the response profile for an individual is essentially flat. This implies that the even with almost perfect reproducibility, that the correlation can actually be negative. The within person distance (d2) across items is just the mean of the squared differences for each item. Although highly negatively correlated with the rqq score, this does distinguish between random responders (high dqq and low rqq) from consistent responders with lower variance (low dqq and low rqq).

Several individual statistics are reported in the scores object. These can be displayed by using pairs.panels for a graphic display of the relationship and ranges of the various measures.

Although meant to decompose the variance for tests with items nested within tests, if just given two tests, the variance components for people and for time will also be shown. The resulting variance ratio of people to total variance is the intraclass correlation between the two tests. See also ICC for the more general case.



The time 1 time 2 correlation of scaled scores across time


Guttman's lambda 3 (aka alpha) and lambda 6* (item reliabilities based upon smcs) are found for the scales at times 1 and 2.


The within subject test retest reliability of response patterns over items


Item reliabilities, item loadings at time 1 and 2, item means at time 1 and time 2


A data frame of principal component scores at time 1 and time 2, raw scores from time 1 and time 2, the within person standard deviation for time 1 and time 2, and the rqq and dqq scores for each subject.


If given separate t1 and t2 data.frames, this is combination suitable for using multilevel.reliability


A key vector showing which items have been reversed


The multilevel output


lmer=TRUE is the default and will do the variance decomposition using lmer. This will take some time. For 3032 cases with 10 items from the msqR and sai data set, this takes 92 seconds, but just .63 seconds if lmer = FALSE. For the 1895 subjects with repeated measures on the sai, it takes 85 seconds with lmer and .38 without out lmer.

In the case of just two tests (no items specified), the item based statistics (alpha, rqq, item.stats, scores, xy.df) are not reported.

Several examples are given. The first takes 200 cases from the sai data set. Subjects were given the link[psychTools]{sai} twice with an intervening mood manipulation (four types of short film clips, with or without placebo/caffeine). The test retest stability of the sai are based upon the 20 sai items. The second example compares the scores of the 10 sai items that overlap with 10 items from the msqR data set from the same study. link[psychTools]{sai} and msqR were given immediately after each other and although the format differs slightly, can be seen as measures of dependability.

The third example considers the case of the Impulsivity scale from the Eysenck Personality Inventory. This is a nice example of how different estimates of reliability will differ. The alpha reliability is .51 but the test-retest correlation across several weeks is .71! That is to say, the items don't correlate very much with each other (alpha) but do with themselves across time (test-retest).

The fourth example takes the same data and jumble the subjects (who are identified by their ids). This result should be the same as the prior result because the data are automatically sorted by id.

Although typically, there are as many subjects at time 1 as time 2, testRetest will handle the case of a different number of subjects. The data are first sorted by time and id, and then those cases from time 1 that are matched at time 2 are analyzed. It is important to note that the subject id numbers must be unique.


William Revelle


Cattell, R. B. (1964). Validity and reliability: A proposed more basic set of concepts. Journal of Educational Psychology, 55(1), 1 - 22. doi: 10.1037/h0046462

Cranford, J. A., Shrout, P. E., Iida, M., Rafaeli, E., Yip, T., \& Bolger, N. (2006). A procedure for evaluating sensitivity to within-person change: Can mood measures in diary studies detect change reliably? Personality and Social Psychology Bulletin, 32(7), 917-929.

DeSimone, J. A. (2015). New techniques for evaluating temporal consistency. Organizational Research Methods, 18(1), 133-152. doi: 10.1177/1094428114553061

Revelle, W. and Condon, D. Reliability from alpha to omega: A tutorial. Psychological Assessment, 31 (12) 1395-1411.

Revelle, W. (in preparation) An introduction to psychometric theory with applications in R. Springer. (Available online at https://personality-project.org/r/book/).

Shrout, P. E., & Lane, S. P. (2012). Psychometrics. In Handbook of research methods for studying daily life. Guilford Press.

Wood, D., Harms, P. D., Lowman, G. H., & DeSimone, J. A. (2017). Response speed and response consistency as mutually validating indicators of data quality in online samples. Social Psychological and Personality Science, 8(4), 454-464. doi: 10.1177/1948550617703168

See Also

alpha, omega scoreItems, cor2


  #for faster compiling, dont test 
#lmer set to FALSE for speed.
#set lmer to TRUE to get variance components
sai.xray <- subset(psychTools::sai,psychTools::sai$study=="XRAY")
#The case where the two measures are identified by time
#automatically reverses items but throws a warning
stability <- testRetest(sai.xray[-c(1,3)],lmer=FALSE) 
stability  #show the results
#get a second data set
sai.xray1 <- subset(sai.xray,sai.xray$time==1)
msq.xray <- subset(psychTools::msqR,
 (psychTools::msqR$study=="XRAY") & (psychTools::msqR$time==1))
select <- colnames(sai.xray1)[is.element(colnames(sai.xray1 ),colnames(psychTools::msqR))] 

select <-select[-c(1:3)]  #get rid of the id information
#The case where the two times are in the form x, y

dependability <-  testRetest(sai.xray1,msq.xray,keys=select,lmer=FALSE)
dependability  #show the results

#now examine the Impulsivity subscale of the EPI
#use the epiR data set which includes epi.keys
#Imp <- selectFromKeys(epi.keys$Imp)   #fixed temporarily with 
Imp <- c("V1", "V3", "V8", "V10","V13" ,"V22", "V39" , "V5" , "V41")
imp.analysis <- testRetest(psychTools::epiR,select=Imp) #test-retest = .7, alpha=.51,.51 

#demonstrate random ordering  -- the results should be the same
n.obs <- NROW(psychTools::epiR)
ss <- sample(n.obs,n.obs)
temp.epi <- psychTools::epiR
temp.epi <-char2numeric(temp.epi)  #make the study numeric
temp.epi$id <- temp.epi$id + 300*temp.epi$study
random.epi <- temp.epi[ss,]
random.imp.analysis <- testRetest(random.epi,select=Imp)

psych documentation built on June 19, 2021, 1:06 a.m.