knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

relfeas

The goal of relfeas is to allow researchers to use reliability reported in test-retest studies to make assessments of the feasibility of new study designs. This R package accompanies the preprint: We need to talk about reliability: Making better use of test-retest studies for study design and interpretation.

Note: The package is still a bit rough around the edges on some functions. If you find anything wrong, ugly, or that I've not included something obviously useful which would follow from anything presented, please don't be shy about letting me know in the Issues or by mail at mathesong[at]gmail.com

Installation

You can install relfeas from github with:

# install.packages("devtools")
devtools::install_github("mathesong/relfeas")

Examples from the paper

Below, I detail the calculations performed in the manuscript, and show the code used from the package. Please refer to the manuscript for the context

library(relfeas)
library(tidyverse)
library(ggplot2)
library(gridExtra)
library(pwr)
library(knitr)

Example 1

Summary: this example shows how the reliability of a test-retest validation study with low inter-individual variance and low reliability can be approximated for an applied study with higher inter-individual variance. This shows that despite the low reliability estimated in the test-retest study, this measurement demonstrates high reliability for the research question in the applied study.

sd2extrapRel(sd=0.32, icc_original = 0.32,
              sd_original = 0.10)

Even if the measurement error were to double due to the use of partial volume effect correction in the applied study (an extreme assumption), this conclusion would still hold.

sd2extrapRel(sd = 0.32,icc_original =  0.32,
              sd_original =  0.10, tau = 2)

Example 2

Summary: this example shows how the reliability of measurements has enormous implications for power analysis when planning a study. In this example, both measures show reliability $\geq$ 0.7, but when the reliability of these measures is taken into consideration during power analysis, the minimal required sample size for this research question (more details in the paper) is nearly doubled.

r_attenuation(0.8, 0.7)

pwr::pwr.r.test(r=sqrt(0.3), power=0.8)

pwr::pwr.r.test(r=sqrt(0.3)*r_attenuation(0.8, 0.7), power=0.8)

Example 3

Summary: reliability of individual measurements, or showing the robustness of within-individual effects is often, mistakenly, taken as a proxy for good reliability of assessing between-individual differences in within-individual effects (see the excellent paper by Hedge, Powell and Sumner 2017). Here we demonstrate this for this particular application.

Characteristics of the sample:

# Sample characteristics

meanbp <- 1.91
delta_mean <- -0.12 # Mean percentage change in difference study
delta_sd <- 0.1 # SD of percentage change in difference study
sem <- icc2sem(icc = 0.8, sd =  0.22)
delta_sd_trt <- 0.073 # SD of percentage change in test-retest study


# Characteristics across two measurements
sd = delta_sd*(meanbp)
(delta_icc <- sem2icc(sem*2, sd))

Within-individual effects:

# Smallest detectable difference: individual
(sdd_indiv <- 100*((sem*1.96*sqrt(2))/meanbp))

# Smallest detectable difference: group
samplesize <- 2
(sdd_group <- 100*(((sem/sqrt(samplesize))*1.96*sqrt(2))/meanbp))

# Within-individual power analysis
(power_n_within <- pwr::pwr.t.test(d= delta_mean/delta_sd_trt , sig.level = 0.05,
                                   type = "paired", alternative = "less",
                                   power=0.8))

Between-individual assessment of within-individual changes

ss_total <- sumStat_total(n1 = 20, mean1 =  abs(delta_mean*meanbp), sd1 = delta_sd*meanbp,
              n2 = 20, mean2 = abs(2.5*delta_mean*meanbp), sd2=delta_sd*meanbp)

# Estimation of reliability
(delta_icc_patcntrl <- sem2icc(sem*2, ss_total$sd_total))

(d_true <- ss_total$d)

# Using this estimated reliability for estimation of attenuation
(d_meas <- d_attenuation(rel_total = delta_icc_patcntrl, d = d_true))

# Increase in the number of required participants after taking reliability into account
(pwr_increase_unpaired <- pwr::pwr.t.test(d=d_meas, power = 0.8, alternative = "greater")$n /
  pwr::pwr.t.test(d=d_true, power = 0.8, alternative = "greater")$n)

(pwr_increase_paired <- pwr::pwr.t.test(d=d_meas, power = 0.8, type = "paired", alternative = "greater")$n /
  pwr::pwr.t.test(d=d_true, power = 0.8, type = "paired", alternative = "greater")$n)

Example 4

Summary: A measurement outcome with high reliability but also high variance (e.g. PBR28 for translocator protein, TSPO) is less well suited for assessing small proportional changes within individuals than a measurement outcome with low variance, even if it has relatively low reliability (e.g. AZ10419369 for frontal cortex serotonin 1B receptors). However, larger proportional within-individual changes are more likely in the former due to the larger variance.

tspo_hab_10es <- 10/42
ser1b_10es <- 10/6

(tspo_n <- round(pwr::pwr.t.test(d = tspo_hab_10es, power = 0.8)$n))

(ser1b_n <- round(pwr::pwr.t.test(d = ser1b_10es, power = 0.8)$n))

Example 5

Variance reduction strategies (see paper) for PBR28 lead to new outcomes with poor reliability (see paper by Matheson et al., 2017). Very large differences between groups are required before these new outcome measures begin to be reliable for applied research questions.

One can conceptualise the required Cohen's D in various ways to get an idea of whether or not such a large effect is or is not reasonable for a particular research question. ()

(more details in the paper, or here)

#########
# Basic #
#########

(sd_basic <- extrapRel2sd(0.7, 0.5))

d_basic <- sdtot2mean2(sd_total = sd_basic, n1 = 20, n2=20, mean1 = 1)

(es_basic <- cohend_convert(d=d_basic$d))


##############
# Acceptable #
##############

(sd_acceptable <- extrapRel2sd(0.8, 0.5))

d_acceptable <- sdtot2mean2(sd_total = sd_acceptable, n1 = 20, n2=20, mean1 = 1)

(es_acceptable <- cohend_convert(d=d_acceptable$d))


############
# Clinical #
############

(sd_clin <- extrapRel2sd(0.9, 0.5))

d_clin <- sdtot2mean2(sd_total = sd_clin, n1 = 20, n2=20, mean1 = 1)

(es_clin <- cohend_convert(d=d_clin$d))

We can also plot these effects to get a better idea of how they would look

graphtheme <- theme(axis.text.x=element_blank(),
                    axis.text.y=element_blank())

d_fig <- grid.arrange(
  plot_difference(d = d_basic$d) + 
    labs(x='Lowest Acceptable Reliability (0.7)', title=NULL, y='Probability') + 
    graphtheme,
  plot_difference(d = d_acceptable$d) + 
    labs(x='Acceptable Reliability (0.8)', title=NULL, y='Probability') + 
    graphtheme,
  plot_difference(d = d_clin$d) + 
    labs(x='Clinical Reliability (0.9)', title=NULL, y='Probability') + 
    graphtheme,
  nrow=1)

Note that the figure above, as well as the function to generate these figures and these alternative explanation metrics describing effect sizes, are inspired by the work of Kristoffer Magnusson, a.k.a. R Psychologist, and his interactive Cohen's D figure as well as his description of where Cohen was wrong about overlap.

Other Examples

Not all the functions are used in the paper. Here a demonstrate a couple of others.

Effect Size Comparisons

It is extremely important to consider effect sizes when performing study design. For this purpose, I have included plotting functions such as in example 5 above. I have also included a conversion function for Cohen's D alternatives for converting between them if one feels more intuitive than another.

## If one wishes to know the true size of what a particular Cohen's D 
## represents (e.g. as a Common Language Effect Size)

cohend_convert(d = 1)

## If one wishes to know what the corresponding Cohen's D would be for
## an application for which the effect size can be more intuitively understood
## as a Common Language Effect Size for example

cohend_convert(cles = 0.8)

Performing test-retest analysis

I have also included a test-retest analysis function which includes all the usual metrics which are reported in PET studies, as well as a few others which are useful. It also produces a tidy output, so it can be used with tidy data within R pipelines. I will use some data from the agRee package.

First we prepare the data into tidy format

data("petVT", package = 'agRee')

Region <- map_df(petVT, ~nrow(.)) %>% 
  gather(ROI, participants) %>% 
  group_by(ROI) %>% 
  map2(.x = .$ROI, .y = .$participants,.f = ~ rep(x = .x, times=.y)) %>% 
  do.call('c', .)

petVT <- do.call(rbind, petVT) %>% 
  as_tibble() %>% 
  rename(Measurement1=V1, Measurement2=V2) %>% 
  mutate(Region = Region) %>% 
  mutate(Participant = 1:n()) %>% 
  gather(Measurement, Outcome, -Participant, -Region)

Now the data looks as follows:

head(petVT)

Then we can perform a test-retest analysis for each separate study/outcome/region.

calc_trt <- petVT %>% 
  group_by(Region) %>% 
  nest() %>% 
  mutate(outcomes = map(data, ~ trt( data=.x, values='Outcome', 
                                     cases = 'Participant', rater = 'Measurement' )))

trt_out <- map_df(calc_trt$outcomes, 'tidy') %>% 
  mutate(Region = calc_trt$Region) %>% 
  select(Region, everything())

kable(trt_out, digits = 2)

The table lists the following for each sample:

Distribution
Reliability
Measurement Imprecision
Variation


mathesong/relfeas documentation built on March 4, 2020, 5:29 a.m.