knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(aggreCAT) library(tidyverse)
The [aggreCAT]{.pkg} package, and the mathematical aggregators therein, were developed by the repliCATS (Collaborative Assessment for Trustworthy Science) project as a part of the SCORE program (Systematizing Confidence in Open Research and Evidence), funded by DARPA (Defense Advanced Research Projects Agency) [@alipourfard2021]. The SCORE program is the largest replication project in science to date, and aims to build automated tools that can rapidly and reliably assign "Confidence Scores" to research claims from empirical studies in the Social and Behavioural Sciences (SBS). Confidence Scores are quantitative measures of the likely reproducibility or replicability of a research claim or result, and may be used by consumers of scientific research as a proxy measure for their credibility in the absence of replication effort [@alipourfard2021].
Replications are time-consuming and costly [@Isager2020], and studies have shown that replication outcomes can be reliably elicited from researchers [@Gordon2020]. Consequently, the DARPA SCORE program generated Confidence Scores for $> 4000$ SBS claims using expert elicitation based on two very different strategies -- prediction markets [@Gordon2020] and the IDEA protocol [@hemming2017], the latter of which is used by the repliCATS project [@Fraser:2021]. A proportion of these research claims were randomly selected for direct replication, against which the elicited and aggregated Confidence Scores are 'ground-truthed' or verified. The aim of the DARPA SCORE project is to aid the development of artificial intelligence tools that can automatically assign Confidence Scores.
The [aggreCAT]{.pkg} package includes the core dataset data_ratings
consisting of judgements elicited during a pilot experiment exploring
the performance of IDEA groups in assessing replicability of a set of
claims with "known outcomes." "Known-outcome" claims are SBS research
claims that have been subject to replication studies in previous
large-scale replication projects[^1]. Data were collected using the
repliCATS IDEA protocol at a two day workshop[^2] in the Netherlands, on
July 2019, at which 25 participants assessed the replicability of 25
unique SBS claims. In addition to the probabilistic estimates provided
for each research claim assessed, participants were also asked to rate
the claim's plausibility and comprehensibility, answer whether they were
involved in any aspect of the original study, and to provide their
reasoning in support of their quantitative estimates, which were used to
form measures of reasoning breadth and engagement [@Fraser:2021].
[^1]: Many labs 1, 2 and 3 @Klein2014, @Klein2018ManyL2, @Ebersole2016, the Social Sciences Replication Project @Camerer2018 and the Reproducibility Project Psychology @aac4716.
[^2]: See @Hanea2021 for details. The workshop was held at the annual meeting of the Society for the Improvement of Psychological Science (SIPS), \<https://osf.io/ndzpt/>{.uri}.
data_ratings is a tidy [data.frame]{.class} wherein each
observation (or row) corresponds to a single value in the set of
values constituting a participant's complete assessment of a research
claim. Each research claim is assigned a unique paper_id, and each
participant has a unique (and anonymous) user_name. The variable
round denotes the round in which each value was elicited (round_1
or round_2). question denotes the type of question the value
pertains to; direct_replication for probabilistic judgements about the
replicability of the claim, belief_binary for participants' belief in
the plausibility of the claim, comprehension for participants'
comprehensibility ratings, and involved_binary for involvement in the
original study. An additional column element maintains the tidy
structure of the data, while capturing the multiple values that
comprise a full assessment of the replicability (direct_replication)
of a claim; three_point_best, three_point_lower and
three_point_upper denote the best estimate and lower and upper bounds
respectively. binary_question describes the element for both the
plausibility rating (belief_binary) and involvement
(involved_binary) questions, whereas likert_binary is the element
describing a participant's comprehension rating. Judgements are
recorded in column value in the form of percentage probabilities
ranging from (0,100). The binary_questions corresponding to
comprehensibility and involvement consist of binary values (1 for the
affirmative, and -1 for the negative). Finally, values corresponding
to participants' comprehension ratings are on a likert_binary scale
from 1 through 7. Note that additional columns with participant
attributes can be included in the ratings dataset if required by the
user; we include the group column in data-ratings, which describes
the group number the participant was a part of. Below we show some
example data for a single user for a single claim to illustrate this
structure of the core data_ratings dataset.
aggreCAT::data_ratings %>% dplyr::filter(paper_id == dplyr::first(paper_id), user_name == dplyr::first(user_name)) %>% head()
Not all data necessary for constructing weights on performance is
contained in data_ratings. Additional data collected as part of the
repliCATS IDEA protocol are contained within separate datasets to
data_ratings. Participants provided justifications for giving
particular judgements, and these are contained in data_justifications.
On the repliCATS platform users were given the option to comment on
others' justifications (data_comments), to vote on others' comments
(data_comment_ratings) and on others' justifications
(data_justification_ratings). Finally, [aggreCAT]{.pkg} contains three
'supplementary' datasets containing data collected externally to the
repliCATS IDEA protocol: data_supp_quiz, data_supp_priors, and
data_supp_reasons.
Prior to the workshop, participants were asked to complete an optional
quiz on statistical concepts and meta-research which we thought would
aid in reliably evaluating the replicability of research claims. Quiz
responses are contained in data_supp_quiz and are used to construct
performance weights for the aggregation method QuizWAgg, where each
participant receives a quiz_score if they completed the quiz, and NA
if they did not attempt the quiz [see @Hanea2021 for further details]. Additional methods of scoring the quiz responses are provided in data_supp_quiz.
aggreCAT::data_supp_quiz
The ReasonWAgg aggregation type uses the number of unique reasons given by participants to
support a Best Estimate for a given claim $B_{i,c}$ to construct
performance weights, and is contained within data_supp_reasons.
Qualitative statements made by individuals during claim evaluation were
recorded on the repliCATS platform [@Pearson2021] and coded as falling
into one of 25 unique reasoning categories by the repliCATS Reasoning
team [@Wintle:2021]. Reasoning categories include plausibility of the
claim, effect size, sample size, presence of a power analysis,
transparency of reporting, and journal reporting [@Hanea2021]. Within
data_supp_reasons, each of the reasoning categories that passed our
inter-coder reliability threshold are distributed as columns in the
dataset whose names are prefixed with RW, and for each claim
paper_id, each participant user_id is assigned a logical 1 or 0
if they included that reasoning category in support of their Best
estimate for that claim. See ReasoningWAgg() for details on the
ReasonWAgg aggregation method.
aggreCAT::data_supp_reasons %>% glimpse()
The method BayPRIORsAgg (implemented in BayesianWAgg()) uses Bayesian updating to update a prior
probability of a claim replicating estimated from a predictive model
[@Gould2021a] using an aggregate of the best estimates for all
participants assessing a given claim $c$ [@Hanea2021]. The prior data is
contained in data_supp_priors with each claim in column paper_id
being assigned a prior probability (on the logit scale) of the claim
replicating in column prior_means.
aggreCAT::data_supp_priors
data_commentsdata_confidence_scoresdata_justificationsdata_outcomesAny scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.