knitr::opts_chunk$set(collapse = TRUE, comment = "#>") has_ggplot <- requireNamespace("ggplot2", quietly = TRUE)
The detector functions (rt_all_pmc(), rt_data_code_pmc()) describe one
article at a time. Most studies of research transparency instead ask
corpus-level questions: across thousands of articles, how often is each practice
present? Is it improving over time? Does it differ by journal or article type?
This vignette shows how to go from per-article detector output to that kind of
summary, using rt_summary(), rt_score() and rt_plot().
Running a detector on a single article returns a one-row table of indicators:
library(rtransparency) xml <- system.file( "extdata", "PMID32171256-PMC7071725.xml", package = "rtransparency" ) one <- rt_all_pmc(xml, remove_ns = TRUE) one[, c("pmid", "is_coi_pred", "is_fund_pred", "is_register_pred")]
To study a corpus you run a detector over many files and stack the rows;
purrr::map_dfr(files, rt_all_pmc, remove_ns = TRUE) returns all eight
indicators per article in one pass. The result is one row per article with the
indicator columns is_coi_pred, is_fund_pred, is_register_pred,
is_open_data, is_open_code, is_novelty_pred, is_replication_pred and
is_ai_pred.
is_ai_pred is NA for articles published before 2023, and rt_summary()
drops those NAs, so the AI-disclosure prevalence is computed only over the
articles where the indicator applies.
This package ships a small simulated table of that shape, rt_demo, so the
rest of the vignette runs without downloading anything:
data(rt_demo) head(rt_demo)
rt_summary() reports, for each indicator, how many articles were assessed, how
many were positive, the apparent prevalence and its 95% confidence interval:
s <- rt_summary(rt_demo) knitr::kable( s[, c("label", "n_articles", "n_detected", "percent", "conf_low", "conf_high")], digits = 1, col.names = c("Indicator", "Assessed", "Detected", "%", "CI low", "CI high") )
A text-mining detector is not perfect, so the observed prevalence is a biased
estimate of the true prevalence. rt_summary() corrects for this using each
detector's sensitivity and specificity estimates (the Rogan-Gladen estimator).
The correction is on by default and adds adj_percent, adj_low and
adj_high:
knitr::kable( s[, c("label", "percent", "adj_percent", "adj_low", "adj_high")], digits = 1, col.names = c("Indicator", "Apparent %", "Corrected %", "CI low", "CI high") )
The accuracy values come from rt_accuracy:
rt_accuracy
AI-use disclosure has no bundled accuracy estimate here, so its corrected value
is NA. Novelty's estimate comes from a hand-labeled gold set
(inst/benchmark/results_novelty_replication.md); the data/code values are
reproducible benchmark estimates for the native detector, not untouched
external-validation estimates. Replication's correction is approximate: its
sensitivity comes from a replication-enriched sample and its specificity from
the representative 2023 sample, so it does not rest on the single-design
validation of conflicts of interest, funding or registration, and the
Rogan-Gladen interval does not propagate uncertainty in these estimates.
To use your own validation (or the published oddpub values for data and
code), pass any table with variable, sensitivity and specificity columns:
my_acc <- rt_accuracy my_acc$sensitivity[my_acc$variable == "is_open_data"] <- 0.758 rt_summary(rt_demo, indicators = "is_open_data", accuracy = my_acc)[, c("label", "percent", "adj_percent")]
rt_score() adds a per-article count of the openness practices met (conflicts
of interest, funding, registration, data and code). Tabulating it shows how many
articles meet zero, one, two ... of the five practices:
scored <- rt_score(rt_demo) knitr::kable( as.data.frame(table(`Practices met` = scored$n_indicators)), col.names = c("Practices met", "Articles") )
Pass by to summarize within a grouping column, such as article type:
by_type <- rt_summary(rt_demo, by = "type", adjust = FALSE) knitr::kable( by_type[by_type$indicator == "is_open_data", c("type", "label", "n_articles", "percent")], digits = 1, col.names = c("Type", "Indicator", "Assessed", "%") )
rt_plot() returns a ggplot, so it composes with the usual ggplot2 layers.
The default is a prevalence bar chart:
library(ggplot2) rt_plot(rt_demo) + ggtitle("Transparency indicators in rt_demo")
Use type = "trend" with a year column to see prevalence over time:
rt_plot(rt_demo, type = "trend", year = "year")
The AI-disclosure line begins only in 2023, because the indicator is NA
before then; the rising data-sharing and AI lines illustrate the kind of trend
these summaries are meant to surface. Restrict a plot to particular indicators
with indicators =, for example to follow AI-use disclosure on its own:
rt_plot(rt_demo, type = "trend", year = "year", indicators = "is_ai_pred") + ggtitle("Disclosure of generative-AI use, 2023 onward")
Set adjusted = TRUE in either plot to show the error-corrected prevalence
instead of the apparent prevalence.
A typical analysis is therefore: run a detector over your corpus, stack the rows, then
results <- purrr::map_dfr(xml_files, rt_all_pmc, remove_ns = TRUE) rt_summary(results) # prevalence + corrected prevalence rt_score(results) # per-article practice count rt_plot(results, type = "trend", year = "year")
For the per-indicator detection methodology, see
vignette("rtransparency").
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.