Summarizing transparency across a corpus
In rtransparency: Identifies Indicators of Transparency

knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
has_ggplot <- requireNamespace("ggplot2", quietly = TRUE)

The detector functions (rt_all_pmc(), rt_data_code_pmc()) describe one article at a time. Most studies of research transparency instead ask corpus-level questions: across thousands of articles, how often is each practice present? Is it improving over time? Does it differ by journal or article type?

This vignette shows how to go from per-article detector output to that kind of summary, using rt_summary(), rt_score() and rt_plot().

From one article to many

Running a detector on a single article returns a one-row table of indicators:

library(rtransparency)

xml <- system.file(
  "extdata", "PMID32171256-PMC7071725.xml", package = "rtransparency"
)
one <- rt_all_pmc(xml, remove_ns = TRUE)
one[, c("pmid", "is_coi_pred", "is_fund_pred", "is_register_pred")]

To study a corpus you run a detector over many files and stack the rows; purrr::map_dfr(files, rt_all_pmc, remove_ns = TRUE) returns all eight indicators per article in one pass. The result is one row per article with the indicator columns is_coi_pred, is_fund_pred, is_register_pred, is_open_data, is_open_code, is_novelty_pred, is_replication_pred and is_ai_pred. is_ai_pred is NA for articles published before 2023, and rt_summary() drops those NAs, so the AI-disclosure prevalence is computed only over the articles where the indicator applies.

This package ships a small simulated table of that shape, rt_demo, so the rest of the vignette runs without downloading anything:

data(rt_demo)
head(rt_demo)

Prevalence of each indicator

rt_summary() reports, for each indicator, how many articles were assessed, how many were positive, the apparent prevalence and its 95% confidence interval:

s <- rt_summary(rt_demo)
knitr::kable(
  s[, c("label", "n_articles", "n_detected", "percent", "conf_low", "conf_high")],
  digits = 1,
  col.names = c("Indicator", "Assessed", "Detected", "%", "CI low", "CI high")
)

Correcting for detector error

A text-mining detector is not perfect, so the observed prevalence is a biased estimate of the true prevalence. rt_summary() corrects for this using each detector's sensitivity and specificity estimates (the Rogan-Gladen estimator). The correction is on by default and adds adj_percent, adj_low and adj_high:

knitr::kable(
  s[, c("label", "percent", "adj_percent", "adj_low", "adj_high")],
  digits = 1,
  col.names = c("Indicator", "Apparent %", "Corrected %", "CI low", "CI high")
)

The accuracy values come from rt_accuracy:

rt_accuracy

AI-use disclosure has no bundled accuracy estimate here, so its corrected value is NA. Novelty's estimate comes from a hand-labeled gold set (inst/benchmark/results_novelty_replication.md); the data/code values are reproducible benchmark estimates for the native detector, not untouched external-validation estimates. Replication's correction is approximate: its sensitivity comes from a replication-enriched sample and its specificity from the representative 2023 sample, so it does not rest on the single-design validation of conflicts of interest, funding or registration, and the Rogan-Gladen interval does not propagate uncertainty in these estimates. To use your own validation (or the published oddpub values for data and code), pass any table with variable, sensitivity and specificity columns:

my_acc <- rt_accuracy
my_acc$sensitivity[my_acc$variable == "is_open_data"] <- 0.758
rt_summary(rt_demo, indicators = "is_open_data", accuracy = my_acc)[,
  c("label", "percent", "adj_percent")]

How many practices per article

rt_score() adds a per-article count of the openness practices met (conflicts of interest, funding, registration, data and code). Tabulating it shows how many articles meet zero, one, two ... of the five practices:

scored <- rt_score(rt_demo)
knitr::kable(
  as.data.frame(table(`Practices met` = scored$n_indicators)),
  col.names = c("Practices met", "Articles")
)

Subgroups

Pass by to summarize within a grouping column, such as article type:

by_type <- rt_summary(rt_demo, by = "type", adjust = FALSE)
knitr::kable(
  by_type[by_type$indicator == "is_open_data",
          c("type", "label", "n_articles", "percent")],
  digits = 1,
  col.names = c("Type", "Indicator", "Assessed", "%")
)

Plots

rt_plot() returns a ggplot, so it composes with the usual ggplot2 layers. The default is a prevalence bar chart:

library(ggplot2)
rt_plot(rt_demo) + ggtitle("Transparency indicators in rt_demo")

Use type = "trend" with a year column to see prevalence over time:

rt_plot(rt_demo, type = "trend", year = "year")

The AI-disclosure line begins only in 2023, because the indicator is NA before then; the rising data-sharing and AI lines illustrate the kind of trend these summaries are meant to surface. Restrict a plot to particular indicators with indicators =, for example to follow AI-use disclosure on its own:

rt_plot(rt_demo, type = "trend", year = "year", indicators = "is_ai_pred") +
  ggtitle("Disclosure of generative-AI use, 2023 onward")

Set adjusted = TRUE in either plot to show the error-corrected prevalence instead of the apparent prevalence.

Putting it together

A typical analysis is therefore: run a detector over your corpus, stack the rows, then

results <- purrr::map_dfr(xml_files, rt_all_pmc, remove_ns = TRUE)
rt_summary(results)                       # prevalence + corrected prevalence
rt_score(results)                         # per-article practice count
rt_plot(results, type = "trend", year = "year")