`rtransparency` automatically identifies and extracts **indicators of research
transparency** from the full text of biomedical articles, in both PubMed Central
(PMC) JATS XML and plain-text (PDF-derived) form. Every prediction comes with the
exact statement that triggered it, so results are auditable rather than a black
box. Detection is rule-based (curated regular expressions over the relevant
article sections), self-contained (no GitHub-only or AGPL dependencies), and
ships with reproducible accuracy benchmarks.
## The eight indicators
| Indicator | Detects | XML function | Text function |
|---|---|---|---|
| **Conflicts of interest** | A COI disclosure is present (including "no competing interests") | `rt_coi_pmc` | `rt_coi` |
| **Funding** | A statement that funding was received | `rt_fund_pmc` | `rt_fund` |
| **Protocol registration** | A trial/protocol registration identifier or statement (NCT, ISRCTN, PROSPERO, OSF, CHiCTR, DRKS, ANZCTR, IRCT, UMIN, ...) | `rt_register_pmc` | `rt_register` |
| **Novelty** | The article claims its own work is novel or first | `rt_novelty_pmc` | `rt_novelty` |
| **Replication** | A replication or external/independent validation was performed | `rt_replication_pmc` | `rt_replication` |
| **Data sharing** | The authors' own data are made available (repository, accession, or in-article) | `rt_data_code_pmc` | `rt_data_code` |
| **Code sharing** | The authors' own analysis code is shared | `rt_data_code_pmc` | `rt_data_code` |
| **AI disclosure** | A statement discloses generative-AI use in manuscript preparation (2023+) | `rt_ai_pmc` | `rt_ai` |
Conflicts of interest and AI disclosure are **disclosure-based**: a statement on
the topic counts whether the disclosure is positive or negative. Conflict-of-
interest and funding statements are detected not only in English but also in
**Spanish, Portuguese, French, German and Italian**.
## Installation
wzxhzdk:0
No GitHub-only or AGPL dependencies are required; data and code detection is
native (it no longer wraps `oddpub`). `rt_read_pdf()` (PDF to text) additionally
needs the poppler `pdftotext` utility on your system. The optional `furrr` and
`future` packages enable parallel corpus processing; `ggplot2` enables plotting.
## Quick start: all eight indicators in one call
wzxhzdk:1
`rt_all_pmc()` returns one row with the eight predictions, the extracted
statement for each, article identifiers and metadata, the year, and
`is_success`. `is_ai_pred` is `NA` for articles published before 2023.
## Per-indicator functions
Each indicator can be run on its own, for a PMC XML file or a plain-text file:
wzxhzdk:2
## Corpus-scale processing
`rt_all_pmc_dir()` runs all eight indicators over an entire directory (or a
vector of paths). It is built for large corpora:
wzxhzdk:3
- **Resumable**: with `output`, results are written to a CSV in chunks; a re-run
skips files already recorded and appends only the new ones.
- **Failure-isolated**: a malformed file yields an `is_success = FALSE` row
instead of aborting the run.
- **Parallel**: set `future::plan("multisession")` and `parallel = TRUE`.
## Plain-text input
The same detectors run on plain-text (PDF-derived) articles. `rt_read_pdf()`
returns the extracted text as a character string; write it to a `.txt` file,
then point the text detectors (which share the PMC detection logic) at that file:
wzxhzdk:4
`rt_ai()` is the plain-text counterpart of `rt_ai_pmc()`. Because a text file
carries no reliable publication date, it applies **no 2023 year gate** (it
returns `TRUE`/`FALSE`, never `NA`) and cannot confine the scan to back-matter
sections, so restrict its use to 2023-or-later articles and expect a slightly
higher false-positive rate on papers that use AI as a research method.
## Summarizing a corpus
Once you have one row per article, summarize the corpus:
wzxhzdk:5
The accuracy correction uses the bundled `rt_accuracy` table (detector
sensitivity and specificity for seven indicators). Supply your own estimates:
wzxhzdk:6
## Linking to FAIR assessment
The data- and code-availability links the detector extracts (`open_data_links`,
`open_code_links`) can be passed to FAIR-assessment tooling such as
[`rfair`](https://github.com/choxos/rfair) to score the findability and
accessibility of the shared resources.
## Validation
Benchmarked against the human-labeled XML benchmark of Serghiou et al. (2021),
reproducible under `data-raw/benchmark/`, with results in `inst/benchmark/`:
| Indicator | Sensitivity | Specificity |
|---|---|---|
| Conflicts of interest | 94.0% | 100% |
| Funding | 100% | 95.7% |
| Protocol registration | 99.2% | 96.9% |
| Data sharing | 76.5% | 99.0% |
| Code sharing | 88.1% | 99.5% |
Registration and code in the table above are labeled independently of the
detector; COI, funding and data labels in the 1000-article 2023 sample were
reconciled against detector-extracted statements (detector-adjudicated), so their
agreement is not a fully independent estimate. Data sharing is deliberately
precision-favoring: its 76.5% sensitivity trades recall for 99.0% specificity
(the original `oddpub` algorithm scores about 84%/97% on this set).
The newer indicators are validated against maintainer-built, hand-labeled
benchmarks in `inst/benchmark/`:
| Indicator | Sensitivity | Specificity | Basis |
|---|---|---|---|
| Novelty | 83.8% | 95.2% | hand-labeled novelty/replication gold set |
| Replication | 92.8% | 98.5% | replication-enriched sample (111 positives); correction is approximate |
| AI-use disclosure | not accuracy-corrected | — | experimental; only 9 positives in the 2023 sample |
Replication's correction mixes designs (sensitivity from the enriched sample,
specificity from the representative 2023 sample), so it is less clean than the
single-design corrections above. AI-use disclosure is reported uncorrected and is
excluded from `rt_accuracy` until a larger labeled post-2022 sample exists. Two
further benchmarks live in `inst/benchmark/`: a **five-language sample** for
multilingual COI and funding, and a **TXT-parity benchmark** comparing the text
and XML detectors.
See `vignette("rtransparency")` for the methodology and `vignette("scope-and-limitations")`
for what each indicator does and does not capture.
## Documentation
- `vignette("rtransparency")` — introduction and methodology
- `vignette("transparency-summary")` — corpus prevalence, scoring and plotting
- `vignette("ai-disclosure")` — the AI-use disclosure indicator in depth
- `vignette("scope-and-limitations")` — indicator semantics, limitations, output schema
- Package website:
## Lineage and citation
This package builds on the original **`rtransparent`** tool of Stylianos
(Stelios) Serghiou, an enhanced, renamed fork maintained by Ahmad Sofi-Mahmudi
([ORCID 0000-0001-6829-0823](https://orcid.org/0000-0001-6829-0823), GitHub
[@choxos](https://github.com/choxos)). It adds four indicators (novelty,
replication, AI disclosure, and a natively re-implemented data/code detector),
multilingual COI and funding detection, plain-text parity, and corpus-scale
batch processing. Serghiou is credited as an author.
The foundational paper: Serghiou et al., *Assessment of transparency indicators
across the biomedical literature: How open is open?* PLOS Biology, 2021,
[doi:10.1371/journal.pbio.3001107](https://doi.org/10.1371/journal.pbio.3001107).
Run `citation("rtransparency")` for both references.
## Getting help
Please file bugs or questions as issues at
with a minimal reproducible
example.