orm_extract: Extract risk categories from bibliographic records

View source: R/orm_extract.R

orm_extractR Documentation

Extract risk categories from bibliographic records

Description

orm_extract() scans the title, abstract, and keywords of each record against the active risk dictionary and builds a binary presence matrix (record x risk category). It also detects whether each study contains direct worker exposure data - the key signal for computing the WRDI indicator.

Matching is case-insensitive and uses whole-word boundary detection to avoid false positives (e.g. "laser" does not match "eyelaser").

Usage

orm_extract(
  refs,
  dict = orm_dict(),
  fields = c("title", "abstract", "keywords"),
  lang = getOption("orisma.lang", "en"),
  verbose = getOption("orisma.verbose", TRUE)
)

Arguments

refs

An orisma_refs object (output of orm_load() or orm_dedup()).

dict

An orisma_dict object. Default: orm_dict() (ISO 45001 / INSST / NIOSH).

fields

Character vector. Which text fields to search. Default c("title", "abstract", "keywords").

lang

Character. "en" or "es".

verbose

Logical. Print progress?

Value

A list (class orisma_matrix) containing:

refs

Original orisma_refs tibble with added columns: one binary column per risk category (⁠cat_*⁠), n_categories (total categories matched), and has_worker_data (logical).

matrix

Pure binary matrix (records x categories) for downstream analysis.

dict

The dictionary used.

categories

Category metadata tibble.

Examples

## Not run: 
refs   <- orm_load("my_references/")
deduped <- orm_dedup(refs)

# Use default dictionary
mx <- orm_extract(deduped)

# Use a customised dictionary
dict <- orm_dict()
dict <- orm_dict_add_terms(dict, "nanoparticles", c("nano-dust", "UFP"))
mx   <- orm_extract(deduped, dict = dict)

# Restrict to title + abstract only
mx <- orm_extract(deduped, fields = c("title", "abstract"))

## End(Not run)


orisma documentation built on May 19, 2026, 1:07 a.m.