extract_entities_workflow: Extract entities from text with improved efficiency using...
In LBDiscover: Literature-Based Discovery Tools for Biomedical Research

extract_entities_workflow

R Documentation

Extract entities from text with improved efficiency using only base R

Description

This function provides a complete workflow for extracting entities from text using dictionaries from multiple sources, with improved performance and robust error handling.

Usage

extract_entities_workflow(
  text_data,
  text_column = "abstract",
  entity_types = c("disease", "drug", "gene"),
  dictionary_sources = c("local", "mesh", "umls"),
  additional_mesh_queries = NULL,
  sanitize = TRUE,
  api_key = NULL,
  custom_dictionary = NULL,
  max_terms_per_type = 200,
  verbose = TRUE,
  batch_size = 500,
  parallel = FALSE,
  num_cores = 2,
  cache_dictionaries = TRUE
)

Arguments

`text_data`	A data frame containing article text data.
`text_column`	Name of the column containing text to process.
`entity_types`	Character vector of entity types to include.
`dictionary_sources`	Character vector of sources for entity dictionaries.
`additional_mesh_queries`	Named list of additional MeSH queries.
`sanitize`	Logical. If TRUE, sanitizes dictionaries before extraction.
`api_key`	API key for UMLS access (if "umls" is in dictionary_sources).
`custom_dictionary`	A data frame containing custom dictionary entries to incorporate into the entity extraction process.
`max_terms_per_type`	Maximum number of terms to fetch per entity type. Default is 200.
`verbose`	Logical. If TRUE, prints detailed progress information.
`batch_size`	Number of documents to process in a single batch. Default is 500.
`parallel`	Logical. If TRUE, uses parallel processing when available. Default is FALSE.
`num_cores`	Number of cores to use for parallel processing. Default is 2.
`cache_dictionaries`	Logical. If TRUE, caches dictionaries for faster reuse. Default is TRUE.