View source: R/text_preprocessing.R
extract_entities_workflow | R Documentation |
This function provides a complete workflow for extracting entities from text using dictionaries from multiple sources, with improved performance and robust error handling.
extract_entities_workflow(
text_data,
text_column = "abstract",
entity_types = c("disease", "drug", "gene"),
dictionary_sources = c("local", "mesh", "umls"),
additional_mesh_queries = NULL,
sanitize = TRUE,
api_key = NULL,
custom_dictionary = NULL,
max_terms_per_type = 200,
verbose = TRUE,
batch_size = 500,
parallel = FALSE,
num_cores = 2,
cache_dictionaries = TRUE
)
text_data |
A data frame containing article text data. |
text_column |
Name of the column containing text to process. |
entity_types |
Character vector of entity types to include. |
dictionary_sources |
Character vector of sources for entity dictionaries. |
additional_mesh_queries |
Named list of additional MeSH queries. |
sanitize |
Logical. If TRUE, sanitizes dictionaries before extraction. |
api_key |
API key for UMLS access (if "umls" is in dictionary_sources). |
custom_dictionary |
A data frame containing custom dictionary entries to incorporate into the entity extraction process. |
max_terms_per_type |
Maximum number of terms to fetch per entity type. Default is 200. |
verbose |
Logical. If TRUE, prints detailed progress information. |
batch_size |
Number of documents to process in a single batch. Default is 500. |
parallel |
Logical. If TRUE, uses parallel processing when available. Default is FALSE. |
num_cores |
Number of cores to use for parallel processing. Default is 2. |
cache_dictionaries |
Logical. If TRUE, caches dictionaries for faster reuse. Default is TRUE. |
A data frame with extracted entities, their types, and positions.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.