R/parseMedxn.R
In EHR: Electronic Health Record (EHR) Data Processing and Analysis Tool

Documented in parseMedXN

#' Parse MedXN NLP Output
#'
#' Takes files with the raw medication extraction output generated by the MedXN
#' natural language processing system and converts it into a standardized format.
#'
#' Output from different medication extraction systems is formatted in different ways.
#' In order to be able to process the extracted information, we first need to convert
#' the output from different systems into a standardized format. Extracted expressions
#' for various drug entities (e.g., drug name, strength, frequency, etc.) each receive
#' their own column formatted as "extracted expression::start position::stop position".
#' If multiple expressions are extracted for the same entity, they will be separated by
#' backticks.
#'
#' MedXN output files anchor extractions to a specific drug name extraction.
#'
#' In MedXN output files, the results from multiple clinical notes can be combined into
#' a single output file. The beginning of some lines of the output file can indicate
#' when output for a new observation (or new clinical note) begins. The user should specify
#' the argument \code{begText} to be a regular expression used to identify the lines where output
#' for a new clinical note begins.
#'
#' See EHR Vignette for Extract-Med and Pro-Med-NLP as well as Dose Building Using Example Vanderbilt EHR Data for details.
#'
#' @param filename File name for single file containing MedXN output.
#' @param begText A regular expression that would indicate the beginning of a new
#' observation (i.e., extracted clinical note).
#'
#' @return A data.table object with columns for filename, drugname, strength, dose, route,
#' freq, and duration. The filename contains the file name corresponding to the clinical
#' note. Each of the entity columns are of the format
#' "extracted expression::start position::stop position".
#'
#' @examples
#' mxn_output <- system.file("examples", "lam_medxn.csv", package = "EHR")
#' mxn_parsed <- parseMedXN(mxn_output, begText = "^ID[0-9]+_[0-9-]+_")
#' mxn_parsed
#' @export

parseMedXN <- function(filename, begText = "^[R0-9]+_[0-9-]+_[0-9]+_") {
  con <- file(filename, 'r', blocking = TRUE)
  cnt <- 1
  bld <- list()
  while(TRUE) {
    l <- readLines(con, n = 10000)
    # lines should start with GRID_date_note
    lineStart <- grepl(begText, l)
    ix <- cumsum(lineStart)
    ll <- sapply(split(l, ix), paste, collapse = ' ', USE.NAMES = FALSE)
    bld[[cnt]] <- tstrsplit(ll, "|", fixed = TRUE)
    if(length(l) < 10000) break
    cnt <- cnt + 1
  }
  close(con)
  rdf <- vector('list', cnt)
  for(i in seq_along(bld)) {
    df <- as.data.frame(bld[[i]][1:9], stringsAsFactors = FALSE)
    names(df) <- paste0('V', 1:9)
    rdf[[i]] <- df
  }
  alldf <- do.call(rbind, rdf)
  x <- data.table::as.data.table(alldf[, c(1,2,4,5,7,8,9)])
  data.table::setnames(x, c("filename", "drugname", "strength", "dose", "route", "freq", "duration"))
}