CRFsuite tagger"

library(shiny)
library(flexdashboard)
tags$script('
  function getSelectionText() {
    var text = "";
    var offset = 0;
    if (window.getSelection) {
      s = window.getSelection();
      text = s.toString();
      range = s.getRangeAt(0).cloneRange();
      offset = range.startOffset;
    } else if (document.selection) {
      range = document.selection.createRange();
      text = range.text;
      offset = range.range.startOffset;
    }
    return {
        text: text,
        offset: offset
    };
  }
  document.onmouseup = document.onkeyup = document.onselectionchange = function() {
    var selection = getSelectionText();
    var ui_txt_id = document.getElementById("ui_txt");
    //var ui_txt_text = ui_txt_id.innerText;
    var ui_txt_text = ui_txt_id.textContent;
    Shiny.onInputChange("visible_text", ui_txt_text);
    Shiny.onInputChange("highlighted_text", selection.text);
    Shiny.onInputChange("highlighted_offset", selection.offset);
  };
')
dashdata <- list()
dashdata$file <- "default"
dashdata$save_as <- tempfile(pattern = sprintf("crfsuite_annotation_%s", format(Sys.time(), "%Y%m%d_%H%M%S")), fileext = ".rds")
dashdata$categories <- c("OTHER")
dashdata$rawdata <- data.frame(doc_id = c("introduction", "example"), text = c(
"This tool allows to easily annnote parts of plain text with categories. Why?\n   o In order to facilitate getting a training dataset for constructing models with the crfsuite R package\n   o To allow to easily build statistical models for named entity recognition, chunking, parsing or intent classification\n\nHow to annotate?\n   o Select a part of the text, indicate which category your selection belongs to and press save.\n   o Done? Press 'Annotate another document' or go to the tab 'upload new set of documents' to provide your dataset to tag.\n\nStart trying this out by providing a list of categories in the tab 'Manage tagging categories', press on 'Annotate another document' and try it out. Once done, upload your own set of documents to tag in the tab 'Upload new set of documents'",

#"<b>This tool allows to easily annnote parts of plain text with categories. Why?</b> <ul><li>In order to facilitate getting a training dataset for constructing models with the crfsuite R package</li><li>To allow to easily build statistical models for named entity recognition, chunking, parsing or intent classification</li></ul><b>How to annotate?</b><ol><li>Select a part of the text, indicate which category your selection belongs to and press save. </li><li>Done? Press 'Annotate another document' or go to the tab 'upload new set of documents' to provide your dataset to tag.</li></ol>Start trying this out by providing a list of categories in the tab 'Manage tagging categories', press on 'Annotate another document' and try it out. Once done, upload your own set of documents to tag in the tab 'Upload new set of documents'", 
'The standard Lorem Ipsum passage, used since the 1500s

"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
Section 1.10.32 of "de Finibus Bonorum et Malorum", written by Cicero in 45 BC

"Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?"
1914 translation by H. Rackham

"But I must explain to you how all this mistaken idea of denouncing pleasure and praising pain was born and I will give you a complete account of the system, and expound the actual teachings of the great explorer of the truth, the master-builder of human happiness. No one rejects, dislikes, or avoids pleasure itself, because it is pleasure, but because those who do not know how to pursue pleasure rationally encounter consequences that are extremely painful. Nor again is there anyone who loves or pursues or desires to obtain pain of itself, because it is pain, but because occasionally circumstances occur in which toil and pain can procure him some great pleasure. To take a trivial example, which of us ever undertakes laborious physical exercise, except to obtain some advantage from it? But who has any right to find fault with a man who chooses to enjoy a pleasure that has no annoying consequences, or one who avoids a pain that produces no resultant pleasure?"
Section 1.10.33 of "de Finibus Bonorum et Malorum", written by Cicero in 45 BC

"At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, similique sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga. Et harum quidem rerum facilis est et expedita distinctio. Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id quod maxime placeat facere possimus, omnis voluptas assumenda est, omnis dolor repellendus. Temporibus autem quibusdam et aut officiis debitis aut rerum necessitatibus saepe eveniet ut et voluptates repudiandae sint et molestiae non recusandae. Itaque earum rerum hic tenetur a sapiente delectus, ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat."
1914 translation by H. Rackham

"On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the pain and trouble that are bound to ensue; and equal blame belongs to those who fail in their duty through weakness of will, which is the same as saying through shrinking from toil and pain. These cases are perfectly simple and easy to distinguish. In a free hour, when our power of choice is untrammelled and when nothing prevents our being able to do what we like best, every pleasure is to be welcomed and every pain avoided. But in certain circumstances and owing to the claims of duty or the obligations of business it will frequently occur that pleasures have to be repudiated and annoyances accepted. The wise man therefore always holds in these matters to this principle of selection: he rejects pleasures to secure other greater pleasures, or else he endures pains to avoid worse pains."
'), stringsAsFactors = FALSE)
dashdata$current_row <- 0
dashdata$total_rows <- nrow(dashdata$rawdata)
current_rds <- reactiveVal("startup")

save_annotation <- function(result){
  if(length(result$doc_id) == 0 || length(result$chunk) == 0){
    return(invisible())
  }
  if(file.exists(dashdata$save_as)){
    out <- readRDS(file = dashdata$save_as)
    if(result$doc_id %in% out$doc_id){
      out <- subset(out, !(doc_id %in% result$doc_id & is.na(start)))
    }
    out <- rbind(out, data.frame(result, stringsAsFactors = FALSE))
    out$chunk_id <- seq_len(nrow(out))
    out <- out[, c("annotation_time", "doc_id", "text", "text_visible", "start", "end", "chunk_id", "chunk_entity", "chunk")]
    class(out) <- c("chunkrange", "data.frame")
    saveRDS(out, file = dashdata$save_as)
  }else{
    out <- data.frame(result, stringsAsFactors = FALSE)
    out$chunk_id <- seq_len(nrow(out))
    out <- out[, c("annotation_time", "doc_id", "text", "text_visible", "start", "end", "chunk_id", "chunk_entity", "chunk")]
    class(out) <- c("chunkrange", "data.frame")
    saveRDS(out, file = dashdata$save_as)
  }
  invisible(out)
}
currentText <- reactive({
  input$ui_next
  current_rds()
  dashdata$current_row <<- dashdata$current_row + 1
  if(nrow(dashdata$rawdata) < dashdata$current_row){
    text <- character()
    doc_id <- character()
    if(nrow(dashdata$rawdata) > 0){
      showModal(modalDialog(title = "Data selection", sprintf("Finished going through all %s records of the provided data.\nPlease upload more data",
                                                              nrow(dashdata$rawdata))))  
    }
  }else{
    text <- dashdata$rawdata$text[dashdata$current_row]
    doc_id <- dashdata$rawdata$doc_id[dashdata$current_row]
  }
  info <- list(doc_id = doc_id, text = text, row = dashdata$current_row, total_rows = nrow(dashdata$rawdata))
  result <- list(annotation_time = Sys.time(), 
               doc_id = info$doc_id,
               text = info$text,
               text_visible = info$text,
               start = NA_integer_,
               end = NA_integer_,
               chunk_id = NA_integer_,
               chunk_entity = NA_character_, 
               chunk = NA_character_)
  save_annotation(result)
  info
})

Annotation

Inputs {.sidebar}

Current document

renderUI({
  current <- currentText()
  if(length(current$doc_id) == 0){
    current$doc_id <- "no-more-documents"
    current$row <- 0
  }
  HTML(sprintf("Doc %s/%s<br>doc_id: %s", current$row, current$total_rows, current$doc_id))
})
tags$br()
actionButton(inputId = "ui_next", label = "Annotate another document")

Indicate category

radioButtons(inputId = "ui_tag", label = "Category of selected text", choices = dashdata$categories)
actionButton(inputId = "ui_save", label = "Save")

Column 1 {.tabset}

Annotate

textOutput("ui_txt")
tags$style(type="text/css", "#ui_txt {white-space: pre-wrap; overflow-y: auto;}")
output$ui_txt <- renderText({
  input$ui_save
  info <- currentText()
  HTML(info$text)
})
observeEvent(input$ui_save, {
  info <- currentText()
  start <- input$highlighted_offset + 1
  end <- start + nchar(input$highlighted_text) - 1
  result <- list(annotation_time = Sys.time(), 
               doc_id = info$doc_id,
               text = info$text,
               text_visible = input$visible_text,
               start = start,
               end = end,
               chunk_id = NA_integer_,
               chunk_entity = input$ui_tag,
               chunk = input$highlighted_text)
  save_annotation(result)
  showModal(modalDialog(title = "Saved data",
                        "Annotation saved",
                        tags$ul(
                          tags$li(sprintf("File: %s", dashdata$save_as)), 
                          tags$li(sprintf("Directory: %s", getwd()))),
                        "Selected text: ",
                        tags$ul(
                          tags$li(sprintf("%s", result$chunk)),
                          tags$li(sprintf("Category: %s", result$chunk_entity)),
                          tags$li(sprintf("Text position %s - %s", result$start, result$end))),
                        size = "l", easyClose = TRUE, fade = FALSE))
})

Manage tagging categories

tags$blockquote("Define a new category to label text")
textInput(inputId = "ui_category_new", label = "New category")
actionButton(inputId = "ui_category_new_start", label = "Update")  
tags$hr()
tags$br()
tags$blockquote("Remove a category")
selectInput(inputId = "ui_category_delete", label = "Delete this category", 
            choices = dashdata$categories, selected = "OTHER")
actionButton(inputId = "ui_category_delete_start", label = "Delete")
observeEvent(input$ui_category_new_start, {
  categories <- setdiff(union(dashdata$categories, input$ui_category_new), "")
  categories <- toupper(categories)
  if(all(categories %in% dashdata$categories)){
    showModal(modalDialog(title = "Update information", sprintf("Category %s already exists", input$ui_category_new)))
  }else{
    showModal(modalDialog(title = "Update information", sprintf("Category %s added", input$ui_category_new)))
  }
  dashdata$categories <<- categories
  updateRadioButtons(session, "ui_tag", choices = dashdata$categories)
  updateSelectInput(session, "ui_category_delete", choices = dashdata$categories)
})
observeEvent(input$ui_category_delete_start, {
  categories <<- setdiff(dashdata$categories, setdiff(input$ui_category_delete, "OTHER"))
  categories <- toupper(categories)
  if(length(setdiff(dashdata$categories, categories)) > 0){
    showModal(modalDialog(title = "Update information", sprintf("Category %s deleted", input$ui_category_delete)))
  }else{
    showModal(modalDialog(title = "Update information", sprintf("Category %s can not be deleted", input$ui_category_delete)))
  }
  dashdata$categories <<- categories
  updateRadioButtons(session, "ui_tag", choices = dashdata$categories)
  updateSelectInput(session, "ui_category_delete", choices = dashdata$categories)
})

Upload new set of documents

tags$blockquote("Provide here the set of documents that you want to tag.")

A set of documents should be a data.frame with at least columns doc_id and text. This data.frame should be saved with saveRDS. An example is shown below. This dataset can be loaded below. After you selected the dataset, go to the tab 'Annotate' and start annotating your text using the categories that you provided in tab 'Manage tagging categories'.

x <- data.frame(doc_id = c("doc1", "doc2"), 
                text = c("The salmon and the creme brulee at the Fishbone were delicious", 
                         "I went to see the Taj Mahal after climbing the Mount Everest"), 
                stringsAsFactors = FALSE)
saveRDS(x, file = "to-annotate.rds")
fileInput(inputId = "ui_tid_input", label = "Choose .rds file")
dataTableOutput(outputId = "ui_rawdata")
reactive({
  input$ui_tid_input
  print(input$ui_tid_input$name)
  if(!is.null(input$ui_tid_input) && file.exists(input$ui_tid_input$datapath)){
    if(tools::file_ext(input$ui_tid_input$name) %in% c("rds", "RDS", "Rds")){
      dashdata$save_as <<- sprintf("crfsuite_annotation_%s.rds", gsub(".rds$|.RDS$", "", input$ui_tid_input$name))
      dashdata$rawdata <<- readRDS(input$ui_tid_input$datapath) 
      if(!inherits(dashdata$rawdata, "data.frame") || !all(c("doc_id", "text") %in% colnames(dashdata$rawdata))){
        showModal(modalDialog(title = "New data upload failed", 
                              "Uploaded data is not a data.frame with columns doc_id and text."))  
      }else{
        dashdata$current_row <<- 0
        dashdata$total_rows <<- nrow(dashdata$rawdata)
        current_rds(dashdata$save_as)
        output$ui_rawdata <- renderDataTable(dashdata$rawdata, options = list(pageLength = 2, lengthMenu = c(1, 2, 5, 10, 15, 20)))
        showModal(modalDialog(title = "New data uploaded", 
                              tags$ul(
                              tags$li(sprintf("New data uploaded with %s rows", nrow(dashdata$rawdata))), 
                              tags$li(sprintf("Annotations will be saved in file: %s", dashdata$save_as)),
                              tags$li(sprintf("This file will be stored at folder %s", getwd())))))  
      }  
    }else{
      showModal(modalDialog(title = "New data upload failed", "Uploaded data is not a .rds file"))  
    }
  }
})

Show results

dataTableOutput(outputId = "ui_taggeddata")
output$ui_taggeddata <- renderDataTable({
  input$ui_next
  if(file.exists(dashdata$save_as)){
    data <- readRDS(dashdata$save_as)
    data <- data[, c("doc_id", "start", "end", "chunk_entity", "chunk")]
    data <- data[rev(seq_len(nrow(data))), ]
  }else{
    data <- data.frame(doc_id = character(), start = integer(), end = integer(), chunk_entity = character(), chunk = character(), stringsAsFactors = FALSE)
  }
  data
}, options = list(pageLength = 10, lengthMenu = c(1, 2, 5, 10, 15, 20), rownames = FALSE))


Try the crfsuite package in your browser

Any scripts or data that you put into this service are public.

crfsuite documentation built on May 31, 2019, 5:04 p.m.