knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  progress = FALSE,
  error = FALSE, 
  message = FALSE,
  warning = FALSE,
  rownames.print = FALSE
)

options(digits = 2)

This vignette presents three examples of using slowrake() to find keywords in text. Each application runs RAKE on a different type of document, including a webpage, patent abstract(s), and a journal article.

Webpage

1. Download the HTML and run RAKE

# Load the libraries needed for all three applications
library(slowraker)
library(httr)
library(xml2)
library(patentsview)
library(dplyr)
library(pdftools)
library(stringr)
library(knitr)

# The webpage of interest - slowraker's "Getting started" page
url <- "https://crew102.github.io/slowraker/articles/getting-started.html"

GET(url) %>% 
  content("text") %>% 
  read_html() %>% 
  xml_find_all(".//p") %>% 
  xml_text() %>% 
  paste(collapse = " ") %>%
  iconv(from = "UTF-8", "ASCII", sub = '"') %>% 
  slowrake() %>%
  .[[1]]

Patent abstracts

1. Download patent data

# Download data from the PatentsView API for 10 patents with the phrase
# "keyword extraction" in their abstracts
pv_data <- search_pv(
  query = qry_funs$text_phrase(patent_abstract = "keyword extraction"),
  fields = c("patent_number", "patent_title", "patent_abstract"),
  per_page = 10
)

# Look at the data
patents <- pv_data$data$patents
kable(head(patents, n = 2))

2. Run RAKE on the abstracts

rakelist <- slowrake(
  patents$patent_abstract,
  stop_words = c("method", smart_words), # Consider "method" to be a stop word
  stop_pos = pos_tags$tag[!grepl("^N", pos_tags$tag)] # Consider all non-nouns to be stop words
)

# Create a single data frame with all patents' keywords
out_rake <- rbind_rakelist(rakelist, doc_id = patents$patent_number)
out_rake
# not sure why, but rake progress bar is being suppressed in other chunks but isn't in the above chunk, so have to split it up like this
out_rake

3. Show each patent's top keyword

out_rake %>% 
  group_by(doc_id) %>% 
  arrange(desc(score)) %>%
  slice(1) %>% 
  inner_join(patents, by = c("doc_id" = "patent_number")) %>% 
  rename(patent_number = doc_id, top_keyword = keyword) %>% 
  select(matches("number|title|keyword")) %>%
  head() %>%
  kable()

Journal article

1. Get PDF text

# The journal article of interest - Rose et. al (i.e., the RAKE publication)
url <- "http://media.wiley.com/product_data/excerpt/22/04707498/0470749822.pdf"

# Download file and pull out text layer from PDF
destfile <- tempfile()
GET(url, write_disk(destfile))
raw_txt <- pdf_text(destfile)

2. Apply basic text cleaning

# Helper function for text removal
sub_all <- function(regex_vec, txt) {
  pattern <- paste0(regex_vec, collapse = "|")
  gsub(pattern, " ", txt)
}

txt1 <- 
  paste0(raw_txt, collapse = " ") %>% 
    gsub("\\r\\n", " ", .) %>% 
    gsub("[[:space:]]{2,}", " ", .)

# Regex to capture text that we don't want to run RAKE on
remove1 <- "Acknowledgements.*"
remove2 <- "TEXT MINING"
remove3 <- "AUTOMATIC KEYWORD EXTRACTION"

txt2 <- sub_all(c(remove1, remove2, remove3), txt1)

3. Detect and remove tables

There are some sections of the PDF's text that we don't want to run RAKE on, such as the text found in tables. The problem with tables is that they usually don't contain typical phrase delimiters (e.g., periods and commas). Instead, the cell of the table acts as a sort of delimiter. It can be very difficult to parse a table's cells out in a PDF document, though, so we'll just try to identify/remove the tables themselves.[^1]

The tables in this article mostly contain numbers. If we split the article into text chunks using a digit delimiter, it's likely that most of a table's chunks will be relatively small in size. We can use this fact to help us identify which text chunks correspond to tables and which correspond to paragraphs.

# Numbers generally appear in paragraphs in two ways in this article: When the authors refer to results in a specific table/figure (e.g., "the sample abstract shown in Figure 1.1"), and when the authors reference another article (e.g., "Andrade and Valencia (1998) base their approach"). Remove these instances so that paragraphs don't get split into small chunks, which would make them hard to tell apart from tables.
remove4 <- "(Table|Figure) [[:digit:].]{1,}"
remove5 <- "\\([12][[:digit:]]{3}\\)"
txt3 <- sub_all(c(remove4, remove5), txt2)

# Split text into chunks based on digit delimiter
txt_chunks <- unlist(strsplit(txt3, "[[:digit:]]"))

# Use number of alpha chars found in a chunk as an indicator of its size
num_alpha <- str_count(txt_chunks, "[[:alpha:]]")

# Use kmeans to distinguish tables from paragraphs
clust <- kmeans(num_alpha, centers = 2)
good_clust <- which(max(clust$centers) == clust$centers)

# Only keep chunks that are thought to be paragraphs
good_chunks <- txt_chunks[clust$cluster == good_clust]
final_txt <- paste(good_chunks, collapse = " ")

4. Run RAKE

rakelist <- slowrake(final_txt)

kable(head(rakelist[[1]]))

5. Filter out bad keywords

The fact that some of the keywords shown above are very long suggests we missed something in Step 4. It turns out that our method mistook one of the tables (Table 1.1 shown below) for a paragraph. Table 1.1 is somewhat atypical in that it doesn't contain any numbers, and thus it makes sense that our method missed it.

To clean up the results, let's apply an ad hoc filter on the keywords. This filter will remove keywords whose long length indicates that a phrase run-on has occurred, and hence the keyword is no good.

# Function to remove keywords that occur only once and have more than max_word_cnt member words
filter.rakelist <- function(x, max_word_cnt = 3) {
  structure(
    lapply(x, function(r) {
      word_cnt <- str_count(r$keyword, " ") + 1
      to_filter <- r$freq == 1 & word_cnt > max_word_cnt
      r[!to_filter, ]
    }),
    class = c("rakelist", "list")
  )
}

filter <- function(x) UseMethod("filter")

filter(rakelist)[[1]]

[^1]: There are better solutions for identifying and parsing tables in PDFs than the one I use here (e.g., the tabula Java library and its corresponding R wrapper, tabulizer).



crew102/slowraker documentation built on Sept. 5, 2024, 11:22 a.m.