In TommyJones/textmineR: Functions for Text Mining and Topic Modeling

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In this example we'll use text embeddings and a bit of network analysis to build a basic document summarizer.

Many document summarizers, as the one we'll build here, do not generate language. Instead, they break a document down into sentences and then use some mechanism to score each sentence for relevance. Sentences with the top scores are returned as the "summary." For more information on summarization, a good place to start is here.

The summarizer we'll build is a version of the TextRank algorithm. We will split a document into sentences, create a nearest-neighbor network where sentences are connected to other similar sentences, and rank the sentences according to eigenvector centrality.

We will use a word embedding model, created on a whole corpus, to project the sentences into the embedding space. Once in the embedding space, we will measure similarity between documents using Hellinger distance. Hellinger distance is a metric specifically for probability distributions. Since we'll use LDA to create embeddings to a probability space, it's a useful measure.

Getting started

We'll use the movie review data set from text2vec again. The first thing we need to do is create a TCM and embedding model. We will skip evaluation such as R-squared, coherence, inspecting top terms, etc. However, in any real application, I'd strongly suggest evaluating your models at every step of the way.

library(textmineR)

# load the data
data(movie_review, package = "text2vec")

# let's take a sample so the demo will run quickly
# note: textmineR is generally quite scaleable, depending on your system
set.seed(123)
s <- sample(1:nrow(movie_review), 200)

movie_review <- movie_review[ s , ]

# let's get those nasty "<br />" symbols out of the way
movie_review$review <- stringr::str_replace_all(movie_review$review, "<br */>", "")

# First create a TCM using skip grams, we'll use a 5-word window
# most options available on CreateDtm are also available for CreateTcm
tcm <- CreateTcm(doc_vec = movie_review$review,
                 skipgram_window = 10,
                 verbose = FALSE,
                 cpus = 2)

# use LDA to get embeddings into probability space
# This will take considerably longer as the TCM matrix has many more rows 
# than a DTM
embeddings <- FitLdaModel(dtm = tcm,
                          k = 50,
                          iterations = 200,
                          burnin = 180,
                          alpha = 0.1,
                          beta = 0.05,
                          optimize_alpha = TRUE,
                          calc_likelihood = FALSE,
                          calc_coherence = FALSE,
                          calc_r2 = FALSE,
                          cpus = 2)

Building a basic document summarizer

Let's use the above embeddings model to create a document summarizer. This will return the three most relevant sentences in each review.

The summarizer works best as a function, as we have many documents to summarize. The function summarizer is defined in the next section. However, let's look at some key bits of code in detail.

The variable doc represents a single document, or a single element of a character vector.

In the code chunk below, we split the document into sentences using the stringi package. Then we embed each sentence under the model built on our whole corpus, above.

  # parse it into sentences
  sent <- stringi::stri_split_boundaries(doc, type = "sentence")[[ 1 ]]

  names(sent) <- seq_along(sent) # so we know index and order

  # embed the sentences in the model
  e <- CreateDtm(sent, ngram_window = c(1,1), verbose = FALSE, cpus = 2)

  # remove any documents with 2 or fewer words
  e <- e[ rowSums(e) > 2 , ]

  vocab <- intersect(colnames(e), colnames(gamma))

  e <- e / rowSums(e)

  e <- e[ , vocab ] %*% t(gamma[ , vocab ])

  e <- as.matrix(e)

Next, we measure the distance between each of the sentences within the embedding space.

  # get the pairwise distances between each embedded sentence
  e_dist <- CalcHellingerDist(e)

Since we are using a distance measure whose values fall between $0$ and $1$, we can take $1 - distance$ to get a similarity. We'll also re-scale it to be between 0 and 100. (The rescaling is just a cautionary measure so that we don't run into numerical precision issues when performing calculations downstream.)

  # turn into a similarity matrix
  g <- (1 - e_dist) * 100

If you consider a similarity matrix to be an adjacency matrix, then you have a fully-connected graph. For the sake of potentially faster computation and with the hope of eliminating some noise, we will delete some edges. Going row-by-row, we will keep connections only to the top 3 most similar sentences.

  # we don't need sentences connected to themselves
  diag(g) <- 0

  # turn into a nearest-neighbor graph
  g <- apply(g, 1, function(x){
    x[ x < sort(x, decreasing = TRUE)[ 3 ] ] <- 0
    x
  })

  # by taking pointwise max, we'll make the matrix symmetric again
  g <- pmax(g, t(g))

Using the igraph package (with its own objects) to calculate eigenvector centrality. From there, we'll take the top three sentences.

  g <- graph.adjacency(g, mode = "undirected", weighted = TRUE)

  # calculate eigenvector centrality
  ev <- evcent(g)

  # format the result
  result <- sent[ names(ev$vector)[ order(ev$vector, decreasing = TRUE)[ 1:3 ] ] ]

  result <- result[ order(as.numeric(names(result))) ]

  paste(result, collapse = " ")

Pulling it all together

The code below puts it all together in a single function. The first few lines vectorize the code, so that we can summarize multiple documents from a single function call.

library(igraph) 

# let's do this in a function

summarizer <- function(doc, gamma) {

  # recursive fanciness to handle multiple docs at once
  if (length(doc) > 1 )
    # use a try statement to catch any weirdness that may arise
    return(sapply(doc, function(d) try(summarizer(d, gamma))))

  # parse it into sentences
  sent <- stringi::stri_split_boundaries(doc, type = "sentence")[[ 1 ]]

  names(sent) <- seq_along(sent) # so we know index and order

  # embed the sentences in the model
  e <- CreateDtm(sent, ngram_window = c(1,1), verbose = FALSE, cpus = 2)

  # remove any documents with 2 or fewer words
  e <- e[ rowSums(e) > 2 , ]

  vocab <- intersect(colnames(e), colnames(gamma))

  e <- e / rowSums(e)

  e <- e[ , vocab ] %*% t(gamma[ , vocab ])

  e <- as.matrix(e)

  # get the pairwise distances between each embedded sentence
  e_dist <- CalcHellingerDist(e)

  # turn into a similarity matrix
  g <- (1 - e_dist) * 100

  # we don't need sentences connected to themselves
  diag(g) <- 0

  # turn into a nearest-neighbor graph
  g <- apply(g, 1, function(x){
    x[ x < sort(x, decreasing = TRUE)[ 3 ] ] <- 0
    x
  })

  # by taking pointwise max, we'll make the matrix symmetric again
  g <- pmax(g, t(g))

  g <- graph.adjacency(g, mode = "undirected", weighted = TRUE)

  # calculate eigenvector centrality
  ev <- evcent(g)

  # format the result
  result <- sent[ names(ev$vector)[ order(ev$vector, decreasing = TRUE)[ 1:3 ] ] ]

  result <- result[ order(as.numeric(names(result))) ]

  paste(result, collapse = " ")
}

How well did we do? Let's look at summaries from the first three reviews.

# Let's see the summary of the first couple of reviews
docs <- movie_review$review[ 1:3 ]
names(docs) <- movie_review$id[ 1:3 ]

sums <- summarizer(docs, gamma = embeddings$gamma)

sums