TextSummary: TextSummary

Description Usage Format Usage Methods Arguments Examples

Description

Build a text summary by extracting relevant sentences from your text. The training dataset should consist of several documents, each document should have sentences separated by a period. While fitting the model, the 'term frequency - inverse document frequency' (TF-IDF) matrix that reflects how important a word is to a document is calculated first. Then vector representations for words are obtained from the 'global vectors for word representation' algorithm (GloVe). While applying the model on new data, the GloVe word vectors for each word are weighted by their TF-IDF weights and averaged to give a sentence vector or a document vector. The magnitude of this sentence vector gives the importance of that sentence within the document. Another way to obtain the importance of the sentence is to calculate cosine similarity between the sentence vector and the document vector. The output can either be at the sentence level (sentences and weights are returned) or at a document level (the summary for each document is returned). It is useful to first get a sentence level output and get quantiles of the sentence weights to determine a cutoff threshold for the weights. This threshold can then be used in the document level output. This method is a variation of the TF-IDF extractive summarization method mentioned in a review paper by Gupta (2010) <doi:10.4304/jetwi.2.3.258-268>.

Usage

1

Format

R6Class object.

Usage

For usage details see Methods, Arguments and Examples sections.

1
2
3
4
5
TextSummaryModel <- TextSummary$new( stopword_list )

TextSummaryModel$fit(x)

TextSummaryModel$transform(df,doc_id, txt_col,summary_col,topN=3,weight_threshold=10, return_sentences = FALSE,replace_char = '',avg_weight_by_word_count = TRUE)

Methods

$new( stopword_list )

Creates TextSummary model

$fit(x)

fit model to an input vector (dataframe column) of documents

$transform(df,doc_id, txt_col,summary_col,weight_method = c('Magnitude', 'DocSimilarity'),topN=3,weight_threshold=10, return_sentences = FALSE,replace_char = '',avg_weight_by_word_count = TRUE)

transform new data df using the model built on train data

Arguments

TextSummaryModel

A TextSummary object

x

An input vector (dataframe column) of documents, preprocessed as necessary to remove case, puncutation etc (except periods that indicate sentence boundaries)

df

dataframe containing document ids and documents. Any other columns are passed through without any changes

doc_id

column name that contains the document ids

txt_col

column name that contains the document text

summary_col

column name for the output summary. This column will be added to df

weight_method

specifies how the sentences importance is calculated. weight_method = "Magnitude" gives the weights as the magnitude of the sentence vector. If avg_weight_by_word_count = TRUE then the magnitude is divided by the word count, which typically favors shorter sentences. If avg_weight_by_word_count = FALSE then the magnitude of the sentence vector is returned, which typically favors longer sentences. weight_method = "DocSimilarity" calculates the sentence importance as a cosine similarity between the sentence vector and the document vector. avg_weight_by_word_count does not play a role in the "DocSimilarity" method

topN

top N sentences to keep in the output

weight_threshold

threshold above which sentences are considered for inclusion in the summary

return_sentences

TRUE: returns sentences and their weights. topN, weight_threshold and replace_char are ignored FALSE: topN sentences that have weights above weight_threshold are included in the summary.

replace_char

The irrelevant sentences are replaced by replace_char (use replace_char = "" to completely remove the irrelevant sentences)

avg_weight_by_word_count

if TRUE: the sentence weights are divided by number of words in the sentence.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
{ 
library(RtextSummary)
library(stringr)
library(tidyr)
library(dplyr)

data("opinosis")
 
# the data is reduced to pass CRAN checks of <5 sec run-time
# delete the line below to build the model on the entire dataset
opinosis = opinosis[1:2,]%>%mutate(text = substr(text, 0, 10) )
 
# 'stopwords_longlist' is a very long list of stopwords. 
# it is not used in this example but can be useful for other datasets
data("stopwords_longlist") 

opinosis$text = stringr::str_replace_all(
  stringr::str_to_lower(opinosis$text),'[^a-z. ]','' )

# -- the model will be fit at the sentence level, which works well for this dataset
# for other datasets, also try fitting at the document level by commenting out the two lines below
tempdf = opinosis%>%
  tidyr::separate_rows(text, sep = '\\.')
# ----------------------------------------

summary.model = TextSummary$new( stopword_list = c() ) 
summary.model$fit(tempdf$text)

# the parameters below work well for this dataset. 
# For other datasets, try changing weight_method and avg_weight_by_word_count
df_sentence_level = summary.model$transform(
  opinosis,
  doc_id = 'topics',
  txt_col = 'text',
  summary_col = 'summary',
  weight_method = 'Magnitude', 
  return_sentences = TRUE,
  avg_weight_by_word_count = TRUE 
)

# explore weight thresholds
quantile(df_sentence_level$wt, seq(0,1,0.1))


df_summary = summary.model$transform(
  opinosis,
  doc_id = 'topics',
  txt_col = 'text',
  summary_col = 'summary',
  weight_method = 'Magnitude', 
  topN = 1,
  weight_threshold=quantile(df_sentence_level$wt, 0.3 ),
  return_sentences = FALSE,
  replace_char = '',
  avg_weight_by_word_count = TRUE
)
}

RtextSummary documentation built on June 7, 2019, 9:03 a.m.