Description Usage Format Usage Methods Arguments Examples
Build a text summary by extracting relevant sentences from your text. The training dataset should consist of several documents, each document should have sentences separated by a period. While fitting the model, the 'term frequency - inverse document frequency' (TF-IDF) matrix that reflects how important a word is to a document is calculated first. Then vector representations for words are obtained from the 'global vectors for word representation' algorithm (GloVe). While applying the model on new data, the GloVe word vectors for each word are weighted by their TF-IDF weights and averaged to give a sentence vector or a document vector. The magnitude of this sentence vector gives the importance of that sentence within the document. Another way to obtain the importance of the sentence is to calculate cosine similarity between the sentence vector and the document vector. The output can either be at the sentence level (sentences and weights are returned) or at a document level (the summary for each document is returned). It is useful to first get a sentence level output and get quantiles of the sentence weights to determine a cutoff threshold for the weights. This threshold can then be used in the document level output. This method is a variation of the TF-IDF extractive summarization method mentioned in a review paper by Gupta (2010) <doi:10.4304/jetwi.2.3.258-268>.
1 |
R6Class
object.
For usage details see Methods, Arguments and Examples sections.
1 2 3 4 5 |
$new( stopword_list )
Creates TextSummary model
$fit(x)
fit model to an input vector (dataframe column) of documents
$transform(df,doc_id, txt_col,summary_col,weight_method = c('Magnitude', 'DocSimilarity'),topN=3,weight_threshold=10, return_sentences = FALSE,replace_char = '',avg_weight_by_word_count = TRUE)
transform new data df
using the model built on train data
A TextSummary
object
An input vector (dataframe column) of documents, preprocessed as necessary to remove case, puncutation etc (except periods that indicate sentence boundaries)
dataframe containing document ids and documents. Any other columns are passed through without any changes
column name that contains the document ids
column name that contains the document text
column name for the output summary. This column will be added to df
specifies how the sentences importance is calculated. weight_method = "Magnitude"
gives the weights as the magnitude of the sentence vector.
If avg_weight_by_word_count = TRUE
then the magnitude is divided by the word count, which typically favors shorter sentences.
If avg_weight_by_word_count = FALSE
then the magnitude of the sentence vector is returned, which typically favors longer sentences.
weight_method = "DocSimilarity"
calculates the sentence importance as a cosine similarity between the sentence vector and the document vector.
avg_weight_by_word_count
does not play a role in the "DocSimilarity" method
top N sentences to keep in the output
threshold above which sentences are considered for inclusion in the summary
TRUE
: returns sentences and their weights. topN
, weight_threshold
and replace_char
are ignored
FALSE
: topN
sentences that have weights above weight_threshold
are included in the summary.
The irrelevant sentences are replaced by replace_char
(use replace_char = ""
to completely remove the irrelevant sentences)
if TRUE
: the sentence weights are divided by number of words in the sentence.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | {
library(RtextSummary)
library(stringr)
library(tidyr)
library(dplyr)
data("opinosis")
# the data is reduced to pass CRAN checks of <5 sec run-time
# delete the line below to build the model on the entire dataset
opinosis = opinosis[1:2,]%>%mutate(text = substr(text, 0, 10) )
# 'stopwords_longlist' is a very long list of stopwords.
# it is not used in this example but can be useful for other datasets
data("stopwords_longlist")
opinosis$text = stringr::str_replace_all(
stringr::str_to_lower(opinosis$text),'[^a-z. ]','' )
# -- the model will be fit at the sentence level, which works well for this dataset
# for other datasets, also try fitting at the document level by commenting out the two lines below
tempdf = opinosis%>%
tidyr::separate_rows(text, sep = '\\.')
# ----------------------------------------
summary.model = TextSummary$new( stopword_list = c() )
summary.model$fit(tempdf$text)
# the parameters below work well for this dataset.
# For other datasets, try changing weight_method and avg_weight_by_word_count
df_sentence_level = summary.model$transform(
opinosis,
doc_id = 'topics',
txt_col = 'text',
summary_col = 'summary',
weight_method = 'Magnitude',
return_sentences = TRUE,
avg_weight_by_word_count = TRUE
)
# explore weight thresholds
quantile(df_sentence_level$wt, seq(0,1,0.1))
df_summary = summary.model$transform(
opinosis,
doc_id = 'topics',
txt_col = 'text',
summary_col = 'summary',
weight_method = 'Magnitude',
topN = 1,
weight_threshold=quantile(df_sentence_level$wt, 0.3 ),
return_sentences = FALSE,
replace_char = '',
avg_weight_by_word_count = TRUE
)
}
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.