textrank_sentences: Textrank - extract relevant sentences

Description Usage Arguments Value See Also Examples

View source: R/textrank.R

Description

The textrank algorithm is a technique to rank sentences in order of importance.

In order to find relevant sentences, the textrank algorithm needs 2 inputs: a data.frame (data) with sentences and a data.frame (terminology) containing tokens which are part of each sentence.
Based on these 2 datasets, it calculates the pairwise distance between each sentence by computing how many terms are overlapping (Jaccard distance, implemented in textrank_jaccard). These pairwise distances among the sentences are next passed on to Google's pagerank algorithm to identify the most relevant sentences.

If data contains many sentences, it makes sense not to compute all pairwise sentence distances but instead limiting the calculation of the Jaccard distance to only sentence combinations which are limited by the Minhash algorithm. This is implemented in textrank_candidates_lsh and an example is show below.

Usage

1
2
3
textrank_sentences(data, terminology, textrank_dist = textrank_jaccard,
  textrank_candidates = textrank_candidates_all(data$textrank_id),
  max = 1000, options_pagerank = list(directed = FALSE), ...)

Arguments

data

a data.frame with 1 row per sentence where the first column is an identifier of a sentence (e.g. textrank_id) and the second column is the raw sentence. See the example.

terminology

a data.frame with with one row per token indicating which token is part of each sentence. The first column in this data.frame is the identifier which corresponds to the first column of data and the second column indicates the token which is part of the sentence which will be passed on to textrank_dist. See the example.

textrank_dist

a function which calculates the distance between 2 sentences which are represented by a vectors of tokens. The first 2 arguments of the function are the tokens in sentence1 and sentence2. The function should return a numeric value of length one. The larger the value, the larger the connection between the 2 vectors indicating more strength. Defaults to the jaccard distance (textrank_jaccard), indicating the percent of common tokens.

textrank_candidates

a data.frame of candidate sentence to sentence comparisons with columns textrank_id_1 and textrank_id_2 indicating for which combination of sentences we want to compute the Jaccard distance or the distance function as provided in textrank_dist. See for example textrank_candidates_all or textrank_candidates_lsh.

max

integer indicating to reduce the number of sentence to sentence combinations to compute. In case provided, we take only this max amount of rows from textrank_candidates

options_pagerank

a list of arguments passed on to page_rank

...

arguments passed on to textrank_dist

Value

an object of class textrank_sentences which is a list with elements:

See Also

page_rank, textrank_candidates_all, textrank_candidates_lsh, textrank_jaccard

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
library(udpipe)
data(joboffer)
head(joboffer)
joboffer$textrank_id <- unique_identifier(joboffer, c("doc_id", "paragraph_id", "sentence_id"))
sentences <- unique(joboffer[, c("textrank_id", "sentence")])
cat(sentences$sentence)
terminology <- subset(joboffer, upos %in% c("NOUN", "ADJ"), select = c("textrank_id", "lemma"))
head(terminology)

## Textrank for finding the most relevant sentences
tr <- textrank_sentences(data = sentences, terminology = terminology)
summary(tr, n = 2)
summary(tr, n = 5, keep.sentence.order = TRUE)

## Not run: 
## Using minhash to reduce sentence combinations - relevant if you have a lot of sentences
library(textreuse)
minhash <- minhash_generator(n = 1000, seed = 123456789)
candidates <- textrank_candidates_lsh(x = terminology$lemma, sentence_id = terminology$textrank_id,
                                      minhashFUN = minhash, bands = 500)
tr <- textrank_sentences(data = sentences, terminology = terminology,
                         textrank_candidates = candidates)
summary(tr, n = 2)

## End(Not run)
## You can also reduce the number of sentence combinations by sampling
tr <- textrank_sentences(data = sentences, terminology = terminology, max = 100)
tr
summary(tr, n = 2)

Example output

  doc_id paragraph_id sentence_id
1   doc1            1           1
2   doc1            1           1
3   doc1            1           1
4   doc1            1           1
5   doc1            1           1
6   doc1            1           1
                                                    sentence token_id
1 Statistical expert / data scientist / analytical developer        1
2 Statistical expert / data scientist / analytical developer        2
3 Statistical expert / data scientist / analytical developer        3
4 Statistical expert / data scientist / analytical developer        4
5 Statistical expert / data scientist / analytical developer        5
6 Statistical expert / data scientist / analytical developer        6
        token       lemma  upos xpos       feats head_token_id  dep_rel deps
1 Statistical Statistical   ADJ   JJ  Degree=Pos             2     amod <NA>
2      expert      expert  NOUN   NN Number=Sing             0     root <NA>
3           /           / PUNCT    ,        <NA>             5       cc <NA>
4        data        data  NOUN   NN Number=Sing             5 compound <NA>
5   scientist   scientist  NOUN   NN Number=Sing             2     conj <NA>
6           /           / PUNCT    ,        <NA>             8       cc <NA>
  misc
1 <NA>
2 <NA>
3 <NA>
4 <NA>
5 <NA>
6 <NA>
Statistical expert / data scientist / analytical developer BNOSAC (Belgium Network of Open Source Analytical Consultants), is a Belgium consultancy company specialized in data analysis and statistical consultancy using open source tools. In order to increase and enhance the services provided to our clients, we are on the lookout for an all-round statistical expert, data scientist and analytical developer. Function: Your main task will be the execution of a diverse range of consultancy services in the field of statistics and data science. You will be involved in a small team where you handle the consultancy services from the start of the project until the end. This covers: Joint meeting with clients on the topic of the analysis. Acquaintance with the data. Analysis of the techniques that are required to execute the study. Mostly standard statistical and biostatistical modelling, predictive analytics & machine learning techniques. Perform statistical design, modeling and analysis, together with more seniors. Building the report on the data analysis. Automating and R/Python package development. Integration of the models into the existing architecture. Giving advise to the client on the research questions, design or integration. Next to that, you will help in building data products and help sell them. These cover text mining, integration of predictive analytics in existing tools and the creation of specific data analysis tools and web services. You also might be involved in providing data science related courses for clients. Profile: You have a master degree in the domain of Statistics, Biostatistics, Mathematics, Commercial or Industrial Engineering, Economics or similar. You have a strong interest in statistics and data analysis. You have good communication skills, are fluent in English and know either Dutch or French. You soak up new knowledge and either just make things work or have the attitude of 'I can do this'. Besides this, you have attention to detail and adapt to changes quickly. You have programming experience in R or you really want to switch to using R. You have a sound knowledge of another data analysis language (Python, SQL, javascript) and you don't care in which relational database, Excel, bigdata or noSQL store your data is located. Interested in robotics is a plus. Offer: A half or full-time employment depending on your personal situation. The ability to get involved in a whole range of sectors and topics and the flexibility to shape your own future. The usage of a diverse range of statistical & data science techniques. Support in getting up to speed quickly in the usage of R. An environment in which you can develop your talent and make your own proposals the standard way to go. Liberty in managing your open source projects during working hours. Contact: To apply or in order to get more information about the job content, please contact us at: http://bnosac.be/index.php/contact/get-in-touch  textrank_id       lemma
1           1 Statistical
2           1      expert
4           1        data
5           1   scientist
7           1  analytical
8           1   developer
[1] "Building the report on the data analysis."                  
[2] "You have a strong interest in statistics and data analysis."
[1] "BNOSAC (Belgium Network of Open Source Analytical Consultants), is a Belgium consultancy company specialized in data analysis and statistical consultancy using open source tools."
[2] "In order to increase and enhance the services provided to our clients, we are on the lookout for an all-round statistical expert, data scientist and analytical developer."        
[3] "Building the report on the data analysis."                                                                                                                                         
[4] "You have a strong interest in statistics and data analysis."                                                                                                                       
[5] "The usage of a diverse range of statistical & data science techniques."                                                                                                            
[1] "You have a strong interest in statistics and data analysis."
[2] "Building the report on the data analysis."                  
Textrank on sentences, showing top 5 most important sentences found:
  1. Building the report on the data analysis.
  2. You have a sound knowledge of another data analysis language (Python, SQL, javascript) and you don't care in which relational database, Excel, bigdata or noSQL store your data is located.
  3. You have a strong interest in statistics and data analysis.
  4. Acquaintance with the data.
  5. Next to that, you will help in building data products and help sell them.
[1] "Building the report on the data analysis."                                                                                                                                                  
[2] "You have a sound knowledge of another data analysis language (Python, SQL, javascript) and you don't care in which relational database, Excel, bigdata or noSQL store your data is located."

textrank documentation built on May 2, 2019, 2:09 p.m.