keywords_phrases: Extract phrases - a sequence of terms which follow each other...
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

keywords_phrases

R Documentation

Extract phrases - a sequence of terms which follow each other based on a sequence of Parts of Speech tags

Description

This function allows to extract phrases, like simple noun phrases, complex noun phrases or any exact sequence of parts of speech tag patterns.
An example use case of this is to get all text where an adjective is followed by a noun or for example to get all phrases consisting of a preposition which is followed by a noun which is next followed by a verb. More complex patterns are shown in the details below.

Usage

keywords_phrases(
  x,
  term = x,
  pattern,
  is_regex = FALSE,
  sep = " ",
  ngram_max = 8,
  detailed = TRUE
)

phrases(
  x,
  term = x,
  pattern,
  is_regex = FALSE,
  sep = " ",
  ngram_max = 8,
  detailed = TRUE
)

Arguments

`x`	a character vector of Parts of Speech tags where we want to locate a relevant sequence of POS tags as defined in `pattern`
`term`	a character vector of the same length as `x` with the words or terms corresponding to the tags in `x`
`pattern`	In case `is_regex` is set to FALSE, `pattern` should be a character vector with a sequence of POS tags to identify in `x`. The length of the character vector should be bigger than 1. In case `is_regex` is set to TRUE, this should be a regular expressions which will be used on a concatenated version of `x` to identify the locations where these regular expression occur. See the examples below.
`is_regex`	logical indicating if `pattern` can be considered as a regular expression or if it is just a character vector of POS tags. Defaults to FALSE, indicating `pattern` is not a regular expression.
`sep`	character indicating how to collapse the phrase of terms which are found. Defaults to using a space.
`ngram_max`	an integer indicating to allow phrases to be found up to `ngram` maximum number of terms following each other. Only used if is_regex is set to TRUE. Defaults to 8.
`detailed`	logical indicating to return the exact positions where the phrase was found (set to `TRUE`) or just how many times each phrase is occurring (set to `FALSE`). Defaults to `TRUE`.

Details

Common phrases which you might be interested in and which can be supplied to pattern are

Simple noun phrase: "(A|N)*N(P+D*(A|N)*N)*"
Simple verb Phrase: "((A|N)*N(P+D*(A|N)*N)*P*(M|V)*V(M|V)*|(M|V)*V(M|V)*D*(A|N)*N(P+D*(A|N)*N)*|(M|V)*V(M|V)*(P+D*(A|N)*N)+|(A|N)*N(P+D*(A|N)*N)*P*((M|V)*V(M|V)*D*(A|N)*N(P+D*(A|N)*N)*|(M|V)*V(M|V)*(P+D*(A|N)*N)+))"
Noun hrase with coordination conjuction: "((A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*(C(D(CD)*)*(A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*)*)"
Verb phrase with coordination conjuction: "(((A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*(C(D(CD)*)*(A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*)*)(P(CP)*)*(M(CM)*|V)*V(M(CM)*|V)*(C(M(CM)*|V)*V(M(CM)*|V)*)*|(M(CM)*|V)*V(M(CM)*|V)*(C(M(CM)*|V)*V(M(CM)*|V)*)*(D(CD)*)*((A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*(C(D(CD)*)*(A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*)*)|(M(CM)*|V)*V(M(CM)*|V)*(C(M(CM)*|V)*V(M(CM)*|V)*)*((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)+|((A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*(C(D(CD)*)*(A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*)*)(P(CP)*)*((M(CM)*|V)*V(M(CM)*|V)*(C(M(CM)*|V)*V(M(CM)*|V)*)*(D(CD)*)*((A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*(C(D(CD)*)*(A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*)*)|(M(CM)*|V)*V(M(CM)*|V)*(C(M(CM)*|V)*V(M(CM)*|V)*)*((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)+))"

See the examples.
Mark that this functionality is also implemented in the phrasemachine package where it is implemented using plain R code, while the implementation in this package uses a more quick Rcpp implementation for extracting these kind of regular expression like phrases.

Value

If argument detailed is set to TRUE a data.frame with columns

keyword: the phrase which corresponds to the collapsed terms of where the pattern was found
ngram: the length of the phrase
pattern: the pattern which was found
start: the starting index of x where the pattern was found
end: the ending index of x where the pattern was found

If argument detailed is set to FALSE will return aggregate frequency statistics in a data.frame containing the columns keyword, ngram and freq (how many time it is occurring)

Examples

data(brussels_reviews_anno, package = "udpipe")
x <- subset(brussels_reviews_anno, language %in% "fr")

## Find exactly this sequence of POS tags
np <- keywords_phrases(x$xpos, pattern = c("DT", "NN", "VB", "RB", "JJ"), sep = "-")
head(np)
np <- keywords_phrases(x$xpos, pattern = c("DT", "NN", "VB", "RB", "JJ"), term = x$token)
head(np)

## Find noun phrases with the following regular expression: (A|N)+N(P+D*(A|N)*N)*
x$phrase_tag <- as_phrasemachine(x$xpos, type = "penn-treebank")
nounphrases <- keywords_phrases(x$phrase_tag, term = x$token, 
                                pattern = "(A|N)+N(P+D*(A|N)*N)*", is_regex = TRUE, 
                                ngram_max = 4, 
                                detailed = TRUE)
head(nounphrases, 10)
head(sort(table(nounphrases$keyword), decreasing=TRUE), 20)

## Find frequent sequences of POS tags
library(data.table)
x <- as.data.table(x)
x <- x[, pos_sequence := txt_nextgram(x = xpos, n = 3), by = list(doc_id, sentence_id)]
tail(sort(table(x$pos_sequence)))
np <- keywords_phrases(x$xpos, term = x$token, pattern = c("IN", "DT", "NN"))
head(np)

udpipe documentation built on Jan. 6, 2023, 5:06 p.m.

udpipe index

README.md UDPipe Natural Language Processing - Basic Analytical Use Cases UDPipe Natural Language Processing - Model Building UDPipe Natural Language Processing - Parallel UDPipe Natural Language Processing - Text Annotation UDPipe Natural Language Processing - Topic Modelling Use Cases UDPipe Natural Language Processing - Try it out UDPipe Natural Language Processing - Universe

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

udpipe
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

keywords_phrases: Extract phrases - a sequence of terms which follow each other...
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

Extract phrases - a sequence of terms which follow each other based on a sequence of Parts of Speech tags

Description

Usage

Arguments

Details

Value

See Also

Examples

Related to keywords_phrases in udpipe...

R Package Documentation

Browse R Packages

We want your feedback!

udpipe Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

keywords_phrases: Extract phrases - a sequence of terms which follow each other... In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

Extract phrases - a sequence of terms which follow each other based on a sequence of Parts of Speech tags

Description

Usage

Arguments

Details

Value

See Also

Examples

Related to keywords_phrases in udpipe...

R Package Documentation

Browse R Packages

We want your feedback!

udpipe
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

keywords_phrases: Extract phrases - a sequence of terms which follow each other...
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit