txt_context: Based on a vector with a word sequence, get n-grams (looking...

View source: R/utils.R

txt_contextR Documentation

Based on a vector with a word sequence, get n-grams (looking forward + backward)

Description

If you have annotated your text using udpipe_annotate, your text is tokenised in a sequence of words. Based on this vector of words in sequence getting n-grams comes down to looking at the previous/next word and the subsequent previous/next word andsoforth. These words can be pasted together to form an n-gram.

Usage

txt_context(x, n = c(-1, 0, 1), sep = " ", na.rm = FALSE)

Arguments

x

a character vector where each element is just 1 term or word

n

an integer vector indicating how many terms to look back and ahead

sep

a character element indicating how to paste the subsequent words together

na.rm

logical, if set to TRUE, will keep all text even if it can not look back/ahead the amount specified by n. If set to FALSE, will have a resulting value of NA if at least one element is NA or it can not look back/ahead the amount specified by n.

Value

a character vector of the same length of x with the n-grams

See Also

txt_paste, txt_next, txt_previous, shift

Examples

x <- c("We", "walked", "anxiously", "to", "the", "doctor", "!")

## Look 1 word before + word itself
y <- txt_context(x, n = c(-1, 0), na.rm = FALSE)
data.frame(x, y)
## Look 1 word before + word itself + 1 word after
y <- txt_context(x, n = c(-1, 0, 1), na.rm = FALSE)
data.frame(x, y)
y <- txt_context(x, n = c(-1, 0, 1), na.rm = TRUE)
data.frame(x, y)

## Look 2 words before + word itself + 1 word after 
## even if not all words are there
y <- txt_context(x, n = c(-2, -1, 0, 1), na.rm = TRUE, sep = "_")
data.frame(x, y)
y <- txt_context(x, n = c(-2, -1, 1, 2), na.rm = FALSE, sep = "_")
data.frame(x, y)

x <- c("We", NA, NA, "to", "the", "doctor", "!")
y <- txt_context(x, n = c(-1, 0), na.rm = FALSE)
data.frame(x, y)
y <- txt_context(x, n = c(-1, 0), na.rm = TRUE)
data.frame(x, y)

library(data.table)
data(brussels_reviews_anno, package = "udpipe")
x      <- as.data.table(brussels_reviews_anno)
x      <- subset(x, doc_id %in% txt_sample(unique(x$doc_id), n = 10))
x      <- x[, context := txt_context(lemma), by = list(doc_id, sentence_id)]
head(x, 20)
x$term <- sprintf("%s/%s", x$lemma, x$upos)
x      <- x[, context := txt_context(term), by = list(doc_id, sentence_id)]
head(x, 20)

udpipe documentation built on Jan. 6, 2023, 5:06 p.m.