get_context: Get context words (words within a symmetric window around the...

View source: R/get_context.R

get_contextR Documentation

Get context words (words within a symmetric window around the target word/phrase) sorrounding a user defined target.

Description

A wrapper function for quanteda's kwic() function that subsets documents to where target is present before tokenizing to speed up processing, and concatenates kwic's pre/post variables into a context column.

Usage

get_context(
  x,
  target,
  window = 6L,
  valuetype = "fixed",
  case_insensitive = TRUE,
  hard_cut = FALSE,
  what = "word",
  verbose = TRUE
)

Arguments

x

(character) vector - this is the set of documents (corpus) of interest.

target

(character) vector - these are the target words whose contexts we want to evaluate This vector may include a single token, a phrase or multiple tokens and/or phrases.

window

(numeric) - defines the size of a context (words around the target).

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

logical; if TRUE, ignore case when matching a pattern or dictionary values

hard_cut

(logical) - if TRUE then a context must have window x 2 tokens, if FALSE it can have window x 2 or fewer (e.g. if a doc begins with a target word, then context will have window tokens rather than window x 2)

what

(character) defines which quanteda tokenizer to use. You will rarely want to change this. For chinese text you may want to set what = 'fastestword'.

verbose

(logical) - if TRUE, report the total number of target instances found.

Value

a data.frame with the following columns:

docname

(character) document name to which instances belong to.

target

(character) targets.

context

(numeric) pre/post variables in kwic() output concatenated.

Note

target in the return data.frame is equivalent to kwic()'s keyword output variable, so it may not match the user-defined target exactly if valuetype is not fixed.

Examples

# get context words sorrounding the term immigration
context_immigration <- get_context(x = cr_sample_corpus, target = 'immigration',
                                   window = 6, valuetype = "fixed", case_insensitive = FALSE,
                                   hard_cut = FALSE, verbose = FALSE)

conText documentation built on Feb. 16, 2023, 7:32 p.m.