get_context: Get context words (words within a symmetric window around the...
In conText: 'a la Carte' on Text (ConText) Embedding Regression

get_context

R Documentation

Get context words (words within a symmetric window around the target word/phrase) sorrounding a user defined target.

Description

A wrapper function for quanteda's kwic() function that subsets documents to where target is present before tokenizing to speed up processing, and concatenates kwic's pre/post variables into a context column.

Usage

get_context(
  x,
  target,
  window = 6L,
  valuetype = "fixed",
  case_insensitive = TRUE,
  hard_cut = FALSE,
  what = "word",
  verbose = TRUE
)

Arguments

`x`	(character) vector - this is the set of documents (corpus) of interest.
`target`	(character) vector - these are the target words whose contexts we want to evaluate This vector may include a single token, a phrase or multiple tokens and/or phrases.
`window`	(numeric) - defines the size of a context (words around the target).
`valuetype`	the type of pattern matching: `"glob"` for "glob"-style wildcard expressions; `"regex"` for regular expressions; or `"fixed"` for exact matching. See valuetype for details.
`case_insensitive`	logical; if `TRUE`, ignore case when matching a `pattern` or dictionary values
`hard_cut`	(logical) - if TRUE then a context must have `window` x 2 tokens, if FALSE it can have `window` x 2 or fewer (e.g. if a doc begins with a target word, then context will have `window` tokens rather than `window` x 2)
`what`	(character) defines which quanteda tokenizer to use. You will rarely want to change this. For chinese text you may want to set `what = 'fastestword'`.
`verbose`	(logical) - if TRUE, report the total number of target instances found.

Value

a data.frame with the following columns:

docname: (character) document name to which instances belong to.
target: (character) targets.
context: (numeric) pre/post variables in kwic() output concatenated.

Note

target in the return data.frame is equivalent to kwic()'s keyword output variable, so it may not match the user-defined target exactly if valuetype is not fixed.

Examples

# get context words sorrounding the term immigration
context_immigration <- get_context(x = cr_sample_corpus, target = 'immigration',
                                   window = 6, valuetype = "fixed", case_insensitive = FALSE,
                                   hard_cut = FALSE, verbose = FALSE)

conText documentation built on Feb. 16, 2023, 7:32 p.m.