colloc_default: Collocates retrieval for raw corpus text
In gederajeg/corplingr: Tidy Concordances, Collocates, and Wordlist

Description Usage Arguments Value Examples

View source: R/corplingr_colloc_default.R

This function retrieve collocates for a word within the user-defined context window based on raw/unannotated corpus texts. The function use vectorisation approach to determine the vector-position of the collocates in relation to the vector-position of the node-word in the corpus word-vector. There is the argument of tokenise_corpus_to_sentence (cf. below) that allows user to first split the input, raw corpus into character vector whose elements correspond to a sentence line.

colloc_default(
  corpus_path = NULL,
  corpus_list = NULL,
  pattern = NULL,
  window = "b",
  span = 3,
  word_split_regex = "([^a-zA-Z-]+|--)",
  case_insensitive = TRUE,
  to_lower_colloc = TRUE,
  tokenise_corpus_to_sentence = TRUE
)

`corpus_path`	character strings of (full) filepath for the corpus text files in `.txt` plain-text format.
`corpus_list`	a named list object containing elements constituting a corpus text. The name of each element should correspond to the corpus file. There can be more than one element (hence more than one corpus text) within this list object.
`pattern`	regular expressions/exact patterns for the target pattern.
`window`	window-span direction of the collocates: `"r"` ('right of the node'), `"l"` ('left of the node'), or the DEFAULT is `"b"` ('both left and right context-window').
`span`	integer vector indicating the span of the collocate scope.
`word_split_regex`	user-defined regular expressions to tokenise the corpus. The default is to split at non alphabetic characters but retain hypen "-" as to maintain reduplication, for instance. The regex for this default setting is `""([^a-zA-Z-]+\|--)""`. Another possible splitting regex may include various characters with diacritics (e.g., `'([^a-zA-Z\u00c0-\u00d6\u00d9-\u00f6\u00f9-\u00ff\u0100-\u017e\u1e00-\u1eff]+\|--)'`)
`case_insensitive`	whether the search pattern ignores case (TRUE – the default) or not (FALSE).
`to_lower_colloc`	whether to lowercase the retrieved collocates (TRUE – default) or not (FALSE).
`tokenise_corpus_to_sentence`	whether to tokenise the input corpus by sentence so that the script can handle the collocates for not crossing sentence boundary. The default is `TRUE` and it uses `stri_split_boundaries` to tokenise into sentence before further tokenising into word-tokens with `str_split`.

A list of three elements:

A tibble of all words in the corpus including the sentence number;
A tibble of all retrieved collocates, including their span position and sentence number;
Regular expression object of the search pattern.

## Not run: 
# do the collocate search using "corpus_path" input-option
df <- colloc_default(corpus_path = orti_bali_path,
                     pattern = "^nuju$",
                     window = "b", # focusing on both left and right context window
                     span = 3) # retrieve 3 collocates to the left and right of the node

## End(Not run)