colloc_sentence: Collocates retrieval on sentence-based corpus
In gederajeg/corplingr: Tidy Concordances, Collocates, and Wordlist

Description Usage Arguments Details Value Examples

View source: R/corplingr_colloc_sentence.R

Perform collocate search for a given word/pattern based on corpus text files with one sentence per line (e.g., the Leipzig Corpora) (cf. Details below). If the input is otherwise, such that each line of the corpus does not correspond to a sentence, use colloc_default.

colloc_sentence(
  corpus_path = "(full) filepath to sentence-based corpus",
  leipzig_input = TRUE,
  pattern = "regular expressions",
  window = c("r", "l", "b"),
  span = 3,
  case_insensitive = TRUE,
  to_lower_colloc = TRUE,
  save_interim_results = FALSE,
  coll_output_name = "colloc_tibble_out.txt"
)

`corpus_path`	character strings of (full) filepath for the corpus text files in `.txt` plain-text format. The corpus file IS a sentence-based corpus, meaning that each line of the file corresponds to one sentence. Each sentence can be in successive, cohesive sequence (e.g. based on a Novel) or randomised (as in the Leipzig Corpora).
`leipzig_input`	logical; to check if the input corpus is specifically the Leipzig corpus files (`TRUE`) so that the function will remove the sentence number in the beginning of the line.
`pattern`	regular expressions/exact patterns for the target pattern.
`window`	window-span direction of the collocates: `"r"` ('right of the node'), `"l"` ('left of the node'), or the DEFAULT is `"b"` ('both left and right context-window').
`span`	integer vector indicating the span of the collocate scope.
`case_insensitive`	whether the search ignores case (TRUE – the default) or not (FALSE).
`to_lower_colloc`	whether to lowercase the retrieved collocates and the nodes (TRUE – default) or not (FALSE).
`save_interim_results`	whether to output the interim results (per corpus file) into a tab-separated plain text (TRUE) or not (FALSE – default).
`coll_output_name`	name of the file for the collocate tables.

This function, which is largely built on top of the tidyverse, is specifically designed to handle collocates search that is not crossing boundary of the sentence in which the search word/pattern occurs. The reason is that the sentence can be randomised and totaly unrelated (as in the Leipzig Corpora). Thus, it is important to keep the collocates of the search word/pattern falling within the sentence boundary in which the word/pattern occurs. That way, it aims maintain cohesivness of meaning of the word.

Moreover, the function only outputs the raw collocates data without tabulating the frequency of the collocates and performing association measure of the collocates to the search word/pattern. Future iteration of this package aims to accommodate this feature.

A tbl_df of raw collocates.

## Not run: 
# get the path of the Leipzig corpora
leipzig_corpus_path <- c("/my/path/to/leipzig_corpus_1M_1.txt",
                         "/my/path/to/leipzig_corpus_300K_2.txt",
                         "/my/path/to/leipzig_corpus_300K_3.txt")
# retrieve collocate list
df <- colloc_sentence(corpus_path = leipzig_corpus_path[2:3],
                      leipzig_input = TRUE,
                      pattern = "\\bterkalahkan\\b",
                      window = "l",
                      span = 1,
                      case_insensitive = TRUE,
                      to_lower_colloc = TRUE,
                      save_interim_results = FALSE)

# see the output
df

# count the frequency of the collocates
df %>% dplyr::count(w, sort = TRUE)

## End(Not run)