Description Usage Arguments Details Value Examples
View source: R/corplingr_colloc_sentence.R
Perform collocate search for a given word/pattern based on corpus text files with one sentence per line (e.g., the Leipzig Corpora) (cf. Details below).
If the input is otherwise, such that each line of the corpus does not correspond to a sentence, use colloc_default
.
1 2 3 4 5 6 7 8 9 10 11 |
corpus_path |
character strings of (full) filepath for the corpus text files in |
leipzig_input |
logical; to check if the input corpus is specifically the Leipzig corpus files ( |
pattern |
regular expressions/exact patterns for the target pattern. |
window |
window-span direction of the collocates: |
span |
integer vector indicating the span of the collocate scope. |
case_insensitive |
whether the search ignores case (TRUE – the default) or not (FALSE). |
to_lower_colloc |
whether to lowercase the retrieved collocates and the nodes (TRUE – default) or not (FALSE). |
save_interim_results |
whether to output the interim results (per corpus file) into a tab-separated plain text (TRUE) or not (FALSE – default). |
coll_output_name |
name of the file for the collocate tables. |
This function, which is largely built on top of the tidyverse
, is specifically designed to handle collocates search that is not crossing boundary of the sentence in which the search word/pattern occurs.
The reason is that the sentence can be randomised and totaly unrelated (as in the Leipzig Corpora).
Thus, it is important to keep the collocates of the search word/pattern falling within the sentence boundary in which the word/pattern occurs. That way, it aims maintain cohesivness of meaning of the word.
Moreover, the function only outputs the raw collocates data without tabulating the frequency of the collocates and performing association measure of the collocates to the search word/pattern. Future iteration of this package aims to accommodate this feature.
A tbl_df of raw collocates.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | ## Not run:
# get the path of the Leipzig corpora
leipzig_corpus_path <- c("/my/path/to/leipzig_corpus_1M_1.txt",
"/my/path/to/leipzig_corpus_300K_2.txt",
"/my/path/to/leipzig_corpus_300K_3.txt")
# retrieve collocate list
df <- colloc_sentence(corpus_path = leipzig_corpus_path[2:3],
leipzig_input = TRUE,
pattern = "\\bterkalahkan\\b",
window = "l",
span = 1,
case_insensitive = TRUE,
to_lower_colloc = TRUE,
save_interim_results = FALSE)
# see the output
df
# count the frequency of the collocates
df %>% dplyr::count(w, sort = TRUE)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.