text_cooc: Build textual collocation frequencies.

Description Usage Arguments Details Value

Description

Builds textual collocation frequencies for a specific node.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
text_cooc(x, 
          re_node,
          re_boundary = NA,
          re_drop_line = NA,
          line_glue = NA,
          re_cut_area = NA,
          re_token_splitter = "\\s+",
          re_token_extractor = "[^\\s]+",
          re_drop_token = NA,
          re_token_transf_in = NA,
          token_transf_out = NA,
          token_to_lower = TRUE,
          perl = TRUE,
          blocksize = 300,
          verbose = FALSE,
          dot_blocksize = 10,
          file_encoding = "UTF-8")

Arguments

x

the object x contains the list of filenames of the corpus files.

re_node

regular expression used for identifying instances of the ‘node’, i.e. the target item, for which surface collocation information is collected. Any token that contains a match for re_node is considered to be an instance of the ‘node’.

re_boundary

regular expression used for identifying boundaries between ‘textual units’. Any token that contains a match for re_boundary is considered to be a boundary in between two ‘textual units’.

re_drop_line

if re_drop_line is NA, then this argument is ignored. Otherwise, re_drop_line is a character vector (assumed to be of length 1) containing a regular expression. Lines in x that contain a match for re_drop_line are treated as not belonging to the corpus and are excluded from the results.

line_glue

if line_glue is NA, then this argument is ignored. Otherwise, all lines in a corpus file (or in x, if as_text is TRUE, are glued together in one character vector of length 1, with the string line_glue pasted in between consecutive lines. The value of line_glue can also be equal to the empty string "". The ‘line glue’ operation is conducted immediately after the ‘drop line’ operation.

re_cut_area

if re_cut_area is NA, then this argument is ignored. Otherwise, all matches in a corpus file (or in x, if as_text is TRUE, are 'cut out' of the text prior to the identification of the tokens in the text (and are therefore not taken into account when identifying the tokens). The ‘cut area’ operation is conducted immediately after the ‘line glue’ operation.

re_token_splitter

the actual token identification is either based on re_token_splitter, a regular expression that identifies the areas between the tokens, or on re_token_extractor, a regular expressions that identifies the area that are the tokens. The first mechanism is the default mechanism: the argument re_token_extractor is only used if re_token_splitter is NA.

more specifically, re_token_splitter is a regular expression that identifies the locations where lines in the corpus files are split into tokens. The ‘token identification’ operation is conducted immediately after the ‘cut area’ operation.

re_token_extractor

a regular expression that identifies the locations of the actual tokens. This argument is only used if re_token_splitter is NA. Whereas matches for re_token_splitter are identified as the areas between the tokens, matches for re_token_extractor are identified as the areas of the actual tokens. Currently the implementation of re_token_extractor is a lot less time-efficient than that of re_token_splitter. The ‘token identification’ operation is conducted immediately after the ‘cut area’ operation.

re_drop_token

a regular expression that identifies tokens that are to be excluded from the results. Any token that contains a match for re_drop_token is removed from the results. If re_drop_token is NA, this argument is ignored. The ‘drop token’ operation is conducted immediately after the ‘token identification’ operation.

re_token_transf_in

a regular expression that identifies areas in the tokens that are to be transformed. This argument works together with the argument token_transf_out.

If both re_token_transf_in and token_transf_out differ from NA, then all matches, in the tokens, for the regular expression re_token_transf_in are replaced with the replacement string token_transf_out.

The ‘token transformation’ operation is conducted immediately after the ‘drop token’ operation.

token_transf_out

a ‘replacement string’. This argument works together with re_token_transf_in and is ignored if re_token_transf_in is NA.

token_to_lower

a boolean value that determines whether or not tokens must be converted to lowercase before returning the result.

The ‘token to lower’ operation is conducted immediately after the ‘token transformation’ operation.

perl

a boolean value that determines whether or not the PCRE regular expression flavor is being used in the arguments that contain regular expressions.

blocksize

number that indicates how many corpus files are read to memory ‘at each individual step’ during the steps in the procedure; normally the default value of 300 should not be changed, but when one works with exceptionally small corpus files, it may be worthwhile to use a higher number, and when one works with exceptionally large corpus files, ot may be worthwhile to use a lower number.

verbose

if verbose is TRUE, messages are printed to the console to indicate progress.

dot_blocksize

if verbose is TRUE, dots are printed to the console to indicate progress.

file_encoding

file encoding that is assumed in the corpus files.

Details

Two major steps can be distinguished in the procedure conducted by surf_coor. The first major step is the identification of the (sequence of) tokens that, for the purpose of this analysis, will be considered to be the content of the corpus. The function arguments that jointly determine the details of this step are re_drop_line, line_glue, re_cut_area, re_token_splitter, re_token_extractor, re_drop_token, re_token_transf_in, token_transf_out, and token_to_lower. The sequence of tokens that is the ultimate outcome of this step is then handed over to the second major step of the procedure.

The second major step is the establishment of the co-occurrence frequencies. The function arguments that jointly determine the details of this step are re_node and re_boundary. It is important to know that this second step is conducted after the tokens of the corpus have been identified, and that it is applied to a sequence of tokens, not to the original text. More specifically the regular expressions re_node and re_boundary are tested against individual tokens, as they are identified by the token identification procedure.

Value

The function text_coor returns an object of the class "cooc_info", containing information on co-occurrence frequencies.


wai-wong-reimagine/mclm documentation built on May 16, 2019, 9:12 p.m.