Description Usage Arguments Details Value
Builds textual collocation frequencies for a specific node.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | text_cooc(x,
re_node,
re_boundary = NA,
re_drop_line = NA,
line_glue = NA,
re_cut_area = NA,
re_token_splitter = "\\s+",
re_token_extractor = "[^\\s]+",
re_drop_token = NA,
re_token_transf_in = NA,
token_transf_out = NA,
token_to_lower = TRUE,
perl = TRUE,
blocksize = 300,
verbose = FALSE,
dot_blocksize = 10,
file_encoding = "UTF-8")
|
x |
the object |
re_node |
regular expression used for identifying instances of the
‘node’, i.e. the target item, for which surface collocation
information is collected. Any token that contains a match for
|
re_boundary |
regular expression used for identifying boundaries between
‘textual units’. Any token that contains a match for
|
re_drop_line |
if |
line_glue |
if |
re_cut_area |
if |
re_token_splitter |
the actual token identification is either based on
more specifically, |
re_token_extractor |
a regular expression that identifies the locations of the
actual tokens. This
argument is only used
if |
re_drop_token |
a regular expression that identifies tokens that are to be excluded
from the results. Any token that contains a match for
|
re_token_transf_in |
a regular expression that identifies areas in the tokens that are to be
transformed. This argument works together with the argument
If both The ‘token transformation’ operation is conducted immediately after the ‘drop token’ operation. |
token_transf_out |
a ‘replacement string’. This argument works together with
|
token_to_lower |
a boolean value that determines whether or not tokens must be converted to lowercase before returning the result. The ‘token to lower’ operation is conducted immediately after the ‘token transformation’ operation. |
perl |
a boolean value that determines whether or not the PCRE regular expression flavor is being used in the arguments that contain regular expressions. |
blocksize |
number that indicates how many corpus files are read to memory
‘at each individual step’ during the steps in the procedure;
normally the default value
of |
verbose |
if |
dot_blocksize |
if |
file_encoding |
file encoding that is assumed in the corpus files. |
Two major steps can be distinguished in the procedure conducted by
surf_coor
. The first major step is the identification of
the (sequence of) tokens that, for the purpose of this analysis,
will be considered to be the content of the corpus.
The function arguments that jointly determine the details of
this step are
re_drop_line
, line_glue
, re_cut_area
,
re_token_splitter
, re_token_extractor
,
re_drop_token
, re_token_transf_in
,
token_transf_out
, and token_to_lower
.
The sequence of tokens that is the ultimate outcome of this step
is then handed over to the second major step of the procedure.
The second major step is the establishment of the
co-occurrence frequencies. The function arguments
that jointly determine the details of this step are
re_node
and
re_boundary
. It is important to know that this
second step is conducted after the tokens of the corpus have
been identified, and that it is applied to a sequence of
tokens, not to the original text. More specifically the
regular expressions re_node
and re_boundary
are tested against individual tokens, as they are identified
by the token identification procedure.
The function text_coor
returns an object of the class
"cooc_info"
, containing information on co-occurrence
frequencies.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.