conc_re: Build concordance for the matches of a regular expression
In wai-wong-reimagine/mclm: Mastering Corpus Linguistics Methods

Description Usage Arguments Details Value Examples

The function conc_re builds a concordance for the matches of a regular expression. The result is a dataset that can be written to a file with the function write_dataset. The function mimics the behaviour of the concordance tool in the program AntConc.

conc_re(pattern,
        x,
        c_left = 200,
        c_right = 200,
        perl = TRUE,
        after_line = "\n",
        file_encoding = "UTF-8",
        as_text = FALSE)

`pattern`	the argument `pattern` is a character string that contains the regular expression that serves as search term for the concordancer.
`x`	the argument `x` is a character vector that determines which text is to be used as corpus. In case of the setting `as_text = TRUE`, the content of the argument `x` is treated as the actual text that is to be used as corpus. In case of the setting `as_text = FALSE`, the argument `x` is treated as a vector of filenames, which are then interpreted as the names of the corpus files that contain the actual corpus data.
`c_left`	the argument `c_left` is a number that specifies how many characters to the left of each match must be included in the result as the left co-text of the match.
`c_right`	the argument `c_right` is a number that specifies how many characters to the right of each match must be included in the result as the right co-text of the match.
`perl`	in case of the setting `perl = TRUE`, the argument `pattern` is treated as a PCRE flavor regular expression. Otherwise, the argument `pattern` is treated as a regular expression in R's default regular expression flavor.
`after_line`	prior to the actual search operation, the lines from a corpus file are concatenated into one single character string, using, as separator between the lines of the file, the value given in `after_line`. In case of the setting `as_text = TRUE`, the argument `after_line` is ignored.
`file_encoding`	each corpus file is interpreted as a text file the encoding of which is the one given in `file_encoding`. In case of the setting `as_text = TRUE`, the argument `file_encoding` is ignored. In case of the setting `as_text = FALSE`, the argument `file_encoding` can either be a character vector of length one, or a character vector with the same length as `x`. In the former case, all files in `x` are assumed to have the same encoding. In the latter case, different files can have different encodings.
`as_text`	in case of the setting `as_text = TRUE`, the content of the argument `x` is treated as the actual text that is to be used as corpus. In case of the setting `as_text = FALSE`, the argument `x` is treated as a vector of filenames, which are then interpreted as the names of the corpus files that contain the actual corpus data.

In order to make sure that the columns left, match, and right in the output of conc_re do not contain any TAB or NEWLINE characters, whitespace in these items is being ‘normalized’. More particularly, each stretch of whitespace, i.e. each uninterrupted sequences of whitespace characters, is replaced by a single SPACE character.

The values in the items the glob_id and id in the output of conc_re are always identical in a dataset that is the output of the function conc_re. The item glob_id only becomes useful when later, for instance, one wants to merge two datasets.

Returns an object of the class conc, which is a kind of data frame with as its rows the matches and with the following columns:

`glob_id`	Number indicating the position of the match in the overall list of matches.
`id`	Number indicating the position of the match in the list of matches for one specific query.
`source`	Either the filename of the file in which the match was found (in case of the setting `as_text = FALSE`), or the string ‘-’ (in case of the setting `as_text = TRUE`).
`left`	The lefthandside co-text of each match.
`match`	The actual match.
`right`	The righthandside co-text of each match.