conc_re: Build concordance for the matches of a regular expression

Description Usage Arguments Details Value Examples

Description

The function conc_re builds a concordance for the matches of a regular expression. The result is a dataset that can be written to a file with the function write_dataset. The function mimics the behaviour of the concordance tool in the program AntConc.

Usage

1
2
3
4
5
6
7
8
conc_re(pattern,
        x,
        c_left = 200,
        c_right = 200,
        perl = TRUE,
        after_line = "\n",
        file_encoding = "UTF-8",
        as_text = FALSE) 

Arguments

pattern

the argument pattern is a character string that contains the regular expression that serves as search term for the concordancer.

x

the argument x is a character vector that determines which text is to be used as corpus. In case of the setting as_text = TRUE, the content of the argument x is treated as the actual text that is to be used as corpus. In case of the setting as_text = FALSE, the argument x is treated as a vector of filenames, which are then interpreted as the names of the corpus files that contain the actual corpus data.

c_left

the argument c_left is a number that specifies how many characters to the left of each match must be included in the result as the left co-text of the match.

c_right

the argument c_right is a number that specifies how many characters to the right of each match must be included in the result as the right co-text of the match.

perl

in case of the setting perl = TRUE, the argument pattern is treated as a PCRE flavor regular expression. Otherwise, the argument pattern is treated as a regular expression in R's default regular expression flavor.

after_line

prior to the actual search operation, the lines from a corpus file are concatenated into one single character string, using, as separator between the lines of the file, the value given in after_line. In case of the setting as_text = TRUE, the argument after_line is ignored.

file_encoding

each corpus file is interpreted as a text file the encoding of which is the one given in file_encoding. In case of the setting as_text = TRUE, the argument file_encoding is ignored. In case of the setting as_text = FALSE, the argument file_encoding can either be a character vector of length one, or a character vector with the same length as x. In the former case, all files in x are assumed to have the same encoding. In the latter case, different files can have different encodings.

as_text

in case of the setting as_text = TRUE, the content of the argument x is treated as the actual text that is to be used as corpus. In case of the setting as_text = FALSE, the argument x is treated as a vector of filenames, which are then interpreted as the names of the corpus files that contain the actual corpus data.

Details

In order to make sure that the columns left, match, and right in the output of conc_re do not contain any TAB or NEWLINE characters, whitespace in these items is being ‘normalized’. More particularly, each stretch of whitespace, i.e. each uninterrupted sequences of whitespace characters, is replaced by a single SPACE character.

The values in the items the glob_id and id in the output of conc_re are always identical in a dataset that is the output of the function conc_re. The item glob_id only becomes useful when later, for instance, one wants to merge two datasets.

Value

Returns an object of the class conc, which is a kind of data frame with as its rows the matches and with the following columns:

glob_id

Number indicating the position of the match in the overall list of matches.

id

Number indicating the position of the match in the list of matches for one specific query.

source

Either the filename of the file in which the match was found (in case of the setting as_text = FALSE), or the string ‘-’ (in case of the setting as_text = TRUE).

left

The lefthandside co-text of each match.

match

The actual match.

right

The righthandside co-text of each match.

Examples

1
2
(conc_data <- conc_re('\\w+', 'A very small corpus.', as_text = TRUE))
print_kwic(conc_data)

wai-wong-reimagine/mclm documentation built on May 16, 2019, 9:12 p.m.