In gederajeg/paracorp: Concordancer for parallel, bilingual corpora

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

paracorp

The goal of paracorp is to provide an R functionality for generating parallel concordance (Keyword-in-Context [KWIC] display) from a parallel/bilingual corpora. The first attempt is implemented in the para_conc() function that is built on top of the tidyverse suit of packages. Please use the following citation if paracorp is used in publications:

citation("paracorp")

The paracorp package is part of the following research project [@rajeg_material_2021]:

Rajeg, Gede Primahadi Wijaya, I Made Rajeg, Putu Dea Indah Kartini & I Gede Semara Dharma Putra. 2021. Material pendukung untuk MODEL KAJIAN TERJEMAHAN BERBASIS BANK DATA TERJEMAHAN DIGITAL INGGRIS-INDONESIA DAN IMPLIKASI PEDAGOGISNYA. Open Science Framework. https://doi.org/10.17605/OSF.IO/Y6ESA. https://osf.io/y6esa/.

The output of the research has been disseminated in several seminars [@rajeg_pemanfaatan_2021; @rajeg_derajat_2021].

Installation

You can install the development version of paracorp from GitHub with:

# install.packages("devtools")
devtools::install_github("gederajeg/paracorp")

Examples

The paracorp package comes with internal sample data of English-Indonesian parallel corpora from the science genre developed by the PAN BPPT project [@adriani_development_2009; @bppt_statistical_2009]. The data are available in the form of character vectors called sci_en (for the English text) whose line is aligned with the Indonesian version (sci_id).

The code-snippet below shows how to generate a parallel concordance for the English modal verb "should" as the target, search-term and present the Indonesian translation (shown in the TRANSLATION column in the output table).

library(paracorp) # load the package

# in this example, the English text is used as the source text
my_para_conc <- para_conc(source_text = sci_en, 
                          target_text = sci_id, 
                          pattern = "\\bshould\\b", # regular expression pattern
                          conc_sample = 20) # retrieve 20 random concordance lines

# peek into the results as tibble/data frame
head(my_para_conc)

The printed messages show that, by default, para_conc() also saves the concordance into a tab-separated plain text (by default called 'parallel_conc.txt'), in addition to returning a tibble/data frame format of the concordance. The tab-separated 'parallel_conc.txt' file can be opened in MS Excel for further corpus-based analyses.

Suppressing the automatic plain-text output

You can suppress the automatic plain-text-output behaviour by specifying filename = FALSE as shown below. In this situation, the output of para_conc() is only the tibble/data frame.

# suppress automatic output file behaviour with `filename = FALSE`
my_para_conc <- para_conc(source_text = sci_en, 
                          target_text = sci_id, 
                          pattern = "\\bshould\\b", # regular expression pattern
                          conc_sample = 20, # retrieve 20 random concordance lines
                          filename = FALSE) # suppress automatic output file 

# peek into the results as tibble/data frame
head(my_para_conc)

Switching the source- and target-text inputs

Moreover, the position of the input corpora can be reversed depending on the nature of the corpora or the research question(s). In the example below, the Indonesian text is entered into the source_text argument while the English text is entered into the target_text argument. In this case, the input string in the pattern argument of para_conc() should represent the Indonesian target-keyword.

# in this example, the Indonesian text is used as the source text
my_para_conc <- para_conc(source_text = sci_id, 
                          target_text = sci_en, 
                          pattern = "\\bmungkin\\b", # regular expression pattern
                          conc_sample = 20) # retrieve 20 random concordance lines

# peek into the results as tibble/data frame
head(my_para_conc)

Sampling numbers

If the requested number of sample (out of all matches) is greater than or equal to the number of matches of the search pattern, para_conc() will print messages indicating these situations, and will retrieve all matches found, rather than generating sample that is supposed to be fewer than the total matches.

The snippet below shows the scenario and printed message when the requested number of sample is equal to the number of matches.

# sample number requested is equal to the matches
para_conc(sci_en, sci_id, pattern = "should", conc_sample = 64, filename = FALSE)

Meanwhile, the snippet below shows the scenario and printed message when the requested number of sample is greater than the number of matches.

# sample number requested is greater than the matches
para_conc(sci_en, sci_id, pattern = "should", conc_sample = 67, filename = FALSE)

No matches

When no matches were found for the string given in the pattern argument, para_conc() will also print out the message informing so and no output will be produced. See the example below.

# For instance, searching for an Indonesian word when the source text is in English
# will most likely produce such no-match message.
para_conc(sci_en, sci_id, pattern = "\\bmungkin\\b", conc_sample = 20, filename = FALSE)