How to add a sentence annotation to an indexed corpus

About

Adding an annotation of sentences as a structural attribute to an existing is a frequent scenario. This vignette offers a basic recipe for a corpus that already includes a part-of-speech annotation. The GermaParlSamplee corpus serves as an example.

Getting started

In addition to the cwbtools package, we use the functionality of the RcppCWB package to decode the p-attribute 'pos'. The same could be achieved using the higher-level get_token_stream() method of the polmineR package, but we want to avoid creating an additional dependency of this package.

library(cwbtools)
library(RcppCWB)

For the purposes of this vigneette, ee will work with a temporary corpy of the corpus we wish to augment, so we create a temporary corpus directory.

registry_dir_tmp <- fs::path(tempdir(), "registry_dir_tmp")
corpus_dir_tmp <- fs::path(tempdir(), "corpus_dir_tmp")

dir.create(path = registry_dir_tmp)
dir.create(path = corpus_dir_tmp)

regdir_envvar <- Sys.getenv("CORPUS_REGISTRY")
Sys.setenv(CORPUS_REGISTRY = registry_dir_tmp)

The GermaParlSample corpus can be downloaded from Zenodo as follows.

is_corpus_available <- cwbtools::corpus_install(
  doi = "10.5281/zenodo.3823245",
  registry_dir = registry_dir_tmp, corpus_dir = corpus_dir_tmp,
  verbose = FALSE
)
list.files(file.path(corpus_dir_tmp, "germaparlsample"))

Recipe

We generate the data for the sentence annotation from the part-of-speech annotation that is already present.

At first, we decode the p-attribute "pos".

germaparl_size <- cl_attribute_size(
  corpus = "GERMAPARLSAMPLE",
  attribute = "word", attribute_type = "p"
)
cpos_vec <- seq.int(from = 0L, to = germaparl_size - 1L)
ids <- cl_cpos2id(corpus = "GERMAPARLSAMPLE", p_attribute = "pos", cpos = cpos_vec)
pos <- cl_id2str(corpus = "GERMAPARLSAMPLE", p_attribute = "pos", id = ids)

The pos-tag for Stuttgart Tuebingen Tag Set (STTS) is a "$.". From this information, we can generate a region matrix with start and end corpus positions of sentences easily.

sentence_end <- grep("\\$\\.", pos)
sentence_factor <- cut(x = cpos_vec, breaks = c(0L, sentence_end), include.lowest = TRUE, right = FALSE)
sentences_cpos <- unname(split(x = cpos_vec, f = sentence_factor))
region_matrix <- do.call(rbind, lapply(sentences_cpos, function(cpos) c(cpos[1L], cpos[length(cpos)])))

So let us see what this looks like ...

head(region_matrix)

And this is how the new annotation layer is written back to the corpus.

s_attribute_encode(
  values = as.character(seq.int(from = 0L, to = nrow(region_matrix) - 1L)),
  data_dir = registry_file_parse(corpus = "GERMAPARLSAMPLE")[["home"]],
  s_attribute = "s",
  corpus = "GERMAPARLSAMPLE",
  region_matrix = region_matrix,
  method = "R",
  registry_dir = Sys.getenv("CORPUS_REGISTRY"),
  encoding = registry_file_parse(corpus = "GERMAPARLSAMPLE")[["properties"]][["charset"]],
  delete = TRUE,
  verbose = TRUE
)

Checking the result

To see whether everything has worked, we get the left and right boundaries of the sentence with corpus position 60.

left <- cl_cpos2lbound("GERMAPARLSAMPLE", cpos = 60, s_attribute = "s")
right <- cl_cpos2rbound("GERMAPARLSAMPLE", cpos = 60, s_attribute = "s")
ids <- cl_cpos2id("GERMAPARLSAMPLE", cpos = left:right, p_attribute = "word")
cl_id2str("GERMAPARLSAMPLE", p_attribute = "word", id = ids)

Cleaning up

As a matter of housekeeping, we remove temporary directories and restore value of environment variable CORPUS_REGISTRY.

unlink(corpus_dir_tmp)
unlink(registry_dir_tmp)

Sys.setenv(CORPUS_REGISTRY = regdir_envvar)

Next Steps

Using the part-of-speech annotation is a basic approach to obtain the data for annotation sentences. An alternative would be to use the NLP annotation machinery of an integrated tool such as Stanford CoreNLP, or OpenNLP. But that's a different story to be told.



Try the cwbtools package in your browser

Any scripts or data that you put into this service are public.

cwbtools documentation built on June 11, 2021, 9:07 a.m.