knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Overview

kwic is an R package for producing key word in context (concordance) from linguistic corpora:

library(kwic)
data(dickensl)
kwic(dickensl, "the")

The kwic package is intended to fit into the existing R data structures. The method work for several representations of corpora :

The first three corpus types are defined in the Text Interchange Formats : https://github.com/ropensci/tif

Several options are available for a fine handling of the output.

data(dickensl)
k <- kwic(dickensl, "the")
print(k, sort.by="right")

Building kwic

The argument pattern may be interpreted as a regexp or as a fixed string, depending on the value of the argument "fixed":

data(dickensl)
k <- kwic(dickensl, "(is|are|was)", fixed=FALSE)
print(k, sort.by="right")

In order to have the regexp matching whole tokens, and not any substring in tokens, anchors may be added:

data(dickensl)
k <- kwic(dickensl, "\\b(is|are|was)\\b", fixed=FALSE)
print(k, sort.by="right")

Printing kwic

Select how many lines to be printed:

k <- kwic(dickensl, "the")
print(k, sort.by="right", from=3, to=4)

Tokens-sensitive kwic

Windows size may be defined as a given number or ''characters'' or a given number of ''tokens'', according to the value of the parameter "unit". Such 'tokens-sensitive kwic' is possible only on tokenized corpora (excluding vector of untoknized strings).

With 'tokens-sensitive kwic', no token are truncated at the beginning or the end of the line. It may be usefull is the kwic lines are to be used as a subcorpus for further analyses.

In the following example five tokens are displayed on both sides:

data(dickensl)
k <- kwic(dickensl, "the", 5, 5, unit="token")
print(k)

With this type of kwic, sorting lines by the left context produce a different effect: the line are sorted by the ''beginning'' of the last word on the left, and not by the last character on the left:

print(k, sort.by="left")

Moreover, sorting may operate on any token at the n-th position at the left or the right of the node. With "sort.by=-2", the ordering is done using the second token on the left. With "sort.by=2", the ordering is done using the second token on the right.

print(k, sort.by=-2)
print(k, sort.by=2)

Case study

Reading plain text files in a directory with tm

Kwic does not address the issue of reading files or walking through directories. The tm package handles this nicely:

d <- system.file("plaintexts", package="kwic")
corpus <- VCorpus(
  DirSource(directory=d, encoding="UTF-8"),
  readerControl = list(reader=readPlain)
)
kwic(corpus, "the")

Reading tagged (tabulated) files in a directory

Here, several tagged (tabulated) files are in a directory. First, we list the file names:

d <- system.file("taggedtexts", package="kwic")
files <- dir(d, pattern = "*.txt")

Below, all files are read as a data frame and stored into a list. They are combined into a large data.frame through rbind. We also add a column with an id for each text.

corpusl <- lapply(
  files,
  function(x) read.table(
    paste(d, x, sep="/"),
    quote="", sep="\t", header = TRUE, fileEncoding="ISO-8859-1", stringsAsFactors = FALSE
    )
  )

corpus <- do.call("rbind", corpusl)

corpus$doc_id <- rep(files, times=sapply(corpusl, nrow))

kwic(corpus, "Paris", token.column="lemme", left=30, right=30) #, unit="token"


sylvainloiseau/kwic documentation built on May 26, 2019, 5:31 a.m.