knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
kwic is an R package for producing key word in context (concordance) from linguistic corpora:
library(kwic) data(dickensl) kwic(dickensl, "the")
The kwic package is intended to fit into the existing R data structures. The method work for several representations of corpora :
The first three corpus types are defined in the Text Interchange Formats : https://github.com/ropensci/tif
Several options are available for a fine handling of the output.
data(dickensl) k <- kwic(dickensl, "the") print(k, sort.by="right")
The argument pattern may be interpreted as a regexp or as a fixed string, depending on the value of the argument "fixed":
data(dickensl) k <- kwic(dickensl, "(is|are|was)", fixed=FALSE) print(k, sort.by="right")
In order to have the regexp matching whole tokens, and not any substring in tokens, anchors may be added:
data(dickensl) k <- kwic(dickensl, "\\b(is|are|was)\\b", fixed=FALSE) print(k, sort.by="right")
Select how many lines to be printed:
k <- kwic(dickensl, "the") print(k, sort.by="right", from=3, to=4)
Windows size may be defined as a given number or ''characters'' or a given number of ''tokens'', according to the value of the parameter "unit". Such 'tokens-sensitive kwic' is possible only on tokenized corpora (excluding vector of untoknized strings).
With 'tokens-sensitive kwic', no token are truncated at the beginning or the end of the line. It may be usefull is the kwic lines are to be used as a subcorpus for further analyses.
In the following example five tokens are displayed on both sides:
data(dickensl) k <- kwic(dickensl, "the", 5, 5, unit="token") print(k)
With this type of kwic, sorting lines by the left context produce a different effect: the line are sorted by the ''beginning'' of the last word on the left, and not by the last character on the left:
print(k, sort.by="left")
Moreover, sorting may operate on any token at the n-th position at the left or the right of the node. With "sort.by=-2", the ordering is done using the second token on the left. With "sort.by=2", the ordering is done using the second token on the right.
print(k, sort.by=-2) print(k, sort.by=2)
Kwic does not address the issue of reading files or walking through directories. The tm package handles this nicely:
d <- system.file("plaintexts", package="kwic") corpus <- VCorpus( DirSource(directory=d, encoding="UTF-8"), readerControl = list(reader=readPlain) ) kwic(corpus, "the")
Here, several tagged (tabulated) files are in a directory. First, we list the file names:
d <- system.file("taggedtexts", package="kwic") files <- dir(d, pattern = "*.txt")
Below, all files are read as a data frame and stored into a list. They are combined into a large data.frame through rbind. We also add a column with an id for each text.
corpusl <- lapply( files, function(x) read.table( paste(d, x, sep="/"), quote="", sep="\t", header = TRUE, fileEncoding="ISO-8859-1", stringsAsFactors = FALSE ) ) corpus <- do.call("rbind", corpusl) corpus$doc_id <- rep(files, times=sapply(corpusl, nrow)) kwic(corpus, "Paris", token.column="lemme", left=30, right=30) #, unit="token"
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.