read.corp.LCC: Import LCC data
In koRpus: Text Analysis with Emphasis on POS Tagging, Readability, and Lexical Diversity

Description Usage Arguments Details Value Note References See Also Examples

Read data from LCC[1] formatted corpora (Quasthoff, Richter & Biemann, 2006).

read.corp.LCC(
  LCC.path,
  format = "flatfile",
  fileEncoding = "UTF-8",
  n = -1,
  keep.temp = FALSE,
  prefix = NULL,
  bigrams = FALSE,
  cooccurence = FALSE,
  caseSens = TRUE
)

`LCC.path`	A character string, either path to a .tar/.tar.gz/.zip file in LCC format (flatfile), or the path to the directory with the unpacked archive.
`format`	Either "flatfile" or "MySQL", depending on the type of LCC data.
`fileEncoding`	A character string naming the encoding of the LCC files. Old zip archives used "ISO_8859-1". This option will only influence the reading of meta information, as the actual database encoding is derived from there.
`n`	An integer value defining how many lines of data should be read if `format="flatfile"`. Reads all at -1.
`keep.temp`	Logical. If `LCC.path` is a tarred/zipped archive, setting `keep.temp=TRUE` will keep the temporarily unpacked files for further use. By default all temporary files will be removed when the function ends.
`prefix`	Character string, giving the prefix for the file names in the archive. Needed for newer LCC tar archives if they are already decompressed (autodetected if `LCC.path` points to the tar archive directly).
`bigrams`	Logical, whether infomration on bigrams should be imported. This is `FALSE` by default, because it might make the objects quite large. Note that this will only work in `n = -1` because otherwise the tokens cannot be looked up.
`cooccurence`	Logical, like `bigrams`, but for information on co-occurences of tokens in a sentence.
`caseSens`	Logical, if `FALSE` forces all frequency statistics to be calculated regardless of the tokens' case. Otherwise, if the imported database supports it, you will get different frequencies for the same tokens in different cases (e.\,g., "one" and "One").

The LCC database can either be unpacked or still a .tar/.tar.gz/.zip archive. If the latter is the case, then all necessary files will be extracted to a temporal location automatically, and by default removed again when the function has finished reading from it.

Newer LCC archives no longer feature the *-meta.txt file, resulting in less meta informtion in the object. In these cases, the total number of tokens is calculated as the sum of types' frequencies.

An object of class kRp.corp.freq.

Please note that MySQL support is not implemented yet.

Quasthoff, U., Richter, M. & Biemann, C. (2006). Corpus Portal for Search in Monolingual Corpora, In Proceedings of the Fifth International Conference on Language Resources and Evaluation, Genoa, 1799–1802.

[1] https://wortschatz.uni-leipzig.de/en/download/

kRp.corp.freq

## Not run: 
# old format .zip archive
my.LCC.data <- read.corp.LCC(
  file.path("~","mydata","corpora","de05_3M.zip")
)
# new format tar archive
my.LCC.data <- read.corp.LCC(
  file.path("~","mydata","corpora","rus_web_2002_300K-text.tar")
)
# in case the tar archive was already unpacked
my.LCC.data <- read.corp.LCC(
  file.path("~","mydata","corpora","rus_web_2002_300K-text"),
  prefix="rus_web_2002_300K-"
)
freq.analysis(
  tokenized.obj,
  corp.freq=my.LCC.data
)

## End(Not run)