README.md
In corpus: Text Corpus Analysis

Corpus (C Library)

Text corpus analysis.

Corpus is C library for analyzing text data, with full support for Unicode. It is designed to support processing large data files that do not fit into memory.

In principle Corpus can support arbitrary input, but it is designed to analyze text stored in JavaScript Object Notation (JSON) format. The text objects provided by the library can either refer to raw UTF-8 encoded text, or they can refer to a JSON string with backslash (\) escapes like \n, \t, and \u2603. On 64-bit architectures, text objects take 16 bytes: an 8-byte pointer to the data, 1 bit indicating whether to interpret backslash as a JSON-style escape, 1 bit indicating whether the text includes a byte sequence that may decode to a non-ASCII character, and 62 bits to store the encoding size, in bytes.

In typical usage, you will memory-map a newline-demimited JSON data file, validate and type-check the data values using the datatype.h interface, extract the appropriate fields using the data.h interface, and create text objects that point into the file. By memory mapping the file, you can let the operating system move data between the hard drive and RAM whenever necessary. You can process a large data set seamlessly without loading everything into RAM at the same time.

For more information on JSON support in Corpus, see the notes on JSON as understood by Corpus.

Corpus can segment text into sentences or words according to the rules described in Unicode Standard Annex #29. To segment text into sentences or words, use the sentscan.h or wordscan.h interface, respectively.

Corpus supports the following text normalization transformations:

transforming to Unicode NFC or NFKC normal form;
performing Unicode case folding (using the default mappings, not the locale-specific ones);
performing quote folding, replacing quote characters like single quotes, double quotes, and apostrophes with ASCII single quote (');
removing Unicode default ignorable characters like zero-width-space and soft hyphen;
stemming, using one of the algorithms supported by the Snowball stemming library.

These normalizations can be applied to arbitrary text, but they are designed to be applied to individual word tokens, so that the results can be cached and re-used. The symtab.h and token.h interfaces support these normalizations.

Corpus is designed to be embedded into other language environments. The R interface is under development concurrently with the library.

There are no dependencies, but to build the documentation, you will need the Doxygen program, and to build the tests, you will need to install the Check Unit Testing library. Running make will build libcorpus.a and the corpus command-line tool. Running make doc will run Doxygen to build the documentation. Running make check will run the tests.

Everything should work on Windows, the only platform-specific code is the memory-mapping used internally by the filebuf.h interface.

Corpus is released under the Apache Licence, Version 2.0. The stemming algorithms used by Corpus come from the Snowball library and are subject to the conditions of the 3-clause BSD license. Portions of Corpus rely on data from the Unicode Character Database and are subject to the terms of the Unicode Licence.

Any scripts or data that you put into this service are public.

corpus documentation built on May 2, 2021, 9:06 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

corpus
Text Corpus Analysis

src/corpus/README.md
In corpus: Text Corpus Analysis

Corpus (C Library)

Overview

Features

JSON support

Text segmentation

Text normalization

R interface

Building from source

Windows support

License

Try the corpus package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

corpus Text Corpus Analysis

src/corpus/README.md In corpus: Text Corpus Analysis

Corpus (C Library)

Overview

Features

JSON support

Text segmentation

Text normalization

R interface

Building from source

Windows support

License

Try the corpus package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

corpus
Text Corpus Analysis

src/corpus/README.md
In corpus: Text Corpus Analysis