Text corpus analysis.
Corpus is C library for analyzing text data, with full support for Unicode. It is designed to support processing large data files that do not fit into memory.
In principle Corpus can support arbitrary input, but it is designed to
analyze text stored in JavaScript Object Notation (JSON) format. The
text
objects provided by the library can either refer to raw UTF-8 encoded
text, or they can refer to a JSON string with backslash (\
) escapes like
\n
, \t
, and \u2603
. On 64-bit architectures, text objects take 16 bytes:
an 8-byte pointer to the data, 1 bit indicating whether to interpret backslash
as a JSON-style escape, 1 bit indicating whether the text includes a byte
sequence that may decode to a non-ASCII character, and 62 bits to store the
encoding size, in bytes.
In typical usage, you will memory-map a newline-demimited JSON
data file, validate and type-check the data values using the datatype.h
interface, extract the appropriate fields using the data.h
interface, and
create text objects that point into the file. By memory mapping the file,
you can let the operating system move data between the hard drive and RAM
whenever necessary. You can process a large data set seamlessly without
loading everything into RAM at the same time.
For more information on JSON support in Corpus, see the notes on JSON as understood by Corpus.
Corpus can segment text into sentences or words according to the rules
described in Unicode Standard Annex #29. To segment text
into sentences or words, use the sentscan.h
or wordscan.h
interface,
respectively.
Corpus supports the following text normalization transformations:
transforming to Unicode NFC or NFKC normal form;
performing Unicode case folding (using the default mappings, not the locale-specific ones);
performing quote folding, replacing quote characters like single
quotes, double quotes, and apostrophes with ASCII single quote ('
);
removing Unicode default ignorable characters like zero-width-space and soft hyphen;
stemming, using one of the algorithms supported by the Snowball stemming library.
These normalizations can be applied to arbitrary text, but they are designed
to be applied to individual word tokens, so that the results can be cached
and re-used. The symtab.h
and token.h
interfaces support these
normalizations.
Corpus is designed to be embedded into other language environments. The R interface is under development concurrently with the library.
There are no dependencies, but to build the documentation, you will need the
Doxygen program, and to build the tests, you will need
to install the Check Unit Testing library. Running make
will
build libcorpus.a
and the corpus
command-line tool. Running make doc
will run Doxygen to build the documentation. Running make check
will
run the tests.
Everything should work on Windows, the only platform-specific code is the
memory-mapping used internally by the filebuf.h
interface.
Corpus is released under the Apache Licence, Version 2.0. The stemming algorithms used by Corpus come from the Snowball library and are subject to the conditions of the 3-clause BSD license. Portions of Corpus rely on data from the Unicode Character Database and are subject to the terms of the Unicode Licence.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.