TC_profiles: TextCat N-Gram Profiles
In textcat: N-Gram Based Text Categorization

TC_profiles

R Documentation

TextCat `N`-Gram Profiles

Description

TextCat n-gram byte and character profile dbs for language identification.

Usage

TC_char_profiles
TC_byte_profiles

Details

TextCat (https://www.let.rug.nl/vannoord/TextCat/) is a Perl implementation of the Cavnar and Trenkle “N-Gram-Based Text Categorization” technique by Gertjan van Noord which was subsequently integrated into SpamAssassin. It provides byte n-gram profiles for 74 “languages” (more precisely, language/encoding combinations). The wiseguys C library reimplementation libtextcat adds one more non-empty profile (see (https://wiki.documentfoundation.org/Libexttextcat).

TC_byte_profiles provides these byte profiles.

TC_char_profiles provides a subset of 56 character profiles obtained by converting the byte sequences to UTF-8 strings where possible.

The category ids are unchanged from the original, and give the full (English) name of the language, optionally combined the name of the encoding script. Note that ‘scots’ indicates Scots, the Germanic language variety historically spoken in Lowland Scotland and parts of Ulster, to be distinguished from Scottish Gaelic (named ‘scots_gaelic’ in the profiles), the Celtic language variety spoken in most of the western Highlands and in the Hebrides (see https://en.wikipedia.org/wiki/Scots_language).

Examples

## Languages in the TC byte profiles:
names(TC_byte_profiles)
## Languages only in the TC byte profiles:
setdiff(names(TC_byte_profiles), names(TC_char_profiles))
## Key options used for the profiles:
attr(TC_byte_profiles, "options")[c("n", "size", "reduce", "useBytes")]
attr(TC_char_profiles, "options")[c("n", "size", "reduce", "useBytes")]

textcat documentation built on April 3, 2025, 9:24 p.m.