ECIMCI_profiles: ECI/MCI N-Gram Profiles
In textcat: N-Gram Based Text Categorization

ECIMCI_profiles

R Documentation

ECI/MCI `N`-Gram Profiles

Description

N-gram profile db for 26 languages based on the European Corpus Initiative Multilingual Corpus I.

Usage

ECIMCI_profiles

Details

This profile db was built by Johannes Rauch, using the ECI/MCI corpus (http://www.elsnet.org/eci.html) and the default options employed by package textcat, with all text documents encoded in UTF-8.

The category ids used for the db are the respective IETF language tags (see parse_IETF_language_tag in package NLP), using the ISO 639-2 Part B language subtags and, for Serbian, the script employed (i.e., "scc-Cyrl" and "scc-Latn" for Serbian written in Cyrillic and Latin script, respectively; all other languages in the profile db are written in Latin script.)

References

S. Armstrong-Warwick, H. S. Thompson, D. McKelvie and D. Petitpierre (1994), Data in Your Language: The ECI Multilingual Corpus 1. In “Proceedings of the International Workshop on Sharable Natural Language Resources” (Nara, Japan), 97–106. https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.44.950

Examples

## Languages in the the ECI/MCI profile db:
names(ECIMCI_profiles)
## Key options used for the profile:
attr(ECIMCI_profiles, "options")[c("n", "size", "reduce", "useBytes")]

textcat documentation built on April 3, 2025, 9:24 p.m.