An experimental study of language models, corpus design and word prediction efficacy in four phases. The initial phase was an exploratory data analysis of the English language HC Corpus, a collection of freely available texts comprised of over 2.5 billion words from 67 languages. Next, linquistically representative corpora of various sizes were built and training, validation, and test sets were preprocesed for modeling. The subsequent language modeling phase concerned the implementation of Good-Turing / Katz, Kneser Ney, Modified Kneser-Ney and Topic Model language models. Finally, the language models were executed on corpora of various sizes and word prediction perplexity measures were taken to illuminate word prediction accuracy.
Package details |
|
---|---|
Maintainer | John James <j2sdatalab@gmail.com> |
License | MIT |
Version | 0.1.0 |
Package repository | View on GitHub |
Installation |
Install the latest version of this package by entering the following in R:
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.