DataScienceSalon/predictifyR.3.0: Word Prediction Language Model Evaluation

An experimental study of language models, corpus design and word prediction efficacy in four phases. The initial phase was an exploratory data analysis of the English language HC Corpus, a collection of freely available texts comprised of over 2.5 billion words from 67 languages. Next, linquistically representative corpora of various sizes were built and training, validation, and test sets were preprocesed for modeling. The subsequent language modeling phase concerned the implementation of Good-Turing / Katz, Kneser Ney, Modified Kneser-Ney and Topic Model language models. Finally, the language models were executed on corpora of various sizes and word prediction perplexity measures were taken to illuminate word prediction accuracy.

Getting started

Package details

MaintainerJohn James <[email protected]>
Package repositoryView on GitHub
Installation Install the latest version of this package by entering the following in R:
DataScienceSalon/predictifyR.3.0 documentation built on Aug. 19, 2017, 12:13 a.m.