README.md

Language Data

A set of language datasets and the code that creates them. These datasets provide a starting point for data visualization, transformation and analysis.

Install from GitHub with devtools::install_github("francojc/langdata").

Datasets

Switchboard Dialog Act Corpus

A dataset containing a corpus of spontaneous conversations from 440 speakers of American English in 1,115 individual conversations. Original corpus files and documentation from the Linguistic Data Consortium is available here.

Brown Corpus

A dataset containing the 1,155,866 tokenized words for 15 genre categories of a sample of American English. Original corpus files and documentation from the Natural Language Toolkit data repository is available here.

...



francojc/langdata documentation built on May 31, 2019, 2:48 p.m.