prepConversations: Prep transcripts for SITS model

Description Usage Arguments Details Value References See Also

View source: R/prepConversations.R

Description

Function takes path to a corpus of raw conversation transcripts, preprocesses the text, and writes files locally needed to run the SITS model.

Usage

1
2
prepConversations(rawCorpusPath, sitsCorpusPath, corpusName, overwrite = TRUE,
  returnData = FALSE, ...)

Arguments

rawCorpusPath

Path to folder containing raw text files. Files need to have .txt extension, which each row holding tab separated values of (1) row number, (2) speaker identifier, and (3) turn text.

sitsCorpusPath

Path to folder in which to store prepared SITS corpus.

corpusName

Desired name of corpus.

overwrite

Boolean indicating whether or not to overwrite existing files at sitsCorpusPath.

returnData

Boolean indicator whether or not to return a data frame of transcript information including preprocessed text.

...

Other arguments to stm::textProcessor() and stm::prepDocuments() for text preproccessing.

Details

This function has two purposes. First, it preproccesses raw transcripts. It does so using 2 functions from the the stm package. See help files and examples in the stm package for more information on the basic text cleaning operations performed by these functions. This function uses all default arguments.

The second purpose of this function is to format and write transcript information into 6 files needed for the SITS model. These files are stored in the path provided by the sitsCorpusPath argument. The files are named using the corpusName argument. More information on the required formatting of inforamation for the SITS model is found at https://github.com/vietansegan/sits.

Value

Optionally returns preproccessed transcripts in the form of a data frame.

References

Nguyen, Viet-An, Jordan Boyd-Graber and Philip Resnik. 2012. SITS: A hierarchical nonparametric model using speaker identity for topic segmentation in multiparty conversations. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics pp. 78<e2><80><93>87.

Nguyen, Viet-An, Jordan Boyd-Graber, Philip Resnik, Deborah A Cai, Jennifer E Midberry and Yuanxin Wang. 2014. <e2><80><9c>Modeling topic control to detect influence in conversations using nonparametric topic models.<e2><80><9d> Machine Learning 95(3):381<e2><80><93>421.

Nguyen, Viet-An. 2014. <e2><80><9c>Speaker Identity for Topic Segmentation (SITS).<e2><80><9d> https://github.com/vietansegan/sits.

See Also

textProcessor, prepDocuments


erossiter/sitsr documentation built on May 23, 2019, 7:34 a.m.