prepConversations: Prep transcripts for SITS model
In erossiter/sitsr: R wrapper for SITS

Description Usage Arguments Details Value References See Also

View source: R/prepConversations.R

Function takes path to a corpus of raw conversation transcripts, preprocesses the text, and writes files locally needed to run the SITS model.

1 2	prepConversations(rawCorpusPath, sitsCorpusPath, corpusName, overwrite = TRUE, returnData = FALSE, ...)

`rawCorpusPath`	Path to folder containing raw text files. Files need to have .txt extension, which each row holding tab separated values of (1) row number, (2) speaker identifier, and (3) turn text.
`sitsCorpusPath`	Path to folder in which to store prepared SITS corpus.
`corpusName`	Desired name of corpus.
`overwrite`	Boolean indicating whether or not to overwrite existing files at sitsCorpusPath.
`returnData`	Boolean indicator whether or not to return a data frame of transcript information including preprocessed text.
`...`	Other arguments to stm::textProcessor() and stm::prepDocuments() for text preproccessing.

This function has two purposes. First, it preproccesses raw transcripts. It does so using 2 functions from the the stm package. See help files and examples in the stm package for more information on the basic text cleaning operations performed by these functions. This function uses all default arguments.

The second purpose of this function is to format and write transcript information into 6 files needed for the SITS model. These files are stored in the path provided by the sitsCorpusPath argument. The files are named using the corpusName argument. More information on the required formatting of inforamation for the SITS model is found at https://github.com/vietansegan/sits.

Optionally returns preproccessed transcripts in the form of a data frame.

Nguyen, Viet-An, Jordan Boyd-Graber and Philip Resnik. 2012. SITS: A hierarchical nonparametric model using speaker identity for topic segmentation in multiparty conversations. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics pp. 78<e2><80><93>87.

Nguyen, Viet-An, Jordan Boyd-Graber, Philip Resnik, Deborah A Cai, Jennifer E Midberry and Yuanxin Wang. 2014. <e2><80><9c>Modeling topic control to detect influence in conversations using nonparametric topic models.<e2><80><9d> Machine Learning 95(3):381<e2><80><93>421.

Nguyen, Viet-An. 2014. <e2><80><9c>Speaker Identity for Topic Segmentation (SITS).<e2><80><9d> https://github.com/vietansegan/sits.

textProcessor, prepDocuments

erossiter/sitsr documentation built on May 23, 2019, 7:34 a.m.