In erossiter/sitsr: R wrapper for SITS

knitr::opts_knit$set(
      root.dir = "~/dropbox/github/",
      collapse = TRUE,
      comment = "#>",
      fig.path = "README-"
)

Overview

sitsr provides a way for researchers to use the Speaker Identity for Topic Segmentation (SITS) model, written in Java, from R. It uses the rstudioapi package to run a locally saved version of Java program sits from the Terminal shell in RStudio.

See the following papers for more information about the SITS model:

Nguyen, Viet-An, Jordan Boyd-Graber and Philip Resnik. 2012. SITS: A hierarchical nonparametric model using speaker identity for topic segmentation in multiparty conversations. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics pp. 78–87.
Nguyen, Viet-An, Jordan Boyd-Graber, Philip Resnik, Deborah A Cai, Jennifer E Midberry and Yuanxin Wang. 2014. “Modeling topic control to detect influence in conversations using nonparametric topic models.” Machine Learning 95(3):381–421.

Installation

sitsr is not on CRAN, so you must install the package from GitHub

# install.packages("devtools")
devtools::install_github("erossiter/sitsr")

Importantly, before using sitsr, you must fork/download sits from here and save it locally.

If you encounter a clear bug, please let me know here.

Usage

At this point, transcripts must be saved as tab-separated text files that take the form:

| | | | |---------|----------|-------------------------| | 1 | Speaker1 | Speaker1's turn text | | 2 | Speaker2 | Speaker2's turn text | | 3 | Speaker1 | Speaker1's turn text |

Example transcripts can be found here.

The function prepConversations reads in, cleans, and formats transcripts for the SITS model. For example, if the transcripts from the 2016 presidential debates were saved in a folder in my current working directory called debates2016, I would do the following:

prepConversations(rawCorpusPath = "debates2016", 
                  sitsCorpusPath = "debates2016_prepped", 
                  corpusName = "debates2016",
                  returnData = FALSE)

Note that prepConversations uses the stm package functions stm::textProcessor and stm::prepDocuments and additional preprocessing arguments for these functions can be passed to prepConversations. Currently, prepConversations uses all default arguments.

The next step is to run the model with runSits. This function builds the locally saved sits package written in Java and run the model via the Terminal shell in RStudio, all using the rstudioapi package. Results are also saved locally at the location specified by outputPath, so the function runSits returns only a list model specifications.

model <- runSits(sitsPath = "sits",
                 corpusName = "debates2016",
                 sitsCorpusPath = "debates2016_prepped",
                 outputPath = "debates2016_results",
                 model = "param",
                 K = 10,
                 alpha = .01,
                 beta = .01,
                 gam = .01,
                 burnIn = 100,
                 maxIter = 500,
                 sampleLag = 5)

The last step is to use the function readSits to read into R the model results saved locally by runSits.

output <- readSits(model)

Current Limitations

Currently, sitsr has not been tested beyond my Mac and current version of R. Also, because of the nature of the rstudioapi package, the function runSits must be run from RStudio.