sitsr
provides a way for researchers to use the Speaker Identity for Topic Segmentation (SITS) model, written in Java, from R. It uses the rstudioapi
package to run a locally saved version of Java program sits from the Terminal shell in RStudio.
See the following papers for more information about the SITS model:
Nguyen, Viet-An, Jordan Boyd-Graber and Philip Resnik. 2012. SITS: A hierarchical nonparametric model using speaker identity for topic segmentation in multiparty conversations. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics pp. 78–87.
Nguyen, Viet-An, Jordan Boyd-Graber, Philip Resnik, Deborah A Cai, Jennifer E Midberry and Yuanxin Wang. 2014. “Modeling topic control to detect influence in conversations using nonparametric topic models.” Machine Learning 95(3):381–421.
sitsr
is not on CRAN, so you must install the package from GitHub
# install.packages("devtools")
devtools::install_github("erossiter/sitsr")
Importantly, before using sitsr
, you must fork/download sits
from here and save it locally.
If you encounter a clear bug, please let me know here.
At this point, transcripts must be saved as tab-separated text files that take the form:
| | | | |-----|----------|----------------------| | 1 | Speaker1 | Speaker1's turn text | | 2 | Speaker2 | Speaker2's turn text | | 3 | Speaker1 | Speaker1's turn text |
Example transcripts can be found here.
The function prepConversations
reads in, cleans, and formats transcripts for the SITS model. For example, if the transcripts from the 2016 presidential debates were saved in a folder in my current working directory called debates2016
, I would do the following:
prepConversations(rawCorpusPath = "debates2016",
sitsCorpusPath = "debates2016_prepped",
corpusName = "debates2016",
returnData = FALSE)
Note that prepConversations
uses the stm
package functions stm::textProcessor
and stm::prepDocuments
and additional preprocessing arguments for these functions can be passed to prepConversations
. Currently, prepConversations
uses all default arguments.
The next step is to run the model with runSits
. This function builds the locally saved sits package written in Java and run the model via the Terminal shell in RStudio, all using the rstudioapi
package. Results are also saved locally at the location specified by outputPath
, so the function runSits
returns only a list model specifications.
model <- runSits(sitsPath = "sits",
corpusName = "debates2016",
sitsCorpusPath = "debates2016_prepped",
outputPath = "debates2016_results",
model = "param",
K = 10,
alpha = .01,
beta = .01,
gam = .01,
burnIn = 100,
maxIter = 500,
sampleLag = 5)
The last step is to use the function readSits
to read into R the model results saved locally by runSits
.
output <- readSits(model)
Currently, sitsr
has not been tested beyond my Mac and current version of R. Also, because of the nature of the rstudioapi
package, the function runSits
must be run from RStudio.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.