README.md

ContentStructure Travis-CI Build Status

NOTE: This package is still under development and is very much a work in progress. Please do not use in any publication without express consent of the authors. PLEASE REPORT ANY BUGS OR ERRORS TO mzd5530@psu.edu.

Model Overview

An R package which implements an extension to the TPME model of Krafft et al. (2012) [Available here]. The three main extensions we make to model are detailed below:

A paper detailing the extended and generalized model is currently under revision. A link to the manuscript will be posted as soon as it is available.

Installation

Requirements for using C++ code with R

Note that if you are using a Mac, you will need to start by making sure you have Xcode + developer tools installed or you will not be able to compile the C++ code that is used in the samplers for this package. You will need to go here: https://developer.apple.com/xcode/downloads/ and then select the link to the additional downloads page which will prompt you to enter you apple ID. This will let you download the developer tools. This requirement is not unique to this package, but is necessary for all packages that use Rcpp.

If you are using a Windows machine, you will need to make sure you have the latest release of R (3.2.0+) and will also need to install Rtools (v33 or higher, available here http://cran.r-project.org/bin/windows/Rtools/) before you can use any packages with C++ code in them. It is also highly advised that you use RStudio to download and install the package as it seems to play nicer with Rcpp under Windows. You may also want to visit this blog post which has more information on making C++ work with R under Windows.

If you are using a Linux distro, make sure you have a C++ complier installed, but in general, you should not run into as many issues

Installing The Package

To install this package from Github, you will need to Hadley Wickham's devtools package installed.

install.packages("devtools")
library("devtools")

Now we can install from Github using the following line:

devtools::install_github("matthewjdenny/ContentStructure")

I have had success installing with R 3.2.0+ installed but if you do not have the latest version of R installed, or run into some install errors (please email if you do), it should work as long as you install the dependencies first with the following block of code:

install.packages( pkgs = c("BH","coda","RcppArmadillo","gridBase",
"gplots","slam","vegan"), dependencies = TRUE)

If all went well, check out the ?ContentStructure help file to see a full working example with info on how the data should look.

Basic Useage

The package provides two functions: Run_Full_Model() and Create_Output() which will actually run the model for text valued communication networks and then generate descriptive and diagnostic output. To run this model, you will need to provide input similar to the example vocabulary, author_attributes, document_edge_matrix, and document_word_matrix dataframes included as examples with the package. You will need to format the data similarly or estimation will not work, and we recommend giving the same names to your files. This version of the package currently supports the inclusion of up to one binary author covariate in the model, whch can be specified by giving the variable name as it appears in the author_attributes object. Also of note, when specifying the data_directory parameter in these fuctions, be sure to include a / after the name of the folder you wish to use for output/data as the function expects this. The sampling code for the Latent Space Model (LSM) portion of the model is pretty fast so we recommend specifying a relatively large number of iterations for the final step of the model where the LSM is run to convergence. We have found that sample_step_burnin = 1000000, sample_step_iterations = 5000000 and sample_step_sample_every = 2000 seem to work pretty well for publication quality results with networks of 20-30 people.

Example

Here is some example code that will run the model an generate output on a toy dataset of 121 emails between 20 department managers.

# Load in necessary data
data(vocabulary)
data(author_attributes)
data(document_edge_matrix)
data(document_word_matrix)

# Run Model
Estimation_Results <- Run_Full_Model(
  main_iterations = 100, 
  sample_step_burnin = 2000, 
  sample_step_iterations = 8000,
  sample_step_sample_every = 20,
  topics = 6,
  clusters = 2,
  mixing_variable = "Gender",
  Auth_Attr = author_attributes, 
  Doc_Edge_Matrix = document_edge_matrix ,
  Doc_Word_Matrix = document_word_matrix, 
  Vocab = vocabulary,
  Seed = 123456
  )

# Generate Output
Data <- Create_Output(
  data_name = "Testing",
  Estimation_Results = Estimation_Results,
  print_agg_stats = TRUE,
  Topic_Model_Burnin = 50,
  Skip = 0, 
  Thin = 2,
  Used_MP = TRUE,
  MP_Name = "Gender",
  Auth_Attr = author_attributes,
  Vocabulary = vocabulary
  )

Output

The Create_Output() function will save a number of PDF's in the data_directory (if save_results = T), which will often be the most interesting a visually interpretable output, but it alo returns a list object that has the following members:

The model also returns an R list object holding the raw output from the MCMC chain with the following members:

Testing

So far, this package has been tested successfully on OSX 10.9.5, Windows 7 and CentOS 6.6. Please email me at mzd5530@psu.edu if you have success on another OS or run into any problems. This Package is also continually checked every time a commit is made using Travis.



matthewjdenny/ContentStructure documentation built on May 21, 2019, 1:01 p.m.