In johninpdx/RSTools: Data Mangement Tools for RSiena and Other Network Packages

Introduction

Social network data is inherently complicated. In ordinary survey analysis, one typically has data on individuals, i.e., one row of a table per individual, with a set of variables (observations on that individual) extending across rows. If the study in question is longitudinal, there will typically be multiple such records for each individual, one for each observational period (wave). Some of these variables may be fixed, however (such as DOB, gender, etc.), raising the question whether these should be kept in a separate table, linked to the longitudinal variables by a subject ID (SID). However, since most statistical analysis software cannot handle SQL-style table joins, this type of data organization is generally fudged, by including fixed variables repetitively, along with actual time-varying variables, from one wave record to the next. But at least the familiar two-dimensional data structure can be retained.

Where social network data are involved, this fudge is not available. The analyst or data scientist is forced to contemplate more logical and efficient data structures, which however are also more complicated. Further, there is no particular consensus on what these data structures should look like, and in most statistical software (e.g. SPSS, SAS, Stata) no machinery exists either for creating a workable structure, or for accessing the data in in analytically meaningful ways. There is the additional problem of reproducability--subsets of data need to be easily linked to analyses performed on them, and the more complicated the relevant scripts become, the greater the chances of error and confusion.

The 'RSTools' package attempts to address some of these issues. It is a result of my experience in setting up social network analyses over the last 13 years. In earlier times and with simpler studies (e.g. just survey data, typically longitudinal) I was able to get by with a more ad hoc approach. I would maintain a core set of scoring code (for multi-item variables and such), but tended to build subfiles in a very analysis-specific way, using file names and folders as a way of separating different sub-projects. This worked pretty well most of the time, but invariably led to code duplication due to lack of modularity, and a kind of rolling complexification that always seemed to occur as a project moved forward.

The package implements a project-specific set of social network data management functions, but attempts to do so based on principled design considerations. The functions (a) pull data from a raw (but clean) data repository, (b) create useful intermediate objects that can be kept in a workspace and used for (c) creating data objects that can be used by the social network analysis packages 'network' and 'RSiena'. At least for the particular project it has been implemented for (ORI's "Peer Influences" project, J. Light & J.C. Rusby, Co-principal Investigators), it implies a specific, straightforward, clean, and highly reproduceable workflow. It remains to be seen how well this approach translates to other projects, but at least it may provide a kind of roadmap for package development in other analytical contexts.

The name 'RSTools' was chosen because the primary analyses anticipated are Snijders' Stochastic Actor Oriented Models (SAOMs), as implemented by the package 'RSiena'. However it turns out that 'network'-class networks-- which RSiena cannot utilize directly--are also very handy, as they are easily graphed, and both the 'network' and 'sna' packages can be applied to calculate various descriptive statistics for 'network'-class objects at the network, vertex, and edge level. Consequently, the workflow assumes that both dgTMatrix (sparse matrix) and network-class networks will be desired, and there should be a way to create them from the same underlying, known-consistent sets of data objects.

Design

The RSTools package roughly implements a so-called 3-Tier, client-server architecture. Tier 1 is the "database" layer, interacting directly with the the repository for the raw data--in this case, a SQL Server database. Tier 1 functions are especially project-specific, of course, because they must deal with the particular table and variable (column) structure developed to hold project data in a logical way.

The Peer Influences study is a longitudinal study of social network dynamics and behavioral development, with network and behavioral observations taken 3x/yr, from the end of 8th grade to the beginning of 11th grade. Students were drawn from seven different school districts; hence the study comprises seven distinct networks. The networks are considered separate when the students are in 8th grade (this applies only to the first wave), but are thereafter combined into district-wide networks. This database holds student response data in three tables: SMaster (one row for any student who ever did a survey, containing non-time-varying data such as DOB, gender, and ethnicity), SAffiliation (an edge list-formatted table of social network relationships), and SWave (a conventional longitudinal survey table with one row per student per wave of survey participation).

Tier 1 functions extract data from the database from each of the tables SMaster, SAffiliation, and SWave, and create R data objects as dataframes or data.tables (as implemented by the package 'data.table'), or lists containing such objects--which can then be used as is or further subset in the course of being processed into objects that the 'RSiena' package will accept. It is the need to work with these database tables and their contents that makes this layer necessarily customized to this specific project.

At this point, I want to insert an editorial comment. In my view, social network studies should normally store data in a relational database. There are many reasons, but the primary one is the ability to enforce data integrity, i.e., requiring that the data be internally consistent. This means, for example, that a nonexistent participant cannot be added, or a participant cannot be assigned to the wrong network or a nonexistant network, etc. It is much easier to prevent such problems in the course of designing the database structure than it is to fix inconsistent data after the fact.

In any case, the whole point of having a Tier 1 in 3-Tier client server systems is to allow as much re-use of code as possible in future projects. Tier 1 is supposed to "insulate" the higher layers from data that is almost sure to vary across projects. Functions at that level will usually need to be customized for particular projects, repository data formats, etc., but if the software is well designed, less customization at higher levels will be necessary.

In the RSTools package, I have tried to make some progress towards this goal, but there is still a lot of quite project-specific code in the middle tier functions. Hopefully though, as I get more experience and input from fellow network researchers, and as a result perhaps a deeper understanding of what abstraction best fulfills these goals, the package will continue to generalize.

The middle tier is divided in two. A "Middle Layer Object" (MLO) tier creates objects useful for further data selection using R (i.e. without recourse to Tier 1 functions again), and also objects that hold study data in R-formatted objects, mainly dataframes or data.tables, that is essentially "staged" to be processed into RSiena-compatible form. Then also the Variable Creation and Scoring (VCS) functions work with the middle layer objects to create data objects in the exact format required by RSiena model object creation functions.

(As an aside, one might also consider the VCS functions to be yet a third tier, meaning the whole architecture is actually four tiers....)

Finally, the third (fourth?) tier is usually called the "presentation tier", and is just the user interface (UI). In this application, the presentation tier is simply a script that puts these functions together to create a longitudinal network analysis, which for the moment means a SAOM.

The following rather bare-bones description provides an overview of RSTools, presented in terms of the expected analysis workflow.

Workflow

Figure 1 shows a workflow supported by RSTools functions. There are four groups of operations to the workflow:

Create a Network Analysis Set

Create data selection objects

Create variable and composition change-compatible objects

Create RSiena modeling objects

We will talk through each of these groups of operations. It will be helpful to refer to Figure 1 during this discussion.

Step 1: Create a Network Analysis Set

I will use the term Network Analysis Set to refer to a set of observations that are internally consistent and can be sensibly modeled using SAOM. By "internally consistent", I mean

The observations refer to the same individuals and time points (though some of the observations could be missing for some variables)
The observations respect the same inclusion criteria. One can imagine a practically unlimited set of inclusion criteria, (e.g. just males or females, just one particular source of observations such as school or community, etc.). I have supported two nearly universal criteria for longitudinal survey data (as utilized in SAOM), which however are not necessarily that simple to select on or keep consistent across data types: groups comprising networks, and time periods (waves).

Thus in general, a Network Analysis Set refers explicitly to a set of basic observational units (typically individuals), and a set of waves. Any variables allowed in SAOM (see the RSiena Manual for a complete description and discusion) are assumed to refer to the same individuals and waves, in a way that comprise a well-defined SAOM analysis.

There is only one operation in this step; apply the function getNetworkSet. This function takes as input:

pWavVec: A ordered vector of integers listing the waves of the survey to be included in the Analysis Set.
pSchVec: A vector (ordering is normal but optional) of school numbers to be included in the Analysis Set. These schools are thought of as comprising a single longitudinal network; if they do not, some method (not currently implemented in RSTools) must be used to separate them, e.g. by designating them as separte groups in a grouped analysis, or by using structural zeros as edges between individuals in separate networks. The RSiena Manual (Ripley et al., 2016) discusses these options further.
pElig: The minimum number of waves (out of those in pWavVec) that the study participant must have been survey eligible. We will discuss how survey eligibility is ascertained for the Peer Influences study below; this somewhat complex logic is wrapped by this function, however.
pDid: The minimum number of waves (out of those in pWavVec) that ths study participant must have completed. Survey completion in this case really means "partial completion", taken to be nearly anything but a completely blank survey. Specifically, we chose to define it as having answered one or more of the "lifetime substance use" questions included in the survey (for tobacco, alcohol, marijuana, etc.). The middle layer/scoring function allNQNotNA defines this criterion; of course it can easily be replaced by alternatives.
pTyp: Defines the network relationship criterion for the network. The quite project-specific options are BF (best friend) and FT (someone you spend free time with).
pOut: Type of network output. There are two choices: SP (class dgTSparse), which can be converted directly into an RSiena dependent network, and NT (class 'network', from Carter Butts' package of the same name). The latter is convenient for descriptive analysis using the packages network and sna.
pS0: If "S0", and if the schools chosen include middle schools, and if wave 1 is one of the selected waves, this option will create structural zero edges between all dyads not in the same middle school for wave 1.

Output from this function is a list of length W+1, where W is the number of waves requested. Each list element contains the network for the corresponding wave requested in pWavVec; the w+1st element is an ordered vector of Study IDs (SIDs) for every individual included in the Analysis Set for any requested wave. In Figure 1, this list is called 'netSetxx', where xx = SP or NT depending on the network format.

In this draft of the package vignette, I will not further consider descriptive analysis with 'network' class networks; this will be rectified in later versions. Instead, I will focus on workflow for SAOM.

Step 2: Create Data Selection Objects

In this step, two dataframes are created. 'eligXWave' has a column of ordered SIDs, followed by one column for each wave in the Analysis Set of interest. These columns contain indicator variables (0/1) which are 1 if the SID was eligible to do the survey in that wave, whether he/she actually completed a survey or not. This information is of course needed to calculate response rates, but it is also important in network studies because even if an individual did not complete a survey in wave w, the fact that they were eligible probably means they could have been selected by others in the network part of the questionnaire. Including individuals who are only selected by others may be a good idea in some cases, e.g. a network-only analysis where the relationship in question is symmetric.

Determining eligibility is not always simple. In the Peer Influences project, participants might become eligible to be surveyed in the middle of a particular wave period, or be eligible for a while and then transfer to another school (sometimes a school also participating in the project), and so on. In other words--unlike simpler projects--survey eligibility is both time (actually, day) and school (network)-specific. The database keeps a date-stamped log of survey eligibility for each student at each school (whose names and district IDs we obtained from the schools a few weeks before each scheduled survey period). Surveys are also date-stamped. Thus we consider a student to be survey-eligible for a given school and wave if that student was eligible on any day during which we were conducting surveys in that school. These overlaps could all be inferred programmatically, and we wrote a stored procedure (SQL script) to return the necessary eligibility information. This 'sproc' is called by getEligXWave, and used to create 'eligXWave', thus encapsulating a great deal of complicated logic. The latter goal is a major factor in deciding whether it is worthwhile to write a general function to solve some problem, instead of simply writing a few lines of code in the master analysis script.

A related task is summarizing who actually completed a survey each wave. Some eligible students in our project did not complete surveys, either because they were absent from school on the days surveys were being conducted (in school computer labs) and could not be rounded up subsequently, or because they chose not to participate for one reason or another. This information is summarized in a dataframe 'diditXWave', returned by the function getDiditXWave. This function makes use of a helper function, allNQNotNA, which checks to see not only whether a survey record was created for that individual and wave, but also whether any of a small set of questions were actually answered (as it turns out, the lifetime substance use questions). The resulting dataframe is structured like 'eligXWave', with the first column the SID, and then W additional columns containing 1's if the individual with that SID 'completed' that wave, and 0 otherwise (including possibly because he or she was ineligible).

This step also creates two related vectors of SIDs: 'netSetSID.elig', containing all the individuals survey-eligible on any wave in some Analysis Set, and 'netSetSID.didx', containing individuals satisfying a particular set of survey completion criteria. The former can be generated either by making an alias or copy of the last list element of 'netSetxx', or, less optimally, by running getEligNodes directly (the latter is used within getNetworkSet). The latter is usually a simple row selection on 'diditXWave', so there is no special function to create it.

Step 3: Variable and Composition Change Objects

In this step, objects needed in a SAOM are created in a way that simplifies assuring their compatability with an Analysis Set, while also putting them in formats that can be directly accepted by RSiena-specific object creation functions. The objects supported by RSTools so far are (a) composition change lists, (b) fixed covariates, and (c) time varying covariates--dependent network variables were already dealt with in Step 1. The order in which these objects are created doesn't matter, but we will follow the order shown in Figure 1 for the sake of clarity.

Task [i] passes the previously created object 'netSetSID.didx' to the function makeCCVec, which then returns a list of vectors that can be passed directly to RSiena::sienaCompositionChange. The structure of the returned list is described in the RSiena Manual.

Task [ii] is to create fixed covariate-compatable objects. Aside from pulling this data from the database into a dataframe using getFixedCovs (Step [iia]) there are no special functions available for this process, because it is so straightforward. Once the fixed covariate dataframe has been created, one simply extracts the column of interest (say, "gender"), and selects only the rows with SIDs in the analysis set, e.g. from 'netSetSID.didx', or wherever one has put the Analysis Set SIDs generted in Step 1. This process is depicted in step [iib].

Task [iii] creates dataframes that are ready to be converted into RSiena time varying covariates, or dependent behavior variables (we will refer to both as time varying covariates, or TVCs, for brevity). First (step [iiia]) a particular TVC is selected and extracted by the function getNQTVCs. This function returns a dataframe 'TVCov' that is an R image of the 'SWave' database table, with all time-varying individual level variables across the columns, and one row for each SID and wave in the school and wave sets requested. Thirty-day substance use functions are set to zero if the participant indicated no lifetime use, as in that case, the 30 day use questions would not be presented in the survey.

Step [iiib] again has no special functional support; 'TVCov' rows are selected using the Analysis Set SID vector 'netSetSID.didx' or equivalent, creating an analysis set-compatible dataframe, 'TVCov.didx'. Then step [iiic] invokes makeTVTbl, using 'TVCov.didx' as input. One of the input parameters allows the user to select a particular TVC with a two-character string. These codes are interpreted by an internal function, getTVCCols, that translates the requested variable into a specific database column or, if a multi-item scale, columns (these are now accessed from 'TVCov.didx', of course). The user may also select a maximum number of these items that can be NA before the scale is scored NA for that individual and wave; if not, the average item rating is calculated as the scale score (this approach could be extended to more complicated and scale-specific scoring algorithms if necessary). Finally, the scale (or item, if there is just one) may be recoded with cut points supplied as a parameter vector by the user. (Often it is best to make one scoring run with no cut point recoding (the default, if the cut point parameter vector is not specified), and then look at the distribution to decide which cut points would be most meaningful. Recall that RSiena variables should normally be no more than about 4 or 5 ordered categories). The function returns a dataframe with W+1 columns: SID, and then the scored value of the selected TVC for each requested wave. This format can be input directly to the RSiena functions creating either behavior dependent or time- varying-covariate variables; one need only select the last W columns for input to these functions, e.g. 'xxTBL[,c(2:(W+1)]', where 'xx' is replaced by two (or more, if you like) letters describing the covariate. Thus in Figure 1 we show 'ALTbl' as an example, where each column gives user response to the question, "How many times have you had 1 drink of alcohol or more in your entire life, up to now?"

There is also a special scoring function [step [iiid] for creating up-only, time-to-event variables. In this study, these are applied only to substance use, so the function, createOnset, takes one of the 'lifetime' substance use response tables (e.g. ALTbl, per above) as input, and creates a variable that is 0 across columns 2:(W+1) until the respondent first indicates any (or some threshold of) lifetime use of that substance. From that wave on, the variable has the value 1. The function createNewOnset creates a similar dataframe, except that in each wave w, an individual is given the value 0 if he/she has not started behavior xx by wave w OR wave w+1, 1 if he/she has not started the behavior by wave w but will start at wave w+1, and 2 if he/she started the behavior at wave w. This scoring allows network visualizations showing the xx behavior of those connected to each individual who is about to onset (next wave) to xx, for a given wave. This can give a sense for whether affiliations with xx-behavers might encourage new onsets; that is, an influence effect.

Step 4: Create RSiena Objects

RSiena objects used in a SAOM may now be created by simply passing the objects from previous steps (1 and 3, specifically) to the relevant RSiena functions.

A composition change object is obtained with