Estimating Semantic Networks"
In SemNetCleaner: An Automated Cleaning Tool for Semantic and Linguistic Data

knitr::opts_chunk$set(echo = TRUE)

Vignette taken directly from @christensen2019semna

With the binary response matrix, semantic networks can now be estimated. In the last few years, various computational approaches have been proposed to estimate semantic networks from verbal fluency data [@goni2011semantic; @kenett2013semantic; @lerner2009network; @zemla2018estimating]. Moreover, there are a number of packages in R that are capable of estimating semantic networks [e.g., corpustools; @welbers2018corpustools] and networks more generally [e.g., igraph; @csardi2006igraph and qgraph; @epskamp2012qgraph]. As described earlier, this tutorial follows the approach developed by Kenett and colleagues to estimate semantic networks based on correlations of the associations profiles of verbal fluency responses across the sample [@borodkin2016pumpkin; @kenett2013semantic; @kenett2016structure].

The SemNetCleaner, SemNeT, and NetworkToolbox packages in R will be used to execute this stage of the pipeline. The SemNetCleaner package will be used to further process the binary response matrix into a finalized format for network estimation. The SemNeT package [@christensen2019semnet] contains several functions for the analysis of semantic networks, including a function to compute the association profiles of verbal fluency responses. The NetworkToolbox package [@christensen2019networktoolbox] contains functions for network analysis more generally, including functions to estimate and analyze networks. This package will be used to estimate the semantic networks from the association profile matrices.

Process

Kenett and colleagues' approach begins by splitting the binary response matrix into groups. Next, for each group, only responses that are provided by two or more participants are retained [e.g., @borodkin2016pumpkin]. This is done to minimize spurious associations driven by idiosyncratic responses in the sample. Finally, binary response matrices are "equated" or their responses are matched such that each group only retains responses if they are given by all other groups [@kenett2013semantic].

This step is particularly important because some groups may have a different number of responses (i.e., nodes), which can introduce confounding factors [e.g., biased comparison of network parameters; @van2010comparing]. By equating the binary response matrices, the networks can be compared using the same nodes, ruling out alternative explanations of the results (e.g., difference in network structure) that could be due to differences in the number of nodes [@borodkin2016pumpkin]. Once this process is complete, the networks can be estimated using a network estimation method.

We continue with the example of the dataset analyzed by @christensen2018remotely that estimated and compared semantic networks of two groups---low and high openness to experience groups. While we focus on estimating and comparing two groups, the functions in our R packages are capable of handling more than two groups.

Preparation for network estimation

The binary response matrix (i.e., corr.clean$binary) from the preprocessing step contains the responses for both the low and high openness to experience groups. To continue with our pipeline, we need to separate the binary response matrix into two groups. This can be done using the Group variable with the following code:

# Attach 'Group' variable to the binary response matrix
behav <- cbind(open.animals$Group, corr.clean$binary)
# Create low and high openness to experience response matrices
low <- behav[which(behav[,1]==1),-1]
high <- behav[which(behav[,1]==2),-1]

The resulting matrices are the binary response matrices for the low and high openness to experience groups. For users who would like to use other network estimation methods that are not included in R, these binary response matrices can be exported using the following code:

# Save binary response matrices
write.csv(low, "low_BRM.csv", row.names = TRUE)
write.csv(high, "high_BRM.csv", row.names = TRUE)

Continuing with our pipeline, we aim to minimize the number of spurious associations in the network. This can be executed with the following code:

# Finalize matrices so that each response
# has been given by at least two participants
final.low <- finalize(low, minCase = 2)
final.high <- finalize(high, minCase = 2)

The function finalize will remove responses (columns) that have responses that are not given by a certain number of people. The number of people that must give a response can be chosen using the minCase argument. This argument defaults to 2, which is consistent with our approach; however, users may wish to define a higher number of minimum cases to avoid spurious associations. Next, the responses are equated to control for differences in the number of nodes. To do this, the following code can be used:

# Equate the responses across the networks
eq <- equate(final.low, final.high)
equate.low <- eq$final.low
equate.high <- eq$final.high

The equate function will match the responses across any number of groups. If there are more than two groups, then they simply need to be entered (separated by commas) into the function. The output of equate are binary response matrices that have been matched across groups. Each group's matrix will be nested in the output and labeled with the name of the object used as input (e.g., input = final.low and output = eq$final.low).

Now that the binary response matrix has been separated into two groups based on our behavioral measure and the responses have been equated between the two groups, the networks can be estimated.

Network estimation

The network estimation method that Kenett et al. apply to estimate semantic networks are called correlation-based networks [@zemla2018estimating]. They are called correlation-based networks because they estimate the network based on how often responses co-occur across the group [@borodkin2016pumpkin; @kenett2013semantic]. Common association measures that have been used with this approach are Pearson's pairwise correlation [e.g., @kenett2013semantic] and cosine similarity [e.g., @christensen2018remotely]. Thus, the nodes in this network represent verbal fluency responses and the edges represent their association.

In our example of the work by @christensen2018remotely, the cosine similarity was used to compute the association profiles of the responses. We can apply this similarity measure with the following code:

# Compute cosine similarity for the 'low' and
# 'high' equated binary response matrices
cosine.low <- similarity(equate.low, method = "cosine")
cosine.high <- similarity(equate.high, method = "cosine")

The similarity function in the SemNeT package computes an association matrix from the equated response matrices. The method argument selects the association measure that is used. Here, we use the "cosine" similarity measure; however, there are a number of other similarity measures, such as Pearson's correlation (method = "cor"), that can be applied (see ?similarity for more options). With these association matrices, a network estimation method can be applied.

To further minimize spurious relations, we proceed to apply a filter over other association matrix. The purpose of applying a network filtering method is to minimize spurious associations and retain the most relevant information in the network [@tumminello2005tool]. Network estimation methods have certain criteria for retaining edges (e.g., statistical significance), which creates a more parsimonious model [@barfuss2016parsimonious]. For Kenett and colleagues approach, a family of network estimation methods known as Information Filtering Networks [@barfuss2016parsimonious; @christensen2018network] have been applied.

The Information Filtering Networks methods apply various geometric constraints on the associations of the data to identify the most relevant information between nodes (e.g., edges) in a network [@christensen2018network]. Common Information Filtering Network approaches are the minimal spanning tree [@mantegna1999hierarchical], planar maximally filtered graph [@tumminello2005tool], triangulated maximally filtered graph [@massara2016network], and maximally filtered clique forest [@massara2019learning].

In @christensen2018remotely, the triangulated maximally filtered graph (TMFG) method was applied. The TMFG algorithm identifies the most important edges in a network by first connecting the four nodes that have the highest sum of edge weights (i.e., association) across all nodes. Next, the algorithm identifies and adds an additional node, which maximizes its sum of edge weights to the other connected nodes. The algorithm continues until every node is connected in the network [@golino2018ega3; @massara2016network].

The resulting network has $3n-6$ number of edges (where $n$ is the number of nodes) and is a planar network [i.e., it could be depicted on a theoretical plane without any edges crossing; @tumminello2005tool]. Because the number of edges is a function of the number of nodes, networks with the same number of nodes will have the same number of edges. This is advantageous for comparing network structures because it reduces the confound of differences between networks being due to differences in the number of edges [@christensen2018network; @van2010comparing]. The TMFG method can be implemented, using the NetworkToolbox package in R,[^2] with the following code:

[^2]: Note that other filtering methods can also be applied using the NetworkToolbox including the minimal spanning tree, maximally filtered clique forest, and several thresholding methods.

# Estimate 'low' and 'high' openness to experience networks
net.low <- TMFG(cosine.low)$A
net.high <- TMFG(cosine.high)$A

The output of these functions is a TMFG filtered semantic network for the low and high openness to experience groups. To save these networks outside of R so that other programs can be applied, the following code can be used:

# Save the networks
write.csv(net.low, "network_low.csv", row.names = FALSE)
write.csv(net.high, "network_high.csv", row.names = FALSE)

These networks are weighted, meaning that the edges correspond to the magnitude of association between nodes. It's common, however, for the edges to be converted to binary values [i.e., 1 = edge present and 0 = edge absent; @abbott2015random; @kenett2013semantic; @kenett2014investigating]. To convert a weighted network into one that is unweighted, the binarize function can be used:

# Binarize the networks (optional)
net.low <- binarize(net.low)
net.high <- binarize(net.high)

It's worth noting that, despite differences in edge weights, it has been shown that weighted and unweighted semantic networks typically correspond to one another [@abbott2015random]. When computing network measures in SemNeT, the edges will be binarized by default, meaning the statistics are computed for unweighted measures. There are options, however, to compute the weighted measures when the networks are left as weighted; therefore, it's often preferred to keep the networks as weighted.

Summary

In this section, we discussed and applied one approach for estimating group-based semantic networks using functions in SemNetCleaner, SemNeT, and NetworkToolbox. In this process, the binary response matrix was split into groups, idiosyncratic responses were removed, and group binary response matrices were equated (using SemNetCleaner). Then, a similarity measure was applied to these group matrices (using SemNeT) and a network estimation method was applied (using NetworkToolbox).

Notably, there are other approaches for estimating semantic networks [e.g., @zemla2018estimating]. These other approaches fit seamlessly into our SemNA pipeline. For example, the binary response matrix from the preprocessing step can be used in another network estimation procedure. The output from the network estimation step are network(s) that are ready to be analyzed in the statistical analysis step of the pipeline. Effectively, this makes the network estimation step in the pipeline exchangeable with any other network estimation procedure.