Diversity and Regression
In migraph: Inferential Methods for Multimodal and Other Networks

library(learnr)
library(manynet)
library(migraph)
library(autograph)
clear_glossary()
knitr::opts_chunk$set(echo = FALSE)
marvel_friends <- to_unsigned(ison_marvel_relationships, keep = "positive")
marvel_friends <- to_giant(marvel_friends)
marvel_friends <- marvel_friends %>% to_subgraph(Appearances >= mean(Appearances))

This tutorial

gif of deadpool greeting

Within a network, nodes and ties vary. But how much do they vary and are there associations in this variance? By the end of this tutorial, you will be able to:

[ ] measure the richness of a network in terms of attribute types
[ ] measure how diverse a network is
[ ] measure how heterophilous the ties in a network are
[ ] test how heterophilous the ties in a network are compared to random networks
[ ] test how heterophilous the ties in a network are compared to permutations of the original network
[ ] regress a network's ties on other ties and/or nodal or dyadic attributes

Initial visualisation

For this session, we'll explore a couple of different datasets. First, let's examine homogeneity/heterogeneity in the Marvel relationships dataset from {manynet}, ison_marvel_relationships. The dataset is quite complicated, so to make this simpler, let's concentrate on:

just the positive (friendship) ties and not the negative (enmity) ties
the main (giant) component without any isolates
just those characters that appear in the comics more than average

Fortunately, all these data cleaning moves are easy to do in {manynet}, and can be seen in the following chunk in order:

# since the dataset is a 'signed' graph, we want to get just the
# positively signed ties to get the friendship graph 
# (and lose the enmity relations)
to_unsigned(____, keep = "positive")

# to_giant() is a quick easy way to get the giant/main component
to_giant(____)

to_subgraph(____, Appearances >= mean(Appearances))

# don't forget to assign the results!
marvel_friends <- ____

marvel_friends <- to_unsigned(ison_marvel_relationships, keep = "positive")
marvel_friends <- to_giant(marvel_friends)
marvel_friends <- marvel_friends %>% to_subgraph(Appearances >= mean(Appearances))
marvel_friends

marvel_friends <- to_unsigned(ison_marvel_relationships, keep = "positive")
marvel_friends <- to_giant(marvel_friends)
marvel_friends <- marvel_friends %>% to_subgraph(Appearances >= mean(Appearances))
marvel_friends

This gives us an undirected network of nearly twenty characters and a little more than 100 edges. Recall that this data has several nodal attributes. Let's explore a couple of these attributes, "Gender" and "PowerOrigin", visually using graphr().

# Since both Gender and PowerOrigin are categorical variables
# you will need to use two different aesthetic dimensions to
# represent them together.
# Which will you present as shapes and which as colors?
graphr(____, 
       node_shape = ____,
       node_color = ____)

graphr(marvel_friends, 
       node_shape = "Gender",
       node_color = "PowerOrigin")

These variables seem to be distributed unevenly across the network. There seems to be some homophily -- or like choosing like -- operating here, but it is difficult to say conclusively because there are clearly more male than female superheros, as well as clearly more superheros of mutant origin than others. So what might seem like homophily could just be a result of there being many more opportunities for ties between nodes of some categories. We therefore need to establish how diverse this network really is.

Measuring richness

Some categorical variables have more categories than others, and this can be thought of as indicating greater diversity than e.g. a binary categorical variable. A measure of this 'richness' is inherited from the study of biodiversity. Richness calculates the number of different categories that appear in the data for a given variable. Let's calculate the richness of all of the variables in our Marvel data.

net_node_attributes(marvel_friends)

net_richness(____, ____)

net_richness(marvel_friends, "Gender")
net_richness(marvel_friends, "PowerOrigin")
net_richness(marvel_friends, "Attractive")
net_richness(marvel_friends, "Rich")
net_richness(marvel_friends, "Intellect")

question("Which variable is the most 'diverse' according to this richness measure?",
  answer("Gender"),
  answer("PowerOrigin", 
         correct = TRUE,
         message = "There are four categories available in this data for power origin, while the other variables include only two categories each."),
  answer("Attractive"),
  answer("Rich"),
  answer("Intellect"),
  random_answer_order = TRUE,
  allow_retry = TRUE
)

Note though that 'richness' as a network measure does not include any sense of how distributed these categories are around the network. There is a measure of nodal richness available, which counts the number of different categories to which each node is tied, but this does not offer a summary of how evenly distributed the appearance of categories are. For that we would need to move on to something like the Blau index.

Measuring diversity

Another measure that reflects the diversity in the network for each attribute is the Blau Index. Recall that the Blau index for any given variable is:

$$1 - \sum p_i^2$$

where $p_i$ represents the proportion of observations holding $i$, which indexes each of the given categories. A perfectly homogeneous group (all the same category) would receive a score of 0, while a perfectly heterogeneous group (with members spread evenly over the maximum categories) would receive a score of 1. Obtain the network diversity scores for the five attributes used above.

net_diversity(____, ____)

net_diversity(marvel_friends, "Gender")
net_diversity(marvel_friends, "PowerOrigin")
net_diversity(marvel_friends, "Attractive")
net_diversity(marvel_friends, "Rich")
net_diversity(marvel_friends, "Intellect")

Looks like there is more diversity in terms of where these characters got their powers, whether they have significant intellectual powers, and their gender, than their attractiveness or their wealth.

We can also cross-reference this diversity. For example, we might be interested in whether our comic book heroes are equally gender diverse across their (power) origin stories, or equally intellectually diverse across gender.^[Note that this works for calculated categorical variables too, such as cluster/group assignment from community detection or equivalence classes.]

net_diversity(____, ____, ____)

net_diversity(marvel_friends, "Gender", "PowerOrigin")
as.factor(node_attribute(marvel_friends, "PowerOrigin")) # view categories in PowerOrigin
net_diversity(marvel_friends, "Intellect", "Gender")

Note that the length of the vector returned as a result is the number of categories in the second category listed. It looks like some origin stories are more gender diverse than others. Gods (just Thor here) and humans are all men, whereas those with mutant or radiation origin stories are more gender diverse. There doesn't appear to be much difference in intellect across gender categories in this data.

Ok, this tells us about how (un)even the distribution of these variables is in this network, but it doesn't necessarily tell us whether ties are appearing more frequently between nodes of similar (or different) categories. For that we need to look at homophily/heterophily.

Measuring heterophily

The EI (or E-I) index offers a way to measure the degree to which ties appear between rather than within groups of nodes of the same category. Recall that the EI index is calculated as:

$$\frac{E-I}{E+I}$$

where $E$ is the number of ties present between a variable's categories (i.e. external), and $I$ is the number of ties present within a variable's categories (i.e. internal). As such, an EI index of -1 suggests perfect homophily, whereas an EI index of +1 suggests perfect heterophily. (This is why the function is called net_heterophily()).

Check how homophilic three variables in the network are, "Gender", "PowerOrigin", and "Attractive". Please assign the results so that we can use them later.

net_heterophily(____, ____)

(obs.gender <- net_heterophily(marvel_friends, "Gender"))
(obs.powers <- net_heterophily(marvel_friends, "PowerOrigin")) 
(obs.attract <- net_heterophily(marvel_friends, "Attractive"))

question("For which variables is there a sign of some homophily effect according to the EI index? (Choose all that apply)",
  answer("Gender", 
         correct = TRUE, 
         message = "Yes, looks like there might be some gender homophily present."),
  answer("PowerOrigin", 
         message = "The score for power origin homophily is so close to 0 that it does not seem to signal much."),
  answer("Attractive", 
         correct = TRUE, 
         message = "And looks like a fairly large effect for homophily on the basis of looks..."),
  allow_retry = TRUE
)

Ultimately though, these are just scores, and doesn't tell us whether this is any more or less than what we might expect the score to be by chance for a network of this size and density and distribution of that attribute.

CUG tests

To see whether we should be surprised by scores this high/low, we compare these scores with those from a series of random graphs (Erdös-Renyi/Bernoulli) of the same dimensions and distribution of the attribute. This can help us establish whether there is more homophily or heterophily than expected by chance.

This is often called a conditional uniform graph or CUG test, but {migraph} uses more descriptive function names, such as test_random(). Plot the results of running this function with respect to the EI index on each of the three variables. These plots can be made if you have {autograph} installed and loaded. You can specify that one thousand simulations should be used using times = 1000.

rand.____ <- test_random(____, FUN = ____, attribute = ____, times = ___)

plot(rand.____)

rand.gender <- test_random(marvel_friends, 
                           net_heterophily, attribute = "Gender", 
                           times = 1000)
rand.power <- test_random(marvel_friends, 
                          net_heterophily, attribute = "PowerOrigin", 
                          times = 1000)
rand.attract <- test_random(marvel_friends, 
                            net_heterophily, attribute = "Attractive", 
                            times = 1000)
plot(rand.gender) + ggtitle("CUG test results for 'Gender' attribute")
plot(rand.power) + ggtitle("CUG test results for 'PowerOrigin' attribute")
plot(rand.attract) + ggtitle("CUG test results for 'Attractive' attribute")

rand.gender <- test_random(marvel_friends, 
                           net_heterophily, attribute = "Gender", 
                           times = 1000)
rand.power <- test_random(marvel_friends, 
                          net_heterophily, attribute = "PowerOrigin", 
                          times = 1000)
rand.attract <- test_random(marvel_friends, 
                            net_heterophily, attribute = "Attractive", 
                            times = 1000)
plot(rand.gender) + ggtitle("CUG test results for 'Gender' attribute")
plot(rand.power) + ggtitle("CUG test results for 'PowerOrigin' attribute")
plot(rand.attract) + ggtitle("CUG test results for 'Attractive' attribute")

The plots of these results use a dotted vertical line for 0 where this is in bounds, a red vertical line for the observed score, and a density plot of the scores from the randomly generated networks. The grey tails of the distribution are a visual aid indicating the most extreme 5% of scores from the distribution. The results are really interesting.

gif of ironman explosion

Despite being the larger coefficients (in absolute terms), it looks like we cannot reject the null hypothesis that there is no homophily for gender nor for attractiveness. Both observed scores fall within the range of scores we would expect from randomly generated networks with the same distribution of that variable.

However, we can reject the null hypothesis with respect to their power origin story. While the coefficient itself is close to 0 (neither strong homophily nor heterophily), all the random networks generated returned larger EI scores, between .1 and .4.

question("Which of the following is true?",
  answer("Attractive superheros only hang out with other attractive superheros.",
         message = paste(learnr::random_encouragement(), '<img src="https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExeW9jbXg5dDBndjRpcWc2enhyOTZvbjloeGFpbmJndXNrMm05MHdtZiZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/ziLadIVnOGCKk/giphy.gif" alt="gif of thor frustration"/>')),
  answer("There is significantly less power origin heterophily than expected.",
         correct = TRUE,
         message = learnr::random_praise()),
  answer("There is no evidence of gender homophily.",
         correct = TRUE,
         message = learnr::random_praise()),
  answer("There is significant power origin homophily.", 
         message = "Remember that the coefficient does not suggest many homophilous ties."),
  random_answer_order = TRUE,
  allow_retry = TRUE
)

QAP tests

Ah, but perhaps random graphs are not the best reference group for establishing whether there is a significant homophily effect here. After all, social networks are not completely random; they are structured in particular ways, such as some nodes having higher degrees than others, or there being a core and periphery or community topology.

Another approach to establishing a baseline for whether we should be surprised by a given score or not is to use permutations of the underlying network instead of random graphs. Permuting the network retains the structure of the network because the ties are kept and only the labels (variables) are reassigned randomly. Let's first plot the observed data and some permuted data next to each other.

graphr(generate_permutation(____, with_attr = TRUE), ____)

old <- graphr(marvel_friends,
              labels = FALSE, node_size = 6,
              node_color = "PowerOrigin",
              node_shape = "Gender") + ggtitle("Original network")
new <- graphr(generate_permutation(marvel_friends, with_attr = TRUE),
              labels = FALSE, node_size = 6,
              node_color = "PowerOrigin",
              node_shape = "Gender") + ggtitle("Permuted network")
old + new

question("Which of the following is true?",
  answer("Random networks retain the structure of the original network.",
         message = paste(learnr::random_encouragement(), '<img src="https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExeW9jbXg5dDBndjRpcWc2enhyOTZvbjloeGFpbmJndXNrMm05MHdtZiZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/ziLadIVnOGCKk/giphy.gif" alt="gif of thor frustration"/>')),
  answer("Permuted networks retain the structure of the original network.",
         correct = TRUE,
         message = learnr::random_praise()),
  answer("Both random and permuted networks retain the proportion of attributes from the original network.",
         correct = TRUE,
         message = learnr::random_praise()),
  answer("Permuted networks retain the ties among the same nodes from the original network.", 
         message = "Permuted networks randomly reassign any variables and change positions of nodes by swapping the rows and columns of an adjacency matrix."),
  random_answer_order = TRUE,
  allow_retry = TRUE
)

This single permutation suggests a more even mixing of these attributes is possible, but it is just a single permutation. Let's try a test that runs this over a succession of permutations, just as we did with random graphs. Plot the results for gender and power according to the random and permutation baselines.

test_permutation(____, FUN = ____, attribute = ____,
                                times = ____)

(perm.gender <- test_permutation(marvel_friends, 
                                net_heterophily, attribute = "Gender",
                                times = 1000))
(perm.power <- test_permutation(marvel_friends, 
                                net_heterophily, attribute = "PowerOrigin",
                                times = 1000))

(plot(rand.gender) + ggtitle("CUG test results for 'Gender' attribute") + 
    theme(plot.title = element_text(size=8)) | 
    plot(rand.power) + ggtitle("CUG test results for 'PowerOrigin' attribute") + 
    theme(plot.title = element_text(size=8))) /
(plot(perm.gender) + ggtitle("QAP test results for 'Gender' attribute") + 
   theme(plot.title = element_text(size=8)) | 
   plot(perm.power) + ggtitle("QAP test results for 'PowerOrigin' attribute") + 
   theme(plot.title = element_text(size=8)))

Again, we see that there is perhaps nothing so surprising that we got the homophily score for gender that we did, but the lack of power origin heterophily is surprising. Note how the distributions are generally wider when permuting the observed network than creating a random distribution (be mindful of the scale of the x-axis). That is, taking into account the structure of the network leads us to expect a larger spread in the EI index than when the variable is distributed across a random network.

question("What can we say from these results?",
  answer("We can reject the null hypothesis that there is no homophily for gender in the comparison with random networks",
         message = "We cannot reject the null hypothesis since the result is in the range of expected scores across randomised networks, indicated by the red line between the grey shaded zones."),
  answer("We can reject the null hypothesis that there is no homophily for power origin in the comparison with random networks", 
         correct = TRUE,
         message = learnr::random_praise()),
  answer("We can reject the null hypothesis that there is no homophily for power origin in both CUG and QAP tests",
         correct = TRUE,
         message = learnr::random_praise()),
  answer("We can reject the null hypothesis that there is no homophily for gender in both CUG and QAP tests", 
         message = "We cannot reject the null hypothesis since the result is in the range of expected scores across randomised and permuted networks, indicated by the red lines between the grey shaded zones in both plots."),
  allow_retry = TRUE
)

Network regression

Next let us examine homophily in another network. The data were collected as part of an early experiment on communication between social network researchers who were using an Electronic Information Exchange System (EIES). You may recognise some of the names. The main network consists of 32 scholars with directed ties weighted by the total number of messages sent from $i$ to $j$ over the period of the study. Nodal attributes collected include the primary discipline and number of citations in the social science citation index at the start of the study.

ison_networkers
graphr(ison_networkers, node_color = "Discipline")

Let's use both the continuous Citations and the categorical Discipline variables and come up with a couple of key hypotheses:

$H_1$: researchers send more emails to those who are cited more (popularity hypothesis)
$H_2$: researchers send more emails to those of the same discipline (disciplinary homophily hypothesis)

Let's start with a pretty maximally specified model (note that it doesn't make sense to include both ego and alter effects because these networks are undirected). We are using times = 200 here because of time-out limitations in the learnr tutorial system, but for publication quality results you would want to base your conclusions on 2000 simulations or more.

net_regression(____, ison_networkers, times = 200)
# If the model runs into a timeout error, please reduce the number of 'times' in the function above.

weight ~ ego(Citations) + alter(Citations) + sim(Citations) + 
                      ego(Discipline) + same(Discipline)

model1 <- net_regression(weight ~ ego(Citations) + alter(Citations) + sim(Citations) + 
                      ego(Discipline) + same(Discipline), 
                      ison_networkers, times = 200)

We can use tidy methods (e.g. tidy(), glance()) to get the salient information from this model, and {autograph} includes also a plot method for these results to facilitate the quick interpretation of these results.

tidy(model1)
glance(model1)
plot(model1)

This violin plot presents the distribution of coefficients from permutations of the network, with the coefficient fitted from the data as a red dot. Lines are used to indicate 95% thresholds, but here the distributions are rendered so wide that they are often not visible.

question("What can we say from the results from model 1?",
  answer("Researchers send more messages to those who are cited more", 
         message = "Looks like alter Citations is not significant."),
  answer("Researchers send more messages to those who similarly cited", 
         message = "Looks like sim Citations is not significant."),
  answer("Researchers send more messages to mathematicians than anthropologists", 
         message = "Looks like alter Discipline Mathematics is not significant."),
  answer("Researchers send more messages to sociologists than anthropologists", 
         message = "Looks like alter Discipline Sociology is not significant."),
  answer("Not much", 
         correct = TRUE, 
         message = "Yes, the fitted coefficients are, if not typical, at least unsurprising from permutations of the network and so there's no evidence for rejecting the null hypothesis on the basis of this data."),
  allow_retry = TRUE
)

Free play

While these are the conclusions from this 'play' data, you may have more and more interesting data at hand. How would you go about specifying such a model? Why is such an approach more appropriate for network data than linear or logistic regression?