library(learnr) library(manynet) library(migraph) library(patchwork) library(ggplot2) knitr::opts_chunk$set(echo = FALSE) marvel_friends <- to_unsigned(ison_marvel_relationships, keep = "positive") marvel_friends <- to_giant(marvel_friends) marvel_friends <- marvel_friends %>% to_subgraph(Appearances >= mean(Appearances))
For this session, we'll explore a couple of different datasets.
First, let's examine homogeneity/heterogeneity in the Marvel relationships dataset from {manynet}
,
ison_marvel_relationships
.
The dataset is quite complicated,
so to make this simpler, let's concentrate on:
Fortunately, all these data cleaning moves are easy to do in {manynet}
,
and can be seen in the following chunk in order:
# since the dataset is a 'signed' graph, we want to get just the # positively signed ties to get the friendship graph # (and lose the enmity relations) to_unsigned(____, keep = "positive")
# to_giant() is a quick easy way to get the giant/main component to_giant(____)
to_subgraph(____, Appearances >= mean(Appearances))
# don't forget to assign the results! marvel_friends <- ____
marvel_friends <- to_unsigned(ison_marvel_relationships, keep = "positive") marvel_friends <- to_giant(marvel_friends) marvel_friends <- marvel_friends %>% to_subgraph(Appearances >= mean(Appearances)) marvel_friends
marvel_friends <- to_unsigned(ison_marvel_relationships, keep = "positive") marvel_friends <- to_giant(marvel_friends) marvel_friends <- marvel_friends %>% to_subgraph(Appearances >= mean(Appearances)) marvel_friends
This gives us an undirected network of nearly twenty characters and a little more than 100 edges.
Recall that this data has several nodal attributes.
Let's explore a couple of these attributes, "Gender" and "PowerOrigin", visually
using graphr()
.
# Since both Gender and PowerOrigin are categorical variables # you will need to use two different aesthetic dimensions to # represent them together. # Which will you present as shapes and which as colors? graphr(____, node_shape = ____, node_color = ____)
graphr(marvel_friends, node_shape = "Gender", node_color = "PowerOrigin")
These variables seem to be distributed unevenly across the network. There seems to be some homophily -- or like choosing like -- operating here, but it is difficult to say conclusively because there are clearly more male than female superheros, as well as clearly more superheros of mutant origin than others. So what might seem like homophily could just be a result of there being many more opportunities for ties between nodes of some categories. We therefore need to establish how diverse this network really is.
We can begin by measuring the number of different categories there are. Here we might assume that the more different categories there are, the more diverse the network is. The measure of 'richness' is inherited from the study of biodiversity, and calculates the number of different categories are presented in a dataset for a given variable.
net_richness(____, ____)
net_richness(marvel_friends, "Gender") net_richness(marvel_friends, "PowerOrigin") net_richness(marvel_friends, "Attractive") net_richness(marvel_friends, "Rich") net_richness(marvel_friends, "Intellect")
question("Which variable is the most 'diverse' according to this richness measure?", answer("Gender"), answer("PowerOrigin", correct = TRUE, message = "There are four categories available in this data for power origin, while the other variables include only two categories each."), answer("Attractive"), answer("Rich"), answer("Intellect"), random_answer_order = TRUE, allow_retry = TRUE )
Note though that 'richness' as a network measure does not include any sense of how distributed these categories are around the network. There is a measure of nodal richness available, which counts the number of different categories to which each node is tied, but this does not offer a summary of how evenly distributed the appearance of categories are. For that we would need to move on to something like the Blau index.
Another measure that reflects the diversity in the network for each attribute is the Blau Index. Recall that the Blau index for any given variable is:
$$1 - \sum p_i^2$$
where $p$ represents the proportion belonging to any given category, and $i$ indexes each of the given categories. A perfectly homogeneous group would receive a score of 0, while a perfectly heterogeneous group (with members spread evenly over the maximum categories) would receive a score of 1. Obtain the network diversity scores for our five attributes.
net_diversity(____, ____)
net_diversity(marvel_friends, "Gender") net_diversity(marvel_friends, "PowerOrigin") net_diversity(marvel_friends, "Attractive") net_diversity(marvel_friends, "Rich") net_diversity(marvel_friends, "Intellect")
Looks like there is more diversity in terms of where these characters got their powers, whether they have significant intellectual powers, and their gender, than their attractiveness or their wealth.
We can also cross-reference this diversity. For example, we might be interested in whether our comic book heroes are equally gender diverse across their (power) origin stories, or equally intellectually diverse across gender.^[Note that this works for calculated categorical variables too, such as cluster/group assignment from community detection or equivalence classes.]
net_diversity(____, ____, ____)
net_diversity(marvel_friends, "Gender", "PowerOrigin") as.factor(node_attribute(marvel_friends, "PowerOrigin")) # view categories in PowerOrigin net_diversity(marvel_friends, "Intellect", "Gender")
Note that the length of the vector returned as a result is the number of categories in the second category listed. It looks like some origin stories are more gender diverse than others. Gods (just Thor here) and humans are all men, whereas those with mutant or radiation origin stories are more gender diverse. There doesn't appear to be much difference in intellect across gender categories however.
Ok, this tells us about how (un)even the distribution of these variables is in this network, but it doesn't necessarily tell us whether ties are appearing more frequently between nodes of similar (or different) categories. For that we need to look at homophily/heterophily.
The EI (or E-I) index offers a way to measure the degree to which ties appear between rather than within groups of nodes of the same category. Recall that the EI index is calculated as:
$$\frac{E-I}{E+I}$$
where $E$ is the number of ties present between a variable's categories (i.e. external),
and $I$ is the number of ties present within a variable's categories (i.e. internal).
As such, an EI index of -1 suggests perfect homophily, whereas an EI index of +1 suggests perfect heterophily.
(This is why the function is called net_heterophily()
).
Check how homophilic three variables in the network are, "Gender", "PowerOrigin", and "Attractive".
net_heterophily(____, ____)
(obs.gender <- net_heterophily(marvel_friends, "Gender")) (obs.powers <- net_heterophily(marvel_friends, "PowerOrigin")) (obs.attract <- net_heterophily(marvel_friends, "Attractive"))
question("For which variables is there a signal of homophily according to the EI index? (Choose all that apply)", answer("Gender", correct = TRUE, message = "Yes, looks like there might be some gender homophily present."), answer("PowerOrigin", message = "The score for power origin homophily is so close to 0 that it does not seem to signal much."), answer("Attractive", correct = TRUE, message = "And looks like a fairly large effect for homophily on the basis of looks..."), allow_retry = TRUE )
Ultimately though, these are just scores, and doesn't tell us whether this is any more or less than what we might expect the score to be by chance for a network of this size and density and distribution of that attribute.
To see whether we should be surprised by scores this high/low, we compare these scores with those from a series of random graphs (Erdös-Renyi/Bernoulli) of the same dimensions and distribution of the attribute. This can help us establish whether there is more homophily or heterophily than expected by chance.
This is often called a conditional uniform graph or CUG test,
but {migraph}
uses more descriptive function names,
such as test_random()
.
Plot the results of running this function with respect to the EI index
on each of the three variables.
You can specify that one thousand simulations should be used using times = 1000
.
rand.____ <- test_random(____, FUN = ____, attribute = ____, times = ___)
plot(rand.____)
rand.gender <- test_random(marvel_friends, net_heterophily, attribute = "Gender", times = 1000) rand.power <- test_random(marvel_friends, net_heterophily, attribute = "PowerOrigin", times = 1000) rand.attract <- test_random(marvel_friends, net_heterophily, attribute = "Attractive", times = 1000) plot(rand.gender) + ggtitle("CUG test results for 'Gender' attribute") plot(rand.power) + ggtitle("CUG test results for 'PowerOrigin' attribute") plot(rand.attract) + ggtitle("CUG test results for 'Attractive' attribute")
rand.gender <- test_random(marvel_friends, net_heterophily, attribute = "Gender", times = 1000) rand.power <- test_random(marvel_friends, net_heterophily, attribute = "PowerOrigin", times = 1000) rand.attract <- test_random(marvel_friends, net_heterophily, attribute = "Attractive", times = 1000) plot(rand.gender) + ggtitle("CUG test results for 'Gender' attribute") plot(rand.power) + ggtitle("CUG test results for 'PowerOrigin' attribute") plot(rand.attract) + ggtitle("CUG test results for 'Attractive' attribute")
The plots of these results use a dotted vertical line for 0 where this is in bounds, a red vertical line for the observed score, and a density plot of the scores from the randomly generated networks. The grey tails of the distribution are a visual aid indicating the most extreme 5% of scores from the distribution.
The results are really interesting. Despite being the larger coefficients (in absolute terms), it looks like we cannot reject the null hypothesis that there is no homophily for gender nor for attractiveness. Both observed scores fall within the range of scores we would expect from randomly generated networks with the same distribution of that variable.
However, we can reject the null hypothesis with respect to their power origin story. While the coefficient itself is close to 0 (neither strong homophily nor heterophily), all the random networks generated returned larger EI scores, between .1 and .4. That is, there is significantly less heterophily here than expected.
Ah, but perhaps random graphs are not the best reference group for establishing whether there is a significant homophily effect here. After all, social networks are not completely random; they are structured in particular ways, such as some nodes having higher degrees than others, or there being a core and periphery or community topology.
Another approach to establishing a baseline for whether we should be surprised by a given score or not is to use permutations of the underlying network instead of random graphs. Permuting the network retains the structure of the network because the ties are kept and only the labels (variables) are reassigned randomly. Let's first plot the observed data and some permuted data next to each other.
graphr(generate_permutation(____, with_attr = TRUE), ____)
old <- graphr(marvel_friends, labels = FALSE, node_size = 6, node_color = "PowerOrigin", node_shape = "Gender") + ggtitle("Original network") new <- graphr(generate_permutation(marvel_friends, with_attr = TRUE), labels = FALSE, node_size = 6, node_color = "PowerOrigin", node_shape = "Gender") + ggtitle("Permuted network") old + new
question("Which of the following is true?", answer("Random networks retain the structure of the original network.", message = learnr::random_encouragement()), answer("Permuted networks retain the structure of the original network.", correct = TRUE, message = learnr::random_praise()), answer("Both random and permuted networks retain the proportion of attributes from the original network.", correct = TRUE, message = learnr::random_praise()), answer("Permuted networks retain the ties among the same nodes from the original network.", message = "Permuted networks randomly reassign any variables and change positions of nodes by swapping the rows and columns of an adjacency matrix."), random_answer_order = TRUE, allow_retry = TRUE )
This single permutation suggests a more even mixing of these attributes is possible, but it is just a single permutation. Let's try a test that runs this over a succession of permutations, just as we did with random graphs. Plot the results for gender and power according to the random and permutation baselines.
test_permutation(____, FUN = ____, attribute = ____, times = ____)
(perm.gender <- test_permutation(marvel_friends, net_heterophily, attribute = "Gender", times = 1000)) (perm.power <- test_permutation(marvel_friends, net_heterophily, attribute = "PowerOrigin", times = 1000)) (plot(rand.gender) + ggtitle("CUG test results for 'Gender' attribute") + theme(plot.title = element_text(size=8)) | plot(rand.power) + ggtitle("CUG test results for 'PowerOrigin' attribute") + theme(plot.title = element_text(size=8))) / (plot(perm.gender) + ggtitle("QAP test results for 'Gender' attribute") + theme(plot.title = element_text(size=8)) | plot(perm.power) + ggtitle("QAP test results for 'PowerOrigin' attribute") + theme(plot.title = element_text(size=8)))
Again, we see that there is perhaps nothing so surprising that we got the homophily score for gender that we did, but the lack of power origin heterophily is surprising. Note how the distributions are generally wider when permuting the observed network than creating a random distribution (be mindful of the scale of the x-axis). That is, taking into account the structure of the network leads us to expect a larger spread in the EI index than when the variable is distributed across a random network.
question("What can we say from these results?", answer("We can reject the null hypothesis that there is no homophily for gender in the comparison with random networks", message = "We cannot reject the null hypothesis since the result is in the range of expected scores across randomised networks, indicated by the red line between the grey shaded zones."), answer("We can reject the null hypothesis that there is no homophily for power origin in the comparison with random networks", correct = TRUE, message = learnr::random_praise()), answer("We can reject the null hypothesis that there is no homophily for power origin in both CUG and QAP tests", correct = TRUE, message = learnr::random_praise()), answer("We can reject the null hypothesis that there is no homophily for gender in both CUG and QAP tests", message = "We cannot reject the null hypothesis since the result is in the range of expected scores across randomised and permuted networks, indicated by the red lines between the grey shaded zones in both plots."), allow_retry = TRUE )
Next let us examine homophily in another network. The data were collected as part of an early experiment on communication between social network researchers who were using an Electronic Information Exchange System (EIES). You may recognise some of the names. The main network consists of 32 scholars with directed ties weighted by the total number of messages sent from $i$ to $j$ over the period of the study. Nodal attributes collected include the primary discipline and number of citations in the social science citation index at the start of the study.
ison_networkers graphr(ison_networkers, node_color = "Discipline")
Let's use both the continuous Citations
and the categorical Discipline
variables
and come up with a couple of key hypotheses:
Let's start with a pretty maximally specified model
(note that it doesn't make sense to include both ego and alter effects because these networks are undirected).
We are using times = 200
here because of time-out limitations in the learnr
tutorial system,
but for publication quality results you would want to base your conclusions on 2000
simulations or more.
network_reg(____, ison_networkers, times = 200) # If the model runs into a timeout error, please reduce the number of 'times' in the function above.
weight ~ alter(Citations) + sim(Citations) + alter(Discipline) + same(Discipline)
model1 <- network_reg(weight ~ ego(Citations) + alter(Citations) + sim(Citations) + ego(Discipline) + alter(Discipline) + same(Discipline), ison_networkers, times = 200)
We can use tidy methods (e.g. tidy()
, glance()
) to get the salient information from this model,
and {migraph}
includes also a plot method for these results to
facilitate the quick interpretation of these results.
tidy(model1) glance(model1) plot(model1)
This violin plot presents the distribution of coefficients from permutations of the network, with the coefficient fitted from the data as a red dot. Lines are used to indicate 95% thresholds, but here the distributions are rendered so wide that they are often not visible.
question("What can we say from the results from model 1?", answer("Researchers send more messages to those who are cited more", message = "Looks like alter Citations is not significant."), answer("Researchers send more messages to those who similarly cited", message = "Looks like sim Citations is not significant."), answer("Researchers send more messages to mathematicians than anthropologists", message = "Looks like alter Discipline Mathematics is not significant."), answer("Researchers send more messages to sociologists than anthropologists", message = "Looks like alter Discipline Sociology is not significant."), answer("Not much", correct = TRUE, message = "Yes, the fitted coefficients are, if not typical, at least unsurprising from permutations of the network and so there's no evidence for rejecting the null hypothesis on the basis of this data."), allow_retry = TRUE )
While these are the conclusions from this 'play' data, you may have more and more interesting data at hand. How would you go about specifying such a model? Why is such an approach more appropriate for network data than linear or logistic regression?
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.