Initial visualisation
For this session, we’ll explore a couple of different datasets.
First, let’s examine homogeneity/heterogeneity in the Marvel
relationships dataset from {manynet},
ison_marvel_relationships. The dataset is quite
complicated, so to make this simpler, let’s concentrate on:
- just the positive (friendship) ties and not the negative (enmity) ties
- the main (giant) component without any isolates
- just those characters that appear in the comics more than average
Fortunately, all these data cleaning moves are easy to do in
{manynet}, and can be seen in the following chunk in
order:
# since the dataset is a 'signed' graph, we want to get just the
# positively signed ties to get the friendship graph 
# (and lose the enmity relations)
to_unsigned(____, keep = "positive")# to_giant() is a quick easy way to get the giant/main component
to_giant(____)to_subgraph(____, Appearances >= mean(Appearances))# don't forget to assign the results!
marvel_friends <- ____marvel_friends <- to_unsigned(ison_marvel_relationships, keep = "positive")
marvel_friends <- to_giant(marvel_friends)
marvel_friends <- marvel_friends %>% to_subgraph(Appearances >= mean(Appearances))
marvel_friendsmarvel_friends <- to_unsigned(ison_marvel_relationships, keep = "positive")
marvel_friends <- to_giant(marvel_friends)
marvel_friends <- marvel_friends %>% to_subgraph(Appearances >= mean(Appearances))
marvel_friendsThis gives us an undirected network of nearly twenty characters and a
little more than 100 edges. Recall that this data has several nodal
attributes. Let’s explore a couple of these attributes, “Gender” and
“PowerOrigin”, visually using graphr().
# Since both Gender and PowerOrigin are categorical variables
# you will need to use two different aesthetic dimensions to
# represent them together.
# Which will you present as shapes and which as colors?
graphr(____, 
       node_shape = ____,
       node_color = ____)graphr(marvel_friends, 
           node_shape = "Gender",
           node_color = "PowerOrigin")These variables seem to be distributed unevenly across the network. There seems to be some homophily – or like choosing like – operating here, but it is difficult to say conclusively because there are clearly more male than female superheros, as well as clearly more superheros of mutant origin than others. So what might seem like homophily could just be a result of there being many more opportunities for ties between nodes of some categories. We therefore need to establish how diverse this network really is.
Measuring richness
We can begin by measuring the number of different categories there are. Here we might assume that the more different categories there are, the more diverse the network is. The measure of ‘richness’ is inherited from the study of biodiversity, and calculates the number of different categories are presented in a dataset for a given variable.
net_richness(____, ____)net_richness(marvel_friends, "Gender")
net_richness(marvel_friends, "PowerOrigin")
net_richness(marvel_friends, "Attractive")
net_richness(marvel_friends, "Rich")
net_richness(marvel_friends, "Intellect")Note though that ‘richness’ as a network measure does not include any sense of how distributed these categories are around the network. There is a measure of nodal richness available, which counts the number of different categories to which each node is tied, but this does not offer a summary of how evenly distributed the appearance of categories are. For that we would need to move on to something like the Blau index.
Measuring diversity
Another measure that reflects the diversity in the network for each attribute is the Blau Index. Recall that the Blau index for any given variable is:
\[1 - \sum p_i^2\]
where \(p\) represents the proportion belonging to any given category, and \(i\) indexes each of the given categories. A perfectly homogeneous group would receive a score of 0, while a perfectly heterogeneous group (with members spread evenly over the maximum categories) would receive a score of 1. Obtain the network diversity scores for our five attributes.
net_diversity(____, ____)net_diversity(marvel_friends, "Gender")
net_diversity(marvel_friends, "PowerOrigin")
net_diversity(marvel_friends, "Attractive")
net_diversity(marvel_friends, "Rich")
net_diversity(marvel_friends, "Intellect")Looks like there is more diversity in terms of where these characters got their powers, whether they have significant intellectual powers, and their gender, than their attractiveness or their wealth.
We can also cross-reference this diversity. For example, we might be interested in whether our comic book heroes are equally gender diverse across their (power) origin stories, or equally intellectually diverse across gender.1
net_diversity(____, ____, ____)net_diversity(marvel_friends, "Gender", "PowerOrigin")
as.factor(node_attribute(marvel_friends, "PowerOrigin")) # view categories in PowerOrigin
net_diversity(marvel_friends, "Intellect", "Gender")Note that the length of the vector returned as a result is the number of categories in the second category listed. It looks like some origin stories are more gender diverse than others. Gods (just Thor here) and humans are all men, whereas those with mutant or radiation origin stories are more gender diverse. There doesn’t appear to be much difference in intellect across gender categories however.
Ok, this tells us about how (un)even the distribution of these variables is in this network, but it doesn’t necessarily tell us whether ties are appearing more frequently between nodes of similar (or different) categories. For that we need to look at homophily/heterophily.
- Note that this works for calculated categorical variables too, such as cluster/group assignment from community detection or equivalence classes.↩︎ 
Measuring heterophily
The EI (or E-I) index offers a way to measure the degree to which ties appear between rather than within groups of nodes of the same category. Recall that the EI index is calculated as:
\[\frac{E-I}{E+I}\]
where \(E\) is the number of ties
present between a variable’s categories (i.e. external), and \(I\) is the number of ties present within a
variable’s categories (i.e. internal). As such, an EI index of -1
suggests perfect homophily, whereas an EI index of +1 suggests perfect
heterophily. (This is why the function is called
net_heterophily()).
Check how homophilic three variables in the network are, “Gender”, “PowerOrigin”, and “Attractive”.
net_heterophily(____, ____)(obs.gender <- net_heterophily(marvel_friends, "Gender"))
(obs.powers <- net_heterophily(marvel_friends, "PowerOrigin")) 
(obs.attract <- net_heterophily(marvel_friends, "Attractive")) Ultimately though, these are just scores, and doesn’t tell us whether this is any more or less than what we might expect the score to be by chance for a network of this size and density and distribution of that attribute.
CUG tests
To see whether we should be surprised by scores this high/low, we compare these scores with those from a series of random graphs (Erdös-Renyi/Bernoulli) of the same dimensions and distribution of the attribute. This can help us establish whether there is more homophily or heterophily than expected by chance.
This is often called a conditional uniform graph or CUG test, but
{migraph} uses more descriptive function names, such as
test_random(). Plot the results of running this function
with respect to the EI index on each of the three variables. You can
specify that one thousand simulations should be used using
times = 1000.
rand.____ <- test_random(____, FUN = ____, attribute = ____, times = ___)plot(rand.____)rand.gender <- test_random(marvel_friends, 
                           net_heterophily, attribute = "Gender", 
                           times = 1000)
rand.power <- test_random(marvel_friends, 
                          net_heterophily, attribute = "PowerOrigin", 
                          times = 1000)
rand.attract <- test_random(marvel_friends, 
                            net_heterophily, attribute = "Attractive", 
                            times = 1000)
plot(rand.gender) + ggtitle("CUG test results for 'Gender' attribute")
plot(rand.power) + ggtitle("CUG test results for 'PowerOrigin' attribute")
plot(rand.attract) + ggtitle("CUG test results for 'Attractive' attribute")The plots of these results use a dotted vertical line for 0 where this is in bounds, a red vertical line for the observed score, and a density plot of the scores from the randomly generated networks. The grey tails of the distribution are a visual aid indicating the most extreme 5% of scores from the distribution.
The results are really interesting. Despite being the larger coefficients (in absolute terms), it looks like we cannot reject the null hypothesis that there is no homophily for gender nor for attractiveness. Both observed scores fall within the range of scores we would expect from randomly generated networks with the same distribution of that variable.
However, we can reject the null hypothesis with respect to their power origin story. While the coefficient itself is close to 0 (neither strong homophily nor heterophily), all the random networks generated returned larger EI scores, between .1 and .4. That is, there is significantly less heterophily here than expected.
QAP tests
Ah, but perhaps random graphs are not the best reference group for establishing whether there is a significant homophily effect here. After all, social networks are not completely random; they are structured in particular ways, such as some nodes having higher degrees than others, or there being a core and periphery or community topology.
Another approach to establishing a baseline for whether we should be surprised by a given score or not is to use permutations of the underlying network instead of random graphs. Permuting the network retains the structure of the network because the ties are kept and only the labels (variables) are reassigned randomly. Let’s first plot the observed data and some permuted data next to each other.
graphr(generate_permutation(____, with_attr = TRUE), ____)old <- graphr(marvel_friends,
              labels = FALSE, node_size = 6,
              node_color = "PowerOrigin",
              node_shape = "Gender") + ggtitle("Original network")
new <- graphr(generate_permutation(marvel_friends, with_attr = TRUE),
              labels = FALSE, node_size = 6,
              node_color = "PowerOrigin",
              node_shape = "Gender") + ggtitle("Permuted network")
old + newThis single permutation suggests a more even mixing of these attributes is possible, but it is just a single permutation. Let’s try a test that runs this over a succession of permutations, just as we did with random graphs. Plot the results for gender and power according to the random and permutation baselines.
test_permutation(____, FUN = ____, attribute = ____,
                                times = ____)(perm.gender <- test_permutation(marvel_friends, 
                                net_heterophily, attribute = "Gender",
                                times = 1000))
(perm.power <- test_permutation(marvel_friends, 
                                net_heterophily, attribute = "PowerOrigin",
                                times = 1000))
(plot(rand.gender) + ggtitle("CUG test results for 'Gender' attribute") + 
    theme(plot.title = element_text(size=8)) | 
    plot(rand.power) + ggtitle("CUG test results for 'PowerOrigin' attribute") + 
    theme(plot.title = element_text(size=8))) /
(plot(perm.gender) + ggtitle("QAP test results for 'Gender' attribute") + 
   theme(plot.title = element_text(size=8)) | 
   plot(perm.power) + ggtitle("QAP test results for 'PowerOrigin' attribute") + 
   theme(plot.title = element_text(size=8)))Again, we see that there is perhaps nothing so surprising that we got the homophily score for gender that we did, but the lack of power origin heterophily is surprising. Note how the distributions are generally wider when permuting the observed network than creating a random distribution (be mindful of the scale of the x-axis). That is, taking into account the structure of the network leads us to expect a larger spread in the EI index than when the variable is distributed across a random network.
Network regression
Next let us examine homophily in another network. The data were collected as part of an early experiment on communication between social network researchers who were using an Electronic Information Exchange System (EIES). You may recognise some of the names. The main network consists of 32 scholars with directed ties weighted by the total number of messages sent from \(i\) to \(j\) over the period of the study. Nodal attributes collected include the primary discipline and number of citations in the social science citation index at the start of the study.
ison_networkers
graphr(ison_networkers, node_color = "Discipline")Let’s use both the continuous Citations and the
categorical Discipline variables and come up with a couple
of key hypotheses:
- \(H_1\): researchers send more emails to those who are cited more (popularity hypothesis)
- \(H_2\): researchers send more emails to those of the same discipline (disciplinary homophily hypothesis)
Let’s start with a pretty maximally specified model (note that it
doesn’t make sense to include both ego and alter effects because these
networks are undirected). We are using times = 200 here
because of time-out limitations in the learnr tutorial
system, but for publication quality results you would want to base your
conclusions on 2000 simulations or more.
network_reg(____, ison_networkers, times = 200)
# If the model runs into a timeout error, please reduce the number of 'times' in the function above.weight ~ alter(Citations) + sim(Citations) + 
                      alter(Discipline) + same(Discipline)We can use tidy methods (e.g. tidy(),
glance()) to get the salient information from this model,
and {migraph} includes also a plot method for these results
to facilitate the quick interpretation of these results.
tidy(model1)
glance(model1)
plot(model1)This violin plot presents the distribution of coefficients from permutations of the network, with the coefficient fitted from the data as a red dot. Lines are used to indicate 95% thresholds, but here the distributions are rendered so wide that they are often not visible.
While these are the conclusions from this ‘play’ data, you may have more and more interesting data at hand. How would you go about specifying such a model? Why is such an approach more appropriate for network data than linear or logistic regression?