In agoutsmedt/networkflow: Functions For A Workflow To Manipulate Networks

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In the following article, we will go other the different steps to create a bibliometric network, manipulate it, prepare the plotting and eventually plot it.

First step: creating the network and keeping the main component

As a point of departure, you need your bibliometric data to be prepared in a certain format:^[See how to extract and clean data from scopus here and from Dimensions here.]

you need a nodes table. For instance, it may be a list of articles with metadata (author(s), title, journal, etc.). Nodes must have a unique identifier and all the information about a node are gathered on only one row (in case of articles, you need one row per article).
you need a directed_edges table, that is a table that links your nodes with another variable that will be used to build the edges between your nodes. For instance, the table could links articles (your nodes) with the references cited by these articles. It can also be a journal or a list of authors (if you are interested in collaboration). In your directed_edges table, you need the identifier of the nodes (also present in the nodes table), and the unique identifier of the categories (references cited, journals, authors...) the nodes are linked to.

As soon as you have a nodes and a edges file (see the biblionetwork package for creating such files), you can create a graph, using tidygraph and the tbl_graph() function. The next step, as it is recurrent in many network analyses, notably in bibliometric netwoks like bibliographic coupling networks would be to keep only the main component of your network. This could be done in one step using the tbl_main_component() function of networkflow.

library(networkflow)

## basic example code

graph <- tbl_main_component(nodes = Nodes_coupling, edges = Edges_coupling, directed = FALSE, node_key = "ItemID_Ref", nb_components = 1)
print(graph)

The parameter nb_components allows you to choose the number of components you want to keep. For obvious reasons, it is settled to 1 by default.

However, it could happen in some networks (for instance co-authorship networks) that the second biggest component of your network is quite large. To avoid removing too big components without knowing it, the tbl_main_component() function integrates a warning that happens when a secondary component gathering more than x% of the total number of nodes is removed. The threshold_alert parameter is set to 0.05 by default, but you can reduce it if you really want to avoid removing relatively big components.

## basic example code

graph <- tbl_main_component(nodes = Nodes_coupling, edges = Edges_coupling, directed = FALSE, node_key = "ItemID_Ref", threshold_alert = 0.001)
print(graph)

Second step: finding communities

Once you have you tidygraph graph, an important step is to run community detection algorithms to group the nodes depending on their links. This package uses the leidenAlg package, and its find_partition() function, to implement the Leiden algorithm [@traag2019]. The leiden_workflow() function of our package runs the Leiden algorithm and attributes a community number to each node in the Com_ID column, but also to each edge (depending if the from and to nodes are within the same community).

# creating again the graph
graph <- tbl_main_component(nodes = Nodes_coupling, edges = Edges_coupling, directed = FALSE, node_key = "ItemID_Ref", nb_components = 1)

# finding communities
graph <- leiden_workflow(graph)
print(graph)

You can observe that the function also gives the size of the community, by calculating the share of total nodes that are in each community.

The function also allows to play with the resolution parameter of leidenAlg find_partition() function. Varying the resolution of the algorithm results in a different partition and different number of communities. A lower resolution means less communities, and conversely. The basic resolution of the leiden_workflow() is set by res_1 and equals 1 by default. You can vary this parameter, but also try a second resolution with res_2 and a third one with res_3:

# creating again the graph
graph <- tbl_main_component(nodes = Nodes_coupling, edges = Edges_coupling, directed = FALSE, node_key = "ItemID_Ref", nb_components = 1)

# finding communities
graph <- leiden_workflow(graph, res_1 = 0.5, res_2 = 2, res_3 = 3)
print(graph)

Once you have detected different communities in your network, you are well on the way of the projection of your graph, but two important steps should be implemented before. First, you have to attribute some colors to each community. These colors will be used for your nodes and edges when you will project your graph with ggraph. The function community_colors of the networkflow package allow to do that. You just have to give it a palette (with as many colors as the number of communities for a better visualisation).^[If two connected nodes are in the same community, their edge will take the same color. If they are in different communities, their edge will have a mix of the two communities colors.]

# loading a palette with many colors
palette <- c("#1969B3","#01A5D8","#DA3E61","#3CB95F","#E0AF0C","#E25920","#6C7FC9","#DE9493","#CD242E","#6F4288","#B2EEF8","#7FF6FD","#FDB8D6","#8BF9A9","#FEF34A","#FEC57D","#DAEFFB","#FEE3E1","#FBB2A7","#EFD7F2","#5CAADA","#37D4F5","#F5779B","#62E186","#FBDA28","#FB8F4A","#A4B9EA","#FAC2C0","#EB6466","#AD87BC","#0B3074","#00517C","#871B2A","#1A6029","#7C4B05","#8A260E","#2E3679","#793F3F","#840F14","#401C56","#003C65","#741A09","#602A2A","#34134A","#114A1B","#27DDD1","#27DD8D","#4ADD27","#D3DD27","#DDA427","#DF2935","#DD27BC","#BA27DD","#3227DD","#2761DD","#27DDD1")

# creating again the graph
graph <- tbl_main_component(nodes = Nodes_coupling, edges = Edges_coupling, directed = FALSE, node_key = "ItemID_Ref", nb_components = 1)

# finding communities
graph <- leiden_workflow(graph)

# attributing colors
graph <- community_colors(graph, palette, community_column = "Com_ID")
print(graph)

What you want to do next is to give a name automatically to your community. The community_names() function allows you to do that: it gives to the community the label of the node, within the community, which has the highest score in the statistics you choose. In the next exemple, we will calculate the degree of each node, and each community will take as a name the label of its highest degree node.

library(magrittr)
library(dplyr)
library(tidygraph)

# calculating the degree of nodes
 graph <- graph %>%
   activate(nodes) %>%
   mutate(degree = centrality_degree())

# giving names to communities
 graph <- community_names(graph, ordering_column = "degree", naming = "Author_date", community_column = "Com_ID")
 print(graph)