In judgelord/netlit: Augment a Literature Review with Network Analysis Statistics

knitr::opts_chunk$set(echo = TRUE, 
                      cache = TRUE,
                      fig.width=10, 
                      fig.height=7,
                      out.width = "100%",
                      split = TRUE,
                      fig.align = 'center', 
                      #fig.path='../man/figures/',
                      fig.retina = 1, 
                      warning=FALSE, 
                      message=FALSE)

library(knitr)
library(kableExtra) # Table formatting and pipe (%>%)

# format kable for document type 
kable <- function(...){
  if (knitr::is_latex_output()){
    head(..., 25) %>% 
      knitr::kable(booktabs = TRUE, format = 'latex') %>% 
      kable_styling(latex_options = c("striped", "scale_down", "HOLD_position"))
  } else {
    knitr::kable(...) %>% 
      kable_styling() %>% 
      scroll_box(height = "200px")
  }
}

Understanding the gaps and connections across existing theories and findings is a perennial challenge in scientific research. Systematically reviewing scholarship is especially challenging for researchers who may lack domain expertise, including junior scholars or those exploring new substantive territory. Conversely, senior scholars may rely on longstanding assumptions and social networks that exclude new research. In both cases, ad hoc literature reviews hinder accumulation of knowledge. Scholars are rarely systematic in selecting relevant prior work or then identifying patterns across their sample. To encourage systematic, replicable, and transparent methods for assessing literature, we propose an accessible network-based framework for reviewing scholarship. In our method, we consider a literature as a network of recurring concepts (nodes) and theorized relationships among them (edges). Network statistics and visualization allow researchers to see patterns and offer reproducible characterizations of assertions about the major themes in existing literature.

netlit provides functions to generate network statistics from a literature review. Specifically, it processes a dataset where each row is a proposed relationship ("edge") between two concepts or variables ("nodes"). The aim is to offer easy tools to begin using the power of network analysis in R for literature reviews. Using netlit simply requires researchers to enter relationships they observe in prior studies into a simple spreadsheet.

Using the `netlit` R Package

The netlit package provides functions to generate network statistics from a literature review. Specifically, netlit provides a wrapper for igraph functions to facilitate using network analysis in literature reviews.

Install this package with

devtools::install_github("judgelord/netlit")

To install netlit from CRAN, run the following:

install.packages("netlit")

Basic Usage

The review() function takes in a dataframe, data, that includes from and to columns (a directed graph structure).

In the example below, we use example data from this project on redistricting. These data are a set of related concepts (from and to) in the redistricting literature and citations for these relationships (cites and cites_empirical). See vignette("netlit") for more details on this example.

library(netlit)

# Load replication version of main data and metadata on citations
load(here::here("replication_data", "literature.rda"))

literature %>% kable()

netlit offers four main functions: make_edgelist(), make_nodelist(), augment_nodelist(), and review().

review() is the primary function. The others are helper functions that perform the individual steps that review() does all at once. review() takes in a dataframe with at least two columns representing linked concepts (e.g., a cause and an effect) and returns data augmented with network statistics. Users must either specify "from" nodes and "to" nodes with the from and to arguments or include columns named from and to in the supplied data object.

review() returns a list of three objects:

an augmented edgelist (a list of relationships with edge_betweenness calculated),
an augmented nodelist (a list of concepts with degree and betweenness calculated), and
a graph object suitable for use in other igraph functions or other network visualization packages.

Users may wish to include edge attributes (e.g., information about the relationship between the two concepts) or node attributes (information about each concept). We show how to do so below. But first, consider the basic use of review():

lit <- review(literature, from = "from", to = "to")

lit

edges <- lit$edgelist

edges %>%  kable()

nodes <- lit$nodelist

nodes %>%  kable()

Edge and node attributes can be added using the edge_attributes and node_attributes arguments. edge_attributes is a vector that identifies columns in the supplied data frame that the user would like to retain. node_attributes is a separate dataframe that contains attributes for each node in the primary data set. The example node_attributes data include one column type indicating a type for each each node/variable/concept.

# Load replication version of node type coding
load(here::here("replication_data/node_attributes.rda"))

node_attributes %>% kable()

lit <- review(literature,
              edge_attributes = c("cites", "cites_empirical"),
              node_attributes = node_attributes)

lit

Tip: to retain all variables from literature, use edge_attributes = names(literature).

More Advanced Uses: larger networks, visualizing your network, network descriptives

library(tidyverse)
library(magrittr)
library(ggraph)

clean <- . %>% 
  str_replace_all("([a-z| |-]{8}) ","\\1\n") %>%
  str_replace_all(" ([a-z| |-]{9})",  "\n\\1") %>% str_to_title() %>% 
  str_replace("\nOf\n", "\nOf ") %>% 
  str_replace("\nFellow ", " Fellow\n") %>% 
  str_replace("\nState\n", " State\n") %>% 
  str_replace("\nDistrict\n", " District\n") %>% 
  str_replace("\nWith\n", " With\n")

literature$from %<>% clean() 
literature$to %<>% clean()
node_attributes$node %<>% clean()

We separated multiple cites to a theorized relationship with semicolons. Let's count the total number of citations and the number of citations to empirical work by splitting out each cite and measuring the length of that vector.

# count cites 
literature %<>% 
  group_by(to, from) %>% 
  mutate(cite_weight = str_split(cites, ";")[[1]]  %>% length(),
         cite_weight_empirical = str_split(cites_empirical, ";",)[[1]] %>% length(),
         cite_weight_empirical = ifelse(is.na(cites_empirical), 0, cite_weight_empirical)) %>% 
  ungroup() 

# subsets 
literature %<>% mutate(communities_node = str_c(to, from) %>% str_detect("Commun"),
                       confound = case_when(
      from == "Preserve\nCommunities\nOf Interest" & to == "Rolloff" ~ T,
      from == "Voter\nInformation\nAbout Their\nDistrict" & to == "Rolloff" ~ T,
      from == "Preserve\nCommunities\nOf Interest" 
            & to == "Voter\nInformation\nAbout Their\nDistrict" ~ T,
      T ~ F),
              empirical = ifelse(!is.na(cites_empirical),
                                 "Empirical work", 
                                 "No empirical work"))

Now we use review() on this expanded edgelist, including all variables in the literature data with edge_attributes = names(literature).

# now with all node and edge attributes 
lit <- review(literature,
              edge_attributes = names(literature),
              node_attributes = node_attributes
              )

edges <- lit$edgelist

edges %>% kable()

nodes <- lit$nodelist

nodes %>%  kable()

The `igraph` object

# define igraph object as g
g <- lit$graph

g

What does it mean?

D means directed
N means named graph
W means weighted graph
name (v/c) means name is a node attribute and it's a character
cite_weight (e/n) means cite_weight is an edge attribute and it's numeric

With `ggraph`

We can also plot using the package ggraph package to plot the igraph object.

This package allows us to plot self-ties, but it is slightly more difficult to use ggplot features (e.g. colors and legend labels) compared to ggnetwork.

set.seed(5)
library(ggraph)

p <- ggraph(g, layout = 'fr') + 
  geom_node_point(
    aes(color = degree_total %>% as.factor() ),
    size = 6, 
    alpha = .7
    ) + 
  geom_edge_arc2(
    start_cap = circle(3,'mm'),
    end_cap = circle(6, 'mm'),
    aes(
      color = cite_weight ,
      linetype = empirical
      ),
    curvature = 0,
    arrow = arrow(length = unit(2, 'mm'), 
                  type = "open")
    ) +
  geom_edge_loop(
      start_cap = circle(5, 'mm'),
      end_cap = circle(2, 'mm'),
      aes( color = cite_weight ,
           linetype = empirical
      ),
      n = 300,
      strength = .6,
    arrow = arrow(length = unit(2, 'mm'), 
                  type = "open")
    ) +
  geom_node_text( aes(label = name), size = 2.3) + 
  ggplot2::theme_void() + 
  theme(legend.position="bottom") + 
  labs(edge_color = "Number of\nPublications",
       color = "Total Degree\nCentrality",
       edge_linetype = "") + 
  scale_edge_colour_viridis(
                        discrete = FALSE,
                        option = "plasma",
                        begin = 0,
                        end = .9,
                        direction = -1,
                        guide = "legend",
                        aesthetics = "edge_colour") +
  scale_color_viridis_d(option = "mako", 
                        begin = 1, 
                        end = .5)

p 

ggsave(filename = "docs/netlit-replication_files/figure-html/figure2.pdf", width=10, height=7)

Subgraphs

p + facet_wrap("communities_node")

ggsave(filename = "docs/netlit-replication_files/figure-html/figure3a.pdf", width=20, height=7)

p + facet_wrap("confound")


ggsave(filename = "docs/netlit-replication_files/figure-html/figure3b.pdf", width=20, height=7)

Betweenness

Edge Betweenness

ggraph(g, layout = 'fr') + 
  geom_node_point(size = 10, 
                  alpha = .1) + 
  theme_void() + 
  theme(legend.position="bottom"
        ) + 
  scale_color_viridis_c(begin = .5, 
                        end = 1, 
                        direction = -1, 
                        option = "cividis") + 
scale_edge_color_viridis(begin = 0.2, 
                         end = .9, 
                         direction = -1, 
                         guide = "legend",
                         option = "cividis")  +    
  geom_edge_arc2(
    start_cap = circle(3, 'mm'),
    end_cap = circle(5, 'mm'),
    aes(
      color = edge_betweenness,
      linetype = empirical
    ),
    curvature = .1,
    arrow = arrow(length = unit(2, 'mm'), 
                  type = "closed")) + 
    geom_edge_loop(aes(color = edge_betweenness))  +
  geom_node_text(aes(label = name), 
                 size = 2.3) + 
  labs(edge_color = "Edge Betweenness",
       color = "Node Betweenness",
       edge_linetype = "")

Node Betweenness

p <- ggraph(g, layout = 'fr') + 
  geom_node_point(
    aes(color = betweenness),
    size = 6, 
    alpha = .7
    ) + 
  geom_edge_arc2(
      start_cap = circle(3, 'mm'),
      end_cap = circle(6, 'mm'),
    aes(
      color = cite_weight,
      linetype = empirical
      ),
    curvature = 0,
    arrow = arrow(length = unit(2, 'mm'), 
                  type = "open")
    ) +
  geom_edge_loop(
      start_cap = circle(5, 'mm'),
      end_cap = circle(2, 'mm'),
      aes( color = cite_weight,
      linetype = empirical
      ),
      n = 300,
      strength = .6,
    arrow = arrow(length = unit(2, 'mm'), 
                  type = "open")
    ) +
  geom_node_text(aes(label = name), 
                  size = 2.3) + 
  theme_void() + 
  theme(legend.position="bottom") + 
  labs(edge_color = "Number of\nPublications",
       color = "Betweeneness",
       edge_linetype = "") + 
scale_edge_color_viridis(option = "plasma", 
                         begin = 0, 
                         end = .9, 
                         direction = -1,
                         guide = "legend") +
  scale_color_gradient2()

p 


ggraph(g, layout = 'fr') + 
  geom_node_point(aes(color = betweenness),
                  size = 10, 
                  alpha = 1) + 
  theme_void() + 
  theme(legend.position="bottom") + 
 scale_color_viridis_c(begin = .5, 
                       end = 1, 
                       direction = -1, 
                       option = "cividis") + 
scale_edge_color_viridis(begin = 0.2, 
                         end = .9, 
                         direction = -1, 
                         option = "cividis",
    guide = "legend")  +    
  geom_edge_arc2(
    start_cap = circle(3, 'mm'),
    end_cap = circle(5, 'mm'),
    aes(
    color = edge_betweenness,
    linetype = empirical
    ),
    curvature = .1,
    arrow = arrow(length = unit(2, 'mm'), 
                  type = "closed")) + 
    geom_edge_loop(aes(color = edge_betweenness)) +
  labs(edge_color = "Edge Betweenness",
       color = "Node Betweenness",
       edge_linetype = "") + 
  geom_node_text(aes(label = name), 
                 size = 2.3)

Degree

ggraph(g, layout = 'fr') + 
  geom_node_point(aes(color = degree_total),
                  size = 10, 
                  alpha = 1) + 
  theme_void() + 
  theme(legend.position="bottom"        ) + 
  scale_color_gradient2() + 
scale_edge_color_viridis(begin = 0.2, 
                         end = .9, 
                         direction = -1, 
                         option = "cividis",
                         guide = "legend")  +    
  geom_edge_arc2(
    start_cap = circle(3, 'mm'),
    end_cap = circle(5, 'mm'),
    aes(
    color = edge_betweenness,
    linetype = empirical
    ),
    curvature = .1,
    arrow = arrow(length = unit(2, 'mm'), 
                  type = "closed")) + 
    geom_edge_loop(aes(color = edge_betweenness)) +
  labs(edge_color = "Edge Betweenness",
       color = "Total Degree",
       edge_linetype = "") + 
  geom_node_text(aes(label = name), 
                 size = 2.3)

About the example data

Articles were chosen according to specific selection criteria. We first identified articles published since 2010 that either 1) were published in one of eight high-ranking journals or 2) gained at least 50 citations according to Google Scholar. We then chose articles that contained four possible key terms in the title or abstract.

# Journal articles in example data
data("literature_metadata")

literature_metadata %>% kable()

# count publications per journal
pub_table <- literature_metadata %>% 
  filter(str_detect(paste(literature$cites, collapse = "|"), Author)) %>% 
  count(Publication, name = "Articles") %>%
  mutate(Publication = case_when(
    Publication == "AJPS" ~ "American Journal of Political Science",
    Publication == "APSR" ~ "American Political Science Review",
    Publication == "BJPS" ~ "British Journal of Political Science",
    Publication == "JOP" ~ "The Journal of Politics",
    Publication == "NCL Review" ~ "North Carolina Law Review",
    Publication == "QJPS" ~ "Quarterly Journal of Political Science",
    TRUE ~ Publication
  )) 

pub_table %>% kable()