Empirical study of complex networks requires real-world data to validate theoretical results.
A large, diverse corpus of networks often proves useful given the many shapes and sizes that complex networks assume.[@corpus1;@corpus2]
To our knowledge, the Colorado Index of Complex Networks (ICON) hosts the largest curated index of real-world complex networks, with metadata and links to over 5,000 networks as of this writing.[@icon]
However, heterogeneity in data format, access, and availability limit how easily users can take advantage of this incredible resource.
A central repository containing a large corpus of ICON-indexed networks in standard format would thus provide a useful service for network science researchers, who would avoid the tedious task of data format conversion prior to analysis.
Here, we introduce the ICON
R package as such a solution, providing a large and diverse corpus of real-world networks and tools to work with existing network analysis and visualization R packages.
\texttt{ICON} v$0.4.0$ is a package for the R programming language hosted on the Comprehensive R Archive Network (CRAN).[@rlang]
It strongly depends on the utils
($\geq 3.6$) R package and weakly depends on the covr
($\geq 3.5$), ggnetwork
($\geq 0.5$), ggplot2
($\geq 3.3$), igraph
($\geq 1.2$), knitr
($\geq 1.30$), network
($\geq 1.16$), testthat
($\geq 2.3$), and rmarkdown
($\geq 2.4$) R packages.[@rlang;@covr-pkg;@ggnetwork-pkg;@ggplot2-pkg;@igraph-pkg;@knitr-pkg;@network-pkg;@testthat-pkg;@rmarkdown-pkg]
Throughout the development process, the devtools
R package was heavily used.[@devtools-pkg]
This report is fully reproducible with code (GPL-3 license) available at \url{https://github.com/rrrlw/ICON}.
Full reproduction will require installation of the strong and weak dependencies listed above.
The stable version of ICON
(currently v$0.4.0$) can be downloaded from CRAN, while the development version can be downloaded from the package's GitHub repository using the remotes
package.[@remotes-pkg]
Both options are demonstrated in the following code chunk.
# install stable version from CRAN install.packages("ICON") # install development version from GitHub remotes::install_github("rrrlw/ICON", build_vignettes = TRUE)
library("ICON") data("ICON_data") largest_row <- which(ICON_data$Edges == max(ICON_data$Edges))
Currently, ICON
provides r nrow(ICON::ICON_data)
complex network edge lists.
The largest network, named r ICON_data$Var_name[largest_row]
, consists of r ICON_data$Edges[largest_row]
edges.
Due to the large volume of data and CRAN package size limits, all of ICON
's networks cannot be downloaded to a local machine upon installation and loaded with utils::data
.
Instead, the package's GitHub repository contains a directory named data-host
, which ICON::get_data
accesses to download networks named by the user.
After successful download, ICON::get_data
loads these networks into the user's environment of choice (default: .GlobalEnv
) and cleans any intermediate artifacts.
To avoid dependence on an internet connection, users can save and access individual networks in RDS format (binary; .rds
extension; via base::saveRDS
and base::readRDS
) or do the same for a set of networks in RData/RDa format (binary; .RData
or .rda
extension; via base::save
and base::load
).
An obvious deficiency of this system is the inability to take advantage of automated data documentation and checking tools, such as roxygen2
.[@roxygen2-pkg]
However, the ICON::ICON_data
dataset provides the necessary documentation for ICON
users and implements a sufficient and slightly soporific checking system for the package authors.
Although providing standardized data format avoids redundant work, an important processing step being completed by a single party (package authors) opens the door to inaccuracies.
It befits us to simply counter this limitation with ICON
's status as free, open-source software (FOSS), which offers every user the opportunity to inspect, question, and correct all aspects of ICON
.
The data-raw
directory in ICON
's GitHub repository follows Wickham's (2015) advice and contains: (1) the original raw data acquired directly from the source indexed by the ICON website; (2) the R code that converts each raw dataset into a data frame comprised of an edge list and potential edge attributes; and (3) the R code saving the resulting data frame as an RDA file in the aforementioned data-host
directory.
We hope that this not only offers, but indeed encourages, ICON
users to confirm dataset accuracy.
Note that to minimize unnecessary package elements, ICON
's .Rbuildignore
contains data-host
and data-raw
.
However, for reproducibility and documentation, ICON
's GitHub repository provides public access to both directories.
We will now look at sample code to acquire complex network datasets using ICON
.
To do so, we must load the library in the R session and load the ICON_data
dataset, which contains relevant complex network metadata.
The metadata can be explored in the package documentation with ?ICON_data
; in this report, we will focus only on the essentials, starting with the following code chunk.
# load library library("ICON") # load metadata # explore this data frame to figure out which networks suit your needs data("ICON_data") # peek at the first few and last few packages available to download head(ICON_data$Var_name, n = 3) tail(ICON_data$Var_name, n = 3)
We first try downloading a single dataset with ICON::get_data
and peeking at its contents.
Once this succeeds, we confidently download multiple datasets.
# download single dataset named in previous code chunk output # could also use `get_data(ICON_data$Var_name[1])` to same effect get_data("aishihik_intensity") # look at the structure of the complex network str(aishihik_intensity) # confirm that metadata reflects the correct number of edges (ICON_data$Edges[1] == nrow(aishihik_intensity)) # look at the first few rows; for all ICON datasets: # columns 1 and 2 = nodes that define the edge # columns 3 and beyond = edge attributes (e.g. weight) head(aishihik_intensity) # download multiple datasets get_data(c("wordadj_japanese", "wordadj_french")) # confirm downloads by looking at internal structure str(wordadj_japanese) str(wordadj_french)
A keen reader might observe that all of ICON
's datasets could be downloaded with get_data(ICON_data$Var_name)
; due to the potential runtime and memory commitment, we strongly recommend that users exercise caution if attempting this.
ICON
S3 classLooking at the structure of the complex networks with utils::str
shows that ICON
complex networks all have two classes: ICON
and data.frame
.
The latter provides a suitable container for edge list objects with potential edge attributes in rectangular format.
The former, an S3 class, benefits users by providing certain guarantees about object format, i.e., an unmodified complex network object acquired via the ICON
package will have the ICON
S3 class and is guaranteed to be a data frame containing an edge list in which each row represents a single edge, the first two columns specify nodes that define the corresponding edge, and additional columns define edge attributes.
This standard format guarantee allows users, among other things, to generate code for one ICON
dataset with assurances that it will function effectively for other ICON
datasets.
The S3 class will also allow users to take advantage of relevant S3 generic methods.
In future ICON
versions, we aim to implement methods for common generics, e.g. base::plot
.
Before starting with the use cases, the following code chunk will load the appropriate libraries and download the sample dataset.
# load necessary libraries library("ICON") library("network") library("ggnetwork") library("ggplot2") library("igraph") # for reproducibility set.seed(42) # download sample dataset get_data("seed_disperse_beehler")
A quick exploration of seed_disperse_beehler
will grant a deeper understanding of the use cases.
Primarily, we would like to explore the third column - named Frequency
.
Due to the heavy skew, we will use two consecutive logarithmic transformations to more easily see the effects of coloring edges by the Frequency
edge attribute.
The following code chunk produces histograms of seed_disperse_beehler$Frequency
before and after this transformation for comparison.
# plot a histogram w/o transformation (skewed, tough to see differences) ggplot(seed_disperse_beehler, aes(x = Frequency)) + geom_histogram(bins = 10, fill = "white", color = "black") + theme_bw() # plot a histogram w/ transformation (more spread out, differences easily seen) ggplot(seed_disperse_beehler, aes(x = log(log(Frequency)))) + geom_histogram(bins = 10, fill = "white", color = "black") + theme_bw()
network
R package {#networkusecase}Using the seed_disperse_beehler
sample dataset, we first convert it to a network
object with as_network
.
This allows us to take advantage of the large set of tools already built in the Statnet suite of R packages, specifically the network package.
Although we first use ggnetwork to rapidly visualize the nodes and edges, we also show how to visualize edge attributes toward the end of the code chunk.
# convert using ICON function converted <- as_network(seed_disperse_beehler) # plot with ggnetwork ggplot(converted, aes(x = x, y = y, xend = xend, yend = yend)) + geom_edges(alpha = 0.25) + geom_nodes() + theme_blank() # are there any edge attributes in `seed_disperse_beehler`? # YES, we have the "Frequency" edge attribute (see third column name) str(seed_disperse_beehler) # is this edge attribute also present in the converted network? # YES, let's plot it in the next network visualization (see end of output) print(converted) # plot with log(log(Frequency)) as an edge attribute (edge color) ggplot(converted, aes(x = x, y = y, xend = xend, yend = yend)) + geom_edges(aes(color = log(log(Frequency)))) + geom_nodes() + theme_blank()
Of course, even with the edge attribute, we are only scraping the surface of the visualization capability provided. More details can be found in the documentation of the appropriate packages.
igraph
R package {#igraphusecase}Using the seed_disperse_beehler
sample dataset, we first convert it to an igraph
object with as_igraph
.
This allows us to take advantage of the large set of tools already built in the igraph library.
The igraph R package allows us to analyze complex networks with built-in functions and visualize them with the igraph::plot.igraph
method for the base::plot
S3 generic.
# convert using ICON function converted <- as_igraph(seed_disperse_beehler) # look at edges in converted network E(converted) # peek at edge weights head(E(converted)$weight) # visualize with igraph::plot.igraph generic plot(converted, vertex.label = NA, vertex.size = 5)
As was the case with the previous use case, we have only scratched the surface of visualization possibilities.
More details can be found in the igraph
package documentation.
We have introduced the ICON
R package, explained its potential use as a network corpus, and demonstrated its compatibility with existing complex network software.
With time, we hope that ICON
's corpus will grow and encourage users to contribute complex network datasets by following steps in the package's contributing guidelines and adhering to the code of conduct,[@covenant] both of which can be found on the package's GitHub repository.
More details about the ICON
R package can be found at the package website (https://rrrlw.github.io/ICON/), GitHub repository (https://github.com/rrrlw/ICON), and CRAN page (https://CRAN.R-project.org/package=ICON).
The authors thank Aaron Clauset, PhD and the members of his research group at the University of Colorado Boulder for advice and for their tireless efforts in creating the Colorado Index of Complex Networks.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.