knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, comment = "#>")

https://www.sciencedirect.com/science/article/pii/S1751157717304108

We created the co-mention network of all identified R packages based on our data. A basic illustration of how this network is constructed in shown in Fig. 1: if any two packages are mentioned in the same paper, even if they are not present in the same sentence, they are seen as being co-mentioned, and are connected through an undirected link. The number of links between two nodes were summed up as the edge weight between two software packages.

Based on the established network, a two-step network analysis was performed. In the first step, we used a few centrality measures, including degree centrality, betweenness centrality, and PageRank, to understand individual R packages’ role in connecting with one another. We used the R package “igraph” (Csardi & Nepusz, 2006) to calculate the centrality of the network.

According to the popular interpretations offered by Hanneman & Riddle (2005), degree centrality is the total amount of connections an actor has. Degree centrality is the simplest and most direct indicator of the importance of a node in the network. Betweenness centrality, on the other hand, is the level to which an actor lies between other pairs of actors in the same network. In his classic interpretation of this concept, Freeman (1978) defined betweenness as an “index of the potential of a point for control of communication” (p. 224). Originally proposed by Brin and Page (1998), PageRank is an indicator of the importance of a node obtained by calculating the quality of its inlinks. It was originally designed for directed networks but has since been applied to undirected networks in various studies (Grolmusz, 2015, Iván and Grolmusz, 2010, Perra and Fortunato, 2008).

In the second step, we employed a modularity-based clustering technique as implemented in version 1.6.5 of VOSviewer (Van Eck & Waltman, 2010) to understand the stratification and grouping characteristics of the co-mention network. VOSviewer is based on the technique of visualization of similarities, which depicts the distance between two objects as the representation of their similarities (Van Eck & Waltman, 2007); as a result, objects more similar to each other are clustered closer in the graph.

  1. Result 4.1. Overview of the papers and packages Of the 13,684 paper we analyzed, 7463 papers (54.5%) mention at least one R package. This number is similar to that found in our prior research (Li et al., 2017), wherein 223 out of 391 papers (57%) mentioned at least one R package. Within these 7463 papers, there are 14,310 package cases of 1838 unique packages mentioned or cited. The average number of package cases per paper (1.92) is also very similar to the result reported in our previous study (1.85).

We followed the general approach of our previous study in classifying all PLoS papers based on publication year. Given the relatively small number of papers published before 2012, all the papers up to 2011 are categorized into a single group. Every year from 2012 onward has its own category. Fig. 2 summarizes the total number of papers, the percentage of papers in which any package is identified, and the mean number of packages in papers with any package identified in each group. As shown in this graph, the latter two numbers have increased during the past few years: in line with our previous analysis of a smaller sample of PLoS papers (Li et al., 2017), these results suggest the growing impact of R packages in PLoS journals.

It is worth mentioning that, similar with our previous study (Li et al., 2017), we also identified a strong imbalance among knowledge domains in our dataset. “Biology and life sciences” has a dominant presence in all the papers we collected, with 96.7% of papers falling into this category. “Medicine and health sciences” (60.3%) and “Research and analysis methods” (51.3%) are the other two categories that have more than 50% of papers.

Moreover, we tested the ratio of the number of package cases with reference to all package cases identified. In total, 10,084 package cases were found to have a reference, accounting for 70.5% of all 14,310 package cases. This number is very similar to our previous study (72.1%). We also tested if this ratio is subject to change over time. As shown in Fig. 3, the percentage of papers with reference has only increased slightly during the past few years based on the larger sample of papers collected in this study, from 63.5% by 2011–74.6% in 2017. This gradual increasing trend is also generally similar with the result last time (Li et al., 2017).

Table 1 presents the frequencies of the top 10 packages in terms of total frequency and their relative sizes compared to the total number of package mentions in all analyzed papers.

Fig. 4 shows the change in the relative sizes of the top 10 packages shown in Table 1. Most of these packages have been mentioned in a relatively stable manner throughout the history of PLoS. One distinguishable exception is the package “limma”: even though it is still an important R package used in scientific studies, its relative size has decreased dramatically since the early years of PLoS journals. What should be noted is that the pattern of limma is not unique among packages from Bioconductor. The second most frequently used Bioconductor package, “affy,” has gone through a similar trend: its size has dropped from 2.9% before 2012–0.4% in 2017. On the other hand, both “ggplot2” and “MuMIn” have become significantly more popular in PLoS papers during the same period. The reasons for such radical changes, however, require further investigation.

4.2. The network of package co-mentions and centrality measures Fig. 5 is an illustration of the distribution of papers with a given number of packages. In total, we found 3410 papers (24.9% of all papers examined) mentioning at least two packages. These papers are the basis for the network analysis reported in this section.

Overall, we identified 14,615 co-mention pairs among 1612 unique R packages. 1576 of these packages (97.8% of those co-mentioned with any other package) are interconnected with each other. Plotting these as nodes in a network graph using VOSviewer (Fig. 6) allows us to see and interpret their clustering behavior. In VOSviewer, we used the clustering function with 50 iterations and minimal cluster size 10. Moreover, for the sake of readability, we elected to show only the top 100 links between all nodes. In total, 21 clusters are identified by VOSviewer, each of which is marked in a distinct color. Fig. 5 also highlights the five largest clusters identified by VOSviewer in terms of total number of nodes. These are discussed in greater detail below.

The five largest clusters shown above are analyzed further in Table 2, with the titles and descriptions (taken from their repositories) of the top five packages in each cluster (in terms of the total number of links) listed. It is obvious that these clusters can be partly explained by the functions and disciplines of the packages. For example, the top four packages of the first cluster (“dna”, “sequences”, “edgeR”, and “diveRsity”) are related to genetic analysis. However, this cluster might also be subject to misidentification as discussed in the Methods section: “dna” and “sequences” are the two terms found to be most frequently misidentified in our posttest. The third cluster is also related to DNA sequencing analysis; here, however, most of the top packages are from Bioconductor rather than from CRAN. Both of these two clusters can be interpreted by the fact that most of the papers we analyzed belong to biology and life sciences. The second, fourth, and fifth clusters largely correspond to the functions of linear mixed-effects modeling, data analysis, and ecological data analysis, respectively.

Fig. 7 shows the distributions of degree centrality, betweenness centrality, and PageRank for all packages in the network. Consistent with findings on author co-citation analysis (e.g., Yan & Ding, 2009), all three measures are bound by power laws. Based on the context of this co-mention network and the definitions of the three measures discussed in the Methods section, we operationalized degree centrality as the frequency with which an R package is co-mentioned with any other R package, betweenness centrality as the level to which an R package is co-mentioned with any other distinct R package, and PageRank as the level of total importance (in terms of the total number of links in the case of undirected network) of all R packages that are co-mentioned with a specific package.

A correlation test was conducted between the total count of packages mentions and the three centrality measures adopted in this analysis. All four measures are strongly positively correlated with each other, with the value of r ranging from 0.83 (count-degree) to 0.99 (degree-PageRank). This result suggests a strong consistency among all these indicators of the importance of an R package.

Table 3 shows the three centrality measures as well as the rankings in all these measures of the top 20 packages in terms of total count (the top 10 of which were shown in Table 1). This table lends further support to the conclusion that our measures for the top R packages are relatively consistent with each other.

Obviously, however, the measures are not perfectly consistent; discernibly different patterns exist among the packages. For example, “ggplot2”, “dna”, and “igraph” are the three packages whose centrality rankings are higher than their total counts, suggesting that these packages are more likely to be co-mentioned with other R packages. Among these three, “ggplot2” and “igraph” are the only two packages in the top 20 list that are dedicated to visualization tasks. On the other hand, a few packages have significant lower rankings in centrality measures than in terms of total frequency. Many of these packages, including “nlme”, “mgcv”, and “multcomp”, are used for modeling-related tasks. The connection between a package’s function/discipline and the measures of its importance requires further study.



mkearney/aej.iconic documentation built on May 5, 2019, 7:57 p.m.