knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, tidy = TRUE)
NAILS performs statistical and Social Network Analysis (SNA) on citation data. SNA is a new way for researchers to map large datasets and get insights from new angles by analyzing connections between articles. As the amount of publications grows on any given field, automatic tools for this sort of analysis are becoming increasingly important prior to starting research on new fields. NAILS also provides useful data when performing Systematic Mapping Studies (SMS) in scientific literature. According to Kitchenham et al. performing a SMS can be especially suitable if few literature reviews have been done on the topic and there is a need to get a general overview of the field of interest.
The nails package provides functionality for parsing Web of Science data for quantitative Systematic Mapping Study analysis, and a series of custom statistical and network analysis functions to give the user an overview of literature datasets. The features can be divided into two primary sections: Firstly, statistical analysis, which for example gives an overview of publication frequencies, most published authors and journals. Secondly, the more novel network analysis, which gives further insight into relationship between the interlinked citations and cooperation between authors. For example, the most basic features can use citation network analysis identify the most cited authors and publication forums. Lastly, it provides a few convenience functions to use the topicmodels and stm packages to create Latent Dirichlet allocation -based topic models.
For further details see the following article: Knutas, A., Hajikhani, A., Salminen, J., Ikonen, J., Porras, J., 2015. Cloud-Based Bibliometric Analysis Service for Systematic Mapping Studies. CompSysTech 2015.
In this section we present how to load Web of Science data using nails package functions and then to create an example report using ggplot2-based visualizations.
Below is an example of how data exported from Web of Science can be loaded and parsed using the nails package functions.
# Setup # Load packages devtools::load_all() require(ggplot2) # Set ggplot theme theme_set(theme_minimal(12)) # Load data literature <- read_wos_data("../tests/testthat/test_data") # Clean data literature <- clean_wos_data(literature)
Below we present how to generate example report using nails calls and then using ggplot2 and knittr to generate visual reports.
This report provides an analysis on the records downloaded from Web of Science. The analysis identifies the important authors, journals, and keywords in the dataset based on the number of occurences and citation counts. A citation network of the provided records is created and used to identify the important papers according to their in-degree, total citation count and PageRank scores. The analysis finds also often-cited references that were not included in the original dataset downloaded from the Web of Science.
Reports can also be generated by using the online analysis service, and the source code is available at GitHub. Instructions and links to tutorial videos can be found at the project page. Please consider citing our research paper on bibliometrics at if you publish the analysis results.
# Setup # Load packages devtools::load_all() require(ggplot2) # Set ggplot theme theme_set(theme_minimal(12))
The analysed dataset, loaded in section "loading data", consist of r nrow(literature)
records with r ncol(literature)
variables. More information about the variables can be found at Web of Science.
ggplot(literature, aes(YearPublished)) + geom_histogram(binwidth = 1, fill = "darkgreen") + ggtitle("Year published") + xlab("Year") + ylab("Article count") # Calculate relative publication counts # yearTable <- as.data.frame(table(literature$YearPublished)) # names(yearDF) <- c("Year", "Freq") # Fix column names # Merge to dataframe of total publication numbers (years) # yearDF <- merge(yearDF, years, by.x = "Year", by.y = "Year", # all.x = TRUE) # yearDF$Year <- as.numeric(as.character(yearDF$Year)) # factor to numeric # Calculate published articles per total articles by year # yearDF$Fraction <- yearDF$Freq / yearDF$Records
# ADD PLOT HERE! print("Placeholder")
Sorted by the number of articles published and by the total number of citations.
# Get author network nodes, which contain the required information author_network <- get_author_network(literature) author_nodes <- author_network$author_nodes # Change Id to AuthorFullName names(author_nodes)[names(author_nodes) == "Id"] <- "AuthorFullName" # Sort by number of articles by author author_nodes <- author_nodes[with (author_nodes, order(-Freq)), ] # Re-order factor levels author_nodes <- transform(author_nodes, AuthorFullName = reorder(AuthorFullName, Freq)) ggplot(head(author_nodes, 25), aes(AuthorFullName, Freq)) + geom_bar(stat = "identity", fill = "blue") + coord_flip() + ggtitle("Productive authors") + xlab("Author") + ylab("Number of articles")
# Reorder AuthorFullName factor according to TotalTimesCited (decreasing order) author_nodes <- transform(author_nodes, AuthorFullName = reorder(AuthorFullName, TotalTimesCited)) # Sort by number of articles by author author_nodes <- author_nodes[with (author_nodes, order(-TotalTimesCited)), ] ggplot(head(author_nodes, 25), aes(AuthorFullName, TotalTimesCited)) + geom_bar(stat = "identity", fill = "blue") + coord_flip() + ggtitle("Most cited authors") + xlab("Author") + ylab("Total times cited")
Sorted by number of published articles in the dataset and by the total number of citations.
# Calculate publication occurences publications <- as.data.frame(table(literature$PublicationName)) # Fix names names(publications) <- c("Publication", "Count") # Trim publication name to maximum of 50 characters for displaying in plot publications$Publication <- strtrim(publications$Publication, 50) # Sort descending publications <- publications[with (publications, order(-Count)), ] # Reorder factor levels publications <- transform(publications, Publication = reorder(Publication, Count)) # WHY??? # literature <- merge(literature, citation_sums, # by = "PublicationName" ) ggplot(head(publications, 25), aes(Publication, Count)) + geom_bar(stat = "identity", fill = "orange") + coord_flip() + theme(legend.position = "none") + ggtitle("Most popular publications") + xlab("Publication") + ylab("Article count")
# Calculating total citations for each publication. citation_sums <- aggregate(literature$TimesCited, by = list(PublicationName = literature$PublicationName), FUN = sum, na.rm = T) # Fix column names names(citation_sums) <- c("PublicationName", "PublicationTotalCitations") # Trim publication name to maximum of 50 characters for displaying in plot citation_sums$PublicationName <- strtrim(citation_sums$PublicationName, 50) # Sort descending and reorder factor levels accordingly citation_sums <- citation_sums[with (citation_sums, order(-PublicationTotalCitations)), ] citation_sums <- transform(citation_sums, PublicationName = reorder(PublicationName, PublicationTotalCitations)) ggplot(head(citation_sums, 25), aes(PublicationName, PublicationTotalCitations)) + geom_bar(stat = "identity", fill = "orange") + coord_flip() + theme(legend.position = "none") + ggtitle("Most cited publications") + xlab("Publication") + ylab("Total times cited")
Sorted by the number of articles where the keyword is mentioned and by the total number of citations for the keyword.
# Calculating total citations for each keyword literature_by_keywords <- arrange_by(literature, "AuthorKeywords") # Sometimes AuthorKeywords column is empty. # Following if-else hack prevents crashing in those situations, # either by using KeywordsPlus column or skipping keyword analysis. if (nrow(literature_by_keywords) == 0) { cat("No keywords.") } else { keyword_citation_sum <- aggregate(literature_by_keywords$TimesCited, by = list(AuthorKeywords = literature_by_keywords$AuthorKeywords), FUN = sum, na.rm = T) names(keyword_citation_sum) <- c("AuthorKeywords", "TotalTimesCited") keywords <- unlist(strsplit(literature$AuthorKeywords, ";")) keywords <- trim(keywords) keywords <- as.data.frame(table(keywords)) names(keywords) <- c("AuthorKeywords", "Freq") keywords <- merge(keywords, keyword_citation_sum, by = "AuthorKeywords") keywords <- keywords[with (keywords, order(-Freq)), ] keywords <- transform(keywords, AuthorKeywords = reorder(AuthorKeywords, Freq)) ggplot(head(keywords, 25), aes(AuthorKeywords, Freq)) + geom_bar(stat = "identity", fill = "purple") + coord_flip() + ggtitle("Popular keywords") + xlab("Keyword") + ylab("Number of occurences") }
if (nrow(literature_by_keywords) > 0) { keywords <- keywords[with (keywords, order(-TotalTimesCited)), ] keywords <- transform(keywords, AuthorKeywords = reorder(AuthorKeywords, TotalTimesCited)) ggplot(head(keywords, 25), aes(AuthorKeywords, TotalTimesCited)) + geom_bar(stat = "identity", fill = "purple") + coord_flip() + ggtitle("Most cited keywords") + xlab("Keyword") + ylab("Total times cited") }
The most important papers and other sources are identified below using three importance measures: 1) in-degree in the citation network, 2) citation count provided by Web of Science (only for papers included in the dataset), and 3) PageRank score in the citation network. The top 25 highest scoring papers are identified using these measures separately. The results are then combined and duplicates are removed. Results are sorted by in-degree, and ties are first broken by citation count and then by the PageRank.
When a Digital Object Identifier (DOI) is available, the full paper can be found using Resolve DOI website.
# Extract citation nodes citation_network <- get_citation_network(literature) citation_nodes <- citation_network$citation_nodes # Extract the articles included in the data set and articles not included # in the dataset citations_lit <- citation_nodes[citation_nodes$Origin == "literature", ] citations_ref <- citation_nodes[citation_nodes$Origin == "reference", ] # Create article strings (document title, reference information and abstract # separated by "|") citations_lit$Article <- paste(toupper(citations_lit$DocumentTitle), " | ", citations_lit$FullReference, " | ", citations_lit$Abstract)
These papers were included in the r nrow(literature)
records downloaded from the Web of Science.
# Sort citations_lit by TimesCited, decreasing citations_lit <- citations_lit[with (citations_lit, order(-TimesCited)), ] # Extract top 25 top_lit <- head(citations_lit, 25) # Sort by InDegree, decreasing citations_lit <- citations_lit[with (citations_lit, order(-InDegree)), ] # Add to list of top 25 most cited papers top_lit <- rbind(top_lit, head(citations_lit, 25)) # Sort by PageRank, decreasing citations_lit <- citations_lit[with (citations_lit, order(-PageRank)), ] # Add to list of most cited and highest InDegree papers top_lit <- rbind(top_lit, head(citations_lit, 25)) # Remove duplicates top_lit <- top_lit[!duplicated(top_lit[, "FullReference"]), ] # Sort top_lit by InDegree, break ties by TimesCited, then PageRank. top_lit <- top_lit[with (top_lit, order(-InDegree, -TimesCited, -PageRank)), ] # Print list knitr::kable(top_lit[, c("Article", "InDegree", "TimesCited","PageRank")])
These papers and other references were not among the r nrow(literature)
records downloaded from the Web of Science.
# Sort citations_ref by InDegree, decreasing citations_ref <- citations_ref[with (citations_ref, order(-InDegree)), ] # Extract top 25 top_ref <- head(citations_ref, 25) # Sort by PageRank, decreasing citations_ref <- citations_ref[with (citations_ref, order(-PageRank)), ] # Add to list of highes in degree papers (references) top_ref <- rbind(top_ref, head(citations_ref, 25)) # Remove duplicates top_ref <- top_ref[!duplicated(top_ref[, "FullReference"]), ] # Sort by InDegree, break ties by PageRank top_ref <- top_ref[with (top_ref, order(-InDegree, -PageRank)), ] # Print results knitr::kable(top_ref[, c("FullReference", "InDegree", "PageRank")])
references <- unlist(strsplit(literature$CitedReferences, ";")) get_publication <- function(x) { publication <- "Not found" try( publication <- unlist(strsplit(x, ","))[[3]], silent = TRUE ) return(publication) } refPublications <- sapply(references, get_publication) refPublications <- sapply(refPublications, trim) refPublications <- refPublications[refPublications != "Not found"] refPublications <- as.data.frame(table(refPublications)) names(refPublications) <- c("Publication", "Count") refPublications <- refPublications[with (refPublications, order(-Count)), ] refPublications <- transform(refPublications, Publication = reorder(Publication, Count)) ggplot(head(refPublications, 25), aes(Publication, Count)) + geom_bar(stat = "identity", fill = "orange") + coord_flip() + theme(legend.position = "none") + ggtitle("Most referenced publications") + xlab("Publication") + ylab("Count")
Topic modeling is a type of statistical text mining method for discovering common "topics" that occur in a collection of documents. A topic modeling algorithm essentially looks through the abstracts included in the datasets for clusters of co-occurring of words and groups them together by a process of similarity.
The following columns describe each topic detected using LDA topic modeling by listing the ten most characteristic words in each topic.
You can specify K, the number of topics, when calling build_topicmodel_from_literature(literature, K). If left empty, stm::searchK function is used to estimate the number of topics. For performance reasons the search range is between 4 and 12. The number of topics is estimated using the structural topic model library semantic coherence diagnostic values. Raw values are available in output file as kqualityvalues.csv and can be interpreted with stm documentation if necessary (see section 3.4).
The analysis below creates the topic model using the convenience functions and then prints out ten most descriptive words for each discovered topic. See topicmodels documentation on the TopicModel-class on other information and instructions and documentation on build_topicmodel_from_literature how to use the rest of the data the convenience function procides.
topicmodel <- build_topicmodel_from_literature(literature) topickeywords <- topicmodels::terms(topicmodel$fit, 10) tw <- data.frame(topickeywords) colnames(tw) <- gsub('X', 'Topic ', colnames(tw)) knitr::kable(tw, col.names = colnames(tw))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.