In CrumpLab/RsemanticLibrarian: Helper Functions for the Semantic Librarian

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE
)

library(RsemanticLibrarian)

This tutorial describes how to use the Semantic Librarian functions. Presntly, this package uses a corpus of abstracts from the APA database. A small sample of the required data files are automatically loaded with the package. These include:

article_df a data frame containing abstract information (100 abstracts)
article_vectors a matrix containing the semantic vectors for each article. Each row contains the vector for each abstract. The rows for each abstract correspond to the row entries in article_df
author_list a vector containing the first 100 author names
dictionary_words a vector containing 100 words, much smaller than actual dictionary
WordVectors a matrix containing the semantic vectors for each word in dictionary_words.
AuthorVectors a matrix of semantic vectors for each author

The entire data set can be downloaded from the github repository for the R Shiny Semantic Libararian app https://github.com/CrumpLab/SemanticLibrarian

Searching the Semantic Librarian

Create some search terms:

search <- "president of psychology"

Check the dictionary

The get_search_terms() function converts the words in the search string to lower case, removes special characters, and finds the words in the search string that are in the dictionary. Search terms must be in the dictionary, in order to use existing semantic vectors for searching the data base.

search <- get_search_terms(search,dictionary_words)

Compute similarity of search terms to articles

The get-search-article_similarities() function returns the cosine similarity between the search terms and all of the abstracts in the data set. query_type can be set to 1 for compound search, 2 for AND search, and 3 for OR search.

article_sims <- get_search_article_similarities(search_terms = search,
                                                d_words = dictionary_words,
                                                w_vectors = WordVectors,
                                                a_vectors = article_vectors,
                                                a_df = article_df,
                                                query_type = 1)

knitr::kable(head(article_sims), format="html", escape=FALSE)

Plot similar articles using multi-dimensional scaling

The get_mds_article_fits() returns the most similar articles, up to the number specified by num_articles. This function also conducts mult-dimensional scaling, to produce a 2-d renderable semantic space, with article positions in the space as x, y coordinates. Finally, the num_clusters argument conducts k-means cluster analysis. The resulting data frame can be used for plotting, or to produce a table of the top articles.

MDS <- get_mds_article_fits(num_articles = 25, 
                            num_clusters = 3,
                            year_range = c(1900,2000),
                            input_df = article_sims,
                            a_vectors=article_vectors
                            )

Here is a simple plot of the abstract space using ggplot.

library(ggplot2)
ggplot(MDS, aes(x=X,y=Y,color = cluster))+
  geom_point()

Alternatively, plotly could be used to generate a plot. Here, hovering over the point displays the title of abstract.

library(plotly)
ggp <- ggplot(MDS, aes(x= X, y= Y,
                       color=as.factor(cluster),
                       text=wrap_title))+
  geom_hline(yintercept=0, color="grey")+
  geom_vline(xintercept=0, color="grey")+
  geom_point(aes(size=Similarity), alpha=.75)+
  theme_void()+
  theme(legend.position = "none")

ax <- list(
  title = "",
  zeroline = TRUE,
  showline = FALSE,
  showticklabels = FALSE,
  showgrid = FALSE
)

p <- ggplotly(ggp, tooltip="text",
              source = "article_SS_plot",
              hoverinfo="text") %>%
  layout(xaxis = ax, yaxis = ax,showlegend = FALSE) %>%
  #style(hoverinfo = 'title') %>%
  config(displayModeBar = F) %>%
  layout(xaxis=list(fixedrange=TRUE)) %>%
  layout(yaxis=list(fixedrange=TRUE))

p$elementId <- NULL
p

Abstract similarity

The get_article_article_similarities() function compares an existing abstract to all other abstracts in the data set, and returns a new data frame containing all of the abstracts with a column for Similarity between the target article and all other articles.

an_article <- article_df[1,]$title
print(as.character(an_article))

sims <- get_article_article_similarities(an_article, article_vectors, article_df)

# look at top 10
knitr::kable(head(sims), format="html", escape=FALSE)

Author Similarity

All of the authors in the database are also coded as semantic vectors. Author vectors are the sum of the abstract vectors for each author. Use get_author_simimilarities() to compute the similarity between one author and all other authors.

sims <- get_author_similarities(author_list[1])

# top 5 authors
knitr::kable(head(sims), format="html", escape=FALSE)

plot author similarity

Use the get_mds_author_fits()` function to create a dataframe with x,y coordinates for each author, based on multi-dimensional scaling

author_sims <- get_author_similarities(author_list[1])
author_MDS <- get_mds_author_fits(15,3,author_sims,AuthorVectors)

Using ggplot with text repel to view the author similarities in 2-D.

library(ggrepel)
 ggplot(author_MDS, aes(x=X, y=Y, color=as.factor(cluster),
                                label=author, shape=selected_author))+
        geom_hline(yintercept=0, color="grey")+
        geom_vline(xintercept=0, color="grey")+
        geom_point(aes(size=Similarity), alpha=.75)+
        geom_text_repel(aes(size=Similarity),color="black", force=1.5)+
        scale_size(range = c(4, 7))+
        theme_void()+
        theme(legend.position = "none")