Reading data from WoS

library(dplyr)
library(scimetrix)
library(tm)
library(topicmodels)
library(ggplot2)
path = system.file("results.txt",package="scimetrix")

Use the readWoS function to read in a text file downloaded from Web of Science, and apply the mergeOECD function to add OECD subject categories

papers <- readWoS(path) %>%
  mergeOECD() 

head(papers)

The paperNumbers function plots numbers of papers by year and another variable

paperNumbers(papers,"OECD",bSize=6)

paperShares works the same way but with shares instead of absolute numbers

paperShares(papers,"OECD",bSize=6)
paperShares(papers,"OECD",bSize=6,pType="line")

Preparing data for topic modelling

Turn a field of your dataframe (defaults to AB, abstract) into a corpus of documents

corpus <- corporate(papers)

Turn this into a document term matrix with a sparsity of 0.5 (this is a very low number, for illustration)

dtm <- makeDTM(corpus,0.5,papers$UT,0.05,0)

The above process removes some documents (a list of paper UTs is returned as $removed). In future operations, we will only want to use documents that were not removed

rem <- filter(papers,UT %in% dtm$removed)
papers_used <- subset(papers, !(UT %in% dtm$removed))

Re-create a corpus based on the words and documents used after the filtering steps above

corpus_used <- refresh_corp(dtm$dtm)

Topic modelling

What's the optimal number (up to a maximum of 10) of topics?

optimal_k(dtm$dtm, 10)

Run a topic model on the dtm, with k topics (smaller k = less computation time).

SEED <- 2016

system.time({
  CTM_3 = CTM(dtm$dtm,k=3,method="VEM",
               control=list(seed=SEED))
})

create a folder where we save a visualisation of the model, and the model data

visualise(CTM_3,corpus_used,dtm$dtm)


mcallaghan/scimetrix documentation built on May 22, 2019, 12:58 p.m.