library(dplyr) library(scimetrix) library(tm) library(topicmodels) library(ggplot2) path = system.file("results.txt",package="scimetrix")
Use the readWoS
function to read in a text file downloaded from Web of Science, and apply the mergeOECD
function to add OECD subject categories
papers <- readWoS(path) %>% mergeOECD() head(papers)
The paperNumbers
function plots numbers of papers by year and another variable
paperNumbers(papers,"OECD",bSize=6)
paperShares
works the same way but with shares instead of absolute numbers
paperShares(papers,"OECD",bSize=6) paperShares(papers,"OECD",bSize=6,pType="line")
Turn a field of your dataframe (defaults to AB, abstract) into a corpus of documents
corpus <- corporate(papers)
Turn this into a document term matrix with a sparsity of 0.5 (this is a very low number, for illustration)
dtm <- makeDTM(corpus,0.5,papers$UT,0.05,0)
The above process removes some documents (a list of paper UTs is returned as $removed). In future operations, we will only want to use documents that were not removed
rem <- filter(papers,UT %in% dtm$removed) papers_used <- subset(papers, !(UT %in% dtm$removed))
Re-create a corpus based on the words and documents used after the filtering steps above
corpus_used <- refresh_corp(dtm$dtm)
What's the optimal number (up to a maximum of 10) of topics?
optimal_k(dtm$dtm, 10)
Run a topic model on the dtm, with k topics (smaller k = less computation time).
SEED <- 2016 system.time({ CTM_3 = CTM(dtm$dtm,k=3,method="VEM", control=list(seed=SEED)) })
create a folder where we save a visualisation of the model, and the model data
visualise(CTM_3,corpus_used,dtm$dtm)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.