runAnalysis: Perform scicloud analysis
In LisaGotzian/scicloud: Cluster and Network Word Analysis of Scientific Papers

Description Usage Arguments Value Author(s) See Also Examples

View source: R/2_runAnalysis.R

The second function to be called to perform the analysis with scicloud after createScicloudList. It outputs a list of 4 components: IndVal, metaMatrix, RepresentativePapers and wordList for further use with inspectScicloud.

The function performs the analysis depending on the method argument. By default, the method is set to 'hclust' that identifies clusters using hclust. The clusters are publication communities based on the words used in the papers. To then identify the words relevant to the communities, it runs an indicator species analysis. Each word receives an indicator species value by indval for each cluster, showing how representative each word is within a cluster. The top representative words will then be visualized with the following plots:

a dendrogram of the clusters
a wordcloud of the publication communities
four visualizations of the communities by year and number of citations (which have been fetched from the Scopus API)

The 'network' method on the other hand also employs a clustering approach, but uses a network analysis. When done, it returns a list of global and local measures and also generates a clustered matrix. This matrix can then be further processed in network programs like Gephi.

runAnalysis(
  scicloudList,
  numberOfClusters = NA,
  dendrogram = TRUE,
  dendroLabels = c("truncated", "break"),
  minWordsPerCluster = 5,
  maxWordsPerCluster = 10,
  p = 0.05,
  exactPosition = FALSE,
  sortby = c("Eigenvector", "Degree", "Closeness", "Betweenness"),
  keep = 0.33,
  saveToWd = FALSE,
  method = c("hclust", "network", "both")
)

`scicloudList`	output of `createScicloudList`
`numberOfClusters`	integer or NA; must be an integer value not more than 14 as more than 14 clusters are not recommended. An integer sets the number of clusters manually. For NA, the function automatically calculates the optimum number of clusters for a range of 1 till 12 possible clusters
`dendrogram`	logical, whether or not to show a dendrogram of the calculated clusters.
`dendroLabels`	allows "truncated" or "break". This either truncates the labels of the dendrogram leaves or puts a line break. Line breaks are not recommended for a large number of PDFs.
`minWordsPerCluster`	minimum number of words per cluster to be plotted in the wordcloud.
`maxWordsPerCluster`	maximum number of words per cluster to be plotted in the wordcloud.
`p`	the p-value that sets the significance level of individual words for the indicator species analysis. Only significant words will be plotted.
`exactPosition`	logical, the wordcloud tries to avoid overlapping labels for the sake of visual simplicity over perfect precision. When set to `TRUE`, the words position will be marked by a dot and the label will be connected with a line to it.
`sortby`	for the network method: the centrality measure to sort the words by, default is Eigenvector. Allows the following possible inputs: "Eigenvector", "Degree", "Closeness, "Betweenness".
`keep`	for the network method: numeric, keeps by default 0.33 of all the words, sorted by the argument given by `sortby`. A smaller amount of words to keep facilitates computations for later use.
`saveToWd`	a logical parameter whether or not to save the return of the function to the working directory. This is especially useful for later analysis steps. The file can be read in by using `readRDS`.
`method`	takes "network", "hclust" or "both" as a method

'hclust' returns a list with the following components:

IndVal: the results of the indicator species analysis.
metaMatrix: the metaMatrix that has been pre-processed
RepresentativePapers: a dataframe of the most representative papers of each publication community. Papers are representative if they contain the highest number of significant words.
wordList: a list of all words that have been used in the analysis.

'network' returns a list with the following components:

LocalMeasures: local measures for both papers and words
ReducedLocalMeasures: 1/3 of the words (!) with their centrality measures & clustering according to three different clustering methods, arranged by default by eigenvector centrality using sortby
ReducedIncidenceMatrix: 1/3 of the words arranged by eigenvector centrality, to be further processed e.g. in Gephi or with other clustering functions
GlobalMeasures: global measures of the network

Creator of the scicloud workflow: Henrik von Wehrden, henrik.von_wehrden@leuphana.de

Code by: Matthias Nachtmann, matthias.nachtmann@stud.leuphana.de, Lisa Gotzian, lisa.gotzian@stud.leuphana.de, Jia Yan Ng, Jia.Y.Ng@stud.leuphana.de, Johann Julius Beeck, johann.j.beeck@stud.leuphana.de

First version of scicloud: Matthias Nachtmann, matthias.nachtmann@stud.leuphana.de

Other scicloud functions: createScicloudList(), deleteRDS(), inspectScicloud(), searchScopus()

## Not run: 

### Workflow of performing analysis using scicloud
myAPIKey <- "YOUR_API_KEY"
# retrieving data from PDFs and Scorpus website using API
scicloudList <- createScicloudList(myAPIKey = myAPIKey)

# Run the analysis with a specified no. of cluster
scicloudAnalysis <- runAnalysis(scicloudList = scicloudList, numberOfClusters = 4)

# Generate a summary of the analysis
scicloudSpecs <- inspectScicloud(scicloudAnalysis)

## End(Not run)