createScicloudList: Create a scicloud list
In LisaGotzian/scicloud: Cluster and Network Word Analysis of Scientific Papers

Description Usage Arguments Value Author(s) See Also Examples

View source: R/1_createScicloudList.R

The first function to be called to perform the analysis with scicloud. It outputs a list of 3 components: metaMatrix, Tf_Idf and wordList for further use with runAnalysis.
The function takes all scientific papers as PDF files from the "PDFs" folder in your working directory or any other specified directory to create a metaMatrix. It then further pre-processes the text (e.g. by stemming words with stemWords) and outputs a tf-idf matrix. As a last step, it fetches the papers' metadata from Scopus for which you'll need an Elsevier API key (https://dev.elsevier.com/index.jsp).
You have the option to limit the words to be used in the analysis with the argument 'keepWordsFile'.

createScicloudList(
  directory = file.path(".", "PDFs"),
  scopusList = NA,
  myAPIKey = NA,
  language = "SMART",
  stemWords = TRUE,
  saveToWd = FALSE,
  ignoreWords = c(),
  keepWordsFile = NA,
  generateWordlist = FALSE
)

`directory`	per default, the PDFs are expected to be in a folder named "PDFs", can be changed ad. lib.
`scopusList`	a finished metaMatrix from `searchScopus`
`myAPIKey`	your private API key for communicating with the Scopus API. You can request one at https://dev.elsevier.com/.
`language`	this defines the language of the stopwords to be filtered. The default is "SMART". Look at `stopwords` for more information.
`stemWords`	logical variable which is passed to processMetaDataMatrix.
`saveToWd`	a logical parameter whether or not to save the output of the function to the working directory. This is especially useful for later analysis steps. The file can be read in by using `readRDS`.
`ignoreWords`	a vector of words to be ignored which is passed to processMetaDataMatrix.
`keepWordsFile`	path to a .csv-file that specifies which words to keep for the analysis. Accepts 0/1 behind each word or takes the words as they are and disregards all other words for the analysis. If no word list is provided, all words are used. You can generate a list with all words used in the current analysis by setting `generateWordlist` to `TRUE`. If you intend to use this option, delete all words you don't need and re-run the function with the updated word list by specifying `keepWordsFile`.
`generateWordlist`	logical, if set to `TRUE`, it generates a wordlist in your working directory. You can now add a 0/1 behind each word or delete rows you don't consider important to the analysis.

Returns a list with the following components:

Tf_Idf: the tf-idf document term matrix.
wordList: a list of all words that have been used in the analysis.
metaMatrix: a matrix with 21 columns that contains information (DOI, Year, Authors, etc.) and each pdf's full text that has been pre-processed and filtered. Information (Title, Abstract, Journal, etc.) are retrieved through the Scopus API. Please note that without a proper API and a valid connection to Scopus within a recognized network these information will not be retrieved successfully

Creator of the scicloud workflow: Henrik von Wehrden, henrik.von_wehrden@leuphana.de

Code by: Jia Yan Ng, Jia.Y.Ng@stud.leuphana.de, Johann Julius Beeck, johann.j.beeck@stud.leuphana.de, Lisa Gotzian, lisa.gotzian@stud.leuphana.de, Prabesh Dhakal, prabesh.dhakal@stud.leuphana.de

First version of scicloud: Matthias Nachtmann, matthias.nachtmann@stud.leuphana.de

Other scicloud functions: deleteRDS(), inspectScicloud(), runAnalysis(), searchScopus()

## Not run: 

### Workflow of performing analysis using scicloud
myAPIKey <- "YOUR_API_KEY"
# retrieving data from PDFs and Scorpus website using API
scicloudList <- createScicloudList(myAPIKey = myAPIKey)

# Run the analysis with a specified no. of cluster
scicloudAnalysis <- runAnalysis(scicloudList = scicloudList, numberOfClusters = 4)

# Generate a summary of the analysis
scicloudSpecs <- inspectScicloud(scicloudAnalysis)

## End(Not run)