BibToCorpus: Convert a bibliographic database into a text corpus

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/BibToCorpus.R

Description

Get a text corpus from a bibliographic database with a control list and help options that allow you to run a faster process of composition of corpus.

Usage

1
2
BibToCorpus(bibData, bibUnits = "Keywords", controlList, stopWords = TRUE,
  wordsToRemove, replaceWords)

Arguments

bibData

a dataframe containing information about a bibliographic database.

bibUnits

a string, the bibliographic unit to be analyzed e.g. "Title", "Keywords", "Abstract". This string must match the column name from the "bibData" dataframe.

controlList

a vector indicating the transformations and processes that will be performed during the corpus composition process. Available options: stripWhitespace for collapsing white spaces; removeNumbers for removing numbers inside texts in corpus;

stopWords

logical. If TRUE, a list of stop words will be removed from the composed corpus.

wordsToRemove

a vector of words that are desired to be removed from the composed corpus.

replaceWords

a TXT file (two columns separated by tab). One column containing the final word to be in the corpus and a second, containing the word to replace. Example: clustering cluster_analysis clustering cluster

Details

A list of stop words is provided inside the package for English language, if necessary, please visit https://sites.google.com/site/kevinbouge/stopwords-lists for a complete list of stop words in many other language, available thanks to Kevin Bouge (kevin.bouge@gmail.com)

Value

An object inheriting from VCorpus and Corpus.

Author(s)

Andres Palacios anfpalacioscl@unal.edu.co

See Also

ArticleSearch can be useful for creating a bibliographic information dataframe if starting from scratch.

Examples

1
2
3
4
5
6
7
data("KDVizData")
wordsToReplace <- system.file("extdata", "KDReplaceWords.txt", package = "KDViz")
wordsToRemove <- c("analysis", "data", "text", "review", "topic", "theory", "system", "protein")

myCorpus <- BibToCorpus(bibData = KDVizData, bibUnits = "Keywords",
  controlList = c("stripWhitespace", "removeNumbers"), stopWords = TRUE,
  wordsToRemove = wordsToRemove, replaceWords = wordsToReplace)

Example output

Processing Corpus from bibliometric data...

Collapsing multiple whitespace characters to a single one...
Removing stop words...
Removing words from custom list...
Removing numbers...
24 words to replace:
 4.2% of words replaced
 8.3% of words replaced
 12.5% of words replaced
 16.7% of words replaced
 20.8% of words replaced
 25% of words replaced
 29.2% of words replaced
 33.3% of words replaced
 37.5% of words replaced
 41.7% of words replaced
 45.8% of words replaced
 50% of words replaced
 54.2% of words replaced
 58.3% of words replaced
 62.5% of words replaced
 66.7% of words replaced
 70.8% of words replaced
 75% of words replaced
 79.2% of words replaced
 83.3% of words replaced
 87.5% of words replaced
 91.7% of words replaced
 95.8% of words replaced
 100% of words replaced
Corpus process finished

KDViz documentation built on May 1, 2019, 6:34 p.m.

Related to BibToCorpus in KDViz...