sentenceTokenParse: Parse text into sentences and tokens

Description Usage Arguments Value Examples

Description

Parse a character vector of documents into into both sentences and a clean vector of tokens. The resulting output includes IDs for document and sentence for use in other lexRank functions.

Usage

1
2
3
sentenceTokenParse(text, docId = "create", removePunc = TRUE,
  removeNum = TRUE, toLower = TRUE, stemWords = TRUE,
  rmStopWords = TRUE)

Arguments

text

A character vector of documents to be parsed into sentences and tokenized.

docId

A character vector of document Ids the same length as text. If docId=="create" document Ids will be created.

removePunc

TRUE or FALSE indicating whether or not to remove punctuation from text while tokenizing. If TRUE, punctuation will be removed. Defaults to TRUE.

removeNum

TRUE or FALSE indicating whether or not to remove numbers from text while tokenizing. If TRUE, numbers will be removed. Defaults to TRUE.

toLower

TRUE or FALSE indicating whether or not to coerce all of text to lowercase while tokenizing. If TRUE, text will be coerced to lowercase. Defaults to TRUE.

stemWords

TRUE or FALSE indicating whether or not to stem resulting tokens. If TRUE, the outputted tokens will be tokenized using SnowballC::wordStem(). Defaults to TRUE.

rmStopWords

TRUE, FALSE, or character vector of stopwords to remove from tokens. If TRUE, words in lexRankr::smart_stopwords will be removed prior to stemming. If FALSE, no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to TRUE.

Value

A list of dataframes. The first element of the list returned is the sentences dataframe; this dataframe has columns docId, sentenceId, & sentence (the actual text of the sentence). The second element of the list returned is the tokens dataframe; this dataframe has columns docId, sentenceId, & token (the actual text of the token).

Examples

1
2
sentenceTokenParse(c("Bill is trying to earn a Ph.D.", "You have to have a 5.0 GPA."),
                   docId=c("d1","d2"))

Example output

$sentences
  docId sentenceId                       sentence
1    d1       d1_1 Bill is trying to earn a Ph.D.
2    d2       d2_1    You have to have a 5.0 GPA.

$tokens
  docId sentenceId token
1    d1       d1_1  bill
2    d1       d1_1  earn
3    d1       d1_1   phd
4    d2       d2_1   gpa

lexRankr documentation built on May 2, 2019, 1:29 p.m.