sentenceTokenParse: Parse text into sentences and tokens
In lexRankr: Extractive Summarization of Text with the LexRank Algorithm

Description Usage Arguments Value Examples

Parse a character vector of documents into into both sentences and a clean vector of tokens. The resulting output includes IDs for document and sentence for use in other lexRank functions.

1
2
3

sentenceTokenParse(text, docId = "create", removePunc = TRUE,
  removeNum = TRUE, toLower = TRUE, stemWords = TRUE,
  rmStopWords = TRUE)

`text`	A character vector of documents to be parsed into sentences and tokenized.
`docId`	A character vector of document Ids the same length as `text`. If `docId=="create"` document Ids will be created.
`removePunc`	`TRUE` or `FALSE` indicating whether or not to remove punctuation from `text` while tokenizing. If `TRUE`, punctuation will be removed. Defaults to `TRUE`.
`removeNum`	`TRUE` or `FALSE` indicating whether or not to remove numbers from `text` while tokenizing. If `TRUE`, numbers will be removed. Defaults to `TRUE`.
`toLower`	`TRUE` or `FALSE` indicating whether or not to coerce all of `text` to lowercase while tokenizing. If `TRUE`, `text` will be coerced to lowercase. Defaults to `TRUE`.
`stemWords`	`TRUE` or `FALSE` indicating whether or not to stem resulting tokens. If `TRUE`, the outputted tokens will be tokenized using `SnowballC::wordStem()`. Defaults to `TRUE`.
`rmStopWords`	`TRUE`, `FALSE`, or character vector of stopwords to remove from tokens. If `TRUE`, words in `lexRankr::smart_stopwords` will be removed prior to stemming. If `FALSE`, no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to `TRUE`.

A list of dataframes. The first element of the list returned is the sentences dataframe; this dataframe has columns docId, sentenceId, & sentence (the actual text of the sentence). The second element of the list returned is the tokens dataframe; this dataframe has columns docId, sentenceId, & token (the actual text of the token).

1 2	sentenceTokenParse(c("Bill is trying to earn a Ph.D.", "You have to have a 5.0 GPA."), docId=c("d1","d2"))

$sentences
  docId sentenceId                       sentence
1    d1       d1_1 Bill is trying to earn a Ph.D.
2    d2       d2_1    You have to have a 5.0 GPA.

$tokens
  docId sentenceId token
1    d1       d1_1  bill
2    d1       d1_1  earn
3    d1       d1_1   phd
4    d2       d2_1   gpa