Home

/

GitHub

/

AdamSpannbauer/lexRankr

/

sentenceTokenParse: Parse text into sentences and tokens

sentenceTokenParse: Parse text into sentences and tokens
In AdamSpannbauer/lexRankr: Extractive Summarization of Text with the LexRank Algorithm

View source: R/sentenceTokenParse.R

sentenceTokenParse

R Documentation

Parse text into sentences and tokens

Description

Parse a character vector of documents into into both sentences and a clean vector of tokens. The resulting output includes IDs for document and sentence for use in other lexRank functions.

Usage

sentenceTokenParse(text, docId = "create", removePunc = TRUE,
  removeNum = TRUE, toLower = TRUE, stemWords = TRUE,
  rmStopWords = TRUE)

Arguments

`text`	A character vector of documents to be parsed into sentences and tokenized.
`docId`	A character vector of document Ids the same length as `text`. If `docId=="create"` document Ids will be created.
`removePunc`	`TRUE` or `FALSE` indicating whether or not to remove punctuation from `text` while tokenizing. If `TRUE`, punctuation will be removed. Defaults to `TRUE`.
`removeNum`	`TRUE` or `FALSE` indicating whether or not to remove numbers from `text` while tokenizing. If `TRUE`, numbers will be removed. Defaults to `TRUE`.
`toLower`	`TRUE` or `FALSE` indicating whether or not to coerce all of `text` to lowercase while tokenizing. If `TRUE`, `text` will be coerced to lowercase. Defaults to `TRUE`.
`stemWords`	`TRUE` or `FALSE` indicating whether or not to stem resulting tokens. If `TRUE`, the outputted tokens will be tokenized using `SnowballC::wordStem()`. Defaults to `TRUE`.
`rmStopWords`	`TRUE`, `FALSE`, or character vector of stopwords to remove from tokens. If `TRUE`, words in `lexRankr::smart_stopwords` will be removed prior to stemming. If `FALSE`, no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to `TRUE`.

Value

A list of dataframes. The first element of the list returned is the sentences dataframe; this dataframe has columns docId, sentenceId, & sentence (the actual text of the sentence). The second element of the list returned is the tokens dataframe; this dataframe has columns docId, sentenceId, & token (the actual text of the token).

Examples

sentenceTokenParse(c("Bill is trying to earn a Ph.D.", "You have to have a 5.0 GPA."),
                   docId=c("d1","d2"))

AdamSpannbauer/lexRankr documentation built on Dec. 9, 2022, 3:44 a.m.

AdamSpannbauer/lexRankr index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

AdamSpannbauer/lexRankr
Extractive Summarization of Text with the LexRank Algorithm

sentenceTokenParse: Parse text into sentences and tokens
In AdamSpannbauer/lexRankr: Extractive Summarization of Text with the LexRank Algorithm

Parse text into sentences and tokens

Description

Usage

Arguments

Value

Examples

Related to sentenceTokenParse in AdamSpannbauer/lexRankr...

R Package Documentation

Browse R Packages

We want your feedback!

AdamSpannbauer/lexRankr Extractive Summarization of Text with the LexRank Algorithm

sentenceTokenParse: Parse text into sentences and tokens In AdamSpannbauer/lexRankr: Extractive Summarization of Text with the LexRank Algorithm

Parse text into sentences and tokens

Description

Usage

Arguments

Value

Examples

Related to sentenceTokenParse in AdamSpannbauer/lexRankr...

R Package Documentation

Browse R Packages

We want your feedback!

AdamSpannbauer/lexRankr
Extractive Summarization of Text with the LexRank Algorithm

sentenceTokenParse: Parse text into sentences and tokens
In AdamSpannbauer/lexRankr: Extractive Summarization of Text with the LexRank Algorithm