sentSplit: Sentence Splitting
In qdap: Bridging the Gap Between Qualitative Data and Quantitative Analysis

sentSplit

R Documentation

Sentence Splitting

Description

sentSplit - Splits turns of talk into individual sentences (provided proper punctuation is used). This procedure is usually done as part of the data read in and cleaning process.

sentCombine - Combines sentences by the same grouping variable together.

TOT - Convert the tot column from sentSplit to turn of talk index (no sub sentence). Generally, for internal use.

sent_detect - Detect and split sentences on endmark boundaries.

sent_detect_nlp - Detect and split sentences on endmark boundaries using openNLP & NLP utilities which matches the onld version of the openNLP package's now removed sentDetect function.

Usage

sentSplit(
  dataframe,
  text.var,
  rm.var = NULL,
  endmarks = c("?", ".", "!", "|"),
  incomplete.sub = TRUE,
  rm.bracket = TRUE,
  stem.col = FALSE,
  text.place = "right",
  verbose = is.global(2),
  ...
)

sentCombine(text.var, grouping.var = NULL, as.list = FALSE)

TOT(tot)

sent_detect(
  text.var,
  endmarks = c("?", ".", "!", "|"),
  incomplete.sub = TRUE,
  rm.bracket = TRUE,
  ...
)

sent_detect_nlp(text.var, ...)

Arguments

`dataframe`	A dataframe that contains the person and text variable.
`text.var`	The text variable.
`rm.var`	An optional character vector of 1 or 2 naming the variables that are repeated measures (This will restart the "tot" column).
`endmarks`	A character vector of endmarks to split turns of talk into sentences.
`incomplete.sub`	logical. If `TRUE` detects incomplete sentences and replaces with `"\|"`.
`rm.bracket`	logical. If `TRUE` removes brackets from the text.
`stem.col`	logical. If `TRUE` stems the text as a new column.
`text.place`	A character string giving placement location of the text column. This must be one of the strings `"original"`, `"right"` or `"left"`.
`verbose`	logical. If `TRUE` select diagnostics from `check_text` are reported.
`grouping.var`	The grouping variables. Default `NULL` generates one word list for all text. Also takes a single grouping variable or a list of 1 or more grouping variables.
`as.list`	logical. If `TRUE` returns the output as a list. If `FALSE` the output is returned as a dataframe.
`tot`	A tot column from a `sentSplit` output.
`...`	Additional options passed to `stem2df`.

Value

sentSplit - returns a dataframe with turn of talk broken apart into sentences. Optionally a stemmed version of the text variable may be returned as well.

sentCombine - returns a list of vectors with the continuous sentences by grouping.var pasted together. returned as well.

TOT - returns a numeric vector of the turns of talk without sentence sub indexing (e.g. 3.2 become 3).

sent_detect - returns a character vector of sentences split on endmark.

Warning

sentSplit requires the dialogue (text) column to be cleaned in a particular way. The data should contain qdap punctuation marks (c("?", ".", "!", "|")) at the end of each sentence. Additionally, extraneous punctuation such as abbreviations should be removed (see replace_abbreviation). Trailing sentences such as I thought I... will be treated as incomplete and marked with "|" to denote an incomplete/trailing sentence.

Suggestion

It is recommended that the user runs check_text on the output of sentSplit's text column.

Author(s)

Dason Kurkiewicz and Tyler Rinker <tyler.rinker@gmail.com>.

Examples

## Not run: 
## `sentSplit` EXAMPLE:
(out <- sentSplit(DATA, "state"))
out %&% check_text()  ## check output text
sentSplit(DATA, "state", stem.col = TRUE)
sentSplit(DATA, "state", text.place = "left")
sentSplit(DATA, "state", text.place = "original")
sentSplit(raj, "dialogue")[1:20, ]

## plotting
plot(out)
plot(out, grouping.var = "person")

out2 <- sentSplit(DATA2, "state", rm.var = c("class", "day"))
plot(out2)
plot(out2, grouping.var = "person")
plot(out2, grouping.var = "person", rm.var = "day")
plot(out2, grouping.var = "person", rm.var = c("day", "class"))

## `sentCombine` EXAMPLE:
dat <- sentSplit(DATA, "state") 
sentCombine(dat$state, dat$person)
truncdf(sentCombine(dat$state, dat$sex), 50)

## `TOT` EXAMPLE:
dat <- sentSplit(DATA, "state") 
TOT(dat$tot)

## `sent_detect`
sent_detect(DATA$state)

## NLP based sentence splitting 
sent_detect_nlp(DATA$state)

## End(Not run)

qdap documentation built on May 31, 2023, 5:20 p.m.