word_stats: Descriptive Word Statistics

View source: R/word_stats.R

word_statsR Documentation

Descriptive Word Statistics

Description

Transcript apply descriptive word statistics.

Usage

word_stats(
  text.var,
  grouping.var = NULL,
  tot = NULL,
  parallel = FALSE,
  rm.incomplete = FALSE,
  digit.remove = FALSE,
  apostrophe.remove = FALSE,
  digits = 3,
  ...
)

Arguments

text.var

The text variable or a "word_stats" object (i.e., the output of a word_stats function).

grouping.var

The grouping variables. Default NULL generates one word list for all text. Also takes a single grouping variable or a list of 1 or more grouping variables.

tot

Optional turns of talk variable that yields turn of talk measures.

parallel

logical. If TRUE attempts to run the function on multiple cores. Note that this may not mean a speed boost if you have one core or if the data set is smaller as the cluster takes time to create (parallel is slower until approximately 10,000 rows). To reduce run time pass a "word_stats" object to the word_stats function.

rm.incomplete

logical. If TRUE incomplete statements are removed from calculations in the output.

digit.remove

logical. If TRUE removes digits from calculating the output.

apostrophe.remove

logical. If TRUE removes apostrophes from calculating the output.

digits

Integer; number of decimal places to round when printing.

...

Any other arguments passed to end_inc.

Details

Note that a sentence is classified with only one endmark. An imperative sentence is classified only as imperative (not as a state, quest, or exclm as well). If a sentence is both imperative and incomplete the sentence will be counted as incomplete rather than imperative. labeled as both imperative

Value

Returns a list of three descriptive word statistics:

ts

A data frame of descriptive word statistics by row

gts

A data frame of word/sentence statistics per grouping variable:

  • n.tot - number of turns of talk

  • n.sent - number of sentences

  • n.words - number of words

  • n.char - number of characters

  • n.syl - number of syllables

  • n.poly - number of polysyllables

  • sptot - syllables per turn of talk

  • wptot - words per turn of talk

  • wps - words per sentence

  • cps - characters per sentence

  • sps - syllables per sentence

  • psps - poly-syllables per sentence

  • cpw - characters per word

  • spw - syllables per word

  • n.state - number of statements

  • n.quest - number of questions

  • n.exclm - number of exclamations

  • n.incom - number of incomplete statements

  • p.state - proportion of statements

  • p.quest - proportion of questions

  • p.exclm - proportion of exclamations

  • p.incom - proportion of incomplete statements

  • n.hapax - number of hapax legomenon

  • n.dis - number of dis legomenon

  • grow.rate - proportion of hapax legomenon to words

  • prop.dis - proportion of dis legomenon to words

mpun

An account of sentences with an improper/missing end mark

word.elem

A data frame with word element columns from gts

sent.elem

A data frame with sentence element columns from gts

omit

Counter of omitted sentences for internal use (only included if some rows contained missing values)

percent

The value of percent used for plotting purposes.

zero.replace

The value of zero.replace used for plotting purposes.

digits

integer value od number of digits to display; mostly internal use

Warning

It is assumed the user has run sentSplit on their data, otherwise some counts may not be accurate.

See Also

end_inc

Examples

## Not run: 
word_stats(mraja1spl$dialogue, mraja1spl$person)

(desc_wrds <- with(mraja1spl, word_stats(dialogue, person, tot = tot)))

## Recycle for speed boost
with(mraja1spl, word_stats(desc_wrds, person, tot = tot)) 

scores(desc_wrds)
counts(desc_wrds)
htruncdf(counts(desc_wrds), 15, 6)
plot(scores(desc_wrds))
plot(counts(desc_wrds))

names(desc_wrds)
htruncdf(desc_wrds$ts, 15, 5)
htruncdf(desc_wrds$gts, 15, 6)
desc_wrds$mpun 
desc_wrds$word.elem
desc_wrds$sent.elem 
plot(desc_wrds)
plot(desc_wrds, label=TRUE, lab.digits = 1)

## Correlation Visualization
qheat(cor(scores(desc_wrds)[, -1]), diag.na = TRUE, by.column =NULL,
    low = "yellow", high = "red", grid = FALSE)

## Parallel (possible speed boost)
with(mraja1spl, word_stats(dialogue, list(sex, died, fam.aff))) 
with(mraja1spl, word_stats(dialogue, list(sex, died, fam.aff), 
    parallel = TRUE)) 
    
## Recycle for speed boost
word_stats(desc_wrds, mraja1spl$sex)

## End(Not run)

qdap documentation built on May 31, 2023, 5:20 p.m.