create_corpus: Prepare corpus data for dispersion calculations
In alex-raw/occurR: Tools for Working with Word Frequency Lists

View source: R/dispersion.R

create_corpus

R Documentation

Prepare corpus data for dispersion calculations

Description

From raw corpus or part-frequency list, create a list object containing descriptive stats and indexed tokens.

Usage

create_corpus(
  tokens,
  parts,
  freq = NULL,
  vocab = NULL,
  doc_ids = NULL,
  type = c("per_part", "raw"),
  cutoff = 0L,
  with_distance = TRUE,
  no_match = c("fail", "remove", "keep")
)

Arguments

`tokens`	character vector with tokens
`parts`	character vector
`freq`	integer optional vector with counts
`vocab`	character or factor optional vector with unique tokens
`doc_ids`	character or factor optional vector with part ids
`type`	input type, either "per_part" or "raw"
`cutoff`	integer minimum frequency for each type
`with_distance`	logical whether or not to calculate distances required for distance measures
`no_match`	character, "fail" (default): throws an error if tokens contain NAs after creating an index. Typically, this happens when `vocab` is given and doesn't contain all types in the corpus; "remove": NAs are removed, "keep": treat NAs as separate type of token error

Details

iparts: integer index of parts
l: number of tokens in the input
f: frequency per unique tokens
i: integer index of tokens per parts
j: integer index of parts per tokens
v: frequency of tokens per part
vocab: unique tokens
sort_ids: sorting permutation of tokens for use in distance based measures
sizes: sizes of parts