create_corpus: Prepare corpus data for dispersion calculations

View source: R/dispersion.R

create_corpusR Documentation

Prepare corpus data for dispersion calculations

Description

From raw corpus or part-frequency list, create a list object containing descriptive stats and indexed tokens.

Usage

create_corpus(
  tokens,
  parts,
  freq = NULL,
  vocab = NULL,
  doc_ids = NULL,
  type = c("per_part", "raw"),
  cutoff = 0L,
  with_distance = TRUE,
  no_match = c("fail", "remove", "keep")
)

Arguments

tokens

character vector with tokens

parts

character vector

freq

integer optional vector with counts

vocab

character or factor optional vector with unique tokens

doc_ids

character or factor optional vector with part ids

type

input type, either "per_part" or "raw"

cutoff

integer minimum frequency for each type

with_distance

logical whether or not to calculate distances required for distance measures

no_match

character, "fail" (default): throws an error if tokens contain NAs after creating an index. Typically, this happens when vocab is given and doesn't contain all types in the corpus; "remove": NAs are removed, "keep": treat NAs as separate type of token error

Details

iparts

integer index of parts

l

number of tokens in the input

f

frequency per unique tokens

i

integer index of tokens per parts

j

integer index of parts per tokens

v

frequency of tokens per part

vocab

unique tokens

sort_ids

sorting permutation of tokens for use in distance based measures

sizes

sizes of parts

Value

list of type "corpus"


alex-raw/occurR documentation built on March 10, 2023, 5:08 p.m.