corp_text: Tokenized text
In CorporaCoCo: Corpora Co-Occurrence Comparison

View source: R/corp_text.R

corp_text

R Documentation

Tokenized text

Description

Encapsulates the tokenization of a piece of text.

Usage

  corp_text(text, tokens = NULL)

  is.corp_text(obj)

  corp_text_rbindlist(x)

  ## S3 method for class 'corp_text'
corp_type_lookup(obj)

Arguments

`text`	A `character string` of the text which is the subject of the co-occurrence counting.
`tokens`	This is a `data.frame` containing `type`, `start` and `end` variables. $ tokens:Classes 'data.frame': 3 variables: ..$ type : chr ..$ start: int ..$ end : int `tokens` captures the types within the text along with their character positions. For example we could represent the types in the text `"Do cats eat bats?"` with the `tokens` `data.frame`: type start end 1: do 1 2 2: cats 4 7 3: eat 9 11 4: bats 13 16 If the `tokens` argument is not supplied, the `tokens` will be calculated from the supplied `text` argument. The default behaviour is to tokenize on word boundaries according to the Unicode Standard with the `types` being the unique set of lowercased extracted words. This is achieved using the stringi CRAN package and will work for any UTF-8 encoded text (in any language).
`obj`	A `corp_text` object as is returned by the `corp_text` function.
`x`	A `list` of `corp_text` objects.

Value

corp_text

Returns a corp_text object.

The corp_text object can be interrogated using the corp_get_* accessor functions.

A concordance can be generated from the corp_text object using the corp_concordance function.

The corp_text objects are used as arguments to the corp_cooccurrence function.

corp_type_lookup

Returns a data.table that can be used to lookup the tokens associated with each type. See example.

corp_text_rbindlist

Returns a corp_text object which is an ordered combination of the given list of corp_text objects.

TODO: Currently the text is concatenated with a single space.

summary

Prints a summary of the token and type counts for the text.

Examples

    x <- "A man, a plan, a canal -- Panama!"

    y <- corp_text(x)

    corp_get_tokens(y)

    ##      type start end  token idx
    ## 1:      a     1   1      A   1
    ## 2:    man     3   5    man   2
    ## 3:      a     8   8      a   3
    ## 4:   plan    10  13   plan   4
    ## 5:      a    16  16      a   5
    ## 6:  canal    18  22  canal   6
    ## 7: panama    27  32 Panama   7

    corp_get_text(y)

    ## [1] "A man, a plan, a canal -- Panama!"

    corp_type_lookup(y)

    ##      type tokens
    ## 1:      a   A, a
    ## 2:  canal  canal
    ## 3:    man    man
    ## 4: panama Panama
    ## 5:   plan   plan

CorporaCoCo documentation built on Aug. 8, 2022, 5:09 p.m.