corp_text: Tokenized text

View source: R/corp_text.R

corp_textR Documentation

Tokenized text

Description

Encapsulates the tokenization of a piece of text.

Usage

  corp_text(text, tokens = NULL)

  is.corp_text(obj)

  corp_text_rbindlist(x)

  ## S3 method for class 'corp_text'
corp_type_lookup(obj)

Arguments

text

A character string of the text which is the subject of the co-occurrence counting.

tokens

This is a data.frame containing type, start and end variables.

    $ tokens:Classes 'data.frame': 3 variables:
     ..$ type : chr
     ..$ start: int
     ..$ end  : int

tokens captures the types within the text along with their character positions. For example we could represent the types in the text "Do cats eat bats?" with the tokens data.frame:

       type start end
    1:   do     1   2
    2: cats     4   7
    3:  eat     9  11
    4: bats    13  16

If the tokens argument is not supplied, the tokens will be calculated from the supplied text argument. The default behaviour is to tokenize on word boundaries according to the Unicode Standard with the types being the unique set of lowercased extracted words. This is achieved using the stringi CRAN package and will work for any UTF-8 encoded text (in any language).

obj

A corp_text object as is returned by the corp_text function.

x

A list of corp_text objects.

Value

corp_text

Returns a corp_text object.

The corp_text object can be interrogated using the corp_get_* accessor functions.

A concordance can be generated from the corp_text object using the corp_concordance function.

The corp_text objects are used as arguments to the corp_cooccurrence function.

corp_type_lookup

Returns a data.table that can be used to lookup the tokens associated with each type. See example.

corp_text_rbindlist

Returns a corp_text object which is an ordered combination of the given list of corp_text objects.

TODO: Currently the text is concatenated with a single space.

summary

Prints a summary of the token and type counts for the text.

See Also

corp_cooccurrence and corp_concordance.

Examples

    x <- "A man, a plan, a canal -- Panama!"

    y <- corp_text(x)

    corp_get_tokens(y)

    ##      type start end  token idx
    ## 1:      a     1   1      A   1
    ## 2:    man     3   5    man   2
    ## 3:      a     8   8      a   3
    ## 4:   plan    10  13   plan   4
    ## 5:      a    16  16      a   5
    ## 6:  canal    18  22  canal   6
    ## 7: panama    27  32 Panama   7

    corp_get_text(y)

    ## [1] "A man, a plan, a canal -- Panama!"

    corp_type_lookup(y)

    ##      type tokens
    ## 1:      a   A, a
    ## 2:  canal  canal
    ## 3:    man    man
    ## 4: panama Panama
    ## 5:   plan   plan

CorporaCoCo documentation built on Aug. 8, 2022, 5:09 p.m.