corp_text | R Documentation |
Encapsulates the tokenization of a piece of text.
corp_text(text, tokens = NULL) is.corp_text(obj) corp_text_rbindlist(x) ## S3 method for class 'corp_text' corp_type_lookup(obj)
text |
A |
tokens |
This is a $ tokens:Classes 'data.frame': 3 variables: ..$ type : chr ..$ start: int ..$ end : int tokens captures the types within the text along with their
character positions. For example we could represent the
types in the text type start end 1: do 1 2 2: cats 4 7 3: eat 9 11 4: bats 13 16 If the tokens argument is not supplied, the tokens will be calculated from the supplied text argument. The default behaviour is to tokenize on word boundaries according to the Unicode Standard with the types being the unique set of lowercased extracted words. This is achieved using the stringi CRAN package and will work for any UTF-8 encoded text (in any language). |
obj |
A |
x |
A |
Returns a corp_text
object.
The corp_text
object can be interrogated using the
corp_get_*
accessor functions.
A concordance can be generated from the corp_text
object
using the corp_concordance
function.
The corp_text
objects are used as arguments to the
corp_cooccurrence
function.
Returns a data.table that can be used to lookup the tokens associated with each type. See example.
Returns a corp_text
object which is an ordered combination
of the given list of corp_text
objects.
TODO: Currently the text is concatenated with a single space.
Prints a summary of the token and type counts for the text.
corp_cooccurrence
and
corp_concordance
.
x <- "A man, a plan, a canal -- Panama!" y <- corp_text(x) corp_get_tokens(y) ## type start end token idx ## 1: a 1 1 A 1 ## 2: man 3 5 man 2 ## 3: a 8 8 a 3 ## 4: plan 10 13 plan 4 ## 5: a 16 16 a 5 ## 6: canal 18 22 canal 6 ## 7: panama 27 32 Panama 7 corp_get_text(y) ## [1] "A man, a plan, a canal -- Panama!" corp_type_lookup(y) ## type tokens ## 1: a A, a ## 2: canal canal ## 3: man man ## 4: panama Panama ## 5: plan plan
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.