Description Usage Arguments Details Value Function Examples
Transforms text in koRpus objects token by token.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | textTransform(txt, ...)
## S4 method for signature 'kRp.text'
textTransform(
txt,
scheme,
p = 0.5,
paste = FALSE,
var = "wclass",
query = "fullstop",
method = "replace",
replacement = ".",
f = NA,
...
)
|
txt |
An object of class |
... |
Parameters passed to |
scheme |
One of the following character strings:
|
p |
Numeric value between 0 and 1. Defines the probability for upper case letters (relevant only
if |
paste |
Logical, see value section. |
var |
A character string naming a variable in the object (i.e.,
colname). See |
query |
A character vector (for words), regular expression,
or single number naming values to be matched in the variable.
See |
method |
One of the following character strings:
In case of |
replacement |
Character string defining the exact token to replace all query matches with.
Relevant only if |
f |
A function to calculate the replacement for all query matches.
Relevant only if |
This method is mainly intended to produce text material for experiments.
By default an object of class kRp.text
with the added feature diff
is returned.
It provides a list with mostly atomic vectors,
describing the amount of diffences between both text variants (percentage):
all.tokens
:Percentage of all tokens, including punctuation, that were altered.
words
:Percentage of altered words only.
all.chars
:Percentage of all characters, including punctuation, that were altered.
letters
:Percentage of altered letters in words only.
transfmt
:Character vector documenting the transformation(s) done to the tokens.
transfmt.equal
:Data frame documenting which token was changed in which transformational step. Only available if more than one transformation was done.
transfmt.normalize
:A list documenting steps of normalization that were done to the object, one element per transformation. Each entry holds the name of the method, the query parameters, and the effective replacement value.
If paste=TRUE
,
returns an atomic character vector (via pasteText
).
You can dynamically calculate the replacement value for the "normalize"
scheme by setting method="function"
and
providing a function object as f
. The function you provide must support the following arguments:
tokens
The original tokens slot of the txt
object (see taggedText
).
match
A logical vector,
indicating for each row of tokens
whether it's a query match or not.
You can then use these arguments in your function body to calculate the replacement,
e.g. tokens[match,"token"]
to get all relevant tokens.
The return value of the function will be used as the replacement for all matched tokens. You probably want to make sure it's a character vecor
of length one or of the same length as all matches.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | # code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
sample_file <- file.path(
path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
)
tokenized.obj <- tokenize(
txt=sample_file,
lang="en"
)
tokenized.obj <- textTransform(
tokenized.obj,
scheme="random"
)
pasteText(tokenized.obj)
# diff stats are now part of the object
hasFeature(tokenized.obj)
diffText(tokenized.obj)
} else {}
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.