Description Usage Arguments Details Value Author(s) References See Also Examples
This method calls a local installation of TreeTagger[1] to tokenize and POS tag the given text.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | treetag(
file,
treetagger = "kRp.env",
rm.sgml = TRUE,
lang = "kRp.env",
apply.sentc.end = TRUE,
sentc.end = c(".", "!", "?", ";", ":"),
encoding = NULL,
TT.options = NULL,
debug = FALSE,
TT.tknz = TRUE,
format = "file",
stopwords = NULL,
stemmer = NULL,
doc_id = NA,
add.desc = "kRp.env",
...
)
## S4 method for signature 'character'
treetag(
file,
treetagger = "kRp.env",
rm.sgml = TRUE,
lang = "kRp.env",
apply.sentc.end = TRUE,
sentc.end = c(".", "!", "?", ";", ":"),
encoding = NULL,
TT.options = NULL,
debug = FALSE,
TT.tknz = TRUE,
format = "file",
stopwords = NULL,
stemmer = NULL,
doc_id = NA,
add.desc = "kRp.env"
)
## S4 method for signature 'kRp.connection'
treetag(
file,
treetagger = "kRp.env",
rm.sgml = TRUE,
lang = "kRp.env",
apply.sentc.end = TRUE,
sentc.end = c(".", "!", "?", ";", ":"),
encoding = NULL,
TT.options = NULL,
debug = FALSE,
TT.tknz = TRUE,
format = NA,
stopwords = NULL,
stemmer = NULL,
doc_id = NA,
add.desc = "kRp.env"
)
|
file |
Either a connection or a character vector, valid path to a file,
containing the text to be analyzed.
If |
treetagger |
A character vector giving the TreeTagger script to be called. If set to |
rm.sgml |
Logical, whether SGML tags should be ignored and removed from output |
lang |
A character string naming the language of the analyzed corpus. See |
apply.sentc.end |
Logical,
whethter the tokens defined in |
sentc.end |
A character vector with tokens indicating a sentence ending. This adds to TreeTaggers results, it doesn't really replace them. |
encoding |
A character string defining the character encoding of the input file,
like |
TT.options |
A list of options to configure how TreeTagger is called. You have two basic choices: Either you choose one of the pre-defined presets or you give a full set of valid options:
You can also set these options globally using |
debug |
Logical. Especially in cases where the presets wouldn't work as expected,
this switch can be used to examine the values |
TT.tknz |
Logical,
if |
format |
Either "file" or "obj",
depending on whether you want to scan files or analyze the text in a given object, like
a character vector. If the latter,
it will be written to a temporary file (see |
stopwords |
A character vector to be used for stopword detection. Comparison is done in lower case. You can also simply set
|
stemmer |
A function or method to perform stemming. For instance,
you can set |
doc_id |
Character string,
optional identifier of the particular document. Will be added to the |
add.desc |
Logical. If |
... |
Only used for the method generic. |
Note that the value of lang
must match a valid language supported by kRp.POS.tags
.
It will also get stored in the resulting object and might be used by other functions at a later point.
E.g., treetag
is being called by freq.analysis
,
which
will by default query this language definition,
unless explicitly told otherwise. The rationale behind this
is to comfortably make it possible to have tokenized and POS tagged objects of various languages around
in your workspace, and not worry about that too much.
An object of class kRp.text
. If debug=TRUE
,
prints internal variable settings and attempts to return the
original output if the TreeTagger system call in a matrix.
m.eik michalke meik.michalke@hhu.de, support for various laguages was contributed by Earl Brown (Spanish), Alberto Mirisola (Italian) and Alexandre Brulet (French).
Schmid, H. (1994). Probabilistic part-of-speec tagging using decision trees. In International Conference on New Methods in Language Processing, Manchester, UK, 44–49.
[1] https://www.cis.lmu.de/~schmid/tools/TreeTagger/
freq.analysis
,
get.kRp.env
,
kRp.text
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | sample_file <- file.path(
path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
)
## Not run:
# first way to invoke POS tagging, using a built-in preset:
tagged.results <- treetag(
sample_file,
treetagger="manual",
lang="en",
TT.options=list(
path=file.path("~","bin","treetagger"),
preset="en"
)
)
# second way, use one of the batch scripts that come with TreeTagger:
tagged.results <- treetag(
sample_file,
treetagger=file.path("~","bin","treetagger","cmd","tree-tagger-english"),
lang="en"
)
# third option, set the above batch script in an environment object first:
set.kRp.env(
TT.cmd=file.path("~","bin","treetagger","cmd","tree-tagger-english"),
lang="en"
)
tagged.results <- treetag(
sample_file
)
# after tagging, use the resulting object with other functions in this package:
readability(tagged.results)
lex.div(tagged.results)
## enabling stopword detection and stemming
# if you also installed the packages tm and SnowballC,
# you can use some of their features with koRpus:
set.kRp.env(
TT.cmd="manual",
lang="en",
TT.options=list(
path=file.path("~","bin","treetagger"),
preset="en"
)
)
tagged.results <- treetag(
sample_file,
stopwords=tm::stopwords("en"),
stemmer=SnowballC::wordStem
)
# removing all stopwords now is simple:
tagged.noStopWords <- filterByClass(
tagged.results,
"stopword"
)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.