View source: R/treetag_parallel.R
treetag_parallel | R Documentation |
NOTE: To increase speed, the input documents are pasted together by a marker phrase mimicking a sentence.
This may or may not have an effect on the usability of results for your downstream analysis.
NOTE: The package doParallel
and foreach
have to be loaded. The package koRpus
has to be installed.
treetag_parallel(docs, ids = NULL, ncores = 2, chunk_size = NULL, language = "en", treetagger_path = "C:/TreeTagger")
docs |
The strings to tag. |
ids |
Ids used to mark each document. By default set to NULL which results in |
ncores |
Number of cores to use. By default 2. |
chunk_size |
Maximum size of document subset to pass to single thread for tagging. By default set to length(docs)/ncores. Depending on document size and number of documents, different chunk sizes might be reasonable. |
language |
Langauge to be assumed for the documents. By default "en". Parameter is passed to |
treetagger_path |
Path to the Treetagger program that has to be installed separately. By default "C:/TreeTagger". |
A data.table
inlcuding tagged tokens. Documents appear consecutively in the data.table marked with the provided ids.
Furthermore, the beginning of each document is marked with the token "STARTOFDOCMARKER".
This marker is introduced before tagging by collapsing documents with paste(..., collapse = ". STARTOFDOCMARKER ."
.
Hence, also additional dots appear.
docs = c("This is the first sentence.", "This is the second sentence to tag.") system.time(pos_tag_parallel(docs = docs, ids = seq_along(docs))) # User System elapsed # 0.03 0.00 1.81 system.time(treetag(docs, treetagger = "manual", format = "obj", TT.tknz = FALSE, lang = "en", TT.options=list(path = "C:/TreeTagger", preset ="en"))) # User System elapsed # 0.03 0.00 0.91 #here only a small number of documents to make code run "quick" #for larger number of documents the timewise advantage will be higher many_longer_docs <- rep(paste(rep(docs, 30), collapse = " "), 200) ncores <- 4 chunk_size <- length(many_longer_docs)/ncores #this is the default when chunk_size = NULL system.time(res_parallel <- pos_tag_parallel(docs = many_longer_docs, ids = seq_along(many_longer_docs), chunk_size = chunk_size, ncores = ncores)) # User System elapsed # 0.13 0.03 11.01 system.time(res_standard <- treetag(many_longer_docs, treetagger = "manual", format = "obj", TT.tknz = FALSE, lang = "en", TT.options=list(path = "C:/TreeTagger", preset ="en"))) # User System elapsed # 13.61 2.01 16.97 #make results comparable res_standard <- as.data.table(res_standard@TT.res) res_standard <- res_standard[!(token %in% "."), ] res_parallel <- res_parallel[!(token %in% c("STARTOFDOCMARKER", ".")), ] res_parallel <- res_parallel[, .SD, .SDcols = setdiff(colnames(res_parallel), c("doc_id"))] all.equal(res_parallel,res_standard) #[1] TRUE
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.