treetag_parallel: Wrapper around treetag function from koRpus for parallel part...
In manuelbickel/textility: Utility functions for text mining

treetag_parallel

R Documentation

Wrapper around treetag function from koRpus for parallel part of speech tagging with Treetagger

Description

NOTE: To increase speed, the input documents are pasted together by a marker phrase mimicking a sentence. This may or may not have an effect on the usability of results for your downstream analysis. NOTE: The package doParallel and foreach have to be loaded. The package koRpus has to be installed.

Usage

treetag_parallel(docs, ids = NULL, ncores = 2, chunk_size = NULL,
  language = "en", treetagger_path = "C:/TreeTagger")

Arguments

`docs`	The strings to tag.
`ids`	Ids used to mark each document. By default set to NULL which results in `ids = seq_along(docs)`.
`ncores`	Number of cores to use. By default 2.
`chunk_size`	Maximum size of document subset to pass to single thread for tagging. By default set to length(docs)/ncores. Depending on document size and number of documents, different chunk sizes might be reasonable.
`language`	Langauge to be assumed for the documents. By default "en". Parameter is passed to `treetag` from korPus package.
`treetagger_path`	Path to the Treetagger program that has to be installed separately. By default "C:/TreeTagger".

Value

A data.table inlcuding tagged tokens. Documents appear consecutively in the data.table marked with the provided ids. Furthermore, the beginning of each document is marked with the token "STARTOFDOCMARKER". This marker is introduced before tagging by collapsing documents with paste(..., collapse = ". STARTOFDOCMARKER .". Hence, also additional dots appear.

Examples


docs = c("This is the first sentence.", "This is the second sentence to tag.")

system.time(pos_tag_parallel(docs = docs, ids = seq_along(docs)))
# User      System elapsed
# 0.03        0.00    1.81
system.time(treetag(docs, treetagger = "manual", format = "obj",
                  TT.tknz = FALSE, lang =  "en",
                    TT.options=list(path = "C:/TreeTagger", preset ="en")))
# User      System     elapsed
# 0.03        0.00        0.91

#here only a small number of documents to make code run "quick"
#for larger number of documents the timewise advantage will be higher
many_longer_docs <- rep(paste(rep(docs, 30), collapse = " "), 200)
ncores <- 4
chunk_size <- length(many_longer_docs)/ncores #this is the default when chunk_size = NULL
system.time(res_parallel <- pos_tag_parallel(docs = many_longer_docs, ids = seq_along(many_longer_docs), chunk_size = chunk_size, ncores = ncores))
# User      System     elapsed
# 0.13        0.03       11.01

system.time(res_standard <- treetag(many_longer_docs, treetagger = "manual", format = "obj",
                                    TT.tknz = FALSE, lang =  "en",
                                    TT.options=list(path = "C:/TreeTagger", preset ="en")))
# User      System   elapsed
# 13.61       2.01     16.97

#make results comparable
res_standard <- as.data.table(res_standard@TT.res)
res_standard <- res_standard[!(token %in% "."), ]

res_parallel <- res_parallel[!(token %in% c("STARTOFDOCMARKER", ".")), ]
res_parallel <- res_parallel[, .SD, .SDcols = setdiff(colnames(res_parallel), c("doc_id"))]

all.equal(res_parallel,res_standard)
#[1] TRUE

manuelbickel/textility documentation built on Nov. 25, 2022, 9:07 p.m.