Logic behind this package is explained in the image below
troop follows a data-parallel approach rather than a model-parallel approach where the training data is divided among worker threads that execute in parallel, each performing the work associated with their shard and communicating updates after completing task over that shard. This is also called SIMD approach (Single Instruction, Multiple Data), using the doParallel
package of R. SOCK clusters are created and a chunk of data runs on each cluster.
data
)by
)apply_func
)preprocess_func
)postprocess_func
)num_chunks
)preprocess_args
)postprocess_args
)apply_func
should be included (packages
)apply_func
should be exported (export
)combine
)files_to_source
)R (>= 3.3.2)
troop can directly be installed from github
install.packages("devtools")
devtools::install_github("tejaslodaya/troop")
Thats it! Now you can use the package on your machine
Barebone example
library(data.table)
dt <- data.table(fread('sample.csv'))
resR <- troop::troop(dt, by = c('column1','column2'), apply_func = nrow)
Complex example
library(data.table)
dt <- data.table(fread('sample.csv'))
var <- 10
foo <- function(data_chunk){
# some complex operation
resR <- summary(data_chunk)
return (resR)
}
#source file on each core
result <- troop::troop(dt, by = c('column1','column2'), apply_func = foo, files_to_source = c('somefile.R','anotherfile.R'))
#using packages and exporting variables
result <- troop::troop(dt, by = c('column1','column2'), apply_func = foo, num_chunks = 10, packages = c('RODBC','xgboost'), export = c('var'), combine = 'c')
?troop::troop
in R consolePlease read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
This project is licensed under the MIT License
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.