The package has been develop using R6 class for implement, for example, the extraction of input data and the 18 Pipes that there are in the application by default. Next, the different tools that make up the application are explained and how the features it offers can be extended.
In case of any more specific doubt, use the package help through help(package = "bdpar").
The package have four input data implemented, which are:
Email:
SMS:
ID's Tweets:
ID's comments of YouTube:
```{R, echo = TRUE, results = "hide"} library(R6) ExtractorTytb <- R6Class( classname = "ExtractorTytb", inherit = Instance, public = list( initialize = function(path) { if (!"character" %in% class(path)) { stop("[ExtractorTytb][initialize][Error] ", "Checking the type of the variable: path ", class(path)) } super$initialize(path) }, obtainDate = function() { super$setDate(file.info(super$getPath())[["ctime"]]) }, obtainSource = function() { super$setSource(readLines(super$getPath(), warn = FALSE)) super$setData(super$getSource()) } ) )
### Enabling a new Instance. <div style = "text-align: justify"> In order to automatically execute the new *Instance* class (*ExtractorTytb*), is must be registered through *registerExtractor* method of *ExtractorFactory* class. Below is shown an example describing how this extractor is registered. </div> ```{R, echo = TRUE, results = "hide",ExtractorTytb} library(bdpar) extractors <- ExtractorFactory$new() extractors$registerExtractor("tytb", ExtractorTytb)
The framework provides over 18 different pipes (inherited from GenericPipe). Each pipe is classified following two categories: (i) basic-functionality pipes and (ii) external file access pipes.
```{R, echo = TRUE, results = "hide"} library(R6) RemovesWhiteSpaces <- R6Class( "RemovesWhiteSpaces", inherit = GenericPipe, public = list( initialize = function(propertyName = "", alwaysBeforeDeps = list(), notAfterDeps = list()) { if (!"character" %in% class(propertyName)) { stop("[RemovesWhiteSpaces][initialize][Error] ", "Checking the type of the 'propertyName' variable: ", class(propertyName)) } if (!"list" %in% class(alwaysBeforeDeps)) { stop("[RemovesWhiteSpaces][initialize][Error] ", "Checking the type of the 'alwaysBeforeDeps' variable: ", class(alwaysBeforeDeps)) } if (!"list" %in% class(notAfterDeps)) { stop("[RemovesWhiteSpaces][initialize][Error] ", "Checking the type of the 'notAfterDeps' variable: ", class(notAfterDeps)) } super$initialize(propertyName, alwaysBeforeDeps, notAfterDeps) }, pipe = function(instance) { if (!"Instance" %in% class(instance)) { stop("[RemovesWhiteSpaces][pipe][Error] ", "Checking the type of the 'instance' variable: ", class(instance)) } instance$setData(trimws(x = instance$getData()))
if (length(instance$getData()) == 0) { instance$invalidate() } return(instance) }
) )
## Flow of Pipes (pipelining proccess) <div style = "text-align: justify"> Flow of pipes is the set of pipes that comprising the whole preprocessing proccess. By default bdpar provides a default pipelining proccess (implemented in *DefaultPipeline*) comprising all the 18 available pipes. </div> ### Flow of Pipes available by default The code included below shows a pipelining example comprising 18 pipes: ```R instance %>|% TargetAssigningPipe$new() %>|% StoreFileExtPipe$new() %>|% GuessDatePipe$new() %>|% File2Pipe$new() %>|% MeasureLengthPipe$new(propertyName = "length_before_cleaning_text") %>|% FindUserNamePipe$new() %>|% FindHashtagPipe$new() %>|% FindUrlPipe$new() %>|% FindEmoticonPipe$new() %>|% FindEmojiPipe$new() %>|% GuessLanguagePipe$new() %>|% ContractionPipe$new() %>|% AbbreviationPipe$new() %>|% SlangPipe$new() %>|% ToLowerCasePipe$new() %>|% InterjectionPipe$new() %>|% StopWordPipe$new() %>|% MeasureLengthPipe$new(propertyName = "length_after_cleaning_text") %>|% TeeCSVPipe$new()
```{R, echo = TRUE, results = "hide", RemovesWhiteSpaces} library(R6) library(bdpar) TestPipeline <- R6Class( "TestPipeline", inherit = GenericPipeline, public = list( initialize = function() { }, execute = function(instance) { if (!"Instance" %in% class(instance)) { stop("[TestPipeline][execute][Error] ", "Checking the type of the 'instance' variable: ", class(instance)) } message("[TestPipeline][execute][Info] ", instance$getPath()) tryCatch( instance %>|% TargetAssigningPipe$new() %>|% StoreFileExtPipe$new() %>|% File2Pipe$new() %>|% RemovesWhiteSpaces$new() %>|% TeeCSVPipe$new() , error = function(e) { message("[TestPipeline][execute][Error]", instance$getPath(), " :", paste(e)) instance$invalidate() } ) return(instance) } ) )
<div style = "text-align: justify"> Alternatively, the pre-processing flow can be dynamically constructed through the *DynamicPipeline* class. An example of its use is shown below. To see the rest of the options offered by this method, access the package's help through *help(package = "bdpar")*. </div> ```{R, echo = TRUE, results = "hide"} library(bdpar) pipeline <- DynamicPipeline$new() pipeline$add(list(TargetAssigningPipe$new(),StoreFileExtPipe$new(),File2Pipe$new()), pos = NULL) pipeline$add(list(TeeCSVPipe$new()), pos = NULL)
# [eml] bdpar.Options$set("extractorEML.mpaPartSelected", <<PartSelectedOnMPAlternative>>) # [resources] bdpar.Options$set("resources.abbreviations.path", <<abbreviation.path>>) bdpar.Options$set("resources.contractions.path", <<contractions.path>>) bdpar.Options$set("resources.interjections.path", <<interjections.path>>) bdpar.Options$set("resources.slangs.path", <<slangs.path>>) bdpar.Options$set("resources.stopwords.path", <<stopwords.path>>) # [twitter] bdpar.Options$set("twitter.consumer.key", <<consumer_key>>) bdpar.Options$set("twitter.consumer.secret", <<consumer_secret>>) bdpar.Options$set("twitter.access.token", <<access_token>>) bdpar.Options$set("twitter.access.token.secret", <<access_token_secret>>) bdpar.Options$set("cache.twitter.path", <<cache.path>>) # [teeCSVPipe] bdpar.Options$set("teeCSVPipe.output.path", <<outputh.path>>) # [youtube] bdpar.Options$set("youtube.app.id", <<app_id>>) bdpar.Options$set("youtube.app.password", <<app_password>>) bdpar.Options$set("cache.youtube.path", <<cache.path>>) # [cache] bdpar.Options$set("cache", <<status_cache>>) bdpar.Options$set("cache.folder", <<cache.path>>) # [parallel] bdpar.Options$set("numCores", <<num_cores>>) # [verbose] bdpar.Options$set("verbose", <<status_verbose>>)
bdpar.Options$set("numCores", 2)
It should be mentioned that in the case of parallelization, the output of the cores' log will only be available in file mode.
The bdpar package is also available in a development version at the Github development page: github.com/miferreiro/bdpar
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.