README.md

Build Status

rvw

Development of rvw package started as R Vowpal Wabbit (Google Summer of Code 2018) project.

Vowpal Wabbit is an online machine learning system that is known for its speed and scalability and is widely used in research and industry.

This package aims to bring its functionality to R.

Installation

From Source

First, you have to install Vowpal Wabbit itself here.

Next, once the required library is installed, you can install the rvw package using remotes:

install.packages("remotes")  ## or devtools
remotes::install_github("rvw-org/rvw")

or (in case you have the package sources) via a standard R CMD INSTALL ..

This installation from source currently works best on Linux; on macOS you have to locally compile using the R-compatible toolchain (and not the brew-based one as the Vowpal Wabbit documentation suggests).

There is one possible shortcut: you can use the Debian/Ubuntu package as our Docker container does: sudo apt-get install libvw-dev vowpal-wabbit libboost-program-options-dev.

Using Docker

We use Docker for the Travis CI tests, and also provide a container for deployment. Do

docker pull rvowpalwabbit/run                 ## one time 
docker run --rm -ti rvowpalwabbit/run bash    ## launch container

to start the container with rvw installed. See the Boettiger and Eddelbuettel RJournal paper for more on Docker for R, and the Rocker Project used here.

Getting Started

Introduction

Examples:

Example

In this example we will try to predict age groups (based on number of abalone shell rings) from physical measurements. We will use Abalone Data Set from UCI Machine Learning Repository.

First we prepare our data:

library(mltools)
library(rvw)

set.seed(1)
aburl = 'http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
abnames = c('sex','length','diameter','height','weight.w','weight.s','weight.v','weight.sh','rings')
abalone = read.table(aburl, header = F , sep = ',', col.names = abnames)
data_full <- abalone

# Split number of rings into groups with equal (as possible) number of observations
data_full$group <- bin_data(data_full$rings, bins=3, binType = "quantile")
group_lvls <- levels(data_full$group)
levels(data_full$group) <- c(1, 2, 3)

# Prepare indices to split data
ind_train <- sample(1:nrow(data_full), 0.8*nrow(data_full))
# Split data into train and test subsets
df_train <- data_full[ind_train,]
df_test <- data_full[-ind_train,]

Then we set up a Vowpal Wabbit model:

vwmodel <- vwsetup(option = "ect", num_classes = 3)

Now we start training:

vwtrain(vwmodel, data = df_train,
        namespaces = list(NS1 = list("sex", "rings"),
                          NS2 = list("weight.w","weight.s","weight.v","weight.sh", "diameter", "length", "height")),
        targets = "group"
)

And we get: average loss = 0.278060

And finally compute predictions using trained model:

predict.vw(vwmodel, data = df_test)

Here we get: average loss = 0.221292

We can add more learning algorithms to our model. For example we want to use boosting algorithm with 100 "weak" learners. Then we will just add this option to our model and train again:

vwmodel <- add_option(vwmodel, option = "boosting", num_learners=100)

vwtrain(vwmodel, data = df_train,
        namespaces = list(NS1 = list("sex", "rings"),
                          NS2 = list("weight.w","weight.s","weight.v","weight.sh", "diameter", "length", "height")),
        targets = "group"
)

We get: average loss = 0.229273

And compute predictions:

predict.vw(vwmodel, data = df_test)

Finally we get: average loss = 0.081340

In order to inspect parameters of our model we can simply print it:

vwmodel
    Vowpal Wabbit model
Learning algorithm:   sgd 
Working directory:   /var/folders/yx/6949djdd3yb4qsw7x_95wfjr0000gn/T//RtmpjO3DD1 
Model file:   /var/folders/yx/6949djdd3yb4qsw7x_95wfjr0000gn/T//RtmpjO3DD1/vw_1534253637_mdl.vw 
General parameters: 
     random_seed :   0 
     ring_size :  Not defined
     holdout_off :   FALSE 
     holdout_period :   10 
     holdout_after :   0 
     early_terminate :   3 
     loss_function :   squared 
     link :   identity 
     quantile_tau :   0.5 
Feature parameters: 
     bit_precision :   18 
     quadratic :  Not defined
     cubic :  Not defined
     interactions :  Not defined
     permutations :   FALSE 
     leave_duplicate_interactions :   FALSE 
     noconstant :   FALSE 
     feature_limit :  Not defined
     ngram :  Not defined
     skips :  Not defined
     hash :  Not defined
     affix :  Not defined
     spelling :  Not defined
Learning algorithms / Reductions: 
     ect :
         num_classes :   3 
     boosting :
         num_learners :   100 
         gamma :   0.1 
         alg :   BBM 
Optimization parameters: 
     adaptive :   TRUE 
     normalized :   TRUE 
     invariant :   TRUE 
     adax :   FALSE 
     sparse_l2 :   0 
     l1_state :   0 
     l2_state :   1 
     learning_rate :   0.5 
     initial_pass_length :  Not defined
     l1 :   0 
     l2 :   0 
     no_bias_regularization :  Not defined
     feature_mask :  Not defined
     decay_learning_rate :   1 
     initial_t :   0 
     power_t :   0.5 
     initial_weight :   0 
     random_weights :  Not defined
     normal_weights :  Not defined
     truncated_normal_weights :  Not defined
     sparse_weights :   FALSE 
     input_feature_regularizer :  Not defined
Model evaluation. Training: 
     num_examples :   3341 
     weighted_example_sum :   3341 
     weighted_label_sum :   0 
     avg_loss :   0.2292727 
     total_feature :   33408 
Model evaluation. Testing: 
     num_examples :   836 
     weighted_example_sum :   836 
     weighted_label_sum :   0 
     avg_loss :   0.08133971 
     total_feature :   8360 


ivan-pavlov/rvwgsoc documentation built on July 1, 2019, 9:40 p.m.