README.md

RProtoBufUtils

As of January 2014, most functionality from this package has been merged into RProtoBuf version 0.4. The RProtoBufUtils package will no longer be maintained.

This package provides some tools and utilities to serialize R objects to with protocol buffers. It builds on the RProtoBuf package, which interfaces to the official protocol buffers C++ library by Google.

The main advantage of serializing an object to a protocol buffer message, as opposed to native R serialization, is that protocol buffers is an inter-operable format that can be read/written by other programming languages. The main disadvantage is that some special R-specific object types are not supported and will get lost in the process (with a warning).

How it works

Designing a R object to protobuf serialization requires defining 3 parts:

This package contains both some example .proto files designed for serializing R objects, as well as R code that will help with converting R data/objects to this format. Note that in order for a third party to unserialize a message, they will need both the serialized data as well as the specific proto file.

Example

The serialize_pb function mimics native serializion and writes an R object to file or connection, in protobuf format. By default it uses the rexp.proto schema.

msg <- tempfile();
serialize_pb(iris, msg);
obj <- unserialize_pb(msg);
identical(iris, obj);

Proto schemas

By default serialize_pb uses rexp.proto, which is also used by the RHIPE project to serialize R objects for use with HADOOP. This proto is designed to be most general, and supports all standard S3 objects, like vectors, factors, lists, dataframes and any combination thereof. It also stores attributes and missing values. It does not however support some R specific constructs, like functions, environments, S4 classes, etc.

The rexp.proto message definition is very general, but also pretty verbose. In the case of an application that only needs to serialize a certain class of objects, it might be wise to define a proto definition and mapper specifically for this class of objects. For example, the RProtoBufUtils package includes a dataframe.proto specifically for dataframes. This proto is a bit less general and might be more simple to use by 3rd parties when communicating a dataset.

msg <- tempfile();
serialize_pb(iris, msg, proto="dataframe");
obj <- unserialize_pb(msg, proto="dataframe");
identical(iris, obj);

Note, again, that one needs to communicate clearly with the consumer of the message which .proto was used to serialize the object. The serialized data can not be interpreted without the proper .proto file.

Unit Test

The RProtoBuf package ships with a dataframe named testdata which contains all of the common vector types, including some missing values for each.

This dataset is used to test if it properly serializes and unserializes without loss of information or precision. This dataset is also useful for testing unserialization in another language.

#load data
data(testdata)

#test rexp.proto
msg <- tempfile();
serialize_pb(testdata, msg, proto="rexp");
obj <- unserialize_pb(msg, proto="rexp");
identical(testdata, obj);

#test dataframe.proto
msg <- tempfile();
serialize_pb(testdata, msg, proto="dataframe");
obj <- unserialize_pb(msg, proto="dataframe");
identical(testdata, obj);

Limitations



jeroenooms/RProtoBufUtils documentation built on May 19, 2019, 6:12 a.m.