Apache Arrow in Python and R with reticulate

The arrow package provides reticulate methods for passing data between R and Python in the same process. This document provides a brief overview.

Why you might want to use pyarrow?

Installing

To use arrow in Python, at a minimum you'll need the pyarrow library. To install it in a virtualenv,

library(reticulate)
virtualenv_create("arrow-env")
install_pyarrow("arrow-env")

If you want to install a development version of pyarrow, add nightly = TRUE:

install_pyarrow("arrow-env", nightly = TRUE)

A virtualenv or a virtual environment is a specific Python installation created for one project or purpose. It is a good practice to use specific environments in Python so that updating a package doesn't impact packages in other projects.

install_pyarrow() also works with conda environments (conda_create() instead of virtualenv_create()).

For more on installing and configuring Python, see the reticulate docs.

Using

To start, load arrow and reticulate, and then import pyarrow.

library(arrow)
library(reticulate)
use_virtualenv("arrow-env")
pa <- import("pyarrow")

The arrow R package include support for sharing Arrow Array and RecordBatch objects in-process between R and Python. For example, let's create an Array in pyarrow.

a <- pa$array(c(1, 2, 3))
a

## Array
## <double>
## [
##   1,
##   2,
##   3
## ]

a is now an Array object in your R session, even though you created it in Python. You can apply R methods on it:

a[a > 1]

## Array
## <double>
## [
##   2,
##   3
## ]

You can send data both ways. One reason you might want to use pyarrow in R is to take advantage of functionality that is better supported in Python than in R. For example, pyarrow has a concat_arrays() function, but as of 0.17, this function is not implemented in the arrow R package. You can use reticulate to use it efficiently.

b <- Array$create(c(5, 6, 7, 8, 9))
a_and_b <- pa$concat_arrays(list(a, b))
a_and_b

## Array
## <double>
## [
##   1,
##   2,
##   3,
##   5,
##   6,
##   7,
##   8,
##   9
## ]

Now you have a single Array in R.

How this works

"Send", however, isn't the correct word. Internally, we're passing pointers to the data between the R and Python interpreters running together in the same process, without copying anything. Nothing is being sent: we're sharing and accessing the same internal Arrow memory buffers.

Arrow object types

For more information about Arrow object types see the "Internals" section of the "arrow" vignette:

vignette("arrow", package = "arrow")

Troubleshooting

If you get an error like

Error in py_get_attr_impl(x, name, silent) :
  AttributeError: 'pyarrow.lib.DoubleArray' object has no attribute '_export_to_c'

it means that the version of pyarrow you're using is too old. Support for passing data to and from R is included in versions 0.17 and greater. Check your pyarrow version like this:

pa$`__version__`

## [1] "0.16.0"

Note that your pyarrow and arrow versions don't need themselves to match: they just need to be 0.17 or greater.



Try the arrow package in your browser

Any scripts or data that you put into this service are public.

arrow documentation built on Dec. 6, 2022, 5:10 p.m.