The Arrow C++ library includes a generic filesystem interface and specific
implementations for some cloud storage systems. This setup allows various
parts of the project to be able to read and write data with different storage
backends. In the
arrow R package, support has been enabled for AWS S3 and
Google Cloud Storage (GCS). This vignette provides an overview of working with
S3 and GCS data using Arrow.
In Windows and macOS binary packages, S3 and GCS support are included. On Linux when installing from source, S3 and GCS support is not always enabled by default, and it has additional system requirements. See
vignette("install", package = "arrow")for details.
One way of working with filesystems is to create
?S3FileSystem objects can be created with the
s3_bucket() function, which
automatically detects the bucket's AWS region. Similarly,
can be created with the
gs_bucket() function. The resulting
FileSystem will consider paths relative to the bucket's path (so for example
you don't need to prefix the bucket path when listing a directory).
FileSystem object, you can point to specific files in it with the
and pass the result to file readers and writers (
write_feather(), et al.).
For example, to read a parquet file from the example NYC taxi data
vignette("dataset", package = "arrow")):
bucket <- s3_bucket("voltrondata-labs-datasets") # Or in GCS (anonymous = TRUE is required if credentials are not configured): bucket <- gs_bucket("voltrondata-labs-datasets", anonymous = TRUE) df <- read_parquet(bucket$path("nyc-taxi/year=2019/month=6/part-0.parquet"))
Note that this will be slower to read than if the file were local, though if you're running on a machine in the same AWS region as the file in S3, the cost of reading the data over the network should be much lower.
You can list the files and/or directories in a bucket or subdirectory using
bucket$ls("nyc-taxi") # Or recursive: bucket$ls("nyc-taxi", recursive = TRUE)
NOTE: in GCS, you should always use
recursive = TRUE as directories often don't appear in
help(FileSystem) for a list of options that
GcsFileSystem$create() can take.
The object that
gs_bucket() return is technically a
which holds a path and a file system to which it corresponds.
SubTreeFileSystems can be
useful for holding a reference to a subdirectory somewhere (on S3, GCS, or elsewhere).
One way to get a subtree is to call the
$cd() method on a
june2019 <- bucket$cd("nyc-taxi/year=2019/month=6") df <- read_parquet(june2019$path("part-0.parquet"))
SubTreeFileSystem can also be made from a URI:
june2019 <- SubTreeFileSystem$create("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6")
File readers and writers (
write_feather(), et al.) also
accept a URI as the source or destination file, as do
An S3 URI looks like:
A GCS URI looks like:
For example, one of the NYC taxi data files used in
vignette("dataset", package = "arrow") is found at
s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet # Or in GCS (anonymous required on public buckets): gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet
Given this URI, you can pass it to
read_parquet() just as if it were a local file path:
df <- read_parquet("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet") # Or in GCS: df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet")
URIs accept additional options in the query parameters (the part after the
that are passed down to configure the underlying file system. They are separated
&. For example,
is equivlant to:
fs <- S3FileSystem$create( endpoint_override="https://storage.googleapis.com", allow_bucket_creation=TRUE ) fs$path("voltrondata-labs-datasets/")
Both tell the
S3FileSystem that it should allow the creation of new buckets and to
talk to Google Storage instead of S3. The latter works because GCS implements an
S3-compatible API--see File systems that emulate S3
below--but for better support for GCS use the GCSFileSystem with
gs://. Also note
that parameters in the URI need to be
percent encoded, which is why
:// is written as
For S3, only the following options can be included in the URI as query parameters
allow_bucket_deletion. For GCS, the supported parameters are
In GCS, a useful option is
retry_limit_seconds, which sets the number of seconds
a request may spend retrying before returning an error. The current default is
15 minutes, so in many interactive contexts it's nice to set a lower value:
To access private S3 buckets, you need typically need two secret parameters:
access_key, which is like a user id, and
secret_key, which is like a token
or password. There are a few options for passing these credentials:
Include them in the URI, like
s3://access_key:secret_key@bucket-name/path/to/file. Be sure to URL-encode your secrets if they contain special characters like "/" (e.g.,
URLencode("123/456", reserved = TRUE)).
Pass them as
Set them as environment variables named
Define them in a
~/.aws/credentials file, according to the AWS documentation.
Use an AccessRole
for temporary access by passing the
role_arn identifier to
The simplest way to authenticate with GCS is to run the gcloud command to setup application default credentials:
gcloud auth application-default login
To manually configure credentials, you can pass either
expiration, for using
temporary tokens generated elsewhere, or
json_credentials, to reference a downloaded
If you haven't configured credentials, then to access public buckets, you
anonymous = TRUE or
anonymous as the user in a URI:
bucket <- gs_bucket("voltrondata-labs-datasets", anonymous = TRUE) fs <- GcsFileSystem$create(anonymous = TRUE) df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet")
If you need to use a proxy server to connect to an S3 bucket, you can provide
a URI in the form
example, a local proxy server running on port 1316 can be used like this:
bucket <- s3_bucket("voltrondata-labs-datasets", proxy_options = "http://localhost:1316")
S3FileSystem machinery enables you to work with any file system that
provides an S3-compatible interface. For example, MinIO is
and object-storage server that emulates the S3 API. If you were to
minio server locally with its default settings, you could connect to
S3FileSystem like this:
minio <- S3FileSystem$create( access_key = "minioadmin", secret_key = "minioadmin", scheme = "http", endpoint_override = "localhost:9000" )
or, as a URI, it would be
(note the URL escaping of the
Among other applications, this can be useful for testing out code locally before running on a remote S3 bucket.
As mentioned above, it is possible to make use of environment variables to configure access. However, if you wish to pass in connection details via a URI or alternative methods but also have existing AWS environment variables defined, these may interfere with your session. For example, you may see an error message like:
Error: IOError: When resolving region for bucket 'analysis': AWS Error [code 99]: curlCode: 6, Couldn'
You can unset these environment
Sys.unsetenv(), for example:
By default, the AWS SDK tries to retrieve metadata about user configuration,
which can cause conficts when passing in connection details via URI (for example
when accessing a MINIO bucket). To disable the use of AWS environment
variables, you can set environment variable
Sys.setenv(AWS_EC2_METADATA_DISABLED = TRUE)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.