Getting Started

The RAthena package aims to make it easier to work with data stored in AWS Athena. RAthena package attempts to provide three levels of interacting with AWS Athena:

Installing RAthena:

As RAthena utilising the python AWS SDK boto3, Python 3+ is required. Please install Python 3+ either by Python or Python Anaconda. To install RAthena:

# cran version
install.packages("RAthena")

# Dev version
remotes::install_github("dyfanjones/RAthena")

Next is to install Python boto3. This can be done either by RAthena's installation method:

RAthena::install_boto()

Or pip method:

pip install boto3

Python Environments:

If RAthena doesn't pick up boto3 after using install_boto(), please consider specifying the python environment.install_boto() creates RAthena environment. This is either a Python virtual environment or a conda environment depending on your system.

library(DBI)

# Specify python conda environment and force reticulate to use it
reticulate::use_condaenv("RAthena", required = TRUE)

# Or specify python virtual environment and force reticulate to use it
reticulate::use_virtualenv("RAthena", required = TRUE)

con <- dbConnect(RAthena::athena())

Note: Python environments are not required if boto3 is either in the root Python or if R and Python are in their own environment (for example conda environment).

Docker Example:

To help with users wishing to run RAthena in a docker, a simple docker file has been created here. To set up the docker please refer to link. For demo purposes we will use the example docker and run it locally:

# build docker image
docker build . -t rathena

# start container with aws credentials passed from local
docker run \
      -e AWS_ACCESS_KEY_ID="$(aws configure get aws_access_key_id)" \
      -e AWS_SECRET_ACCESS_KEY="$(aws configure get aws_secret_access_key)" \
      -e AWS_SESSION_TOKEN="$(aws configure get aws_session_token)" \
      -e AWS_DEFAULT_REGION="$(aws configure get region)" \
      -it rathena

When running RAthena in the docker environment you might be required to let reticulate know what python you are using.

reticulate::use_python("/usr/bin/python3")

library(DBI)

con <- dbConnect(RAthena::athena(), s3_staging_dir = "s3://mybucket/")

Usage:

Low - Level API:

library(DBI)
library(RAthena)

con <- dbConnect(athena())

# list all current work groups in AWS Athena
list_work_groups(con)

# Create a new work group
create_work_group(con, "demo_work_group", description = "This is a demo work group",
                  tags = tag_options(key= "demo_work_group", value = "demo_01"))

DBI:

library(DBI)

con <- dbConnect(RAthena::athena())

# Get metadata 
dbGetInfo(con)

# $profile_name
# [1] "default"
# 
# $s3_staging
# [1] ######## NOTE: Please don't share your S3 bucket to the public
# 
# $dbms.name
# [1] "default"
# 
# $work_group
# [1] "primary"
# 
# $poll_interval
# NULL
# 
# $encryption_option
# NULL
# 
# $kms_key
# NULL
# 
# $expiration
# NULL
# 
# $region_name
# [1] "eu-west-1"
# 
# $boto3
# [1] "1.11.5"
# 
# $RAthena
# [1] "1.7.1"

# create table to AWS Athena
dbWriteTable(con, "iris", iris)

dbGetQuery(con, "select * from iris limit 10")
# Info: (Data scanned: 860 Bytes)
#  sepal_length sepal_width petal_length petal_width species
# 1:           5.1         3.5          1.4         0.2  setosa
# 2:           4.9         3.0          1.4         0.2  setosa
# 3:           4.7         3.2          1.3         0.2  setosa
# 4:           4.6         3.1          1.5         0.2  setosa
# 5:           5.0         3.6          1.4         0.2  setosa
# 6:           5.4         3.9          1.7         0.4  setosa
# 7:           4.6         3.4          1.4         0.3  setosa
# 8:           5.0         3.4          1.5         0.2  setosa
# 9:           4.4         2.9          1.4         0.2  setosa
# 10:          4.9         3.1          1.5         0.1  setosa

dplyr:

library(dplyr)

athena_iris <- tbl(con, "iris")

athena_iris %>%
  select(species, sepal_length, sepal_width) %>% 
  head(10) %>%
  collect()
# Info: (Data scanned: 860 Bytes)
# # A tibble: 10 x 3
# species  sepal_length sepal_width
# <chr>           <dbl>       <dbl>
# 1 setosa            5.1         3.5
# 2 setosa            4.9         3  
# 3 setosa            4.7         3.2
# 4 setosa            4.6         3.1
# 5 setosa            5           3.6
# 6 setosa            5.4         3.9
# 7 setosa            4.6         3.4
# 8 setosa            5           3.4
# 9 setosa            4.4         2.9
# 10 setosa           4.9         3.1

Useful Links:



Try the RAthena package in your browser

Any scripts or data that you put into this service are public.

RAthena documentation built on Dec. 28, 2022, 1:19 a.m.