knitr::opts_chunk$set(
  echo = TRUE,
  collapse = TRUE,
  comment = "#"
)
library(sparkts)
library(testthat)

Set up

NOTE: When cloning the repository to an on-network drive, you MUST clone it to the D drive; failure to do so will not allow you to build the package.

Software Dependencies

R

If you have cloned sparkts and wish to build it, you will need:

You will then need to run the following commands (note we need the development version of sparklyr due to bugs in the version available from CRAN):

install.packages(c("dplyr", "devtools", "testthat", "covr", "knitr", "rmarkdown"))
source("https://raw.githubusercontent.com/r-lib/remotes/master/install-github.R")$value("r-lib/remotes")
remotes::install_github("rstudio/sparklyr")

Spark

You also need to install Spark. On Windows machines you need to ensure your LOCALAPPDATA environment variable is set to the correct folder. First check if this is already set

Sys.getenv()["LOCALAPPDATA"]

If this is NA, set it with

Sys.setenv(LOCALAPPDATA="C:/Users/username/AppData/Local")

Now Spark should install in the correct directory

sparklyr::spark_install(version = "2.2.0")

You can check where Spark has installed with

sparklyr::spark_install_dir()

Connecting to Spark

Note that you may come across firewall restrictions when trying to connect to Spark. To get around this we can use the localhost instead (assuming this isn't also blocked).

sc <- sparklyr::spark_connect(
  master = "local",
  version = "2.2.0",
  config = list(sparklyr.gateway.address = "127.0.0.1")
)

Building

Once you have all of your dependencies correctly setup, you can then build and install the package using

devtools::build()

Development Learning

Developing this package further will require a working knowledge of several packages. Detailed below are several links to books, packages and vignettes.

You can read about these tools and the wider R package development world in the R Packages book.

Key functions to know about are:

A very good (free) book on general R can be found here.

All of the packages I have mentioned come from the "tidyverse" which is a collection of packages that work very well together. A key package which works well with sparklyr is called dplyr. dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges. It is the package to use for data manipulation in R and is R's version of the Python pandas library.

Generating HTML Documentation

We can generate the HTML from an Rd file using the following code.

tools::Rd2HTML("../man/scala_list.Rd")

This is linked to JIRA ticket DAPS-433.

Connecting to Spark

Cloudera Data Science Workbench [TO BE TESTED]

In order to connect to Spark on the Cloudera Data Science Workbench, the user must configure their connection using the sparklyr package. An example of this can be seen below.

library(sparklyr)

config <- spark_config()
config[["spark.r.command"]] <- "/opt/cloudera/parcels/CONDAR/lib/conda-R/bin/Rscript"
config$sparklyr.apply.env.R_HOME <- "/opt/cloudera/parcels/CONDAR/lib/conda-R/lib/R"
config$sparklyr.apply.env.RHOME <- "/opt/cloudera/parcels/CONDAR/lib/conda-R"
config$sparklyr.apply.env.R_SHARE_DIR <- "/opt/cloudera/parcels/CONDAR/lib/conda-R/lib/R/share"
config$sparklyr.apply.env.R_INCLUDE_DIR <- "/opt/cloudera/parcels/CONDAR/lib/conda-R/lib/R/include"

sc <- spark_connect(master = "yarn-client", config = config)

This is linked to JIRA ticket DAPS-450.

API

Naming Conventions

The functions in this package generally begin with sdf_*. This stands for Spark DataFrame and is used for two reasons:

  1. It extends the same API used by the sparklyr package
  2. We are using Spark DataFrames and modifying them in place

Generics vs. R6

A decision was made to use generic funtions (and possibly S3 classes where needed) over R6 classes since generics are more typical of the R code the end user will be used to. This is detailed more here.

Testing

Building Expected Data Outputs

The Problem

sc <- sparklyr::spark_connect(master = "local", version = "2.2.0")

# Define the expected data
expected_df <- structure(
  list(
    ref = c(
      "000000000", "111111111", "222222222", 
      "333333333", "444444444", "555555555", "666666666", "777777777"
    ), 
    xColumn = c(
      "200", "300", "400", "500", "600", "700", "800", "900"), 
    yColumn = c(120, 220, 320, 420, 520, 620, 720, 820), 
    zColumn = c(10, 20, 30, 40, 53, 60, 70, 80), 
    stdError = c(
      10.5851224804993, 14.1156934648117, 16.7967753286756, 19.0703353574777, 
      22.351729734926, 22.9194448813965, 24.6125909283282, 26.1943338370632)), 
  .Names = c(
    "ref", "xColumn", "yColumn", "zColumn", "stdError"), 
  row.names = c(NA, -8L), 
  class = c("tbl_df", "tbl", "data.frame")
)

# Read in the data
std_data <- sparklyr::spark_read_json(
  sc,
  "std_data",
  path = system.file(
    "data_raw/StandardErrorDataIn.json",
    package = "sparkts"
  )
) %>%
  sparklyr::spark_dataframe()

# Call the method
output <- sdf_standard_error(
  sc = sc, data = std_data,
  x_col = "xColumn", y_col = "yColumn", z_col = "zColumn",
  new_column_name = "stdError"
) %>%
  dplyr::collect()

You may see the following issue when testing:

expect_identical(
  output,
  expected_df
)

If you use dput to generate your expected output, it doesn't store the full numeric data information. To prove this, rounding this information allows the test to pass:

expect_identical(
  output %>% dplyr::mutate(stdError = round(stdError, 2)),
  expected_df %>% dplyr::mutate(stdError = round(stdError, 2))
)

Solution 1

If you must use dput, one way we can keep the full numeric data information is using hexadecimal (binary fractions) data. See ?deparseOpts for more information.

expected_df <- dput(
  output, control = c("keepNA", "keepInteger", "showAttributes", "hexNumeric")
)

Then when we compare the two datasets, we see no errors.

expect_identical(
  output,
  expected_df
)

Solution 2

If you don’t want to use hexadecimal units, for whatever reason, you can get away with using expect_equal() instead of expect_identical() which adds a tolerance to numerical value comparisons. See ?all.equal for more information.

You can always just import the data from a JSON file using sparklyr, however.

expected_df_json <- sparklyr::spark_read_json(
  sc,
  "std_data",
  path = system.file(
    "path/to/jsonfile.json",
    package = "sparkts"
  )
) %>%
  sparklyr::spark_dataframe()


nathaneastwood/sparkts documentation built on May 25, 2019, 10:34 p.m.