knitr::opts_chunk$set( echo = TRUE, collapse = TRUE, comment = "#" ) library(sparkts) library(testthat)
NOTE: When cloning the repository to an on-network drive, you MUST clone it to the D drive; failure to do so will not allow you to build the package.
If you have cloned sparkts
and wish to build
it, you will need:
You will then need to run the following commands (note we need the development version of sparklyr
due to bugs in the version available from CRAN):
install.packages(c("dplyr", "devtools", "testthat", "covr", "knitr", "rmarkdown")) source("https://raw.githubusercontent.com/r-lib/remotes/master/install-github.R")$value("r-lib/remotes") remotes::install_github("rstudio/sparklyr")
You also need to install Spark. On Windows machines you need to ensure your LOCALAPPDATA environment variable is set to the correct folder. First check if this is already set
Sys.getenv()["LOCALAPPDATA"]
If this is NA
, set it with
Sys.setenv(LOCALAPPDATA="C:/Users/username/AppData/Local")
Now Spark should install in the correct directory
sparklyr::spark_install(version = "2.2.0")
You can check where Spark has installed with
sparklyr::spark_install_dir()
Note that you may come across firewall restrictions when trying to connect to Spark. To get around this we can use the localhost instead (assuming this isn't also blocked).
sc <- sparklyr::spark_connect( master = "local", version = "2.2.0", config = list(sparklyr.gateway.address = "127.0.0.1") )
Once you have all of your dependencies correctly setup, you can then build and install the package using
devtools::build()
Developing this package further will require a working knowledge of several packages. Detailed below are several links to books, packages and vignettes.
You can read about these tools and the wider R package development world in the R Packages book.
Key functions to know about are:
A very good (free) book on general R can be found here.
All of the packages I have mentioned come from the "tidyverse" which is a collection of packages that work very well together. A key package which works well with sparklyr is called dplyr. dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges. It is the package to use for data manipulation in R and is R's version of the Python pandas library.
We can generate the HTML from an Rd file using the following code.
tools::Rd2HTML("../man/scala_list.Rd")
This is linked to JIRA ticket DAPS-433.
In order to connect to Spark on the Cloudera Data Science Workbench, the user must configure their connection using the sparklyr
package. An example of this can be seen below.
library(sparklyr) config <- spark_config() config[["spark.r.command"]] <- "/opt/cloudera/parcels/CONDAR/lib/conda-R/bin/Rscript" config$sparklyr.apply.env.R_HOME <- "/opt/cloudera/parcels/CONDAR/lib/conda-R/lib/R" config$sparklyr.apply.env.RHOME <- "/opt/cloudera/parcels/CONDAR/lib/conda-R" config$sparklyr.apply.env.R_SHARE_DIR <- "/opt/cloudera/parcels/CONDAR/lib/conda-R/lib/R/share" config$sparklyr.apply.env.R_INCLUDE_DIR <- "/opt/cloudera/parcels/CONDAR/lib/conda-R/lib/R/include" sc <- spark_connect(master = "yarn-client", config = config)
This is linked to JIRA ticket DAPS-450.
The functions in this package generally begin with sdf_*
. This stands for Spark DataFrame and is used for two reasons:
sparklyr
packageA decision was made to use generic funtions (and possibly S3 classes where needed) over R6 classes since generics are more typical of the R code the end user will be used to. This is detailed more here.
sc <- sparklyr::spark_connect(master = "local", version = "2.2.0") # Define the expected data expected_df <- structure( list( ref = c( "000000000", "111111111", "222222222", "333333333", "444444444", "555555555", "666666666", "777777777" ), xColumn = c( "200", "300", "400", "500", "600", "700", "800", "900"), yColumn = c(120, 220, 320, 420, 520, 620, 720, 820), zColumn = c(10, 20, 30, 40, 53, 60, 70, 80), stdError = c( 10.5851224804993, 14.1156934648117, 16.7967753286756, 19.0703353574777, 22.351729734926, 22.9194448813965, 24.6125909283282, 26.1943338370632)), .Names = c( "ref", "xColumn", "yColumn", "zColumn", "stdError"), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame") ) # Read in the data std_data <- sparklyr::spark_read_json( sc, "std_data", path = system.file( "data_raw/StandardErrorDataIn.json", package = "sparkts" ) ) %>% sparklyr::spark_dataframe() # Call the method output <- sdf_standard_error( sc = sc, data = std_data, x_col = "xColumn", y_col = "yColumn", z_col = "zColumn", new_column_name = "stdError" ) %>% dplyr::collect()
You may see the following issue when testing:
expect_identical(
output,
expected_df
)
If you use dput
to generate your expected output, it doesn't store the full numeric data information. To prove this, rounding this information allows the test to pass:
expect_identical( output %>% dplyr::mutate(stdError = round(stdError, 2)), expected_df %>% dplyr::mutate(stdError = round(stdError, 2)) )
If you must use dput
, one way we can keep the full numeric data information is using hexadecimal (binary fractions) data. See ?deparseOpts
for more information.
expected_df <- dput( output, control = c("keepNA", "keepInteger", "showAttributes", "hexNumeric") )
Then when we compare the two datasets, we see no errors.
expect_identical(
output,
expected_df
)
If you don’t want to use hexadecimal units, for whatever reason, you can get away with using expect_equal()
instead of expect_identical()
which adds a tolerance to numerical value comparisons. See ?all.equal
for more information.
You can always just import the data from a JSON file using sparklyr
, however.
expected_df_json <- sparklyr::spark_read_json( sc, "std_data", path = system.file( "path/to/jsonfile.json", package = "sparkts" ) ) %>% sparklyr::spark_dataframe()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.