In hrbrmstr/sergeant: Tools to Transform and Query Data with Apache Drill

knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  collapse = TRUE,
  comment = "## ",
  fig.path = "README-"
)
options(sergeant.bigint.warnonce = FALSE)
options(width=120)

💂 sergeant

Tools to Transform and Query Data with 'Apache' 'Drill'

IMPORTANT

Version 0.7.0 (a.k.a. the main branch) splits off the JDBC interface into a separate package sergeant.caffeinated (GitLab; GitHub).

I# Description

Drill + sergeant is (IMO) a streamlined alternative to Spark + sparklyr if you don't need the ML components of Spark (i.e. just need to query "big data" sources, need to interface with parquet, need to combine disparate data source types — json, csv, parquet, rdbms - for aggregation, etc). Drill also has support for spatial queries.

Using Drill SQL queries that reference parquet files on a local linux or macOS workstation can often be more performant than doing the same data ingestion & wrangling work with R (especially for large or disperate data sets). Drill can often help further streamline workflows that involve wrangling many tiny JSON files on a daily basis.

Drill can be obtained from https://drill.apache.org/download/ (use "Direct File Download"). Drill can also be installed via Docker. For local installs on Unix-like systems, a common/suggestion location for the Drill directory is /usr/local/drill as the install directory.

Drill embedded (started using the $DRILL_BASE_DIR/bin/drill-embedded script) is a super-easy way to get started playing with Drill on a single workstation and most of many workflows can "get by" using Drill this way.

There are a few convenience wrappers for various informational SQL queries (like drill_version()). Please file an PR if you add more.

Some of the more "controlling vs data ops" REST API functions aren't implemented. Please file a PR if you need those.

The following functions are implemented:

DBI (REST)

A "just enough" feature complete R DBI driver has been implemented using the Drill REST API, mostly to facilitate the dplyr interface. Use the RJDBC driver interface if you need more DBI functionality.
This also means that SQL functions unique to Drill have also been "implemented" (i.e. made accessible to the dplyr interface). If you have custom Drill SQL functions that need to be implemented please file an issue on GitHub. Many should work without it, but some may require a custom interface.

dplyr: (REST)

src_drill: Connect to Drill (using dplyr) + supporting functions

Note that a number of Drill SQL functions have been mapped to R functions (e.g. grepl) to make it easier to transition from non-database-backed SQL ops to Drill. See the help on drill_custom_functions for more info on these helper Drill custom function mappings.

Drill APIs:

drill_connection: Setup parameters for a Drill server/cluster connection
drill_active: Test whether Drill HTTP REST API server is up
drill_cancel: Cancel the query that has the given queryid
drill_functions: Show all the available Drill built-in functions & UDFs (Apache Drill 1.15.0+ required)
drill_jdbc: Connect to Drill using JDBC
drill_metrics: Get the current memory metrics
drill_options: List the name, default, and data type of the system and session options
drill_popts: Show all the available Drill options (1.15.0+)
drill_rofile: Get the profile of the query that has the given query id
drill_profiles: Get the profiles of running and completed queries
drill_query: Submit a query and return results
drill_set: Set Drill SYSTEM or SESSION options
drill_settings_reset: Changes (optionally, all) session settings back to system defaults
drill_show_files: Show files in a file system schema.
drill_show_schemas: Returns a list of available schemas.
drill_stats: Get Drillbit information, such as ports numbers
drill_status: Get the status of Drill
drill_storage: Get the list of storage plugin names and configurations
drill_system_reset: Changes (optionally, all) system settings back to system defaults
drill_threads: Get information about threads
drill_uplift: Turn a columnar query results into a type-converted tbl
drill_use: Change to a particular schema.
drill_version: Identify the version of Drill running

Helpers

ctas_profile: Generate a Drill CTAS Statement from a Query
drill_up: sart a Dockerized Drill Instance

`sdrill_down`: stop a Dockerized Drill Instance by container id

howall_drill: Show all dead and running Drill Docker containers
stopall_drill: Prune all dead and running Drill Docker containers

Installation

hrbrpkghelpr::install_block()
`````
# Usage

### `dplyr` interface

```r
library(sergeant)
library(tidyverse)

# use localhost if running standalone on same system otherwise the host or IP of your Drill server
ds <- src_drill("localhost")  #ds
db <- tbl(ds, "cp.`employee.json`") 

# without `collect()`:
count(db, gender, marital_status)

count(db, gender, marital_status) %>% collect()

group_by(db, position_title) %>%
  count(gender) -> tmp2

group_by(db, position_title) %>%
  count(gender) %>%
  ungroup() %>%
  mutate(full_desc = ifelse(gender == "F", "Female", "Male")) %>%
  collect() %>%
  select(Title = position_title, Gender = full_desc, Count = n)

arrange(db, desc(employee_id)) %>% print(n = 20)

mutate(db, position_title = tolower(position_title)) %>%
  mutate(salary = as.numeric(salary)) %>%
  mutate(gender = ifelse(gender == "F", "Female", "Male")) %>%
  mutate(marital_status = ifelse(marital_status == "S", "Single", "Married")) %>%
  group_by(supervisor_id) %>%
  summarise(underlings_count = n()) %>%
  collect()

REST API

dc <- drill_connection("localhost") 

drill_active(dc)

drill_version(dc)

drill_storage(dc)$name

drill_query(dc, "SELECT * FROM cp.`employee.json` limit 100")

drill_query(dc, "SELECT COUNT(gender) AS gctFROM cp.`employee.json` GROUP BY gender")

drill_options(dc)

drill_options(dc, "json")

Working with parquet files

drill_query(dc, "SELECT * FROM dfs.`/usr/local/drill/sample-data/nation.parquet` LIMIT 5")

Including multiple parquet files in different directories (note the wildcard support):

drill_query(dc, "SELECT * FROM dfs.`/usr/local/drill/sample-data/nations*/nations*.parquet` LIMIT 5")

Drill has built-in support for spatial ops

Via: https://github.com/k255/drill-gis

A common use case is to select data within boundary of given polygon:

drill_query(dc, "
select columns[2] as city, columns[4] as lon, columns[3] as lat
    from cp.`sample-data/CA-cities.csv`
    where
        ST_Within(
            ST_Point(columns[4], columns[3]),
            ST_GeomFromText(
                'POLYGON((-121.95 37.28, -121.94 37.35, -121.84 37.35, -121.84 37.28, -121.95 37.28))'
                )
            )
")

sergeant Metrics

cloc::cloc_pkg_md()

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

hrbrmstr/sergeant documentation built on Dec. 27, 2021, 11:17 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

hrbrmstr/sergeant
Tools to Transform and Query Data with Apache Drill

In hrbrmstr/sergeant: Tools to Transform and Query Data with Apache Drill

💂 sergeant

IMPORTANT

`sdrill_down`: stop a Dockerized Drill Instance by container id

Installation

REST API

Working with parquet files

Drill has built-in support for spatial ops

sergeant Metrics

Code of Conduct

R Package Documentation

Browse R Packages

We want your feedback!

hrbrmstr/sergeant Tools to Transform and Query Data with Apache Drill

In hrbrmstr/sergeant: Tools to Transform and Query Data with Apache Drill

💂 sergeant

** IMPORTANT **

sdrill_down: stop a Dockerized Drill Instance by container id

Installation

REST API

Working with parquet files

Drill has built-in support for spatial ops

sergeant Metrics

Code of Conduct

R Package Documentation

Browse R Packages

We want your feedback!

hrbrmstr/sergeant
Tools to Transform and Query Data with Apache Drill

IMPORTANT

`sdrill_down`: stop a Dockerized Drill Instance by container id