knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  collapse = TRUE,
  comment = "## ",
  fig.path = "README-"
)
options(sergeant.bigint.warnonce = FALSE)
options(width=120)

DOI Travis-CI Build Status Coverage Status CRAN_Status_Badge

💂 sergeant

Tools to Transform and Query Data with 'Apache' 'Drill'

** IMPORTANT **

Version 0.7.0 (a.k.a. the main branch) splits off the JDBC interface into a separate package sergeant.caffeinated (GitLab; GitHub).

I# Description

Drill + sergeant is (IMO) a streamlined alternative to Spark + sparklyr if you don't need the ML components of Spark (i.e. just need to query "big data" sources, need to interface with parquet, need to combine disparate data source types — json, csv, parquet, rdbms - for aggregation, etc). Drill also has support for spatial queries.

Using Drill SQL queries that reference parquet files on a local linux or macOS workstation can often be more performant than doing the same data ingestion & wrangling work with R (especially for large or disperate data sets). Drill can often help further streamline workflows that involve wrangling many tiny JSON files on a daily basis.

Drill can be obtained from https://drill.apache.org/download/ (use "Direct File Download"). Drill can also be installed via Docker. For local installs on Unix-like systems, a common/suggestion location for the Drill directory is /usr/local/drill as the install directory.

Drill embedded (started using the $DRILL_BASE_DIR/bin/drill-embedded script) is a super-easy way to get started playing with Drill on a single workstation and most of many workflows can "get by" using Drill this way.

There are a few convenience wrappers for various informational SQL queries (like drill_version()). Please file an PR if you add more.

Some of the more "controlling vs data ops" REST API functions aren't implemented. Please file a PR if you need those.

The following functions are implemented:

DBI (REST)

dplyr: (REST)

Note that a number of Drill SQL functions have been mapped to R functions (e.g. grepl) to make it easier to transition from non-database-backed SQL ops to Drill. See the help on drill_custom_functions for more info on these helper Drill custom function mappings.

Drill APIs:

Helpers

sdrill_down: stop a Dockerized Drill Instance by container id

Installation

hrbrpkghelpr::install_block()
`````
# Usage

### `dplyr` interface

```r
library(sergeant)
library(tidyverse)

# use localhost if running standalone on same system otherwise the host or IP of your Drill server
ds <- src_drill("localhost")  #ds
db <- tbl(ds, "cp.`employee.json`") 

# without `collect()`:
count(db, gender, marital_status)

count(db, gender, marital_status) %>% collect()

group_by(db, position_title) %>%
  count(gender) -> tmp2

group_by(db, position_title) %>%
  count(gender) %>%
  ungroup() %>%
  mutate(full_desc = ifelse(gender == "F", "Female", "Male")) %>%
  collect() %>%
  select(Title = position_title, Gender = full_desc, Count = n)

arrange(db, desc(employee_id)) %>% print(n = 20)

mutate(db, position_title = tolower(position_title)) %>%
  mutate(salary = as.numeric(salary)) %>%
  mutate(gender = ifelse(gender == "F", "Female", "Male")) %>%
  mutate(marital_status = ifelse(marital_status == "S", "Single", "Married")) %>%
  group_by(supervisor_id) %>%
  summarise(underlings_count = n()) %>%
  collect()

REST API

dc <- drill_connection("localhost") 

drill_active(dc)

drill_version(dc)

drill_storage(dc)$name

drill_query(dc, "SELECT * FROM cp.`employee.json` limit 100")

drill_query(dc, "SELECT COUNT(gender) AS gctFROM cp.`employee.json` GROUP BY gender")

drill_options(dc)

drill_options(dc, "json")

Working with parquet files

drill_query(dc, "SELECT * FROM dfs.`/usr/local/drill/sample-data/nation.parquet` LIMIT 5")

Including multiple parquet files in different directories (note the wildcard support):

drill_query(dc, "SELECT * FROM dfs.`/usr/local/drill/sample-data/nations*/nations*.parquet` LIMIT 5")

Drill has built-in support for spatial ops

Via: https://github.com/k255/drill-gis

A common use case is to select data within boundary of given polygon:

drill_query(dc, "
select columns[2] as city, columns[4] as lon, columns[3] as lat
    from cp.`sample-data/CA-cities.csv`
    where
        ST_Within(
            ST_Point(columns[4], columns[3]),
            ST_GeomFromText(
                'POLYGON((-121.95 37.28, -121.94 37.35, -121.84 37.35, -121.84 37.28, -121.95 37.28))'
                )
            )
")

sergeant Metrics

cloc::cloc_pkg_md()

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.



hrbrmstr/sergeant documentation built on Dec. 27, 2021, 11:17 p.m.