db_map_csv: Use a sample CSV file to create a Hive table or pass the...

Description Usage Arguments Details Examples

Description

Use a sample CSV file to create a Hive table or pass the 'columns' argument to spark_read_csv

Usage

1
2
db_map_csv(sample_file, db = "sparklyr", sample_size = 5,
  dir_location = NULL, table_name = NULL, ...)

Arguments

sample_file

The path to a sample CSV file that will be used to determine the column types.

db

The type of connection or database. Possible values: 'hive', 'sparklyr'.

sample_size

The number of the top rows that will be sampled to determine the class. Defaults to 5.

dir_location

'hive' only - Passes the location of the directory where the data files are.

table_name

'hive' only - Passes the name of the table. Defaults to 'default'.

Details

This technique is meant to cut down the time of reading CSV files into the Spark context. It does that by either passing the column names and types in spark_read_csv or by using SQL to create the table

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
## Not run: 
#Libraries needed for this example
library(tidyverse)
library(sparklyr)
library(dbutilities)
library(nycflights13)

#Creating a local spark context
conf <- spark_config()
conf$`sparklyr.shell.driver-memory` <- "16G"
sc <- spark_connect(master = "local",
                    version = "2.1.0",
                    config = conf)

#Using flights from nycflights13 for example
data("flights")
flights

#Creating a csv file out of the flights table
if(!dir.exists("csv"))dir.create("csv")
write_csv(flights, "csv/flights.csv")

#Mapping the CSV file (Hive)
create_sql <- db_map_csv(sample_file = "csv/flights.csv",
                         dir_location = file.path(getwd(), "csv"),
                         db = "hive",
                         table_name = "sql_flights")

#Run resulting SQL command to create the table
DBI::dbGetQuery(sc, create_sql)

#Mapping the CSV file (sparklyr)
flights_columns <- db_map_csv(sample_file = "csv/flights.csv")

#Use spark_read_csv with the infer_schema argument set to FALSE
flights_noinfer <- spark_read_csv(sc,
                                  name = "noinfer_flights",
                                  path = "csv/",
                                  infer_schema = FALSE,
                                  columns = flights_columns)

spark_disconnect(sc)

## End(Not run)

edgararuiz/dbutilities documentation built on May 15, 2019, 11:02 p.m.