spark_write_table: Write to a Spark table

Description Usage Arguments Details Examples

View source: R/read-write.R

Description

Saves the content of the spark_tbl as the specified table. An R wrapper for Spark's saveAsTable.

Usage

1
2
3
4
5
6
7
8
9
spark_write_table(
  .data,
  table,
  mode = "error",
  partition_by = NULL,
  bucket_by = list(n = NA_integer_, cols = NA_character_),
  sort_by = NULL,
  ...
)

Arguments

.data

a spark_tbl

table

string, the table name

mode

string, usually "error" (default), "overwrite", "append", or "ignore"

partition_by

string, column names to partition by

bucket_by

list, format list(n = <integer>, cols = <string>)")specifying the number of buckets and strings to bucket on. Use with caution. Not currently working.

sort_by

string, if bucketed, column names to sort by.

...

additional named arguements pased to the spark writer.

Details

In the case the table already exists, behavior of this function depends on the save mode, specified by the mode function (default to throwing an exception). When mode is Overwrite, the schema of the DataFrame does not need to be the same as that of the existing table.

When mode is Append, if there is an existing table, we will use the format and options of the existing table. The column order in the schema of the DataFrame doesn't need to be same as that of the existing table. Unlike insertInto, saveAsTable will use the column names to find the correct column positions

Bucketing is supported in tidyspark but as a general warning in most cases bucketing is very difficult to do correctly and manage. It is the opinion of many Spark experts that you are better off using Delta optimize/z-order.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
## Not run: 
iris_tbl <- spark_tbl(iris)

# save as table
iris_tbl %>%
  spark_write_table("iris_tbl")

# try it with partitioning
iris_tbl %>%
  spark_write_table("iris_tbl", mode = "overwrite", partition_by = "Species")

spark_sql("DESCRIBE iris_tbl") %>% collect
# # A tibble: 8 x 3
# col_name                data_type   comment
# <chr>                   <chr>       <chr>
#   1 Sepal_Length            "double"     NA
# 2 Sepal_Width             "double"     NA
# 3 Petal_Length            "double"     NA
# 4 Petal_Width             "double"     NA
# 5 Species                 "string"     NA
# 6 # Partition Information ""          ""
# 7 # col_name              "data_type" "comment"
# 8 Species                 "string"     NA


## End(Not run)

danzafar/tidyspark documentation built on Sept. 30, 2020, 12:19 p.m.