In zero323/dlt: SparkR Delta Lake Interface

Public naming convention

All exported methods and functions, with exception to compatibility reader and writer (read.delta and write.delta), use dlt_ prefix followed by snake case name.

For example, dlt equivalent of the upstream convertToDelta, is dlt_convert_to_delta.

Argument order

Python Delta Lake API provides a number of methods with an optional condition, followed by mandatory set (mapping between column name and update expression). For example, DeltaMergeBuilder.whenMatchedUpdate can be called with either condtion and set:

```{python eval = FALSE, python.reticulate = FALSE} @overload def whenMatchedUpdate( self, condition: OptionalExpressionOrColumn, set: ColumnMapping ) -> "DeltaMergeBuilder": ...

or only `set`, passed as a keyword argument
```{python eval = FALSE, python.reticulate = FALSE}
@overload
def whenMatchedUpdate(
    self, *, set: ColumnMapping
) -> "DeltaMergeBuilder": ...

This follows Scala information flow, where merge operation is defined using interleaved operations on DeltaMergeBuilder (whenMatched, whenNotMached) and DeltaMerge*ActionBuilder (i.e. insert, insertAll, update, updateAll, delete)

```{scala eval = FALSE} deltaTable .as("target") .merge( source.as("source"), "target.key = source.key") .whenMatched("source.value < target.value") // Returns DeltaMergeMatchedActionBuilder .update(Map( "value" -> expr("source.value") )) // Returns DeltaMergeBuilder .whenNotMatched("source.value < target.value") // Returns DeltaMergeNotMatchedActionBuilder .insert(Map( "key" -> expr("source.key") "value" -> lit(0) )) // Returns DeltaMergeBuilder .execute()

so `condition`, if present, is always provided before `set`.

In contrast, `dlt` provides composite methods modeled after Python API, but mandatory arguments are placed before the optional ones. So Python's

```{python eval = FALSE, python.reticulate = FALSE}
def whenMatchedUpdate(
    self,
    condition: OptionalExpressionOrColumn = None,
    set: OptionalColumnMapping = None
) -> "DeltaMergeBuilder": ...

is mapped to

setGeneric(
  "dlt_when_matched_update",
  function(dmb, set, condition) {
    standardGeneric("dlt_when_matched_update")
  }
)

in dlt.

Return types

In Scala and Python API, methods used for their side effects, with exception to DeltaTableBuilder.execute (DeltaMergeBuilder.execute, DeltaTable.update, DeltaTable.delete, DeltaTable.generate, etc.), are Unit() or None respectively.

Additionally, DeltaTable.vacuum returns an empty Spark DataFrame.

In contrast, dlt invisibly returns an applicable instance of DeltaTable (target DeltaTable for merge operations, DeltaTable, on which method has been called, otherwise).

As a result, it is possible to chain calls like these:

dlt_for_path("/tmp/target") %>%
  dlt_delete("id in (1, 3, 5)") %>%
  dlt_update(list(ind = "-ind"), "key = 'a'") %>%
  dlt_alias("target") %>%
  dlt_merge(alias(source, "source"), "source.id = target.id") %>%
  dlt_when_not_matched_insert_all() %>%
  dlt_execute() %>%
  dlt_show()

Please keep in mind that, while convenient to write, such code can be harder to reason about and recover in case of coding error. Use with caution.

DeltaTableBuilder semantics

Scala and Python DeltaTableBuilder use a mutable state to build table definition. As a result, table partial table builder definitions are not suitable for reuse. For example, the following Scala code:

```{scala eval = FALSE} import org.apache.spark.sql.types._

val parentBuilder = DeltaTable .create(spark) .addColumn("id", "integer") .addColumns(StructType(Seq( StructField("key", StringType), StructField("value", DoubleType)) ))

val firstChildBuilder = parentBuilder .addColumn("first", "decimal(10, 2)") .tableName("first_child_table")

val secondChildBuilder = parentBuilder .addColumn("second", "boolean") .tableName("second_child_table")

firstChildBuilder.execute() secondChildBuilder.execute()

fails with

org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table default.second_child_table already exists

In contrast, `dlt` builder is free of side effects, and can be easily reused:

```r
parent_builder <- dlt_create() %>%
  dlt_add_column("id", "integer") %>%
  dlt_add_columns(structType("key string, value double"))

first_child_builder <- parent_builder %>%
  dlt_table_name("first_child_table") %>%
  dlt_add_column("first", "decimal(10, 2)")

second_child_builder <- parent_builder %>%
  dlt_table_name("second_child_table") %>%
  dlt_add_column("second", "boolean")

first_child_builder %>%
  dlt_execute() %>%
  dlt_to_df() %>%
  schema()

# StructType
# |-name = "id", type = "IntegerType", nullable = TRUE
# |-name = "key", type = "StringType", nullable = TRUE
# |-name = "value", type = "DoubleType", nullable = TRUE
# |-name = "first", type = "DecimalType(10,2)", nullable = TRUE

second_child_builder %>%
  dlt_execute() %>%
  dlt_to_df() %>%
  schema()

# StructType
# |-name = "id", type = "IntegerType", nullable = TRUE
# |-name = "key", type = "StringType", nullable = TRUE
# |-name = "value", type = "DoubleType", nullable = TRUE
# |-name = "second", type = "BooleanType", nullable = TRUE