sdf_duplicate_marker: This method will flag any duplicate records

Description Usage Arguments Value Examples

Description

This method adds a column to a dataframe containing duplicate markers.

Usage

1
2
sdf_duplicate_marker(sc, data, part_col, ord_col,
  new_column_name = "duplicate")

Arguments

sc

A spark_connection.

data

A jobj: the Spark DataFrame on which to perform the function.

part_col

String(s). A vector of the column(s) to check for duplicates within.

ord_col

String(s). A list of the column(s) to order by.

new_column_name

A string. This is what the duplicate marker column is called, it can be defaulted to "duplicate".

Value

Returns a jobj. * 0 = Duplicate * 1 = Not a Duplicate

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
## Not run: 
# Set up a spark connection
sc <- spark_connect(master = "local", version = "2.2.0")

# Extract some data
dup_data <- spark_read_json(
  sc,
  "std_data",
  path = system.file(
    "data_raw/DuplicateDataIn.json",
    package = "sparkts"
  )
) %>%
  spark_dataframe()

# Call the method
p <- sdf_duplicate_marker(
  sc, dup_data, part_col = "order", ord_col = "marker"
)

# Return the data to R
p %>% dplyr::collect()

spark_disconnect(sc = sc)

## End(Not run)

nathaneastwood/sparkts documentation built on May 25, 2019, 10:34 p.m.