spark_plot_kmeans: A SparklyR Kmeans Cluster Plotting Function

Description Usage Arguments Details

Description

The function can be used to generate 2D or 3D plots to visualize and understand kmean clusters

Usage

1
2
3
spark_plot_kmeans(sparklyr_table, ml_kmean_model, plotMode = "2d",
  optional_pca_model = "None", local_selection = 80000L,
  combination = "experimental")

Arguments

sparklyr_table

is the spark table you will pass to the function. You can pass using a dplyr spark table (tbl).

ml_kmean_model

is the ml_kmean model outputs to pass to the function

plotMode

(default=2d this will generate the output visualization with ggplot, if set to 3d it will generate a 3d plot with plotly, if set to both it will output both. You should create some variable like both_plot = .... then access for plotting like both_plot$'2d_plot' and both_plot$'3d_plot'

optional_pca_model

(default = "None") You can plug the existing pca model you have run on the dataframe with ml_pca and it will avoid re-running. By default the PCA selects k=2 for 2-dimension and k=3 for 3-dimension so if you use a different k in your model you may be missing out on dimensionality. (Not always a bad thing)

local_selection

(default = 80000L) This is the randomly selected number of points that will ultimately be collected and plotted. The 3D model can handle up to 250,000 points (sometimes) and the 2D can handle more like 350-400,000. The default of 80,000 is set for browser performance (especially with the 3D plot).

combination

(default = "experimental") This uses a custom version of sdf_bind_cols that is faster and solves errors that I have encountered with sdf_bind_cols (called indexJoin) note it does not have support for nested columns yet

Details

Important package requirements:
You must have ggplot2 installed, and if you want the 3D output you must have plotly installed

Example selection of a spark table and graph
spark_table = tbl(sc, sql("select * from db.stock_samples_20m limit 100"))
outputs = spark_plot_kmeans(inputDF, kmean_model, plotMode="both")


GabeChurch/sparkedatools documentation built on June 25, 2019, 12:23 p.m.