ml_prepare_dataset: Creates the 'label' and 'features' columns
In pysparklyr: Provides a 'PySpark' Back-End for the 'sparklyr' Package

View source: R/ml-prepare-dataset.R

ml_prepare_dataset

R Documentation

Creates the 'label' and 'features' columns

Description

Creates the 'label' and 'features' columns

Usage

ml_prepare_dataset(
  x,
  formula = NULL,
  label = NULL,
  features = NULL,
  label_col = "label",
  features_col = "features",
  keep_original = TRUE,
  ...
)

Arguments

`x`	A `tbl_pyspark` object
`formula`	Used when `x` is a `tbl_spark`. R formula.
`label`	The name of the label column.
`features`	The name(s) of the feature columns as a character vector.
`label_col`	Label column name, as a length-one character vector.
`features_col`	Features column name, as a length-one character vector.
`keep_original`	Boolean flag that indicates if the output will contain, or not, the original columns from `x`. Defaults to `TRUE`.
`...`	Added for backwards compatibility. Not in use today.

Details

At this time, 'Spark ML Connect', does not include a Vector Assembler transformer. The main thing that this function does, is create a 'Pyspark' array column. Pipelines require a 'label' and 'features' columns. Even though it is is single column in the dataset, the 'features' column will contain all of the predictors insde an array. This function also creates a new 'label' column that copies the outcome variable. This makes it a lot easier to remove the 'label', and 'outcome' columns.