sparklyr: R Interface to Apache Spark

R interface to Apache Spark, a fast and general engine for big data processing, see <>. This package supports connecting to local and remote Apache Spark clusters, provides a 'dplyr' compatible back-end, and provides an interface to Spark's built-in machine learning algorithms.

Install the latest version of this package by entering the following in R:
AuthorJavier Luraschi [aut, cre], Kevin Ushey [aut], JJ Allaire [aut], RStudio [cph], The Apache Software Foundation [aut, cph]
Date of publication2017-03-09 18:09:02
MaintainerJavier Luraschi <>
LicenseApache License 2.0 | file LICENSE

View on CRAN

Man pages

compile_package_jars: Compile Scala sources into a Java Archive (jar)

connection_config: Read configuration values for a connection

connection_is_open: Check whether the connection is open

copy_to.spark_connection: Copy an R Data Frame to Spark

DBISparkResult-class: DBI Spark Result.

ensure: Enforce Specific Structure for R Objects

find_scalac: Discover the Scala Compiler

ft_binarizer: Feature Transformation - Binarizer

ft_bucketizer: Feature Transformation - Bucketizer

ft_discrete_cosine_transform: Feature Transformation - Discrete Cosine Transform (DCT)

ft_elementwise_product: Feature Transformation - ElementwiseProduct

ft_index_to_string: Feature Transformation - IndexToString

ft_one_hot_encoder: Feature Transformation - OneHotEncoder

ft_quantile_discretizer: Feature Transformation - QuantileDiscretizer

ft_regex_tokenizer: Feature Tranformation - RegexTokenizer

ft_sql_transformer: Feature Transformation - SQLTransformer

ft_string_indexer: Feature Transformation - StringIndexer

ft_tokenizer: Feature Tranformation - Tokenizer

ft_vector_assembler: Feature Transformation - VectorAssembler

invoke: Invoke a Method on a JVM Object

invoke_method: Generic call interface for spark shell

livy_config: Create a Spark Configuration for Livy

livy_install: Install Livy

livy_service: Start Livy

ml_als_factorization: Spark ML - Alternating Least Squares (ALS) matrix...

ml_binary_classification_eval: Spark ML - Binary Classification Evaluator

ml_classification_eval: Spark ML - Classification Evaluator

ml_create_dummy_variables: Create Dummy Variables

ml_decision_tree: Spark ML - Decision Trees

ml_generalized_linear_regression: Spark ML - Generalized Linear Regression

ml_gradient_boosted_trees: Spark ML - Gradient-Boosted Tree

ml_kmeans: Spark ML - K-Means Clustering

ml_lda: Spark ML - Latent Dirichlet Allocation

ml_linear_regression: Spark ML - Linear Regression

ml_logistic_regression: Spark ML - Logistic Regression

ml_model: Create an ML Model Object

ml_multilayer_perceptron: Spark ML - Multilayer Perceptron

ml_naive_bayes: Spark ML - Naive-Bayes

ml_one_vs_rest: Spark ML - One vs Rest

ml_options: Options for Spark ML Routines

ml_pca: Spark ML - Principal Components Analysis

ml_prepare_dataframe: Prepare a Spark DataFrame for Spark ML Routines

ml_prepare_inputs: Pre-process the Inputs to a Spark ML Routine

ml_random_forest: Spark ML - Random Forests

ml_saveload: Save / Load a Spark ML Model Fit

ml_survival_regression: Spark ML - Survival Regression

ml_tree_feature_importance: Spark ML - Feature Importance for Tree Models

na.replace: Replace Missing Values in Objects

pipe: Pipe operator

print_jobj: Generic method for print jobj for a connection type

reexports: Objects exported from other packages

register_extension: Register a Package that Implements a Spark Extension

sdf_copy_to: Copy an Object into Spark

sdf_mutate: Mutate a Spark DataFrame

sdf_partition: Partition a Spark Dataframe

sdf_persist: Persist a Spark DataFrame

sdf_predict: Model Predictions with Spark DataFrames

sdf_quantile: Compute (Approximate) Quantiles with a Spark DataFrame

sdf_read_column: Read a Column from a Spark DataFrame

sdf_register: Register a Spark DataFrame

sdf_sample: Randomly Sample Rows from a Spark DataFrame

sdf-saveload: Save / Load a Spark DataFrame

sdf_schema: Read the Schema of a Spark DataFrame

sdf_sort: Sort a Spark DataFrame

sdf_with_unique_id: Add a Unique ID Column to a Spark DataFrame

spark-api: Access the Spark API

spark_compilation_spec: Define a Spark Compilation Specification

spark_compile: Compile Scala sources into a Java Archive

spark_config: Read Spark Configuration

spark_connection: Retrieve the Spark Connection Associated with an R Object

spark-connections: Manage Spark Connections

spark_dataframe: Retrieve a Spark DataFrame

spark_default_compilation_spec: Default Compilation Specification for Spark Extensions

spark_dependency: Define a Spark dependency

spark_home_dir: Find the SPARK_HOME directory for a version of Spark

spark_install: Download and install various versions of Spark

spark_jobj: Retrieve a Spark JVM Object Reference

spark_load_table: Load a Spark Table into a Spark DataFrame.

spark_log: View Entries in the Spark Log

spark_read_csv: Read a CSV file into a Spark DataFrame

spark_read_json: Read a JSON file into a Spark DataFrame

spark_read_parquet: Read a Parquet file into a Spark DataFrame

spark_save_table: Saves a Spark DataFrame as a Spark table

spark_version: Get the Spark Version Associated with a Spark Connection

spark_version_from_home: Get the Spark Version Associated with a Spark Installation

spark_web: Open the Spark web interface

spark_write_csv: Write a Spark DataFrame to a CSV

spark_write_json: Write a Spark DataFrame to a JSON file

spark_write_parquet: Write a Spark DataFrame to a Parquet file

tbl_cache: Cache a Spark Table

tbl_uncache: Uncache a Spark Table


\%>\% Man page
compile_package_jars Man page
connection_config Man page
connection_is_open Man page
copy_to Man page
copy_to.spark_connection Man page
DBISparkResult-class Man page
ensure Man page
ensure_scalar_boolean Man page
ensure_scalar_character Man page
ensure_scalar_double Man page
ensure_scalar_integer Man page
find_scalac Man page
ft_binarizer Man page
ft_bucketizer Man page
ft_discrete_cosine_transform Man page
ft_elementwise_product Man page
ft_index_to_string Man page
ft_one_hot_encoder Man page
ft_quantile_discretizer Man page
ft_regex_tokenizer Man page
ft_sql_transformer Man page
ft_string_indexer Man page
ft_tokenizer Man page
ft_vector_assembler Man page
hive_context Man page
invoke Man page
invoke_method Man page
invoke_new Man page
invoke_static Man page
java_context Man page
livy_available_versions Man page
livy_config Man page
livy_home_dir Man page
livy_install Man page
livy_install_dir Man page
livy_installed_versions Man page
livy_service_start Man page
livy_service_stop Man page
ml_als_factorization Man page
ml_binary_classification_eval Man page
ml_classification_eval Man page
ml_create_dummy_variables Man page
ml_decision_tree Man page
ml_generalized_linear_regression Man page
ml_gradient_boosted_trees Man page
ml_kmeans Man page
ml_lda Man page
ml_linear_regression Man page
ml_load Man page
ml_logistic_regression Man page
ml_model Man page
ml_multilayer_perceptron Man page
ml_naive_bayes Man page
ml_one_vs_rest Man page
ml_options Man page
ml_pca Man page
ml_prepare_dataframe Man page
ml_prepare_features Man page
ml_prepare_inputs Man page
ml_prepare_response_features_intercept Man page
ml_random_forest Man page
ml_save Man page
ml_saveload Man page
ml_survival_regression Man page
ml_tree_feature_importance Man page
na.replace Man page
print_jobj Man page
reexports Man page
registered_extensions Man page
register_extension Man page
sdf_copy_to Man page
sdf_import Man page
sdf_load_parquet Man page
sdf_load_table Man page
sdf_mutate Man page
sdf_mutate_ Man page
sdf_partition Man page
sdf_persist Man page
sdf_predict Man page
sdf_quantile Man page
sdf_read_column Man page
sdf_register Man page
sdf_sample Man page
sdf-saveload Man page
sdf_save_parquet Man page
sdf_save_table Man page
sdf_schema Man page
sdf_sort Man page
sdf_with_unique_id Man page
spark-api Man page
spark_available_versions Man page
spark_compilation_spec Man page
spark_compile Man page
spark_config Man page
spark_connect Man page
spark_connection Man page
spark_connection_is_open Man page
spark-connections Man page
spark_context Man page
spark_dataframe Man page
spark_default_compilation_spec Man page
spark_dependency Man page
spark_disconnect Man page
spark_disconnect_all Man page
spark_home_dir Man page
spark_install Man page
spark_install_dir Man page
spark_installed_versions Man page
spark_install_tar Man page
spark_jobj Man page
spark_load_table Man page
spark_log Man page
spark_read_csv Man page
spark_read_json Man page
spark_read_parquet Man page
spark_save_table Man page
spark_session Man page
spark_uninstall Man page
spark_version Man page
spark_version_from_home Man page
spark_web Man page
spark_write_csv Man page
spark_write_json Man page
spark_write_parquet Man page
tbl_cache Man page
tbl_uncache Man page


tests/testthat/test-install-spark.R tests/testthat/test-read-write.R tests/testthat/test-dplyr-do.R tests/testthat/test-ml-linear-regression.R tests/testthat/test-feature-transformers.R tests/testthat/test-ml-kmeans.R tests/testthat/test-ml-generalized-linear-regression.R tests/testthat/test-ml-saveload.R tests/testthat/test-naive-bayes.R
tests/testthat/test-serialization.R tests/testthat/helper-initialize.R
R/test_connection.R R/dplyr_spark_connection.R R/sdf_saveload.R R/ml_classification_evaluators.R R/spark_serialize.R R/data_csv.R R/spark_globals.R R/utils.R R/connection_spark.R R/ml_kmeans.R R/connection_instances.R R/install_spark.R R/ml_interface.R R/ml_backwards_compatibility.R R/spark_version.R R/config_spark.R R/livy_install.R R/ml_logistic_regression.R R/spark_dataframe.R R/dplyr_spark_table.R R/na_actions.R R/spark_shell.R R/dplyr_spark.R R/mutation.R R/ml_feature_transformation.R R/dplyr_sql.R R/ml_lda.R R/ml_survival_regression.R R/data_interface.R R/livy_invoke.R R/dbi_spark_transactions.R R/ml_gradient_boosted_tree.R R/ml_utils.R R/spark_jobj.R R/dplyr_do.R R/tbl_spark.R R/spark_compile.R R/ml_alternating_least_squares.R R/livy_connection.R R/precondition.R R/install_spark_versions.R R/spark_hive.R R/sdf_sql.R R/spark_connection.R R/data_copy.R R/reexports.R R/ml_decision_tree.R R/ml_pca.R R/sdf_wrapper.R R/connection_viewer.R R/ml_generalized_linear_regression.R R/ml_saveload.R R/ml_options.R R/tables_spark.R R/ml_random_forest.R R/ml_multilayer_perceptron.R R/ml_one_vs_rest.R R/livy_service.R R/spark_invoke.R R/dbi_spark_result.R R/dbi_spark_table.R R/livy_sources.R R/spark_deserialize.R R/sdf_interface.R R/spark_magrittr.R R/dbi_spark_query.R R/dbi_spark_connection.R R/dplyr_spark_data.R R/ml_feature_transformation_utils.R R/ml_linear_regression.R R/ml_model_print_methods.R R/zzz.R R/imports.R R/spark_gateway.R R/formulas.R R/ml_naive_bayes.R R/connection_windows.R R/spark_extensions.R
man/spark_read_json.Rd man/DBISparkResult-class.Rd man/sdf_quantile.Rd man/spark_version.Rd man/sdf-saveload.Rd man/ml_generalized_linear_regression.Rd man/connection_is_open.Rd man/ml_lda.Rd man/ft_vector_assembler.Rd man/ml_linear_regression.Rd man/register_extension.Rd man/ml_binary_classification_eval.Rd man/pipe.Rd man/ml_random_forest.Rd man/spark_default_compilation_spec.Rd man/ml_decision_tree.Rd man/ml_model.Rd man/livy_config.Rd man/spark_jobj.Rd man/ml_multilayer_perceptron.Rd man/sdf_sort.Rd man/livy_service.Rd man/spark_version_from_home.Rd man/ft_sql_transformer.Rd man/ml_kmeans.Rd man/spark_config.Rd man/invoke_method.Rd man/ft_elementwise_product.Rd man/sdf_sample.Rd man/compile_package_jars.Rd man/ft_tokenizer.Rd man/spark-connections.Rd man/invoke.Rd man/spark_log.Rd man/ml_prepare_inputs.Rd man/ft_quantile_discretizer.Rd man/sdf_schema.Rd man/spark-api.Rd man/ml_one_vs_rest.Rd man/spark_write_parquet.Rd man/ensure.Rd man/spark_install.Rd man/connection_config.Rd man/find_scalac.Rd man/ft_bucketizer.Rd man/print_jobj.Rd man/spark_dependency.Rd man/sdf_read_column.Rd man/ml_prepare_dataframe.Rd man/livy_install.Rd man/tbl_cache.Rd man/spark_web.Rd man/spark_dataframe.Rd man/ft_one_hot_encoder.Rd man/ft_regex_tokenizer.Rd man/spark_read_parquet.Rd man/sdf_partition.Rd man/sdf_predict.Rd man/reexports.Rd man/copy_to.spark_connection.Rd man/spark_load_table.Rd man/spark_write_json.Rd man/spark_home_dir.Rd man/ml_survival_regression.Rd man/sdf_with_unique_id.Rd man/spark_read_csv.Rd man/spark_connection.Rd man/ml_options.Rd man/spark_write_csv.Rd man/ft_string_indexer.Rd man/sdf_register.Rd man/ml_pca.Rd man/ml_naive_bayes.Rd man/ml_create_dummy_variables.Rd man/ml_logistic_regression.Rd man/ml_als_factorization.Rd man/ml_gradient_boosted_trees.Rd man/ft_discrete_cosine_transform.Rd man/na.replace.Rd man/ml_classification_eval.Rd man/spark_compile.Rd man/sdf_copy_to.Rd man/ft_index_to_string.Rd man/spark_save_table.Rd man/tbl_uncache.Rd man/sdf_mutate.Rd man/ml_tree_feature_importance.Rd man/ft_binarizer.Rd man/ml_saveload.Rd man/spark_compilation_spec.Rd man/sdf_persist.Rd

Questions? Problems? Suggestions? or email at

Please suggest features or report bugs with the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.