knitr::opts_chunk$set(
  collapse = TRUE,
  echo = TRUE,
  eval = FALSE,
  comment = "#>"
)
library(magrittr)
library(readr)

HLSGUtils contains functions that can run the objects of the GenomicsUtils Scala project. This function opens a Spark session and runs a Scala class with input arguments that are given from the R function. First, we need to have Apache Spark on our machine. GenomicsUtils is based on Spark version of 3.1.2 and Hadoop 2.7. This page helps Choose your version and download the source.

On Linux, downloading and extracting Spark source files are done by these commands.:

wget https://www.apache.org/dyn/closer.lua/spark/spark-3.1.3/spark-3.1.3-bin-hadoop2.7.tgz
tar xvf spark-3.1.3-bin-hadoop2.7.tgz
mv spark-3.1.3-bin-hadoop2.7 /opt/spark 

Insert the following line into the /.bashrc file to add the spark software file location to the PATH variable.

export PATH=$PATH:/opt/spark/bin

To active new setting, use the following command for sourcing the ~/.bashrc file.

source ~/.bashrc

It is also necessary to define SPARK_HOME in system.

export SPARK_HOME="/opt/spark"
echo $SPARK_HOME #! check result

To see environments variables we can also use R command:

Sys.setenv(SPARK_HOME = "/opt/spark")
Sys.getenv("SPARK_HOME")

Run the following command to see if everything is working properly:

spark-shell

If Spark is active on your system, then we can run a Spark-based HLSGUtils functions. VCF2Delta reads VCF files and converts them to delta format .

library(HLSGUtils)
vcf_file <- system.file("data","1KG_SNP_chr19.vcf.gz", package = "HLSGUtils")
VCF2Delta(vcf_file, savePath = "~/Desktop/1KG_SNP_chr19.delta")

The sparklyr package can read delta files, as shown in the code below.

library(sparklyr)
sc <- spark_connect(master = "local", packages = "io.delta:delta-core_2.12:1.0.1")
snp <- spark_read_delta(sc, "~/Desktop/1KG_SNP_chr19.delta")
head(snp)

CSVJoinDelta is Another useful function in GenomicsUtils library. It is used to join CSV files with datasets that are stored as delta format.

csv_path <- system.file("data","1KG_p_value.csv", package = "HLSGUtils")
CSVJoinDelta(csvPath = csv_path, deltaPath = "~/Desktop/1KG_SNP_chr19.delta",
             byX = "ID", byY = "names", 
             savePath = "~/Desktop/1KG_SNP_chr19_pvalue.csv"
             )

The final result is:

read_csv("~/Desktop/1KG_SNP_chr19_pvalue.csv/part-00000-2c471c7c-e42e-4943-8afc-637c18b6254c-c000.csv") %>% 
  head()


Ehyaei/IRISUtils documentation built on March 29, 2022, 9:46 a.m.