In kevbobli224/EpiGPlot: Visualization of epigenetic factors

library(knitr)
knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE,
  comment = "#>"
)
library(EpiGPlot)
set.seed(1)

EpiGPlot is a simple R package for producing a interpretative plot for epigenetic factors and their expression values against all other test sample genes.

This package relies on minimal package dependencies, and can easily be expanded to support variety of data sets.

For plotting purposes and graphing linear regression models: plotEpigeneticEV

For customizing plotting layouts and specifications: layoutEpigeneticEV

For parsing csv or loading rda data for a certain gene and their expression values amongst other genes: parseEpigeneticData, loadEpigeneticData

Refer to help(package = "EpiGPlot") for further details.

To download and install the package use the following commands:

require("devtools")
install_github("kevbobli224/EpiGPlot", build_vignettes = TRUE)
library("EpiGPlot")

To list all available exported functions for users, use:

lsf.str("package:EpiGPlot")

Details

Given a epigenetic factor, proteins with DNA/Histone/Chromatin remodeling; a list of expression values are obtained as well as their quantile against all genes. The referenced sample category can be plotted and visualized using linear regression methods.

Provided data set

The package provides 2 identical data set in the form of .csv and .rda format; they're located in inst/extdata/expressions.csv and data/NO66_HUMAN.rda. To load parse or load these files to work with, refer to below detailed explanations.

The data set contains 4 columns of data: sample class, sample name, expressions values, and quantile over all samples.

Sample class is the classification of genes that belong in a biological class such as cell lines, tissues... etc. Sample name refers to the approved HGNC symbol name. Expression value is the value derived from HGNC mapping based on CAGE tags, essentially this is how often the gene expresses itself. Quantile over all samples refers to the gene expression in a given sample (NO66_HUMAN) to all other samples.

Data specification and functions

Data sets can be obtained through the official EpiFactors database. There are specifications when importing .csv data from the data base or .rda data generated by the usethis package into the package depending which functions are used to load the epigenetic data.

`parseEpigeneticData`

parseEpigeneticData takes an input of a .csv, tab delimited data set generated by Epifactors database that contains 4 columns of data in the following particular order:

+------------------------------------------------------------------------+-----------------------------+-----------------------+--------------------------------------+ | character | character | numeric | numeric | +========================================================================+=============================+=======================+======================================+ | Unit/class type | Name | Expression values | Quantile over all samples values | | | | | | | (genes/histones/protamines...) or (cell line/fractionation/tissues...) | (HGNC approved symbol/name) | | | +------------------------------------------------------------------------+-----------------------------+-----------------------+--------------------------------------+

Data is parsed and outputs as an R data frame with the assigned column names given by the .csv file.

`loadEpigeneticData`

loadEpigeneticData takes an input of a .rda file generated by the method use_data() from the package usethis, and returns a data frame that is the same for when it was saved.

Saving/Loading

When working with a data set that you wish to revisit in the future, it is recommended to use the method use_data() from the package usethis for saving purposes. Calling loadEpigeneticData would load the data.

`plotEpigeneticEV` and `layoutEpigeneticEV`

In plotEpigeneticEV, the layout data returned from layoutEpigeneticEV is used to carry out visualization methods. layoutEpigeneticEV is required for the second round of parsing of the data from .csv/.rda because it contains unorganized data columns which the return of the layout function will serve as the layout data for various plotting functions from ggplot2 package.

User can specify various parameters from both plotEpigeneticEV and layoutEpigeneticEV in order to achieve an accurate plotting with specific parameters that are important for data interpretation.

`layoutEpigeneticEv`

To obtain a visualizable and plot-able data set, it must first be obtained from this function; where user can specify the operations and parameters that should the data be manipulated in a way that restricts future visual clutters.

Here, a sample class can be specified in order to prune out unnecessary comparisons of other sample classes, this will return a subset of the original data frame, keeping the specified sample class. Furthermore, a simple centering algorithm can be performed if user wishes to center all expression values through the built-in base R function scale. This would allow data sets with extreme values to be normalized; this rescaling method will alter the range of expression values from 0 to 1.

Given a list of expression values of a epigenetic factor, it can be rescaled from 0 to 1 for the purpose of ease of plotting by

```{=tex} \begin{equation}

f(x)=\frac{x'-min(x)}{max(x)-min(x)}

\end{equation}

where $x$ is a vector of expression values and $x'$ is individual elements of $x$.

### `plotEpigeneticEV`

Given a layout data from `layoutEpigeneticEV`, user can specify various parameters for visualization purposes. Most importantly, it allows user to perform and display linear regression models on specified sample classes for a given layout data.

## Example on visualizing data

An example data of the `NO66_HUMAN` gene data set provided in the package is used for the visualization purpose, user can specify label names to be displayed in the plot as well as other plot features when provided valid arguments:

```r
# Ensure that working directory is in /EpiGPlot
setwd("..")
# Initialize data variable
rawData <- EpiGPlot::loadEpigeneticData("data/NO66_HUMAN.rda")
# Create a normalized layout data set for cell line and tissue
gLayout <- EpiGPlot::layoutEpigeneticEV(rawData, 
                              normalized=TRUE, 
                              sample.class=c("cell_line", "tissue"))
# Plot the normalized layout data, with linear regression performed on the tissue sample class. Also displays the top 5 expression values' gene labels.
gPlot <- EpiGPlot::plotEpigeneticEV(gLayout, 
                          normalized=TRUE, 
                          sample.pred = c("tissue"),
                          labels.top=5)
gPlot

To provide a short summary of the plot, the 5 highest expression values of the sample is depicted. These values indicate the significance of our sample gene (NO66_HUMAN) that are being expressed in their respective biological unit. The linear regression line is displayed as a model for how our sample gene would be expressed in certain sample class.

Shiny app

Running the shiny app provided in the package is a way to familiarize yourself with the workable parameters of plot output features as well as certain data subset functions. The shiny app can be run as follow: