knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
This is a set of tests for GSOC 2019. I'm trying to produce parallel coordinate plots in differnt ways.
The package can be downloaded from GitHub. You can use the following code to install.
devtools::install_github("yaweige/yaweitest")
You can then load the package in the normal way.
library(yaweitest) library(tidyverse)
myparallel
The myparallel
function was used to produce parallel coordinate plot based on ggplot2 functionalities.
The function has six input values currently and two input values to be extended in the future.
data
: the data.frame to plotcolumns
: a vector of variables (either names or indices) to be axes in the plotgroupColumn
: a single variable to group (color) byjitter
: a single logical value, jitter the points of the selected varaible(s) or notjittervariables
: varibles that will be jittered, should be a subset of columns, both indices or names; if unspecified, all of the variables will be jitteredjitterratio
: the ratio of the maximum random ajustment devided by the range of selected variables (after scaling)mapping
: NOT AVAILABLE NOW, aes string to pass to ggplot2 object...
: NOT AVAILABLE NOW, passed on directly to all of the ggplot2 commandsThe function returns a ggplot2 object that can be plotted directly or used as base layers for additional modification
data(diamonds, package="ggplot2") diamonds.samp <- diamonds[sample(1:dim(diamonds)[1], 100), ] myparallel(data = diamonds.samp, columns = c(1, 5:7))
myparallel(data = diamonds.samp, columns = c(1, 5:7), groupColumn = "cut")
myparallel(data = diamonds.samp, columns = c(1, 5:7), groupColumn = "cut", jitter = TRUE, jittervariables = "table") + ggplot2::ggtitle("Hello Parallel")
myparallel(data = diamonds.samp, columns = c(1, 5:7), groupColumn = "cut", jitter = TRUE, jittervariables = "table") + ggplot2::facet_wrap(~cut) + ggplot2::ggtitle("Hello Parallel")
ggparcoord
and ggparallel
To conclude, there are some features of these two functions I think are superfluous, but it doesn't affect anything if you choose not to use them. It's obvious that we can't combine categorical and numerical varibles in one plot using these two funcions. And there are definitely limitations for the full use of ggplot2
functionalities. Especially, for ggparallel
, there are more limitations as shown below. But ggparcoord
works pretty good to my point of view. It is also limited in jittering (but we can do that).
In many aspects, I think they work quite well.
I found that ggparcoord
does work with facet, but it just simply divided the data instead of calculate for each of facet (or this is not a bad thing?). And ggparcoord
does work with layers, but it is not very clear what kind of data and aes it passes to the following layers. But it seems not too bad. We can even find a way to jitter using ggparcoord
.
I didn't see overlapping problem when I deal with categorical variables using ggparallel
. When it comes to ggparcoord
, I don't understand the conceptual problem described in the wiki, why treating those "variables of axes" as ordinal variables will lead to overplotting. To me, overlapping just exists because of the data.
set.seed(20190404) # it doesn't really matter parallelsample <- data.frame(# category_1 is the standard one category_1 = c(rep(letters[1:5], each = 20)), # category_2 is highly correlated with category_1 category_2 = c(sample(c(rep("A", 15), sample(LETTERS[2:5], 5, replace = T)), 20), sample(c(rep("B", 15), sample(LETTERS[c(1, 3:5)], 5, replace = T)), 20), sample(c(rep("C", 15), sample(LETTERS[c(1:2, 4:5)], 5, replace = T)), 20), sample(c(rep("D", 15), sample(LETTERS[c(1:3, 5)], 5, replace = T)), 20), sample(c(rep("E", 15), sample(LETTERS[1:4], 5, replace = T)), 20)), # category_3 has large difference in the sample sizes of each type category_3 = c(rep("cat", 10), rep("dog", 50), rep("pig", 40)), # category_4 is randomized category_4 = c(sample(rep(c("red", "green", "yellow", "blue"), each = 25), 100)), # numeric_1 is uniformly distributed with different means, correlated with category_1 numeric_1 = c(unlist(lapply(list(0, 5, 10, 15, 20), FUN = function(x) runif(20, min = x, max = x +10)))), # numeric_2 is normally distributed with different means, correlated with numeric_1 numeric_2 = c(unlist(lapply(list(5, 10, 15, 20, 25), FUN = function(x) rnorm(20, mean = x, sd = 5)))), # numeric_3 is normally distributed with different means and variances, negatively correlated with numeric_1 numeric_3 = c(unlist(lapply(list(-5, -10, -15, -20, -25), FUN = function(x) rnorm(20, mean = x, sd = 1 + sqrt(-x))))), # numeric_4 has a lot of overlap points, correlated with numeric_1 numeric_4 = rep(runif(10, min = 0, max = 1), each = 10), # numeric_5 has a lot of overlap points too (but with different pattern, randomized) numeric_5 = sample(rep(runif(10, min = 0, max = 1), each = 10), 100)) str(parallelsample)
ggparcoord
ggparcoord
std
scale as default, subtract mean and divide by standard devitionlibrary(GGally) ggparcoord(parallelsample, columns= 5:9)
ggparcoord(parallelsample, columns= 5:9, groupColumn = 1)
uniminmax
scaleggparcoord(parallelsample, columns= 5:9, groupColumn = 1, scale = "uniminmax")
uniminmax
scale, centered by "mean"ggparcoord(parallelsample, columns= 5:9, groupColumn = 1, scale = "center")
globalminmax
, unscaled; with some transparencyggparcoord(parallelsample, columns= 5:9, groupColumn = 1, scale = "globalminmax", alphaLines = 0.5)
ggparcoord(parallelsample, columns= 5:9, groupColumn = 1, splineFactor = T)
ggparcoord(parallelsample, columns= 5:9, groupColumn = 1, order = "anyClass")
ggparcoord
facet_wrap
, add titleggparcoord(parallelsample, columns= 5:9, groupColumn = 3, scale = "uniminmax") + facet_wrap(~category_3) + ggtitle("Happy Parallel")
ggparcoord(parallelsample, columns= 5:9, groupColumn = 1) + geom_point() ggparcoord(parallelsample, columns= 5:9, groupColumn = 1) + geom_jitter()
ggparcoord(parallelsample, columns= 5:9, groupColumn = 1, alphaLines = 0) + geom_line(position = position_jitter(width = 0, height = 0.4))
Noticed that our category_1 is already a factor varible, we still need to use factor(category_1) to make it a categorical variable here.
ggparcoord(parallelsample, columns= 5:9, alphaLines = 0.5, mapping = aes(color = category_1, size = category_2)) + scale_size_identity() + ggtitle("Happy Parallel") + theme_light() + theme(plot.title = element_text(hjust = 0.5))
ggparallel
ggparallel
library(ggparallel) ggparallel(data = parallelsample, vars = list("category_1", "category_2", "category_3", "category_4"))
ggparallel(data = parallelsample, vars = list("category_1", "category_2", "category_3", "category_4"), method = "hammock")
ggparallel(data = parallelsample, vars = list("category_1", "category_2", "category_3", "category_4"), method = "hammock", ratio = 0.2)
ggparallel(data = parallelsample, vars = list("category_1", "category_2", "category_3", "category_4"), method = "parset")
ggparallel(data = parallelsample, vars = list("category_1", "category_2", "category_3", "category_4"), method = "parset", text.angle = 0)
ggparallel
ggparallel(data = parallelsample, vars = list("category_1", "category_2", "category_4")) + facet_wrap(~category_1)
ggparallel(data = parallelsample, vars = list("category_1", "category_2", "category_4")) + facet_wrap(~category_3)
ggparallel(data = parallelsample, vars = list("category_1", "category_2", "category_4"), mapping = aes(alpha = as.numeric(category_3)/5))
ggparallel(data = parallelsample, vars = list("category_1", "category_2", "category_4"), mapping = aes(color = category_3))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.