tune_biclustermd: Bicluster data over a grid of tuning parameters
In biclustermd: Biclustering with Missing Data

Description Usage Arguments Value References See Also Examples

Bicluster data over a grid of tuning parameters

tune_biclustermd(
  data,
  nrep = 10,
  parallel = FALSE,
  ncores = 2,
  tune_grid = NULL
)

`data`	Dataset to bicluster. Must to be a data matrix with only numbers and missing values in the data set. It should have row names and column names.
`nrep`	The number of times to repeat the biclustering for each set of parameters. Default 10.
`parallel`	Logical indicating if the user would like to utilize the `foreach` parallel backend. Default is FALSE.
`ncores`	The number of cores to use if parallel computing. Default 2.
`tune_grid`	A data frame of parameters to tune over. The column names of this must match the arguments passed to `biclustermd()`.

A list of:

`best_combn`	The best combination of parameters,
`best_bc`	The minimum SSE biclustering using the parameters in `best_combn`,
`grid`	`tune_grid` with columns giving the minimum, mean, and standard deviation of the final SSE for each parameter combination, and
`runtime`	CPU runtime & elapsed time.

Li, J., Reisner, J., Pham, H., Olafsson, S., and Vardeman, S. (2019) Biclustering for Missing Data. Information Sciences, Submitted

biclustermd, rep_biclustermd

library(dplyr)
library(ggplot2)
data("synthetic")
tg <- expand.grid(
miss_val = fivenum(synthetic),
similarity = c("Rand", "HA", "Jaccard"),
col_min_num = 2,
row_min_num = 2,
col_clusters = 3:5,
row_clusters = 2
)
tg

# in parallel: two cores:
tbc <- tune_biclustermd(synthetic, nrep = 2, parallel = TRUE, ncores = 2, tune_grid = tg)
tbc

tbc$grid %>%
  group_by(miss_val, col_clusters) %>%
  summarise(avg_sd = mean(sd_sse)) %>%
  ggplot(aes(miss_val, avg_sd, color = col_clusters, group = col_clusters)) +
  geom_line() +
  geom_point()

tbc <- tune_biclustermd(synthetic, nrep = 2, tune_grid = tg)
tbc

boxplot(tbc$grid$mean_sse ~ tbc$grid$similarity)
boxplot(tbc$grid$sd_sse ~ tbc$grid$similarity)

# nycflights13::flights dataset

library(nycflights13)
data("flights")

library(dplyr)
flights_bcd <- flights %>%
  select(month, dest, arr_delay)

flights_bcd <- flights_bcd %>%
  group_by(month, dest) %>%
  summarise(mean_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  spread(dest, mean_arr_delay) %>%
  as.data.frame()

# months as rows
rownames(flights_bcd) <- flights_bcd$month
flights_bcd <- as.matrix(flights_bcd[, -1])

flights_grid <- expand.grid(
row_clusters = 4,
col_clusters = c(6, 9, 12),
miss_val = fivenum(flights_bcd),
similarity = c("Rand", "Jaccard")
)

# RUN TIME: approximately 40 seconds across two cores.
flights_tune <- tune_biclustermd(
  flights_bcd,
  nrep = 10,
  parallel = TRUE,
  ncores = 2,
  tune_grid = flights_grid
)
flights_tune