tune_biclustermd: Bicluster data over a grid of tuning parameters

Description Usage Arguments Value References See Also Examples

View source: R/tune_biclustermd.R

Description

Bicluster data over a grid of tuning parameters

Usage

1
2
3
4
5
6
7
tune_biclustermd(
  data,
  nrep = 10,
  parallel = FALSE,
  ncores = 2,
  tune_grid = NULL
)

Arguments

data

Dataset to bicluster. Must to be a data matrix with only numbers and missing values in the data set. It should have row names and column names.

nrep

The number of times to repeat the biclustering for each set of parameters. Default 10.

parallel

Logical indicating if the user would like to utilize the foreach parallel backend. Default is FALSE.

ncores

The number of cores to use if parallel computing. Default 2.

tune_grid

A data frame of parameters to tune over. The column names of this must match the arguments passed to biclustermd().

Value

A list of:

best_combn

The best combination of parameters,

best_bc

The minimum SSE biclustering using the parameters in best_combn,

grid

tune_grid with columns giving the minimum, mean, and standard deviation of the final SSE for each parameter combination, and

runtime

CPU runtime & elapsed time.

References

Li, J., Reisner, J., Pham, H., Olafsson, S., and Vardeman, S. (2019) Biclustering for Missing Data. Information Sciences, Submitted

See Also

biclustermd, rep_biclustermd

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
library(dplyr)
library(ggplot2)
data("synthetic")
tg <- expand.grid(
miss_val = fivenum(synthetic),
similarity = c("Rand", "HA", "Jaccard"),
col_min_num = 2,
row_min_num = 2,
col_clusters = 3:5,
row_clusters = 2
)
tg

# in parallel: two cores:
tbc <- tune_biclustermd(synthetic, nrep = 2, parallel = TRUE, ncores = 2, tune_grid = tg)
tbc

tbc$grid %>%
  group_by(miss_val, col_clusters) %>%
  summarise(avg_sd = mean(sd_sse)) %>%
  ggplot(aes(miss_val, avg_sd, color = col_clusters, group = col_clusters)) +
  geom_line() +
  geom_point()

tbc <- tune_biclustermd(synthetic, nrep = 2, tune_grid = tg)
tbc

boxplot(tbc$grid$mean_sse ~ tbc$grid$similarity)
boxplot(tbc$grid$sd_sse ~ tbc$grid$similarity)

# nycflights13::flights dataset

library(nycflights13)
data("flights")

library(dplyr)
flights_bcd <- flights %>%
  select(month, dest, arr_delay)

flights_bcd <- flights_bcd %>%
  group_by(month, dest) %>%
  summarise(mean_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  spread(dest, mean_arr_delay) %>%
  as.data.frame()

# months as rows
rownames(flights_bcd) <- flights_bcd$month
flights_bcd <- as.matrix(flights_bcd[, -1])

flights_grid <- expand.grid(
row_clusters = 4,
col_clusters = c(6, 9, 12),
miss_val = fivenum(flights_bcd),
similarity = c("Rand", "Jaccard")
)

# RUN TIME: approximately 40 seconds across two cores.
flights_tune <- tune_biclustermd(
  flights_bcd,
  nrep = 10,
  parallel = TRUE,
  ncores = 2,
  tune_grid = flights_grid
)
flights_tune

biclustermd documentation built on June 17, 2021, 5:11 p.m.