biclustermd: Bicluster data with non-random missing values

Description Usage Arguments Value References See Also Examples

View source: R/biclustermd.R

Description

Bicluster data with non-random missing values

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
biclustermd(
  data,
  row_clusters = floor(sqrt(nrow(data))),
  col_clusters = floor(sqrt(ncol(data))),
  miss_val = mean(data, na.rm = TRUE),
  miss_val_sd = 1,
  similarity = "Rand",
  row_min_num = floor(nrow(data)/row_clusters),
  col_min_num = floor(ncol(data)/col_clusters),
  row_num_to_move = 1,
  col_num_to_move = 1,
  row_shuffles = 1,
  col_shuffles = 1,
  max.iter = 100,
  verbose = FALSE
)

Arguments

data

Dataset to bicluster. Must to be a data matrix with only numbers and missing values in the data set. It should have row names and column names.

row_clusters

The number of clusters to partition the rows into. The default is floor(sqrt(nrow(data))).

col_clusters

The number of clusters to partition the columns into. The default is floor(sqrt(ncol(data))).

miss_val

Value or function to put in empty cells of the prototype matrix. If a value, a random normal variable with sd = miss_val_sd is used each iteration. By default, this equals the mean of data.

miss_val_sd

Standard deviation of the normal distribution miss_val follows if miss_val is a number. By default this equals 1.

similarity

The metric used to compare two successive clusterings. Can be "Rand" (default), "HA" for the Hubert and Arabie adjusted Rand index or "Jaccard". See RRand for details.

row_min_num

Minimum row prototype size in order to be eligible to be chosen when filling an empty row prototype. Default is floor(nrow(data) / row_clusters).

col_min_num

Minimum column prototype size in order to be eligible to be chosen when filling an empty row prototype. Default is floor(ncol(data) / col_clusters).

row_num_to_move

Number of rows to remove from the sampled prototype to put in the empty row prototype. Default is 1.

col_num_to_move

Number of columns to remove from the sampled prototype to put in the empty column prototype. Default is 1.

row_shuffles

Number of times to shuffle rows in each iteration. Default is 1.

col_shuffles

Number of times to shuffle columns in each iteration. Default is 1.

max.iter

Maximum number of iterations to let the algorithm run for.

verbose

Logical. If TRUE, will report progress.

Value

A list of class biclustermd:

params

a list of all arguments passed to the function, including defaults.

data

the inputted two way table of data.

P0

the initial column partition matrix.

Q0

the initial row partition matrix.

InitialSSE

the SSE of the original partitioning.

P

the final column partition matrix.

Q

the final row partition matrix.

SSE

a matrix of class biclustermd_sse detailing the SSE recorded at the end of each iteration.

Similarities

a data frame of class biclustermd_sim detailing the value of row and column similarity measures recorded at the end of each iteration. Contains information for all three similarity measures. This carries an attribute "used" which provides the similarity measure used as the stopping condition for the algorithm.

iteration

the number of iterations the algorithm ran for, whether max.iter was reached or convergence was achieved.

A

the final prototype matrix which gives the average of each bicluster.

References

Li, J., Reisner, J., Pham, H., Olafsson, S., and Vardeman, S. (2020) Biclustering with Missing Data. Information Sciences, 510, 304–316.

See Also

rep_biclustermd, tune_biclustermd

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
data("synthetic")
# default parameters
bc <- biclustermd(synthetic)
bc
autoplot(bc)

# providing the true number of row and column clusters
bc <- biclustermd(synthetic, col_clusters = 3, row_clusters = 2)
bc
autoplot(bc)

# an example with the nycflights13::flights dataset
library(nycflights13)
data("flights")

library(dplyr)
flights_bcd <- flights %>%
  select(month, dest, arr_delay)

flights_bcd <- flights_bcd %>%
  group_by(month, dest) %>%
  summarise(mean_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  spread(dest, mean_arr_delay) %>%
  as.data.frame()

rownames(flights_bcd) <- flights_bcd$month
flights_bcd <- as.matrix(flights_bcd[, -1])

flights_bc <- biclustermd(data = flights_bcd, col_clusters = 6, row_clusters = 4,
                  row_min_num = 3, col_min_num = 5,
                  max.iter = 20, verbose = TRUE)
flights_bc

biclustermd documentation built on June 17, 2021, 5:11 p.m.