## Group rows in a matrix based on their correlation

### Description

The `groupByCorrelation` allows to group rows in a numeric matrix based on their correlation with each other.

Two types of groupings are available:

• `inclusive = FALSE` (the default): the algorithm creates small groups of highly correlated members, all of which have a correlation with each other that are `>= threshold`. Note that with this algorithm, rows in `x` could still have a correlation `>= threshold` with one or more elements of a group they are not part of. See notes below for more information.

• `inclusive = TRUE`: the algorithm creates large groups containing rows that have a correlation `>= threshold` with at least one element of that group. For example, if row 1 and 3 have a correlation above the threshold and rows 3 and 5 too (but correlation between 1 and 5 is below the threshold) all 3 are grouped into the same group (i.e. rows 1, 3 and 5).

Note that with parameter `f` it is also possible to pre-define groups of rows that should be further sub-grouped based on correlation with each other. In other words, if `f` is provided, correlations are calculated only between rows with the same value in `f` and hence these pre-defined groups of rows are further sub-grouped based on pairwise correlation. The returned `factor` is then `f` with the additional subgroup appended (and separated with a `"."`). See examples below.

### Usage

```groupByCorrelation(
x,
method = "pearson",
use = "pairwise.complete.obs",
threshold = 0.9,
f = NULL,
inclusive = FALSE
)
```

### Arguments

 `x` `numeric` `matrix` where rows should be grouped based on correlation of their values across columns being larger than `threshold`. `method` `character(1)` with the method to be used for correlation. See `corr()` for options. `use` `character(1)` defining which values should be used for the correlation. See `corr()` for details. `threshold` `numeric(1)` defining the cut of value above which rows are considered to be correlated and hence grouped. `f` optional vector of length equal to `nrow(x)` pre-defining groups of rows in `x` that should be further sub-grouped. See description for details. `inclusive` `logical(1)` whether a version of the grouping algorithm should be used that leads to larger, more loosely correlated, groups. The default is `inclusive = FALSE`. See description for more information.

### Value

`factor` with same length than `nrow(x)` with the group each row is assigned to.

### Note

Implementation note of the grouping algorithm:

• all correlations between rows in `x` which are `>= threshold` are identified and sorted decreasingly.

• starting with the pair with the highest correlation groups are defined:

• if none of the two is in a group, both are put into the same new group.

• if one of the two is already in a group, the other is put into the same group if all correlations of it to that group are `>= threshold` (and are not `NA`).

• if both are already in the same group nothing is done.

• if both are in different groups: an element is put into the group of the other if a) all correlations of it to members of the other's group are not `NA` and `>= threshold` and b) the average correlation to the other group is larger than the average correlation to its own group.

This ensures that groups are defined in which all elements have a correlation `>= threshold` with each other and the correlation between members of the same group is maximized.

### Author(s)

Johannes Rainer

### Examples

```
x <- rbind(
c(1, 3, 2, 5),
c(2, 6, 4, 7),
c(1, 1, 3, 1),
c(1, 3, 3, 6),
c(0, 4, 3, 1),
c(1, 4, 2, 6),
c(2, 8, 2, 12))

## define which rows have a high correlation with each other
groupByCorrelation(x)

## assuming we have some prior grouping of rows, further sub-group them
## based on pairwise correlation.
f <- c(1, 2, 2, 1, 1, 2, 2)
groupByCorrelation(x, f = f)
```

