rca: Relevant Component Analysis

View source: R/rca.R

rcaR Documentation

Relevant Component Analysis

Description

Performs relevant component analysis on the given data.

Usage

rca(x, chunks, useD = NULL)

Arguments

x

matrix or data frame of original data. Each row is a feature vector of a data instance.

chunks

list of k numerical vectors. Each vector represents a chunklet, the elements in the vectors indicate where the samples locate in x. See examples for more information.

useD

optional. When not given, RCA is done in the original dimension and B is full rank. When useD is given, RCA is preceded by constraints based LDA which reduces the dimension to useD. B in this case is of rank useD.

Details

The RCA function takes a data set and a set of positive constraints as arguments and returns a linear transformation of the data space into better representation, alternatively, a Mahalanobis metric over the data space.

Relevant component analysis consists of three steps:

  1. locate the test point

  2. compute the distances between the test points

  3. find k shortest distances and the bla

The new representation is known to be optimal in an information theoretic sense under a constraint of keeping equivalent data points close to each other.

Value

list of the RCA results:

B

The RCA suggested Mahalanobis matrix. Distances between data points x1, x2 should be computed by (x2 - x1)' * B * (x2 - x1)

A

The RCA suggested transformation of the data. The data should be transformed by A * data

newX

The data after the RCA transformation (A). newData = A * data

The three returned argument are just different forms of the same output. If one is interested in a Mahalanobis metric over the original data space, the first argument is all she/he needs. If a transformation into another space (where one can use the Euclidean metric) is preferred, the second returned argument is sufficient. Using A and B is equivalent in the following sense:

if y1 = A * x1, y2 = A * y2 then (x2 - x1)' * B * (x2 - x1) = (y2 - y1)' * (y2 - y1)

Note

Note that any different sets of instances (chunklets), e.g. 1, 3, 7 and 4, 6, might belong to the same class and might belong to different classes.

Author(s)

Nan Xiao <https://nanx.me>

References

Aharon Bar-Hillel, Tomer Hertz, Noam Shental, and Daphna Weinshall (2003). Learning Distance Functions using Equivalence Relations. Proceedings of 20th International Conference on Machine Learning (ICML2003).

See Also

See dca for exploiting negative constrains.

Examples

## Not run: 
library("MASS") # generate synthetic multivariate normal data
set.seed(42)
k <- 100L # sample size of each class
n <- 3L # specify how many classes
N <- k * n # total sample size
x1 <- mvrnorm(k, mu = c(-16, 8), matrix(c(15, 1, 2, 10), ncol = 2))
x2 <- mvrnorm(k, mu = c(0, 0), matrix(c(15, 1, 2, 10), ncol = 2))
x3 <- mvrnorm(k, mu = c(16, -8), matrix(c(15, 1, 2, 10), ncol = 2))
x <- as.data.frame(rbind(x1, x2, x3)) # predictors
y <- gl(n, k) # response

# fully labeled data set with 3 classes
# need to use a line in 2D to classify
plot(x[, 1L], x[, 2L],
  bg = c("#E41A1C", "#377EB8", "#4DAF4A")[y],
  pch = rep(c(22, 21, 25), each = k)
)
abline(a = -10, b = 1, lty = 2)
abline(a = 12, b = 1, lty = 2)

# generate synthetic chunklets
chunks <- vector("list", 300)
for (i in 1:100) chunks[[i]] <- sample(1L:100L, 10L)
for (i in 101:200) chunks[[i]] <- sample(101L:200L, 10L)
for (i in 201:300) chunks[[i]] <- sample(201L:300L, 10L)

chks <- x[unlist(chunks), ]

# make "chunklet" vector to feed the chunks argument
chunksvec <- rep(-1L, nrow(x))
for (i in 1L:length(chunks)) {
  for (j in 1L:length(chunks[[i]])) {
    chunksvec[chunks[[i]][j]] <- i
  }
}

# relevant component analysis
rcs <- rca(x, chunksvec)

# learned transformation of the data
rcs$A

# learned Mahalanobis distance metric
rcs$B

# whitening transformation applied to the chunklets
chkTransformed <- as.matrix(chks) %*% rcs$A

# original data after applying RCA transformation
# easier to classify - using only horizontal lines
xnew <- rcs$newX
plot(xnew[, 1L], xnew[, 2L],
  bg = c("#E41A1C", "#377EB8", "#4DAF4A")[gl(n, k)],
  pch = c(rep(22, k), rep(21, k), rep(25, k))
)
abline(a = -15, b = 0, lty = 2)
abline(a = 16, b = 0, lty = 2)

## End(Not run)

terrytangyuan/dml documentation built on July 19, 2023, 2 p.m.