classCleaner: classCleaner: A package for cleaning outliers when data is...

Description Usage Arguments Details Examples

View source: R/clean_classes.R

Description

Test whether each instance in a class actually belongs.

Usage

1
2
3
4
5
6
7
8
9
classCleaner(
  D,
  assignment,
  classes = "all",
  alpha0 = 0.05,
  q = 0.5,
  labels = NULL,
  exclude_classes = NULL
)

Arguments

D

A distance matrix containing the pairwise dissimilarity scores between instances

assignment

The assigned group of each instance

classes

The subset of classes on which filtering is performed, or "all" if all classes should be analyzed.

alpha0

Desired global type I (v1) or type II (v2) error rate.

q

(v2 only) - the proportion of distances expected to be "close enough" to keep an instance. Defaults to 0.5.

labels

a vector of labels for each instance. Must be the same length as D. If NULL, the algorithm will check for rownames and column names in D. If none are found, the instances will be labeled with numbers 1:nrow(D).

exclude_classes

names of "mega" classes which should not be included in determining whether or not classCleaner2 is appropriate. By default, these classes will not be included in the analysis.

Details

For each instance in an analyzed class, this function will estimate the probability that it was correctly placed in that class.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
set.seed(23)

X <- simulate_clustered_data(
  n = 200,
  Nk = rep(50, 100),
  s = rep(1, 100),
  rho = .2,
  tau = 1,
  method = "by-class"
)
# true assignment
a <- rep(1:100, each = 50)

# corrupted assignment
b <- sample(100, 50 * 100, replace = TRUE)

# corrupt 10% of samples
a.corrupt <- ifelse(runif(50 * 1000) < 0.1, b, a)

D <- 1 - cor(X)
result <- identify_outliers(a.corrupt, D, 1000, colnames(D))

melissakey/classCleaner documentation built on Feb. 11, 2022, 3:33 a.m.