selectscat: Selecting a scatterplot matrix based on scagnostics

Description Usage Arguments Details Value Note Author(s) References See Also Examples

View source: R/selectscat.R

Description

Selects a scatterplot matrix from a data frame including the k variables with approximately highest "relevance". If no own measure of relevance is defined, the function uses the maximum of the measures "Outlying", "Clumpy", "Sparse", "Striated", "1-Convex" and "Stringy" from the scagnostics package. See Details and Note.

Usage

1
selectscat(data,relmat=NULL,k=5,r=k,plot=TRUE,criteria="maxm")

Arguments

data

A data frame or a list of class "sdfdata". If data is a data frame and contains categorical variables, they are excluded.

relmat

NULL or a matrix which can be interpreted as a similarity matrix
(m_ii = 1, m_ij = m_ji, 0 <= m_ij <= 1).

k

A positive integer. The number of variables to include in the scatterplot matrix.

r

A positive integer (greater or equal to k). Controls the goodness of the approximation (see Details).

plot

Logical. Should the plot be drawn? Default is TRUE.

criteria

"maxm" or "cor". Use the maximum of the measures ("maxm") or the correlation as the measure of relevance. Ignored if relmat is not equal to NULL.

Details

To make this selection work fast in case of data sets with a huge number of variables, considering all possible combinations needs to be avoided. The implemented algorithm reorders the variables on optimal leafs. That means an average linkage clustering is done based on the criterion of relevance which is interpreted as a similarity measure. The new order of the variables is chosen so that pairs of variables with high values in the criteria are grouped. That allows us to search around the diagonal of the reordered matrix including all variables for the optimal matrix of size k. The size of the area around the diagonal in which the optimal matrix is searched is controlled by r. If r = p (number of numeric variables of the data set) than every possible combination is considered. Otherwise it is not certain that the optimal matrix is found.

Value

A ggpair object (if plot=TRUE) or a character vector including the variable names selected by the function (if plot=FALSE).

Note

When using more than one measure, results can be strongly influenced by differences in the scales of the measures. Make sure that all measures have similar scales.

When using the function defaults, results can strongly depend on the measure "1-Convex".

Author(s)

Katrin Grimm

References

B. Schloerke et al. (2016) GGally: Extension to ggplot2. https://cran.r-project.org/package=GGally

See Also

sdf, scag2sdf

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
data(Election2005)

## Not run: 
# Use whole data set with default settings
selectscat(Election2005)
# 7 variables and a higher chance of finding optimal matrix 
selectscat(Election2005,k=7,r=15)


# Use correlation as the measure of relevance 
selectscat(Election2005,criteria="cor")
# boring for the election data 
# same result as
election_num <- Election2005[,sapply(Election2005,is.numeric)] 
selectscat(election_num,relmat=cor(election_num),plot=FALSE)

# If a list of class "sdfdata" is already calculated
sdfdf <- sdf(Election2005)
# Use only measure "Outlying"
sdfdf_O <- sdfdf
sdfdf_O$sdf <- sdfdf_O$sdf[,c(1,10,11)]
selectscat(sdfdf_O,k=7,r=15) 

## End(Not run)

mbgraphic documentation built on May 2, 2019, 2:45 a.m.