pcSelect | R Documentation |
The goal is feature selection: If you
have a response variable y
and a data matrix dm
, we want
to know which variables are “strongly influential” on y
. The
type of influence is the same as in the PC-Algorithm, i.e., y
and x
(a column of dm
) are associated if they are
correlated even when conditioning on any subset of the remaining
columns in dm
. Therefore, only very strong relations will be
found and the result is typically a subset of other feature selection
techniques. Note that there are also robust correlation methods
available which render this method robust.
pcSelect(y, dm, alpha, corMethod = "standard",
verbose = FALSE, directed = FALSE)
y |
response vector. |
dm |
data matrix (rows: samples/observations, columns: variables);
|
alpha |
significance level of individual partial correlation tests. |
corMethod |
a string determining the method for correlation
estimation via |
verbose |
Note that such diagnostic output may make the function considerably slower. |
directed |
logical; should the output graph be directed? |
This function basically applies pc
on the data
matrix obtained by joining y
and dm
. Since the output is
not concerned with the edges found within the columns of dm
,
the algorithm is adapted accordingly. Therefore, the runtime and the
ability to deal with large datasets is typically increased
substantially.
G |
A |
zMin |
The minimal z-values when testing partial correlations
between |
Markus Kalisch (kalisch@stat.math.ethz.ch) and Martin Maechler.
Buehlmann, P., Kalisch, M. and Maathuis, M.H. (2010). Variable selection for high-dimensional linear models: partially faithful distributions and the PC-simple algorithm. Biometrika 97, 261–278.
pc
which is the more general version of this function;
pcSelect.presel
which applies pcSelect()
twice.
p <- 10
## generate and draw random DAG :
suppressWarnings(RNGversion("3.5.0"))
set.seed(101)
myDAG <- randomDAG(p, prob = 0.2)
if (require(Rgraphviz)) {
plot(myDAG, main = "randomDAG(10, prob = 0.2)")
}
## generate 1000 samples of DAG using standard normal error distribution
n <- 1000
d.mat <- rmvDAG(n, myDAG, errDist = "normal")
## let's pretend that the 10th column is the response and the first 9
## columns are explanatory variable. Which of the first 9 variables
## "cause" the tenth variable?
y <- d.mat[,10]
dm <- d.mat[,-10]
(pcS <- pcSelect(d.mat[,10], d.mat[,-10], alpha=0.05))
## You see, that variable 4,5,6 are considered as important
## By inspecting zMin,
with(pcS, zMin[G])
## you can also see that the influence of variable 6
## is most evident from the data (its zMin is 18.64, so quite large - as
## a rule of thumb for judging what is large, you could use quantiles
## of the Standard Normal Distribution)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.