Description Usage Arguments Details Value Note Author(s) References Examples
This performs the s4pm
machine learning method in R as
described in the references below. It can fit a graph only estimate
(y=f(G)) and graph-based semi-parametric estimate (y=xb+f(G)). Refer
to the example below.
This is documented in the user manual and references.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | ## S4 method for signature 'formula'
s4pm(x,data,metric= c("cosine","euclidean"),...,
est.only=FALSE,control=SemiSupervised.control())
## S4 method for signature 'matrix'
s4pm(x,y,graph,...)
## S4 method for signature 'data.frame'
s4pm(x,y,graph,...)
## S4 method for signature 'vector'
s4pm(x,y,graph,...)
## S4 method for signature 'NULL'
s4pm(x,y,graph,...)
## Default Method (not meant to be run direclty).
s4pm.default(x,y,graph,weights,hs,lams,gams,type=c("r","c"), est.only=FALSE,
control=SemiSupervised.control())
|
x |
a symbolic description of the model to be fit. A formula can be used directly or the feature data/graph are inputted directly. Refer to the details below for more on this. |
data |
the response and or feature data are a ‘data.frame’. The missing or unlabeled responses must be NA. |
graph |
the graph matrix. Typically an n by n dissimilarity matrix but can also be similarity. Refer to the documentation in the details section below. |
y |
response could be either of length m=|L|<=n with the first m observations labeled, or be of length n with the unlabeled cases flagged as NA. Refer to details below. |
weights |
an optional vector of weights to be used in the fitting process. If missing the unweighted algorithm is fit. |
metric |
the metric to be fit. It is either Cosine dissimilarity
or the Euclidean distance (using the |
hs |
a sequence of tuning parameters for the kernel function if the graph is a dissimilarity matrix. If missing then the procedure provides the grid by default. |
lams |
a ‘vector’ or ‘matrix’ of Lagrangian parameters for the graph and ridge penalties depending on the call. If missing then the procedure provides the grid by default. |
gams |
a vector of Lagrangian parameters corresponding to the latent unlabeled response penalty. If missing then the procedure provides the grid by default. |
type |
use “r” for regression, which is the default unless y is a factor. The “c” option performs classification with logistic loss. |
est.only |
returns only the fitted vector (sign is the class in classification) with no ‘s4pm’ object. Designed for quickly fitting. |
control |
control parameters (refer to |
... |
mop up additional inputs and checks against the |
Details on Response:
The response is set where NA's denote the missing responses (unlabeled set) and values (either ‘factor’ or ‘numeric’) for the known cases (labeled set). This type of input must be used for the ‘formula’ interface. In some circumstances, the response can be inputted with the first m cases defined. So the dimension of argument y and graph/x are different.
Details on Formula:
1) y~.: The most common case. One starts with a typical data set x and
wishes to fit the model xb+f(G[x]). All fitting is done internally using a
k=6 NN graph and cosine distance as default but these can be modified
respectively through the SemiSupervised.control
or the metric argument.
The predict function and all other aspects are done internally just like a
typical R function, i.e., the algorithm inputs are the same as if a linear
model or random forest were fit.
2) y~.+dG(G[x]): The graph corresponds to a dissimilarity matrix
(‘0’ is similar, ‘Inf’ is dissimilar). The graph is
constructed outside the function using the knnGraph
command,
and the data argument must be set to the feature data part with one column
for the response. This is convenient for benchmarking.
Note the dG(G[x],k=5L) allows one to modify k and
dG
(G[x],nok=TRUE) bypasses the knnGraph
treating
the matrix G[x] as a proper adjacency matrix. Refer to dG
for some examples.
3) y~.+sG(G): This case corresponds to a similarity (‘0’ is dissimilar and
‘1’ [typically] is similar) graph and an
additional feature data set whose rows correspond to the nodes of G. Refer
to dG
help page for some examples.
4) y~dG(G[x]) or y~sG(G): This bypasses the safe semi-parametric component of the model. A more traditional semi-supervised graph-only method is fit. This tends to perform worse but is faster to fit.
5) Non-formula call: The ‘formula’ is a much simpler interface, but the inputs
of the function can be manipulated through the ‘data.frame’, ‘vector’,
‘matrix’, and ‘NULL’ interfaces of s4pm
. The NULL interface
fits version (4) of the ‘formula ’ call above. The s4pm.default
could
also be fit, but it is not recommended.
Other Details:
The approach only fits the Laplace kernel for the graph. Practically, this is all that is necessary since we optimize the tuning parameters for it. Also, the interest is in a computationally fast algorithm, so determining the grid internally is optimized for performance and time.
The predict
generic for s4pm
is optimized for the ‘formula’
input. This is the simplest way to input the function and build the
graph.
The code provides semi-supervised graph-based support for R.
An object of class s4pm-class
.
K-fold cross-validation was implement in C++ to perform parameter estimation for the h, graph penalty, safe ridge, and the latent unlabeled response penalties. Several LAPACK routines are used to fit the underlying functions for both CV and regular fit.
Mark Vere Culp
MV Culp, KJ Ryan, and P Banerjee (2015). On Safe Semi-supervised Learning. IEEE Pattern Analysis and Machine Intelligence. Submitted.
MV Culp and KJ Ryan (2016). SemiSupervised: Scalable Semi-Supervised Routines for Real Data Problems.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | ## Set up Sonar data with 20% labeled (comparing randomForest and glmnet)
library(mlbench)
data(Sonar)
n=dim(Sonar)[1]
p=dim(Sonar)[2]
nu=0.2
set.seed(100)
L=sort(sample(1:n,ceiling(nu*n)))
U=setdiff(1:n,L)
y.true<-Sonar$Class
Sonar$Class[U]=NA
## Fit s4pm to Sonar
g.s4pm<-s4pm(Class~.,data=Sonar)
g.s4pm
tab=table(fitted(g.s4pm)[U],y.true[U])
1-sum(diag(tab))/sum(tab)
## For comparison
library(caret)
library(randomForest)
g.glmnet=train(Class~.,data=Sonar[L,],method="glmnet",preProc = c("center", "scale"))
tab=table(predict(g.glmnet,newdata=Sonar[U,-p]),y.true[U])
1-sum(diag(tab))/sum(tab)
g.rf<-randomForest(Class~.,data=Sonar[L,])
tab=table(predict(g.rf,newdata=Sonar[U,-p]),y.true[U])
1-sum(diag(tab))/sum(tab)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.