s4pm: Safe Semi-Supervised Semi-Parametric Model ('s4pm')

Description Usage Arguments Details Value Note Author(s) References Examples

Description

This performs the s4pm machine learning method in R as described in the references below. It can fit a graph only estimate (y=f(G)) and graph-based semi-parametric estimate (y=xb+f(G)). Refer to the example below.

This is documented in the user manual and references.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
## S4 method for signature 'formula'
s4pm(x,data,metric= c("cosine","euclidean"),...,
  est.only=FALSE,control=SemiSupervised.control())

## S4 method for signature 'matrix'
s4pm(x,y,graph,...)

## S4 method for signature 'data.frame'
s4pm(x,y,graph,...)

## S4 method for signature 'vector'
s4pm(x,y,graph,...)

## S4 method for signature 'NULL'
s4pm(x,y,graph,...)

## Default Method (not meant to be run direclty).
s4pm.default(x,y,graph,weights,hs,lams,gams,type=c("r","c"), est.only=FALSE,
	control=SemiSupervised.control())

Arguments

x

a symbolic description of the model to be fit. A formula can be used directly or the feature data/graph are inputted directly. Refer to the details below for more on this.

data

the response and or feature data are a ‘data.frame’. The missing or unlabeled responses must be NA.

graph

the graph matrix. Typically an n by n dissimilarity matrix but can also be similarity. Refer to the documentation in the details section below.

y

response could be either of length m=|L|<=n with the first m observations labeled, or be of length n with the unlabeled cases flagged as NA. Refer to details below.

weights

an optional vector of weights to be used in the fitting process. If missing the unweighted algorithm is fit.

metric

the metric to be fit. It is either Cosine dissimilarity or the Euclidean distance (using the daisy function in the cluster library). Only used with the ‘formula’ call. Default is “cosine”.

hs

a sequence of tuning parameters for the kernel function if the graph is a dissimilarity matrix. If missing then the procedure provides the grid by default.

lams

a ‘vector’ or ‘matrix’ of Lagrangian parameters for the graph and ridge penalties depending on the call. If missing then the procedure provides the grid by default.

gams

a vector of Lagrangian parameters corresponding to the latent unlabeled response penalty. If missing then the procedure provides the grid by default.

type

use “r” for regression, which is the default unless y is a factor. The “c” option performs classification with logistic loss.

est.only

returns only the fitted vector (sign is the class in classification) with no ‘s4pm’ object. Designed for quickly fitting.

control

control parameters (refer to SemiSupervised.control for more information).

...

mop up additional inputs and checks against the s4pm.default arguments.

Details

Details on Response:

The response is set where NA's denote the missing responses (unlabeled set) and values (either ‘factor’ or ‘numeric’) for the known cases (labeled set). This type of input must be used for the ‘formula’ interface. In some circumstances, the response can be inputted with the first m cases defined. So the dimension of argument y and graph/x are different.

Details on Formula:

1) y~.: The most common case. One starts with a typical data set x and wishes to fit the model xb+f(G[x]). All fitting is done internally using a k=6 NN graph and cosine distance as default but these can be modified respectively through the SemiSupervised.control or the metric argument. The predict function and all other aspects are done internally just like a typical R function, i.e., the algorithm inputs are the same as if a linear model or random forest were fit.

2) y~.+dG(G[x]): The graph corresponds to a dissimilarity matrix (‘0’ is similar, ‘Inf’ is dissimilar). The graph is constructed outside the function using the knnGraph command, and the data argument must be set to the feature data part with one column for the response. This is convenient for benchmarking.

Note the dG(G[x],k=5L) allows one to modify k and dG(G[x],nok=TRUE) bypasses the knnGraph treating the matrix G[x] as a proper adjacency matrix. Refer to dG for some examples.

3) y~.+sG(G): This case corresponds to a similarity (‘0’ is dissimilar and ‘1’ [typically] is similar) graph and an additional feature data set whose rows correspond to the nodes of G. Refer to dG help page for some examples.

4) y~dG(G[x]) or y~sG(G): This bypasses the safe semi-parametric component of the model. A more traditional semi-supervised graph-only method is fit. This tends to perform worse but is faster to fit.

5) Non-formula call: The ‘formula’ is a much simpler interface, but the inputs of the function can be manipulated through the ‘data.frame’, ‘vector’, ‘matrix’, and ‘NULL’ interfaces of s4pm. The NULL interface fits version (4) of the ‘formula ’ call above. The s4pm.default could also be fit, but it is not recommended.

Other Details:

The approach only fits the Laplace kernel for the graph. Practically, this is all that is necessary since we optimize the tuning parameters for it. Also, the interest is in a computationally fast algorithm, so determining the grid internally is optimized for performance and time.

The predict generic for s4pm is optimized for the ‘formula’ input. This is the simplest way to input the function and build the graph.

The code provides semi-supervised graph-based support for R.

Value

An object of class s4pm-class.

Note

K-fold cross-validation was implement in C++ to perform parameter estimation for the h, graph penalty, safe ridge, and the latent unlabeled response penalties. Several LAPACK routines are used to fit the underlying functions for both CV and regular fit.

Author(s)

Mark Vere Culp

References

MV Culp, KJ Ryan, and P Banerjee (2015). On Safe Semi-supervised Learning. IEEE Pattern Analysis and Machine Intelligence. Submitted.

MV Culp and KJ Ryan (2016). SemiSupervised: Scalable Semi-Supervised Routines for Real Data Problems.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
## Set up Sonar data with 20% labeled (comparing randomForest and glmnet)
library(mlbench)
data(Sonar)

n=dim(Sonar)[1]
p=dim(Sonar)[2]

nu=0.2
set.seed(100)
L=sort(sample(1:n,ceiling(nu*n)))
U=setdiff(1:n,L)

y.true<-Sonar$Class
Sonar$Class[U]=NA

## Fit s4pm to Sonar
g.s4pm<-s4pm(Class~.,data=Sonar) 
g.s4pm
tab=table(fitted(g.s4pm)[U],y.true[U])
1-sum(diag(tab))/sum(tab)

## For comparison

library(caret)
library(randomForest)
     
g.glmnet=train(Class~.,data=Sonar[L,],method="glmnet",preProc = c("center", "scale"))
tab=table(predict(g.glmnet,newdata=Sonar[U,-p]),y.true[U])
1-sum(diag(tab))/sum(tab)
     
g.rf<-randomForest(Class~.,data=Sonar[L,])
tab=table(predict(g.rf,newdata=Sonar[U,-p]),y.true[U])
1-sum(diag(tab))/sum(tab)

SemiSupervised documentation built on May 11, 2018, 5:03 p.m.