FeaLect: Computes the scores of the features.

Description Usage Arguments Details Value Note Author(s) References See Also Examples

View source: R/FeaLect.R

Description

Several random subsets are sampled from the input data and for each random subset, various linear models are fitted using lars method. A score is assigned to each feature based on the tendency of LASSO in including that feature in the models. Finally, the average score and the models are returned as the output.

Usage

1
2
3
4
FeaLect(F, L, maximum.features.num = dim(F)[2], total.num.of.models, gamma = 3/4, 
	   persistence = 1000, talk = FALSE, minimum.class.size = 2, 
	   report.fitting.failure = FALSE, return_linear.models = TRUE, balance = TRUE,
	   replace = TRUE, plot.scores = TRUE)

Arguments

F

The feature matrix, each column is a feature.

L

The vector of labels named according to the rows of F.

maximum.features.num

Upto this number of features are allowed to contribute to each linear model.

total.num.of.models

The total number of models that are fitted.

gamma

A value in range 0-1 that determines the relative size of sample subsets.

persistence

Maximum number of tries for randomly choosing.samples, If we try this many times and the obtained labels are all the same, we give up (maybe the whole labels are the same) with the error message: " Not enough variation in the labels...".

talk

If TRUE, some messages are printed during the computations.

minimum.class.size

The size of both positive and negative classes should be greater than this threshold after sampling.

report.fitting.failure

If TRUE, any failure in fitting the linear of logistic models will be printed.

return_linear.models

The models are memory intensive, so for if they more than 1000, we may decide to ignore them to prevent memory outage.

balance

If TRUE, the cases will be balanced for the same number of positive vs. negatives by oversampling before fitting the linear model.

replace

If TRUE, the subsets are sampled with replacement.

plot.scores

If TRUE, the scores are plotted in logarithmic scale after each iteration.

Details

See the reference for more details.

Value

Returns a list of:

log.scores

A vector containing the logarithm of final scores.

feature.matrix

The input feature matrix.

labels

The input labels

total.num.of.models

The total number of models that are fitted.

maximum.features.num

Upto this number of features are allowed to contribute to each linear model.

feature.scores.history

The matrix of history of feature scores where column i contains the scores after i runs.

num.of.features.score

A vector, entry i contains the number of times that i has been the best number of features.

best.feature.num

The i'th value of this vector is the best number of features for the i'th model.

mislabeling.record

A vector that keeps track of the frequency of mislabelling for each cases.

doctors

List of all models which are created by train.doctor() function.

best.features.intersection

Best features are computed for each sampling and their intersection is reported as this vector of features names

features.with.best.global.error

A list containing the sets of features. The set i was the best for i'th sampling.

time.taken

Total time used for executing this function.

Note

Logistic regression is also done on top of fitting the linear models.

Author(s)

Habil Zare

References

"Statistical Analysis of Overfitting Features", manuscript in preparation.

See Also

FeaLect, train.doctor, doctor.validate, random.subset, compute.balanced,compute.logistic.score, ignore.redundant, input.check.FeaLect

Examples

1
2
3
4
5
6
7
8
9
library(FeaLect)
data(mcl_sll)
F <- as.matrix(mcl_sll[ ,-1])	# The Feature matrix
L <- as.numeric(mcl_sll[ ,1])	# The labels
names(L) <- rownames(F)
message(dim(F)[1], " samples and ",dim(F)[2], " features.")

## For this data, total.num.of.models is suggested to be at least 100.
FeaLect.result <-FeaLect(F=F,L=L,maximum.features.num=10,total.num.of.models=20,talk=TRUE)	

FeaLect documentation built on Feb. 26, 2020, 1:06 a.m.