Fittes various models based on a combination on penalized linear models and logistic regression.

Share:

Description

Various linear models are fitted to the training samples using lars method. The models differ in the number of features and each is validated by validating samples. A score is also assigned to each feature based on the tendency of LASSO in including that feature in the models.

Usage

1
2
3
train.doctor(F_, L_, training.samples, validating.samples, considered.features, 
		 maximum.features.num, balance = TRUE, return_linear.models = TRUE, 
		 report.fitting.failure = FALSE)

Arguments

F_

The feature matrix, each column is a feature.

L_

The vector of labels named according to the rows of F.

training.samples

The names of rows of F that should be considered as training samples.

validating.samples

The names of rows of F that should be considered as validating samples.

considered.features

The names of columns of F that determine the features of interest.

maximum.features.num

Upto this number of features are allowed to contribute to each linear model.

balance

If TRUE, the cases will be balanced for the same number of positive vs. negatives by oversampling before fitting the linear model.

return_linear.models

The models are memory intensive, so for if they more than 1000, we may decide to ignore them to prevent memory outage.

report.fitting.failure

If TRUE, any failure in fitting the linear of logistic models will be printed.

Details

See the reference for more details.

Value

Returns a list of:

linear.models

The result of model fitting computed by lars().

best.number.of.features

According to best accuracy.

probabilities

The best computed logistic score.

accuracy

The best F-measure.

best.logistic.cof

According to best accuracy.

contribution.to.feature.scores

This vector should be added to the total feature scores.

contribution.to.feature.scores.frequency

This vector should be added to the total frequency of features.

training.samples

Input, the names of rows of F that should be considered as training samples.

validating.samples

Input, the names of rows of F that should be considered as validating samples.

precision

Ratio of number of true positives to predicted positives.

recall

Ratio of number of true positives to real positives.

selected.features.sequence

A list of sets of features which are selected in different models.

global.errors

A vector of global error of the linear fits.

features.with.best.global.error

A vector of names of good features in terms of global error of linear fits.

Note

Logistic regression is also done on top of fitting the linear models.

Author(s)

Habil Zare

References

"Statistical Analysis of Overfitting Features", manuscript in preparation.

See Also

FeaLect, train.doctor, doctor.validate, random.subset, compute.balanced,compute.logistic.score, ignore.redundant, input.check.FeaLect

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
library(FeaLect)
data(mcl_sll)
F <- as.matrix(mcl_sll[ ,-1])	# The Feature matrix
L <- as.numeric(mcl_sll[ ,1])	# The labels
names(L) <- rownames(F)
message(dim(F)[1], " samples and ",dim(F)[2], " features.")

all.samples <- rownames(F); ts <- all.samples[5:10]; vs <- all.samples[c(1,22)]

doctor <- train.doctor(F_=F, L_=L, training.samples=ts, validating.samples=vs,
       considered.features=colnames(F), maximum.features.num=10)