Description Usage Arguments Details Value Author(s) See Also Examples
Split data from vector Y into two sets in predefined ratio while preserving relative ratios of different labels in Y. Used to split the data used during classification into train and test subsets.
1 | sample.split( Y, SplitRatio = 2/3, group = NULL )
|
Y |
Vector of data labels. If there are only a few labels (as is expected) than relative ratio of data in both subsets will be the same. |
SplitRatio |
Splitting ratio:
|
group |
Optional vector/list used when multiple copies of each sample
are present. In such a case |
Function msc.sample.split
is the old name of the
sample.split
function. To be retired soon. Note that the function
differs from base::sample
by first restricting the input data set
to its unique values before generating the subset(s).
Returns logical vector of the same length as Y with random
SplitRatio*length(Y)
elements set to TRUE.
Jarek Tuszynski (SAIC) jaroslaw.w.tuszynski@saic.com
Similar to sample
function.
Variable group
is used in the same way as f
argument in
split
and INDEX
argument in tapply
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | library(MASS)
data(cats) # load cats data
Y = cats[,1] # extract labels from the data
msk = sample.split(Y, SplitRatio=3/4)
table(Y,msk)
t=sum( msk) # number of elements in one class
f=sum(!msk) # number of elements in the other class
stopifnot( round((t+f)*3/4) == t ) # test ratios
# example of using group variable
g = rep(seq(length(Y)/4), each=4); g[48]=12;
msk = sample.split(Y, SplitRatio=1/2, group=g)
table(Y,msk) # try to get correct split ratios ...
split(msk,g) # ... while keeping samples with the same group label together
# test results
print(paste( "All Labels numbers: total=",t+f,", train=",t,", test=",f,
", ratio=", t/(t+f) ) )
U = unique(Y) # extract all unique labels
for( i in 1:length(U)) { # check for all labels
lab = (Y==U[i]) # mask elements that have label U[i]
t=sum( msk[lab]) # number of elements with label U[i] in one class
f=sum(!msk[lab]) # number of elements with label U[i] in the other class
print(paste( "Label",U[i],"numbers: total=",t+f,", train=",t,", test=",f,
", ratio=", t/(t+f) ) )
}
# use results
train = cats[ msk,2:3] # use output of sample.split to ...
test = cats[!msk,2:3] # create train and test subsets
z = lda(train, Y[msk]) # perform classification
table(predict(z, test)$class, Y[!msk]) # predicted & true labels
# see also LogitBoost example
|
msk
Y FALSE TRUE
F 12 35
M 24 73
msk
Y FALSE TRUE
F 23 24
M 49 48
$`1`
[1] TRUE TRUE TRUE TRUE
$`2`
[1] FALSE FALSE FALSE FALSE
$`3`
[1] TRUE TRUE TRUE TRUE
$`4`
[1] FALSE FALSE FALSE FALSE
$`5`
[1] FALSE FALSE FALSE FALSE
$`6`
[1] FALSE FALSE FALSE FALSE
$`7`
[1] FALSE FALSE FALSE FALSE
$`8`
[1] TRUE TRUE TRUE TRUE
$`9`
[1] TRUE TRUE TRUE TRUE
$`10`
[1] TRUE TRUE TRUE TRUE
$`11`
[1] TRUE TRUE TRUE TRUE
$`12`
[1] FALSE FALSE FALSE TRUE
$`13`
[1] FALSE FALSE FALSE FALSE
$`14`
[1] TRUE TRUE TRUE TRUE
$`15`
[1] FALSE FALSE FALSE FALSE
$`16`
[1] FALSE FALSE FALSE FALSE
$`17`
[1] FALSE FALSE FALSE FALSE
$`18`
[1] FALSE FALSE FALSE FALSE
$`19`
[1] TRUE TRUE TRUE TRUE
$`20`
[1] FALSE FALSE FALSE FALSE
$`21`
[1] TRUE TRUE TRUE TRUE
$`22`
[1] TRUE TRUE TRUE TRUE
$`23`
[1] TRUE TRUE TRUE TRUE
$`24`
[1] FALSE FALSE FALSE FALSE
$`25`
[1] FALSE FALSE FALSE FALSE
$`26`
[1] TRUE TRUE TRUE TRUE
$`27`
[1] FALSE FALSE FALSE FALSE
$`28`
[1] FALSE FALSE FALSE FALSE
$`29`
[1] TRUE TRUE TRUE TRUE
$`30`
[1] FALSE FALSE FALSE FALSE
$`31`
[1] TRUE TRUE TRUE TRUE
$`32`
[1] TRUE TRUE TRUE FALSE
$`33`
[1] TRUE TRUE TRUE TRUE
$`34`
[1] TRUE TRUE TRUE TRUE
$`35`
[1] FALSE FALSE FALSE FALSE
$`36`
[1] TRUE TRUE TRUE TRUE
[1] "All Labels numbers: total= 144 , train= 108 , test= 36 , ratio= 0.75"
[1] "Label F numbers: total= 47 , train= 24 , test= 23 , ratio= 0.51063829787234"
[1] "Label M numbers: total= 97 , train= 48 , test= 49 , ratio= 0.494845360824742"
F M
F 20 18
M 3 31
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.