runTests: Reproducibly Run Various Kinds of Cross-Validation

Description Usage Arguments Value Author(s) Examples

Description

Enables doing classification schemes such as ordinary 10-fold, 100 permutations 5-fold, and leave one out cross-validation. Processing in parallel is possible by leveraging the package BiocParallel.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
  ## S4 method for signature 'matrix'
runTests(measurements, classes, ...)
  ## S4 method for signature 'DataFrame'
runTests(measurements, classes, featureSets = NULL, metaFeatures = NULL,
         minimumOverlapPercent = 80, datasetName, classificationName,
         validation = c("permute", "leaveOut", "fold"),
         permutePartition = c("fold", "split"),
         permutations = 100, percent = 25, folds = 5, leave = 2,
         seed, parallelParams = bpparam(),
            params = list(SelectParams(), TrainParams(), PredictParams()), verbose = 1)
  ## S4 method for signature 'MultiAssayExperiment'
runTests(measurements, targets = names(measurements), ...)
  ## S4 method for signature 'MultiAssayExperiment'
runTestsEasyHard(measurements, easyDatasetID = "clinical", hardDatasetID = names(measurements)[1],
                   featureSets = NULL, metaFeatures = NULL, minimumOverlapPercent = 80,
                   datasetName = NULL, classificationName = "Easy-Hard Classifier", 
                   validation = c("permute", "leaveOut", "fold"),
                   permutePartition = c("fold", "split"),
                   permutations = 100, percent = 25, folds = 5, leave = 2,
                   seed, parallelParams = bpparam(), ..., verbose = 1)  

Arguments

measurements

Either a matrix, DataFrame or MultiAssayExperiment containing the training data. For a matrix, the rows are features, and the columns are samples. The sample identifiers must be present as column names of the matrix or the row names of the DataFrame.

classes

Either a vector of class labels of class factor of the same length as the number of samples in measurements or if the measurements are of class DataFrame a character vector of length 1 containing the column name in measurement is also permitted. Not used if measurements is a MultiAssayExperiment object.

featureSets

An object of type FeatureSetCollection which defines sets of features or sets of edges.

metaFeatures

Either NULL or a DataFrame which has meta-features of the numeric data of interest.

minimumOverlapPercent

If featureSets stores sets of features, the minimum overlap of feature IDs with measurements for a feature set to be retained in the analysis. If featureSets stores sets of network edges, the minimum percentage of edges with both vertex IDs found in measurements that a set has to have to be retained in the analysis.

targets

If measurements is a MultiAssayExperiment, the names of the data tables to be used. "clinical" is also a valid value and specifies that numeric variables from the clinical data table will be used.

...

For runTests, variables not used by the matrix nor the MultiAssayExperiment method which are passed into and used by the DataFrame method. For runTestsEasyHard, easyClassifierParams and hardClassifierParams to be passed to easyHardClassifierTrain

datasetName

A name associated with the data set used.

classificationName

A name associated with the classification.

validation

Default: "permute". "permute" for repeated permuting. "leaveOut" for leaving all possible combinations of k samples as test samples. "fold" for folding of the data set (no resampling).

permutePartition

Default: "fold". Either "fold" or "split". Only applicable if validation is "permute". If "fold", then the samples are split into folds and in each iteration one is used as the test set. If "split", the samples are split into two groups, the sizes being based on the percent value. One group is used as the training set, the other is the test set.

permutations

Default: 100. Relevant when permuting is used. The number of times to do reordering of the samples before splitting or folding them.

percent

Default: 25. Used when permutation with the split method is chosen. The percentage of samples to be in the test set.

folds

Default: 5. Relevant when repeated permutations are done and permutePartition is set to "fold" or when validation is set to "fold". The number of folds to break the data set into. Each fold is used once as the test set.

leave

Default: 2. Relevant when leave-k-out cross-validation is used. The number of samples to leave for testing.

seed

The random number generator used for repeated resampling will use this seed, if it is provided. Allows reproducibility of repeated usage on the same input data.

parallelParams

An object of class MulticoreParam or SnowParam.

params

A list of objects of class of TransformParams, SelectParams, TrainParams or PredictParams. The order they are in the list determines the order in which the stages of classification are done in.

easyDatasetID

The name of a data set in measurements or "clinical" to indicate the patient information in the column data be used.

hardDatasetID

The name of a data set in measurements different to the value of easyDatasetID to be used for classifying the samples not classified by the easy classifier.

verbose

Default: 1. A number between 0 and 3 for the amount of progress messages to give. A higher number will produce more messages as more lower-level functions print messages.

Value

If the predictor function made a single prediction, then an object of class ClassifyResult. If the predictor function made a set of predictions, then a list of such objects.

Author(s)

Dario Strbenac

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
  #if(require(sparsediscrim))
  #{
    data(asthma)
    
    resubstituteParams <- ResubstituteParams(nFeatures = seq(5, 25, 5),
                                         performanceType = "balanced error",
                                         better = "lower")
    runTests(measurements, classes, datasetName = "Asthma",
             classificationName = "Different Means", permutations = 5,
             params = list(SelectParams(differentMeansSelection, "t Statistic",
                                        resubstituteParams = resubstituteParams),
                           TrainParams(DLDAtrainInterface),
                           PredictParams(DLDApredictInterface)
                           )
             )
  #}
  
  genesMatrix <- matrix(c(rnorm(90, 9, 1),
                        9.5, 9.4, 5.2, 5.3, 5.4, 9.4, 9.6, 9.9, 9.1, 9.8),
		      ncol = 10, byrow = TRUE)

  colnames(genesMatrix) <- paste("Sample", 1:10)
  rownames(genesMatrix) <- paste("Gene", 1:10)
  genders <- factor(c("Male", "Male", "Female", "Female", "Female",
                    "Female", "Female", "Female", "Female", "Female"))

  # Scenario: Male gender can predict the hard-to-classify Sample 1 and Sample 2.
  clinical <- DataFrame(age = c(31, 34, 32, 39, 33, 38, 34, 37, 35, 36),
                        gender = genders,
                        class = factor(rep(c("Poor", "Good"), each = 5)),
		        row.names = colnames(genesMatrix))
  dataset <- MultiAssayExperiment(ExperimentList(RNA = genesMatrix), clinical)
  selParams <- SelectParams(featureSelection = differentMeansSelection, selectionName = "Difference in Means",
                            resubstituteParams = ResubstituteParams(1:10, "balanced error", "lower"))
  easyHardCV <- runTestsEasyHard(dataset, datasetName = "Test Data", classificationName = "Easy-Hard",
                                 easyClassifierParams = list(minCardinality = 2, minPurity = 0.9),
                                 hardClassifierParams = list(selParams, TrainParams(), PredictParams()),
                                 validation = "leaveOut", leave = 1)  

ClassifyR documentation built on Nov. 8, 2020, 6:53 p.m.