library(stringi) library(stringr) library(RcppRoll) library(devtools) library(SparseMatrixRecommender) #library(ERTMon) devtools::load_all()
Here we specify the directory with data and transformations specifications:
directoryName <- if( grepl("testthat", getwd()) ) { file.path( getwd(), "..", "..", "data", "FakeData") } else { file.path( getwd(), "..", "data", "FakeData") }
compSpecObj <- new( "ComputationSpecification" ) compSpecObj <- readSpec( compSpecObj, file.path( directoryName, "computationSpecification.csv" ) ) compSpecObj <- ingestSpec( compSpecObj )
The ingestion process below is done with this data transformation specification:
compSpecObj@parameters
diObj <- new( "DataIngester") diObj <- readData( diObj, file.path( directoryName, "eventRecords.csv" ), file.path( directoryName, "entityAttributes.csv" ) ) diObj <- ingestData( diObj, "Label" ) dwObj <- diObj@dataObj
dwObj@labels
ERTMon does not require the data ingester object to have its fields "diedLabel" and "survivedLabel" set to have the correct values. If it is done it is for convenience or "as a memo". (The label values are set through the parameters CSV table in the class ComputationSpecification.)
#dwObj@survivedLabel <- compSpecObj@parameters[ compSpecObj@parameters$Variable == "Label", "Critical.label"] ## Note the the computation here has two different approaches. #dwObj@diedLabel <- paste0("Non.", dwObj@survivedLabel) #dwObj@diedLabel <- setdiff( dwObj@labels, dwObj@survivedLabel)
## If this is really needed it can be in the validation function for DataWrapper. #assertthat::assert_that( mean( c(dwObj@diedLabel, dwObj@survivedLabel) %in% dwObj@labels ) == 1 )
Obtaining splitting indices:
set.seed(1456) entityIDs <- unique(dwObj@eventRecords$EntityID) trainingEntityIDs <- sample( entityIDs, floor( params$trainingDataFraction * length(entityIDs) ) ) testEntityIDs <- setdiff( unique(dwObj@eventRecords$EntityID), trainingEntityIDs )
Splitting of data into training and test parts:
trainingData <- dwObj@eventRecords[ dwObj@eventRecords$EntityID %in% trainingEntityIDs, ] testData <- dwObj@eventRecords[ dwObj@eventRecords$EntityID %in% testEntityIDs, ]
Remark: PCCPF has a data splitter object, but here use a more direct approach in order to simulate real-life scenarios.
Make a new data transformer Obj:
if( params$categoricalMatricesQ ) { dtObj <- new( "DataTransformerCatMatrices" ) } else { dtObj <- new( "DataTransformer" ) }
Note that the data has not been "seen" by the data transformation object:
compSpecObj@parameters
dtObj <- transformData( object = dtObj, compSpec = compSpecObj, eventRecordsData = trainingData, entityAttributes = dwObj@entityAttributes[ (dwObj@entityAttributes$EntityID %in% trainingEntityIDs), ], outlierIdentifierParameteres = SPLUSQuartileIdentifierParameters) # also HampelIdentifierParameters or QuartileIdentifierParameters
transformedTrainingDataDF <- dtObj@transformedData
summary( as.data.frame(unclass(dtObj@transformedData), stringsAsFactors = T), maxsum=20 )
summary( as.data.frame(unclass(dtObj@transformedData %>% dplyr::filter( MatrixName == "HR.OutFrc" )), stringsAsFactors = T) )
Matrix version of the transformed data:
transformedTrainingDataMat <- dtObj@dataMat
dtObj@groupAggregatedValues
The ingestion function call above encapsulates a lot of steps. Here we show summaries of the entity data and medical data that are going to be used in the classification.
Remark: These data objects contain transformed versions of the data that is placed in the specified directory.
summary(as.data.frame(unclass(dwObj@entityAttributes), stringsAsFactors = T))
summary(as.data.frame(unclass(dwObj@eventRecords), stringsAsFactors = T))
dim(transformedTrainingDataDF)
summary(as.data.frame(unclass(transformedTrainingDataDF), stringsAsFactors = T), maxsum=12)
The corresponding data matrix:
dim(transformedTrainingDataMat)
The test data should not be "known" at this point.
rm("transformedTestDataDF") exists("transformedTestDataDF")
Here we repeat the transformations over the test data using the aggregation values from the training data transformation.
(This is specified with the parameter testDataRun
.)
dtObj <- transformData( dtObj, compSpecObj, testData, dwObj@entityAttributes[ dwObj@entityAttributes$EntityID %in% testEntityIDs, ], testDataRun = TRUE ) transformedTestDataDF <- dtObj@transformedData transformedTestDataMat <- dtObj@dataMat
Make a Sparse Matrix Recommender (SMR) object.
smats <- dtObj@sparseMatrices smats <- setNames( purrr::map(names(smats), function(x) {m<-smats[[x]]; colnames(m)<-paste(x,colnames(m)); m}), names(smats))
purrr::map_df( smats, function(x) data.frame( NRow = nrow(x), NCol = ncol(x) ), .id = "MatrixName" )
names(smats)
clSMRFreq <- SMRCreateFromMatrices( matrices = smats[1:8], tagTypes = NULL, itemColumnName = "EntityID" )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.