library(stringi)
library(stringr)
library(RcppRoll)
library(devtools)
library(SparseMatrixRecommender)
#library(ERTMon)
devtools::load_all()

Data directory

Here we specify the directory with data and transformations specifications:

directoryName <- if( grepl("testthat", getwd()) ) { 
  file.path( getwd(), "..", "..", "data", "FakeData") 
} else { 
  file.path( getwd(), "..", "data", "FakeData")
}

Ingest computations specification data

compSpecObj <- new( "ComputationSpecification" )
compSpecObj <- readSpec( compSpecObj,  file.path( directoryName, "computationSpecification.csv" ) )  
compSpecObj <- ingestSpec( compSpecObj )

The ingestion process below is done with this data transformation specification:

compSpecObj@parameters

Ingest data

diObj <- new( "DataIngester")

diObj <- readData( diObj, 
                   file.path( directoryName, "eventRecords.csv" ),
                   file.path( directoryName, "entityAttributes.csv" ) )

diObj <- ingestData( diObj, "Label" )

dwObj <- diObj@dataObj
dwObj@labels

ERTMon does not require the data ingester object to have its fields "diedLabel" and "survivedLabel" set to have the correct values. If it is done it is for convenience or "as a memo". (The label values are set through the parameters CSV table in the class ComputationSpecification.)

#dwObj@survivedLabel <- compSpecObj@parameters[ compSpecObj@parameters$Variable == "Label", "Critical.label"]
## Note the the computation here has two different approaches.
#dwObj@diedLabel <- paste0("Non.", dwObj@survivedLabel)
#dwObj@diedLabel <- setdiff( dwObj@labels, dwObj@survivedLabel)
## If this is really needed it can be in the validation function for DataWrapper.
#assertthat::assert_that( mean( c(dwObj@diedLabel, dwObj@survivedLabel) %in% dwObj@labels ) == 1 )

Split data

Obtaining splitting indices:

set.seed(1456)
entityIDs <- unique(dwObj@eventRecords$EntityID)
trainingEntityIDs <- sample( entityIDs, floor( params$trainingDataFraction * length(entityIDs) ) )
testEntityIDs <- setdiff( unique(dwObj@eventRecords$EntityID), trainingEntityIDs )

Splitting of data into training and test parts:

trainingData <- dwObj@eventRecords[ dwObj@eventRecords$EntityID %in% trainingEntityIDs, ]
testData <- dwObj@eventRecords[ dwObj@eventRecords$EntityID %in% testEntityIDs, ]

Remark: PCCPF has a data splitter object, but here use a more direct approach in order to simulate real-life scenarios.

Transform training data

Make a new data transformer Obj:

if( params$categoricalMatricesQ ) {
  dtObj <- new( "DataTransformerCatMatrices" )
} else {
  dtObj <- new( "DataTransformer" )
}

Note that the data has not been "seen" by the data transformation object:

compSpecObj@parameters
dtObj <- transformData( object = dtObj, 
                        compSpec = compSpecObj, 
                        eventRecordsData = trainingData, 
                        entityAttributes = dwObj@entityAttributes[ (dwObj@entityAttributes$EntityID %in% trainingEntityIDs), ], 
                        outlierIdentifierParameteres = SPLUSQuartileIdentifierParameters) # also HampelIdentifierParameters or QuartileIdentifierParameters
transformedTrainingDataDF <- dtObj@transformedData
summary( as.data.frame(unclass(dtObj@transformedData), stringsAsFactors = T),  maxsum=20 )
summary( as.data.frame(unclass(dtObj@transformedData %>% dplyr::filter( MatrixName == "HR.OutFrc" )), stringsAsFactors = T) )

Matrix version of the transformed data:

transformedTrainingDataMat <- dtObj@dataMat
dtObj@groupAggregatedValues

What data do we have?

The ingestion function call above encapsulates a lot of steps. Here we show summaries of the entity data and medical data that are going to be used in the classification.

Remark: These data objects contain transformed versions of the data that is placed in the specified directory.

Patient data

summary(as.data.frame(unclass(dwObj@entityAttributes), stringsAsFactors = T))

Event records

summary(as.data.frame(unclass(dwObj@eventRecords), stringsAsFactors = T))

Transformed event records data (training)

dim(transformedTrainingDataDF)
summary(as.data.frame(unclass(transformedTrainingDataDF), stringsAsFactors = T), maxsum=12)

The corresponding data matrix:

dim(transformedTrainingDataMat)

Transformed event records data (test)

The test data should not be "known" at this point.

rm("transformedTestDataDF")
exists("transformedTestDataDF")

Transform test data

Here we repeat the transformations over the test data using the aggregation values from the training data transformation. (This is specified with the parameter testDataRun.)

dtObj <- transformData( dtObj, compSpecObj, testData, dwObj@entityAttributes[ dwObj@entityAttributes$EntityID %in% testEntityIDs, ], testDataRun = TRUE )
transformedTestDataDF <- dtObj@transformedData 
transformedTestDataMat <- dtObj@dataMat

Sparse matrix object (for explanations and proofs)

Make a Sparse Matrix Recommender (SMR) object.

smats <- dtObj@sparseMatrices
smats <- setNames( purrr::map(names(smats), function(x) {m<-smats[[x]]; colnames(m)<-paste(x,colnames(m)); m}), names(smats))
purrr::map_df( smats, function(x) data.frame( NRow = nrow(x), NCol = ncol(x) ), .id = "MatrixName" )
names(smats)
clSMRFreq <- SMRCreateFromMatrices( matrices = smats[1:8], tagTypes = NULL, itemColumnName = "EntityID" )


antononcube/ERTMon-R documentation built on Oct. 14, 2021, 2:27 p.m.