knitr::opts_chunk$set(echo = TRUE)

This example illustrates the steps of running GENN on a subset of the NCI-60 cancer data set.

Read Data

To read in the data, you need to install the IntLIM package. Note that IntLIM will be automatically installed when you install GENN. IntLIM is available both through CRAN and through GitHub. The CRAN version is the version used by GENN.

IntLIM requires a specific format for data. For IntLIM, we require the data corresponding to two entity types, optional metadata for each of these types, and sample meta data. We also need a CSV meta-file that lists the location of the other files. These need to be in the same folder. The formats are described below. In addition, we provide a sample set of files.

Please be sure that all files noted in the CSV file, including the CSV file, are in the same folder. Do not include path names in the filenames.

Users need to input a CSV file with two required columns: 'type' and 'filenames'.

The CSV file is expected to have the following 2 columns and 6 rows:

  1. type,filenames
  2. analyteType1,myfilename (optional if analyteType2 is provided)
  3. analyteType2,myfilename (optional if analyteType1 is provided)
  4. analyteType1MetaData,myfilename (optional)
  5. analyteType2MetaData,myfilename (optional)
  6. sampleMetaData,myfilename"

The data and meta-data is stored in a series of comma-separated-values (.CSV) files. The 5 files consist of data for two entity types, metadata for two entity types, and sample meta data. A meta-file lists the location of the other 5 files. This meta-file is input into IntLIM.

Please be sure to normalize your data appropriately before inputting it into IntLIM.

Input data files should be in a specific format:

File type | Description ----------------------| ----------------------------------------- analyteType1 | rows are entities from type 1 (e.g. small molecules), columns are samples analytetype2 | rows are entities from type 2 (e.g. genes), columns are samples analyteType1MetaData | rows are entities, features are columns analyteType2MetaData | rows are entities, features are columns sampleMetaData | rows are samples, features are columns

For the entity data files, the first row contains the feature IDs and the first column contains the sample IDs.

For the sampleMetaData, the first column of the sampleMetaData file is assumed to be the sample ID, and those sample IDs should match the first row of entity data (e.g. it is required that all sample IDs in the entity data are also in the sampleMetaDatafile).

Additionally, the entity data files and SampleMetaData need to contain an 'id' column that contains the name of the features (entities) or sample (sample id, name, etc).

library("IntLIM")
library("GENN")
dir <- system.file("extdata", package="GENN", mustWork=TRUE)
inputData <- IntLIM::ReadData(inputFile = paste0(dir, "/NCI60_input_refs_train.csv"), 
                                  suppressWarnings = TRUE)

Set Up Model Input

Setting up the model input requires several input parameters: - inputData: The input data in the IntLIM format, generated using IntLIM::ReadData() above. - stype: The outcome variable, in this case the outcome "score" (Y). - outcomeType: The entity type to use as the outcome in the regression models (i.e. X^{S_2}). Listed as "1" (type 1 from the input file) or "2" (type 2 from the input file). - independentVarType: The entity type to use as the independent variable in the regression models (i.e. X^{S_1}). - learningRate: The learning rate for model optimization (eta). - covar: Covariate features (i.e. X^{S_3}). - continuous: Whether the data is continuous (regression problem) or discrete (classification problem).

Optional parameters may also be included, such as: - rsquaredCutoff: R^2 cutoff - optimizationType: type of optimization algorithm - corrCutoff: For all component predictors with correlation greater than this cutoff, only the best predictor [computed using t-score] will be included in the graph - convergenceCutoff: convergence cutoff - k to use in KNN for computing local error metafeature - eigStep: number of eigenvectors to use for Grassmannian manifold projection when computing local error metafeature - maxIterations: maximum number of iterations - initialMetaFeatureWeights: initial metafeature weights phi_0

You will be able to view the IntLIM analysis and metafeature computation running in real-time.

modelResults <- GENN::DoModelSetup(inputData = inputData,
                                                            stype = "drug5FU",
                                                            outcomeType = 2,
                                                            independentVarType = 1,
                                                            learningRate = 0.9,
                                                            covar = "cancertype",
                                                            optimizationType = "adam",
                                                            corrCutoff = 0.4,
                                                            rsquaredCutoff = 0.3,
                                   maxIterations = 100,
                                                        continuous = TRUE)

Learn Optimal Composite Model

The next step is to actually train the model. First, the graph will be segmented into neighborhoods and connected components (H) that are used throughout the training process. Then, the initial error will be computed after the initial pruning. Finally, the weight deltas, error, and pruned subgraph will be printed for each iteration.

In this case, we are training on a small subset of the entities and are only able to slightly improve the training error after learning the optimal metafeatures.

optimalModel <- GENN::OptimizeMetaFeatureCombo(modelResults = modelResults,
                                                                                         verbose = FALSE,
                                                                                         pruningTechnique = "both",
                                                                                         useCutoff = TRUE)

Apply the Model

Finally, we apply the model to a new data set for prediction. The test data will be read in, the metafeatures computed, and the scores predicted for the test data.

inputDataTest <- IntLIM::ReadData(inputFile = paste0(dir, "/NCI60_input_refs_test.csv"),
                                  suppressWarnings = TRUE)
predictions <- GENN::DoTestSetupAndPrediction(inputDataTest = inputDataTest,
                                                                   model = optimalModel)

Evaluate

We can then evaluate the prediction on the new model. Here, we plot the predictions on the test data against the true test data, and then we evaluate the SCov. This subset of entities is clearly not ideal for prediction, as the SCov is very low.

We also print out the list of pairs and metafeature weights. We can see that the metafeature weights have not changed much during training.

plot(inputDataTest@sampleMetaData$drug5FU, predictions)
print(cov(inputDataTest@sampleMetaData$drug5FU, predictions) / (max(sd(inputDataTest@sampleMetaData$drug5FU),
                                                                  sd(predictions))^2))
print(optimalModel@pairs)
print(optimalModel@current.metaFeature.weights)


ncats/MultiOmicsGraphPrediction documentation built on Aug. 23, 2023, 9:19 a.m.