buildPredictor_sparseGenetic: Performs feature selection using multiple resamplings of the...

Description Usage Arguments Details Value Examples

View source: R/buildPredictor_sparseGenetic.R

Description

Performs feature selection using multiple resamplings of the data

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
buildPredictor_sparseGenetic(
  phenoDF,
  cnv_GR,
  predClass,
  group_GRList,
  outDir = tempdir(),
  numSplits = 3L,
  featScoreMax = 10L,
  filter_WtSum = 100L,
  enrichLabels = TRUE,
  enrichPthresh = 0.07,
  numPermsEnrich = 2500L,
  minEnr = -1,
  numCores = 1L,
  FS_numCores = NULL,
  ...
)

Arguments

phenoDF

(data.frame) sample metadat. patient ID,STATUS

cnv_GR

(GRanges) genetic events. Must contain "ID" column mapping the event to a patient. ID must correspond to the ID column in phenoDF

predClass

(char) patient class to predict

group_GRList

(list) List of GRangesList indicating grouping rules for CNVs. For example, in a pathway-based design, each key value would be a pathway name, and the value would be a RangesList containing coordinates of the member genes

outDir

(char) path to dir where results should be stored. Results for resampling i are under <outDir>/part<i>, while predictor evaluation results are directly in outDir.

numSplits

(integer) number of data resamplings to use

featScoreMax

(integer) max score for features in feature selection

filter_WtSum

(numeric between 5-100) Limit to top-ranked networks such that cumulative weight is less than this parameter. e.g. If filter_WtSum=20, first order networks by decreasing weight; then keep those whose cumulative weight <= 20.

enrichLabels

(logical) if TRUE, applies label enrichment to train networks

enrichPthresh

(numeric between 0 and 1) networks with label enrichment p-value below this threshold pass enrichment

numPermsEnrich

(integer) number of permutations for label enrichment

minEnr

(integer -1 to 1) minEnr param in enrichLabelsNets()

numCores

(integer) num cores for parallel processing

FS_numCores

(integer) num cores for running GM. If NULL, is set to max(1,numCores-1). Set to a lower value if the default setting gives out-of-memory error. This may happen if networks are denser than expected

...

params for runFeatureSelection()

Details

This function is used for feature selection of patient networks, using multiple resamplings of input data. It is intended for use in the scenario where patient networks are sparse and binary. This function should be called after defining all patient networks. It performs the following steps: For i = 1..numSplits randomly split patients into training and test (optional) filter training networks to exclude random-like networks compile features into database for cross-validation score networks out of 10 end using test samples from all resamplings, measure predictor performance.

In short, this function performs all steps involved in building and evaluating the predictor.

Value

(list) Predictor results 1) phenoDF (data.frame): subset of phenoDF provided as input, but limited to patients that have at least one event in the universe of possibilities e.g. if using pathway-level features, then this table excludes patients with zero CNVs in pathways 2) netmat (data.frame): Count of genetic events by patients (rows) in pathways (columns). Used as input to the feature selection algorithm 3) pathwayScores (list): pathway scores for each of the data splits. Each value in the list is a data.frame containing pathway names and scores. 4) enrichedNets (list): This entry is only found if enrichLabels is set to TRUE. It contains the vector of features that passed label enrichment in each split of the data. 5 - 9) Output of RR_featureTally: 5) cumulativeFeatScores: pathway name, cumulative score over N-way data resampling. 6) performance_denAllNets: positive,negative calls at each cutoff: network score cutoff (score); num networks at cutoff (numPathways) ; total +, ground truth (pred_tot); + calls (pred_ol); + calls as pct of total (pred_pct); total -, ground truth (other_tot) ; - calls (other_ol) ; - calls as pct of total (other_pct) ; ratio of pred_pct and other_pct (rr) ; min. pred_pct in all resamplings (pred_pct_min) ; max pred_pct in all resamplings (pred_pct_max) ; min other_pct in all resamplings (other_pct_min); max other_pct in all resamplings (other_pct_max) 7) performance_denEnrichedNets: positive, negative calls at each cutoff label enrichment option: format same as performance_denAllNets. However, the denominator here is limited to patients present in networks that pass label enrichment 8) resamplingPerformance: breakdown of performance for each of the resamplings, at each of the cutoffs. This is a list of length 2, one for allNets and one for enrichedNets. The value is a matrix with (resamp * 7) columns and S rows, one row per score. The columns contain the following information per resampling: 1) pred_total: total num patients of predClass 2) pred_OL: num of pred_total with a CNV in the selected net 3) pred_OL_pct: 2) divided by 1) (percent) 4) other_total: total num patients of other class(non-predClass) 5) other_OL: num of other_total with CNV in selected net 6) other_OL_pct: 5) divided by 4) (percent) 7) relEnr: 6) divided by 3).

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
suppressMessages(require(GenomicRanges))
suppressMessages(require(BiocFileCache))

# read CNV data
phenoFile <- system.file("extdata","AGP1_CNV.txt",package="netDx")
pheno   <- read.delim(phenoFile,sep="\t",header=TRUE,as.is=TRUE)
colnames(pheno)[1] <- "ID"
pheno <- pheno[!duplicated(pheno$ID),]

# create GRanges with patient CNVs
cnv_GR    <- GRanges(pheno$seqnames,IRanges(pheno$start,pheno$end),
                        ID=pheno$ID,LOCUS_NAMES=pheno$Gene_symbols)

# get gene coordinates
geneURL <- paste("https://download.baderlab.org/netDx/",
	"supporting_data/refGene.hg18.bed",sep="")
cache <- rappdirs::user_cache_dir(appname = "netDx")
bfc <- BiocFileCache::BiocFileCache(cache,ask=FALSE)
geneFile <- bfcrpath(bfc,geneURL)
genes <- read.delim(geneFile,sep="\t",header=FALSE,as.is=TRUE)
genes <- genes[which(genes[,4]!=""),]
gene_GR     <- GRanges(genes[,1],IRanges(genes[,2],genes[,3]),
   name=genes[,4])

# create GRangesList of pathways
pathFile <- fetchPathwayDefinitions("February",2018,verbose=TRUE)
pathwayList <- readPathways(pathFile)
path_GRList <- mapNamedRangesToSets(gene_GR,pathwayList)

#### uncomment to run - takes 5 min
#out <- buildPredictor_sparseGenetic(pheno, cnv_GR, "case",
#                             path_GRList,outDir,
#                             numSplits=3L, featScoreMax=3L,
#                             enrichLabels=TRUE,numPermsEnrich=20L,
#                             numCores=1L)
#summary(out)
#head(out$cumulativeFeatScores)

BaderLab/netDx documentation built on Sept. 26, 2021, 9:13 a.m.