create_gene_cna: Load and map CNA gain/loss onto human gene location by genome...
In iGC: An integrated analysis package of Gene expression and Copy number alteration

Description Usage Arguments Details Value Custom reader function See Also Examples

View source: R/faster_gene_cna.R

The function reads through in all sample CNA data given by the sample description sample_desc and returns a joint CNA gain/loss table based on gene regions across samples.

1
2
3

create_gene_cna(sample_desc, gain_threshold = log2(2.5) - 1,
  loss_threshold = log2(1.5) - 1, read_fun = NULL, progress = TRUE,
  progress_width = 48, parallel = FALSE, ...)

`sample_desc`	data.table object created by create_sample_desc.
`gain_threshold`	CNA expression above this will be considered as gain region. By default \log_2{2.5} - 1
`loss_threshold`	CNA expression below this will be considered as loss region. By default \log_2{1.5} - 1
`read_fun`	Custom reader function, see its own section for more detail.
`progress`	Whether to display a progress bar. By default `TRUE`.
`progress_width`	The text width of the shown progress bar. By default is 48 chars wide.
`parallel`	Enable parallelism by plyr. One has to specify a parallel engine beforehand. See example for more information.
`...`	Arguments passed to the custom reader function specified in `read_fun`.

A gene is considered to have CNA gain if the overlapped CNA record expression is higher than the given threshold. Similarly, a gene is considered CNA loss if the overlapped CNA record is lower than the given threshold. If multiple CNA records map onto the same gene region with both gain and loss, the majority wins. If none of the records map to the gene, NA is given.

By default it assumes the data to be of TCGA level 3 file format. For other data formats (e.g. raw data or other experiments from GEO), one should implement a custom reader function that accepts the filepath as the first argument. See section Custom reader function for full specification.

Currently the package ships a custom genome reference hg19, hg19DBNM, for gene region look up. Each gene's region is defined by the widest splicing form it has in NCBI curated records. The defined region includes intron regions. This limitation may change in the future.

data.table of CNA gain/loss on each gene region for all samples, whose rows represent regions of genes and columns represent sample names. First column GENE contains the corresponding gene names.

Custom reader function is given by read_fun = your_reader_fun. It takes the filepath to CNA data as the first argument and returns a data.table with at least the following four columns: Chromosome, Start, End, and Segment_Mean of type character, integer, integer and numeric respectively.

Rest arguments of create_gene_cna(...) will be passed to this reader function.

Note: all string-like columns should NOT be of type factor. Remember to set stringsAsFactors = FALSE.

read.table and fread for custom reader function implementation; create_sample_desc for creating sample description. If the gene information already exists in the data, try direct_gene_cna to skip the genome reference lookup.

## Use first three samples of the builtin dataset

sample_root <- system.file("extdata", package = "iGC")
sample_desc_pth <- file.path(sample_root, "sample_desc.csv")
sample_desc <- create_sample_desc(
    sample_desc_pth, sample_root=sample_root
)[1:3]


## Define custom reader function for TCGA level 3 gene exp. data

my_cna_reader <- function(cna_filepath) {
    cna <- data.table::fread(cna_filepath, sep = '\t', header = TRUE)
    data.table::setnames(
        cna,
        c("Sample", "Chromosome", "Start", "End", "Num_Probes", "Segment_Mean")
    )
    # pick only the needed columns
    cna[, .(Chromosome, Start, End, Segment_Mean)]
}


## Read all samples' CNA data and combined as a single table

gene_cna <- create_gene_cna(
    sample_desc,
    gain_threshold = log2(2.3) - 1, loss_threshold = log2(1.7) - 1,
    read_fun = my_cna_reader,
)
gene_cna[GENE %in% c("BRCA2", "TP53", "SEMA5A"), ]


## Not run: 
## To boost the speed, utilize parallelization

doMC::registerDoMC(4)  # number of CPU cores
gene_cna <- create_gene_cna(
    sample_desc,
    gain_threshold = log2(2.3) - 1, loss_threshold = log2(1.7) - 1,
    read_fun = my_cna_reader,
    parallel = TRUE
)

## End(Not run)

|                    |  0%                      
|======              |33.33333% ~1 s remaining  
|=============       |66.66667% ~0 s remaining  
|====================|100% ~0 s remaining       
|====================|100%                      Completed after 1 s 
     GENE TCGA-AO-A0JL-01A TCGA-BH-A0HF-01A TCGA-BH-A0HK-01A
1:  BRCA2               -1               -1                0
2: SEMA5A                1                0                0
3:   TP53                0                0               -1
Progress disabled when using parallel plyr
Warning messages:
1: In readChild(ch) : unable to terminate child process: Permission denied
2: In readChild(ch) : unable to terminate child process: Permission denied
3: In readChild(ch) : unable to terminate child process: Permission denied
4: In cleanup(mc.cleanup) : unable to terminate child: Permission denied
5: In cleanup(mc.cleanup) : unable to terminate child: Permission denied
6: In cleanup(mc.cleanup) : unable to terminate child: Permission denied
Error while shutting down parallel: unable to terminate some child processes
sh: 1: rm: Permission denied
Warning messages:
1: In cleanup(kill = tools::SIGKILL, detach = TRUE, shutdown = TRUE) :
  unable to terminate child: Permission denied
2: In cleanup(kill = tools::SIGKILL, detach = TRUE, shutdown = TRUE) :
  unable to terminate child: Permission denied
3: In cleanup(kill = tools::SIGKILL, detach = TRUE, shutdown = TRUE) :
  unable to terminate child: Permission denied