Load and map CNA gain/loss onto human gene location by genome reference

Share:

Description

The function reads through in all sample CNA data given by the sample description sample_desc and returns a joint CNA gain/loss table based on gene regions across samples.

Usage

1
2
3
create_gene_cna(sample_desc, gain_threshold = log2(2.5) - 1,
  loss_threshold = log2(1.5) - 1, read_fun = NULL, progress = TRUE,
  progress_width = 48, parallel = FALSE, ...)

Arguments

sample_desc

data.table object created by create_sample_desc.

gain_threshold

CNA expression above this will be considered as gain region. By default \log_2{2.5} - 1

loss_threshold

CNA expression below this will be considered as loss region. By default \log_2{1.5} - 1

read_fun

Custom reader function, see its own section for more detail.

progress

Whether to display a progress bar. By default TRUE.

progress_width

The text width of the shown progress bar. By default is 48 chars wide.

parallel

Enable parallelism by plyr. One has to specify a parallel engine beforehand. See example for more information.

...

Arguments passed to the custom reader function specified in read_fun.

Details

A gene is considered to have CNA gain if the overlapped CNA record expression is higher than the given threshold. Similarly, a gene is considered CNA loss if the overlapped CNA record is lower than the given threshold. If multiple CNA records map onto the same gene region with both gain and loss, the majority wins. If none of the records map to the gene, NA is given.

By default it assumes the data to be of TCGA level 3 file format. For other data formats (e.g. raw data or other experiments from GEO), one should implement a custom reader function that accepts the filepath as the first argument. See section Custom reader function for full specification.

Currently the package ships a custom genome reference hg19, hg19DBNM, for gene region look up. Each gene's region is defined by the widest splicing form it has in NCBI curated records. The defined region includes intron regions. This limitation may change in the future.

Value

data.table of CNA gain/loss on each gene region for all samples, whose rows represent regions of genes and columns represent sample names. First column GENE contains the corresponding gene names.

Custom reader function

Custom reader function is given by read_fun = your_reader_fun. It takes the filepath to CNA data as the first argument and returns a data.table with at least the following four columns: Chromosome, Start, End, and Segment_Mean of type character, integer, integer and numeric respectively.

Rest arguments of create_gene_cna(...) will be passed to this reader function.

Note: all string-like columns should NOT be of type factor. Remember to set stringsAsFactors = FALSE.

See Also

read.table and fread for custom reader function implementation; create_sample_desc for creating sample description. If the gene information already exists in the data, try direct_gene_cna to skip the genome reference lookup.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
## Use first three samples of the builtin dataset

sample_root <- system.file("extdata", package = "iGC")
sample_desc_pth <- file.path(sample_root, "sample_desc.csv")
sample_desc <- create_sample_desc(
    sample_desc_pth, sample_root=sample_root
)[1:3]


## Define custom reader function for TCGA level 3 gene exp. data

my_cna_reader <- function(cna_filepath) {
    cna <- data.table::fread(cna_filepath, sep = '\t', header = TRUE)
    data.table::setnames(
        cna,
        c("Sample", "Chromosome", "Start", "End", "Num_Probes", "Segment_Mean")
    )
    # pick only the needed columns
    cna[, .(Chromosome, Start, End, Segment_Mean)]
}


## Read all samples' CNA data and combined as a single table

gene_cna <- create_gene_cna(
    sample_desc,
    gain_threshold = log2(2.3) - 1, loss_threshold = log2(1.7) - 1,
    read_fun = my_cna_reader,
)
gene_cna[GENE %in% c("BRCA2", "TP53", "SEMA5A"), ]


## Not run: 
## To boost the speed, utilize parallelization

doMC::registerDoMC(4)  # number of CPU cores
gene_cna <- create_gene_cna(
    sample_desc,
    gain_threshold = log2(2.3) - 1, loss_threshold = log2(1.7) - 1,
    read_fun = my_cna_reader,
    parallel = TRUE
)

## End(Not run)