concatenate_exon: Concatenate GDC files into a single matrix and prepar the...

Description Usage Arguments Value Examples

View source: R/concatenate_exon.R

Description

concatenate_exon is a function designed to concatenate GDC files into a single matrix, where the columns stand for patients code and rows stand for data names.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
concatenate_exon(
  data_type,
  normalization = TRUE,
  name,
  data_base,
  htseq = NULL,
  work_dir,
  tumor,
  workflow_type,
  tumor_data = TRUE,
  only_filter = FALSE,
  tumor_type = 1,
  normal_type = 11,
  platform = "",
  env,
  save_data = FALSE
)

Arguments

data_type

Type of data. It could be "methylation", "mutation", "clinical_supplement", "biospecimen", "gene", or "clinical"(biotab).

  • Only present in "Legacy" database:"protein", "Exon quantification", "miRNA gene quantification", "miRNA isoform quantification", "isoform", and "image".

  • Only present in "GDC" database:"miRNA Expression Quantification", and "Isoform Expression Quantification" (miRNA).

normalization

Logical value where TRUE specify the desire to work with normalized files only. When FALSE, in the second run, do not forget to set env argument. This argument is only applyable to gene and isoform expression data from GDC Legacy Archive. The default is TRUE.

name

A character string indicating the desired values to be used in next analysis. For instance, "HIF3A" in the legacy gene expression matrix, "mir-1307" in the miRNA quantification matrix, or "HER2" in the protein quantification matrix.

data_base

A character string specifying "GDC" for GDC Data Portal or "legacy" for GDC Legacy Archive.

htseq

A character string indicating which htseq workflow data should be downloaded (only applied to "GDC" gene expression): "Counts", "FPKM" or "FPKM-UQ".

work_dir

A character string specifying the path to work directory.

tumor

A character string contaning one of the 33 tumors available in the TCGA project. For instance, the "BRCA" stands for breast cancer.

workflow_type

A character string specifying the workflow type for mutation data in "gdc". Where:

  • "varscan" stands for VarScan2 Variant Aggregation and Masking

  • "mutect" stands for MuTect2 Variant Aggregation and Masking

  • "muse" stands for MuSE Variant Aggregation and Masking

  • "somaticsniper" stands for SomaticSniper Variant Aggregation and Masking

  • "all" means to concatenate all workflows into a single matrix.

tumor_data

Logical value where TRUE specifies the desire to work with tumor tissue files only. When set to FALSE, it creates two matrices, one containing tumor data and other containing data from not-tumor tissue. The default is TRUE.

only_filter

Logical value where TRUE indicates that the matrix is already concatenate and the function should choose a different name, without concatenate all the files again. The default is FALSE.

tumor_type

Numerical value(s) correspondent to barcode data types:

Tumor codes:

  • 1: Primary Solid Tumor

  • 2: Recurrent Solid Tumor

  • 3: Primary Blood Derived Cancer - Peripheral Blood

  • 4: Recurrent Blood Derived Cancer - Bone Marrow

  • 5: Additional - New Primary

  • 6: Metastatic

  • 7: Additional Metastatic

  • 8: Human Tumor Original Cells

  • 9: Primary Blood Derived Cancer - Bone Marrow

The default is 1.

normal_type

Numerical value(s) correspondent to barcode data types:

Normal codes:

  • 10: Blood Derived Normal

  • 11: Solid Tissue Normal

  • 12: Buccal Cell Normal

  • 13: EBV Immortalized Normal

  • 14: Bone Marrow Normal

  • 15: sample type 15

  • 16-19: sample type 16

or

Control codes:

  • use '20:29' without quotes

The default is 11.

platform

A character string indicating the platform name for methylation, exon quantificaton, miRNA, and mutation data.

  • For mutation and exon quantificaton data:"Illumina GA", "Illumina HiSeq" or "all".

  • For methylation data"Illumina Human Methylation 450", "Illumina Human Methylation 27" or "all".

  • For miRNA data:"Illumina GA", "Illumina HiSeq", "H-miRNA_8x15K" (for GBM tumor), "H-miRNA_8x15Kv2" (for OV tumor), or "all".

The default for all data_type cited is "all" (when downloading data).

env

A character string containing the environment name that should be used. If none has been set yet, the function will create one in global environment following the standard criteria:

  • 'tumor_data_base_data_type_tumor_data' or

  • 'tumor_data_base_data_type_both_data' (for tumor and not tumor data in separated matrices).

save_data

Logical value where TRUE indicates that the concatenate and filtered matrix should be saved in local storage. The default is FALSE.

cutoff_beta_na

Numerical value indicating the maximum threshold percentage (in decimal form) to tolerate and to remove rows containing NA for beta values (methylation data). The default is 0.25.

cutoff_betasd

Numerical value indicating the standard deviation threshold of beta values (methylation data). It keeps only rows that have standard deviation of beta values higher than the threshold. The default is 0.005.

use_hg19_mirbase20

Logical value where TRUE indicates that only hg19.mirbase20 should be used. This parameter is needed when using data_base = "legacy" and one of the available miRNA data_type in "legacy" ("miRNA gene quantification" and "miRNA isoform quantification"). The default is FALSE.

Value

A matrix with data names in row and patients code in column.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
library(DOAGDC)

# Concatenating gene expression data into a single matrix
# data already downloaded using the 'download_gdc' function
concatenate_exon("gene",
    name = "HIF3A",
    data_base = "legacy",
    tumor = "CHOL",
    work_dir = "~/Desktop"
)

Facottons/DOAGDC documentation built on April 7, 2020, 3:17 a.m.