cohorts: Functions to access/manage cohort level data

View source: R/cohort.R

cohortsR Documentation

Functions to access/manage cohort level data

Description

Cohort is a set of datasets from the same study or for a same group of samples, it is the individual data unit in Omics database construction. TODO: ENSEMBL IDs and Hugo Symbol will be supported and the can exchange automatically.

Usage

cohorts(verbose = FALSE)

cohort_new(
  path,
  name,
  cancer_type,
  data_provider,
  maintainer,
  dataset_list,
  doi = NA,
  year = NULL,
  dataset_rowidx = SPEC_ROWINDEX,
  dataset_options = SPEC_DATASET_OPTIONS,
  dataset_ncount = SPEC_SAMPINDEX,
  verbose = FALSE
)

cohort_ls(id, verbose = FALSE)

Arguments

verbose

Whether to print extra information.

path

A directory path pointing the new cohort.

name

Description of the cohort.

cancer_type,

Cancer type the cohort belongs to.

data_provider

Typically an university or an institute.

maintainer

Maintainer name and email in format xxx <xxx@xxx.com>.

dataset_list

A data.frame contains datasets information in this cohort. The dataset should have the following fields:

  • id: The file name without .txt/.maf(.gz) extension and cohort, which would be used as identifier to the dataset. For example, if you have a gene expression dataset in cohort abc, and your dataset name is abc_expression.txt, then the id should be set to abc_expression. therefore the file is unique to whole database hosted by {coco}.

  • name: Description of the dataset.

  • genome_build: Reference genome version, e.g., "hg38".

  • data_platform: See .SPEC_DATASET_OPTIONS.

  • data_type: See SPEC_DATASET_OPTIONS.

  • data_format: See SPEC_DATASET_OPTIONS, to filter and search. For "Segment" format, refer to https://github.com/ShixiangWang/DoAbsolute/blob/master/inst/extdata/SNP6_solid_tumor.seg.txt as an example.

  • tags: Other labels for the dataset to help search and filter. Separate multiple tags with comma.

doi

DOI to link the reference.

year

The year to generate the cohort, if not set, use current year.

dataset_rowidx

Specify which columns used for generating row index files from dataset file. You can check SPEC_ROWINDEX for implemented description. e.g., ".c1" for "Matrix" data format means if a dataset is in "Matrix" format (gene expression typically), the first column is the index. Moreover, you can directly specify the column name, e.g., "Sample" for data format "Segment".

dataset_options

Valid dataset options, see SPEC_DATASET_OPTIONS. NOTE: you can expand the SPEC_DATASET_OPTIONS and SPEC_ROWINDEX by yourself.

dataset_ncount

similar to dataset_rowidx but this is used for specify column for sample counting. See SPEC_SAMPINDEX.

id

Cohort ID, which can obtained from cohorts().

Functions

  • cohort_new: Create a new cohort with data storing in a path

  • cohort_ls: List datasets (or metadata) in a cohort

Examples

cohorts(verbose = TRUE)

cohort_new(
  path = system.file("cohorts/example_TCGA_LAML", package = "coco"),
  name = "TCGA Acute Myeloid Leukemia (LAML) for examples utils",
  cancer_type = "LAML",
  data_provider = "TCGA",
  maintainer = "Shixiang Wang <wangsx1@sysucc.org.cn>",
  doi = NA,
  year = 2016,
  dataset_list = data.frame(
    id = c(
      "example_TCGA_LAML_patient_info",
      "example_TCGA_LAML_gene_expr_HTSeq_count",
      "example_TCGA_LAML_MAF"
    ),
    name = c(
      "Patient information", "Gene expression in log2(count)",
      "Mutation list"
    ),
    genome_build = c(NA, "hg19", "hg19"),
    data_platform = c("Clinical", "RNA-Seq", "WES"),
    data_type = c("Phenotype", "Gene expression", "Mutation"),
    data_format = c("PatientInfo", "Matrix", "MAF"),
    tags = c("survival,FAB", "HTSeq", NA)
  )
)

cohort_ls("example_TCGA_LAML")

ShixiangWang/coco documentation built on July 9, 2022, 4:43 a.m.