coco: Interactive Cancer Explorer

View source: R/cohort.R

cohorts

R Documentation

Functions to access/manage cohort level data

Description

Cohort is a set of datasets from the same study or for a same group of samples, it is the individual data unit in Omics database construction. TODO: ENSEMBL IDs and Hugo Symbol will be supported and the can exchange automatically.

Usage

cohorts(verbose = FALSE)

cohort_new(
  path,
  name,
  cancer_type,
  data_provider,
  maintainer,
  dataset_list,
  doi = NA,
  year = NULL,
  dataset_rowidx = SPEC_ROWINDEX,
  dataset_options = SPEC_DATASET_OPTIONS,
  dataset_ncount = SPEC_SAMPINDEX,
  verbose = FALSE
)

cohort_ls(id, verbose = FALSE)

Arguments

`verbose`	Whether to print extra information.
`path`	A directory path pointing the new cohort.
`name`	Description of the cohort.
`cancer_type,`	Cancer type the cohort belongs to.
`data_provider`	Typically an university or an institute.
`maintainer`	Maintainer name and email in format `xxx <xxx@xxx.com>`.
`dataset_list`	A `data.frame` contains datasets information in this cohort. The dataset should have the following fields: `id`: The file name without `.txt`/`.maf`(.gz) extension and cohort, which would be used as identifier to the dataset. For example, if you have a gene expression dataset in cohort `abc`, and your dataset name is `abc_expression.txt`, then the id should be set to `abc_expression`. therefore the file is unique to whole database hosted by `{coco}`. `name`: Description of the dataset. `genome_build`: Reference genome version, e.g., "hg38". `data_platform`: See `.SPEC_DATASET_OPTIONS`. `data_type`: See `SPEC_DATASET_OPTIONS`. `data_format`: See `SPEC_DATASET_OPTIONS`, to filter and search. For "Segment" format, refer to https://github.com/ShixiangWang/DoAbsolute/blob/master/inst/extdata/SNP6_solid_tumor.seg.txt as an example. `tags`: Other labels for the dataset to help search and filter. Separate multiple tags with comma.
`doi`	DOI to link the reference.
`year`	The year to generate the cohort, if not set, use current year.
`dataset_rowidx`	Specify which columns used for generating row index files from dataset file. You can check `SPEC_ROWINDEX` for implemented description. e.g., ".c1" for "Matrix" data format means if a dataset is in "Matrix" format (gene expression typically), the first column is the index. Moreover, you can directly specify the column name, e.g., "Sample" for data format "Segment".
`dataset_options`	Valid dataset options, see `SPEC_DATASET_OPTIONS`. NOTE: you can expand the `SPEC_DATASET_OPTIONS` and `SPEC_ROWINDEX` by yourself.
`dataset_ncount`	similar to `dataset_rowidx` but this is used for specify column for sample counting. See `SPEC_SAMPINDEX`.
`id`	Cohort ID, which can obtained from `cohorts()`.

Functions

cohort_new: Create a new cohort with data storing in a path
cohort_ls: List datasets (or metadata) in a cohort

Examples

cohorts(verbose = TRUE)

cohort_new(
  path = system.file("cohorts/example_TCGA_LAML", package = "coco"),
  name = "TCGA Acute Myeloid Leukemia (LAML) for examples utils",
  cancer_type = "LAML",
  data_provider = "TCGA",
  maintainer = "Shixiang Wang <wangsx1@sysucc.org.cn>",
  doi = NA,
  year = 2016,
  dataset_list = data.frame(
    id = c(
      "example_TCGA_LAML_patient_info",
      "example_TCGA_LAML_gene_expr_HTSeq_count",
      "example_TCGA_LAML_MAF"
    ),
    name = c(
      "Patient information", "Gene expression in log2(count)",
      "Mutation list"
    ),
    genome_build = c(NA, "hg19", "hg19"),
    data_platform = c("Clinical", "RNA-Seq", "WES"),
    data_type = c("Phenotype", "Gene expression", "Mutation"),
    data_format = c("PatientInfo", "Matrix", "MAF"),
    tags = c("survival,FAB", "HTSeq", NA)
  )
)

cohort_ls("example_TCGA_LAML")

ShixiangWang/coco documentation built on July 9, 2022, 4:43 a.m.