Importing publicly available CAGE data from various resources

Description

Imports CAGE data from different sources into a CAGEset object. After the CAGEset object has been created the data can be further manipulated and visualized using other functions available in the CAGEr package and integrated with other analyses in R. Available resources include:
- FANTOM5 datasets (Forrest et al. Nature 2014) for numerous human and mouse samples (primary cells, cell lines and tissues), which are fetched directly from FANTOM5 online resource.
- FANTOM3 and 4 datasets (Carninci et al. Science 2005, Faulkner et al. Nature Genetics 2009, Suzuki et al. Nature Genetics 2009) from FANTOM3and4CAGE data package available from Bioconductor
- ENCODE datasets (Djebali et al. Nature 2012) for numerous human cell lines from ENCODEprojectCAGE data package, which is available for download from http://promshift.genereg.net/CAGEr/.
- Zebrafish developmental timecourse datasets (Nepal et al. Genome Research 2013) from ZebrafishDevelopmentalCAGE data package, which is available for download from http://promshift.genereg.net/CAGEr/.

Usage

1

Arguments

source

Character vector specifying one of the available resources for CAGE data. Can be one of the following:
"FANTOM5": for fetching and importing CAGE data for various human or mouse primary cells, cell lines and tissues from the online FANTOM5 resource (http://fantom.gsc.riken.jp/5/data/). All data published in main FANTOM5 publication by Forrest et al. is available.
"FANTOM3and4": for importing CAGE data for various human or mouse tissues produced within FANTOM3 and FANTOM4 projects. Requires data package FANTOM3and4CAGE to be installed. This data package is available from Bioconductor.
"ENCODE": for importing CAGE data for human cell lines from ENCODE project published by Djebali et al.. Requires data package ENCODEprojectCAGE to be installed. This data package is available for download from http://promshift.genereg.net/CAGEr/.
"ZebrafishDevelopment": for importing CAGE data from developmental timecourse of zebrafish (Danio rerio) published by Nepal et al.. Requires data package ZebrafishDevelopmentalCAGE to be installed. This data package is available for download from http://promshift.genereg.net/CAGEr/.
See Details for further explanation of individual resources.

dataset

Character vector specifying one or more of the datasets available in the selected resource. For FANTOM5 it can be either "human" or "mouse", and only one of them can be specified at a time. For other resources please refer to the vignette of the corresponding data package for the list of available datasets. Multiple datasets mapped to the same genome can be specified to combine selected samples from each.

group

Character string specifying one or more groups within specified dataset(s), from which the samples should be selected. group argument is used only when importing TSSs from data packages and ignored when source="FANTOM5". For available groups in each dataset please refer to the vignette of the corresponding data package. Either only one group has to be specified (if all selected samples belong to the same group) or one group per sample (if samples belong to different groups). In the latter case, the number of elements in group must match the number of elements in sample.

sample

Character string specifying one or more CAGE samples. Check the corresponding data package for available samples within each group and their labels. For FANTOM5 resource, list of all human (~1000) and mouse (~) samples can be obtained in CAGEr by loading data(FANTOM5humanSamples) and data(FANTOM5mouseSamples), respectively. Use the names from the sample column to specify which samples should be imported.

Details

CAGE data from different sources is available for importing directly into CAGEset object for further manipulation with CAGEr.
FANTOM consortium provides single base-pair resolution TSS data for numerous human and mouse primary cells, cell lines and tissues produced within FANTOM5 project (Forrest et al. Nature 2014). These are directly fetched from their online resource at http://fantom.gsc.riken.jp/5/data and imported into a CAGEset object. To use this resource specify source="FANTOM5". The dataset argument can be either "human" or "mouse", but not both at the same time. The list of all human and mouse samples can be obtained by loading data(FANTOM5humanSamples) and data(FANTOM5mouseSamples). The sample column gives the names of individual samples that should be provided as sample argument. See example below.
TSS data from previous FANTOM3 and FANTOM4 projects (Carninci et al., Faulkner et al., Suzuki et al.) are also available through FANTOM3and4CAGE data package. This data package can be installed directly from Bioconductor. To use this resource install and load FANTOM3and4CAGE package and specify source="FANTOM3and4". The dataset argument can be a name of any of the datasets available in this package. Load data(FANTOMhumanSamples) or data(FANTOMmouseSamples) for the list of available datasets with group and sample labels for specific human or mouse samples. These have to be provided as dataset, group and sample arguments to import selected samples. If all samples belong to the same group, only this one group can be provided, otherwise, for each sample a corresponding group has to be specified, i.e. the number of elements in group must match the numer of elements in sample.
ENCODE consortium produced CAGE data for numerous human cell lines (Djebali et al. Nature 2012). We have used these data to derive single base-pair resolution TSSs and collected them into an R data package ENCODEprojectCAGE. This data package is available for download from http://promshift.genereg.net/CAGEr/. To use this resource install and load ENCODEprojectCAGE data package and specify source="ENCODE". The dataset argument can be a name of any of the datasets available in this package. Load data(ENCODEhumanCellLinesSamples) for the list of available datasets with group and sample labels for specific samples. These have to be provided as dataset, group and sample arguments to import selected samples. Multiple datasets can be combined together, by specifying them as dataset argument. If all samples belong to the same dataset and the same group, these dataset and group can be specified only once, otherwise, for each sample a corresponding dataset and group have to be specified, i.e. the number of elements in dataset and group must match the numer of elements in sample.
Precise TSSs are also available for zebrafish (Danio Rerio) from CAGE data published by Nepal et al. for 12 developmental stages. These have been collected into a data package ZebrafishDevelopmentalCAGE, which is available for download from http://promshift.genereg.net/CAGEr/. To use this resource install and load ZebrafishDevelopmentalCAGE data package and specify source="ZebrafishDevelopment". Load data(ZebrafishSamples) for the list of available datasets and group and sample labels, which have to be specified to import these data.

Value

A CAGEset object is returned. Slots librarySizes, CTSScoordinates and tagCountMatrix are occupied by the single base-pair resolution TSS data imported from the selected resource.

Author(s)

Vanja Haberle

References

Carninci et al. (2005) The Transcriptional Landscape of the Mammalian Genome, Science 309(5740):1559-1563.
Djebali et al. (2012) Landscape of transcription in human cells, Nature 488(7414):101-108.
Faulkner et al. (2009) The regulated retrotransposon transcriptome of mammalian cells, Nature Genetics 41:563-571.
Forrest et al. (2014) A promoter-level mammalian expression atlas, Nature 507(7493):462-470.
Nepal et al. (2013) Dynamic regulation of the transcription initiation landscape at single nucleotide resolution during vertebrate embryogenesis, Genome Research 23(11):1938-1950.
Suzuki et al. (2009) The transcriptional network that controls growth arrest and differentiation in a human myeloid leukemia cell line, Nature Genetics 41:553-562.

See Also

getCTSS

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
### importing FANTOM5 data

# list of FANTOM5 human tissue samples
data(FANTOM5humanSamples)
head(subset(FANTOM5humanSamples, type == "tissue"))

# import selected samples
exampleCAGEset <- importPublicData(source="FANTOM5", dataset = "human", sample = c("adipose_tissue__adult__pool1", "adrenal_gland__adult__pool1", "aorta__adult__pool1"))

exampleCAGEset


### importing FANTOM3/4 data from a data package
library(FANTOM3and4CAGE)

# list of mouse datasets available in this package
data(FANTOMmouseSamples)
unique(FANTOMmouseSamples$dataset)
head(subset(FANTOMmouseSamples, dataset == "FANTOMtissueCAGEmouse"))
head(subset(FANTOMmouseSamples, dataset == "FANTOMtimecourseCAGEmouse"))

# import selected samples from two different mouse datasets
exampleCAGEset <- importPublicData(source="FANTOM3and4", dataset = c("FANTOMtissueCAGEmouse", "FANTOMtimecourseCAGEmouse"), group = c("brain", "adipogenic_induction"), sample = c("CCL-131_Neuro-2a_treatment_for_6hr_with_MPP+", "DFAT-D1_preadipocytes_2days"))

exampleCAGEset

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.