This package describes the workflow of how to download file data from Encode, how to map IDs to HGNC IDs and statistics of data frame.
in this project :
--BCB420.2019.encode/
|__.Rbuildignore
|__.gitignore
|__BCB420.2019.encode.Rproj
|__DESCRIPTION
|__Ggmaplot-encode.png
|__Ggmaplot-hgnc.png
|__LICENSE
|__NAMESPACE
|__R/
|__lseq.R
|__zzz.R
|__README 2.md
|__README.md
|__data/
|__ENCFF768GAH.tsv
|__Encode_hgyc.RData
|__dev/
|__functionTemplate.R
|__rptTwee.R
|__encode model.jpg
|__encode.jpg
|__inst/
|__extdata/
|__ensp2sym.RData
|__test_lseq.dat
|__scripts/
|__recoverIDs.R
|__scriptTemplate.R
|__man/
|__lseq.Rd
|__tests/
|__testthat.R
|__testthat/
|__helper-functions.R
|__test_lseq.R
The Encyclopedia of DNA Elements (ENCODE) is a public research project whose purpose is to identify functional elements in the human genome. Right now it has extended to other organisms like mouse and fly. It analyzes data in an integrative fashion. Registry of candidate regulatory elements is one of the most important annotations in the integrative level. One achievement from this project shows that some "useless" elements in genomes are not as useless as we thought before. we can look at the enhancer region part and use Yufei's GTRD data to annotate the specific TFs that bind there, and identify co-regulated genes.
The Metedata organization contains six accessions to be reused in experimental protocol and computational analysis - An assay: the replicates will be performed using the same method, on the same kind of biosample, and investigating the same target. Assays may contain one or more biological replicates. - A biosample: An accessioned biosample refers to a tube or sample of that biological material that is being used. - A strain or donor: Every strain (for model organisms) and donor (for humans) is given a donor accession. This accession allows multiple samples obtained from a single donor to be grouped together. - An antibody lot:Each antibody lot(unique) is also associated with characterizations for its target in a species. - A library: A unique library is accessioned to ensure the correct files are associated with the nucleic acid material that has been created from the biosample. - A file: Each data file is accessioned. This accession is used as the file name, along with its file format as an extension.
There are lots of ways to download source data from Encode and you can even access those data on other database like USUC or GO,etc.Here I introduce one of them.
file.txt
which contains a list of URLs to a file containing all experimental metedata and links to download the data.This assay uses ensembl ID as gene ID. To associate with HGNC symbol, we can map ENSG ID to HGNC symbols by biomaRt. However,there are some different ENSG IDs that can be mapped to the same HGNC symbol, because same region may have several versions of genome. We need to be careful in the mapping and notice that not all ENSG IDs can be mapped uniquely.
Preparations: - make sure the required packages are installed and functions are sourced:
readr provides functions to read data which are particularly suitable for large datasets.
if (! requireNamespace("readr")) {
install.packages("readr")
}
biomaRt is a Bioconductor package that implements the RESTful API of biomart, the annotation framwork for model organism genomes at the EBI. It is a Bioconductor package, and as such it needs to be loaded via the BiocManager,
if (! requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
if (! requireNamespace("biomaRt", quietly = TRUE)) {
BiocManager::install("biomaRt")
}
ggpubr package provides function ggmaplot for us to plot of log2 fold change vs mean change for each gene.
if (! requireNamespace("ggpubr")) {
install.packages("ggpubr")
library(ggpubr)
}
Next we source a utility function that we will use later, for mapping ENSG IDs to gene symbols if the mapping can not be achieved directly. I overwrote this file but basic structure is same as recoverIDs.R in STRINGcoverIDs.
source("inst/scripts/recoverIDs.R")
Finally we fetch the HGNC reference data from GitHub.
myURL <- paste0("https://github.com/hyginn/",
"BCB420-2019-resources/blob/master/HGNC.RData?raw=true")
load(url(myURL)) # loads HGNC data frame
# Read the differential data, there are several columns: id , fold change, log fold change and pvalues,etc.
tmp <- read.delim("/Users/yuhanzhang/Downloads/BCB420.2019.encode/data/ENCFF768GAH.tsv", header=TRUE, sep="\t")
head(tmp)
rowID id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
1 1 ENSG00000000003.10 2693.0461 2627.1502 2758.9420 1.0501653 0.07061643 3.285189e-01 6.047569e-01
2 2 ENSG00000000005.5 0.0000 0.0000 0.0000 NA NA NA NA
3 3 ENSG00000000419.8 1814.2725 1488.1768 2140.3682 1.4382486 0.52431311 3.483299e-10 2.346446e-09
4 4 ENSG00000000457.9 239.4773 305.9247 173.0298 0.5655962 -0.82215578 9.737404e-07 4.775592e-06
5 5 ENSG00000000460.12 354.1951 234.6976 473.6925 2.0183103 1.01314798 1.714617e-12 1.408435e-11
6 6 ENSG00000000938.8 0.0000 0.0000 0.0000 NA NA NA NA
# id are Ensembl Gene id, each of them would be mapped to hgnc symbol.
# fold change, log2foldchange and adjusted pvalue are the results of difference gene expression.
# some genes does not have values(NA), which need to be deleted
tmp <- na.omit(tmp)
# delete suffix in ids
tmp$id <-gsub(".[0-9]+$","", tmp$id)
uENSG <- unique(tmp$id) ## 25645 unique ids
# statistics of data frame
nrow(tmp) 25645
# Map ENSG to HGNC symbols:
myMart <- biomaRt::useMart("ensembl", dataset="hsapiens_gene_ensembl")
tmp2 <- biomaRt::getBM(filters = "ensembl_gene_id",
+ attributes = c("ensembl_gene_id",
+ "hgnc_symbol"),
+ values = uENSG,
+ mart = myMart)
> head(tmp2)
ensembl_gene_id hgnc_symbol
1 ENSG00000000003 TSPAN6
2 ENSG00000000419 DPM1
3 ENSG00000000457 SCYL3
4 ENSG00000000460 C1orf112
5 ENSG00000000971 CFH
6 ENSG00000001036 FUCA2
# defining mapping tool
ensp2sym <- tmp2$hgnc_symbol
names(ensp2sym) <- tmp2$ensembl_gene_id
> head(ensp2sym)
ENSG00000000003 ENSG00000000419 ENSG00000000457 ENSG00000000460 ENSG00000000971 ENSG00000001036
"TSPAN6" "DPM1" "SCYL3" "C1orf112" "CFH" "FUCA2"
There are three possible problems we may encounter
- More than one value could be returned. The ID appears more than once in tmp2$ensembl_gene_id
, with different mapped symbols.
sum(duplicated(tmp2$ensembl_gene_id)) ## 0 Luckily we don't have this problem.
uENSG
, but it does not appear in tmp2$ensembl_gene_id
:sum(! (uENSG) %in% tmp2$ensembl_gene_id)
[1] 1125
tmp2$ensembl_gene_id
, but there is no symbol in tmp2$hgnc_symbol
.sum(is.na(ensp2sym)) # 0, we don't need to consider this part.
Now we fix these problems.
First, we add the symbols that were not returned by biomaRt to the map. They are present in uENSG
, but not in ensp2sym
:
sel <- ! (uENSG %in% names(ensp2sym))
x <- rep(NA, sum( sel))
names(x) <- uENSG[ sel ]
# confirm uniqueness
any(duplicated(c(names(x), names(ensp2sym)))) # FALSE
# concatenate the two vectors
ensp2sym <- c(ensp2sym, x)
# confirm
all(uENSG %in% names(ensp2sym)) # TRUE
Next we set the symbols for which only an empty string was returned to NA:
sel <- which(ensp2sym == "") # 4754 elements
ensp2sym[head(sel)] # before ...
ensp2sym[sel] <- NA
ensp2sym[head(sel)] # ... after
# Do we still have all ENSG IDs accounted for?
all( uENSG %in% names(ensp2sym)) # TRUE
A function for using biomaRt for more detailed mapping is in the file inst/scripts/recoverIDs.R
. We have loaded it previously, and use it on all elements of ensp2sym that are NA.
# How many NAs are there in "ensp2sym" column?
sum(is.na(ensp2sym)) # 5879
# subset the ENSG IDs
unmappedENSG <- names(ensp2sym)[is.na(ensp2sym)]
# use our function recoverIDs() to try and map the unmapped ensp IDs
# to symboils via other cross-references
recoveredENSG <- recoverIDs(unmappedENSG)
# how many did we find
nrow(recoveredENSG) # 5. Not much, but it's honest work.
# add the recovered symbols to ensp2sym
ensp2sym[recoveredENSG$ensp] <- recoveredENSG$sym
# validate:
sum(is.na(ensp2sym)) # 5874 - 5 less than 5879
We now have each unique ENSG IDs represented once in our mapping table. But We need to compare the symbols to our reference data and try to fix any problem like incorrect symbols. Symbols that do not appear in the reference table will also be set to NA.
# are all symbols present in the reference table?
sel <- ( ! (ensp2sym %in% HGNC$sym)) & ( ! (is.na(ensp2sym)))
length( ensp2sym[ sel ] ) # 2341 unknown
# put these symbols in a new dataframe
unkSym <- data.frame(unk = ensp2sym[ sel ],
new = NA,
stringsAsFactors = FALSE)
# Inspect:
# several of these are formatted like "TNFSF12-TNFSF13" or "TMED7-TICAM2".
# This looks like biomaRt concatenated symbol names.
grep("TNFSF12", HGNC$sym) # 23984: TNFSF12
grep("TNFSF13", HGNC$sym) # 23985 23986: TNFSF13 and TNFSF13B
grep("TMED7", HGNC$sym) # 23630: TMED7
grep("TICAM2", HGNC$sym) # 23494: TICAM2
# It's not clear why this happened. We will take a conservative approach
# and not make assumptions which of the two symbols is the correct one,
# i.e. we will leave these symbols as NA
# grep() for the presence of the symbols in either HGNC$prev or
# HGNC$synonym. If either is found, that symbol replaces NA in unkSym$new
for (i in seq_len(nrow(unkSym))) {
iPrev <- grep(unkSym$unk[i], HGNC$prev)[1] # take No. 1 if there are several
if (length(iPrev) == 1) {
unkSym$new[i] <- HGNC$sym[iPrev]
} else {
iSynonym <- which(grep(unkSym$unk[i], HGNC$synonym))[1]
if (length(iSynonym) == 1) {
unkSym$new[i] <- HGNC$sym[iSynonym]
}
}
}
# How many did we find?
sum(! is.na(unkSym$new)) # 36
# We add the contents of unkSym$new back into ensp2sym. This way, the
# newly mapped symbols are updated, and the old symbols that did not
# map are set to NA.
ensp2sym[rownames(unkSym)] <- unkSym$new
Validation and statistics of our mapping tool:
# do we now have all ENSG IDs mapped?
all(uENSG %in% names(ensp2sym)) # TRUE
# how many symbols did we find?
sum(! is.na(ensp2sym)) # 17466
# (in %)
sum(! is.na(ensp2sym)) * 100 / length(ensp2sym) # 68.0 %
# are all symbols current in our reference table?
all(ensp2sym[! is.na(ensp2sym)] %in% HGNC$sym) # TRUE
# Done.
# This concludes construction of our mapping tool.
# Save the map:
save(ensp2sym, file = file.path("inst", "extdata", "ensp2sym.RData"))
# From an RStudio project, the file can be loaded with
load(file = file.path("inst", "extdata", "ensp2sym.RData"))
Finally, we map ENSG ID to HGNC symbols, using the mapping tool ensp2sym
tmp$id <-ensp2sym[tmp$id]
hgyc <- tmp
# Done.
# Save result
save(hgyc, file = file.path(".", "data", "Encode_hgyc.RData"))
xSet <- c("AMBRA1", "ATG14", "ATP2A1", "ATP2A2", "ATP2A3", "BECN1", "BECN2",
"BIRC6", "BLOC1S1", "BLOC1S2", "BORCS5", "BORCS6", "BORCS7",
"BORCS8", "CACNA1A", "CALCOCO2", "CTTN", "DCTN1", "EPG5", "GABARAP",
"GABARAPL1", "GABARAPL2", "HDAC6", "HSPB8", "INPP5E", "IRGM",
"KXD1", "LAMP1", "LAMP2", "LAMP3", "LAMP5", "MAP1LC3A", "MAP1LC3B",
"MAP1LC3C", "MGRN1", "MYO1C", "MYO6", "NAPA", "NSF", "OPTN",
"OSBPL1A", "PI4K2A", "PIK3C3", "PLEKHM1", "PSEN1", "RAB20", "RAB21",
"RAB29", "RAB34", "RAB39A", "RAB7A", "RAB7B", "RPTOR", "RUBCN",
"RUBCNL", "SNAP29", "SNAP47", "SNAPIN", "SPG11", "STX17", "STX6",
"SYT7", "TARDBP", "TFEB", "TGM2", "TIFA", "TMEM175", "TOM1",
"TPCN1", "TPCN2", "TPPP", "TXNIP", "UVRAG", "VAMP3", "VAMP7",
"VAMP8", "VAPA", "VPS11", "VPS16", "VPS18", "VPS33A", "VPS39",
"VPS41", "VTI1B", "YKT6")
myMart <- biomaRt::useMart("ensembl", dataset="hsapiens_gene_ensembl")
exam_transcript_data <- biomaRt::getBM(filters = "ensembl_gene_id",
+ attributes = c("ensembl_gene_id",
+ "hgnc_symbol"),
+ values = xSet,
+ mart = myMart)
A resulting MA-plot for means and log fold change would be
ggmaplot(tmp, main = expression("Group 1" %->% "Group 2"),
fdr = 0.05, fc = 2, size = 0.4,
palette = c("#B31B21", "#1465AC", "darkgray"),
genenames = as.vector(tmp$id),
legend = "top", top = 20,
font.label = c("bold", 11),
font.legend = "bold",
font.main = "bold",
ggtheme = ggplot2::theme_minimal())
Example codes for biomaRt and mapping tool from ENSG ID to HGNC symbol was taken from STRING
Encode pictures and information are captured from official website Encode
Further reading for Annotations of Encode
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.