BCB420.2019.CORUM
If any of this information is ambiguous, inaccurate, outdated, or incomplete, please check the most recent version of the package on GitHub and if the problem has not already been addressed, please file an issue.
This pakage is designed using (hyginn/BCB420.2019.STRING) and (https://github.com/hyginn/rpt) as an template. The 2019.STRING package is written by Dr. Boris Steipe.
This package is designed to download network data from the CORUM database, map annotated human protein complex data to HGNC, and provide examples for statistic computation of the databases.
The package serves dual duty, as an RStudio project, as well as an R package that can be installed. Package checks pass without errors, warnings, or notes.
--BCB420.2019.CORUM/
|__.gitignore
|__.Rbuildignore
|__BCB420.2019.CORUM.Rproj
|__DESCRIPTION
|__dev/
|__toBrowser.R # display .md files in your browser
|__inst/
|__extdata/
|__symbolToCom.RData # Tool for finding compleses encoded by a gene
|__LICENSE
|__NAMESPACE
|__R/
|__zzz.R # Welcoming file
|__script.R # A raw script for this work flow
|__README.md # this file
CORUM is a database of annotated mammalian protein complexes. The main mammalian organisms are human(67%), mouse(15%), and rat(10%). There are data for 4274 protein complexes, which are encoded by 4473 genes (Giurgiu et al. 2019). All information is obtained from individual experiment. Each complex also has information crossed referenced from other datasets. All CORUM data is available under a CC-BY 4.0 license.
Available annotations and information are:
Annotation of splice variant complexes is included for functional properties since some complex isoform is tissue specific. This can be a great implication of the protein functions.
In addition, a CORUM tool using IntAct database to predict protein- protein interactions within protein complexes and a visualization tool (Cytoscape written in javascript) of the interactions can be fouond at https://www.ebi.ac.uk/intact/. This tool is proved for reliable predictions(Giurgiu et al. 2019).
CORUM integrates and cross-references data of mammalian protein complexes from experiment, Uniport, GO and other literature resouces. By demonstrating the importance of protein complex, it provide information for bioinformatics and biomedical research (especially cancer research). To build a larger datasets, input of other researchers are greatly welcoming. Please contact CORUM at andreas.ruepp@helmholtz-muenchen.de (Giurgiu et al. 2019).
To download the source data from CORUM:
This workflow uses complete complexes data. Download the following data file:
allComplexes.txt.zip
(752kb) complete complexes data;
Uncompress the file and place it into a sister directory of your working directory which is called data
. (It should be reachable with file.path("..", "data")
).
CORUM database do have HGNC symbols for genes that encodes for each complex. However, these genes might be outdated, so the following workflow will update the outdated gene symbols in the datasets
To begin, we need to load the HGNC reference data:
myURL <- paste0("https://github.com/hyginn/",
"BCB420-2019-resources/blob/master/HGNC.RData?raw=true")
load(url(myURL)) # loads HGNC data frame
# Read data and check the information we have
getwd()
tmp <- readr::read_delim(file.path("../data", "allComplexes.txt"),
delim = "\t")
nrow(tmp) # 4274
# check datasets
colnames(tmp)
# [1] "ComplexID"
# [2] "ComplexName"
# [3] "Organism"
# [4] "Synonyms"
# [5] "Cell line"
# [6] "subunits(UniProt IDs)"
# [7] "subunits(Entrez IDs)"
# [8] "Protein complex purification method"
# [9] "GO ID"
# [10] "GO description"
# [11] "FunCat ID"
# [12] "FunCat description"
# [13] "subunits(Gene name)"
# [14] "Subunits comment"
# [15] "PubMed ID"
# [16] "Complex comment"
# [17] "Disease comment"
# [18] "SWISSPROT organism"
# [19] "subunits(Gene name syn)"
# [20] "subunits(Protein name)"
# filter human data
human_data <- tmp[tmp$Organism == 'Human', ]
human_data$Organism == 'Human'
# Get basic information of the dataset
nrow(human_data) # 2916
# Check for missing variables
sum(is.na(genes))
sum(genes == "") #0
sum(genes == "N/A") #0
The gene symbols encodes for one protein complex is quoted in one string and seperated by ; or ;;. To get a more accurate check of the missing data, I obtained a list of genes:
genes <- c()
for (i in seq_len(nrow(human_data))){
geneName <- unlist(strsplit(human_data$`subunits(Gene name)`[i], ";"))
genes <- c(geneName, genes)
}
sel <- ( ! (genes %in% HGNC$sym))
# we found that there are 81 unique genes that are outdated
length(genes[ sel ] ) # 230
length( unique(genes[ sel ])) # 81
# Check which genes are outdated
for (gene in unique(genes[ sel ]) ){
iPrev <- grep(gene, HGNC$prev)[1]
if (length(iPrev) == 1){
print(HGNC$sym[iPrev])
}
}
# Replace the oudated genes with the new version
# These code uses Dr. Boris's BCB420.2019.STRING as a reference
for (i in seq_len(nrow(human_data))){
geneName <- unlist(strsplit(human_data$`subunits(Gene name)`[i], ";"))
for (gene in geneName){
iPrev <- grep(gene, HGNC$prev)[1]
if ((gene != "") & (length(iPrev) == 1) & (! is.na(iPrev))) {
newgene <- gsub(gene, HGNC$sym[iPrev], human_data$`subunits(Gene name)`[i])
human_data$`subunits(Gene name)`[i] <- newgene
count <- count + 1
}
}
}
```
#### 4.5 Final validation
Validation: The validation shows no change which means that something is wrong. I need to check this and modify the code.
```R
finalgenes <- c()
for (i in seq_len(nrow(human_data))){
geneName <- unlist(strsplit(human_data$`subunits(Gene name)`[i], ";"))
finalgenes <- c(geneName, finalgenes)
}
loc <- which(finalgenes == "")
finalgenes <- finalgenes[-loc]
sum(finalgene == "N/A")
sum(! is.na(finalgene)) * 100 / length(finalgene)
sel <- ( ! (finalgenes %in% HGNC$sym))
# The performance is really bad, I will have to check it again
length(finalgenes[ sel ] ) # 230
length( unique(finalgenes[ sel ])) # 81
Gereation of RData file which comtains a list with symbols and the protein complex the symbol encodes for.
elements <- 1
symbolList <- list()
for (i in seq_len(nrow(human_data))){
geneName <- unlist(strsplit(human_data$`subunits(Gene name)`[i], ";"))
for (gene in geneName){
if (gene != ""){
if (gene %in% names(symbolList)){
sel <- which(names(symbolList) == gene)
symbolList[[sel]] <- c(symbolList[[sel]], human_data$ComplexName[i])
}else{
symbolList[elements] <- c(human_data$ComplexName[i])
names(symbolList)[elements] <- gene
elements <- elements + 1
}
}
}
}
# Save data
save(symbolList, file = file.path("inst", "extdata", "symbolToCom.RData"))
# Load data
load(file = file.path("inst", "extdata", "symbolToCom.RData"))
# To check what complex a gene encode for, use symbolList[gene symbol]
symbolList["CDKN1A"]
# [1] "CDKN1A"
# $`CDKN1A`
# [1] "Cell cycle kinase complex CDC2"
# [2] "Cell cycle kinase complex CDK2"
# [3] "Cell cycle kinase complex CDK4"
Thanks to Simon Kågedal's very useful PubMed to APA reference tool.
Thank you professor Boris Steipe for providing the R package templates, project templates and useful resources and tools for us.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.