'BCB420.2019.GEO'

(GEO Data annotation of human genes)

R Markdown

1.1 About this package:

This package is used for mapping GEO database genes into HGNC symbol. Since GEO database is huge, this package focus on the term of "stress" and organism is "homo sapiens".

2 GEO Data semantics

GEO database stores expression data and experiment metadata.This database covers expression profile and can be queried through specific interest such as "cancer" key words and we can choose various organisms.

3 Data download

Log into https://www.ncbi.nlm.nih.gov/gds/ website. Key in "stress" in the search box. and click organism homo sapiens. Choose first one "Hypoxia effect on von Hippel Lindau-overexpressing renal cell carcinoma cells" by clicking it. On the right side downlaod section. Download DataSet SOFT File instead of DataSet Full SOFT file. save the file under sister directory of data folder. Change the file name into .txt. and save them under data folder. Here we show two dataset as example. First dataset is GDS5810.txt and second dataset is GDS5633.txt.

3.1 Data cleanup.

if (! requireNamespace("readr")) { install.packages("readr") }

setwd("/Users/Kevin/Desktop/Bioinformatics/BCB420/GEOSystemBio/data") list.files() filename_1 <- "GDS5810.txt" openfile_1<- file(filename_1, open = "r")

filename_2 <-"GDS5810_clean.txt" openfile_2<- file(filename_2,open = "w") line <-readLines(openfile_1) tx2 <-gsub(pattern="^(!.|#.)",replace="", line) #remove the comment starting #with "!","#","" writeLines(tx2,con=openfile_2) close(openfile_1) close(openfile_2)

read the file into tmp; create a tibble datatype

tmp <- readr::read_tsv(file.path("../data", "GDS5810_clean.txt"),comment="^",col_names =TRUE,skip_empty_rows=TRUE)

head(tmp)

A tibble: 6 x 10

ID_REF IDENTIFIER GSM1588481 GSM1588485 GSM1588482 GSM1588486 GSM1588483 GSM1588487 GSM1588484 1 7.90e6 chr1:5304… 7.21 7.03 7.05 7.12 7.31 7.33 7.37 2 7.90e6 chr1:6301… 5.81 5.67 5.54 5.73 5.64 5.89 5.90 3 7.90e6 OR4F17 5.75 5.82 5.58 5.69 5.62 5.85 5.94 4 7.90e6 LOC100134… 7.06 7.06 7.26 7.22 7.23 7.76 7.48 5 7.90e6 OR4F29 6.91 6.31 6.44 6.49 6.47 6.62 6.62 6 7.90e6 chr1:5649… 9.68 9.59 9.11 9.22 9.58 9.62 9.67

… with 1 more variable: GSM1588488

Let's see how many of "IDENTIFIER" showing up in this experiment, In GEO databaes, most of IDENTIFIER symbol is corresponding to HGNC symbmol;

nrow(tmp) #33297 entries in this dataset;

Let's see number of unique "IDENTIFIER" in this experiment;

GEO_Unique_IDENTIFIER <-unique(tmp$IDENTIFIER) #26804 IDENTIFIERS which are also GEO_Unique_IDENTIFIER #potential HGNC symbol have been #retrived

33297 - 26804 = 6493 of dulications; it can be confirmed with following function

sum(duplicated(tmp$IDENTIFIER)) #Indeed there are 6493 of duplication;

Let's find out where the duplicate is and remove them;

duplicated_IDENTIFIER <- tmp$IDENTIFIER[duplicated(tmp$IDENTIFIER)] duplicated_IDENTIFIER tmp[tmp$IDENTIFIER %in% duplicated_IDENTIFIER, ]

# A tibble: 7,826 x 10 ID_REF IDENTIFIER GSM1588481 GSM1588485 GSM1588482 GSM1588486 GSM1588483 1 7.90e6 OR4F17 5.75 5.82 5.58 5.69 5.62 2 7.90e6 LOC100134… 7.06 7.06 7.26 7.22 7.23 3 7.90e6 OR4F29 6.91 6.31 6.44 6.49 6.47 4 7.90e6 M37726 9.30 9.17 8.06 8.77 9.64 5 7.90e6 LOC100287… 7.45 7.16 7.10 7.08 7.05 6 7.90e6 ATAD3C 7.39 7.52 7.00 7.26 7.41 7 7.90e6 ATAD3C 8.67 9.02 8.34 8.62 8.67 8 7.90e6 SSU72 6.73 6.76 6.55 6.73 6.66 9 7.90e6 FAAP20 6.64 6.60 6.52 6.51 6.58 10 7.90e6 MORN1 6.63 6.66 6.38 6.67 6.71

… with 7,816 more rows, and 3 more variables: GSM1588487 ,

GSM1588484 , GSM1588488

Let's pick up some HGNC symbols such as "OR4F17","OR4F29","M37726" and validate target rows

tmp[tmp$IDENTIFIER %in% c("OR4F17","OR4F29","M37726"), ]

A tibble: 10 x 10

ID_REF IDENTIFIER GSM1588481 GSM1588485 GSM1588482 GSM1588486 GSM1588483 1 7.90e6 OR4F17 5.75 5.82 5.58 5.69 5.62 2 7.90e6 OR4F29 6.91 6.31 6.44 6.49 6.47 3 7.90e6 M37726 9.30 9.17 8.06 8.77 9.64 4 7.91e6 OR4F29 6.91 6.31 6.44 6.49 6.47 5 7.99e6 OR4F17 5.75 5.82 5.58 5.69 5.62 6 8.02e6 OR4F17 5.75 5.82 5.58 5.69 5.62 7 8.04e6 M37726 9.48 9.40 8.33 9.00 9.76 8 8.11e6 OR4F29 6.91 6.31 6.44 6.49 6.47 9 8.15e6 OR4F29 6.89 6.25 6.44 6.47 6.48 10 8.17e6 M37726 9.30 9.17 8.06 8.77 9.64

… with 3 more variables: GSM1588487 , GSM1588484 , GSM1588488

"OR4F17" symbols appears three times at # 1 and # 5 and #6; "OR4F29" appears ## at #2 and #4

remove target row

tmp <- tmp[ ! (tmp$IDENTIFIER %in% duplicated_IDENTIFIER), ]

check result

any(duplicated(tmp$IDENTIFIER)) now result is FALSE;

nrow(tmp) # total of 25471 unique IDENTIFERs has been cleaned 33297 - 25471 = # 7826 of duplicated entries has been removed;

Second scenario, there is some Identifier that does not appear in the HGNC symbol

sum(is.na(tmp$IDENTIFIER)) # there are nothing identifier shows as na; sum(tmp$IDENTIFIER =="") # there are nothing space value in IDENTIFIER;

Map IDENTIFIERS TO HGNC symbol; open a "Mart" object. This code is prepared from Dr.Boris Steipe BCH420 String Package.

myMart <- biomaRt::useMart("ensembl", dataset="hsapiens_gene_ensembl")

biomart_HGNC_Table <- biomaRt::getBM(filters = "hgnc_symbol", attributes = "hgnc_symbol",
values =tmp$IDENTIFIER, mart = myMart)

head(biomart_HGNC_Table)

hgnc_symbol

1 AADACL3 2 AADACL4 3 ACADM 4 ACOT11 5 ACTL8 6 ACTRT2

nrow(biomart_HGNC_Table) #there are 17798 of Identifiers have been retrived nrow(tmp)

there are 25471 unique IDENTIFIERs, that means 25471 - 17798 = 7673 IDENTIFIERs are not HGNC symbol.Let's find out what they are

convert type "list" to character

biomart_HGNC_Table_IDENTIFIER <- biomart_HGNC_Table$hgnc_symbol

unmapped_tmp <- tmp[ ! (tmp$IDENTIFIER %in% biomart_HGNC_Table_IDENTIFIER),]

head(unmapped_tmp)

A tibble: 6 x 10

ID_REF IDENTIFIER GSM1588481 GSM1588485 GSM1588482 GSM1588486 GSM1588483 1 7.90e6 chr1:5304… 7.21 7.03 7.05 7.12 7.31 2 7.90e6 chr1:6301… 5.81 5.67 5.54 5.73 5.64 3 7.90e6 chr1:5649… 9.68 9.59 9.11 9.22 9.58 4 7.90e6 chr1:5660… 9.31 8.77 8.43 9.68 9.76 5 7.90e6 chr1:5680… 6.46 6.34 5.82 6.26 6.36 6 7.90e6 cc 7.39 7.59 7.29 7.61 7.48

… with 3 more variables: GSM1588487 , GSM1588484 , GSM1588488

nrow(unmapped_tmp) #total of 7673 unmapped symbol;

view(unmapped_tmp)

these IDENTIFIERs has three main patterns.

First pattern with starts with "chr" such as "chr1:53049-54936"; second pattern is like gene symbol but not in the CURRENT HGNC table such as "FAM213B"(could be previous HGNC symbol);Third patten starts with "LOC"" such as "LOC644083".

let's place them into three catgories;

unmapped_tmp_chr_location <- unmapped_tmp$IDENTIFIER[grepl("chr",unmapped_tmp$IDENTIFIER)] length(unmapped_tmp_chr_location) #there are total of 6381 entries

unmapped_tmp_LOC <- unmapped_tmp$IDENTIFIER[grepl("LOC",unmapped_tmp$IDENTIFIER)] length(unmapped_tmp_LOC) #there are total of 357 entries

Accoridng to https://www.ncbi.nlm.nih.gov/books/NBK3840/ secion of Conventions Symbol begins with LOC. It means published symbol s not available. For instance, LOC644083 IDENTIFIER, 644083 is GeneID which we can be used for query purpose. Since HGNC symbol is not avaialble, we are not able to map these genes.

unmapped_tmp_previous_HGNC <-unmapped_tmp$IDENTIFIER[!(grepl("chr",unmapped_tmp$IDENTIFIER))] length(unmapped_tmp_previous_HGNC)

unmapped_tmp_previous_HGNC <-unmapped_tmp$IDENTIFIER[!(grepl("chr|LOC",unmapped_tmp$IDENTIFIER))] length(unmapped_tmp_previous_HGNC) #total of 935 symbols;

6381 + 357 + 935 = 7673 which is equal to number of unmapped_tmp;

Let's map IDENTIFIERs to HGNC symbol

GEO_IDENTIFIER <- tmp$IDENTIFIER

names(GEO_IDENTIFIER) <- biomart_HGNC_Table head(GEO_IDENTIFIER)

GEO_MAP_TO_HGNC <- tmp$IDENTIFIER GEO_MAP_TO_HGNC names(GEO_MAP_TO_HGNC) <- biomart_HGNC_Table_IDENTIFIER

GEO_MAP_TO_HGNC



KevinJianLin/GEOSystemBio documentation built on May 16, 2019, 9:13 a.m.