Single cell sequencing technologies have advanced the research to examine the genomic alterations from individual cells. Currently a huge amount of single cell genetic data have been deposited in GEO and the number is keeping on exponentially increasing. These single cell high-throughput sequencing data come from different tissues, health status, disease stage and drug treatments. Various machine learning algorithms, such as canonical correlation analysis and reciprocal principal component analysis, have been adapted in single cell data analysis. The implementation of such algorithms correct the potential batch effects and make the data integration between different datasets feasible. SingleGEO is developed to efficiently query, download high-throughput single cell sequencing data from GEO for further integrative analysis. The program will initially download GEO meta database into local drive and implement a fast and light-weight searching engine to query the database. By feeding the program with the keywords relevant to the user's research interests, the query results with specific meta information will be returned for further investigation. Once the targeted GSE IDs have been selected, the single cell sequencing data deposited in GEO supplement can be downloaded into local derive.
knitr::opts_chunk$set( collapse = TRUE, warning = FALSE, message = FALSE, comment = "#>", fig.width=6, fig.height=6, fig.align = "center", echo=TRUE )
Installing singleGEO from GitHub
library(devtools) devtools::install_github("yuanqingyan/singleGEO")
SingleGEO uses sql to query the database. To perform the database searching, GEO meta database should be first downloaded and save in the local drive. The name of the geo meta dataset usually is GEOmetadb.sqlite. If the meta database has been already downloaded and not the in the current working folder, specify the path of this file. In this vignett, a demo file is used for demonstration purpose.
library(singleGEO) ##Download the GEO meta database. Remove "Demo=TRUE" or set "Demo=FALSE" when you start your own project. MyDataBase<-GetGeoMetaDatabase(Sqlfile=NULL,Demo=TRUE)
To query the database, the organisms and a list of keywords should be provided. For the organism, one or multiple organisms can be provided. By providing a list of keywords, the query will return the results with the keywords described in GEO meta data. One or multiple keywords can be provided. For each keyword, to overcome the ambiguity, multiple options can be provided. For example, "pulmonary" could be an alternative word used for lung study. The program will searching the database by recognizing the different options for each key point. The following code is to search the potential available GEO datasets, assuming we are interested in scRNAseq data in lung study from both human and mouse.
# Will study mouse and human data MyOrganism<-c("Mus musculus","Homo sapiens") # Will study lung. Some study may use "pulmonary" instead of "lung", so two options are specified search_kw1<-c("lung","pulmonary") KeyList<-list(kw1=search_kw1) MySearch<-Get_Keyword_Meta(GeoDataBase=MyDataBase, Organism=MyOrganism, InputSearchList=KeyList) colnames(MySearch) unique(MySearch$gse)
head(MySearch[,c("title","gse")])
head(MySearch[,c("title","gse")])
MySearch[MySearch$gse=="GSE158127","summary"][1]
MySearch[MySearch$gse=="GSE158127","summary"][1]
SingleGEO can query scRNAseq and scATACseq datasets simultaneously. The following code searches the datasets which are from either scRANseq or scATACseq and relevant to human adenocarcinoma disease.
Organism_hs<-c("Homo sapiens") Data_platform<-c("scRNAseq","scATACseq") search_kw_ade<-c("adenocarcinoma") KeyList_ade<-list(kw_ad=search_kw_ade) MySearch_ade<-Get_Keyword_Meta(GeoDataBase=MyDataBase, Organism=Organism_hs, InputSearchList=KeyList_ade, Platform=Data_platform) unique(MySearch_ade$gse)
Title of one scATACseq study
MySearch_ade[MySearch_ade$gse=="GSE142285","title"][1]
MySearch_ade[MySearch_ade$gse=="GSE142285","title"][1]
For some studies, a combination of both scRNAseq and scATACseq are performed. SingleGEO can identify such datasets by intersecting the query results. The following code is to search for mouse brain study with both scRNAseq and scATACseq.
MyOrganism_ms<-c("Mus musculus") search_brain<-c("brain") KeyList_brain<-list(kw_brain=search_brain) MySearch_brain_scRNA<-Get_Keyword_Meta(GeoDataBase=MyDataBase, Organism=MyOrganism_ms, InputSearchList=KeyList_brain, Platform="scRNAseq") MySearch_brain_scATAC<-Get_Keyword_Meta(GeoDataBase=MyDataBase, Organism=MyOrganism_ms, InputSearchList=KeyList_brain, Platform="scATACseq") (Both_RNA_ATAC<-intersect(unique(MySearch_brain_scRNA$gse), unique(MySearch_brain_scATAC$gse)))
MySearch_brain_scRNA[MySearch_brain_scRNA$gse==Both_RNA_ATAC[1],"overall_design"][1]
MySearch_brain_scRNA[MySearch_brain_scRNA$gse==Both_RNA_ATAC[1],"overall_design"][1]
To obtain more specific studies, multiple keywords should be provided. The following codes can be used to query scRNAseq datasets in mouse, and such datasets should focus on lung fibrosis and new or novel subtype of cell population should be reported.
search_kw2<-c("fibrosis") search_kw3<-c("new","novel") search_kw4<-c("subtype","subpopulation") KeyList_new<-list(kw1=search_kw1, kw2=search_kw2, kw3=search_kw3, kw4=search_kw4) MySearch_new<-Get_Keyword_Meta(GeoDataBase=MyDataBase, Organism=MyOrganism_ms, InputSearchList=KeyList_new) unique(MySearch_new$gse)
Title of this gse dataset
MySearch_new[1,"title"]
MySearch_new[1,"title"]
Summary of this study
MySearch_new$summary[1]
MySearch_new$summary[1]
Design of this study ```{r,eval=FALSE MySearch_new$overall_design[1]
```r MySearch_new$overall_design[1]
We can also obtain the gsm information of each study.
gsm_info<-Get_GSMFromGSE(GeoDataBase=MyDataBase, GSE.ID=MySearch_new$gse[1]) colnames(gsm_info)
Gsm title and id
gsm_info[,c("title","gsm")]
gsm_info[,c("title","gsm")]
SingleGEO can be used to query different datasets for integrative analysis. Suppose we are also doing a fas-signaling project in lung fibrosis (fas-signaling has been reported to be up-regulated in lung epithelial cells from patients with idiopathic pulmonary fibrosis) and wondering whether there are some scRNAseq datasets available in mouse. If we could find some, we would like to integrate them with the previous searching result (novel cell type in mouse lung fibrosis) and evaluate whether fas-signaling pathway affect the novel cell type (this example will be described in data integration part).
search_kw5<-c("fas-signaling","fas signaling","fas pathway","fas-pathway") KeyList_fas<-list(kw1=search_kw1, kw2=search_kw2, kw5=search_kw5) MySearch_fas<-Get_Keyword_Meta(GeoDataBase=MyDataBase, Organism=MyOrganism_ms, InputSearchList=KeyList_fas) unique(MySearch_fas$gse)
Title of this study
MySearch_fas[1,"title"]
MySearch_fas[1,"title"]
Design of this study ```{r,eval=FALSE MySearch_fas$overall_design[1]
```r MySearch_fas$overall_design[1]
Summary of this study
MySearch_fas$summary[1]
MySearch_fas$summary[1]
Available supplement data
MySearch_fas$supplementary_file[1]
MySearch_fas$supplementary_file[1]
By providing the GSE IDs, the single cell high-throughput sequencing data deposited in the supplement will be downloaded for further analysis.
myGeoID<-c("GSE140032") ##Set DecompressFile=TRUE if you want to download and decompress the data in the same time DownloadFileInfo<-Get_Geo_Data(DownGSEID=myGeoID,DecompressFile=FALSE) DownloadFileInfo ExtractFile(ExtractID=myGeoID,DecompressType=c("tar","gz")) list.files("GSE140032")
Different groups have different preference in storing and uploading the data to GEO. If a Seurat object is provided by the submitter, the data can be read into Seurat directly for further analysis. For hdf5 format file, R packages of "rhdf5" and/or "SeuratDisk" could be helpful for data access. For the study with only raw count data and/or normalized data provided, SingleGEO provides "MakeSeuObj_FromRawRNAData" to quickly make the Seurat object.
SingleGEO can make the Seurat object from the raw data it downloaded by running "MakeSeuObj_FromRawRNAData". Before running "MakeSeuObj_FromRawRNAData", a list of raw data should be constructed. Each element of the list is for each individual sample and the name of the element corresponds to the name of the sample. "MakeSeuObj_FromRawRNAData" performs the scRNAseq data quality control by filtering out cells with potential doublets, dead cells or empty droplets. If all cells want to be kept or the cells have been filtered, set a big number of maxFeature,maxCount,maxMT and small number of minFeature,minCount. SingleGEO also conducts dimension reduction and makes the Seurat object for the downstream analysis.
###The following code generate an example dataset from GSE134174 (The same data has been built and can be loaded by data(testData_GSE134174)). Note that only 4 samples with 800 cells were selected for illustration purpose. # Get_Geo_Data(DownGSEID="GSE134174",DecompressFile="TRUE",DecompressType="All") # selPatient<-c("T101","T85","T153","T164") # raw_GSE134174<-read.delim("./GSE134174/GSE134174_Processed_invivo_raw.txt",header=T,sep="\t") # meta_GSE134174<-read.delim("./GSE134174/GSE134174_Processed_invivo_metadata.txt",header=T,sep="\t") # meta_sel0<-meta_GSE134174[meta_GSE134174$Donor %in% selPatient,] # raw_sel0<-raw_GSE134174[,colnames(raw_GSE134174) %in% meta_sel0$Cell] # set.seed(12345);sampleIndex<-sample(1:nrow(meta_sel0),size=800) # meta_sel<-meta_sel0[sampleIndex,];raw_sel<-raw_sel0[,paste(meta_sel$Cell)] # raw_sel<-raw_sel[rowSums(raw_sel)>0,] # testData_GSE134174<-list(TwoRawData=raw_sel,TwoMetaData=meta_sel) data(testData_GSE134174) test_meta<-testData_GSE134174$TwoMetaData test_data<-testData_GSE134174$TwoRawData list_GSE134174<-Splitdata_MakeDataList(InputData=test_data,Group=test_meta$Donor) ##The following code setting a big number of maxFeature,maxCount,maxMT and small number ##of minFeature,minCount will keep all the cells seu_GSE134174<-MakeSeuObj_FromRawRNAData(RawList=list_GSE134174, GSE.ID="GSE134174", MinFeature=1, MaxFeature=750000, MinCount=1, MaxCount=4000000, MaxMT=100) names(seu_GSE134174) Seurat::DimPlot(seu_GSE134174[[1]],repel = TRUE) ####Use sctransform normalization and make Seurat object seu_GSE134174_sct<-MakeSeuObj_FromRawRNAData(RawList=list_GSE134174, GSE.ID="GSE134174", MinFeature=1, MaxFeature=750000, MinCount=1, MaxCount=4000000, MaxMT=100, Norm.method = "sct") names(seu_GSE134174_sct)
Multiple datasets can be integrated for the analysis. These datasets could be obtained from one single GSE ID or multiple GSE IDs. The following examples illustrate the integrative analysis of example datasets from one single GSE ID (GSE134174). For datasets from different studies, please double-check the potential technical discrepancy. For example, different studies can use different genomic build which will lead to gene symbol difference.
Int_GSE134174<-SeuObj_integration(Object.list=seu_GSE134174) ###Add additional meta data row.names(test_meta)<-test_meta$Cell test_meta<-test_meta[row.names(Int_GSE134174@meta.data),] Int_GSE134174<-Seurat::AddMetaData(object=Int_GSE134174,metadata =as.data.frame(test_meta)) Seurat::DimPlot(Int_GSE134174,group.by="cluster_ident",label=TRUE,repel = TRUE)+ NoLegend() #####Data integration with one single dataset normalized with sctransform method Int_GSE134174_sct<-SeuObj_integration(Object.list=seu_GSE134174_sct,Frow.which.Norm="sct") test_meta_sct<-test_meta[row.names(Int_GSE134174_sct@meta.data),] Int_GSE134174_sct<-Seurat::AddMetaData(object=Int_GSE134174_sct,metadata =as.data.frame(test_meta_sct)) Seurat::DimPlot(Int_GSE134174_sct,group.by="cluster_ident",label=TRUE,repel = TRUE)+ NoLegend()
Integrative analysis of datasets from multiple GSE IDs could be powerful to gain additional information. As previous query result, GSE104154 dataset studies the mouse pulmonary fibrotic mesenchymal cells and the authors identify a new subtype of cells (Pdgfrb_hi) in fibrotic stage(Xie,T. (2018)). Another dataset, GSE161648, studies the impact of homeostatic fibrosis when the fas signaling was lost in fibroblasts(Redente, EF. (2020)). Integrative analysis of these two scRNAseq datasets could be useful to help us understand how the new subtype of cells found from GSE104154 will be affected by fas signaling. For demonstration purpose, we assume that these two datasets using the same genomic build and having the same gene symbol for the same gene.
# ##The codes were used to generate the example data list of GSE104154 (only a small subset of the data were used) # myGeoList<-c("GSE104154", "GSE161648") # myAllF<-Get_Geo_Data(DownGSEID=myGeoList,DecompressFile=TRUE,DecompressType="All") # dat_104_raw<-read.csv("GSE104154/GSE104154_d0_d21_sma_tm_Expr_raw.csv") # meta_104_2<-as.data.frame(readxl::read_xlsx("GSE104154/GSE104154_cell_type_annotation_d0_d21.xlsx",col_names=T,sheet=2)) # meta_104_3<-as.data.frame(readxl::read_xlsx("GSE104154/GSE104154_cell_type_annotation_d0_d21.xlsx",col_names=T,sheet=3)) # meta_104_2_3<-rbind(meta_104_2,meta_104_3) # meta_104_2_3$ID<-sapply(strsplit(meta_104_2_3$Barcode,split="\\."),function(x) x[2]) # sleSize=300 # sampleInd<-unlist(lapply(unique(meta_104_2_3$ID),function(x) {temp<-meta_104_2_3[meta_104_2_3$ID %in% x,] # set.seed(1234);sleSize<-min(nrow(temp),sleSize);SelectCellInd<-sample(1:nrow(temp),size=sleSize,replace=F) # returnBarcode<-temp$Barcode[SelectCellInd]})) # meta_23<-meta_104_2_3[meta_104_2_3$Barcode %in% sampleInd,] # dat_104Raw_withAno<-dat_104_raw[,paste(meta_23$Barcode)] # rowSum_dat104Raw<-rowSums(dat_104Raw_withAno) # dat_104Raw_f<-dat_104Raw_withAno[rowSum_dat104Raw>0,] # removeDupGene<-(!duplicated(dat_104_raw[rowSum_dat104Raw>0,2])) # dat_104Raw_f2<-dat_104Raw_f[removeDupGene,] # row.names(dat_104Raw_f2)<-dat_104_raw[rowSum_dat104Raw>0,2][removeDupGene] # t_datRaw<-data.frame(barcode=sapply(strsplit(colnames(dat_104Raw_f2),split="\\."),function(x) x[2]),t(dat_104Raw_f2)) # t_datRaw$barcode<-dplyr::recode(t_datRaw$barcode,"1"="d0_sma.pos_tm.pos","2"="d0_sma.neg_tm.pos", # "3"="d0_sma.neg_tm.neg", "4"="d21_sma.pos_tm.pos","5"="d21_sma.neg_tm.pos","6"="d21_sma.neg_tm.neg") # RawDataList_GSE104154<-Splitdata_MakeDataList(InputData=t(t_datRaw[,2:ncol(t_datRaw)]),Group=t_datRaw$barcode) # meta_GSE104154<-meta_23 # # ##The codes were used to generate the example data list of GSE161648 # GEO_ID<-"GSE161648" # list_barcodefile<-list.files(GEO_ID,pattern="*_barcodes.tsv.gz") # (sampleName<-gsub("_barcodes.tsv.gz","",list_barcodefile)) # sapply(sampleName,function(x){ # NewSampleFolder<-sprintf("%s/%s",GEO_ID,x) # if (!file.exists(NewSampleFolder)){dir.create(NewSampleFolder)} # file.copy(sprintf("%s/%s_barcodes.tsv.gz",GEO_ID,x), sprintf("%s/barcodes.tsv.gz",NewSampleFolder)) # file.copy(sprintf("%s/%s_matrix.mtx.gz",GEO_ID,x), sprintf("%s/matrix.mtx.gz",NewSampleFolder)) # file.copy(sprintf("%s/%s_features.tsv.gz",GEO_ID,x), sprintf("%s/features.tsv.gz",NewSampleFolder)) # }) # GSE161648_10xRawData<-sapply(sampleName,function(x){rawdata<- Seurat::Read10X(data.dir =sprintf("%s/%s",GEO_ID,x))}) # sleSize1=500 # RawDataList_GSE161648<-lapply(GSE161648_10xRawData,function(x){sleSize1<-min(ncol(as.matrix(x)),sleSize1) # set.seed(1234);SelectCellInd<-sample(1:ncol(as.matrix(x)),size=sleSize1,replace=F) # returnCell<-(as.matrix(x))[,SelectCellInd]}) ##load the example data data(Dat_GSE54_GSE48) RawDataList_GSE104154<-Dat_GSE54_GSE48$dat_GSE104154$RawDataList meta_GSE104154<-Dat_GSE54_GSE48$dat_GSE104154$MetaData row.names(meta_GSE104154)<-meta_GSE104154$Barcode RawDataList_GSE161648<-Dat_GSE54_GSE48$dat_GSE161648$RawDataList ###make Seurat object fro the raw data seu_GSE104154_raw<-MakeSeuObj_FromRawRNAData(RawList=RawDataList_GSE104154, GSE.ID="GSE104154", MtPattern='^mt-', MinFeature=1, MaxFeature=750000, MinCount=1, MaxCount=4000000, MaxMT=100) seu_GSE161648_raw<-MakeSeuObj_FromRawRNAData(RawList=RawDataList_GSE161648,GSE.ID="GSE161648",MtPattern='^mt-') ###Data integration Int_GSE54_GSE48<-SeuObj_integration(Object.list=seu_GSE104154_raw,Object.list2=seu_GSE161648_raw) ###Evaluate the potential batch effect Seurat::DimPlot(Int_GSE54_GSE48,group.by="orig.ident",split.by="GSE",repel = TRUE) Seurat::DimPlot(Int_GSE54_GSE48,group.by="seurat_clusters",split.by="GSE",label=TRUE,repel = TRUE)+ NoLegend() ###Add annotation to GSE104154 as provided by the data submitter Int_GSE54_GSE48$rowname<-row.names(Int_GSE54_GSE48@meta.data) meta_subsetOfInt_GSE54<-subset(Int_GSE54_GSE48,GSE=="GSE104154")@meta.data meta_subsetOfInt_GSE54$Barcode<-sapply(strsplit(meta_subsetOfInt_GSE54$rowname,split="\\_"),function(x) x[1]) ano_meta_subsetOfInt_GSE54<-dplyr::left_join(meta_subsetOfInt_GSE54,meta_GSE104154,by="Barcode") row.names(ano_meta_subsetOfInt_GSE54)<-ano_meta_subsetOfInt_GSE54$rowname ano_meta_subsetOfInt_GSE54<-ano_meta_subsetOfInt_GSE54[,c("rowname","defined")] Int_GSE54_GSE48@meta.data<-dplyr::left_join(Int_GSE54_GSE48@meta.data,ano_meta_subsetOfInt_GSE54,by="rowname") row.names(Int_GSE54_GSE48@meta.data)<-Int_GSE54_GSE48@meta.data$rowname Int_GSE54_GSE48$defined<-ifelse(is.na(Int_GSE54_GSE48$defined), paste(Int_GSE54_GSE48$seurat_clusters),Int_GSE54_GSE48$defined) Seurat::DimPlot(Int_GSE54_GSE48,group.by="defined",split.by="GSE",label=TRUE,repel = TRUE)+ NoLegend() ###Test whether loss Fas signaling will affect PDGFrb cell fraction in GSE161648 dataset Subset_GSE161648FromInt<-subset(Int_GSE54_GSE48,GSE=="GSE161648") Subset_GSE161648FromInt$FasStatus<-ifelse(Subset_GSE161648FromInt$ID %in% c("GSM4911963_sixwkfasdel","GSM4911965_threewkfasdel"),"Loss","NoLoss") tab_dat<-as.matrix(as.data.frame(unclass(table(Subset_GSE161648FromInt@meta.data[,c("FasStatus","defined")])))) ###PDGFrb is located in the cluster of 6. No obvious different is observed and loss of Fas signaling probably won't affect PDGFrb cell fraction. (tab_pop<-prop.table(tab_dat, margin = 1)) dat_barplot<-data.frame(FasStatus=rep(c("Loss","NoLoss"),ncol(tab_pop)), defined=rep(colnames(tab_pop),each=2), Percentage=as.numeric(tab_pop))
###Barplot. Note that PDGFrb is located in the cluster of 6. pl<-ggplot2::ggplot(dat_barplot, aes(fill=defined, y=Percentage, x=FasStatus))+ geom_bar(position="stack", stat="identity") + theme_bw() print(pl)
Instead of the data integration by modifying the query expression data, transferring learning could be an alternative approach to study the multiple datasets from different GSE ids. The following example takes GSE104154 as the reference while GSE161648 as the query.
data(Data_Ref_transfer) data(Data_Query_transfer) Transfer_Result<-SeuObj_transfer(RefObj=Data_Ref_transfer, QueryObj=Data_Query_transfer, RefLabel="defined") Seurat::DimPlot(Transfer_Result, group.by="predicted.RefLabel", reduction = "umap", label=TRUE, repel = TRUE)+ NoLegend()
Xie T, Wang Y, Deng N, Huang G et al. Single-Cell Deconvolution of Fibroblast Heterogeneity in Mouse Pulmonary Fibrosis. Cell Rep 2018 Mar 27;22(13):3625-3640. PMID: 29590628
Redente EF, Chakraborty S, Sajuthi S, Black BP et al. Loss of Fas signaling in fibroblasts impairs homeostatic fibrosis resolution and promotes persistent pulmonary fibrosis. JCI Insight 2020 Dec 8;6(1). PMID: 33290280
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.