This package aims to visualize the word and text information contained in the gene or the other omics identifiers such as microbiome, and identify important words among the clusters, integrate, and compare the clusters based on those information. It contributes to understanding the functional implications of omics identifier lists and aid in interpretation and visualization. In this vignette, the basic usage for generating the network and combining them are introduced, as well as customized usage. The detailed options and usage like the integration with the other packages are available in the package's bookdown. The web server is available for the convenient querying here.

Basic usage

Install and load the package and the database for converting identifiers. In this example, we use mostly human-derived data, and use org.Hs.eg.db.

# devtools::install_github("noriakis/biotextgraph")
library(biotextgraph)
library(org.Hs.eg.db)
library(ggplot2)
library(ggraph)

Producing word networks

The main function accepts some omics identifiers and generates a network. Various functions are available for this purpose, and to create a network from a gene list, refseq is used. By specifying plotType as network, a network is generated. This generates a biotext class object with slots containing various information, which is then returned. The default ID type is SYMBOL, but it can be specified arbitrarily in keyType. As many of the words are commonly observed, you should limit word frequency by excludeFreq, which is default to 2000. TF-IDF on the all the summary is precomputed, and exclude="tfidf" can be specified too. The net slot stores the visualization result generated by ggraph.

## Configure input genes (ERCC genes)
inpSymbol <- c("ERCC1","ERCC2","ERCC3","ERCC4","ERCC5","ERCC6","ERCC8")
net <- refseq(inpSymbol, plotType="network")
net
plotNet(net, asis=TRUE)

The network visualization can be customized using the options like colorText, which colorizes the text based on the node colors, or edgeLink to customize the geom to represent edges.

cxcls <- NULL
for (i in c(1,2,3,5,6,8,9,10,11,12,13,14,16)){
    cxcls <- c(cxcls, paste0("CXCL",i))
}
cxclNet <- refseq(cxcls, plotType="network",
    colorText=TRUE, edgeLink=FALSE, autoThresh=FALSE)
plotNet(cxclNet, asis=TRUE)

It is possible to draw the IDs related to important words in the text in the network within the ID list. This makes it possible to extract important IDs within the list based on word frequency. For The genes, those associated with the frequently occurred words within the cluster can be shown by genePlot=TRUE, and the number can be controlled by genePlotNum.

net <- refseq(cxcls, plotType="network", autoThresh=FALSE,
    colorText=TRUE, edgeLink=FALSE, genePlot=TRUE, genePlotNum=5)
net
plotNet(net, asis=TRUE)

Tagging the set of words is possible by enabling the option tag="cor" (based on the adjacency matrix of inferred network) or tag="tdm" (term-document matrix). This allows us to see what word sets appear significantly and to reflect this information in the plots.

tag <- refseq(c(inpSymbol, cxcls), plotType="network", tag="cor",
    colorText=TRUE, edgeLink=FALSE, genePlot=TRUE, genePlotNum=5)
getSlot(tag, "pvpick")

It is possible to perform searches against databases such as PubMed using the obtained important genes as queries, and visualize the results. It is recommended to use a PubMed API key for this purpose. Specify one in apiKey.

getSlot(net, "geneCount") |> head()
pmquery <- getSlot(net, "geneCount") |> head() |> names()
## Not run in vignette
# pubmed(pmquery, plotType="network")

Each process can be break down to piping operation or storing the results for the later analysis.

btg <- obtain_refseq(inpSymbol) |> ## obtain RefSeq description
  set_filter_words() |> ## Set filtering words
  make_corpus() |> ## Make corpus
  make_TDM() |> ## Make term-document matrix
  make_graph() |> ## Make graph
  process_network_gene(gene_plot=TRUE) |> ## Process graph for showing associated genes to words
  plot_biotextgraph(edge_link=FALSE) |> ## Make plot for the network (stored in `net` slot)
  plot_wordcloud() ## Make wordcloud plot (stored in `wc` slot)

An example of comparing networks of words in the biological pathway

As an example of comparing text information on a network, this section demonstrates the comparison of gene lists within KEGG pathways. For applications using actual public data, please refer to the documentation.

keggPathways <- org.Hs.egPATH2EG
mappedKeys <- mappedkeys(keggPathways)
keggList <- as.list(keggPathways[mappedKeys])
## Hepatitis C
hCNet <- refseq(keggList$`05160`, plotType="network",
                        layout="nicely", keyType="ENTREZID",
                        autoThresh=FALSE, excludeFreq = 5000, colorText=TRUE,
                        edgeLink=FALSE, showLegend=FALSE)
plotNet(hCNet, asis=TRUE)

We create another biotext object to compare with.

ecoli <- refseq(keggList$`05130`, keyType="ENTREZID", autoThresh=FALSE)

Comparison of networks can be performed by compareWordNet. By providing multiple biotext class objects, it is possible to create a new network by integrating the networks and tag information contained in each object. This makes it possible to compare multiple different IDs.

compareWordNet(list(hCNet, ecoli),
               titles=c("RefSeq_05160","RefSeq_05130"),
               colPal = "Dark2") |> plotNet()

The summarization of text in enrichment analysis results can be optionally performed by enrich option. The below example shows enrichment analysis of KEGG database.

if (requireNamespace("clusterProfiler")) {
    hCNetK <- refseq(keggList$`05160`, enrich="kegg", keyType="ENTREZID",cooccurrence = TRUE,
                               topPath=50, numWords=50, autoThresh=FALSE,
                               plotType="network", corThresh=0.1)
    plotNet(hCNetK, asis=TRUE)    
}

Summarizing other identifiers' data

Other than genes, microbial information can also be summarized in the similar manner. For obtaining and summarizing information on disease relationship, enzymes, metabolites, and biological pathways, please refer to the documentation. Furthermore, a manual function (manual) is available that performs similar operations based on customized user input.

The other visualization options

Producing word clouds

The package provides the other visualization options such as producing a wordcloud of biomedical textual information by querying gene IDs or other identifiers.

gwc <- refseq(inpSymbol, plotType="wc")
gwc
plotWC(gwc, asis=TRUE)

The options in wordcloud or ggwordcloud including color and rotation can be specified in argList.

gwc <- refseq(inpSymbol, numWords=200,
                     argList=list(max.words=200, random.order=FALSE,
                     colors=RColorBrewer::brewer.pal(5, "Dark2"),
                     rot.per=0.4), plotType="wc", scaleFreq=2)
gwc
plotWC(gwc, asis=TRUE)

Text summaries such as word clouds can be combined with other plots. For example, they can be displayed on reduced dimension plots in single-cell analysis. For other ways of combining them, please refer to the documentation.

Annotating the cluster relationship

The customized functions are available, which annotate the gene cluster relationship. If you perform some clustering analysis for gene expression data or other identifiers and investigate the relationship between clusters by dendrogram, the plotEigengeneNetworksWithWords function can be used to populate the resulting dendrogram.

mod <- returnExample()
plotEigengeneNetworksWithWords(mod$MEs, mod$colors) +
  scale_y_continuous(expand=c(0,1)) ## Scaled for labels to be not truncated

Other examples, such as interactive visualization of cluster networks using actual data and populating reduced dimension plots in single-cell transcriptomics, are described in the documentation.

sessionInfo()


noriakis/wcGeneSummary documentation built on May 31, 2024, 4:42 p.m.