PPIfinder_getPPIs: Find Protein-Protein Interactions

Description Usage Arguments Details Value Examples

Description

The function "getPPIs" identifies proteins in PubMed titles and abstracts and if a match is found returns matching information as data frame. The function searches among each sentence that contains two proteins using a user provided list of keywords for matches. If a match is found, the sentences containing the keywords and the matched keywords are appended to data frame.

Usage

1

Arguments

data

a data frame containing PubMed Ids, gene A and B symbols, synonyms list (each symbol separated by '|'), gene A and B names, and article title and abstract

patterns

a large character vector of symbols and names

getInteractionMatches

See specific function documentation

keywords

a list of keywords to identify PPIs

Details

This function requires the packages "svMisc" and "stringr". It also uses the getInteractionMacthes function.

Value

The function "getPPIs" returns a data frame with seven columns: pmid, proteins, title, sentence, int_sentences, int_keywords, and match. If matched sentences do not contain any of the provided keywords the cell will contain "No keywords found in sentence" and or "No keywords". The data frame will contain the following columns:

pmid

PubMed IDs

proteins

a list of protein symbols, synonyms, and gene names separated by '|'

title

PubMed article title

sentences

Sentences from abstract containing both gene A and gene B

int_sentences

Sentences from abstract containing gene A, gene B, and one or more of the keywords, separated by '|'

int_keywords

Matched keywords from int_sentences, separated by '|'

match

'0' or '1' to indicate articles where gene A and B where identified

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
## Not run: 

## getting protein-protein interaction matches from titles and abstracts extracted from PubMed
found_PPIs <- getPPIs(biogrid_data, patterns, getInteractionMatches, c('interaction, association, binding'))

## full example of how this function can be used
# load biogrid data
biogrid = getLatestBiogridData()

# filter data to inlcude only certain types of interactions and organisms - human + colocalization
biogrid_human_data <- biogrid[ which(biogrid$Organism.Interactor.A == 9606 & biogrid$Organism.Interactor.B == 9606),]
biogrid_human_coloc <- biogrid_human_data[ which(biogrid_human_data$Experimental.System == "Co-localization"),]

# remove uneeded columns
biogrid_human_coloc <- biogrid_human_coloc[,c("Entrez.Gene.Interactor.A",
                                             "Entrez.Gene.Interactor.B",
                                             "Official.Symbol.Interactor.A",
                                             "Official.Symbol.Interactor.B",
                                             "Synonyms.Interactor.A",
                                             "Synonyms.Interactor.B",
                                             "Pubmed.ID")]

# add gene name to filtered biogrid data
symbols <- unlist(c(unique(biogrid_human_coloc$Entrez.Gene.Interactor.A), 
                   c(unique(biogrid_human_coloc$Entrez.Gene.Interactor.B))), recursive=F)

gene_name <- aggregate(GENENAME ~ ENTREZID, data = select(org.Hs.eg.db, symbols, "GENENAME", "ENTREZID"), FUN = unique)

# combine information for rows with same PubMed ID
# gene A
final_merged_biogrid_data <- merge(biogrid_human_coloc, 
                                  gene_name, 
                                  by.x = "Entrez.Gene.Interactor.A", by.y = "ENTREZID", 
                                  all.x = TRUE)
# rename column
colnames(final_merged_biogrid_data)['GENENAME'] <- 'GENENAME_A'

# gene B
final_merged_biogrid_data = merge(final_merged_biogrid_data, 
                                  gene_name, 
                                  by.x = "Entrez.Gene.Interactor.B", by.y = "ENTREZID", 
                                  all.x = TRUE)
# rename column
colnames(final_merged_biogrid_data)['GENENAME'] <- 'GENENAME_B'

# Query PubMed - Given a set of PubMed ids for analysis, we retrieve the titles and abstracts
pmids <- unlist(lapply(as.matrix(unique(final_merged_biogrid_data$Pubmed.ID)), function(x)as.character(x)))

# retrieve pubmed abstracts and titles for pmids in data set
pubmed_results <- getPubmedAbstracts(biogrid$Pubmed.ID)

# combine results with final data
merged_biogrid_pubmed_results <- merge(final_merged_biogrid_data, 
                                      pubmed_results, 
                                      by.x = "Pubmed.ID", by.y = "PMID", all = TRUE)

# only keep complete rows
merged_biogrid_pubmed_results$Abstract[merged_biogrid_pubmed_results$Abstract==''] <- NA
merged_biogrid_pubmed_results <- merged_biogrid_pubmed_results[complete.cases(merged_biogrid_pubmed_results$Abstract),]

# get a list of all symbols and gene names
symbols = getCompleteSymbols(merged_biogrid_pubmed_results)

# get regex patterns
patterns = getFilteredRegex(symbols)

# interaction keywords
keywords <- c("bind", 
              "interact",
              "associate",
              "regulation",
              "bound",
              "localize",
              "stimulation",
              "regulate",
              "effect",
              "target",
              "component",
              "member",
              "mediate")

# loop over abstracts - get subset of the data for testing
tokens <- getPPIs(merged_biogrid_pubmed_results, patterns, getInteractionMatches, keywords)

# write out sentences
write.table(tokens, "Mapped_proteins.txt", quote  = FALSE, sep = '\t', col.names = TRUE, row.names = FALSE)


## End(Not run)

andreysoares/nlpUtilityBelt documentation built on May 6, 2019, 8:57 p.m.