options(width = 750) knitr::opts_chunk$set( comment = "#>", error = FALSE, tidy = FALSE)
The orthologr
package provides several interface functions to perform
BLAST searches.
First, users need to make sure that they have BLAST installed on their machine. Please follow these instructions to install BLAST on your manine.
The orthologr
package stores 20 example genes (orthologs) between Arabidopsis thaliana
and Arabidopsis lyrata. The following example BLAST search shall illustrate a simple
search with standard parameters provided by the blast()
function.
When running the subsequent functions please make sure you can call BLAST+ from your
console either in the standard PATH
or in case you have BLAST+ installed in a separate
folder, please specify the path
argument that can be passed to blast()
.
To check whether BLAST+ can be executed from the default PATH
(usr/bin/local
on UNIX systems),
you can run:
# check if blast is installed and if yes, what version of blast system("blastp -version")
This should return something like this:
blastp: 2.8.1+ Package: blast 2.8.1, build Nov 26 2018 12:45:20
If everything works properly, you can get started with your first BLAST+ search.
The orthologr
packages allows users to perform fast and easy-to-use BLAST searches
between fasta
files. These searches can even be scaled towards genome and proteome comparisons.
The blast()
function provides the easiest way to perform a BLAST search.
library(dplyr) # performing a BLAST search using blastp (default) hit_tbl <- blast(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'), subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr')) # look at results glimpse(hit_tbl)
Variables: $ query_id (chr) "AT1G01010.1", "AT1G01020.1", "AT1G01030.1",... $ subject_id (chr) "333554|PACid:16033839", "470181|PACid:16064... $ perc_identity (dbl) 73.99, 91.06, 95.54, 91.98, 100.00, 89.51, 9... $ alig_length (dbl) 469, 246, 359, 1970, 213, 648, 366, 300, 434... $ mismatches (dbl) 80, 22, 12, 85, 0, 58, 14, 22, 8, 34, 4, 6, ... $ gap_openings (dbl) 8, 0, 2, 10, 0, 5, 2, 2, 3, 0, 0, 1, 3, 2, 1... $ q_start (dbl) 1, 1, 1, 6, 1, 1, 1, 1, 1, 1, 1, 1, 5, 4, 2,... $ q_end (dbl) 430, 246, 359, 1910, 213, 646, 366, 294, 429... $ s_start (dbl) 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 16, 2, 4... $ s_end (dbl) 466, 246, 355, 1963, 213, 640, 362, 299, 433... $ evalue (dbl) 0e+00, 7e-166, 0e+00, 0e+00, 2e-160, 0e+00, ... $ bit_score (dbl) 627, 454, 698, 3704, 437, 1037, 696, 491, 85...
As you can see, the hit table shows the output of the BLAST+ search. The blast()
function
runs blastp
as default BLAST+ algorithm. Different BLAST+ algorithms can be selected by
specifying the blast_algorithm
argument, e.g. blast_algorithm = "tblastn"
. See ?blast
for further details.
The blast()
function returns the BLAST arguments: query_id
, subject_id
, perc_identity
,
alig_length
, mismatches
, gap_openings
, q_start
, q_end
, s_start
, s_end
, evalue
, and bit_score
.
Since blast()
stores the hit table returned by BLAST in a data.table object, you can access each column,
using the data.table notation.
In case you need to specify the PATH
to BLAST+ please use the path
argument:
# performing a BLAST search using blastp (default) hit_tbl <- blast(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'), subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'), path = "/path/to/blastp") hit_tbl
# access columns: query_id, subject_id, evalue, and bit_score dplyr::select(hit_tbl, query_id, subject_id, evalue, bit_score)
query_id subject_id evalue bit_score 1: AT1G01010.1 333554|PACid:16033839 0e+00 627 2: AT1G01020.1 470181|PACid:16064328 7e-166 454 3: AT1G01030.1 470180|PACid:16054974 0e+00 698 4: AT1G01040.1 333551|PACid:16057793 0e+00 3704 5: AT1G01050.1 909874|PACid:16064489 2e-160 437 6: AT1G01060.3 470177|PACid:16043374 0e+00 1037 7: AT1G01070.1 918864|PACid:16052578 0e+00 696 8: AT1G01080.1 909871|PACid:16053217 1e-178 491 9: AT1G01090.1 470171|PACid:16052860 0e+00 859 10: AT1G01110.2 333544|PACid:16034284 0e+00 972 11: AT1G01120.1 918858|PACid:16049140 0e+00 1092 12: AT1G01140.3 470161|PACid:16036015 0e+00 918 13: AT1G01150.1 918855|PACid:16037307 3e-150 421 14: AT1G01160.1 918854|PACid:16044153 1e-93 268 15: AT1G01170.2 311317|PACid:16052302 3e-54 158 16: AT1G01180.1 909860|PACid:16056125 0e+00 576 17: AT1G01190.1 311315|PACid:16059488 0e+00 1036 18: AT1G01200.1 470156|PACid:16041002 3e-172 470 19: AT1G01210.1 311313|PACid:16057125 7e-76 215 20: AT1G01220.1 470155|PACid:16047984 0e+00 2106
The blast()
function also allows you to pass additional parameters to the BLAST+ search
using the blast_params
argument. In the following example, a remote BLAST+ search is
performed.
hit_tbl <- blast(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'), subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'), blast_params = "-qcov_hsp_perc 0.9")
glimpse(hit_tbl)
A tibble: 78 x 12 query_id subject_id perc_identity alig_length mismatches gap_openings <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 AT1G010... 311334|PA... 43.6 165 81 4 2 AT1G010... 333554|PA... 74.0 469 80 8 3 AT1G010... 909883|PA... 34.5 354 162 12 4 AT1G010... 918785|PA... 33.8 160 80 6 5 AT1G010... 470181|PA... 91.1 246 22 0 6 AT1G010... 470180|PA... 95.5 359 12 2 7 AT1G010... 333551|PA... 92.0 1970 85 10 8 AT1G010... 909874|PA... 100 213 0 0 9 AT1G010... 311286|PA... 64.5 62 22 0 10 AT1G010... 470177|PA... 89.5 648 58 5 with 68 more rows, and 6 more variables: q_start <dbl>, q_end <dbl>, s_start <dbl>, s_end <dbl>, evalue <dbl>, bit_score <dbl>
In all cases the default e-value
BLAST+ searches is 1E-5
and the default blast_algorithm
is blastp
.
Since BLAST+ searches can be computationally expensive, it is possible to specify the
comp_cores
argument when working with an multicore machine.
# BLAST computations using the comp_cores parameter: here with 2 cores blast(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'), subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'), comp_cores = 2)
The query_file
and subject_file
arguments specify the path to the corresponding
fasta files storing the CDS
files, amino acid
files, or genome
files of the query organism and subject organism
of interest. Make sure that when using CDS
files, amino acid
files, or genome
files the
corresponding argument seq_type
must be adapted according to the input data format.
Use :
CDS
files -> seq_type = "cds"
amino acid
files -> seq_type = "protein"
genome
files -> seq_type = "dna"
The format
argument specifies the input file format, e.g. "fasta" or "gbk".
The blast_algorithm
argument specifies the BLAST program (algorithm) that shall be
used to perform BLAST searches, e.g. "blastp","blastn","tblastn",etc. Again, the eval
argument defines the default e-value that shall be chosen as best hit threshold.
All blast
functions implemented in orthologr
can easily be processed using the
split-apply-combine strategy to detect
for example one-to-one
, one-to-many
, and many-to-many
gene homology relationships.
Here a simple example:
library(dplyr) # perform a blastp search of 20 A. thaliana genes against 1000 A. lyrata genes hit_tbl <- blast(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'), subject_file = system.file('seqs/ortho_lyra_cds_1000.fasta', package = 'orthologr')) # determine 'one-to-many' and 'one-to-one' gene relationships rel_hit_tbl <- summarize(group_by(hit_tbl, query_id), n_genes = n()) # look at results rel_hit_tbl
A tibble: 20 x 2 query_id n_genes <chr> <int> 1 AT1G01010.1 4 2 AT1G01020.1 1 3 AT1G01030.1 1 4 AT1G01040.1 1 5 AT1G01050.1 1 6 AT1G01060.3 2 7 AT1G01070.1 3 8 AT1G01080.1 5 9 AT1G01090.1 1 10 AT1G01110.2 1 11 AT1G01120.1 3 12 AT1G01140.3 36 13 AT1G01150.1 1 14 AT1G01160.1 1 15 AT1G01170.2 1 16 AT1G01180.1 1 17 AT1G01190.1 6 18 AT1G01200.1 7 19 AT1G01210.1 1 20 AT1G01220.1 1
Now you can sort genes into classes: one-to-one
and one-to-many
.
# classify into 'one-to-one' relationships one_to_one <- filter(rel_hit_tbl, n_genes == 1) # classify into 'one-to-many' relationships one_to_many <- filter(rel_hit_tbl, n_genes > 1)
# look at one_to_one
one_to_one
A tibble: 12 x 2 query_id n_genes <chr> <int> 1 AT1G01020.1 1 2 AT1G01030.1 1 3 AT1G01040.1 1 4 AT1G01050.1 1 5 AT1G01090.1 1 6 AT1G01110.2 1 7 AT1G01150.1 1 8 AT1G01160.1 1 9 AT1G01170.2 1 10 AT1G01180.1 1 11 AT1G01210.1 1 12 AT1G01220.1 1
# look at one_to_many
one_to_many
A tibble: 8 x 2 query_id n_genes <chr> <int> 1 AT1G01010.1 4 2 AT1G01060.3 2 3 AT1G01070.1 3 4 AT1G01080.1 5 5 AT1G01120.1 3 6 AT1G01140.3 36 7 AT1G01190.1 6 8 AT1G01200.1 7
Now we can treat classes: one_to_one
and one_to_many
differently:
We can now retrieve the blast hit entries for the one-to-one hits
by joining the tables one_to_one
and hit_tbl
.
# look at the evalue, perc_identity, and alig_length of one_to_one genes oo_genes <- dplyr::inner_join(one_to_one, hit_tbl , by = "query_id") oo_genes
A tibble: 12 x 13 query_id n_genes subject_id perc_identity alig_length mismatches <chr> <int> <chr> <dbl> <dbl> <dbl> 1 AT1G010... 1 470181|PA... 91.1 246 22 2 AT1G010... 1 470180|PA... 95.5 359 12 3 AT1G010... 1 333551|PA... 92.0 1970 85 4 AT1G010... 1 909874|PA... 100 213 0 5 AT1G010... 1 470171|PA... 96.8 434 8 6 AT1G011... 1 333544|PA... 93.6 528 34 7 AT1G011... 1 918855|PA... 72.6 285 68 8 AT1G011... 1 918854|PA... 84.9 179 19 9 AT1G011... 1 311317|PA... 85.6 97 0 10 AT1G011... 1 909860|PA... 92.6 310 20 11 AT1G012... 1 311313|PA... 95.3 107 5 12 AT1G012... 1 470155|PA... 96.7 1056 35 with 7 more variables: gap_openings <dbl>, q_start <dbl>, q_end <dbl>, s_start <dbl>, s_end <dbl>, evalue <dbl>, bit_score <dbl>
This way, users will have all blast information of one_to_one
hits available.
Using the oo_genes
dataset above, users can use this one_to_one
hits strategy to
detect homologous genes between species.
For example, we could restrict one_to_one
hits to fulfill certain (stringent)
criteria to identify as a homologous hit. Here we choose the following
parameter constellation to achieve this goal: one_to_one
genes must have a minimum alignment length of 300
, a perc_identity of > 80 percent
and an e-value < 1E-5
.
# filter for 'homologous' hits true_orthologs <- dplyr::filter(oo_genes, evalue < 1e-5, perc_identity > 80, alig_length > 300) # look at results dplyr::select(true_orthologs, query_id, subject_id, evalue, perc_identity, alig_length)
query_id subject_id evalue perc_identity alig_length 1: AT1G01030.1 470180|PACid:16054974 0 95.54 359 2: AT1G01040.1 333551|PACid:16057793 0 91.98 1970 3: AT1G01090.1 470171|PACid:16052860 0 96.77 434 4: AT1G01110.2 333544|PACid:16034284 0 93.56 528 5: AT1G01180.1 909860|PACid:16056125 0 92.58 310 6: AT1G01220.1 470155|PACid:16047984 0 96.69 1056
This way we could filter out a high confidence set of homologous genes from the
one_to_one
class of genes.
In reality most homology inference programs and methods perform way more complicated and sophisticated analyses to distinguish true orthologs from true paralogs (in-paralogs, out-paralogs, etc.). These subsequent analyses can also be performed using the above introduced split-apply-combine strategy.
Note, that you can perform self-BLAST searches blast(query,query)
and blast(subject,subject)
to distinguish between orthologous and paralogous genes.
Now we continue with the one_to_many
class of genes.
Here we want to address the question how to deal with multiple hits returned by BLAST+
.
Again we investigate all one_to_many
genes:
# look at results
one_to_many
A tibble: 8 x 2 query_id n_genes <chr> <int> 1 AT1G01010.1 4 2 AT1G01060.3 2 3 AT1G01070.1 3 4 AT1G01080.1 5 5 AT1G01120.1 3 6 AT1G01140.3 36 7 AT1G01190.1 6 8 AT1G01200.1 7
When looking at gene_id AT1G01200.1
we see that it was found 7 times in the
corresponding subject set of A. lyrata.
# look at all 7 hits found dplyr::select(dplyr::filter(hit_tbl, query_id == "AT1G01200.1"), query_id, subject_id, evalue, perc_identity, alig_length)
A tibble: 7 x 5 query_id subject_id evalue perc_identity alig_length <chr> <chr> <dbl> <dbl> <dbl> 1 AT1G01200.1 470156|PACid:16041002 1.87e-172 95.8 238 2 AT1G01200.1 910431|PACid:16035207 2.63e- 75 53.0 219 3 AT1G01200.1 918732|PACid:16054958 1.07e- 51 44.6 193 4 AT1G01200.1 919287|PACid:16060536 4.42e- 70 58.1 179 5 AT1G01200.1 919355|PACid:16050170 2.40e- 73 53.3 212 6 AT1G01200.1 919721|PACid:16036935 2.63e- 81 59.3 204 7 AT1G01200.1 919852|PACid:16055066 4.02e- 7 24.0 154
Now we have to decide which hit shall be considered as potential homolog.
In this example subject_id
470156|PACid:16041002
has the highest perc_identity
as well
as the lowest e-value 1.87e-172
. So a straightforward approach would be to choose subject gene 470156|PACid:16041002
as potential ortholog of query gene AT1G01200.1
.
We can validate this approach by running a reciprocal best hit search with blast_rec()
and compare the output of gene AT1G01200.1
with our choice 470156|PACid:16041002
.
A reciprocal best hit blast approach denotes the search strategy in which query sequences are blasted in one direction to detect matches in subject sequences and then inversely subject sequences are blasted in the other direction to detect matches in the query sequences. Only when both blast seaches in both directions result in the same hit pair: BLAST(A,B) = BLAST(B,A) the hit will be considered as true ortholog relationship.
# run blast best reciprocal hit function rbh_hit_tbl <- blast_rec(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'), subject_file = system.file('seqs/ortho_lyra_cds_1000.fasta', package = 'orthologr')) # look at results dplyr::select(rbh_hit_tbl, query_id, subject_id, evalue, perc_identity, alig_length)
A tibble: 20 x 5 Groups: query_id [20] query_id subject_id evalue perc_identity alig_length <chr> <chr> <dbl> <dbl> <dbl> 1 AT1G01010.1 333554|PACid:16033839 0. 74.0 469 2 AT1G01020.1 470181|PACid:16064328 4.33e-166 91.1 246 3 AT1G01030.1 470180|PACid:16054974 0. 95.5 359 4 AT1G01040.1 333551|PACid:16057793 0. 92.0 1970 5 AT1G01050.1 909874|PACid:16064489 1.59e-160 100 213 6 AT1G01060.3 470177|PACid:16043374 0. 89.5 648 7 AT1G01070.1 918864|PACid:16052578 0. 95.1 366 8 AT1G01080.1 909871|PACid:16053217 5.90e-179 90.3 300 9 AT1G01090.1 470171|PACid:16052860 0. 96.8 434 10 AT1G01110.2 333544|PACid:16034284 0. 93.6 528 11 AT1G01120.1 918858|PACid:16049140 0. 99.2 529 12 AT1G01140.3 470161|PACid:16036015 0. 98.5 453 13 AT1G01150.1 918855|PACid:16037307 2.01e-150 72.6 285 14 AT1G01160.1 918854|PACid:16044153 1.16e- 93 84.9 179 15 AT1G01170.2 311317|PACid:16052302 5.75e- 54 85.6 97 16 AT1G01180.1 909860|PACid:16056125 0. 92.6 310 17 AT1G01190.1 311315|PACid:16059488 0. 94.2 533 18 AT1G01200.1 470156|PACid:16041002 1.87e-172 95.8 238 19 AT1G01210.1 311313|PACid:16057125 8.80e- 76 95.3 107 20 AT1G01220.1 470155|PACid:16047984 0. 96.7 1056
When we now look at gene AT1G01200.1
we find that after an reciprocal blast approach subject gene 470181|PACid:16064328
rather than the subject gene 470156|PACid:16041002
(= result of unidirectional blast) has been detected as potential ortholog. The example illustrates the importance of using bidirectional blast to determine orthology relationships.
An alternative analysis that can be performed with these three candidate subject genes is the following:
# read CDS sequences of the 20 example query genes of A. thaliana Ath.cds <- orthologr::read.cds(file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'), format = "fasta") # read CDS sequences of the 1000 example subject genes of A. lyrata Aly.cds <- orthologr::read.cds(file = system.file('seqs/ortho_lyra_cds_1000.fasta', package = 'orthologr'), format = "fasta") # show the sequence of gene AT1G01070.1 Ath.cds["AT1G01070.1" , seqs]
[1] "atggctggagatatgcaaggagtgagagtagtagaaaaatattcaccggtcatagtgatggtgatgtcaaatgta gcgatgggttcggtgaatgcacttgtgaagaaagctcttgatgttggtgtgaaccatatggtcattggtgcttatcgaat ggctatttccgctttaattttggttccctttgcctatgtcttggaaaggaaaacaagaccacaaataacgtttaggctaa tggtcgatcatttcgtcagtggccttctcggggcgagtttgatgcagtttttctttttgcttggtctgtcgtacacgtca gcaactgtttcgtgtgctttggtaagcatgttgcctgcaatcaccttcgctttggcccttattttcaggactgaaaatgt gaagattctaaagaccaaagcaggaatgttgaaggtgattggaactttgatctgtataagtggagctttgttcttaacat tttacaaaggcccacaaatatcaaactctcactctcactctcacggtggggcttcccacaacaacaacgatcaagacaag gccaataattggcttcttggatgtctttatttaaccataggaacagtgttgctatctctatggatgttgtttcaagggac tttaagtattaagtacccttgcaaatactcgagcacttgtcttatgtcaattttcgcggcatttcaatgtgctctcttga gcctttacaagagcagagacgttaatgattggatcatagatgatagattcgttatcaccgtcatcatatacgctggagtg gtaggacaagcaatgacgacggttgcaacaacatgggggattaaaaaattaggagctgtgttcgcatcggcgtttttccc acttactctcatttcggctactctatttgatttcctcattttacacactcctttataccttggaagtgtgattggatcac tagtgaccataacgggtctctacatgttcttgtgggggaagaacaaagaaacggaatcatcaactgcattgtcttcagga atggataacgaagctcaatatactactcctaataaggataacgactctaagtcgcccgtttaa"
Now users can perform a global alignment between the CDS sequences of AT1G01070.1
and the three subject genes as follows:
library(Biostrings) # perform 3 global alignments between: AT1G01070.1 and 918864|PACid:16052578, # 919693|PACid:16048878, 919961|PACid:16062329 sapply(Aly.cds[ unlist(dplyr::select(dplyr::filter(hit_tbl, query_id == "AT1G01070.1"), subject_id)), seqs ], pairwiseAlignment, pattern = Ath.cds["AT1G01070.1" , seqs], type = "global" )
$... Global PairwiseAlignmentsSingleSubject (1 of 1) pattern: [1] atggctggagatatgcaaggagtgagagta...aaggataacgactctaagtcgcccgtttaa subject: [1] atgggtgaaggtatgattggagtgagagta...aaggataacgactctaagtcgcccgtttaa score: 1768.965 $... Global PairwiseAlignmentsSingleSubject (1 of 1) pattern: [1] atggctgga---gatatgcaaggagtgaga...----cgac----tctaagtcgcccgtttaa subject: [1] atggctaaatcagatatgc------tg---...ggttccacaaggtctatatcgcc---ttaa score: -2318.726 $... Global PairwiseAlignmentsSingleSubject (1 of 1) pattern: [1] atggctggagatatgcaaggagtgagagta...aaggataacgactctaagtcgcccgtttaa subject: [1] atgagtgaggatatgggaggagtgaaagta...----------------------------aa score: 486.462
Note: To obtain the score value, you need to specify the scoreOnly = TRUE
in the pairwiseAlignment
function.
As you can see, subject gene 918864|PACid:16052578
also has the highest global alignment score 1768.965
based on the Needleman-Wunsch algorithm. This strategy might help you to differentiate between border line cases.
The examples shown above shall demonstrate the use cases that can be performed using the
blast
functions implemented in orthologr
.
Another useful analysis can be to take the length of the initial query genes
into account using the nchar()
function:
# show the length distribution of all genes # stored in "Ath.cds" Ath.cds[ , nchar(seqs)]
[1] 1290 738 1077 5730 639 1938 1098 882 1287 1584 1587 1356 1038 588 252 1437 1608 714 321 3168
Or the length of a specific gene:
Ath.cds["AT1G01070.1" , nchar(seqs)]
[1] 1098
This way you can easily visualize the length distribution of genes stored in your query organism file.
Ath.cds <- read.cds(system.file('seqs/ortho_thal_cds_1000.fasta', package = 'orthologr'), format = "fasta") # look at sequence length distributions hist(Ath.cds[ , nchar(seqs)], breaks = 100)
For some analyses it is sufficient to perform BLAST+ best hit searches.
The blast_best()
function is optimized to perform BLAST+ best hit searches
(only based on the minimum e-value) and returns the best hit when performing a BLAST+ search of a query organisms
(or set of query genes) against a subject organism (or set of subject genes).
# performing gene orthology inference using the best hit (BH) method blast_best(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'), subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'), clean_folders = TRUE)
A tibble: 20 x 12 Groups: query_id [20] query_id subject_id perc_identity alig_length mismatches gap_openings <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 AT1G010... 333554|PA... 74.0 469 80 8 2 AT1G010... 470181|PA... 91.1 246 22 0 3 AT1G010... 470180|PA... 95.5 359 12 2 4 AT1G010... 333551|PA... 92.0 1970 85 10 5 AT1G010... 909874|PA... 100 213 0 0 6 AT1G010... 470177|PA... 89.5 648 58 5 7 AT1G010... 918864|PA... 95.1 366 14 2 8 AT1G010... 909871|PA... 90.3 300 22 2 9 AT1G010... 470171|PA... 96.8 434 8 3 10 AT1G011... 333544|PA... 93.6 528 34 0 11 AT1G011... 918858|PA... 99.2 529 4 0 12 AT1G011... 470161|PA... 98.5 453 6 1 13 AT1G011... 918855|PA... 72.6 285 68 3 14 AT1G011... 918854|PA... 84.9 179 19 2 15 AT1G011... 311317|PA... 85.6 97 0 1 16 AT1G011... 909860|PA... 92.6 310 20 1 17 AT1G011... 311315|PA... 94.2 533 30 1 18 AT1G012... 470156|PA... 95.8 238 10 0 19 AT1G012... 311313|PA... 95.3 107 5 0 20 AT1G012... 470155|PA... 96.7 1056 35 0 with 6 more variables: q_start <dbl>, q_end <dbl>, s_start <dbl>, s_end <dbl>, evalue <dbl>, bit_score <dbl>
The blast_best()
function returns: query_id
, subject_id
, and eval
.
In case you need more parameters returned by a BLAST+ best hit search, you
can specify the detailed_output
argument (detailed_output = TRUE
).
# BLAST+ best hit search best_hit_tbl <- blast_best(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'), subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr')) # look at results dplyr::glimpse(best_hit_tbl)
Observations: 20 Variables: 12 Groups: query_id [20] $ query_id <chr> "AT1G01010.1", "AT1G01020.1", "AT1G01030.1", "AT1G010... $ subject_id <chr> "333554|PACid:16033839", "470181|PACid:16064328", "47... $ perc_identity <dbl> 73.987, 91.057, 95.543, 91.980, 100.000, 89.506, 95.0... $ alig_length <dbl> 469, 246, 359, 1970, 213, 648, 366, 300, 434, 528, 52... $ mismatches <dbl> 80, 22, 12, 85, 0, 58, 14, 22, 8, 34, 4, 6, 68, 19, 0... $ gap_openings <dbl> 8, 0, 2, 10, 0, 5, 2, 2, 3, 0, 0, 1, 3, 2, 1, 1, 1, 0... $ q_start <dbl> 1, 1, 1, 6, 1, 1, 1, 1, 1, 1, 1, 1, 5, 4, 2, 1, 1, 1,... $ q_end <dbl> 430, 246, 359, 1910, 213, 646, 366, 294, 429, 528, 52... $ s_start <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 16, 2, 4, 1, 6, 1... $ s_end <dbl> 466, 246, 355, 1963, 213, 640, 362, 299, 433, 528, 52... $ evalue <dbl> 0.00e+00, 8.91e-168, 0.00e+00, 0.00e+00, 3.27e-162, 0... $ bit_score <dbl> 627, 454, 698, 3704, 437, 1037, 696, 491, 859, 972, 1...
The blast_rec()
function was implemented to optimize BLAST+ reciprocal best hit searches
(only based on the minimum e-value).
BLAST+ reciprocal best hit searches are used to perform orthology inference.
Running blast_rec()
using default parameter settings:
# performing gene orthology inference using the reciprocal best hit (RBH) method blast_rec(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'), subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'))
A tibble: 20 x 12 Groups: query_id query_id subject_id perc_identity alig_length mismatches gap_openings <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 AT1G010... 333554|PA... 74.0 469 80 8 2 AT1G010... 470181|PA... 91.1 246 22 0 3 AT1G010... 470180|PA... 95.5 359 12 2 4 AT1G010... 333551|PA... 92.0 1970 85 10 5 AT1G010... 909874|PA... 100 213 0 0 6 AT1G010... 470177|PA... 89.5 648 58 5 7 AT1G010... 918864|PA... 95.1 366 14 2 8 AT1G010... 909871|PA... 90.3 300 22 2 9 AT1G010... 470171|PA... 96.8 434 8 3 10 AT1G011... 333544|PA... 93.6 528 34 0 11 AT1G011... 918858|PA... 99.2 529 4 0 12 AT1G011... 470161|PA... 98.5 453 6 1 13 AT1G011... 918855|PA... 72.6 285 68 3 14 AT1G011... 918854|PA... 84.9 179 19 2 15 AT1G011... 311317|PA... 85.6 97 0 1 16 AT1G011... 909860|PA... 92.6 310 20 1 17 AT1G011... 311315|PA... 94.2 533 30 1 18 AT1G012... 470156|PA... 95.8 238 10 0 19 AT1G012... 311313|PA... 95.3 107 5 0 20 AT1G012... 470155|PA... 96.7 1056 35 0 with 6 more variables: q_start <dbl>, q_end <dbl>, s_start <dbl>, s_end <dbl>, evalue <dbl>, bit_score <dbl>
The set_blast()
function reads a file storing a specific sequence type, such as "cds", "protein", or "dna" in a standard sequence file format such as "fasta", etc.
and depending of the makedb parameter either creates a blast-able database, or returns the corresponding protein sequences as data.table object for further BLAST searches.
# using set_blast() to generate a blastable sequence database head(set_blast(file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'))[[1]] , 2)
geneids 1: AT1G01010.1 2: AT1G01020.1 seqs 1: atggaggatcaagttgggtttgggttccgtccgaacgacgaggagctcgttggtcactatctccgtaacaaaatcgaaggaaacact agccgcgacgttgaagtagccatcagcgaggtcaacatctgtagctacgatccttggaacttgcgcttccagtcaaagtacaaatcgaga gatgctatgtggtacttcttctctcgtagagaaaacaacaaagggaatcgacagagcaggacaacggtttctggtaaatggaagcttacc ggagaatctgttgaggtcaaggaccagtggggattttgtagtgagggctttcgtggtaagattggtcataaaagggttttggtgttcctc gatggaagataccctgacaaaaccaaatctgattgggttatccacgagttccactacgacctcttaccagaacatcagaggacatatgtc atctgcagacttgagtacaagggtgatgatgcggacattctatctgcttatgcaatagatcccactcccgcttttgtccccaatatgact agtagtgcaggttctgtggtcaaccaatcacgtcaacgaaattcaggatcttacaacacttactctgagtatgattcagcaaatcatggc cagcagtttaatgaaaactctaacattatgcagcagcaaccacttcaaggatcattcaaccctctccttgagtatgattttgcaaatcac ggcggtcagtggctgagtgactatatcgacctgcaacagcaagttccttacttggcaccttatgaaaatgagtcggagatgatttggaag catgtgattgaagaaaattttgagtttttggtagatgaaaggacatctatgcaacagcattacagtgatcaccggcccaaaaaacctgtg tctggggttttgcctgatgatagcagtgatactgaaactggatcaatgattttcgaagacacttcgagctccactgatagtgttggtagt tcagatgaaccgggccatactcgtatagatgatattccatcattgaacattattgagcctttgcacaattataaggcacaagagcaacca aagcagcagagcaaagaaaaggtgataagttcgcagaaaagcgaatgcgagtggaaaatggctgaagactcgatcaagatacctccatcc accaacacggtgaagcagagctggattgttttggagaatgcacagtggaactatctcaagaacatgatcattggtgtcttgttgttcatc tccgtcattagttggatcattcttgttggttaa 2: atggcggcgagtgaacacagatgcgtgggatgtggttttagggtaaagtcattgttcattcaatactctccgggtaacattcgtctcatg aaatgcggaaattgcaaggaagtagcagatgagtacatcgagtgtgaacgcatgattattttcatcgatttaatccttcacagaccaaag gtatatagacacgtcctctacaatgcaattaatccagcaactgtcaatattcagcatctgttgtggaagttggtcttcgcctatcttctt ctagactgttatagaagcttgctactgagaaaaagtgatgaagaatcgagcttttctgatagccctgttcttctatctataaaggttctg attggtgtcttatctgcaaacgctgcatttatcatctcttttgccattgcgactaagggtttgctaaatgaagtttccagaagaagagag attatgttggggatattcatctctagttacttcaagatatttctgcttgcgatgttggtatgggaattcccaatgtcagtgatttttttt gtcgatatacttctcttaacatcaaactccatggctcttaaagtgatgactgaatcaacaatgaccagatgcatagccgtatgcttaatc gcgcacttgattagattcttggtgggtcagatttttgagccgacaatatttttgatacaaattggatctctgttgcaatatatgtcttat tttttcagaatcgtatga aa 1: MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYDPWNLRFQSKYKSRDAMWYFFSRRENNKGNRQSRTTVSGKWK LTGESVEVKDQWGFCSEGFRGKIGHKRVLVFLDGRYPDKTKSDWVIHEFHYDLLPEHQRTYVICRLEYKGDDADILSAYAIDPTPAFVPN MTSSAGSVVNQSRQRNSGSYNTYSEYDSANHGQQFNENSNIMQQQPLQGSFNPLLEYDFANHGGQWLSDYIDLQQQVPYLAPYENESEMI WKHVIEENFEFLVDERTSMQQHYSDHRPKKPVSGVLPDDSSDTETGSMIFEDTSSSTDSVGSSDEPGHTRIDDIPSLNIIEPLHNYKAQE QPKQQSKEKVISSQKSECEWKMAEDSIKIPPSTNTVKQSWIVLENAQWNYLKNMIIGVLLFISVISWIILVG* 2: MAASEHRCVGCGFRVKSLFIQYSPGNIRLMKCGNCKEVADEYIECERMIIFIDLILHRPKVYRHVLYNAINPATVNIQHLLWKLVFAYLL LDCYRSLLLRKSDEESSFSDSPVLLSIKVLIGVLSANAAFIISFAIATKGLLNEVSRRREIMLGIFISSYFKIFLLAMLVWEFPMSVIFF VDILLLTSNSMALKVMTESTMTRCIAVCLIAHLIRFLVGQIFEPTIFLIQIGSLLQYMSYFFRIV*
For large-scale applications of orthologr
, users may wish to perform BLAST+
searches between entire genomes. For very large-scale BLAST+ searches
between genomes we recommend the user to use the metablastr package
which aims to provide an easy-to-use BLAST search framework
for massive genome comparisons.
For pairwise genome comparisons however, users can use the following functions to BLAST one genome against another.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.