Description Usage Arguments Details Value Author(s) Examples
Filter homolog sequences by sequence similarity.
1 2 3 4 5 6 7 8 |
seq |
a list with one element for each protein/gene sequence. The elements are in two parts, one the description ("desc") and the second is a character string of the biological sequence ("seq"). |
identity |
a numeric value ranged from 0 to 1. It is used as a maximum identity cutoff among input sequences. |
method |
a string for the method of homolog redunction. This must be one of the strings "cdhit" or "aligndis". |
cdhit.path |
a string for the path of cdhit program directory. eg: "/people/hongli/cd-hit". It is necessary when method="cdhit". |
seq1 |
a string for the protein or gene sequence. |
seq2 |
a string for the protein or gene sequence. seq1 and seq2 must have same length. |
seqfile |
a string for the name of FASTA file. |
posfile |
a string for the name of file which contains the positive site dataset. It has two columns: 1st column is the protein name; 2st column is the positive site. Protein name should be consistent with the name used in seqfile. |
aa |
a character for the interested amino acid. eg: "C". |
w |
an integer for the window size of flanking peptide sequence. Window size is 2*w+1, and the central residues are the positive sites in posfile. |
balance |
a logical value indicating whether negative sites will be random selected to have the same number with positive sites. |
posSite |
a string vector for the positive sites. It is consisted of protein description and positive site, eg: "P278168:952". |
hr
employs cdhitHR
and aligndisHR
to
filter homolog sequences. It supported following methods:
"cdhit": Use cd-hit program to quickly filter sequences by given identity. It is designed to filter full-length protein or gene sequences. "formatdb" and "blastall" are required for running cd-hit program. (http://www.bioinformatics.org/download.php/cd-hit/cd-hit-2007-0131.tar.gz or http://www.bioinformatics.org/download.php/cd-hit/cd-hit-2007-0131-win32.tar.gz)
"aligndis": Use the number of different residues to meature the identity between two sequences. It is designed to filter aligned seuqnces with equal length.
getTrain
extract 2*w+1 flanking peptides of positive sites and
filter homolog sequences. Negative sites are non-positive sites in the same
proteins.
distance
calculate the number of positions with different residues
between two sequences.
hr
return a list of reduced sequences.
Hong Li
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | distance("AABD","ACBD")
distance("AABD","ECBD")
if(interactive()){
file = file.path(path.package("BioSeqClass"), "example", "acetylation_K.fasta")
library(Biostrings)
seq = as.character(readAAStringSet(file))
## Homolog reduction of whole-length sequence by cd-hit
# need cd-hit program;
reducSeq50 = hr(seq, method="cdhit", identity=0.5, cdhit.path="/people/hongli/cd-hit")
file = file.path(path.package("BioSeqClass"), "example", "acetylation_K.site")
tmp = as.matrix(read.csv(file, sep="\t",header=F))
logical = apply(tmp,1,function(x){ l=nchar(seq[x[1]]); (l>=as.numeric(x[2])+7 & as.numeric(x[2])-7>0) })
fragment = sub.seq(seq[tmp[logical,1]], as.numeric(tmp[logical,2])-7, as.numeric(tmp[logical,2])+7)
## Homolog reduction of short sequence fragment
# It may be slow.
reducSeq = hr(fragment, method="aligndis", identity=0.4)
## produce train set based on given positive sites and fasta sequences.
file = file.path(path.package("BioSeqClass"), "example", "acetylation_K.fasta")
posfile = file.path(path.package("BioSeqClass"), "example", "acetylation_K.site")
## "getTrain" integrate negative set construction and homolog reduction. It is designed for site level training data.
# It may be very slow.
data = getTrain(file, posfile, aa="K", w=7, identity=0.4)
}
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.