Description Usage Arguments Details Value Author(s) See Also
Find linear motifs (QSLIMFinder or SLIMFinder) in the protein interaction network
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | PPInetwork2SLIMFinder(dataset_name = "SLIMFinder",
interaction_main_set = all_human_interaction,
interaction_query_set = all_viral_interaction,
analysis_type = "qslimfinder",
options = "dismask=T consmask=F cloudfix=T probcut=0.3 minwild=0 maxwild=2 slimlen=5 alphahelix=F maxseq=1500 savespace=1 iuchdir=T",
domain_res_file = "./processed_data_files/what_we_find_VS_ELM_clust20171019.RData",
domain_results_obj = "res_count", center_domains = F,
filter_by_domain = F,
fasta_path = "./data_files/all_human_viral_proteins.fasta",
main_set_only = F, domain_pvalue_cutoff = 1,
SLIMFinder_dir = paste0("./", dataset_name, "/"),
LSF_project_path = "/hps/nobackup/research/petsalaki/users/vitalii/vitalii/viral_project/",
software_path = "../software/cluster/", length_set1_min = 2,
length_set2_min = 1, write_log = T, N_seq = 200,
seed_list = NULL, query_list = NULL, memory_start = 350,
memory_step = 100, compare_motifs = T, Njobs_limit = 490,
CompariMotif3_dburl = "http://elm.eu.org/elms/elms_index.tsv",
CompariMotif3_dbpath = "./data_files/",
non_query_domain_res_file = "../viral_project/processed_data_files/predict_domain_human_clust20180819.RData",
non_query_domain_results_obj = NULL, non_query_domains_N = 0,
non_query_set_only = c(main_set_only), query_domains_only = T)
|
dataset_name |
refer to |
interaction_main_set |
clean_MItab class, use this set of protein interactions to construct QSLIMFinder datasets |
interaction_query_set |
clean_MItab class, use this set of protein interactions as a query (+ add to the QSLIMFinder datasets). Both interaction sets have shared seed proteins. SLIMFinder |
analysis_type |
"qslimfinder" or "slimfinder" |
options |
any options from QSLIMFinder or SLIMFinder. Detail http://rest.slimsuite.unsw.edu.au/docs&page=module:qslimfinder or http://rest.slimsuite.unsw.edu.au/docs&page=module:slimfinder => Commandline |
domain_res_file |
relative path to domain enrichment results RData |
domain_results_obj |
which object contains domain enrichment results in |
center_domains |
logical, center QSLIMFinder datasets at domains? |
filter_by_domain |
logical, filter by domain? If FALSE this function does not use |
fasta_path |
relative path (from the project folder) to the FASTA file containing sequences for all proteins in |
main_set_only |
logical, If TRUE sequence sets for motif search contain only proteins from |
domain_pvalue_cutoff |
construct SLIMFinder datasets using interactions of proteins that contain domain associated to protein in the query set with p-value |
SLIMFinder_dir |
directory to store SLIMFinder datasets and results within the project directory |
LSF_project_path |
full path to the project directory |
software_path |
relative path (from the project folder) to the directory containing slimsuite, blast, iupred # "../software/cluster/" or "../software/" |
length_set1_min |
mininal number of proteins in a QSLIMFinder dataset from |
length_set2_min |
mininal number of proteins in a QSLIMFinder dataset from |
write_log |
FALSE will not allow runQSLIMFinder to detect crashed jobs |
N_seq |
number of sequences per batch |
seed_list |
character vector of UniprotKB accesions that should serve as a seed for QSLIMFinder datasets. These proteins are supposed to recognise SLIMs. Overrides selection of seed protein by |
query_list |
character vector of UniprotKB accesions that should serve as a query for QSLIMFinder |
memory_start |
integer, how much memory each job should be given initially |
memory_step |
interger, increment by which to increase how much memory each job should be given if |
compare_motifs |
logical, compare motifs using CompariMotif3? The procedure is relatively fast but memory consuming. |
Njobs_limit |
integer, the number of LSF jobs allowed to run simultaneously |
CompariMotif3_dburl |
dburl url where to download database for CompariMotif V3. Argument for |
CompariMotif3_dbpath |
path to directory where to save and keep ELM database (http://elm.eu.org/) or other database of linear motifs in a format required by comparimotif_V3: http://rest.slimsuite.unsw.edu.au/docs&page=module:comparimotif_V3 |
non_query_domain_res_file |
path to RData file containing the result of domain enrichment analysis for non-query proteins |
non_query_domain_results_obj |
character, name of the object containing domain enrichment results for non-query proteins (class == XYZinteration_XZEmpiricalPval), when provided will be used for filtering datasets. |
non_query_domains_N |
the number of non-query proteins with predicted domains for each dataset. Used only when non_query_domain_results_obj is not NULL |
non_query_set_only |
If TRUE sequence sets for motif search contain only proteins (interacting partners of a seed) from non_query_domain_results_obj, if FALSE - both from non_query_domain_results_obj and domain_res_obj. Used only when non_query_domain_results_obj is not NULL and by default equals to main_set_only |
query_domains_only |
If TRUE proteins whose sequences will be used for motif search must be predicted to bind the same domains in a seed protein as domains predicted for query protein. Used only when non_query_domain_results_obj is not NULL |
QSLIMFinder command line options (http://rest.slimsuite.unsw.edu.au/docs&page=module:qslimfinder)
### Basic Input/Output Options ###
seqin=FILE : Sequence file to search [None]
batch=LIST : List of files to search, wildcards allowed. (Over-ruled by seqin=FILE.) [*.dat,*.fas]
query=LIST : Return only SLiMs that occur in 1+ Query sequences (Name/AccNum/Seq Number) [1]
addquery=FILE : Adds query sequence(s) to batch jobs from FILE [None]
maxseq=X : Maximum number of sequences to process [500]
maxupc=X : Maximum UPC size of dataset to process [0]
sizesort=X : Sorts batch files by size prior to running (+1 small->big; -1 big->small; 0 none) [0]
walltime=X : Time in hours before program will abort search and exit [1.0]
resfile=FILE : Main QSLiMFinder results table [qslimfinder.csv]
resdir=PATH : Redirect individual output files to specified directory (and look for intermediates) [QSLiMFinder/]
buildpath=PATH : Alternative path to look for existing intermediate files [SLiMFinder/]
force=T/F : Force re-running of BLAST, UPC generation and SLiMBuild [False]
pickup=T/F : Pick-up from aborted batch run by identifying datasets in resfile using RunID [False]
dna=T/F : Whether the sequences files are DNA rather than protein [False]
alphabet=LIST : List of characters to include in search (e.g. AAs or NTs) [default AA or NT codes]
megaslim=FILE : Make/use precomputed results for a proteome (FILE) in fasta format [None]
megablam=T/F : Whether to create and use all-by-all GABLAM results for (gablamdis) UPC generation [False]
ptmlist=LIST : List of PTM letters to add to alphabet for analysis and restrict PTM data []
ptmdata=DSVFILE : File containing PTM data, including AccNum, ModType, ModPos, ModAA, ModCode
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
SLiMBuild Options I
efilter=T/F : Whether to use evolutionary filter [True]
blastf=T/F : Use BLAST Complexity filter when determining relationships [True]
blaste=X : BLAST e-value threshold for determining relationships [1e=4]
altdis=FILE : Alternative all by all distance matrix for relationships [None]
gablamdis=FILE : Alternative GABLAM results file [None] (!!!Experimental feature!!!)
homcut=X : Max number of homologues to allow (to reduce large multi-domain families) [0]
SLiMBuild Options II
masking=T/F : Master control switch to turn off all masking if False [True]
dismask=T/F : Whether to mask ordered regions (see rje_disorder for options) [False]
consmask=T/F : Whether to use relative conservation masking [False]
ftmask=LIST : UniProt features to mask out (True=EM,DOMAIN,TRANSMEM) []
imask=LIST : UniProt features to inversely ("inclusively") mask. (Seqs MUST have 1+ features) []
compmask=X,Y : Mask low complexity regions (same AA in X+ of Y consecutive aas) [5,8]
casemask=X : Mask Upper or Lower case [None]
motifmask=X : List (or file) of motifs to mask from input sequences []
metmask=T/F : Masks the N-terminal M (can be useful if termini=T) [True]
posmask=LIST : Masks list of position-specific aas, where list = pos1:aas,pos2:aas [2:A]
aamask=LIST : Masks list of AAs from all sequences (reduces alphabet) []
qregion=X,Y : Mask all but the region of the query from (and including) residue X to residue Y [0,-1]
SLiMBuild Options III
termini=T/F : Whether to add termini characters (^ & $) to search sequences [True]
minwild=X : Minimum number of consecutive wildcard positions to allow [0]
maxwild=X : Maximum number of consecutive wildcard positions to allow [2]
slimlen=X : Maximum length of SLiMs to return (no. non-wildcard positions) [5]
minocc=X : Minimum number of unrelated occurrences for returned SLiMs. (Proportion of UP if < 1) [0.05]
absmin=X : Used if minocc<1 to define absolute min. UP occ [3]
alphahelix=T/F : Special i, i+3/4, i+7 motif discovery [False]
SLiMBuild Options IV
ambiguity=T/F : (preamb=T/F) Whether to search for ambiguous motifs during motif discovery [True]
ambocc=X : Min. UP occurrence for subvariants of ambiguous motifs (minocc if 0 or > minocc) [0.05]
absminamb=X : Used if ambocc<1 to define absolute min. UP occ [2]
equiv=LIST : List (or file) of TEIRESIAS-style ambiguities to use [AGS,ILMVF,FYW,FYH,KRH,DE,ST]
wildvar=T/F : Whether to allow variable length wildcards [True]
combamb=T/F : Whether to search for combined amino acid degeneracy and variable wildcards [False]
SLiMBuild Options V
musthave=LIST : Returned motifs must contain one or more of the AAs in LIST (reduces search space) []
focus=FILE : FILE containing focal groups for SLiM return (see Manual for details) [None]
focusocc=X : Motif must appear in X+ focus groups (0 = all) [0]
* See also rje_slimcalc options for occurrence-based calculations and filtering *
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### SLiMChance Options ###
cloudfix=T/F : Restrict output to clouds with 1+ fixed motif (recommended) [False]
slimchance=T/F : Execute main QSLiMFinder probability method and outputs [True]
sigprime=T/F : Calculate more precise (but more computationally intensive) statistical model [False]
sigv=T/F : Use the more precise (but more computationally intensive) fix to mean UPC probability [False]
qexact=T/F : Calculate exact Query motif space (True) or over-estimate from dimers (False) (quicker) [True]
probcut=X : Probability cut-off for returned motifs [0.1]
maskfreq=T/F : Whether to use masked AA Frequencies (True), or (False) mask after frequency calculations [False]
aafreq=FILE : Use FILE to replace individual sequence AAFreqs (FILE can be sequences or aafreq) [None]
aadimerfreq=FILE: Use empirical dimer frequencies from FILE (fasta or *.aadimer.tdt) (!!!Experimental!!!) [None]
negatives=FILE : Multiply raw probabilities by under-representation in FILE (!!!Experimental!!!) [None]
smearfreq=T/F : Whether to "smear" AA frequencies across UPC rather than keep separate AAFreqs [False]
seqocc=T/F : Whether to upweight for multiple occurrences in same sequence (heuristic) [False]
probscore=X : Score to be used for probability cut-off and ranking (Prob/Sig) [Sig]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
Advanced Output Options I
clouds=X : Identifies motif "clouds" which overlap at 2+ positions in X+ sequences (0=minocc / -1=off) [2]
runid=X : Run ID for resfile (allows multiple runs on same data) [DATE:TIME]
logmask=T/F : Whether to log the masking of individual sequences [True]
slimcheck=FILE : Motif file/list to add to resfile output []
Advanced Output Options II
teiresias=T/F : Replace TEIRESIAS, making *.out and *.mask.fasta files [False]
slimdisc=T/F : Emulate SLiMDisc output format (*.rank & *.dat.rank + TEIRESIAS *.out & *.fasta) [False]
extras=X : Whether to generate additional output files (alignments etc.) [1]
–1 = No output beyond main results file
- 0 = Generate occurrence file and cloud file
- 1 = Generate occurrence file, alignments and cloud file
- 2 = Generate all additional QSLiMFinder outputs
- 3 = Generate SLiMDisc emulation too (equiv extras=2 slimdisc=T)
targz=T/F : Whether to tar and zip dataset result files (UNIX only) [False]
savespace=0 : Delete "unneccessary" files following run (best used with targz): [0]
- 0 = Delete no files
- 1 = Delete all bar *.upc and *.pickle
- 2 = Delete all bar *.upc (pickle added to tar)
- 3 = Delete all dataset-specific files including *.upc and *.pickle (not *.tar.gz)
Advanced Output Options III
topranks=X : Will only output top X motifs meeting probcut [1000]
minic=X : Minimum information content for returned motifs [2.1]
allsig=T/F : Whether to also output all SLiMChance combinations (Sig/SigV/SigPrime/SigPrimeV) [False]
Memory requirements for jobs. How much memory is enough for most jobs? files = list.files("./qslimfinder.Full_IntAct.FALSE/log_dir/log/") times = sapply(files, function(file) system(paste0("cat ./qslimfinder.Full_IntAct.FALSE/log_dir/log/",file," | grep Requested"), intern = T)) > table(times) times 100.00 MB 200.00 MB 300.00 MB 400.00 MB 500.00 MB 14 61 53 25 3 > table(times) / sum(table(times)) times 100.00 MB 200.00 MB 300.00 MB 400.00 MB 500.00 MB 0.08974359 0.39102564 0.33974359 0.16025641 0.01923077 > cumsum(table(times) / sum(table(times))) 100.00 MB 200.00 MB 300.00 MB 400.00 MB 500.00 MB 0.08974359 0.48076923 0.82051282 0.98076923 1.00000000 > 1 - cumsum(table(times) / sum(table(times))) 100.00 MB 200.00 MB 300.00 MB 400.00 MB 500.00 MB 0.91025641 0.51923077 0.17948718 0.01923077 0.00000000
path to RData containing all objects used by this pipeline
Vitalii Kleshchevnikov
listInteractionSubsetFASTA
, runQSLIMFinder
, groupQSLIMFinderCommand
, mQSLIMFinderCommand
, runCompariMotif3
, readQSLIMFinderMain
, readQSLIMFinderOccurence
, writeInteractionSubsetFASTA_list
, domainProteinPairMatch
, filterInteractionSubsetFASTA_list
, removeInteractionNoFASTA
, centerDomains
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.