extract_tidy_df_from_hmmer: Extract a tidy data frame with the hmmer results.

View source: R/extract_tidy_df_from_hmmer.R

extract_tidy_df_from_hmmerR Documentation

Extract a tidy data frame with the hmmer results.

Description

Extract a tidy data frame with the hmmer results.

Usage

extract_tidy_df_from_hmmer(
  xml.document,
  by_column = c(alisqacc = "acc", alisqname = "name")
)

Arguments

xml.document

A xml_document downloaded from HMMER

by_column

A character vector for joining domains hash with sequence's hits hash. By default, it is c("alisqacc" = "acc", "alisqname" = "name"), i.e. use to match the results the acc and the names of the sequences. This is the one that should be used in most cases.

Details

Below, we list the meaning of the different columns following the HMMER documentation.

  • ienv: Envelope start position

  • jenv: Envelope end position

  • iali: Alignment start position

  • jali: Alignment end position

  • bias: null2 score contribution

  • oasc: TOptimal alignment accuracy score

  • bitscore: Overall score in bits, null corrected, if this were the only domain in seq

  • cevalue: Conditional E-value based on the domain correction

  • ievalue: Independent E-value based on the domain correction

  • is_reported: 1 if domain meets reporting thresholds

  • is_included: 1 if domain meets inclusion thresholds

  • alimodel: Aligned query consensus sequence phmmer and hmmsearch, target hmm for hmmscan

  • alimline: Match line indicating identities, conservation +’s, gaps

  • aliaseq: Aligned target sequence for phmmer and hmmsearch, query for hmmscan

  • alippline: Posterior probability annotation

  • alihmmname: Name of HMM (query sequence for phmmer, alignment for hmmsearch and target hmm for hmmscan)

  • alihmmacc: Accession of HMM

  • alihmmdesc: Description of HMM

  • alihmmfrom: Start position on HMM

  • alihmmto: End position on HMM

  • aliM: Length of model

  • alisqname: Name of target sequence (phmmer, hmmscan) or query sequence(hmmscan)

  • alisqacc: Accession of sequence

  • alisqdesc: Description of sequence

  • alisqfrom: Start position on sequence

  • alisqto: End position on sequence

  • aliL: Length of sequence

  • name: Name of the target (sequence for phmmer/hmmsearch, HMM for hmmscan)

  • acc: Accession of the target

  • acc2: Secondary accession of the target

  • id: Identifier of the target

  • desc: Description of the target

  • score: Bit score of the sequence (all domains, without correction)

  • pvalue: P-value of the score

  • evalue: E-value of the score

  • nregions: Number of regions evaluated

  • nenvelopes: Number of envelopes handed over for domain definition, null2, alignment, and scoring.

  • ndom: Total number of domains identified in this sequence

  • nreported: Number of domains satisfying reporting thresholding

  • nregions: Number of regions evaluated

  • nincluded: Number of domains satisfying inclusion thresholding

  • taxid: The NCBI taxonomy identifier of the target (if applicable)

  • species: The species name of the target (if applicable)

  • kg: The kingdom of life that the target belongs to - based on placing in the NCBI taxonomy tree (if applicable)

  • seqs: An array containing information about the 100% redundant sequences

  • pdbs: Array of pdb identifiers (which chains information)

  • nhits: The number of hits found above reporting thresholds

  • Z: The number of sequences or models in the target database

  • domZ: The number of hits in the target database

  • nmodels: The number of models in this search

  • nincluded: The number of sequences or models scoring above the significance threshold

  • nreported: The number of sequences or models scoring above the reporting threshold

Value

DataFrame

Examples

## Not run: 
 xml.path %>%
   read_xml() %>%
   extract_tidy_df_from_hmmer()

## End(Not run)

currocam/toolkit4pySCA documentation built on April 7, 2022, 8:17 p.m.