From the abstract: "We present the Homo sapiens comprehensive model collection (HOCOMOCO, http://autosome.ru/HOCOMOCO/, http://cbrc.kaust.edu.sa/hocomoco/) containing carefully hand-curated TFBS models constructed by integration of binding sequences obtained by both low- and high-throughput methods. To construct position weight matrices to represent these TFBS models, we used ChIPMunk software in four computational modes, including newly developed periodic positional prior mode associated with DNA helix pitch. We selected only one TFBS model per TF, unless there was a clear experimental evidence for two rather distinct TFBS models. We assigned a quality rating to each model. HOCOMOCO contains 426 systematically curated TFBS models for 401 human TFs, where 172 models are based on more than one data source."
MotifDb object of length 426; to access metadata
Name provided by HOCOMOCO
ID provided by HOCOMOCO including experiment type
Gene symbol for the transcription factor
Entrez gene id for the transcription factor
UNIPROT id for the transcription factor
Number of sequences evaluated for producing the PWM
Consensus sequence for the motif
from http://autosome.ru/HOCOMOCO/Details.php#200 quoted here:
"TFBS model identification modes
To construct TFBS models ChIPMunk was run four times: two times (f1) and (f2) with uniform model positional prior and two times (si) and (do) with informative model positional prior.
The min-to-max (f1) model length estimation mode was used with the min length of 7 bp and increasing it by 1 bp until the default max length of 25 bp was reached following the optimal length selection procedure as in Kulakovskiy and Makeev, Biophysics, 2009. For max-to-min (f2) model length estimation mode we started from 25 bp and searched for the best alignment decreasing the length by 1 bp until the minimal length of 7 bp. We also used the single (si) and double box (do) model positional priors in order to simulate DNA helix turn. For a single box, the positional weights are to be distributed as cos2(pi n / T), where T=10.5 is the DNA helix pitch, n is the coordinate within the alignment, and the center of the alignment of the length L is at n=0. During the internal cycle of PWM optimization the PWM column scores are multiplied by prior values so the columns closer to the center of the alignment (n=0) receive no score penalty while the columns around (n = 5,6,-5,-6) contribute much less to the score of the PWM under optimization. The single box model prior was used along with the min-to-max length estimation mode (si). We also used the double box model prior with a shape prior equal to sin2(pin / T), which was used to search for possibly longer double box models in the max-to-min length estimation mode (do).
Model quality assignment
The resulting models were rated (from A to F) according to their quality. Model quality rates from A-to-D were assigned to proteins known to be TFs, including those listed in Schaefer et al., Nucleic Acids Research, 2011 with addition of a number of proteins having relevant models and sufficient evidence to be TFs. The ratings were assigned by human curation according to the following criteria:
Relevant distribution of position-specific information content over alignment columns, which means a model LOGO representation displaying well formed core positions with a high information content surrounded by flanking letters with lower information content; the information content at flanking positions decreasing with the distance from the model core.
"Stability", which means that in more than one of the ChIPMunk modes we obtained models with a similar length, consensus, and comparable number of aligned binding sites, along with a similar shape of model LOGO representation. "Similarity" of the model to the binding sequence consensus for this TF given in the UniProt or other databases, which means similarity of the shape of the model LOGO and TFBS lengths to those of other TFs from the same TF family. "A total number of binding sites" was also considered as a quality measure, as a large set of binding regions (mostly but not limited to ChIP-Seq and parallel SELEX) implies that there are many observations of each letter in any position of the alignment, particularly many observations of non-consensus letters in core positions. In positions with low information content, where there is no strong consensus, all variants have many observations, and thus the observed letter frequencies are less dependent on statistical fluctuations.
Quality A was assigned to high confidence models complying with all four criteria listed in the section above. Quality B was assigned to models built from large sequence sets that failed no more than one out of the three remaining criteria. Quality C was assigned to models built from small sequence sets but (with a number of specifically marked exceptions) complying with the three remaining criteria. Quality D models missed part of the known consensus sequence or had no clearly significant core positions in the TFBS model. Quality E (error) was assigned to models for proteins not convincingly shown to be TFs or to models exhibiting an irrelevant LOGO shape or a wrong consensus sequence. Quality F (failure) was assigned to TFs for which there was no reliable model identified."
Source for more details
Kulakovskiy,I.V., Medvedeva,Y.A., Schaefer,U., Kasianov,A.S., Vorontsov,I.E., Bajic,V.B. and Makeev,V.J. (2013) HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Research, 41, D195–D202.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.