Description Usage Format Source
A dataset containing the subcellular protein localisations along with several amino acid sequence based metrics used to make a classification of the localisation. 'yeast' contains the full dataset while 'yeast_classes' is simply 'class' collumn from the 'yeast' dataset as a vector, for convenience in the practical excercise.
1 |
A data frame with 1484 rows and 10 variables:
Accession number for the SWISS-PROT database
McGeoch's method for signal sequence recognition
von Heijne's method for signal sequence recognition
Score of the ALOM membrane spanning region prediction program
Score of discriminant analysis of the amino acid content of the N-terminal region (20 residues long) of mitochondrial and non-mitochondrial proteins
Presence of "HDEL" substring (thought to act as a signal for retention in the endoplasmic reticulum lumen). Binary attribute
Peroxisomal targeting signal in the C-terminus
Score of discriminant analysis of the amino acid content of vacuolar and extracellular proteins
Score of discriminant analysis of nuclear localization signals of nuclear and non-nuclear proteins
Experimentally observed subcellular localisations.
Class Distribution. The class is the localization site.
CYT (cytosolic or cytoskeletal) 463
NUC (nuclear) 429
MIT (mitochondrial) 244
ME3 (membrane protein, no N-terminal signal) 163
ME2 (membrane protein, uncleaved signal) 51
ME1 (membrane protein, cleaved signal) 44
EXC (extracellular) 37
VAC (vacuolar) 30
POX (peroxisomal) 20
ERL (endoplasmic reticulum lumen) 5
...
Paul Horton & Kenta Nakai, ["A Probablistic Classification System for Predicting the Cellular Localization Sites of Proteins"] (https://www.aaai.org/Papers/ISMB/1996/ISMB96-012.pdf), Intelligent Systems in Molecular Biology, 109-115. St. Louis, USA 1996.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.