This is just a suggestion to try and help get you into "good habits" early. If I was sitting to take this practical now, I would start a new R script file. That way all of the work that you have done associated with today's course is in one place, and the code for each of the practicals is separate from one another. This might feel a bit tedious right now, but as the amount of code you write and number of projects you take part in increases it will pay off to have a structured workflow.
For this set of questions we will use a dataset from a 1996 paper on yeast protein localisations within the cell[^1]. Each row is a protein locus. For each locus the data contains the various features of the amino acid sequence to identify the subcellular locations and then finally the experimentally observed subcellular localisation.
To load the data into your current session you can use.
[^1]: Paul Horton & Kenta Nakai, "A Probablistic Classification System for Predicting the Cellular Localization Sites of Proteins", Intelligent Systems in Molecular Biology, 109-115. St. Louis, USA 1996.
data(yeast, package = "jrIntroBio")
(1) Use the head()
function to inspect the top of the data, this can help give you a feel for what the data looks like and what variables are contained within the data?
r
head(yeast)
(1) How many proteins and how many variables are in this data set?
r
dim(yeast)
(1) Recall that if I want to pull out a single column, or variable, from a data frame we can use $
. To extract the protein id's from this data set we write yeast$seq
.^[If you can't remember what the names of the columns are, you can use colnames(yeast)
to find out. Since this dataset is part of a package, you could also type ?yeast
to get some documentation on the dataset]
(1) Using mean()
and median()
calculate these summary statistics for the measure on nucleus concensus patterns nuc
.
r
mean(yeast$nuc)
median(yeast$nuc)
(1) What is the lowest value for discriminant analysis[^2] on the amino acid content of vacuolar and extracellular proteins vac
in the data.
[^2]: Discriminant analysis in this context is a comparison of the amino acid composition of an unknown protein in question to the amino acid composition of proteins of a known localisation. It doesn't matter exactly how this measure was derived for now, just know that a higher value means a higher degree of similarity in amino acid composition to the known proteins of that class.
r
min(yeast$vac)
(2) What is the maximum value for the result of discriminant analysis on the amino acid content of the 20-residue N-terminal region of mitochondrial and non-mitochondrial proteins mit
?
r
max(yeast$mit)
(3) What is the standard deviation for vacuolar and extracellular measure vac
?
r
sd(yeast$vac)
(4) How many proteins of each class are in the dataset?
r
table(yeast$class)
Solutions to the practical questions are contained within the package
vignette("solutions2", package = "jrIntroBio")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.