knitr::opts_chunk$set(echo = TRUE)
The drosophila shroom genebank page is at https://www.ncbi.nlm.nih.gov/protein/ABA81834 (I think)
Sequence can be obtained by licking on "FASTA. I copy and pastedit from https://www.ncbi.nlm.nih.gov/protein/ABA81834.1?report=fasta . Note the quotation marks around the string.
dmshrm <- c("MKMRNHKENGNGSEMGESTKSLAKMEPENNNKISVVSVSKLLLKDSNGANSRSSNSNASFSSASVAGSVQ DDLPHHNSSSSQLGQQHGSSLDQCGLTQAGLEEYNNRSSSYYDQTAFHHQKQPSYAQSEGYHSYVSSSDS TSATPFLDKLRQESDLLSRQSHHWSENDLSSVCSNSVAPSPIPLLARQSHSHSHSHAHSHSNSHGHSHGH AHSASSSSSSNNNSNGSATNNNNNNSSESTSSTETLKWLGSMSDISEASHATGYSAISESVSSSQRIVHS SRVPTPKRHHSESVLYLHNNEEQGDSSPTASNSSQMMISEEANGEESPPSVQPLRIQHRHSPSYPPVHTS MVLHHFQQQQQQQQDYQHPSRHHTNQSTLSTQSSLLELASPTEKPRSLMGQSHSMGDLQQKNPHQNPMLG RSAGQQHKSSISVTISSSEAVVTIAPQPPAGKPSKLQLSLGKSEALSCSTPNMGEQSPTNSIDSYRSNHR LFPVSTYTEPVHSNTSQYVQHPKPQFSSGLHKSAKLPVITPAGATVQPTWHSVAERINDFERSQLGEPPK FAYLEPTKTHRLSNPALKALQKNAVQSYVERQQQQQKEEQQLLRPHSQSYQACHVERKSLPNNLSPIMVG LPTGSNSASTRDCSSPTPPPPPRRSGSLLPNLLRRSSSASDYAEFRELHQAQGQVKGPSIRNISNAEKIS FNDCGMPPPPPPPRGRLAVPTRRTSSATEYAPMRDKLLLQQAAALAHQQHHPQQHRHAQPPHVPPERPPK HPNLRVPSPELPPPPQSELDISYTFDEPLPPPPPPEVLQPRPPPSPNRRNCFAGASTRRTTYEAPPPTAI VAAKVPPLVPKKPTSLQHKHLANGGGGSRKRPHHATPQPILENVASPVAPPPPLLPRARSTAHDNVIASN LESNQQKRSNSKASYLPRQSLEKLNNTDPDHGIYKLTLTSNEDLVAHTKPSYGVTGKLPNNLPDVLPLGV KLHQQPKLQPGSPNGDANVTLRYGSNNNLTGNSPTVAPPPYYGGGQRYSTPVLGQGYGKSSKPVTPQQYT RSQSYDVKHTSAVTMPTMSQSHVDLKQAAHDLETTLEEVLPTATPTPTPTPTPTPPRLSPASSHSDCSLS TSSLECTINPIATPIPKPEAHIFRAEVISTTLNTNPLTTPPKPAMNRQESLRENIEKITQLQSVLMSAHL CDASLLGGYTTPLITSPTASFANEPLMTPPLPPSPPPPLEPEEEEEQEENDVHDKQPEIEELQLMQRSEL VLMVNPKPSTTDMACQTDELEDRDTDLEAAREEHQTRTTLQPRQRQPIELDYEQMSRELVKLLPPGDKIA DILTPKICKPTSQYVSNLYNPDVPLRLAKRDVGTSTLMRMKSITSSAEIRVVSVELQLAEPSEEPTNLIK QKMDELIKHLNQKIVSLKREQQTISEECSANDRLGQDLFAKLAEKVRPSEASKFRTHVDAVGNITSLLLS LSERLAQTESSLETRQQERGALESKRDLLYEQMEEAQRLKSDIERRGVSIAGLLAKNLSADMCADYDYFI NMKAKLIADARDLAVRIKGSEEQLSSLSDALVQSDC")
The object. Note the "\n" in the sequence.
dmshrm
Check out how long it is. Why is length 1?
length(dmshrm)
How many characters?
nchar(dmshrm)
Use the gsub() regular expression to get rid of the newline characters (regular expressions are tought and I will always write them for you)
dmshrm <- gsub("\n", "", dmshrm)
Now how many character?
nchar(dmshrm)
Another example of regular expression: Let's say I wanted to turn all the I to ¯_(ツ)_/¯ for some weird reason (one example of why regular expressions are tough: I can't figure out how to get one of the arms to be correct)
gsub("I", "¯\\_(ツ)_/¯", dmshrm)
Use just the actin binding part of shroom2
dmshrm.actin <- substr(dmshrm, start = 445, stop =920)
Use nchar to confirm change
nchar(dmshrm.actin)
Check the class of the new object (I did this in class b/c I got muddled about what nchar does to a vector versus another operation; see below)
is(dmshrm.actin)
I didn't do this in class but had notes on it. You can put each aa into a separate element of a vector like this
dmshrm.actin2 <- strsplit(dmshrm.actin,"")
"strsplit" stands for "String split."
for some reason strsplit gives us a list
is(dmshrm.actin2)
if we do length() we get an odd result
length(dmshrm.actin2)
iists take some experience to work with. I'll write code to work with them. To move forward what we need to to access the vector of amino acids which is stored i this list. The easiest way to do this is with bracket notation (can't use a dollar sign b/c there isn't a name assigned to the vector; these are some R quirks). Note the double brackts. [[1]] means "give me the first element of the list.
dmshrm.actin2 <- dmshrm.actin2[[1]]
Now the length makes sense
length(dmshrm.actin2)
is(dmshrm.actin2)
I can summarize the aa composition of this subsequence with table()
table(dmshrm.actin2)
Save table as an object
my.table <- table(dmshrm.actin2)
A table is a table, not a dataframe or a matrix. Its a special data structure for summarizing data
is(my.table)
The plot function can work on tables
plot(my.table)
Normally we give plot() and x and y axis to plot. Due to R's object oriented abilities attributes of table objects communicate with plot() to tell it what to do.
We can sort this using the order command
i<- order(my.table) plot(my.table[i])
I can reverse the order with rev()
#reverse the index i.rev <- rev(i) #reverse the vector ## (this might be hard to understand; no worries) my.table.rev <- my.table[i.rev] #plot it plot(my.table.rev)
order and rev are functions I don't expect you to remember; I'll always give them to you if needed
Ok, now back to our main event.
The bio3d package lets you do blast searchesfrom R. Install it if needed
# install.packages("bio3d") library(bio3d)
We run the blast like this
dmshrm.blast.1 <- blast.pdb(dmshrm.actin)
How many results are resumed? I got 5. Why is that?
(If you already loaded the code for my version of the blast.pdb file then you can replciate this behavior by either of these commands bio3d::blast.pdb(dmshrm.actin) #explicilty call the bio3d version of the function blast.pdb(dmshrm.actin,database = "pdb") #explicitly set dbase to pdb)
The protein data band (pdb) is a special database: "The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids." (wikipedia). We don't have good 3D data on most proteins, so there aren't that many entries in the pdb (https://en.wikipedia.org/wiki/Protein_Data_Bank).
I tweaked the blast.pdb() to allow it to work with other databases; the default is now refseq_protein, the database of reference sequences (only one entry for each protein). Run all the code in the blast_pdb.R file.
dmshrm.blast.2 <- blast.pdb(dmshrm.actin)
Whatkind of object do we get?
is(dmshrm.blast.2)
class(dmshrm.blast.2)
This is a special type of object created by these package authors. Its basically a list though
is.list(dmshrm.blast.2)
Peak at the structure
str(dmshrm.blast.2, max.level = 1)
The data is in the hist.tbl part of the list.
head(dmshrm.blast$hit.tbl)
We can save hit.tbl to its own object if we want
hit.tbl <-dmshrm.blast.2$hit.tbl
I gave a very brief shout out to Object Oriented Programming today. This is a huge topic, though in R it isn't always a big player in how we code. As a typical R user, you'll never need to think about it, though you will benefit from it. Whickham write
"The main reason to use OOP is polymorphism (literally: many shapes). Polymorphism means that a developer can consider a function’s interface separately from its implementation, making it possible to use the same function form [sic] for different types of input. This is closely related to the idea of encapsulation: the user doesn’t need to worry about details of an object because they are encapsulated behind a standard interface." https://adv-r.hadley.nz/oo.html
The way I think about it (which may be overly simplistic) is that functions are often very flexible and can take on many kinds of input. The code below will evoke some of these ideas; follow the link above for more information.
Let's make a new dataframe with a few numeric variables from our blast
my.hit.tbl <- data.frame(alignmentlength = hit.tbl$alignmentlength, evalue = hit.tbl$evalue, bitscore = hit.tbl$bitscore, identity = hit.tbl$identity)
head(my.hit.tbl)
In class I made a plot
plot(my.hit.tbl$evalue, my.hit.tbl$identity)
Looks better logged
plot(log(my.hit.tbl$evalue), my.hit.tbl$identity)
I can also use a ~. Note that the behavior of plot is different with ~ and I've switched the order ofthe objectswithin plot
plot(my.hit.tbl$identity ~ log(my.hit.tbl$evalue))
I can also do this, with a data argument; "my.hit.tbl" only has to be written once, and no $
plot(identity ~ log(evalue), data= my.hit.tbl)
Due to object oriented programming features, I can also just throw the whole dataframe into plot, and plot() knows that when that happen, make a scatter plot of all the numeric variables (This doens't work unless all variables are numeric)
plot(my.hit.tbl)
I played around with for loops. First I put all the identity values into a single object, x
x <- dmshrm.blast$hit.tbl$identity
This is jsut a vector of numbers
x
For a toy loop example, I will print out each value one by one using a loop, along with some other text
First, how many elements are in the x?
length(x)
A simple, very explicit loop is
for(i in 1:108){ cat("\n") print("i") print(x[i]) }
1:108 creates an index. "i" will take on each value, 1 to 108
A more flexible approach is
for(i in 1:length(x)){ cat("\n") print("i") print(x[i]) }
As Rich pointed out, another approach is
for(i in x){ print(i) cat("\n") }
Here, "i" takes on each value of x, and so I is the index, but the acutal value we want. In many ways this is cleaner code, and in more advanced scenarios this will prevent bugs
One issues is that my approach will try to run the loop even if my vector "x" doesn't have anything in it
#set x to null x <-NULL #how long is it length(x) #run loop for(i in 1:length(x)){ cat("\n") print("i") print(x[i]) }
Rich's method, however, does nothing, because there is nothing to do
for(i in x){ print(i) cat("\n") }
Another alternative which I am still exploring is to use the seq_len function. This throws an error when there is nothing in x
for(i in seq_len(x)){ print(x[i]) cat("\n") }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.