In mtrw/IBGG: Introduction to Bioinformatics for Genebank Genomics (IBGG): Course Material and Assignments

Introduction

This first session is about getting a feel for scripting, and its usefulness in almost every branch of science.

R is the world's most popular statistical programming language. Open source and hugely community-driven, and is constantly expanded by the development of packages that add new functionality and implement new analysis methods and tools. New packages are added to the official curated package database CRAN every week.

Thanks to this flexibility, useability, and speed, R's popularity has exploded in the last decade, especially in bioinformatics. Just look at the number of questions being asked about these three popular languages on the online bioinformatics community BioStars.

Those fancy graphics and appletts you see on news websites? Many of these are made in R, such as this collection from the BBC.

This session should set you up with the R skills to do analysis on biological data. In a few hours, you can challenge yourself to understand a line of code like this:

phenotype_data_1[ sample != "Wheat" , .( average_yield_pba = mean(yield*14.87) , average_height = mean(height) ) , by=species ]

Make sure you do all the exercises, and avoid copying-and-pasting. If possible, don't move on from a concept or command until you understand it. And---most importantly---play around and experiment.

We are using R via an Integrated Development Environment or IDE, called Rstudio. It's a nice visual tool that puts everything you need to know on one screen and makes programming much easier.

The top left panel of the IDE is where you work on files. You'll fill them with lines of R script that can be run sequentially. You can now write R commands into the document. In case you haven't yet, open the exercise sheet for workshop 1 by clicking "workshop_001_exercise_sheet.R" in the list of files in the bottom right window:

You can use the top of the exercises sheet as a "playground" for running commands as you follow the workshop. Run all the commands described as you learn them here, and also the commands you'll write for the exercises. As you work, save your exercise sheet with ctrl-s regularly.

You should see this command at the top of your sheet:

#run this to get all your data and functions loaded
source("/filer/projekte/ggr-course2020/r-source/genres_2020_bioinfo_setup_fns_data.R")

The first line is called a comment, which is indicated because it starts with a hash (#) symbol. Anything after a hash is counted as a comment. Comments are not code: R completely ignores them! They are just useful notes. I will use them a lot in the instructions.

The second line is a command that will automatically load all the packages and data. To run a command, either highlight it or simply place the cursor on the same line, and press ctrl-<enter>. Do this now with the command above. You can type commands directly into the console too.

The lower left panel is the console, which is where R commands run. When you press ctrl-<enter>, the highlighted command is sent down to the console, and the output of the command is shown, below it in the console. If the command creates plot, the plot will be shown in the bottom right window.

The most simple commands are simple arithmetic. Type some simple expressions like these into your script and try running them with ctrl-<enter>.

1+1
1/2
5*5
4**2 #four to the power of two

Variables

Variables are a key concept in programming. A variable is like a box that stores some information. When you want to later access or use that information, you can simply use the variable's name. You can name a variable almost anything you like, but informative names are always helpful.

Let's create a variable called my_var and store the number 3 in it. To do it, we use the assignment operator (<-), which looks like an arrow pointing the way for the information to the box.

my_var <- 3

To check what my_var stores, you can just run the variable name like any other command. If done correctly, it will print "3" in the console.

my_var

Now, you can confirm that you can treat my_var as a substitute for the value it stores. Try a few operations, for instance:

my_var + 10
my_var * my_var

A single number is not a very interesting thing to store in a variable. Let's define a "list" (technically called a vector) of numbers called my_vec. We do this by writing the list inside brackets preceeded by c (short for "combine"), separated by commas.

my_vec <- c(4,10,-100,2.4,1,-5,0)
my_vec #run it to make sure it contains the list

Vectorised operations

Having a vector to play with will let us demonstrate a few more things. First, a true R specialty: vectorised operations.

Try running some simple arithmetic operations on my_vec.

my_vec * 4
my_vec / 100
my_vec + my_var
my_vec + my_vec

Notice how the operations act on all the entries in the vector? Think about the last example there, my_vec + my_vec. The result is a vector, made by adding the first element of my_vec to itself, and the same with the second, etc. This is different to the previous operations such as my_vec * 4, where each element was multiplied by the same value. But these are both examples of vectorised operations; operations that automatically act across lists of things.

As with normal maths, operations can get quite tricky, and it's best to include brackets to clarify what operations should go in what order. For instance, look at the results of these:

 my_vec + my_vec  * 2
(my_vec + my_vec) * 2

See? R defaults to doing the multiplication (*) first, but the brackets tell R to do the + first. So the results are different.

We can save the results of a command into a new variable, e.g.:

my_new_vec <- (my_vec + my_vec) * 2
my_new_vec

Exercise :

(@) Complete the following steps using maximum one line of code per step unless otherwise indicated. Save all the code you create, and write short answers where requested [13]: + Save a number inside a variable called "my_number". + Save a second number in a variable called "my_other_number". Make sure it is a number smaller than 10. + Add the two together, and save the result in a variable called "my_sum". + Print the value of the sum to the screen. + Store a string (i.e. some text, any text you like) in a variable called "my_string". You can do it just like for numbers, just put the text inside quotes, like this: my_string <- "some text here". + Print the the string to the screen. + Try to add my_string to my_sum. What happens, and why do you think it happens? [Write your answer] + Create a variable called "my_vec" that contains a vector of 5 numbers. Give them all values between 20 and 1000. + Write a simple command that uses a vectorised operation to produce the same vector as "my_vec", but with all the entries being twice as large as their original amount (e.g. [994,33,54,100,270] would become [1988,66,108,200,540]). + Write a simple command that uses a vectorised operation to produce the same vector as "my_vec", but with all the entries being 10% of their original amount (e.g. [994,33,54,100,270] would become [99.4,3.3,5.4,10,27]). + Let's pretend that my_vec contains temperatures in degrees Celcius, and we want to convert them to Fahrenheit. The formula to convert between Celcius and Farenheit is T_Farenheit = (T_Celcius * 9/5) + 32. Write a command using vectorised operations to convert all the entries in my_vec to degrees Fahrenheit. + Write and run the command my_vec %% my_other_number. Can you work out what the %% operator does? Explain it. Google is your friend. [Write your answer] + Write a pair of commands that demonstrate how the order of operations can change the results of calculation, and how brackets can control what operations are done first.

Functions

Now, functions. While simple operators like + and * perform straightforward things, functions are tools that perform more exciting and advanced things. Functions are always (ALWAYS!) called by a function name followed by brackets inside which is a list of arguments. Any expression you see with the structure some_function_name( ... some stuff in here ... ) is a function. For example, try:

seq( from = 2 , to = 20 , by = 3 )

This is a function for making vectors that are evenly-spaced sequences of numbers (e.g. 2, 5, 8, 11, 14 ... ). Notice:

seq is the function name, and we know it is a function because it is followed by brackets.
There are three arguments for this function: from, to, and by. Arguments tell the function what to do or give the function information it needs. The format is always argument1 = <something> , argument2 = <something>. Don't forget the equals signs, and the commas that are used to separate arguments.

Run that command to check it works what it does, then try adjusting the values for each argument to get an understanding of what it does.

Once you've done that, try this:

seq(1,10,2)

Notice that it does exactly what the original command does ... even though we never gave it the function names. This is because R will guess what the values are meant to mean based on their positions. It saves time, but it can also make your code confusing so be careful.

So, play around with these useful little functions and figure out what they do. You'll use some of them in the next exercise.

sum(my_vec)
mean(my_vec)
sd(my_vec) #"sd" stands for "standard deviation"
max(my_vec)
min(my_vec)
range(my_vec)
rep(my_vec , times = 3 ) #"rep" stands for "repeat"
rep(my_vec , each = 3 )
rnorm(n = 100 , mean = 20 , sd = 2 ) #"rnorm" stands for "random normal"
sort(my_vec)
sort(my_vec,decreasing=TRUE)

Notice that not all functions return the same thing. Some return vectors, and others return single numbers. Some even create graphs, play sounds, write files, and other cool things. And many even have multiple effects, as we will see.

Exercise :

(@) Complete the following steps using maximum one line of code per step unless otherwise indicated. Save all the code you create. [6]: + Use a function to create a sequence of numbers beginning 1,2,3 ... all the way up to 100. Store the result in a variable called "my_sequence". + Print my_sequence to the screen. + Use a function to create a vector of 100 normally-distributed random numbers, with a mean of 30 and standard deviation of 5. Store the result in a vector called "my_random_numbers". + Print my_random_numbers to the screen. + Use a function to sort my_random_numbers and store it in a vector called "my_random_numbers_sorted". + Run this command: plot(x=my_sequence,y=my_random_numbers_sorted). If you have done all the previous commands correctly, it will plot an approximately S-shaped curve in the lower right window of the screen.

Logical operators

A VERY common theme in bioinformatics is asking yes/no questions about your data. For this reason, most programming languages include ways to formulate such questions so that a computer can understand them. They are constructed from a few simple, logical ideas that we use intuitively when we speak. In R, there are symbols for:

"Is greater than": >
"Is less than": <
"Is equal to": ==
"AND": &
"OR": |
"NOT": !

And a few clever combinations of these such as:

"Is greater than or equal to": >=
"Is not equal to": !=

Let's try by making two new vectors and asking questions about them. Run these:

vec1 <- c(1,2,3,4,5,6,7)
vec2 <- c(7,6,5,4,3,2,1)

#The "is equal to" logical operator, `==`
vec1 == vec2

We get "FALSE FALSE FALSE TRUE FALSE FALSE FALSE". Why? Well ...

The first result was asking if the first element of vec1 (1) is equal to the first element of vec2 (7). It isn't, so we get FALSE.
The second result was asking if the second element of vec1 (2) is equal to the second element of vec2 (6). It isn't, so we get FALSE.
... etc ...
The fourth result was asking if the fourth element of vec1 (4) is equal to the fourth element of vec2 (4). It is! So we get TRUE.
... etc ...

Work through these, experiment, refer to the notes, and make sure you understand the outcome. Commands like these will be very important for the rest of the course.

vec1 == 2
! vec1 == 2
vec1 != 2
vec1 >= 5
vec1 == mean(vec1)
sum(vec1) == sum(vec2)

If we want to add more conditions to a logical expression, we can use the AND operator (&).

vec1 > 4 & vec1 < 7

So, we get TRUE for all the elements which are greater than 4 AND less than 7 ... so "FALSE FALSE FALSE FALSE TRUE TRUE FALSE".

This is actually a tiny bit trickier than it looks. In English, this idea would be expressed as "which elements of vec1 and greater than 4 and less than 7" ... so why can't we just write vec1 > 4 & < 7? Well, the R language just doesn't support syntax like that, which is a deliberate choice to enforce consistency. The command we just wrote can actually be understood by breaking up the command into steps. Which is your job:

Exercise :

(@) Complete the following steps using maximum one line of code per step unless otherwise indicated. Save all the code you create, and write short answers where requested [10]: + Define the variable vec1 as demonstrated above. + Run a logical operation on vec1 to give TRUE for all elements greater than 4, and store the results in a variable called "vec1_gt_4" + Run a logical operation on vec1 to give TRUE for all elements less than 7, and store the results in a variable called "vec1_lt_7" + Print both those vectors to the screen (two commands). + Run this command: vec1_gt_4 & vec1_lt_7, and think about the results. What did it do? Based on your reasoning, explain briefly how the command vec1 > 4 & vec1 < 7 works. [Write your answer] + Run this command: vec1_gt_4 | vec1_lt_7, and think about the results. + Based on your observations, do the AND (&) and OR (|) operators perform vectorised operations? [Write your answer] + Run this command: !vec1_gt_4 and observe the results. What does the NOT (!) operator actually do? [Write your answer]

Indexing with Vectors

Indexing is a way to extract elements or sets of elements out of some collection (e.g. a list). Very simple indexing can be done by putting numbers, or lists of numbers, within square brackets ([...]) after the variable name. For example, we can select some items from vec1. Try these:

vec1
vec1[1]
vec1[2]
vec1[3]
vec1[c(1,2,3)]

select_me <- c(4,5,6)
vec1[select_me]

As you can see from the last example, you can use variables to do indexing. And you can also use functions, and logical expressions. If the index contains a logical vector the same length as the indexed vector, then the elements will be selected wherever the logical vector contains "TRUE". For example if we had a vector containing 1, 2, 3, 4, 5 indexed with a vector containing TRUE, FALSE, TRUE, TRUE, FALSE the result would be 1, 3, 4. Try it!

v <- c(1,2,3,4,5)
s <- c(TRUE,FALSE,TRUE,TRUE,FALSE)

v[s]

#notice we could also do this without the variable, one less step
v[c(TRUE,FALSE,TRUE,TRUE,FALSE)]

#or, we can go even one step further and not use variables at all
c(1,2,3,4,5)[c(TRUE,FALSE,TRUE,TRUE,FALSE)]

Here're a few little secrets. R has a lot of shortcuts to do common things quickly. We could really simplify the above code using two little tricks:

v <- 1:5 # Trick 1. The `:` operator automatically creates ranges of numbers
s <- c(T,F,T,T,F) # Trick 2. "T"=TRUE and "F"=FALSE
v[s]

By combining ideas, we can do some fancy things now. For example, play around with this command until you understand how it works. It combines a function max(), with a logical operator !=, with indexing.

vec1[vec1!=max(vec1)]

#Bit lost? Break it up into pieces: Run
vec1
#and then
max(vec1)
#and then
vec1!=max(vec1)
# ... make sense now?

Over to you.

Exercise :

(@) Complete the following steps using one line of code per instruction. Save all the code you create, and write short answers where requested [5]: + Describe in words what the command above (vec1[vec1!=max(vec1)]) does. [Write your answer] + Write a command that uses indexing and logical statements to choose all the elements of vec1 that are greater than 4 and not equal to 6.

A programmer's best friend!

Here are some tricks to make your life a lot easier.

Building commands

The structure of a command can get complicated. Some have a lot of arguments or parts, and a single mistake will make them fail. To help, programmers usually do NOT write the commmand from left to right like a sentence. Instead, we write the structure of a command first, then fill in the blanks.

For instance, writing the seq() command from earlier, I would do it in a few steps:

#step 1 write:
seq() # this ensures we have both the open and close brackets

#step 2 fill in argument names:
seq( from = , to = , by = ) #this ensures we have all the argument names and commas in place

#step 3 fill in arguments:
seq( from = 1 , to = 10 , by = 2 ) #fill in the blanks to complete the command

\<tab> autocomplete:

The IDE is clever! When you start typing, it can try to autocomplete or suggest possibilities when you press the <tab> key. It is so useful that the <tab> key on my keyboard is blank because I press it so much.

For instance, type se and press <tab>. The IDE will give you a whole list of commands and variables beginning with 'se', including our friend seq():

Notice how this even gives you some info on the command. Now try my_v and then <tab>. You should get suggestions for variables beginning in those:

Useful! If there is only one option available, the command will be automatically completed. The <tab> trick also works on file names.

Help documentation

All standard (or base, in the lingo) R functions, and most functions in general, have help documentation associated with them. You can access it by preceeding the function name with a question mark ?. The documentation can remind you of what the function does, decribe all the arguments it can take, and give examples of how it is used. Use it often!

Try:

?seq
?c
?lm
?plot

Ok, one last exercise to reiterate everything you know, and also to show off some of R's flexibility. Then we will move to another topic.

Exercise :

(@) Complete the following steps using as much code as you need. You can choose your own variable names this time, unless indicated otherwise. Save all the code you create or run, and write short answers where requested. [10]: + Create a large vector (say, at least a few thousand elements) with random numbers in it. Make sure some of the random numbers are positive and some are negative. Store it in a variable. + Select only the elements of this vector that are greater than the mean of all the elements, and store these in a new variable. + Sort that vector and store the results in a variable called vec_x. + Print the first 20 elements of the vec_x to the screen. + Create a second vector whose elements are the same as the elements in the sorted vector, all raised to the power of two. Store it in a variable called vec_x_squared + Run this command vec_y <- vec_x_squared + rnorm( n=length(vec_x_squared) , mean=0 , sd=5 ). What does this command do? Experiment, work it out, and explain briefly. [Write your answer] + Create a scatterplot with the values of vec_x on the x-axis, and the values of vec_y on the y-axis. HINT: Remember the plot(x=...,y=...) function. + Run the command fit_polynomial(x=vec_x,y=vec_y). If you have done everything correctly so far, this will plot a curve with a line fitting the curve, and also print a lot of information about the fit to the console (which you can ignore---it's just to show you how simple R can make it to do difficult statistics).

Well done! Now is a good time to take a break, think over the concepts, and come back after a walk or a coffee.

The Amazing data.table

A data.table is one of the most useful data structures in statistical programming. They are very much like typical Microsoft Excel spreadsheets, where you might have rows representing experimental 'units' (observations, genetic markers, plants, subjects, measurements ... ), and columns that contain information about those units (name, height, weight, colour, flavour ...).

Let's jump in with some proper biological data. I've created a pre-loaded data.table stored in the variable phenotype_data_1. Have a look.

phenotype_data_1

It should look like this:

Basically, each row tells us phenotype information about a particular sample in a genetic diversity panel. The information is in columns. We have columns that tell us:

The species of the sample.
A sample identifier called sample.
The height of the sample. This might be, for instance, the average length of the longest culm ("stem") at maturity in cm.
The yield of the sample. This could be, say, an estimate of the total grain mass per hectare the sample would attain if grown as crop, in tonnes.
A TRUE/FALSE columnn indicating if the sample is selfing (i.e. if it can pollinate itself).

Perhaps you noticed that the columns look a bit like vectors, except that where we previously saw them on their sides like this ... They now appear as column lists like this: ... and that's not a coincidence: They really are vectors, and the glory of a data.table is that we can very easily manipulate and analyse the data in very flexible and creative ways by referring to these vectors. That's what we will learn now.

The three indexing fields

data.table's power lies largely in its excellent indexing system. We have already seen a simpler indexing system for vectors. This one has a lot more cool functionality. data.tables have three indexing fields. Spend a minute memorising this basic structure:

data_table_name[ SELECT , OUTPUT , GROUP_BY ]

When you start writing a data.table command, just start by writing data_table_name[ , , ] (where data_table_name is just the name of the variable your data.table is stored in). Then fill in the fields. You can leave fields blank too.

1) The SELECT field.

This is used to select certain rows, and we do that by making logical statements about the columns, for instance try this:

phenotype_data_1[ height > 150 , , ]

This selects all the rows where height is above 150.

Here's another example.

phenotype_data_1[ sample=="sample_5" , , ]

Selects the row where the sample is "sample_5".

And you can use logical tests on all the columns to do your selection, for instance, maybe we'd like to see all the samples that can self-pollinate, but ignore the wheat samples, so we need to combine two logical statements with an AND operator (&). We also use the NOT operator (!) to cut out the wheat samples:

phenotype_data_1[ selfing==TRUE & ! species=="Wheat" ,  ,  ]

A good way to visualise selection that is just crosses out rows that fail the logical tests, like this.

Exercise :

(@) Complete the following steps using maximum one command per instruction. Save all the code you create. [5]: + Select all the rows of phenotype_data_1 where the yield is above 3 + Select all the rows of phenotype_data_1 where the yield is above 3, or the height is above 100. + Select all the rows where the yield is between 3 and 4. + + HINT: Use the AND operator & in conjunction with less than (<) and a greater than (>) operators. + Select every barley row where the yield is above the mean yield for the whole table. + Select every barley row where the yield is above 80% of the mean yield for the whole table. + + HINT: 20% of 50 is equal to 0.2*20.

2) The OUTPUT field.

By default, a data.table command will just returns the rows of the data.table you selected (or all the rows if you didn't use the select field).

Instead, you might want to actually do something with the columns of the data.table, for example, finding the mean of maximum value of a column. Our results will become the columns of the result, and we get to name our results columns. Try this command:

For example, we could just return one of the columns as a vector, try this:

phenotype_data_1[ , .( average_yield = mean(yield) , tallest_plant_height = max(height) ) , ]

As you can see, the format of the command is:

data_table_name[ , .( column_name_1 = , column_name_2 = ) , ]

So, the whole OUTPUT field is enclosed in brackets with a dot in front of them: .( ), and inside the brackets is a comma-separated list of output column names (you choose these!), and each has its own command telling R how to generate the output you want in this column.

Remember the "building commands" advice above? That is useful here. I would progress through typing my command like this.

#set up the basic structure
phenotype_data_1[ , , ]

#since I want to use the output column, set up the brackets
phenotype_data_1[ , .() , ]

#put in the names of the output columns I want, separated by commas
phenotype_data_1[ , .( mean_yield = , max_yield = ) , ]

#fill in the commands for each column
phenotype_data_1[ , .( average_yield = mean(yield), tallest_plant_height = max(yield) ) , ]

Remember from earlier that the columns are basically just vectors. Since some operations work with multiple vectors, it's no surprise that the OUTPUT field can contain expressions that use multiple columns. For example, the paste command is used to join strings together. To understand it, try this:

a <- c("This","the","of")
b <- c("is","result","paste()")

paste(a,b)

So, let's say we wanted a single column that combined the information from the on "species" and "sample" columns, we could do something like this:

phenotype_data_1[ , .( species_and_sample = paste(species,sample) ) , ]

Your turn! ...

Exercise :

(@) Complete the following steps using maximum one command per instruction. Save all the code you create. Give short written answers where requested [10]: + Write a command that produces three columns, using the data in phenotype_data_1: + + A column called "minimum_height" that contains the minimum height of the plants. + + A column called "total_yield" that contains the sum of the yields. + + A column called "SD_height" that contains the standard deviation of all the heights. + + + HINT: Remember the min(), sum() and sd() functions. + Write a command that produces just one column called "yield_per_height", which contains the yield divided by the height for each sample. You will notice the result has lots of rows, where the last result had only one row. Why? [Write your answer] + How does data.table cope when the OUTPUT field asks for columns that have different numbers of rows (e.g. where one column has 43 rows, and the other has just one)? Write any command that tries to do this, and explain what happens. [Write your answer]

You have probably guessed, you can combine the use of the SELECT and OUTPUT fields. Let's build a command to calculate the average yield of wheat samples (if you feel confident, try it on your own first!)

#set up the basic structure
phenotype_data_1[ , , ]

#Fill in the select column to choose only wheat (I would then run the command to test the selection works)
phenotype_data_1[ species=="Wheat" ,  ,  ]

#since I want to use the output column, set up the brackets
phenotype_data_1[ species=="Wheat" , .() , ]

#set up my output column name
phenotype_data_1[ species=="Wheat" , .( average_wheat_yield = ) , ]

#fill in the command
phenotype_data_1[ species=="Wheat" , .( average_wheat_yield = mean(yield) ) , ]

Exercise :

(@) Save all the code you create. [2]: + Write a command to output the maximum height of all the rye samples.

3) The GROUP_BY field.

Imagine we want to compare the yields of different species. We would need to calculate the average yield of not just of one species, but of every species. The third field, GROUP_BY lets us divide the data.table up into groups and do OUTPUT operations in each group separately. To use it, put by=<grouping column name> in the GROUP_BY field.

So to calculate the mean yields of each species, our grouping column will be "species". Build up your command like this:

#structure
phenotype_data_1[ , , ]

#THINK: "Do I want to select specific rows?" Since we want all the rows, leave the SELECT field empty

#Since we want to output something, set up the OUTPUT field
phenotype_data_1[ , .() , ]

#name columns
phenotype_data_1[ , .( average_yield = ) , ]

#add command
phenotype_data_1[ , .( average_yield = mean(yield) ) , ]

#THINK: "Do I want to calculate the output separately in group?" Yes, so add the grouping column in the GROUP_BY field
phenotype_data_1[ , .( average_yield = mean(yield) ) , by=species ]

And we get this:

If you want to visualise this, imagine sorting the table into groups, then doing your calculations separately on the groups:

And, of course, we can incorporate the SELECT field too, for instance, we could do the same command by exclude samples with height over 90:

phenotype_data_1[ ! height < 90 , .( average_yield = mean(yield) ) , by=species ]

Which could be visualised like this:

Exercise :

(@) Save all the code you create. [2]: + Write a single command that will output the average heights for the groups of samples which do and don't self-pollinate. Note: Self-pollinating samples have "TRUE" in the "selfing" column.

Adding and changing information in data.tables

You now know enough about data.tables to do some really meaningful analysis, which we will use in the next workshop. For that workshop (and for the assignment), you will sometimes need to add a column to a data.table. Luckily, it's quite simple.

You specify what you want in the new column in the OUTPUT field, just like before, except:

You can only add ONE column per command.
Don't use the .( ) that usually surrounds commands in the OUTPUT field.
Instead of the normal equals symbol (=), use the special assignment symbol :=.

So, I could add a column called height_m to phenotype_data_1 that gives the height in metres. To get that, we just divide the heights in cm by 100, so the command would be:

phenotype_data_1[  , height_m := height / 100 ,  ]

Try it and then have a look to make sure it added the column.

The assignment symbol := can also be used to modify existing columns, and also in conjunction with the SELECT and GROUP_BY fields. For example, imagine we wanted the tallest height in the height_m column to be 2 metres, and anything above that, we want to reduce to 2. We can do it like this:

phenotype_data_1[ height_m > 2 , height_m := 2 ,  ]

#check that it worked
phenotype_data_1

Exercise :

(@) Save all the code you create. Use as many lines as required for each instruction. [5]: + The yield column in phenotype_data_1 is in units of tonnes per hectare. Add a column called "yield_bpa" that gives the yield in bushels per acre. The coversion is Yield_bsh/acre = Yield_ton/hec * 14.87. + Add a column called "high_yield" to phenotype_data_1, which contains TRUE when the yield is over 60 bushels/acre and FALSE when it is not.

Great work! Now is a good time for another break before we take this data and produce some pretty plots.

Plotting with ggplot

Last of all for this session, we'll learn the very basics of making scatterplots with the very popular package ggplot2. This package can produce some very beautiful and complex plots with just one command works wonderfully with data in data.table-type formats.

A typical ggplot2 command for a scatterplot looks like this:

ggplot( data= , mapping=aes( x= , y= ) ) + geom_point()

So, <data.table name> here is the data.table you want to plot data from, <column to go on x axis> is the column that provide values for the x-axis, and <column to go on x axis> is the column that will provide values for the y-axis.

Once again: Start building commands by typing the structure, and then fill in the blanks.

Imagine we want to know if there are differences between yields for the different species. Try this command where we put the species on the x-axis and yields on the y-axis.

#start like this, with the structure
ggplot( data=  , mapping=aes( x=  , y= ) ) + geom_point()

#and fill in the blanks
ggplot( data=phenotype_data_1 , mapping = aes( x=species , y=yield ) ) + geom_point()

So that's the basic idea. It gets interesting when we start using more columns to add more visual information. For instance, we might wonder whether selfing samples have on average a smaller or larger yield. We could add that information in the graph, too---perhaps as colour or shape? As it happens, the "aes" in the aes( ... ) part of the command stands for "aesthetic", because it maps (see why the argument name is "mapping"?) certain columns to certain aesthetic features. So we can, for instance, map selfing to colour in this box.

ggplot( data=phenotype_data_1 , mapping = aes( x=species , y=yield , colour=selfing ) ) + geom_point()

#or map selfing to shape!
ggplot( data=phenotype_data_1 , mapping = aes( x=species , y=yield , shape=selfing ) ) + geom_point()

And it's possible to go crazy with aesthetics, for instance, let's use as many as we can to represent the entire dataset on one plot:

ggplot( data=phenotype_data_1 , mapping = aes( x=species , y=yield , shape=selfing , colour=sample , size=height ) ) + geom_point()

Of course, this is a bit stupid. It's not a very intuitive or informative plot ... choosing aesthetics is an art. Have a try:

Exercise :

(@) Save all the code you create. Maximum one command per instruction. [2]: + Create a scatterplot that looks for a relationship between yield and height by placing one of them on the x-axis and the other one on the y-axis. + Add a colour aesthetic to show which data are from which species. + Add another aesthetic of your choice to show which data are from selfing plants.

Interpreting data (and a showcase of ggplot's flexibility)

Just to round up, let's have a look at a few of ggplot's built-in abilities. Since it's the end of the workshop, feel free to just copy-paste these commands. We'll take a mental break from code and instead just think about the results in biological terms.

Here's the original plot we made of species vs. yield.

ggplot( data=phenotype_data_1 , mapping = aes( x=species , y=yield ) ) + geom_point()

That + geom_point() means "add points to the graph". We can add different--or multiple--"geoms" to the plots. For instance, we can make a classic boxplot by adding + geom_boxplot()

ggplot( data=phenotype_data_1 , mapping = aes( x=species , y=yield ) ) + geom_boxplot()

Like many statisticians, I personally hate boxplots: they hide more than they reveal. I would prefer something that shows a more honest representation of the spread of data in each group. So naturally, I love geom_violin(). Check it out.

ggplot( data=phenotype_data_1 , mapping = aes( x=species , y=yield ) ) + geom_violin()

And even better, we can show both the raw data, and the violin plots, by using both geom_point() and geom_violin() (with + geom_point() second, so the points are drawn on top of the violins).

ggplot( data=phenotype_data_1 , mapping = aes( x=species , y=yield ) ) + geom_violin() + geom_point()

ggplot will even fit statistical models to your data. Here's a plot of yield vs. height once more.

ggplot( data=phenotype_data_1 , mapping = aes( x=height , y=yield , colour=species ) ) + geom_point()

You can see some obvious trends in it, for instance, shorter species have higher yields, but within each species taller plants have higher yields. With geom_smooth() we can ask ggplot to fit a model to the data. In this command, we will specify a simple linear model using the argument method="lm".

ggplot( data=phenotype_data_1 , mapping = aes( x=height , y=yield , colour=species ) ) + geom_point() + geom_smooth(method="lm")

The data we worked with are just made up, but are they realistic? Yes, generally. Relationships between agronomic factors like grain yield, plant biomass, height, genetic makeup, etc. are notoriously complex and interdependent. The relationships are often non-linear, and they can drastically change based on soil type, climatic and weather conditions, nutrient availability and so on. However, the trends we see here do reflect relationships that sometimes occur in real life.

For instance, note these three things:

1) The wheat have the highest overall yields, followed by barley, and then rye. This is the case, for instance, in Germany (from the excellent site FOASTAT).

This is also the case in most countries that grow these crops such as the USA, Poland, and the Russian Federation. Of course, these countries have varied environments and growers will likely plant whatever crops yield best in their area (for instance, wheat tolerates dry soils and rye is excellent in the cold). And economic factors of supply and demand affect their choices, along with crop rotation and sowing/harvesting schedules ... and many other factors.

2) In our data, the wheat have low variability in yield and height, while the rye is notably variable. Naturally this would depend on our samples. If you compare samples from a single isogenic line with a collection of wild samples, the wild samples will always be more variable. But this variability does somewhat reflect the different breeding systems that are prevalent among these species. Wheat (well, bread wheat) is a domesticated species that normally self-pollinates, limiting its genetic diversity. Rye is semi-wild and often self-incompatible, so it has more genetic diversity, and hence more variation in growth form.

3) What about the positive relationship between height and yield within each of the species? That is also semi-realistic, but reality is usually more complex.

Consider this chart from Addisu et al. (2009):

Each dot represents a crop of wheat plants. They are genetically almost identical, except for variation at a gene that affects the height, which is part of why their heights vary from about 15 to 120 cm tall. Each panel shows the results for one year. The black dots show conventionally-grown plants, and the open dots are organically grown. What do these data tell us about the relationship between plant height and yield? And what about the effects of organic vs conventional methods on yield and height? Actually, it is rather difficult to interpret, which is the point. But let's try.

Exercise :

(@) This is worth 0 marks so leave the answer blank, but it is so important that it should feel like an exercise. Look at the graph. Work out what all the axes, and dots, and lines, and panels mean. And form an opinion about what the data might be saying about height, yield, and organic/conventional farming methods.

Ok, here're some thoughts from me. Obviously the organic methods don't yield as much as the conventional methods, and there is some height/yield relationship. But this relationship changes every year. For conventionally grown plants, it looks like maybe the best height is normally about 80 cm. I suspect that, when taller than this, and the plants tend to lodge (fall over) and the yield decreases. This is in fact part of the reason why so-called dwarfing genes were a key part of the green revolution of the 1950s and 1960s: dwarfing genes help plants to attain this "sweet spot" of size for maximum yield.

But it's not that simple. In panel (e) the yield of plants is highest at around 100 cm ... but there are no data at 80 cm so, it's a bit ambiguous. Also, the same relationship cannot always be seen for organic plants. In years (a), (b), and (c), organically-grown plants kept yielding more no matter how big they got!

It's hard to say what is going on, and that is often the case in the biological sciences---and one of the reasons good experimental design is so important. Good experimental design, but also thinking hard about what information is not contained in the analysis. For example: Is my "lodging" interpretation reasonable? Did tall plants really fall over? Did the organic regime perhaps result in different soil properties that helped plants to stand up? Or maybe their lower yield also meant they were just not as heavy, so they never got to the point of falling over.

It's important to wonder about questions like this---it helps us work out how to move science forward.

The ultimate lesson is this: the more data we have to analyse the better. Big data requires analysis using computational tools, but also interpretation and analysis skills. You can do a lot of good for the world with just a little coding skill and good data literacy.

One final exercise to get you thinking about data interpretation, which will be a much stronger focus in the next unit.

Exercise :

(@) Use Google or YouTube or a textbook to develop a very basic understanding of "Simpson's paradox". Use your understanding to answer the question [5]: + Run this command ggplot( data=phenotype_data_1 , mapping = aes( x=height , y=yield , colour=species ) ) + geom_point(), and examine the output. Is this an example of Simpson's paradox? Explain why, in particular, pointing out any false conclusions that could be derived from the data if we had not used the colour aesthetic to show species. [Write your answer]

Rye and barley growing together here at IPK. Based on what you now know--can you tell which is which?

mtrw/IBGG documentation built on April 10, 2022, 12:06 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

mtrw/IBGG
Introduction to Bioinformatics for Genebank Genomics (IBGG): Course Material and Assignments

In mtrw/IBGG: Introduction to Bioinformatics for Genebank Genomics (IBGG): Course Material and Assignments

Introduction

Variables

Vectorised operations

Exercise :

Functions

Exercise :

Logical operators

Exercise :

Indexing with Vectors

Exercise :

A programmer's best friend!

Building commands

\<tab> autocomplete:

Help documentation

Exercise :

The Amazing data.table

The three indexing fields

1) The SELECT field.

Exercise :

2) The OUTPUT field.

Exercise :

Exercise :

3) The GROUP_BY field.

Exercise :

Adding and changing information in data.tables

Exercise :

Plotting with ggplot

Exercise :

Interpreting data (and a showcase of ggplot's flexibility)

Exercise :

Exercise :

R Package Documentation

Browse R Packages

We want your feedback!

mtrw/IBGG Introduction to Bioinformatics for Genebank Genomics (IBGG): Course Material and Assignments

In mtrw/IBGG: Introduction to Bioinformatics for Genebank Genomics (IBGG): Course Material and Assignments

Introduction

Variables

Vectorised operations

Exercise :

Functions

Exercise :

Logical operators

Exercise :

Indexing with Vectors

Exercise :

A programmer's best friend!

Building commands

\<tab> autocomplete:

Help documentation

Exercise :

The Amazing data.table

The three indexing fields

1) The SELECT field.

Exercise :

2) The OUTPUT field.

Exercise :

Exercise :

3) The GROUP_BY field.

Exercise :

Adding and changing information in data.tables

Exercise :

Plotting with ggplot

Exercise :

Interpreting data (and a showcase of ggplot's flexibility)

Exercise :

Exercise :

R Package Documentation

Browse R Packages

We want your feedback!

mtrw/IBGG
Introduction to Bioinformatics for Genebank Genomics (IBGG): Course Material and Assignments