In mtrw/IBGG: Introduction to Bioinformatics for Genebank Genomics (IBGG): Course Material and Assignments

knitr::opts_chunk$set(echo = TRUE)

Introduction and setup

It's time to put our R skills to work, using them to interpret genetic data and turn it into meaningful information. There will be a few new R tricks in here but mostly we will just apply what we already know.

Get set up with the IDE as you did in the previous session, and work in the document "workshop_002.R". And once again, make sure you run this line of code to load all the required packages and datasets:

source("/filer/projekte/ggr-course2020/r-source/genres_2020_bioinfo_setup_fns_data.R")

We worked with some phenotype data (height, yield, etc.) last time. Now, let's begin by looking at how some genotype data might be handled in an R environment.

Let's get some genetic marker data. Recall that a genetic marker tells us some information about the genome sequence, at some particular site, in a particular individual. We use markers rather than whole genome sequences simply for practicality: Markers are cheap and fast to use. Whole genome sequences can still by very slow and expensive to produce. If we have a collection of genetic markers from common sites across a number of individuals, then we can do many of the things we can do with whole genomes anyway, at a fraction of the time and effort.

Let us load a set of 22 SNP markers from 23 barley plants into a data.table called genotype_data. Load it with this function:

genotype_data <- read_KASP_genotypes()

#have a look
genotype_data

This dataset simulates what we might see if we used a marker technology called KASP (which detects SNPs), on some inbred lines. As you can see, the "genotypes" are all scored as either 1 or 0. That might seem curious, since aren't they actually SNP mutations? Shouldn't they be nucleotide symbols, like "A", "G", "C", "T"? The reason is that ultimately the actual nucleotides don't matter for many purposes. There's plenty we can do with just arbitrarily assigned categories. How is it decided which state should be "1"? Sometimes it is totally arbitrary. Sometimes the most common nucleotide at the site is given a standard state. Sometimes a particular individual's genotype is chosen to represent the 0 state in all cases. It is not always important. Let's just do some basic operations on genotypes to get us thinking in R again.

Imagine we have the genotypes for ten markers (run the code as always!):

genotype_1 <- c(1,0,1,1,0,1,1,0,1,0)
genotype_2 <- c(1,1,0,1,0,1,0,1,1,1)

genotype_1
genotype_2

This could represent an arrangement of nucleotide states like this, for example (with genotype 1 on the top and genotype 2 on the bottom) ...

One simple task might be to count the differences between the two genotypes. That gives us some idea of how closely the two individuals are related, at least.

Let's learn, and then use, a new trick to count differences. As we know, logical statements ask questions about data, and give back long lists of "TRUE"s and "FALSE"s. For example, we can create a vector of numbers and ask which are greater than 4:

numbers  <- c(1,2,3,4,5,6,7,8,9)
numbers #run it to see what is in the vector
numbers > 4

Which gives "FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE". Now remember that the sum() function adds up the entries of a vector, for instance:

sum(numbers)

... but what if sum() is not given numbers, but instead a logical vector? Well, in that case it automatically adds up all the times the vector includes TRUE! So if we put any logical statements inside the function sum( .... ) we will get the answer to the questions "how many times is this true?" Make sure you understand these commands, and you will see the point:

sum(c(FALSE,FALSE,TRUE,TRUE,FALSE))
numbers > 4
sum(numbers > 4)

Now back to the question of counting differences between the two genotypes. Here are two different ways to do it:

#Way 1
sum( genotype_1!=genotype_2 )

#Way 2
sum( (genotype_1 - genotype_2)**2 ) #Note, this method depends upon the 0/1 encoding of genotypes. It wouldn't work if they were given other numbers or letters.

If these look like magic, spend some time taking them apart and running the constituent parts to work out what is going on.

Exercise :

(@) Save all the commands you create or run. [2]: + Run the commands genotype_1 <- c(1,0,1,1,0,1,1,0,1,0) and genotype_2 <- c(1,1,0,1,0,1,0,1,1,1) + Create a command that counts the differences between genotype_1 and genotype_2, that is different to the two examples given above. + OPTIONAL CHALLENGE TASK; Program a dotplot, using as many commands as required: + Find a way to create random character vectors of "A", "G", "C", and "T". For an added challenge, find a way to control the approximate percentages of each. + Use your method to create an A/C/G/T-containing vector of length 500 and store it in a variable "seq1"." + Create a copy of seq1 in which entries 100--200 are reversed. Store it in "seq2". + Write commands that will plot a simple dotplot of seq1 and seq2, i.e., if you imagine one sequence running along the vertical axis and the other on the horizontal axis, a point will be drawn wherever two bases match. This will probably require several commands and the creation of some new variables.

Genotype data with data.table

And in actual fact, the genotypes stored in the "genotype" column of the table genotype_data are just like the vectors genotype_1 and genotype_2, as we can see if we select just one sample...

setorder(genotype_data,marker) #this orders the table by the "marker" column. It is essential to make sure the markers are in the same order no matter what sample we look at.

genotype_data[sample=="HOR_14914",genotype]

... meaning we can use data.table to manipulate genotype data.

Ok, now we are thinking in R, let's set some goals for this data. We will:

1) Search the data for missing genotypes. These occur a lot in real datasets, because technology isn't perfect. 2) Remove any markers that have a very high number of missing data points. 3) Display the genotypes graphically, as shown in the lecture, and take the opportunity to ... 4) Visually search for possible linkage disequilibrium among alleles at different loci, and ... 5) Use this linkage if possible to fill in likely states for any remaining missing genotypes ... 6) Assess the relationships among the samples.

Let's begin by counting the number of "missing" data points for each marker. This will give us an idea which markers are unreliable. Missing data are given the special symbol "NA", and we are interested in counting all the "NA"s in the "genotype" column. The logical command to tell you if an entry in a vector is "NA" is is.na(). See how it works by running and understanding these commands:

numbers_with_NAs <- c(1,2,3,4,NA,6,7,NA,9)
numbers_with_NAs
is.na( numbers_with_NAs )
sum( is.na(numbers_with_NAs) )

Take a minute to recall first the structure of a data.table command, and what they do. Here is the command again, and the diagram from the last workshop to help. Be sure to take note again of the .( ) structure that occurs in the OUTPUT field, and the by= that we use in the GROUP_BY field to indicate the grouping variable.

data_table_name[ SELECT , OUTPUT , GROUP_BY ]

Let's build up the command that will count NAs for each marker.

#start with the basic structure
genotype_data[  ,  ,  ]

#THINK: Do we want to select any particular rows? (We don't, so the SELECT field stays blank)

#THINK: Do we want to output something? (Yes, so set up the OUTPUT field with the ".( )" and name the output)
genotype_data[  , .( number_of_NA_genotypes =  ) ,  ]

#Fill in the command to produce what we want
genotype_data[  , .( number_of_NA_genotypes = sum(is.na(genotype)) ) ,  ]

#THINK: Do we want to produce this output for different groups? (Yes, we want to group them by marker, so set up the "by=..." argument in the GROUP_BY field)
genotype_data[  , .( number_of_NA_genotypes = sum(is.na(genotype)) ) , by=marker ]

Ok, no that we have a list of the number of NA genotypes per marker, we can see whether there are any markers that stand out as being very unreliable, and remove them:

Exercise :

(@) Save all the commands you create. Maximum one command per instruction. Give short written answers where requested. [5] + Which two markers in genotype_data have the highest counts of missing data? [Write your answer] + Write a command that filters these two markers out of genotype_data (i.e., removes all the rows that refer to these markers), and overwrites the data.table genotype_data with the filtered version. + Write a command that will let you check whether your filtering was successful. + HINT: I would try to SELECT only rows with these two markers. If they have been successfully removed, the result will be an empty table.

Graphical genotypes with ggplot

To create a graphical genotype display, shows genotypes as coloured dots or squares arranged in a "grid" with marker positions lined up on the x-axis, and samples lined up along the y-axis. Once we refresh our ggplot syntax for a scatterplot, this should be easy.

ggplot( data= , mapping=aes( x= , y= ) ) + geom_point()

So, data= gives the data source (genotype_data in this case), and the arguments inside mapping=aes() (as in x=..., y=..., colour=...) tell ggplot what to display on the plot. Let's build this command up piece by piece as before:

#Start with the structure!
ggplot( data=   , mapping=aes( x=   , y=   ) ) + geom_point()

#Fill in the data source
ggplot( data=genotype_data   , mapping=aes( x=marker , y=sample ) ) + geom_point()

Almost there! We just need to add another aesthetic to show what the actual genotypes are!

Exercise :

(@) Save the command you run. [2] + Add another aesthetic to the plot above that shows genotype as shape.

I personally prefer colour as genotype, so run this command:

ggplot( data=genotype_data , mapping=aes( x=marker , y=sample , colour=genotype ) ) + geom_point()

Note the markers here are in random order. They could fall anywhere on the actual chromosomes.

So, based on what we just did in the lectures, let's think about some of the phenomena we can see just by looking at the data in this form.

Exercise :

(@) Save any commands you create, and write short answers where requested [10] + Based on these SNP data alone, which individual in the panel would you guess is the most closely related to "HOR_12311", in evolutionary terms. [Write your answers] + Based on these SNP data alone, which three markers would you suggest are located most closely in the genome to SNP_19? [Write your answers] + There are two missing data points in the data still. For each, write the marker name and sample name of the missing data point. [Write your answers] + The missing data points come from markers that appear to be linked to other markers. Based on the markers they are linked to, write the most likely genotype for each of the missing markers. [Write your answers] + For each missing data point, write a command that sets the genotype in genotype_data to the most likely state you chose above. HINT: You can read the section "Adding and changing information in data.tables" from workshop 1. + Write a command to help you confirm there are no missing genotypes left in the dataset. + OPTIONAL CHALLENGE TASK; Create an IBS matrix.: + IBS ("Identity By State") matrices are a common way to describe the genetic distances between all the individuals in a population. Write a script that produces a matrix object, where each row and column corresponds to an individual in genotype_data (in the same order, i.e., the same individual will correspond to both row 1 and column 1). Each entry in the matrix should contain the proportion of sites at which the individuals on that row and column have the same marker state (e.g., if individual 1 is A,A,T,A, and individual 2 is A,A,A,A, then they have a 0.75 IBS proportion). + HINT: There are a lot of ways to do this. Nested loops would be one way (or nested apply functions, even). You could also achieve it using data.tables, but that is a little more complex. + Write a command that will print TRUE if the diagonal entries of the matrix are all equal to 1 (they should be). + Create a heatmap of the matrix.

Genotypes as a window into deep history: Thinking about evolution

When we asked about relatedness and grouping of genotypes, we are really asking much more profound questions about the relatedness genetic grouping of individuals, which is a direct consequence of their evolutionary history.

In the lecture we talked about visualising genetic diversity intuitively using principal components analysis (PCA), and using the resultant PCA plots to speculate about possible evolutionary scenarios.

Simulating evolution

We now have a short exercise designed to use bioinformatics to strengthen intuition about how different evolutionary histories lead to different patterns of diversity, and how different ways of visualising that diversity can be helpful in learning about the original evolutionary scenarios that created them. You will actually simulate your own evolutionary scenarios in R. The function sim_pop() runs a simple simulation of a population undergoing evolutionary scenarios that the user describes. The "individuals" in these simulated populations are sexual diploids, so they have genotypes that are inherited from two parents, which are, in turn, produced by recombination between the chromosomes they inherited from their parents. Genotypes occasionally mutate. For simplicity, the hypothetical species we are simulating has just one pair of linear chromosomes. There is a venomous Australian ant called Myrmecia pilosula that actually has only one chromosome, and I am proud to say I have personally been bitten by one.

Here's an example for you to start on. Notice writing a data.table into R is pretty much just like running any function (i.e., it has the structure function_name( argument1 = ... , argument2 = ... )), except the argument names are the column names, and the ... are lists of whatever you want to fill the columns with. Create an events table like this and view it:

events1 <- data.table(  event=c("split","size_change","split","size_change"),  generation=c(4,7,9,20),   population=c(1,2,1,2),   size=c(10,6,2,20)    )

events1

NOTE: Previously all the R commands have been on one long line. You can also write them over multiple lines. This is allowed because when R sees an open bracket (i.e. ( or { or [), it expects a close bracket () or } or ]) ... so it will just keep reading lines until all the brackets are closed and the command is complete. This is really useful for writing clear code, for example, I could re-write the command above as:

events1 <- data.table(
  event=     c( "split",  "size_change",  "split",   "size_change"  ),
  generation=c( 4,        7,              9,         20             ),
  population=c( 1,        2,              1,         2              ),
  size=      c( 10,       6,              2,         20             )
)

The events table events1 describes the evolutionary scenario we want to simulate. In this case, here is a phylogenetic depiction of the events described. Go through the table you created and make sure you can see how the rows correspond to this evolutionary scenario:

Run the simulation by giving the events list to the sim_pop function like the below, and store the results in the variable names "simulation1":

simulation1 <- sim_pop(events=events1)

If done correctly, you'll get a plot that shows the individuals in the population evolving, just as we requested! The dots represent individuals in a particular generation, while the lines represent the two parents of each individual.

As you will see, the population splits and size changes represent the evolutionary scenario as we described. Now, let's look at the patterns of genetic diversity this scenario produced! First, we can just look at the raw genotypes posessed by individuals in this population. The function plot_simulated_genotypes() will plot genotypes with the "0" state in blue and "1" in red. In contrast to the previous genotype plot we made, though, the markers are now displayed in order along the "chromosome".

plot_simulated_genotypes(simulation1)

A side note for the observant ... we're actually looking at haploid, gametic genotypes here, so you won't see any heterozygotes. This is to keep us thinking in terms of the 1/0, two-state genotype encoding that we have been working on so far.

Once again, can you tell which individuals (rows) are closely related? Which are from each of the three populations? Think about how you can tell and have a guess. Then check your guess by adding the argument show_populations=TRUE, to reveal (using shapes), which individuals are in which groups.

plot_simulated_genotypes(simulation1,show_populations=TRUE)

As you can see, the process of evolution has left a mark on these genotypes. They tend to be the same within populations, but sometimes differ between populations. The marker states at locus 1 are a good example, with the top and bottom populations having the blue state, and the middle having (mostly) the red. And this is purely a consequence of the simple processes of mutation, recombination, and fertilisation, acting in populations over time!

Exercise :

(@) Write short answers where requested [6] + Observe the marker states for population 2 only (triangles). Within the first ten loci, there is an excellent example of obvious linkage between two loci. Which loci are they? [Write the numbers] + The linkage pattern is almost perfect but the pattern is broken in two individuals. Which are they? [Write the numbers] + Are the two linked loci close to each other? Is this expected? Why? [Write your answer]

Could you have guessed what the evolutionary scenario might have been based on the graphical genotypes alone? Perhaps PCA analysis could tell us more. The function plot_genetic_diversity() can be used to perform PCA on this genotype matrix and represent the genetic diversity in PCA plot form.

plot_genetic_diversity(simulation1)

Let's use the data to make inference about the evolutionary process. Of course, we know what the process was, in this case, because we programmed it into the events table. The point of this exercise is to learn intuitively how evolutionary history ends up represented in genotype data, so that when we don't know the evolutionary process, we can make good guesses about it. Which is what you will do in the next exercise.

Examine the PCA plots. Try to adjust the window size so they are approximately square, this will mean the PCs are properly scaled. Remember:

"Tighter" (less spread out) clusters come from populations that have spent some time being a small size, and thus losing genetic variation.
"Looser" (more spread out) clusters come from populations that have spent some time being a large size, allowing them to build up a lot of genetic variation.
Clusters that are close to each other represent populations that have split later in evolutionary time.
Clusters that are very separate from each other usually split early.

This is what you should get from the PCA analysis:

By looking at the scree plot (top left) we can see that the first two principal components capture most of the variance. So we are safe to concentrate mainly on the plot of PC1 and PC2, and feel confident it's a good representation of information contained in the genotype data, with not too much "hidden" away in further dimensions.

So we are safe to largely focus on the plot of PC1 vs PC2 as we make our interpretation. And very easily we can see the main clusters:

There are three rather clear clusters in the plot. This means that, since the time they shared a common ancestor, there must have been two distinct splitting events.
The clusters are quite obvious. They do not seem to overlap. Therefore, they must have split for a significant amount of time, to allow them to become quite unique.
The green and yellow clusters are slightly closer to each other than they are to the blue cluster. So we know that the split between blue and the others probably happened first.

So far then, we can assume an evolutionary history with two fairly ancient splits, with blue breaking away first. Something like this:

What about the population sizes?

Assuming this is a random sample of individuals from a species, then it seems one of the populations (green) is significantly bigger than the other two. It has a lot more individuals in it.
Populations that are large accumulate a lot of diversity, so the clusters that represent them in a PCA become "looser"; more spread out. But ...
This is not the case in our data! All three clusters are spread out about the same amount. So why isn't the green population more spread out? How could this be?
The key is in the sentence above: "Looser clusters come from populations that have spent some time being a large size". A population that expands rapidly, but not had time to accumulate diversity, could result in a cluster that is similar in size to those of small populations, but contains many more individuals.

So we can update our interpretation. We will keep the population sizes about the same, but add a very late expansion to the second (green) group.

This is basically correct (it approximately matches what we originally simulated), although it's not perfect, and quite a lot of information was lost or ambiguous. But we worked out the basic picture of the evolutionary history from genetic data alone! And that is what evolutionary inference from genetics is all about. There are a lot of sophisticated computational methods for performing evolutionary inference automatically. Undertanding those methods is important, but not nearly as important as having a good intuition of the basic principles.

Exercise :

(@) Imagine the PCA and scree plots below are the result of your study on a population of plants. These plants are from the species Candyus polychromus, which grow jelly beans. In your study, you took a random sample of the whole Candyus polychromus population and genotyped them at 100 loci. It's your job now to interpret the evolutionary events that occurred in the history of the species. Save any commands you create, and write brief answers where requested. [15]

Describe the major features you observe in the PCA plot, and suggest what evolutionary events could explain them. Give your reasoning. [Write your answer]
Explain whether the information contained in the scree plot gives you much confidence in your interpretation, and why/why not? [Write your answer]
Write an events table that describes the evolutionary events you suggested as an explanation, and store it in a variable called "my_events". Print the events table to the screen. Remember:
- Try to keep the total number of individuals below about 70 at all times, or it will run very slowly.
Run the simulation using the sim_pop() function as before, but add the additional argument n_loci=100 to simulate a larger number of marker loci. Store the results in a variable called "my_sim".
- By default, the starting population size is 20 and generations the simulation will run for 25 generations. If you want to get creative, you can adjust these values using the arguments default_size=... and n_gens=....
Display the results as a PCA plot using the function plot_genetic_diversity(). Does it show generally the patterns you expected? [Write your answer]. Keep experimenting with your events table until it does. Just overall similarity is fine: they will never look exactly alike.

Evolutionary inference on real data

So, we know evolution can predict what we will see in PCA plots. Let's work backwards and try to infer the evolution of our actual KASP-marker-genotyped barley dataset!

First we need to look at our genetic information in PCA form. Do that with:

plot_genetic_diversity(genotype_data)

Exercise :

(@) Write a brief answer. [5] + Write a brief evolutionary interpretation of the information contained in the PCA and scree plots for barley KASP marker data. Give your reasoning. [Write your answer] + OPTIONAL CHALLENGE TASK; Write some evolutionary simulation functions: + As in the first challenge task, create commands to produce random vectors containing "A", "C", "G", and "T", and store two such vectors, each of length 40, in "seq1" and "seq2". + Learn how to write your own functions. Create a function that will accept a vector like "seq1" as an argument, and return a "mutated" version. To apply mutations, each "nucleotide" in the input vector should be given a 10% chance of being replaced with a random nucleotide. + Write a function that will take two vectors, like "seq1" and "seq2" as arguments, and return a "recombined" combination of them. The recombined vector should be one part of one input and one part of the other, joined at a randomly chosen breakpoint (e.g. if seq1 is A,A,A,A and seq2 is G,G,G,G, it could return A,G,G,G or A,A,G,G or G,G,G,A etc ...)

Attaching phenotype data

We have some phenotype data available for these samples, it is pre-loaded as a data.table called phenotype_data. Have a look:

phenotype_data

As you can see, the phenotype information we have for each sample is stored in columns with names like "annuality" and "rachilla_hairs". What do they mean?

Latitude is the latitude at which the sample grows. In general, higher latitudes have colder winters.
LT50, in the cold_tolerance_LT50 column, is a measure of how well a sample adapts to cold. Lower values mean that a sample can better survive lower temperatures.
Annuality refers to the time during which the variety can be sown (planted). Winter type barleys are more tolerant of cold.
Spike length and awn length refer to the lengths of these parts:
Image from Yuo et al., 2009
Awn texture relates to the presence or absence of microscopic barbs on the awns, that make the awns feel rough to touch:
Image from Milner et al., 2019
Finally, rachilla hairs issue from the rachilla. Rachilla are the tiny "stalks" that attach the florets (the individual barley flowers) to the main axis of the spike. These hairs can be long or short:
Image from Russell and Paynter, 2009

Exercise :

(@) Save the commands you create, and write brief answers where indicated. [5] + Write a ggplot command that will help you see if there is a relationship between awn length and spike length in the phenotype_data dataset. Does there appear to be an association? [Write an answer] + HINT: I would use a scatterplot. And, if you want to go an extra step, you could fit a trendline with geom_smooth(), as demonstrated in the last workshop. + Add a new aesthetic to the above plot to help you assess whether there is a relationship between low temperature tolerance and awn/spike length. + There are at least two explanations for why grasses have awns. One explanation suggests that they are very unpleasant for herbivores to eat, because they splinter into pieces and the barbs make them stick in the animal's throat. An alternative explanation suggests that florets that fall to the ground contract and expand with the heat, and this movement of the awns back and forward helps to bury the seeds in the soil. Think about your plot in relation to these two explanations of why awns exist. what would you predict about the awn length / spike length relationship based on each explanation? Do the results suggest one explanation is better (or more important) than the other? [Write a brief answer]

The glorious `merge` feature!

We want to intepret our phenotypic data in the context of genetic data, we will have to combine them into one table. This is where the merge feature of data.tables is helpful. Each row of genotype_data corresponds to a sample, listed in a column called sample. And each of these samples also appears somewhere in phenotype_data, also in a column called sample. What we really want to do, is to take each sample name in the sample column of genotype_data, search the table phenotype_data for a row where the sample name matches, and then add the information from that row to genotype_data. In a way, it is like the sample names in genotype_data are being used to SELECT the rows of genotype_data to add to itself.

And that's why we use the SELECT field for merging. We simply put genotype_data into the SELECT field of phenotype_data! We also explicitly tell it that we want the selection process to use the information in the columns named "sample" using the argument on="sample":

phenotype_data[ genotype_data , on="sample" ] #this just prints out the result of the merge on the screen

combined_data <- phenotype_data[ genotype_data , on="sample" ] #write the result to a variable

Exercise

(@) Demonstrate the merge feature. Think of a band you like (or use one of mine), preferably with 3 to 10 members, and use them in this exercise. Save all the commands you create. [5] + Create a data.table with two columns: one column that contains the names of the band's members, and a second column that contains the instruments they play. Store it in a variable. + Create a second data.table, also with two columns: One containing the names of the band's members, and a second column containing the year they were born. Store it in a variable. + Write a command that merges the two data.tables to create a table with three columns containing the band members' names, the instruments they play, and the year they were born. + OPTIONAL CHALLENGE TASK: + Write some code that produces some appropriate data.tables and then demonstrates the difference between data.table's rolling joins. Make sure it includes "roll='nearest'", "roll=-10", and "roll=TRUE".

Phenotypes and evolution

As we know, evolution by natural selection works because an individual's phenotype affects its ability to reproduce and pass alleles on to the next generation. But selection doesn't always act in exactly the same way for every individual. For example, individuals living in a sub-population living in a windy area might pass on more genes that make them short (so they don't blow over too easily). But individuals in another sub-population which doesn't experience much wind might be fitter if they grow tall, which mean they can suck up as much light energy as possible and use that energy for making seeds.

Let's think about the phenotypic information we have, and how it might relate to evolutionary history. To do that, we can take our PCA (our window into the evolutionary relationships), and superimpose information about phenotypes onto it. We do this using the same function plot_genetic_diversity(), and provide column names for the colour variable with the argument colour_variable=. To remind yourself what phenotype data we have, you can just run the name of the data.table (combined_data) and look at the column names.

For example, we could examine rachilla hairs like this:

plot_genetic_diversity(combined_data,colour_variable="rachilla_hairs") #always put the column name used for colour in quote marks

I got a result like this:

As we can see in the top-right PCA plot, long rachilla hairs do seem a bit more common in the upper cluster than in the lower clusters. Perhaps this could mean that one sub-population grows in areas where long rachilla hairs are more useful? That would mean rachilla hairs are a consequence of the groups' evolutionary histories. However, this could also just be a random result (which is more likely). This is one of the problems with small data sets like this: it can be hard to say with certainty whether results really mean anything or not. Perhaps there are other phenotypic traits that correlate better ...

Your turn:

Exercise

(@) Save all the code you create, and give brief written answers to questions where prompted. [15] + Produce the same PCA/scree plots as above using the same command, but with latitude used to colour the points. + Do the data suggest that closely-related individuals tend to live at similar latitudes? Is this what you would expect? Explain your reasoning. [Write your answers] + Produce yet another set of the PCA/scree plots that colour points according to LT50. By comparing these plots with the previous ones, what can you say about the relationship between latitude and LT50? [Write your answer] + Confirm the relationship using a scatterplot. Remember, the table you will use as the basis for the plot is combined_data. + What do the plots produced in this exercise so far tell us about the selective pressures that have acted upon individuals in these subpopulations? Explain your reasoning. [Write your answer]

Have a coffee! One section to go.

SNP associations with phenotypes

We spent the day so far thinking about what evolutionary history does to genotype data (and how to work backwards from the genotype data to the history). But when a genotype causes a phenotype, then this affects the fitness of the individuals, and this affects the way evolution occurs. One such example appears to be the cold tolerance we just looked at.

If we are breeding (say) barley plants and we want them to be more cold tolerant, then it would be a great advantage to know exactly which genetic changes are having this effect---then we can use the same genetic variants to produce cold-tolerant varieties!

To do this, we find correlations between genotypes at particular sites and phenotypes that suggest the site is linked to the cause of the phenotype. In some cases, the genotype observed at the correlating site might not just be linked to the phenotype---it might actually be a cause of the phenotype---but establishing true causality requires a lot of experimental validation. Finding genotype-phenotype correlations requires the use of some statistics, which is what we will try to understand better in this section ... but we will do it visually and leave the maths out.

First, let's revisit the problem of genotype-phenotype correlations caused by population structure, by looking at our KASP data. We could ask about cold tolerance as discussed in the previous section, and investigate it naively by viewing the graphical genotypes in the context of how cold tolerant the samples are. To do this, I just used the high-cold-tolerance and low-cold tolerance samples, indicated by shape. Have a look at the results.

What we are looking for is sites (columns) where the genotypes (colours) are almost all one state (colour) for the cold tolerant group, and almost all in a different state for the non-cold-tolerant group. This would point to linkage (or causation) between the site and the phenotype. I can see one here:

Only one sample's genotype (black arrow) disagrees with the general pattern: The green allele occurs in cold tolerant samples, and the red allele is in non-cold-tolerant samples.

But here's the problem: I can find many other sites where the pattern (or "correlation" to be precise) is almost as good ...

If you look at the original graphical genotypes again, you'll easily see why: because sub-populations are generally genetically similar. So if we are going to find strong associations, we need to account for population in our statistical method.

Linear models

First, let's get a feel for one of the most popular statistical frameworks that exists for investigating associations when there are many factors involved, linear modelling. Linear models are about checking how well some outcome can be predicted by adding different factors together.

Here is a tool that will help you get a feel for linear models, and how we can use them to control for factors like population structure. Mostly, scientists use software programs to fit models automatically but to really understand the process, it's a good exercise to fit one yourself. We will fit a linear model that tries to predict the height of some hypothetical rye samples, based on three factors:

The amount of fertiliser that is applied to them.
The genotype of a SNP. In modern studies, thousands of SNPs are tried. We will just test one, to demonstrate the procedure.
The subpopulation the sample is in.

We will give each factor a value that reflects how much it affects height. These "effect size" values are called coefficients. In this example, a coefficient of 3 for the fertiliser factor would mean: "An increase of fertiliser by 1 standard deviation results in, on average, a height increase of 3 cm". Of course, we could make the coefficient negative, in which case it would mean "An increase of fertiliser by 1 standard deviation results in, on average, a height decrease of 3 cm".

This means the difference of the coefficient from zero has real meaning! Zero means the factor has no effect. If one factor has a coefficient of 4, and another 8, then the second can be said to have twice as much influence (on average) as the first.

Start with this command using the function check_fit_fertiliser() to find a good coefficient for a very simple model that only includes fertiliser as a factor.

check_fit_fertiliser(
  guess_eff_fertiliser = 20 #this is YOUR guess of the effect fertiliser has on the height
)

The result will look like this:

At the top we see a "field" of rye plants. The black bars represent the height estimated by the model. If the fit is very good, the black bars will be at the exact same height as the plant. The blue arrows show the contribution to the model's prediction made by different factors (in this case, there is only one factor, fertiliser). Notice the model (in this case) tries to estimate the influence of the factors relative to some "baseline" (the dotted line, in this case just the mean height of the plants). Plants that received less than average fertiliser have arrows pointing down, and vice versa. If we double the coefficient the length of the arrows will double, and if we make it negative they will change direction (try it!).

The lower panel shows the differences (or residuals) between the predicted heights and the real heights (and the text tells you the total squared sum of these differences). If the fit is good, the bars will be small, and the total squared residuals will be as low as possible. The model we made has a bad fit. Some models will simply never fit well, but with this one I think we can improve the fit.

How? We need to adjust the coefficient for fertiliser to help reduce the difference between the predicted height and the true heights.

check_fit_fertiliser(
  guess_eff_fertiliser = 10
)

See how that made the influence of fertiliser smaller, and the sum of squared residuals (or "SSRs") went from down from over 7000 to about 3748.

Exercise

(@) Write the final command you end up with [2] + Run check_fit_fertiliser() and adjust the coefficient for fertiliser to find a value that reduces the total SSRs to below 3475. + HINT: A good way is to make changes up and down in smaller and smaller "jumps", always moving towards lower SSRs. You will need to use decimals to make fine adjustments.

Ok, time to get genetics involved. As we know from the lectures, if a marker is very close to a gene that affects the thing being measured (height, here), and which is variable in the population, then genetic linkage will mean that its state can have predictive value: Its state will normally correllate with the version of the gene that an individual has. In this exercise we will assume our markers are SNP markers. The function check_fit_fertiliser_SNP() adds a factor for a SNP marker at some site. You will see these reflected as black arrows that point up if the SNP has one state, and down if it has the other state.

check_fit_fertiliser_SNP(
  guess_eff_fertiliser = 20,
  guess_eff_SNP = 20
)

Now the fitting problem becomes more interesting. Some combination of these coefficients will get an optimual result, but to find it requires careful tinkering.

Exercise

(@) Write the final command you end up with [2] + Run check_fit_fertiliser_SNP() and adjust the coefficients given to the effects of fertiliser the SNP. Find a combination that gets the SSRs down to below 1403. + HINT: A good method is to try to make an improvement by adjusting one coefficient, then try adjusting the other, then back and forwards until they "settle" on good values. + Based on your fitted model, what factor appears to be the most important for controlling plant height?

Looking at the improvement in the residuals should prove to you that forgetting important factors can often make the difference between a good model and a useless one.

Recall from the lecture that population structure creates a lot of apparent genotype-phenotype correlations that are not really based on causality. They just reflect that fact that markers often have the same state within a subpopulation, and if that subpopulation happens to also express a certain phenotype, it is harder to say which of the markers are truly linked to genes that influence the trait.

A simple way to deal with this can be demonstrated in this model. We can simply allow for a "subpopulation" effect, that will in capture the influence of the shared variation in a subpopulation. This means the SNP effect will then be more informative. It will only show itself to be important if its effect cannot be accounted for my the subpopulation factor. In this case, we will reflect the subpopulation effects with red arrows.

check_fit_fertiliser_SNP_subpopulation(
  guess_eff_fertiliser = 10,
  guess_eff_SNP = 10,
  guess_eff_subpopulation = 10
)

Exercise

(@) Write the final command you end up with, and give written answers where requested. [8] + Run check_fit_fertiliser_SNP() and adjust the coefficients again, until you get the SSRs lower than 50. + Based on your fitted model, how important is the influence of the SNP on plant height, and how does this differ from your interpretation under the model from the previous exercise? [Write your answers] + OPTIONAL CHALLENGE TASK; Find a way to cheat on this question: + It is possible to look inside various functions and variables I wrote for the course and which are available in your R environment, then work out what values I used to create the exercise. It may require some careful reading of code and a bit of calculation ... Note, please (!) do not open any "hidden" course files if you find them during this process---print all information you want from within your R environment.

You've finished the session on interpreting biological data! Hopefully this has given you some intuition about how computational tools and bioinformatics allow us to extract information from genebank genomic data, and how creative the process can sometimes be. In the next session we will leave R behind and try some other languages. Well done!

mtrw/IBGG documentation built on April 10, 2022, 12:06 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

mtrw/IBGG
Introduction to Bioinformatics for Genebank Genomics (IBGG): Course Material and Assignments

In mtrw/IBGG: Introduction to Bioinformatics for Genebank Genomics (IBGG): Course Material and Assignments

Introduction and setup

Exercise :

Genotype data with data.table

Exercise :

Graphical genotypes with ggplot

Exercise :

Exercise :

Genotypes as a window into deep history: Thinking about evolution

Simulating evolution

Exercise :

Exercise :

Evolutionary inference on real data

Exercise :

Attaching phenotype data

Exercise :

The glorious `merge` feature!

Exercise

Phenotypes and evolution

Exercise

SNP associations with phenotypes

Linear models

Exercise

Exercise

Exercise

R Package Documentation

Browse R Packages

We want your feedback!

mtrw/IBGG Introduction to Bioinformatics for Genebank Genomics (IBGG): Course Material and Assignments

In mtrw/IBGG: Introduction to Bioinformatics for Genebank Genomics (IBGG): Course Material and Assignments

Introduction and setup

Exercise :

Genotype data with data.table

Exercise :

Graphical genotypes with ggplot

Exercise :

Exercise :

Genotypes as a window into deep history: Thinking about evolution

Simulating evolution

Exercise :

Exercise :

Evolutionary inference on real data

Exercise :

Attaching phenotype data

Exercise :

The glorious merge feature!

Exercise

Phenotypes and evolution

Exercise

SNP associations with phenotypes

Linear models

Exercise

Exercise

Exercise

R Package Documentation

Browse R Packages

We want your feedback!

mtrw/IBGG
Introduction to Bioinformatics for Genebank Genomics (IBGG): Course Material and Assignments

The glorious `merge` feature!