knitr::opts_chunk$set(echo = TRUE)
Most of you will be familiar with operating systems like Windows or MacOS. Linux is also an operating system: it is used to manipulate files, run programs, and generally serve as an interface between hardware, software, and users. But Linux is used by typing text into a console, just like R. At first, this is quite uncomfortable and frustrating, but it becomes natural quickly, and the advantages are enormous. With Windows and MacOS, simple file manipulations like moving a group of files from one directory to another is easy. But what about renaming all your 542 holidays photographs to "Tim-holiday-2019-1.jpg", "Tim-holiday-2019-2.jpg", "Tim-holiday-2019-3.jpg" ... etc? On Windows this would take a whole day. In Linux I could write:
```{bash,eval=FALSE} n=1; for i in $(ls *.jpg); do mv $i Tim-holiday-2019-${n}.jpg; n=$((n+1)); done
And it would rename all the photographs in about half a second. Of course, learning how a command like this works takes a bit of time, so there is a trade-off: It takes time to learn, but once you can do it, many tasks that can occupy people for hours become very fast and easy. In bioinformatics, this kind of task comes up every day. Bioinformatics often involves running programs thousands or millions of times on thousands or millions of files, in all sorts of orders and combinations. So most bioinformaticians love Linux. Linux is the hugely popular open source version of the proprietary operating system *Unix*. #Getting set up The browser-based Rstudio client actually gives you very easy access to a Linux console, called the "terminal". Just click over to it here:  It might ask you for your password, which you should type and press `<enter>`. #First steps First, let me just prove to you that Linux is an operating system. In Linux we are always working in a directory. To see what your current *working directory* is, type this into the terminal and press `<enter>` to run it: ```{bash,eval=FALSE} pwd
"pwd" stands for "print working directory".
To see all the files and directories in your working directory run this function:
```{bash,eval=FALSE} ls
... which is short for "list". You will see all the directories you know from the course work so far, something like this: </br>  </br> By default the function `ls` lists the contents of the current working directory, but you can tell it to list another one. For instance, we can list the contents of the "data" directory. Remember: The `<tab>` key also autocompletes in the terminal ... so once you have typed "da", hit `<tab>` to auto-complete. ```{bash,eval=FALSE} ls data
Now try using the cd
function ("change directory") to change your working directory to the data directory. After that, run pwd
to confirm you changed directory, and ls
to look at what's in the new working directory. Like this (type and run them one at a time):
```{bash,eval=FALSE} cd data
pwd
ls
To move back to the old working directory, run the following command. The `..` always refers to the parent of the current working directory, so if you are in a directory "/home/user/documents", then `cd ..` will change your working directory to "/home/user". ```{bash,eval=FALSE} cd ..
Notice how Linux functions are a bit similar to R functions: They have a name (like "cd" or "ls"), and you give them arguments (like the name of a directory). The difference is that in Linux, we don't put the arguments in brackets. However, they often do the same things! For instance, the R function list.files()
does exactly what the Linux function ls
does. Remember the seq()
command in R? For instance seq(from = 0 , to = 10 , by = 2)
would produce "0 2 4 6 8 10". Well, try this in the Linux terminal:
```{bash,eval=FALSE} seq 0 2 10
Notice it does the same! The arguments (0, 2, and 10) are just in a different order. The point is, once you learn one language, the second one is a lot easier. There are a lot of Linux functions that exist for copying, moving, editing, creating files/directories etc. But today is about bioinformatics, so we'll instead focus less on operating system functions, and more on functions that process the contents of files. Specifically, files that contain Next Generation Sequencing data. But as a starting point, we can use a simple text file to demonstrate. The function `cat` ("*cat*enate") will print the contents of a file to the terminal. The file we will print out is in the "data/" directory and it is called "poem.txt", so we will run `cat`, and give it the path to find that file ("data/poem.txt"). ```{bash,eval=FALSE} cat data/poem.txt #remember use <tab> to auto-complete as you type!
So you should be seeing lines of text from a poem in the terminal. Now the terminal is only where the text goes by default. We could also save it into a file, for example. Or---and this is one of the keys to why Linux is so magical---we can send it into another command. If we write a |
symbol (called a "pipe") after a command, the result will not go to the screen, it will go into the second command.
For example, there is a function wc
which stands for "word count", which counts the lines, words, and characters in text. If we pipe the output of cat data/poem.txt
into wc
, it will count those things in the poem. Try this:
```{bash,eval=FALSE} cat data/poem.txt | wc #remember, to get the previous commands you ran back, press the up arrow in the terminal. then you can just keep building on it
 Ok, the poem has 46 lines, 366 words, and 2063 characters. I can actually modify the output of `wc` to *only* print the information I want, and I do it using a *flag*. A flag is typed after the function name, and modifies what the function does. For instance, the `-w` flag for `wc` tells it only to print the word count, and the `-l` flag tells it only to print the line count. Try these: ```{bash,eval=FALSE} cat data/poem.txt | wc -l cat data/poem.txt | wc -w cat data/poem.txt | wc -c
The function grep
is used to select lines that match a given pattern. For, example to print the lines of the poem containing the pattern "the", we pipe the poem into grep
and provide the pattern 'And' (the quote marks ('
) are important!):
```{bash,eval=FALSE} cat data/poem.txt | grep 'the'
####Exercise (@) Save all the code you create. Maximum one command per instruction. Write answers where prompted. [10] + Change working directory into "data" + List the files in the new working directory + List the files again, but add the flag `-l` to the command. This flag will tell the function `ls` to provide (l)ong information about the files + Do it again, but add a *two more* flags, `-h` and `-L`. This will show you the file sizes in human readable format (e.g. 23M for 23 megabytes). Which is the biggest file? [Write your answer] + *Without changing the working directory* list the files in the parent directory + Change directory back to the parent directory + *Without changing the working directory* print the contents of poem.txt using `cat` + Use `grep` to select only the lines of the poem containing the pattern "and" + Add the flag `-i` to grep. Compare the results with the previous command. What does the `-i` flag do? [Write your answer] + Extend the above command to count the number of lines containing the patterns "and" or "And". The command should *only* output the number of lines, and no other information. + HINT: Remember, a command can have more than one pipe in it. #Getting friendly with NGS data ##Counting reads in a fastq file Even with the few simple commands we have learned, it is possible to do some basic processing on NGS data. We'll first use `cat`, `grep`, and `wc` to count the numbers of reads in the standard NGS reads file which is in the format "fastq". We learned about it in the lecture. Just take a second to remind yourself of the big picture, and where these "reads" fit in the big picture. They are the raw output of a sequencing machine, representing the sequences of bits of DNA that come from (in this case) the genome of a barley plant. </br>  </br> ####Exercise (@) Save all the code you create. Maximum one command per instruction. Write answers where prompted. [2] + The function `head` is *exactly* like `cat`: It prints the contents of a file on the screen. BUT---`head` only prints the first 10 lines. This is great to see what is in the file when the file has millions of lines in it. Use the function `head` to look at the first ten lines of the file "HOR_1251.fastq". I got this: </br>  </br> As we learned in the lecture, the lines are in groups of four: The name, the sequence, a "+", and the qualities (represented as a string of letters). However, we want to know how many sequences there are, so it would be best to use just one line per sequence. We could use `grep` for that! Notice that the beginning of each read line begins with the code '@HET-141-007:256:HNLWYBCXX' ... etc. This is information about the run (the run ID, the sequencer type, etc ...), and it's going to be the same for every read in the file, so we could use this as our pattern for `grep` to search for. ```{bash,eval=FALSE} head data/HOR_1251.fastq | grep '@HET-141-007:256:HNLWYBCXX'
Now you have a way to produce one line per read. And we already know how a good way to count lines, so:
(@) Save all the code you create. Write answers where prompted. [10]
+ Count the reads in 'HOR_1251.fastq'.
+ HINT: Remember head
will only print 10 lines. Change it to a command that will print all the lines.
+ Count the reads in 'HOR_1384.fastq', which is also in the 'data' directory.
+ HINT: You will need to first look at the contents and identify a good pattern for grep
to match once per read. It might not be the same as it was for HOR_1251.fastq.
+ Recall that NGS reads are constructed with adapter sequences on the ends, that allow the reads to be sequenced. Sometimes, the inserts are so small that the read extends into the adapter sequence. In this case, the read is said to be "contaminated" by adapter sequence. Use your skills to estimate what percentage of the reads in HOR_1384.fastq are suffering from contamination. The adpater sequence is 'AGATCGGAAGAGC'. Construct as many commands as you need to allow you to calculate what percentage of the reads contain the adapter sequence, and write down your estimate.
Once reads are aligned (or "mapped" in bioinfomatics slang) to a reference genome, the information is stored in a .sam file (or its binary-encoded equivalent, a .bam file). To get a feel for what this information really means we can view the .sam file's contents using a specialised viewer that diplays the mapped reads the way we visualised them in the lecture: stacked up on top of the reference genome, so that the nucleotides of the read match up to their aligned positions in the reference.
The program samtools
contains a visualiser called tview
to do this. We can use it to look at some NGS read data from HOR_1251 that was mapped to a barley reference genome. This NGS data was created using a technique called Genotyping By Sequencing (GBS). You don't need to know about the method, only that it does result in many reads often mapping to the exact same position. I've created a shortcut function to diplay the alignment in the terminal, by running this command:
```{bash,eval=FALSE}
samtools_tview_HOR_1251_mapped
The tview display looks like this: </br>  </br> So in this case we can see three reads (yellow arrows) aligned to the reference (blue underline). Read sequences that match the reference are simply displayed as a dot, but any nucleotides that don't match the reference are displayed (e.g. red arrow). Tap (or hold) space to scroll across the genome, and backspace to scroll in the other direction. Perhaps you can find some mapped reads ... but they are spread out far across the genome, so it may be hard to find some. However, we could try going directly to a site where we expect there to be a genotype call, to see how an alignment of NGS reads can reveal genotypes at variant sites, just like KASP markers can. To do it, we can look for some of the markers that have actually been genotyped with KASP. Let's look at HOR_1251 and HOR_1384's genotypes at this locus, which is at chromosome 1H, position 10965721. I've pointed them out with arrows. You should recognise this way of diplaying genotypes by now. One new thing here is that we know exactly where these loci fall in the genome (thanks to the construction of a barley reference genome). So the x-axis actually shows the real physical positions of the loci along the chromosome. </br>  </br> Press question mark (`?`) to get a menu of helpful commands (`<esc>` to exit). Among these is `g`, to jump to a position. Press `g`, and enter the position like this "chr1H:10965721" then press `<enter>`. The position will be placed on the far left of the screen. The position will be placed at the left side of the screen so use the arrow keys to move it around. Press the full stop key (`.`) to toggle nucleotide view on and off. ####Exercise (@) Write answers where prompted. [20] + What is the reference genome's state (nucleotide) at position chr1H:10965721? [Write your answer] + How many reads from HOR_1251 mapped at this SNP locus? [Write your answer] + How many reads from HOR_1384 mapped at this SNP locus? [Write your answer] + HINT: To quit tview press `q`. To open tview for HOR_1351, run `samtools_tview_HOR_1384_mapped`. + What state does HOR_1251 have at this position? [Write your answer] + HINT: Remember you can toggle nucleotide view with `.`, also try `v`. + What state does HOR_1384 have at this position? [Write your answer] + Are there other sites within 100 bp of chr1H:10965721 at which HOR_1251 differs in sequence from the reference? If so, approximately what positions? [Write your answer] + Open tview for HOR_1384 and go to position chr1:47611. Press `b` to toggle colour by base quality (press `v` and `.` to toggle highlighting/nucleotide view too if you want). Find some bases of lower quality (white is high, yellow lower, green even lower, blue lowest). Are they randomly distributed across the reads? [Describe briefly anything you notice about where low quality bases occur] + Remember from the lecture that the information stored in a *variant call file* (.vcf) describes the genotypes detected in a population, with each row representing one locus. Based on your knowledge of HOR_1251 and HOR_1384, copy and paste the text below into your exercise sheet, and fill in the missing entries (___) that would be needed to describe the site chr1H:10965721 in .vcf format. Remember, "DP" means read depth, and "GT" gives the genotype of the individual at the locus, where 0 means the reference allele and 1 means the alternative allele. + Column 1 (Chromosome name): \_\_\_\_ + Column 2 (Position number): \_\_\_\_ + Column 4 (Reference allele): \_\_\_\_ + Column 5 (Alternative allele): \_\_\_\_ + Column 10 (Genotype information for HOR_1284): DP=\_\_\_\_;GT=\_\_\_\_ + Column 11 (Genotype information for HOR_1384): DP=\_\_\_\_;GT=\_\_\_\_ Of course visualisations like tview are nice, but they aren't ideal for a bioinformatician to use in an analysis involving billions of reads and millions of sites! For that, we need to use the information in .sam/.bam format directly. There is a (slightly simplified) .sam file for HOR_1251 in data/HOR_1251_mapped_to_reference_genome.sam. Have a look at it quickly using `head`. You should get this: </br>  </br> To analyse data in volumes like this (this is actually a very small sam file and it still contains 375,609 lines), we need to describe what we want to a computer, via some appropriate language. R is very good up to a limit, but once the volume of data goes from the millions into the billions of rows, faster tools are needed. It's time to start learning your third programming language, called *awk*. Awk is a beautiful language, and it's so compact you could theoretically learn almost all of its functionality through [Wikipedia](https://en.wikipedia.org/wiki/AWK). ##Awk command structure Awk is designed for processing files organised into rows and columns. It reads one row at a time. For each row, it will perform a test, and if the test is passed it will run a corresponding command. Awk programs are simply written on the command line in between single quotes (`'`), and the structure looks like this: > ' *test 1* { *commands 1* } *test 2* { *commands 2* } ... ' The **Tests** we already know about: Remember all those commands with logical operators like `my_vec > 4` and `sample=="HOR_1284"` we used in R? Most of those logical operators like `>` and `==` and `!` and `<=` work in awk too! (in fact, they work in most modern programming languages). An important "test" in awk is just `1`: If you write `1` as a test, it simply *always* counts as TRUE. The **commands** in awk are mostly made from functions, just like in R. The most simple one is `print`. Awk commands can also use variables just like in R. There are some very special variables, however, that are automatically set to the columns of each row: + $1 contains the contents of column 1 (in the .sam file this would be the read name) + $2 contains the contents of column 2 (in the .sam file, the sam-flag value) + $3 contains the contents of column 3 (in the .sam file, the chromosome name) + $4 contains the contents of column 4 (in the .sam file, the mapping position) + $5 contains the contents of column 5 (in the .sam file, the mapping quality) + $6 contains the contents of column 6 + ... etc ... + And (**important!!!**) $0 contains *the entire line*. So, time for a simple awk command. Run me: ```{bash,eval=FALSE} cat data/HOR_1251_mapped_to_reference_genome.sam | awk ' 1 {print $0} ' #press ctrl-C if you want to cancel the command once it is running
So, the test is 1
, meaning "do this command for every single line".
The command is print $0
meaning "print the whole line".
This is a very stupid program, it just prints every line of the file. So, it really does nothing in this context.
Build your understanding by experimenting with some modifications, e.g.:
```{bash,eval=FALSE}
cat data/HOR_1251_mapped_to_reference_genome.sam | awk ' $3=="chr1H" {print $0} '
cat data/HOR_1251_mapped_to_reference_genome.sam | awk ' $3!="chr1H" {print $0} '
cat data/HOR_1251_mapped_to_reference_genome.sam | awk ' $3=="chr1H" && $4==10965690 {print $0} '
cat data/HOR_1251_mapped_to_reference_genome.sam | awk ' $3!="chr1H" && $4<1000000 {print $0} ' | wc -l
cat data/HOR_1251_mapped_to_reference_genome.sam | awk ' 1 {print $5} '
cat data/HOR_1251_mapped_to_reference_genome.sam | awk ' $5>50 {print $1} '
cat data/HOR_1251_mapped_to_reference_genome.sam | awk ' $5>50 {print "This read has high quality"} $5<20 {print "This read has low quality"} '
####Exercise (@) Write down all the code you create. Maximum one command per instruction. [4] + Write a command that reads through all the lines of data/HOR_1251_mapped_to_reference_genome.sam and uses awk to print "This read mapped to chromosome 1H" or "This read did not map to chromosome 1H", as appropriate. What if we want to count these low and high-quality reads using just one awk command? Well, we can use some variables. We could make one variable to count the low-quality reads, set it to 0 at the start, and each time a low quality read is found, add one. At the end, we could print out the number in the variable. To do that, we will use two new "tests". `BEGIN` is a test that runs its command only at the very beginning of the file. And `END` is a test that runs its command only at the end of the file. Also, we will include multiple commands using the semicolon (`;`) to separate them. Follow along with this process as we build the awk program Make sure you understand what every part of the program is doing. ```{bash,eval=FALSE} #begin with a low quality counter set to 0. Add 1 to the counter every time a line has quality < 20. Print the counter's value at the end. cat data/HOR_1251_mapped_to_reference_genome.sam | awk 'BEGIN { low_quality_counter=0 } $5<20 { print "This read is low quality, adding 1!" ; low_quality_counter=low_quality_counter+1 } END { print "Number of low quality mapped reads:" ; print low_quality_counter } '
If you struggled to understand all of that program, read this breakdown (or talk to Tim), if not, you can skip it.
BEGIN { low_quality_counter=0 }
:{}
) are ONLY run at the very start. It stores the number 0 inside a variable called "low_quality_counter".$5<30 { print "This read is low quality, adding 1!" ; low_quality_counter=low_quality_counter+1 }
:$5
means the value in the fifth column, which is the quality. The test $5<20
means the commands inside the curly brackets ({}
) are run whenever this column contains a value less than 20 (i.e., when it is a low quality read).;
.print "This read is low quality, adding 1!"
, which just prints a message to the screen.low_quality_counter=low_quality_counter+1
, meaning it replaces the value inside low_quality_counter
with "itself plus one". So the value inside the variable will go up by 1 every time these commands are run (that is, every time the quality is below 20)END { print "Number of low quality mapped reads:" ; print low_quality_counter }
:END
means the commands inside the curly brackets ({}
) are ONLY run at the very end, after every line has been processed.;
.print "Number of low quality mapped reads:"
, which just prints a message to the screen.print low_quality_counter
, which prints the value inside the counter variable called low_quality_counter
.(@) Write down the code you create. [4] + Write a command that counts the number of reads mapping to chromosome 5H in data/HOR_1251_mapped_to_reference_genome.sam, and prints the final result.
We can quite simply extend the ideas in our low-quality read counter to count both high and low quality reads. We just add a second counter, but this one will increase only when reads are above 50 in quality (i.e. when $5>50
). And we add more print statements at the end to print the value of the second counter also.
If you feel confident, try to write a program like this yourself before you read mine.
```{bash,eval=FALSE} cat data/HOR_1251_mapped_to_reference_genome.sam | awk 'BEGIN { low_quality_counter=0; high_quality_counter=0 } $5<20 { print "This read is low quality, adding 1 to low_quality_counter!" ; low_quality_counter=low_quality_counter+1 } $5>50 { print "This read is high quality, adding 1 to high_quality_counter!" ; high_quality_counter=high_quality_counter+1 } END { print "Number of low quality mapped reads:" ; print low_quality_counter ; print "Number of high quality mapped reads:" ; print high_quality_counter } '
That's a great program that a real bioinformatician would find very useful! However it's a bit messy, don't you think? Just to demonstrate, a simple program like this would usually contain shorter variable names, fewer unneeded print statements, and simplified output. It would also use some of the shortcuts awk has for achieving common tasks without typing too much. For example, adding one to a variable can be done by just writing the variable name followed by `++`. Also, you don't technically need to start variables at zero, they do it automatically. You *could* actually write that last program like this, below. ```{bash,eval=FALSE} cat data/HOR_1251_mapped_to_reference_genome.sam | awk '$5<20 {n_lq++} $5>50 {n_hq++} END {print "n_hq_reads","n_lq_reads"; print n_hq,n_lq}'
And as you learn more tricks and shortcuts, it can get even more compact ...
```{bash,eval=FALSE} cat data/HOR_1251_mapped_to_reference_genome.sam | awk '{$5<20?n_hq++:n_lq++} END {print "n_hq_reads","n_lq_reads"; print n_hq,n_lq}'
You don't need to [fully understand](https://www.tutorialspoint.com/awk/awk_ternary_operators.htm) that last line, but the point is that programming can involve a lot of flexibility and stylistic freedom, and you can always find ways to improve your code for readability and functionality. You never stop learning. ####Exercise (@) Write down the code you create. [4] + Write a single awk program that counts the number of reads mapping on chromosomes 1H and 2H, and print both numbers at the end. #Fantastic arrays and how to loop over them ##Array structures When we counted reads in different quality groups, we had just two groups, so it was easy to create two counter variables, and write two print statements at the end. But what if we wanted to count many different categories of thing? For example, what about counting the reads mapping to each of seven chromosomes? We need a kind of storage system. We need an *array*. I think of an array like a train with labelled carriages carrying information. You can access the information if you provide the name of the train, and of the carriage: </br>  </br> Yes, I know this is a bit childish but it's how I first learned arrays and I like it. An array is a kind of variable with labelled slots in it, that can be accessed using that label. If I create an array called "my_array", I could name the slots after, say, the chromosomes. And in each slot we could store the number of reads mapping to that particular chromosome. When accessing the slots of an array in awk, we type the array's name, followed by square brackets, and inside those, the slot name. </br>  </br> So using this array as an example, if part of my awk program was `print my_array["chr2H"]`, it would print "102", and if part of my program was `my_array["chr2H"] = my_array["chr2H"] + 5` it would add five to the value, so the next `print my_array["chr2H"]` would produce "107". Make sure you run and understand this command before you move on. Focus on how both literal statements (like "frog") and *and* items stored in variables (like thing1) can both be used in the indices of arrays: ```{bash,eval=FALSE} #note we aren't actually using the .sam file in any way here ... cat data/HOR_1251_mapped_to_reference_genome.sam | awk 'BEGIN { thing1="cow"; thing2="frog"; animal_noises["frog"] = "ribbit ribbit!"; animal_noises[thing1] = "mooooo!"; print "The cow goes",animal_noises["cow"],"and the frog goes",animal_noises[thing2] ; }'
You may have noticed here some similarity with R's indexing systems, for example in R my_list[3] <- "hello!"
would set the third element of my_list
to "hello". That's no coincidence. Indexing objects that contain multiple entries is almost universal across programming languages, and using square brackets to denote indices is also a common convention that many of them borrow from each other.
In awk, you don't need to set up arrays at the beginning of the program, you just start using them and awk automatically sets them up for you. So, to count reads hitting chromosomes, we can use the chromosome names (stored in $3) as slot labels in an array (I will call it "chr_counts"), like this:
```{bash,eval=FALSE}
cat data/HOR_1251_mapped_to_reference_genome.sam | awk '1 { chr_counts[$3] = chr_counts[$3] + 1 }'
This is great, but it simply creates the array, and never prints anything! But since you now know how to print the contents of an array slot, you can do this yourself: ####Exercise (@) Write down the code you create. [4] + Modify the command above to print the count for chromosome 6H at the end. + HINT: Remember when referring to the array slot labels directly, put the label name in quote marks, e.g. `array_name["label_name"]` ##Looping If we wanted to print the number of reads of every single chromosome, we'd need to write many print statements, one for each chromosome. But what if we din't even know what chromosome names are in .sam file? Or what if there are 100 different chromosomes? This would be almost impossible. What we really need is a way to repeat a command (in this case, printing something), but very slightly change the command each time. In this case we want to run it once for each chromosome name, and use the chromosome name as the label of the array slot to be printed each time. Let's say we found a way to cycle through the labels, and for each one, we would store the label name in a variable called `i`. So `i` would first contain "chr1H", and we could write `print my_array[i]` to print out the contents of `my_array["chr1H"]`. Then we set `i` to "chr2H", and repeat the process. And again and again until every label has been used ... like this: </br>  </br> This is called a *loop* (specifically, a *for* loop), and the way to describe one is written at the top of that diagram: Start with "for". In brackets, we write what we want our *indexing* variable to be called (this is `i` in this example), then "in", and then the name of the array we are going to loop over. The command to be run each time is then written in curly braces (`{...}`). A simple thing to do with a loop would be to just print the names of all the labels of the slots in the array: They are what gets stored in `i` each time: ```{bash,eval=FALSE} cat data/HOR_1251_mapped_to_reference_genome.sam | awk '1 { chr_counts[$3] = chr_counts[$3] + 1 } END { for(i in chr_counts){print i} }'
And of course, the variable i
can now be used to access the slots in the array chr_counts
, so we can achieve our goal:
```{bash,eval=FALSE} cat data/HOR_1251_mapped_to_reference_genome.sam | awk '1 { chr_counts[$3] = chr_counts[$3] + 1 } END { for(i in chr_counts){print i,chr_counts[i]} }'
####Exercise (@) Write down the code you create. [8] + Create what bioinformaticians call a *read quality histogram*: Write an awk command that counts how many reads of each quality there are, and then at the end, loops over the quality scores and prints both the quality score and the number of reads with that quality to the screen. + HINT: This will be a very similar command to the previous one just described, just with column numbers and variable names changed. + HINT: Remember, the quality of the read is in column 5. #Nucleotide evolution revealed in a .vcf file: A final challenge. As we know, a .vcf file tells us about the genetic variants that have been recorded in a dataset, with one line per site. So far, we used that information to produce genotype matricies and PCA plots to explore evolution, and to look for possible associations between the characteristics of an individual and the sequence of its genome. But what about questions of biochemistry? Could we use this dataset to tell us something about the molecular properties of DNA molecules? Mutation happens when there is a copying error in the germ line. It can be expected that, in general, the *purine* bases (A and G) are more likely to be mistakenly copied in place of each other, because they are chemically similar. So, a G is more likely to mutate into an A, than into a C or T. Similarly, a *pyrimidine* (C or T) is more likely to be mutated into a different pyrimidine. Is this true for barley? We can test it using bioinformatics. The .vcf file actually records the consequences of mutations, telling us what the nucleotide states were before and after the mutation: These are the *reference* and *alternative* alleles. We don't know which was the original state but, for this exercise that's not important. We'll start by making a count of the reference allele for each variant site recorded in the .vcf file `data/Panel_variant_calls.vcf`. Take a look at the .vcf and remind yourself of the format. The reference allele is in column 4. ####Exercise (@) Write down the code you create. [8] + The first ten rows of `data/Panel_variant_calls.vcf` contain metadata about the file that we can ignore. Write a command that uses any combination of the tools you know (cat, head, grep, awk) print `data/Panel_variant_calls.vcf` to the screen, #excluding the first 10 rows*. + Add to the above command to print the same rows but only the first five columns. + Add to the above command again: Write an awk program to count the occurences of each reference allele in the .vcf file, and then print the alleles and their counts at the end. + Remember: The reference allele is in column 4. That's a good first step, but what we *really* want to know is what alternative alleles are more or less common for each reference allele. But how can we store that? We have an array already, but each slot only stores one value. It's almost as though we want each slot of the array storing reference allele counts to contain *another array*, that counts each time that reference allele is paired with each of the four possible alternative alleles (stored in column 5). This is called a *two-dimensional* array, and it's simpler to use than than it sounds. We just use a second set of square brackets to specify the slot in the second array. To visualise it, just think of a grid: </br>  </br> So, in our case, we'd have the first brackets containing the reference allele, and the second the alternative allele. If we called the array "nt_counts", then `nt_counts["A"]["G"]` would store the number of variant sites at which the reference allele was "A" and the alternative allele was "G". And what about printing this out? To print the array, we used a loop. And to print *arrays inside an array* ... we use *loops inside a loop*. Each time the outer loop increments one variable (say, `i`), an inner loop will be run to increment over a second variable, say, `j`. Once again, this is easier to visualise: </br>  </br> Here is an example of a nested loop in awk, for demonstration purposes. The "BEGIN" section inserts values into the array, and the END section uses the nested loop to print them out. Make sure you run it and understand how it achieves what is shown in the diagram, then you will be ready to do the final exercise. I will give two versions, the first is just spread over many lines to make it clearer how a loop-inside-a-loop is written. The second one is all on one line, as it would appear in the terminal. ```{bash,eval=FALSE} #notice we aren't actually using the .sam file in any way here ... cat data/HOR_1251_mapped_to_reference_genome.sam | awk ' BEGIN { my_array[1]["a"]="A"; my_array[1]["b"]="B"; my_array[1]["c"]="C"; my_array[2]["a"]="D"; my_array[2]["b"]="E"; my_array[2]["c"]="F"; for(i in my_array){ for(j in my_array[i]){ print "i is",i,", j is",j,", my_array[i][j] is",my_array[i][j] } } }' #same again ... cat data/HOR_1251_mapped_to_reference_genome.sam | awk 'BEGIN { my_array[1]["a"]="A"; my_array[1]["b"]="B"; my_array[1]["c"]="C"; my_array[2]["a"]="D"; my_array[2]["b"]="E"; my_array[2]["c"]="F"; for(i in my_array){ for(j in my_array[i]){ print "i is",i,", j is",j,", my_array[i][j] is",my_array[i][j] } } }'
(@) Write down the code you create at each step. This is the last question of the whole course, so it is designed to be a challenge. You will write an awk program to count and print all the combinations of reference and alternative alleles in the .vcf file. Follow the steps. [15]
+ Start with the simplest program you can write to count just the numbers of each reference allele (column 4) into an array called counts
(don't worry about printing the contents at the end).
+ Make counts
into a two-dimensional array by adding a second index, whose label is the alternative allele (column 5).
+ Add a block with an END condition. Inside it create a single loop over the variable called "i", where i
is the label names of counts
. Each time it loops, make it print "Reference allele=", followed by the name of the refence allele (stored in i
).
+ Delete the previous print statement and replace it with a second, inner loop over the variable called "j", where j
will contain the label names of counts[i]
. Each time this loops, print "Reference allele=", then the reference allele stored in i
, then "Alternative allele=", then the alternative allele stored in j
, then "Count=", then the count of sites stored in counts[i][j]
.
+ Based on these results, do transitions appear to be more common than transversions? [Write your answer]
Congratulations! You've either proven or disproven a prediction from molecular biology, using evolutionary biology and bioinformatics. That's the end of the workshop material. If you got through all those exercises, you likely have more bioinformatic expertise than the average biologist today!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.