knitr::opts_chunk$set(echo = TRUE)

Preamble

Every good recipe is built from basic ingredients. Likewise, bioinformatic pipelines, while complex, require some fundamental methods and data types that are the bare neccessities to any study implementing population genomic approaches.

In this tutorial, you will:

  1. Become familiar with some basic features of genomalicious.
  2. Develop familiarity with SNP data structures.
  3. Learn to use genomalicious functions to import and manipulate SNP data structures.

Getting to know genomalicious

Let's start by loading the genomalicious library into your R session.

library(genomalicious)

Genomalicious contains a number of demonstrative toy datasets that you can experiment on. These need to be loaded in with the data function and typically follow the naming convention, data_[data name], where [data name] is a unique identifier.

For example, let's take a look at a toy dataset of genotype data from four simulated populations:

data("data_Genos")

# Take a look at the structure of this data, the `dim` function reports the dimensions.
dim(data_Genos)

# The class
class(data_Genos)

# Print the data to screen.
data_Genos

You will notice that data_Genos is a dual classed object: a data.table and a data.frame. This is a long-format data table, sample IDs and locus IDs are in the columns (SAMPLE and LOCUS, respectively), and the genotypes a recorded in a single column (GT). There is also information on the population ID (POP) and the chromosome (CHROM).

You can learn more about a dataset by using ? and the data object name, e.g. ?data_Genos. We will look at other data sets in later tutorials, but you can peruse the various datasets by typing genomalicious::data_ and hitting the TAB key to get a list of options.

Loading VCFs into R

Variant call files (VCFs) are one of the typical end products from read assembly/variant calling pipelines in population genomic analyses. These files contain information about genotypes and read counts attributed to each sample and associated mapping and genotype calling statistics.

Genomalicious offers a very simple way to import VCFs into R as long-format data tables, whereby loci and and populations are both in rows, as described above.

We will now import a demo VCF into R using the vcf2DT function. First, we need to find where genomalicious is installed and make a path to the demo file.

# Create a link to raw external datasets in genomalicious
genomaliciousExtData <- paste0(find.package('genomalicious'), '/extdata')

# This command here shows you the VCF file that comes with genomalicious
list.files(path=genomaliciousExtData, pattern='indseq.vcf')

# Use this to create a path to that file
vcfPath <- paste0(genomaliciousExtData, '/data_indseq.vcf')

# The value of vcfPath will depend on your system
vcfPath

You can naviagte yourself to the path stored in vcfPath and open the VCF using a text editor. However, we can simply read the lines of the file and print them to screen:

# Read in and print first 20 lines of the demo VCF
readLines(vcfPath)[1:20]

This demo VCF was constructed more simply for the purpose of this tutorial, but follows the same basic structure of that produced from variant calling software. You will see that the text is interspersed with \t: these indicate tabs that separate the various columns in the file.

A VCF has the following components:

  1. A comments section marked with double hashes, ##. These typically contain details about the contents of the VCF or how the reads were called.
  2. A heading section marked with a single hash, #. This is effectively the column names for the wide format SNP data.
  3. All lines proceeding contain SNP data.

VCFs can be a bit tricky to interpret when you first see them. All columns before FORMAT contain information about the SNP (i.e. the chromosome, its position, the alleles). FORMAT itself contains a text string detailing the contents of the sample columns. Note, the contents of FORMAT are typically described in the comments section. We therefore know that DP:RO:AO represent the total read depth (DP), and the counts of the reference (RO) and alternate (AO) allele. Also note the difference in DP stored in the INFO column (read depth across all samples) and that in FORMAT (read depth within each sample).

All sample columns occur after FORMAT (e.g. Ind1.115, Ind1.243) and continue to the end of the line. Within each sample column, the values described in FORMAT are stored, with each value separated by a :. This demo VCF contains 256 simulated individuals sampled from four populations, with naming conventions Ind[pop]_[sample], were pop is the population ID and sample is the sample ID.

Note the difference in the VCF file as a wide-format data structure to that of the long-format data structure we saw earlier. In wide-format, samples are in columns, and the row represent some common measurement across those samples; in this case, genotypes at a SNP locus.

Now that we understand the data we are working with, let's import it into R.

# Import VCF into R using the path name
indSnps <- vcf2DT(vcfPath)

# First 8 rows of the imported SNP data
head(indSnps, 8)

# The imported data is stored as a data.table object
class(indSnps)

As you can see, the function vcf2DT converts the wide-format VCF data into a long-format data table. Instead of a single column for each sample (and loci in rows), all possible sample-by-locus combinations are stored in the rows of indSnps, with a single column each for samples and loci. Using the $ notation, you can access these columns as vectors.

# Column vector for sample and loci
head(indSnps$SAMPLE)

head(indSnps$LOCUS)

Typically in these tutorials, and throughout the documentation for genomalicious, I will use $ to indicate nested vectors of R objects (e.g. columns, list items).

You would have noticed a column in indSnps called GT: this column contains information on each sample's genotype at a particular locus. '0' is the reference allele and '1' is the alternate allele; each allele is separated by a '/'. This is a separated format of genotype values as character class. However, another way of representing genotypes is as counts of one of the alleles, e.g. '0', '1', or '2', as an integer class.

When genotypes are represented as allele counts, we assume biallelic data because it is only possible to keep track of a maximum of two unique alleles. In reality, there are other ways to work around this, but many population genomic analyses are constrained to be on biallelic SNPs, so we will focus on those here.

The function genoscore_converter can be used to convert genotypes between the biallelic scoring formats. The input into this function is simply a vector of genotypes. If inputting the separated format, the vector must be a character class object. If inputting allele count format, the vector is ideally integer class object. The function bases its counts of the alternate allele, hence a genotype of '0/0', '0/1', and '1/1', are equivalent to 0, 1, and 2 (respectively).

# Practise run with simple vectors
genoscore_converter(c('0/0', '0/1', '1/1'))

genoscore_converter(c(0L, 1, 2))

# Now convert the first 15 genotypes from separated to count format.
genoscore_converter(indSnps$GT[1:15])
indSnps$GT[1:15]

Data tables for storing SNP data

Genomalicious is largely built around data.table classed objects. Data tables feel and more-or-less function as per standard R data.frame objects, but they differ in two major ways. Firstly, they are much more efficient at storing large volumes of data. Secondly, they have some nifty features that facilitate easy data manipulations. Though the purpose of this tutorial is not to provide a detailed demonstration of data.table object features (you can find that, here), let's just take a quick look at how you can harness the power of data tables.

In the simplest case, objects of class data.table can be manipulated using the semantics D[i, j, by]. If D is the data table, i represents rows, j represents columns, and by represents an operation we would like to perform on the data.

Just like a regular object of class data.frame we can subset a data table using integer indexes and column names.

indSnps[1:5,]
indSnps[, 1:3]
indSnps[1:5, 1:3]
indSnps[1:5, c('LOCUS','SAMPLE','GT')]

But the neat thing about data tables is that we can use expressions to manipulate rows (at position i) and columns (at position j) using some sort of grouping (at position by). Here are some very simple examples:

# Subset rows to keep only those with a depth > 15.
indSnps[DP > 15,]

# You can apply functions to columns, for example,
# take the mean depth across all samples and loci.
indSnps[, mean(DP)]

# We can add a grouping to our calculation of the mean. Here, we
# calculate the mean by locus.
indSnps[, mean(DP), by=LOCUS]

# We can combine manipulations of rows, columns and groups. Here, we 
# filter for read depth > 30, then determine the number of samples
# with that read depth at each locus.
indSnps[DP > 30, length(unique(SAMPLE)), by=LOCUS]

# You can also use these feature of column manipulation to
# apply a function to the data and create a new column using
# the ':=' notation. For example, let's add a column the scores
# genotypes as an integer of counts of the alternate allele,
# as opposed to the VCF standard character format (0/0, 0/1, or 1/1).
# We will use the genomalicious function, genoscore_converter.
indSnps[, GT.INT:=genoscore_converter(GT)]
indSnps[1:5]

Many functions in genomalicious take long-format data.table objects as their direct input. From personal experience, I find data tables from 100s of individuals, and 1,000s to 10,000s of SNP loci are manageable, though they may take a moment to load into R (especially when importing from a VCF). I believe the ease, simplicity, and versatility of working with data tables makes them a great way to deal with high dimensional data (many samples, many loci, many chromosomes/contigs).

As a disclaimer, it is important to note that the utility of the data table oriented methods developed in genomalicious will be limited by the dimensionality of your dataset and the memory and processing power of your system. Increasingly larger sample sizes and/or numbers of SNP loci will affect performance relative to the available computing resources. This is something to keep in mind, and very large genomic data sets will probably benefit from using other more memory efficient R packages specifically designed to handleenormous data sizes.

Data structures: Long-to-wide-format and genotype values

Though many functions in R expect data structured in long-format, there are lots of others that require data to be wide-format. This is especially the case in many population genetics/genomics R packages. Remember, in long-format, loci-by-sample combinations are all in rows, whereas in wide-format, samples are in rows and loci are in columns.

There are two functions in genomalicious for long-to-wide conversions: DT2Mat_genos for individual genotypes and DT2Mat_freqs for allele frequencies. Both return an R matrix object.

Let's first try this out with the individual-level genotype data we imported at the start of this lesson. Earlier, you made a column in indSnps, $GT.INT, that contained the genotypes scored as integer counts of the alternate alleles.

Take a look at the help file for DT2Mat_genos:

?DT2Mat_genos

You will see that DT2Mat_genos requires a data table as input and specification of the sample, locus, and genotype columns in the long format data table as the arguments, popCol, locusCol, and genoCol, respectively. The default values of these arguments are popCol='SAMPLE', locusCol='LOCUS', genoCol='GT', but remember, GT in our data table indSnps records genotypes as characters, but we want to create a matrix of genotypes scored as integers. We therefore need to manually specify genoCol.

# Convert long-format data table of genotypes to a wide-format matrix
genosMat <- DT2Mat_genos(indSnps, genoCol='GT.INT')

# First 10 individuals and first 3 loci
genosMat[1:10,1:3]

# Check the class
class(genosMat)

# Sample names are stored in the rows of the matrix
rownames(genosMat)

# Loci names are stored in the columns of the matrix (the first 20)
colnames(genosMat)[1:20]

There is also a way to go in reverse, that is, to convert the wide-format matrix back into a long-format data table. This is done by specifying the argument flip=TRUE (default value is FALSE). Doing so requires the input to be a wide-format matrix. There are again default values of popCol='SAMPLE', locusCol='LOCUS', genoCol='GT', which specify the long-format columns.

genosDT <- DT2Mat_genos(genosMat, flip=TRUE)

genosDT

Postamble

This concludes the 'Basic Ingredients' tutorial! You should now be comfortable with the basic functionality of genomalicious for importing SNP data into R. You have also familiarised yourself with SNP data structures and learnt how genomalicious can be used to do some basic manipulations that are common in population genetic/genomic analyses.



j-a-thia/genomalicious documentation built on Oct. 19, 2024, 7:51 p.m.