knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This tutorial guide you on how to analysis SNP data to detect population clusters. For input formats supported by IPCAPS.BIOC, please check the reference manual of ipcaps() or the vignette entitled "Supported input formats".
In this tutorial, we will use the functions from IPCAPS.BIOC and KRIS to demonstrate how to do SNP analysis. The package KRIS is available on CRAN.
library(IPCAPS.BIOC) library(KRIS)
The input files used in this tutorial are the example files that are embedded with the package. These files are:
The files are the Binary PLINK format, which the bed file consists of SNP data, the bim file contains SNP information, and the fam file contains sample information. For ipcaps_example_individuals.txt.gz, this file contains additional information for samples. Note: The vignette entitled "Support input formats" for more explanation of input file format.
Once the package is installed, the location of the files can be referred as:
BED.file <- system.file("extdata", "ipcaps_example.bed", package = "IPCAPS.BIOC") BIM.file <- system.file("extdata", "ipcaps_example.bim", package = "IPCAPS.BIOC") FAM.file <- system.file("extdata", "ipcaps_example.fam", package = "IPCAPS.BIOC") EXTRA.file <- system.file("extdata", "ipcaps_example_individuals.txt.gz", package = "IPCAPS.BIOC") print(BED.file) print(BIM.file) print(FAM.file) print(EXTRA.file)
We can view the data in the bed, bim, fam files by using read.bed() from the package KRIS. The function read.bed returns the object which consists of the SNP matrix, the data frames of SNP information and sample information.
my.data <- read.bed(BED.file, BIM.file, FAM.file) summary(my.data) # check for the size of SNP matrix dim(my.data[['snp']]) # SNP matrix contains the data in the additive encoding format one.part <- my.data[['snp']] one.part[seq(1,10), seq(1,6)] # check for SNP information head(my.data[['snp.info']]) # check for sample information head(my.data[['ind.info']])
For the extra additional individual information or ipcaps_example_individuals.txt.gz, we can use the function read.table to view the file.
extra.info <- read.table(file = EXTRA.file, header = FALSE) head(extra.info)
The SNP analysis can be done using the function ipcaps. We only need to indicate the bed file, however, the bim and fam files must be presented in the same directory as the bed file. We indicate the labels of samples as the second column in the extra file (label.file = EXTRA.file, lab.col = 2).
Note: Due to the iterative process, the run time may take quite long if there are many samples. However, the number of SNPs has a minor role in contributing to analysis time.
my.result <- ipcaps(bed = BED.file, label.file = EXTRA.file, lab.col = 2, out = tempdir(), silence = TRUE, max.thread = 1, seed = 1234) summary(my.result)
The path to all output files is indicated in my.result[['output.dir']].
print(my.result[['output.dir']])
Here are the list of output files:
dir(my.result[['output.dir']])
The intermediate data are saved as the RData format located in the directories RData. For visualization, the function ipcaps generates four HTML files including tree_text.html, tree_scatter_cluster.html, tree_scatter_label.html, and tree_scree.html. See more detail in the section of Visualization.
The cluster memberships of all samples are saved in my.result[['cluster']], which is the same result as in groups.txt.
head(my.result[['cluster']])
We can check the distribution of samples in the output clusters from their labels by:
table(my.result[['cluster']][['group']], my.result[['cluster']][['label']])
By default, the function ipcaps generates four HTML files located as indicated in my.result[['output.dir']]. These files include:
This is the example of tree_text.html.
{width=500px}
This is the example of tree_scatter_cluster.html.
{width=500px}
This is the example of tree_scatter_label.html.
{width=500px}
This is the example of tree_scree.html.
{width=500px}
Here is the example to show how to manually visualize the result. We can visualize the overall PC as follow:
# get the data of node 1 node1 <- get.node.info(cluster.obj = my.result, node = 1) summary(node1) # visualize the first three PCs of node 1 and color by their labels plot3views(node1[['PCs']], labels = node1[['label']]) # visualize the first three PCs of node 1 and exclude some samples for example, # exclude outliors, then color by their labels filtered.index <- !(node1[['label']] %in% c('outlier6')) filtered.PCs <- node1[['PCs']][filtered.index,] filtered.label <- node1[['label']][filtered.index] plot3views(filtered.PCs, labels = filtered.label) # visualize the first three PCs of node 1 and color by cluster membership head(node1[['index']]) plot3views(node1[['PCs']], labels = my.result[['cluster']])
We can check for which SNPs can be used to discriminate between two clusters by using the function top.discriminator. This function simply checks the differences among allele frequencies based on the fixation index (F~ST~). For example, we check for the discriminating SNPs between clusters 1 and 2. Here, to obtain the discriminating SNPs from percentile at 99.9%:
top.snp1 <- top.discriminator(cluster.obj = my.result, group1 = 2, group2 = 3, bim.file = BIM.file, percentile = 0.999, use.percentile = TRUE) print(top.snp1)
Here, to obtain the discriminating SNPs from the top 10:
top.snp2 <- top.discriminator(cluster.obj = my.result, group1 = 2, group2 = 3, bim.file = BIM.file, num.top = 10) print(top.snp2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.