In kridsadakorn/ipcaps.bioc: Iterative Pruning to Capture Population Structure

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

What does this tutorial cover?

This tutorial guide you on how to analysis SNP data to detect population clusters. For input formats supported by IPCAPS.BIOC, please check the reference manual of ipcaps() or the vignette entitled "Supported input formats".

Required packages

In this tutorial, we will use the functions from IPCAPS.BIOC and KRIS to demonstrate how to do SNP analysis. The package KRIS is available on CRAN.

library(IPCAPS.BIOC)
library(KRIS)

Example dataset

The input files used in this tutorial are the example files that are embedded with the package. These files are:

ipcaps_example.bed
ipcaps_example.bim
ipcaps_example.fam
ipcaps_example_individuals.txt.gz

The files are the Binary PLINK format, which the bed file consists of SNP data, the bim file contains SNP information, and the fam file contains sample information. For ipcaps_example_individuals.txt.gz, this file contains additional information for samples. Note: The vignette entitled "Support input formats" for more explanation of input file format.

Once the package is installed, the location of the files can be referred as:

BED.file <- system.file("extdata", 
                        "ipcaps_example.bed", 
                        package = "IPCAPS.BIOC")

BIM.file <- system.file("extdata", 
                        "ipcaps_example.bim", 
                        package = "IPCAPS.BIOC")

FAM.file <- system.file("extdata", 
                        "ipcaps_example.fam", 
                        package = "IPCAPS.BIOC")

EXTRA.file <- system.file("extdata", 
                          "ipcaps_example_individuals.txt.gz",
                          package = "IPCAPS.BIOC")

print(BED.file)
print(BIM.file)
print(FAM.file)
print(EXTRA.file)

We can view the data in the bed, bim, fam files by using read.bed() from the package KRIS. The function read.bed returns the object which consists of the SNP matrix, the data frames of SNP information and sample information.

my.data <- read.bed(BED.file, 
                   BIM.file,
                   FAM.file)

summary(my.data)

# check for the size of SNP matrix
dim(my.data[['snp']])

# SNP matrix contains the data in the additive encoding format
one.part <- my.data[['snp']]
one.part[seq(1,10), seq(1,6)]

# check for SNP information
head(my.data[['snp.info']])

# check for sample information
head(my.data[['ind.info']])

For the extra additional individual information or ipcaps_example_individuals.txt.gz, we can use the function read.table to view the file.

extra.info <- read.table(file = EXTRA.file,
                         header = FALSE)

head(extra.info)

Analysis step

The SNP analysis can be done using the function ipcaps. We only need to indicate the bed file, however, the bim and fam files must be presented in the same directory as the bed file. We indicate the labels of samples as the second column in the extra file (label.file = EXTRA.file, lab.col = 2).

Note: Due to the iterative process, the run time may take quite long if there are many samples. However, the number of SNPs has a minor role in contributing to analysis time.

my.result <- ipcaps(bed = BED.file, 
                    label.file = EXTRA.file, 
                    lab.col = 2,
                    out = tempdir(), 
                    silence = TRUE, 
                    max.thread = 1,
                    seed = 1234)

summary(my.result)

The path to all output files is indicated in my.result[['output.dir']].

print(my.result[['output.dir']])

Here are the list of output files:

dir(my.result[['output.dir']])

The intermediate data are saved as the RData format located in the directories RData. For visualization, the function ipcaps generates four HTML files including tree_text.html, tree_scatter_cluster.html, tree_scatter_label.html, and tree_scree.html. See more detail in the section of Visualization.

The cluster memberships of all samples are saved in my.result[['cluster']], which is the same result as in groups.txt.

head(my.result[['cluster']])

We can check the distribution of samples in the output clusters from their labels by:

table(my.result[['cluster']][['group']], my.result[['cluster']][['label']])

Visualization

By default, the function ipcaps generates four HTML files located as indicated in my.result[['output.dir']]. These files include:

tree_text.html is the HTML output in text mode.
tree_scatter_cluster.html is the HTML output with PC plots highlighted by cluster membership.
tree_scatter_label.html is the HTML output with PC plots highlighted by labels.
tree_scree.html is the HTML output with scree plots.

This is the example of tree_text.html.

tree_txt.html {width=500px}

This is the example of tree_scatter_cluster.html.

tree_scatter_cluster.html {width=500px}

This is the example of tree_scatter_label.html.

tree_scatter_label.html {width=500px}

This is the example of tree_scree.html.

tree_scree.html {width=500px}

Here is the example to show how to manually visualize the result. We can visualize the overall PC as follow:

# get the data of node 1
node1 <- get.node.info(cluster.obj = my.result,
                       node = 1)

summary(node1)

# visualize the first three PCs of node 1 and color by their labels
plot3views(node1[['PCs']],
           labels = node1[['label']])

# visualize the first three PCs of node 1 and exclude some samples for example,
# exclude outliors, then color by their labels

filtered.index <- !(node1[['label']] %in% c('outlier6'))
filtered.PCs <- node1[['PCs']][filtered.index,]
filtered.label <- node1[['label']][filtered.index]
plot3views(filtered.PCs,
           labels = filtered.label)

# visualize the first three PCs of node 1 and color by cluster membership
head(node1[['index']])
plot3views(node1[['PCs']],
           labels = my.result[['cluster']])

Check the top discriminating SNPs among clusters

We can check for which SNPs can be used to discriminate between two clusters by using the function top.discriminator. This function simply checks the differences among allele frequencies based on the fixation index (F~ST~). For example, we check for the discriminating SNPs between clusters 1 and 2. Here, to obtain the discriminating SNPs from percentile at 99.9%:

top.snp1 <- top.discriminator(cluster.obj = my.result,
                             group1 = 2,
                             group2 = 3,
                             bim.file = BIM.file,
                             percentile = 0.999,
                             use.percentile = TRUE)

print(top.snp1)

Here, to obtain the discriminating SNPs from the top 10:

top.snp2 <- top.discriminator(cluster.obj = my.result,
                             group1 = 2,
                             group2 = 3,
                             bim.file = BIM.file,
                             num.top = 10)

print(top.snp2)