knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(SNPFastImputeMac)
library(tidyverse)
library(ggplot2)
library(ggpubr)
#library(doParallel)
#library(foreach)
#library(doRNG)

1. Data preparation.

1.1. From vcf (Variant Call Format) to matrix.

Load the vcf dataset.
Some vcf file might contain some heading information. If that is the case, just read (using the read.table function) starting from the first row below those headings. (More information about the vcf format check https://en.wikipedia.org/wiki/Variant_Call_Format.) Here we just load the previously read full_vcf file.

data("full_vcf")
dim(full_vcf)
full_vcf[1:5, c(1:7, 9:14)]

The original vcf file contains 52 rows (starting from column 10 are samples) and 19999 SNP positions.

Information we need are in column 3 (label of SNPs) and columns starting from 19 (each sample).

And only the first part of the sample columns before the first ":" are useful in our analyses. The two integers separated by "/" are the indicator of genotype of two alleles of genes at the same SNP position. (0: wild type; 1: the most common mutation type, 2: the second most common mutation type, etc.) We currently only care whether it is wild type now. So we change any number greater than 1 to 1.

So we only has two type of values for each allele (0 means wild type, 1 means mutation). This can be modified if we need to take care of different mutation types in the future.

Instead of having two numbers for each SNP, we add the numbers for two alleles together. So finally, we have 3 values for each SNP position (0: both alleles are wild type, 1: one of the alleles is mutated, 2: both alleles are mutated). We store these information into a numeric matrix, whose rows are the samples while columns are the SNP positions.

full_df <- vcf2df(full_vcf)
full_df[1:4, 1:4]
sub <- 200

There would be a warning message for NAs, which is caused by adding missing positions of SNPs, which is what we should do. So here NAs does not mean problem.
There are r dim(full_df)[1] samples and r dim(full_df)[2] SNPs in the original dataset. And the missing ratio is r sum(is.na(full_df))/length(full_df).

This data matrix is too large, we will only use a subset (first r sub SNPs) of it.

SNP_orig_sub <- full_df[, 1:sub]

Original NA ratio is r sum(is.na(SNP_orig_sub))/length(SNP_orig_sub).

1.2 Introduction of additional NAs.

There are several variations in our prediction:
- (1) precentage of NAs in the matrix
- (2) windows size of the surrounding SNPs we used for model building
- (3) training rounds of model.

We want to know how these affects the following properties:
- (1) classification error: ratio that are not correctly classified.
- (2) running time: e.g., if enlarging the window size only slightly decreases the classification error but cost unreasonably long time, then we might not choose that size.

To calculate these error, the best way is to find an SNP matrix which contains no missing SNPs (which I could not find, all my current SNP files contains a lot of missings).

The second best way is to simulate an SNP matrix which contains no missing value. This is not an easy task because we need to make sure the correlations between the SNPs are representing the true situation. And the appearances of wildetype alleles are representing the true situation. I am planning to do this in the next step, but currently I do not have a very good way to simulate.

So what I am doing now is that I introduced some new missing values into the original matrix so that the total missing rate reaches some value. And I recorded these introduced positions. Then I perform a imputation for all missing values, get the filled data. We only use the recorded introduced NA positions in the original data and filled data to calculate the classification error. This way, although not as good as the first two ways, may catch how the variable factors affects the classification error and running time.

2. Model bulding and predictions

2.1. We want to check how different missing ratios affect the prediction accuracy and processing time.

Generate a list of data with different missing ratio from 0.05 to 0.5

ratios <- seq(0.05, 0.25, by = 0.05)
SNP_NA_dfs <- list()
ratios_len <- length(ratios)
#set.seed(20980)
for (i in 1:ratios_len){
  SNP_NA_dfs[[i]] <- NA_Generator(SNP_orig_sub, ratios[i])
  print(SNP_NA_dfs[[i]]$NA_percent_generate)
}
names(SNP_NA_dfs) <- paste("missing", as.character(ratios))

Now we calculate the classification errors and calculation time of different missing ratios

df_fills <- list()
proctimes <- rep(NA, ratios_len)
for(i in 1:ratios_len) {
  timeprocess <- system.time(
      df_fills[[i]] <- Impute_GenoType_XGBoost(SNP_NA_dfs[[i]]$SNP_NA_df)
  )
  proctimes[i] <- timeprocess[1]
}
errors <- rep(NA, ratios_len)
for(i in 1:ratios_len){
  errors[i] <- classification_error(SNP_orig_sub, df_fills[[i]], SNP_NA_dfs[[i]]$NP_generate_positions)
}
errors_df <- data.frame(ratios, errors)
times_df <- data.frame(ratios, proctimes)
p1 <- ggplot(errors_df, aes(x = ratios, y = errors)) + geom_point() + ylim(c(0, 0.2))
p2 <- ggplot(times_df, aes(x = ratios, y = proctimes)) + geom_point() + ylim(c(0, 20))
ggarrange(p1, p2, nrow = 1)

The result shows that as missing ratio increases, the classification error rate does not change much, but the processing time monotonely increases.

2.1. Check how different windows sizes affect the prediction accuracy and processing time.

We use the 20% missing data to perform tests.

sizes <- seq(10, 50, by = 10)
sizes_len <- length(sizes)
SNP_NA_df02 <- SNP_NA_dfs[[4]]$SNP_NA_df
df_fills <- list()
proctimes <- rep(NA, ratios_len)
for(i in 1:sizes_len) {
  timeprocess <- system.time(
      df_fills[[i]] <- Impute_GenoType_XGBoost(SNP_NA_df02, size = sizes[i])
  )
  proctimes[i] <- timeprocess[1]
}
errors <- rep(NA, ratios_len)
for(i in 1:ratios_len){
  errors[i] <- classification_error(SNP_orig_sub, df_fills[[i]], SNP_NA_dfs[[i]]$NP_generate_positions)
}
errors_df <- data.frame(sizes, errors)
times_df <- data.frame(sizes, proctimes)
p1 <- ggplot(errors_df, aes(x = sizes, y = errors)) + geom_point() + ylim(c(0, 0.2))
p2 <- ggplot(times_df, aes(x = sizes, y = proctimes)) + geom_point() + ylim(c(0, 20))
ggarrange(p1, p2, nrow = 1)


GaoGN517/689_SNPFastImpute_Mac documentation built on Dec. 8, 2019, 12:33 a.m.