knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(SNPFastImpute)
library(tidyverse)
library(ggplot2)
library(ggpubr)
library(nlme)

1. Data preparation.

1.1. From vcf (Variant Call Format) to matrix.

Load the vcf dataset.

Some vcf file might contain some heading information. If that is the case, just read (using the read.table function) starting from the first row below those headings. (More information about the vcf format check https://en.wikipedia.org/wiki/Variant_Call_Format.) Here we just load the previously read full_vcf file.

data("full_vcf")
dim(full_vcf)
full_vcf[1:5, c(1:7, 9:14)]

The original vcf file contains 52 rows (starting from column 10 are samples) and 19999 SNP positions.

Data Explanation

Information we need are in column 3 (label of SNPs) and columns starting from 19 (each sample).

Each row is for an SNP, and columns contain sample information in the origianl data.

And only the first part of the sample columns before the first ":" are useful in our analyses. The two integers separated by "/" are the indicator of genotype of two alleles of genes at the same SNP position. (0: wild type; 1: the most common mutation type, 2: the second most common mutation type, etc.) We currently only care whether it is wild type now. So we change any number greater than 1 to 1.

So we only has two type of values for each allele (0 means wild type, 1 means mutation). This can be modified if we need to take care of different mutation types in the future.

Instead of having two numbers for each SNP, we add the numbers for two alleles together. So finally, we have 3 values for each SNP position (0: both alleles are wild type, 1: one of the alleles is mutated, 2: both alleles are mutated). We store these information into a numeric matrix, whose rows are the samples while columns are the SNP positions.

full_df <- vcf2df(full_vcf)

There would be a warning message for NAs, which is caused by adding missing positions of SNPs, which is what we should do. So here NAs does not mean problem. They are what we are going to impute in our package.

After processing, raw data would be looking like the following matrix (only showing the first 4 columns and 4 rows):

full_df[1:4, 1:4]
sub <- 100

There are r dim(full_df)[1] samples and r dim(full_df)[2] SNPs in the original dataset. And the missing ratio is r sum(is.na(full_df))/length(full_df).

The full data matrix is too large, we will only use a subset (first r sub SNPs) of it here to show a short example of how our package deal with the missing values in the matrix.

SNP_orig_sub <- full_df[, 1:sub]

Original NA ratio is r sum(is.na(SNP_orig_sub))/length(SNP_orig_sub).

1.2 Introduction of additional NAs.

There are variations in our prediction, for example:

We want to know how these affects the following properties:
* classification error: ratio that are not correctly classified.

To calculate true classfication error, the best way is to find an SNP matrix which contains no missing SNPs (which I could not find, all my current SNP files contains a lot of missings).

The second best way is to simulate an SNP matrix which contains no missing value. This is not an easy task because we need to make sure the correlations between the SNPs and the appearances of wildetype alleles are representing the real-life situation. I am planning to do this in the next step, but currently I do not have a very good way to simulate.

So what I am doing now is that I introduced some new missing values into the original matrix so that the total missing rate reaches some value. And I recorded these introduced positions. Then I perform a imputation for all missing values, get the filled data. We only use the recorded introduced NA positions in the original data and filled data to calculate the classification error. This way, although not as good as the first two ways, may catch how the variable factors affects the classification error and running time.

2. Model bulding and predictions

2.1. Check how different missing ratios affect the prediction accuracy and processing time.

Generate a list of data with different missing ratio from 0.05 to 0.25. The introduced ratio of NAs are shown bellow.

## Create ratio sequence
ratios <- seq(0.05, 0.25, by = 0.05)
## Create list and vector to store the introduced NA data
SNP_NA_dfs <- list()
ratios_len <- length(ratios)
## Introduce NAs to reach the corresponding ratio
for (i in 1:ratios_len){
  SNP_NA_dfs[[i]] <- NA_Generator(SNP_orig_sub, ratios[i])
  print(SNP_NA_dfs[[i]]$NA_percent_generate)
}
names(SNP_NA_dfs) <- paste("missing", as.character(ratios))

Now we calculate the classification errors and calculation time of different missing ratios.

## Create an empty list to store the filled data
df_fills <- list()
## Create an empty vectore to store the processing times.
proctimes <- rep(NA, ratios_len)
## Fill each data matrix using the parallel version of function. 
for(i in 1:ratios_len) {
  timeprocess <- system.time(
      df_fills[[i]] <- Impute_GenoType_XGBoost_para(SNP_NA_dfs[[i]]$SNP_NA_df)
  )
  proctimes[i] <- timeprocess[1]
}

Next we calcualte the classfication error for each ratio.

errors <- rep(NA, ratios_len)
for(i in 1:ratios_len){
  errors[i] <- classification_error(SNP_orig_sub, df_fills[[i]], SNP_NA_dfs[[i]]$NP_generate_positions)
}

Plot the results.

errors_df <- data.frame(ratios, errors)
times_df <- data.frame(ratios, proctimes)
p1 <- ggplot(errors_df, aes(x = ratios, y = errors)) + geom_point() + ylim(c(0, 0.2))
p2 <- ggplot(times_df, aes(x = ratios, y = proctimes)) + geom_point() + ylim(c(0, 0.5)) + ylab("process time")
ggarrange(p1, p2, nrow = 1)

The result shows that for the current subset of data, the increasing of missing ratio from 0.05 to 0.25 did not affect the training error and processing time a lot.

2.2. Check how different windows sizes affect the prediction accuracy and processing time.

We use the 20% missing data to perform tests. We choose 5 different windows sizes: 10, 20, 30, 40, 50.

sizes <- seq(10, 50, by = 10)
sizes_len <- length(sizes)
SNP_NA_df02 <- SNP_NA_dfs[[4]]$SNP_NA_df

Next, we used the imputation function to impute with different windows sizes.

df_fills <- list()
proctimes <- rep(NA, ratios_len)
for(i in 1:sizes_len) {
  timeprocess <- system.time(
      df_fills[[i]] <- Impute_GenoType_XGBoost_para(SNP_NA_df02, size = sizes[i])
  )
  proctimes[i] <- timeprocess[1]
}

Then we calculate the classification errors.

errors <- rep(NA, ratios_len)
for(i in 1:ratios_len){
  errors[i] <- classification_error(SNP_orig_sub, df_fills[[i]], SNP_NA_dfs[[i]]$NP_generate_positions)
}

Plot the results

errors_df <- data.frame(sizes, errors)
times_df <- data.frame(sizes, proctimes)
p1 <- ggplot(errors_df, aes(x = sizes, y = errors)) + geom_point() + ylim(c(0, 0.2))
p2 <- ggplot(times_df, aes(x = sizes, y = proctimes)) + geom_point() + ylim(c(0, 0.5)) + ylab("process time") 
ggarrange(p1, p2, nrow = 1)

From these plots, it seems that windows size = 40 is not a good choice for this impuation.

2.3. Check whether considering the correlation between SNPs can increase or decrease accuracy.

We used the windows does not consider SNP correlations and the windows selected considering SNP correlations. All other parameters are set as default values. We again compare 5 different ratios as in part 2.1.

df_fills <- df_fills_corr <- list()
proctimes <- rep(NA, ratios_len)
for(i in 1:ratios_len) {
  timeprocess <- system.time(
      df_fills[[i]] <- Impute_GenoType_XGBoost_para(SNP_NA_dfs[[i]]$SNP_NA_df),
      df_fills_corr[[i]] <- Impute_GenoType_XGBoost_para(SNP_NA_dfs[[i]]$SNP_NA_df, select.corr = T)
  )
  proctimes[i] <- timeprocess[1]
}

Calculate errors

errors <- error_corr <- rep(NA, ratios_len)
for(i in 1:ratios_len){
  errors[i] <- classification_error(SNP_orig_sub, df_fills[[i]], SNP_NA_dfs[[i]]$NP_generate_positions)
  error_corr[i] <- classification_error(SNP_orig_sub, df_fills_corr[[i]], SNP_NA_dfs[[i]]$NP_generate_positions)
}

Plot the results

errors_df <- data.frame(ratios, errors)
error_corr_df <- data.frame(ratios, error_corr)
p1 <- ggplot(errors_df, aes(x = ratios, y = errors)) + geom_point() + ylim(c(0, 1)) + ylab("error with windows not consider correlation")
p2 <- ggplot(error_corr_df, aes(x = ratios, y = error_corr)) + geom_point() + ylim(c(0, 1)) + ylab("error with windows consider correlation")
ggarrange(p1, p2, nrow = 1)

The result shows that for our current data, considering the correlation when selecting windows size did not reduce error rate or increase accuracy.

Conclusion

All these conclusions can provide very helpful information about our target dataset and help use to decide the parameter settings when we want to impute for real missing values.



GaoGN517/689_SNP_FastImpute documentation built on Jan. 2, 2020, 11:44 a.m.