knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(SNPFastImpute) library(tidyverse) library(ggplot2) library(ggpubr) library(nlme)
Some vcf file might contain some heading information. If that is the case, just read (using the read.table function) starting from the first row below those headings. (More information about the vcf format check https://en.wikipedia.org/wiki/Variant_Call_Format.) Here we just load the previously read full_vcf file.
data("full_vcf") dim(full_vcf) full_vcf[1:5, c(1:7, 9:14)]
The original vcf file contains 52 rows (starting from column 10 are samples) and 19999 SNP positions.
Information we need are in column 3 (label of SNPs) and columns starting from 19 (each sample).
Each row is for an SNP, and columns contain sample information in the origianl data.
And only the first part of the sample columns before the first ":" are useful in our analyses. The two integers separated by "/" are the indicator of genotype of two alleles of genes at the same SNP position. (0: wild type; 1: the most common mutation type, 2: the second most common mutation type, etc.) We currently only care whether it is wild type now. So we change any number greater than 1 to 1.
So we only has two type of values for each allele (0 means wild type, 1 means mutation). This can be modified if we need to take care of different mutation types in the future.
Instead of having two numbers for each SNP, we add the numbers for two alleles together. So finally, we have 3 values for each SNP position (0: both alleles are wild type, 1: one of the alleles is mutated, 2: both alleles are mutated). We store these information into a numeric matrix, whose rows are the samples while columns are the SNP positions.
full_df <- vcf2df(full_vcf)
There would be a warning message for NAs, which is caused by adding missing positions of SNPs, which is what we should do. So here NAs does not mean problem. They are what we are going to impute in our package.
After processing, raw data would be looking like the following matrix (only showing the first 4 columns and 4 rows):
full_df[1:4, 1:4] sub <- 100
There are r dim(full_df)[1]
samples and r dim(full_df)[2]
SNPs in the original dataset.
And the missing ratio is r sum(is.na(full_df))/length(full_df)
.
The full data matrix is too large, we will only use a subset (first r sub
SNPs) of it here to show a short example of how our package deal with the missing values in the matrix.
SNP_orig_sub <- full_df[, 1:sub]
Original NA ratio is r sum(is.na(SNP_orig_sub))/length(SNP_orig_sub)
.
There are variations in our prediction, for example:
precentage of NAs in the matrix. This is decided by the sequencing result, and not controlable. But we will test how it affects the results.
windows of the surrounding SNPs we used for model building
We can directly control the windows size
We can also decide who to include in the windows. e.g., we can only include highly correlated SNPs.
training rounds of model. The more training rounds, the better results. But we also need to note the increasing training time. So we also need to find the balance.
We want to know how these affects the following properties:
* classification error: ratio that are not correctly classified.
To calculate true classfication error, the best way is to find an SNP matrix which contains no missing SNPs (which I could not find, all my current SNP files contains a lot of missings).
The second best way is to simulate an SNP matrix which contains no missing value. This is not an easy task because we need to make sure the correlations between the SNPs and the appearances of wildetype alleles are representing the real-life situation. I am planning to do this in the next step, but currently I do not have a very good way to simulate.
So what I am doing now is that I introduced some new missing values into the original matrix so that the total missing rate reaches some value. And I recorded these introduced positions. Then I perform a imputation for all missing values, get the filled data. We only use the recorded introduced NA positions in the original data and filled data to calculate the classification error. This way, although not as good as the first two ways, may catch how the variable factors affects the classification error and running time.
Generate a list of data with different missing ratio from 0.05 to 0.25. The introduced ratio of NAs are shown bellow.
## Create ratio sequence ratios <- seq(0.05, 0.25, by = 0.05) ## Create list and vector to store the introduced NA data SNP_NA_dfs <- list() ratios_len <- length(ratios) ## Introduce NAs to reach the corresponding ratio for (i in 1:ratios_len){ SNP_NA_dfs[[i]] <- NA_Generator(SNP_orig_sub, ratios[i]) print(SNP_NA_dfs[[i]]$NA_percent_generate) } names(SNP_NA_dfs) <- paste("missing", as.character(ratios))
Now we calculate the classification errors and calculation time of different missing ratios.
## Create an empty list to store the filled data df_fills <- list() ## Create an empty vectore to store the processing times. proctimes <- rep(NA, ratios_len) ## Fill each data matrix using the parallel version of function. for(i in 1:ratios_len) { timeprocess <- system.time( df_fills[[i]] <- Impute_GenoType_XGBoost_para(SNP_NA_dfs[[i]]$SNP_NA_df) ) proctimes[i] <- timeprocess[1] }
Next we calcualte the classfication error for each ratio.
errors <- rep(NA, ratios_len) for(i in 1:ratios_len){ errors[i] <- classification_error(SNP_orig_sub, df_fills[[i]], SNP_NA_dfs[[i]]$NP_generate_positions) }
Plot the results.
errors_df <- data.frame(ratios, errors) times_df <- data.frame(ratios, proctimes)
p1 <- ggplot(errors_df, aes(x = ratios, y = errors)) + geom_point() + ylim(c(0, 0.2)) p2 <- ggplot(times_df, aes(x = ratios, y = proctimes)) + geom_point() + ylim(c(0, 0.5)) + ylab("process time") ggarrange(p1, p2, nrow = 1)
The result shows that for the current subset of data, the increasing of missing ratio from 0.05 to 0.25 did not affect the training error and processing time a lot.
We use the 20% missing data to perform tests. We choose 5 different windows sizes: 10, 20, 30, 40, 50.
sizes <- seq(10, 50, by = 10) sizes_len <- length(sizes) SNP_NA_df02 <- SNP_NA_dfs[[4]]$SNP_NA_df
Next, we used the imputation function to impute with different windows sizes.
df_fills <- list() proctimes <- rep(NA, ratios_len) for(i in 1:sizes_len) { timeprocess <- system.time( df_fills[[i]] <- Impute_GenoType_XGBoost_para(SNP_NA_df02, size = sizes[i]) ) proctimes[i] <- timeprocess[1] }
Then we calculate the classification errors.
errors <- rep(NA, ratios_len) for(i in 1:ratios_len){ errors[i] <- classification_error(SNP_orig_sub, df_fills[[i]], SNP_NA_dfs[[i]]$NP_generate_positions) }
Plot the results
errors_df <- data.frame(sizes, errors) times_df <- data.frame(sizes, proctimes)
p1 <- ggplot(errors_df, aes(x = sizes, y = errors)) + geom_point() + ylim(c(0, 0.2)) p2 <- ggplot(times_df, aes(x = sizes, y = proctimes)) + geom_point() + ylim(c(0, 0.5)) + ylab("process time") ggarrange(p1, p2, nrow = 1)
From these plots, it seems that windows size = 40 is not a good choice for this impuation.
We used the windows does not consider SNP correlations and the windows selected considering SNP correlations. All other parameters are set as default values. We again compare 5 different ratios as in part 2.1.
df_fills <- df_fills_corr <- list() proctimes <- rep(NA, ratios_len) for(i in 1:ratios_len) { timeprocess <- system.time( df_fills[[i]] <- Impute_GenoType_XGBoost_para(SNP_NA_dfs[[i]]$SNP_NA_df), df_fills_corr[[i]] <- Impute_GenoType_XGBoost_para(SNP_NA_dfs[[i]]$SNP_NA_df, select.corr = T) ) proctimes[i] <- timeprocess[1] }
Calculate errors
errors <- error_corr <- rep(NA, ratios_len) for(i in 1:ratios_len){ errors[i] <- classification_error(SNP_orig_sub, df_fills[[i]], SNP_NA_dfs[[i]]$NP_generate_positions) error_corr[i] <- classification_error(SNP_orig_sub, df_fills_corr[[i]], SNP_NA_dfs[[i]]$NP_generate_positions) }
Plot the results
errors_df <- data.frame(ratios, errors) error_corr_df <- data.frame(ratios, error_corr)
p1 <- ggplot(errors_df, aes(x = ratios, y = errors)) + geom_point() + ylim(c(0, 1)) + ylab("error with windows not consider correlation") p2 <- ggplot(error_corr_df, aes(x = ratios, y = error_corr)) + geom_point() + ylim(c(0, 1)) + ylab("error with windows consider correlation") ggarrange(p1, p2, nrow = 1)
The result shows that for our current data, considering the correlation when selecting windows size did not reduce error rate or increase accuracy.
All these conclusions can provide very helpful information about our target dataset and help use to decide the parameter settings when we want to impute for real missing values.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.