Description Usage Arguments Details Value Examples
Read a vcf file, output the corresponding dataframe.
1 | vcf2df(vcf_df)
|
filename |
The vcf file we want to read and generate the dataframe we need to generate the xgboost dataframe. |
The input vcf file is directly from raw sequencing data. It contains (n + 9) columns (Information for each SNP and corresponding SNP values for the samples) and p rows (SNP positions).
Starting from the 10th column are the information for the first sample. So we first remove the first 9 columns.
For our imputation, we need the p SNPs as features (columns), n samples as rows, so we need to transpose the dataframe.
In the input dataset, there are 2 values indicating the SNP types for each SNP position as there are two alleles: 0 (Wild type) and 1 (Mutate type). So the values can be "0/0", "0/1", "1/0", "1/1". Some of the values might be missing.
We sum up the two values at each position to one value to represent the corresponding SNP type.
In the output data, each unit is the corresponding SNP type: (1) 0: both alleles are mutations; (2) 1: one of the alleles is a mutation, the other is wild type; (3) 2: both alleles are wild type; (4) NA: at least one of the SNP type of the two alleles is missing. We need to predict the value for this position.
A dataframe which we need to generate the xgboost data structure.
1 2 3 4 5 6 |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.