vcf2df: Read a vcf file, output the corresponding dataframe.

Description Usage Arguments Details Value Examples

View source: R/vcf2df.R

Description

Read a vcf file, output the corresponding dataframe.

Usage

1
vcf2df(vcf_df)

Arguments

filename

The vcf file we want to read and generate the dataframe we need to generate the xgboost dataframe.

Details

The input vcf file is directly from raw sequencing data. It contains (n + 9) columns (Information for each SNP and corresponding SNP values for the samples) and p rows (SNP positions).

Starting from the 10th column are the information for the first sample. So we first remove the first 9 columns.

For our imputation, we need the p SNPs as features (columns), n samples as rows, so we need to transpose the dataframe.

In the input dataset, there are 2 values indicating the SNP types for each SNP position as there are two alleles: 0 (Wild type) and 1 (Mutate type). So the values can be "0/0", "0/1", "1/0", "1/1". Some of the values might be missing.

We sum up the two values at each position to one value to represent the corresponding SNP type.

In the output data, each unit is the corresponding SNP type: (1) 0: both alleles are mutations; (2) 1: one of the alleles is a mutation, the other is wild type; (3) 2: both alleles are wild type; (4) NA: at least one of the SNP type of the two alleles is missing. We need to predict the value for this position.

Value

A dataframe which we need to generate the xgboost data structure.

Examples

1
2
3
4
5
6
data(vcf_df)
output_df <- vcf2df(vcf_df)
## This dataset has 112 samples and 338 SNP positions.
## The original file has 121 columns and 338 rows.

## Output should be a dataset with 112 rows and 338 columns. 

GaoGN517/689_SNP_FastImpute documentation built on Jan. 2, 2020, 11:44 a.m.