README.md

Introduction

rLCLAE is a R package for implementing LCLAE (Low Coverage Local Ancestry Estimation), a composite likelihood method for estimating local ancestry from low coverage genomic data developed by Wall et al. 2016 Molecular Ecology (https://doi.org/10.1111/mec.13684). This program incorporates genotype uncertainty into local ancestry assignment (e.g., the PL values in a vcf generated by GATK), which is a ubiquitous feature of low coverage genotype calls.

rLCLAE requires in a single vcf (or multiple vcfs split by chromsome) data from two ancestral populations and the admixed individuals for whom we want to estimate local ancestry. The program uses allele frequency differences between the two ancestral populations to estimate the likelihood that an admixed individual’s ancestry in a specific window of their genome is homozygous for either ancestral population or heterozygous. Importantly, rLCLAE takes into account that genotypes from low coverage sequencing are uncertain by considering the relative probability that a given genotype call is correct. Importantly, while genotype data for any single variant from low coverage data are relatively uninformative about local ancestry, genotype data aggregated across many contiguous sites can provide accurate ancestry estimates. Ancestry calls can be assembled into ancestry tracts by joining neighboring variants with the same inferred ancestry, with tract boundaries determined when contiguous variants change ancestry state.

Installation

Currently, rLCLAE cannot be installed from CRAN using:

install.packages("rLCLAE")

Therfore, you must install the development version of rLCLAE from GitHub using:

# install.packages("devtools")
devtools::install_github("lycium0605/rLCLAE")

Example

Below is the typical pipeline for transforming biallelic genomic data in vcf format to local ancestry calls. We assume that any quality control filtering of your variant calls (e.g., GATK’s recommended hard filters, minor allele frequency threshold, etc.) has already been done. Note that your vcf can contain multiple chromosomes or a single chromosome at this stage.

Before importing your genomic data into R, you must first edit your vcf (here, original.vcf) to remove all fields except for XXX (???it makes more sense to me say we removed fields because that’s what the annotate -x command does and to also tell the reader/user the specific field we want to keep???). We can do this using bcftools (http://www.htslib.org/doc/bcftools.html) with the following command: bcftools annotate -x ^INFO/DP,^FORMAT/GT,^FORMAT/AD,^FORMAT/DP,^FORMAT/GQ,^FORMAT/PL original.vcf>tmp.vcf

0. Getting started

## After installing rLCLAE, load the rLCLAE package
library(rLCLAE)

## Load and clean up the genotype data
## This function loads the tmp.vcf file into R. It also removes redundant information (???what does this mean? what exactly is it removing???) and replaces some types of characters (e.g. | or :) in the vcf for easier processing.
preprocess(inputdir = './tmp.vcf',outputdir = './clean.vcf')

## Checking the clean.vcf
## This function checks whether the clean.vcf is ready for rLCLAE analysis. It checks for unexpected characters in the clean.vcf which should only contain numbers, commas, spaces, or '/'. If this function outputs "Unexpected character, please double check your input", the clean.vcf file contains characters that must be removed. This function also outputs the number of individuals and a list of the chromosomes detected in the vcf file. Check to make sure that this information matches what you expect.
inputcheck(inputdir = './clean.vcf')

After the data is cleaned and checked, we will: 1) calculate the genotype likelihoods for all individuals in the vcf, 2) calculate the allele frequencies in both ancestral populations, 3) calculate the ancestral likelihood for admixed individuals, 4) estimate local ancestry, and 5) generate ancestry tracts.

1. Calculating genotype likelihood

In this step, the genotype likelihoods for all individuals in the vcf file is calculated based on the PL value. If multiple chromosomes are detected, this function will automatically split the data into multiple genotype likelihood files, one for each chromosome, and add the chromosome name to the end of each file name (e.g., genolk_chr1 for chromosome 1, genolk_chrX for chromosome X, etc.).

## Calculate the genotype likelihood based on the PL value
## The type parameter is set to 'dip' by default, as the function expects the data to be diploid (e.g., autosomal data). Type can also be set to 'hap' which tells the function  that the data is data (e.g. sex chromosome data for human males).
genolik(inputdir = './clean.vcf', outputdir = './genolik', type = 'dip')

## In a generated genolik file, each line corresponds to a snp. The first column corresponds to the chromosome name (e.g., chr1, chrX, etc.) while the second column corresponds to the snp position. In the subsequent columns, every three numbers represent the genotype likelihoods for one individual. (e.g., .81 .12 .07 represents an individual's genotype likelihoods for the homoyzgous reference allele, heterozygous reference-alternate alleles, and homozygous alternate allele, respectively; -1. -1. -1. indicates that the genotype likelihood data for an individual is missing).

2. Calculating ancestral population allele frequencies

In this step, the allele frequencies for each ancestral population are calculated based on the genotype likelihoods for ancestral individuals. Here, we must assign individuals to each ancestral population. To do so, create a text file for each ancestral population with a single line of integers separated by spaces. The first integer in the line is the total number of individuals in the ancestral population followed by the order of each individual in the vcf/genolik file (the order should be the same in both files). For example, in the file “anc1dip.txt”, we’ll specify that we have 5 total individuals in ancestral population 1 and they are the 23rd, 24th, 25th, 26th, and 28th individuals in the vcf/genolik files: 5 23 24 25 26 28 (???I’d also give them the option, and show an example here, how to generate this file in R???)(??? It makes total sense, however I am thinking changing the code a bit to avoid using the text file, and allow the user to create a numeric vector in R and pass that directly into the c code as parameter??? Shuyu, that sounds good!) If you have only one type of data, you can generate one ancestral allele frequency using the following code, it will add a suffix (_hap or _dip based on data type) to the output: If all three files (inputdir_dip, pop1_dip and pop2_dip) are provided, the output will have a suffix ’_dip’ and vice versa.(??? I’m confused by this sentence - if I only have diploid data, it will automatically detect that I have diploid data and add the suffix _dip? Or do I need to define pop1_dip and pop2_dip instead of pop1_hap and pop2_hap???)

## For diploid data 
ancfreq(outputdir = './ancfreq',inputdir_dip = './genolik_dip_chrX',
        pop1_dip = 'anc1dip.txt',pop2_dip = 'anc2dip.txt')

## For haploid data 
ancfreq(outputdir = './ancfreq',inputdir_hap = './genolik_hap_chrX',
        pop1_hap = 'anc1hap.txt',pop2_hap = 'anc2hap.txt')

However, if you have both diploid and haploid data (e.g., for the X chromosome, you have haploid data for males and diploid data for females), you can do either of the steps below:

## The easiest way is to input both the diploid and haploid data into the ancfreq function and three files (ancfreq_dip, ancfreq_hap, ancfreq_merge) will be generated
ancfreq(outputdir = './ancfreq',
        inputdir_hap = './genolik_hap_chrX',
        pop1_hap = 'ref1hap.txt',pop2_hap = 'ref2hap.txt',
        inputdir_dip = './genolik_dip_chrX',
        pop1_dip = 'ref1dip.txt',pop2_dip = 'ref2dip.txt',
        mergetype = "union")

## You can also use the mergefreq function after using ancfreq separately for both data types (lines 90-96) 
mergefreq(hap = './ancfreq_hap', dip = './ancfreq_dip', 
          merge = './ancfreq_merge', type = 'union')

## In a generated ancfreq file, there will be six columns, corresponding to the following: the snp position, the allele frequency of 'A' in ancestral populations 1 and 2, and the number of individuals (not chromosomes) used for the allele frequency calculation in ancestral populations 1 and 2. (???What do you mean by 'A' here? Is this the reference allele???)

mergetype has several options: union: Keep the snps that appear in either the diploid or haploid data. intersect: Keep the snps that appear in both the haploid and diploid data. full_hap: Keep all the snps that appear in the haploid data. full_dip: Keep all the snps that appear in the diploid data.

After the ancfreq file is generated, you will likely want to check the distribution of allele frequency differences between the two ancestral populations.

# Plot a histogram of allele frequency differences between the two ancestral populations and the mean/median/max/min of these data. You can also do additional analysis on diff if necessary. For example, you can check the distribution of the allele frequency differences and identify the 25/50/75 percentile values, which may help you with your decision on the cutoff used in the next step.
diff<-freqsum('./ancfreq_dip')
#(???What does the freqsum function actually do? Does it calculate the allele frequency differences AND plot a histogram and print the mean/median/max/min or does it just calculate the allele frequency differences and it's left up to the user to plot the distribution, get the mean, etc. themselves? If so, we should change the wording and say that this calculates the allele frequency differences which may be useful to plot or evaluate further for themselves if necessary???)

3. Calculating the ancestry likelihood

In this step, the ancestry likelihood is calculated for each snp for each individual. The default option performs this step for all individuals in the genolik file at once. A suffix of individual index will be added to each file (e.g., anclik_1 for individual 1).

## For all individuals
anclik(genodir = './genolik_chrX', ancfreqdir = './ancfreq_merge',
       outputdir = './anclik', type = 'dip', test = 'all')

## For a specific individual: first individual in the vcf/genolik file
anclik(genodir = './genolik_chrX', ancfreqdir = './ancfreq_merge',
       outputdir = './anclik', type = 'dip', test = 1) #???Just want to make sure my comment in line 146 is correct, i.e., this will only evaluate individual 1???
## For a subset of individuals (individuals 1, 2, 5, and 9 in the vcf/genolik file):
## sub<-c(1,2,5,9)
## test = sub
anclik(genodir = './genolik_chrX', ancfreqdir = './ancfreq_merge',
       outputdir = './anclik', type = 'dip', test = sub) #???Is this correct???(They should be correct, I make corresponding changes in the code and test all three types, may I ask what is the worry here more specifically? ASF: I was just confirming that my comment in line 149 was correctly interepreting what this function is doing)

4. Estimating local ancestry

In this step, the ancestry state at each snp for each individual is estimated based on majority rule. The function, anccall, will output a data frame where each row corresponds to a snp. The first column corresponds to the snp position and the second column corresponds to the ancestry state called for that snp where a value of 0 corresponds to homozygous ancestry for ancestral population 1, 1 corresponds to heterozygous ancestry, and 2 corresponds to homozygous ancestry for ancestral population 2.

# delta: the minimum ancestral allele frequency difference for snps which will be evaluated; e.g., delta=0.2 will only evaluate snps with >= 20% allele frequency difference between the ancestral populations
# window_size: the size (in base pairs) for the sliding window used in ancestry calling where the larger this value is, the shorter tracts will be smoothed out; e.g., window_size=50000 will use 50 kb sliding windows
anccall(delta = 0.2, window_size = 50000,
        inputdir = './anclik_1', outputdir = './anccall_1')

5. Generating ancestry tracts

In this step, neighboring snps with the same inferred ancestry states are assembled into ancestry tracts. The gettract command will generate two output files, one with a suffix of ‘.MajorityRule.txt’ and one with ‘.tracts.txt’. The former file contains eight columns. The ‘snp’ column corresponds to the snp position, the ‘call’ column corresponds to the original call at this snp, the ‘chrom’ and ‘indiv’ column describes the chromosome and the individual this call belongs to. The ‘n’ column corresponds to the number of calls within current sliding window, the ‘mode’ column corresponds to the call that is the majority in this window, and the ‘n_mode’ corresponds to the total number of this majority call, the ‘perc’ colmun corresponds to the percentage of this majority call.

gettract('./anccall_1', # input dir
         './tract_1', # output dir
         indiv_name = '1', # individual name, will add a column (???Is there a way to do it for all individuals in the sample and not each individual in turn???)
         chr='chrX', # chromosome name, will add a column
         chrlength='142711496', #chromosome length in bp
         value=35000, # window size
         mode_n=0.5, # the minimum percentage for a call to be made (0.5 = majority rule, <=0.33 = plurality rule)
         min_n=20, # the minimum number of SNPs within the window to make a call
         exclude=50000) # how much of the start/end of each chromosome to ignore if any

6. Wrapping up 3-5

To simplify the workflow, a wrapped up function is provided. To use this function, the user need to define the parameters as below, and eventually the intermediate files and final result including ancestral likelihood, local ancestry and ancestry tract will be output to the given folder.

ancgen(genodir = './geno', # input dir for genolik file
       ancfreqdir = './ancfreq', # input dir for ancfreq file
       outputdir = './ancestry_estimation_result', # output folder
       type = "dip", test = "all", #anclik parameter
       delta = 0.2 , window = 35000, #anccall parameter
       chrlength = '142711496', mode_n = 0.5, min_n = 20, exclude = 50000 # Get tract parameter
)


lycium0605/rLCLAE documentation built on Aug. 28, 2024, 11:35 p.m.