match_alleles: Check and correct alleles in GWAS result files

Description Usage Arguments Details Value Interpreting the output Note See Also Examples

View source: R/match_alleles.R


This function checks the reported alleles and allele frequencies in GWAS results data by comparing them to a reference table. It will also uniformize the dataset by switching all SNPs to the positive strand and flipping the alleles so that the effect allele matches the reference minor allele.


match_alleles(dataset, ref_set, HQ_subset,
              dataname = "dataset", ref_name = "reference",
       unmatched_data = !all(dataset$MARKER %in% ref_set$SNP),
              check_strand = FALSE,
              save_mismatches = TRUE, delete_mismatches = FALSE,
              delete_diffEAF = FALSE, threshold_diffEAF = 0.15,
              check_FRQ = TRUE, check_ambiguous = FALSE,
              plot_FRQ = FALSE, plot_intensity = FALSE, 
              plot_if_threshold = FALSE,
              threshold_r = 0.95,
              return_SNPs = FALSE, return_ref_values = FALSE,
              header_translations, header_reference,
              save_name = dataname, save_dir = getwd(),
              use_log = FALSE, log_SNPall = nrow(dataset))



table containing the allele data. dataset should always contain columns for the SNPID, the effect allele and the other allele. Strand and allele frequency may be required, depending upon the settings, while effect size is optional. match_alleles accepts non-standard column names, provided a translation table is specified in header_translations. The order of columns does not matter; nor does the presence of other columns.


table containing the reference data. ref_set should always contain columns for the SNPID, the minor allele and the major allele. Minor allele frequency is only required if check_FRQ or check_ambiguous are TRUE. The standard column-names are "SNP", "MINOR", "MAJOR" and "MAF". Non-standard column names are accepted if a translation table is specified in header_reference. All SNPs must be aligned to the positive strand.


an optional logical or numeric vector indicating the rows in dataset that contain high quality SNPs.

dataname, ref_name

character strings; the names of the dataset and reference, respectively. Used as identifiers in the output files.


logical; are there SNPs in the dataset that do not appear in the reference? This argument is currently redundant: the function will determine it automatically.


logical; should the function check for negative-strand SNPs? If FALSE, all SNPs are assumed to be on the positive strand.


logical; should mismatching entries be exported to a .txt file before they are corrected?


logical; should mismatching SNPs (that could not be corrected by strand-switching) have their effect allele set to missing?


logical; should SNPs that exceed the threshold_diffEAF have their effect allele set to missing?


numeric; the max. allowed difference between reported and reference allele frequency.


logical; should the function correlate the reported allele-frequency with that of the reference?


logical; should the function do separate frequency correlations and create separate plots for SNPs with a strand-independent allele-pair (i.e. an A/T or C/G configuration)?


logical; should a scatterplot of reported vs. reference allele-frequency be made?


logical; if TRUE, instead of a scatterplot an intensity plot is generated. This option is currently only partially implemented. Leave to FALSE for now.


logical; if TRUE, the scatterplot is only made when frequency correlation is below the threshold specified by threshold_r.


numeric; the correlation threshold value.


logical; should the return value include the relevant columns of dataset?


logical; should the return-value include the matching entries in ref_set?

header_translations, header_reference

translation tables for converting the column names of dataset and ref_set to standard names, respectively. See translate_header for more information.


character string; the filename, without extension, for the various output files.


character string; the directory where the output files are saved. Note that R uses forward slash (/) where Windows uses backslash (\).

use_log, log_SNPall

arguments used by QC_GWAS; redundant when match_alleles is used separately.


match_alleles is one of the more complicated functions of QCGWAS. However, what it does is quite simple:

The complexity stems from the fact that these three tasks have to be carried out together and often overlap. So the actual function schematic looks like this:


An object of class 'list' with the following components:

FRQ_cor, FRQ_cor_ambiguous, FRQ_cor_nonambi

Allele-frequency correlations for all, ambiguous, and non-ambiguous SNPs respectively.


Total number of SNPs in dataset

n_missing, n_missing_data, n_missing_ref

Number of SNPs with missing allele-data in either dataset or reference, dataset only, and reference only, respectively.

n_negative_strand, n_negative_switch, n_negative_mismatch

Number of negative-strand SNPs, the subset of negative-strand SNPs that were strand-switched twice because they did not match the reference, and the subset of double-switched SNPs that were still mismatching after the second strand-switch.

n_strandswitch, n_mismatch

Number of SNPs that was strand-switched because they did not match the reference, and the subset of those that still did not match after the strand-switch.


Number of SNPs whose alleles were flipped to align them with the reference.

n_ambiguous, n_suspect

Number of ambiguous SNPs, and the subset of those that had a large allele-frequency aberration.


Number of SNPs whose allele-frequency differs from the reference by more than threshold_diffEAF.


When return_SNPs and/or return_ref_values is TRUE, this returns the column of dataset containing the SNP IDs. If not, this returns NULL.


If return_SNPs is TRUE, these elements return the corrected columns of data-set. If FALSE, these return NULL. Note: match_alleles only returns those columns that were checked; if check_FRQ is FALSE, EFF_ALL_FREQ return NULL. The same goes for check_strand and STAND. EFFECT is only returned if present in dataset.

ref_MINOR, ref_MAJOR, ref_MAF

If return_ref_values is TRUE, these elements return the reference minor and major alleles and allele frequency column for the SNPs in MARKER.If FALSE, these return NULL. ref_MAF is only returned when check_FRQ is TRUE.

Interpreting the output

The output of match_alleles may seem a bit overwhelming at first, so here is a short explanation of what it means and what you should pay attention to.

The columns included in the return value when return_SNPs is TRUE are the post-matching dataset. This is only relevant if you want to continue working with the corrected dataset. Similarly, the output of return_ref_values is only used for comparing the post-matching dataset to the reference.

n_missing, n_missing_data and n_missing_ref report the prevalence of missing allele data, but are otherwise irrelevant.

The majority of return values serve to check whether strand-switching was performed correctly. n_strandswitch indicates how many SNPs were converted to the other strand because of a mismatch with the reference. In our experience, many cohorts do not include stand data, or simply set all SNPs to "+", so the presence of strand-switched SNPs isn't an indicator of problems by itself. However, if the strand-switching did not fix the mismatch, there may a problem. The subset of strand-switched SNPs that could not be fixed is reported as n_mismatch, and indicates incorrect allele data or, possibly, triallelic SNPs.

Depending on the argument save_mismatches, mismatching entries are exported as a .txt file, together with the reference data. This allows the user to see which SNPs are affected.

Another sign of trouble is when negative-strand SNPs (n_negative_strand) are present (i.e. the cohort included real strand data, rather just setting it to "+"), but strand-switching still occurred. Negative-strand SNPs are converted to the positive strand before their alleles are compared to the reference, so they should not appear here. If they do, it means that either the strand-column data is incorrect, or it is an ordinary mismatch (see above).

Negative-strand SNPs that are "strand-switched" will revert to their original allele configuration (but the strand column now reports them as being on positive strand). The output of QC_GWAS calls them double strand-switches, but here they are reported as n_negative_switch. The subset of those that could not be fixed is n_negative_mismatch.

Just to be clear: the relevant output is still n_strandswitch and n_negative_strand, not n_negative_switch and n_negative_mismatch. Whether it was the negative-strand SNPs that were switched or not is not important: the important thing is that there were negative-strand SNPs (i.e. the cohort included real strand data rather setting everything to "+"); yet strand-switches were still necessary and cannot be attributed to mismatch (i.e. the strand data is incorrect).

n_flipped counts how many SNPs had their alleles reversed to match the effect allele with the reference minor-allele. This is merely recorded for "administrative" purposes, and shouldn't concern the user.

n_ambiguous and n_suspect are another test of the strand information. Ambiguous SNPs are SNPs that have the same allele pair on the positive and negative strands (i.e. A/T or C/G). Matching them with the allele-reference therefor won't detected incorrect strand-information. In a normal-sized GWAS results file, about 15% of SNPs will be ambiguous.

Suspect SNPs are the subset of ambiguous SNPs whose allele frequency is significantly different from that in the reference ( < 0.35. vs > 0.65 or visa versa). In our experience, a GWAS results file with 2.5M SNPs will have only a few dozen suspect SNPs. However, if it's a sizable proportion of all ambiguous SNPs, it indicates that the ambiguous SNPs are listed for the wrong strand. This will also have resulted in the wrong SNPs being flipped in the previous step, so it should be visible in the allele-frequency correlation test as well.

n_diffEAF counts SNPs with significantly different allele-frequencies. A large number here indicates either that the allele-frequencies are incorrect or listed for the wrong allele (see below), or that the population used in the dataset does not match that of the reference.

The FRQ_cor value is the correlation between the reported and reference allele-frequencies. If allele frequency is correct, the correlation should be near 1. If it's close to -1, the listed frequency is that of the other (i.e. non-effect) allele.

the FRQ_cor_ambiguous and FRQ_cor_nonambi values are the same test for the subsets of ambiguous and non-ambiguous SNPs. If ambiguous SNPs are listed on the wrong strand, then they will have been flipped incorrectly, so allele-frequency correlation should also move towards -1.


The function does not delete SNPs, regardless of the delete_mismatches delete_diffEAF arguments. Setting these to TRUE means that any such SNPs are marked by having their effect allele set to NA. The actual deletion takes place inside QC_GWAS.

See Also

create_hapmap_reference for creating an allele reference from publicly-available HapMap data



# In order to keep the QCGWAS package small, no allele reference
# is included. Follow the example of the create_hapmap_reference
# function to make one.

#  data("gwa_sample")
#  hapmap_ref <- read.table("C:/new_hapmap/new_hapmap.txt",
#                  header = TRUE, stringsAsFactors = FALSE)

#  match_alleles(gwa_sample, hapmap_ref,
#    dataname = "sample data", ref_name = "HapMap",
#    save_name = "test_allele1", save_dir = "C:/new_hapmap",
#    check_strand = TRUE, plot_FRQ = TRUE)

# HQ_SNPs <- HQ_filter(data = gwa_sample, filter_NA = TRUE,
#                          filter_FRQ = 0.01, filter_cal = 0.95)
#  match_alleles(gwa_sample, hapmap_ref,
#    HQ_subset = HQ_SNPs,
#    dataname = "sample data", ref_name = "HapMap",
#    save_name = "test_allele2", save_dir = "C:/new_hapmap",
#    check_strand = TRUE, plot_FRQ = TRUE)

#  match_output <-
#    match_alleles(gwa_sample, hapmap_ref,
#      HQ_subset = HQ_SNPs,
#      delete_mismatches = TRUE, return_SNPs = TRUE,
#      delete_diffEAF = TRUE, threshold_diffEAF = 0.15,
#      dataname = "sample data", ref_name = "HapMap",
#      save_name = "test_allele3", save_dir = "C:/new_hapmap",
#      check_strand = TRUE, plot_FRQ = TRUE)
#  if(sum(match_output$n_negative_strand,
#         match_output$n_strandswitch, match_output$n_mismatch,
#         match_output$n_flipped, match_output$n_diffEAF) > 0){
#    gwa_sample$EFFECT_ALL   <- match_output$EFFECT_ALL
#    gwa_sample$OTHER_ALL    <- match_output$OTHER_ALL
#    gwa_sample$STRAND       <- match_output$STRAND
#    gwa_sample$EFFECT       <- match_output$EFFECT
#    gwa_sample$EFF_ALL_FREQ <- match_output$EFF_ALL_FREQ
#  }

Example output

QCGWAS library, version 1.0-8

A quick start guide (and other documentation)
can be found in the 'R/library/QCGWAS/doc' folder

QCGWAS documentation built on May 2, 2019, 3:19 p.m.