create_hapmap_reference: Create an allele-reference file from HapMap data

View source: R/create_hapmap_reference.R

create_hapmap_referenceR Documentation

Create an allele-reference file from HapMap data


This function creates the standard allele reference file, as used by QC_GWAS and match_alleles, from data publicly available at the website of the international HapMap project (see 'References').


create_hapmap_reference(dir = getwd(),
   download_hapmap = FALSE, download_subset,
   hapmap_files = list.files(path = dir, pattern = "freqs_chr"),
   filename = "allele_reference_HapMap",
   save_txt = TRUE, save_rdata = !save_txt,
   return_reference = FALSE)



character string; the directory of the input and output files. Note that R uses forward slash (/) where Windows uses the backslash (\).


logical; if TRUE, the required allele-frequency files are downloaded from the HapMap website into dir, and then turned into a reference. If FALSE, the files specified in hapmap_files are used.


character-string; indicates the population to download for creating the reference. Options are: ASW, CEU, CHB, CHD, GIH, JPT, LWK, MEX, MKK, TSI, YRI.


character vector of the filenames of HapMap frequency-files to be included in the reference. The default option includes all files with the string "freqs_chr" in the filename. (This argument is only used when download_hapmap is FALSE.)


character string; the name of the output file, without file-extension.

save_txt, save_rdata

logical; should the reference be saved as a tab-delimitated text file and/or an RData file? If saved as RData, the object name allele_ref_std is used for the reference table.


logical; should the function return the reference as it output value?


The function removes SNPs with invalid alleles and with allele frequencies that do not add up to 1. It also removes all instances of duplicate SNPids. If such entries are encountered, a warning is printed in the R console and the entries are saved in a .txt file in the output directory.

Like the QC_GWAS, create_hapmap_reference codes the X chromosome as 23, Y as 24, XY (not available on HapMap website) as 25 and M as 26.

Both the .RData export and the function return store the alleles as factors rather than character strings.


If return_reference is TRUE, the function returns the generated reference table. If FALSE, it returns an invisible NULL.


The required data is available at the Website of the International HapMap project, under bulk data downloads > bulk data > frequencies

The HapMap files downloaded by this function are subject to the HapMap terms and policies. See:

See Also



  # This command will download the CEU HapMap dataset and use
  # it to generate an allele-reference. Create a folder
  # "new_hapmap" to store the data and make sure there is
  # sufficient disk space and a reasonably fast internet
  # connection.

  ## Not run: 
    new_hapmap <- create_hapmap_reference(dir = "C:/new_hapmap",
                                download_hapmap = TRUE, download_subset = "CEU",
                                filename = "new_hapmap", save_txt = TRUE,
                                return_reference = TRUE)
## End(Not run)

QCGWAS documentation built on May 30, 2022, 5:05 p.m.