knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(plinkr)
# clear_plinkr_cache()
# These variables are used in the text, also when # not good to go. Here the default values are set n_snps <- "[unknown]" n_individuals <- "[unknown]"
Here we follow the tutorial at http://zzz.bwh.harvard.edu/plink/tutorial.shtml. The text was literally copied initially.
if (!is_plink_installed()) { message("PLINK is not installed") message("Tip: use 'plinkr::install_plinks()'") }
if (!is_plink_tutorial_data_installed()) { message("PLINK tutorial data is not installed") message("Tip: use 'plinkr::install_plink_tutorial_data()'") }
good_to_go <- FALSE if (is_plink_installed() && is_plink_tutorial_data_installed()) { good_to_go <- TRUE }
These are the tutorial data files:
if (good_to_go) { tutorial_data_filenames <- get_plink_tutorial_data_filenames() tutorial_data_filenames }
Here we take a peek at those files, starting with hapmap1.map
:
if (good_to_go) { hapmap1_map_filename <- stringr::str_subset( tutorial_data_filenames, "hapmap1.map" ) map_table <- plinkr::read_plink_map_file(hapmap1_map_filename) n_snps <- nrow(map_table) knitr::kable(utils::head(map_table)) }
From the genetic mapping table, the number of rows equals the number of SNPs, which is r n_snps
SNPs.
Now hapmap1.ped
, of which we'll only show the first 10 columns:
if (good_to_go) { # Cannot do so until stringi::stri_split is fixed if (1 == 2) { hapmap1_ped_filename <- stringr::str_subset( tutorial_data_filenames, "hapmap1.ped" ) ped_table <- plinkr::read_plink_ped_file(hapmap1_ped_filename) n_individuals <- nrow(ped_table) knitr::kable(ped_table[1:5, 1:10]) } }
The number of rows of the pedigree table equals the number of individuals, which equals r n_individuals
. The number of columns equals 6 plus twice the number of SNPs.
if (good_to_go) { # Cannot do so until stringi::stri_split is fixed if (1 == 2) { testthat::expect_equal(ncol(ped_table), 6 + (2 * n_snps)) } }
Now pop.phe
:
if (good_to_go) { pop_phe_filename <- stringr::str_subset( tutorial_data_filenames, "pop.phe" ) knitr::kable( utils::head(plinkr::read_plink_phe_file(pop_phe_filename)) ) }
Now qt.phe
:
if (good_to_go) { qt_phe_filename <- stringr::str_subset( tutorial_data_filenames, "qt.phe" ) knitr::kable( utils::head(plinkr::read_plink_phe_file(qt_phe_filename)) ) }
Two phenotypes were generated: a quantitative triat and a disease trait (affection status, coded 1=unaffected, 2=affected), based on a median split of the quantitative trait. The quantitative trait was generated as a function of three simple components:
Remember, this model is not intended to be realistic. The following contingency table shows the joint distribution of disease and subpopulation:
t_disease <- tibble::tribble( ~disease , ~chinese, ~japanese, # nolint "control", 34 , 11, # nolint "case" , 11 , 33 # nolint ) knitr::kable(t_disease)
which shows a strong relationship between these two variables. The next table shows the association between the variant rs2222162 and disease:
t_variant <- tibble::tribble( ~disease , ~g11, ~g12, ~g22, # nolint "control", 17 , 22 , 6 , # nolint "case" , 3 , 19 , 22 # nolint ) knitr::kable(t_variant)
Again, the strong association is clear. Note that the alleles have been recoded as 1 and 2 (this is not necessary for PLINK to work, however -- it can accept any coding for SNPs).
In summary, we have a single causal variant that is associated with disease. Complicating factors are that this variant is one of 83534 SNPs and also that there might be some degree of confounding of the SNP-disease associations due to the subpopulation-disease association -- i.e. a possibility that population stratification effects will exist. Even though we might expect the two subpopulations to be fairly similar from an overall genetic perspective, and even though the sample size is small, this might still lead to an increase in false positive rates if not controlled for.
We will use the affection status variable as the default variable for analysis (i.e. the sixth column in the PED file). The quantitative trait is in a separate alternate phenotype file, qt.phe. The file pop.phe contains a dummy phenotype that is coded 1 for Chinese individuals and 2 for Japanese individuals. We will use this in investigating between-population differences. You can view these alternate phenotype files in any text editor.
In this tutorial dataset we focus on autosomal SNPs for simplicity, although PLINK does provide support for X and Y chromosome SNPs for a number of analyses. See the main documentation for further information.
This tutorial is intended to introduce some of PLINK's features rather than provide exhaustive coverage of them. Futhermore, it is not intended as an analysis plan for whole genome data, or to represent anything close to 'best practice'.
Just typing plink and specifying a file with no further options is a good way to check that the file is intact, and to get some basic summary statistics about the file.
if (good_to_go) { hapmap1_base_filename <- tools::file_path_sans_ext(hapmap1_map_filename) # This only works up until approx PLINK v1.7 run_plink( args = c("--file", hapmap1_base_filename, "--noweb"), plink_options = create_plink_v1_7_options() ) }
The information contained here can be summarized as follows:
A banner showing copyright information and the version number
Skipping web check... [ --noweb ]
denotes that checking for a
PLINK update is disabled. This test is disabled, as -as of year 2021-
this causes the system to freeze
A message indicating that the log file will be saved in plink.log. The name of the output file can be changed with the --out option -- e.g. specifying --out anal1 will generate a log file called anal1.log instead.
A list of the command options specified is given next: in this case it is only a single option, --file hapmap1. By keeping track of log files, and naming each analysis with its own --out name, it makes it easier to keep track of when and how the different output files were generated.
Next is some information on the number of markers and individuals read from the MAP and PED file. In total, just over 80,000 SNPs were read in from the MAP file. It is written "...83534 (of 83534)..." because some SNPs might be excluded (by making the physical position a negative number in the MAP file), in which case the first number would indicate how many SNPs are included. In this case, all SNPs are read in from the PED file. We also see that 89 individuals were read in from the PED file, and that all these individuals had valid phenotype information.
Next, PLINK tells us that the phenotype is an affection status variable, as opposed to a quantitative trait, and lets us know what the missing values are.
The next stage is the filtering stage -- individuals and/or SNPs are removed on the basis of thresholds. Please see this page for more information on setting thresholds. In this case we see that no individuals were removed, but almost 20,000 SNPs were removed, based on missingness (859) and frequency (16994). This particularly high proportion of removed SNPs is based on the fact that these are random HapMap SNPs in the Chinese and Japanese samples, rather than pre-selected markers on a whole-genome association product: there will be many more rare and monomorphic markers here than one would normally expect.
Finally, a line is given that indicates when this analysis finished. You can see that it took 8 seconds (on my machine at least) to read in the file and apply the filters.
If other analyses had been requested, then the other output files generated would have been indicated in the log file. All output files that PLINK generates have the same format: root.extension where root is, by default, "plink" but can be changed with the --out option, and the extension will depend on the type of output file it is (a complete list of extensions is given here).
The first thing we will do is to make a binary PED file. This more compact representation of the data saves space and speeds up subsequent analysis.
To make a binary PED file using plinkr
, use make_bed
:
# Work in a temporary folder plinkr_folder <- get_plinkr_tempfilename() base_binary_filenames <- file.path(plinkr_folder, "hapmap1") if (good_to_go) { binary_filenames <- make_bed( base_input_filename = hapmap1_base_filename, base_output_filename = base_binary_filenames ) list.files(dirname(base_binary_filenames)) }
Four files are created with this command:
hapmap1.bed
: the binary file that contains the raw genotype datahapmap1.bim
: a revised map file, which contains two extra columns that
give the allele names for each SNPhapmap1.fam
: which is just the first six columns of hapmap1.ped
hapmap1.log
: the log fileif (good_to_go) { bim_filename <- stringr::str_subset( binary_filenames, "hapmap1.bim" ) bim_table <- plinkr::read_plink_bim_file(bim_filename = bim_filename) knitr::kable(utils::head(bim_table)) }
if (good_to_go) { fam_filename <- stringr::str_subset( binary_filenames, "hapmap1.fam" ) fam_table <- plinkr::read_plink_fam_file(fam_filename = fam_filename) knitr::kable(utils::head(fam_table)) }
if (good_to_go) { bed_filename <- stringr::str_subset( binary_filenames, "hapmap1.bed" ) bed_table <- read_plink_bed_file( bed_filename = bed_filename, names_loci = bim_table$id, names_ind = fam_table$fam ) knitr::kable(bed_table[1:5, 1:10]) }
if (good_to_go) { log_filename <- stringr::str_subset( binary_filenames, "hapmap1.log" ) log_text <- read_plink_log_file( log_filename = log_filename ) log_text }
You can view the .bim
, .fam
and .log
files directly, as these
are plain-text. The .bed
file, however, is in a binary and computer-friendly
format, use plink::read_plink_bed_file
to read that file.
To specify that the input data are in binary format, as opposed to the normal text PED/MAP format, just use the --bfile option instead of --file. To repeat the first command we ran (which just loads the data and prints some basic summary statistics):
if (good_to_go) { plinkr::run_plink( args = c("--bfile", base_binary_filenames, "--noweb"), plink_options = create_plink_v1_7_options() ) }
The things to note here:
That three files hapmap1.bim, hapmap1.fam and hapmap1.bed were loaded instead of the usual two files. That is, hapmap1.ped and hapmap1.map are not used in this analysis, and could in fact be deleted now.
The data are loaded in much more quickly -- based on the timestamp at the beginning and end of the log output, this took 2 seconds instead of 10.
Next, we shall generate some simple summary statistics on rates of missing data in the file, using the --missing option:
base_miss_stat_filenames <- file.path(plinkr_folder, "miss_stat") if (good_to_go) { missing_result <- missing_from_bfile( bfile = base_binary_filenames, out = base_miss_stat_filenames ) }
which should generate the following output:
if (good_to_go) { missing_result$log }
Here we see that no individuals were removed for low genotypes (MIND > 0.1 implies that we accept people with less than 10 percent missingness).
The per individual and per SNP (after excluding individuals on the basis of low genotyping) rates are then output to the files miss_stat.imiss
and miss_stat.lmiss
respectively.
If we had not specified an --out
option, the root output filename would have defaulted to "plink".
These output files are standard, plain text files that can be viewed in any text editor, pager, spreadsheet or statistics package.
Taking a look at the file miss_stat.lmiss
:
if (good_to_go) { knitr::kable(utils::head(missing_result$lmiss_table)) }
We see, that is, for each SNP, we see the number of missing individuals (N_MISS) and the proportion of individuals missing (F_MISS).
if (good_to_go) { ggplot2::ggplot( missing_result$lmiss_table, ggplot2::aes(x = "all", y = F_MISS) ) + ggplot2::geom_boxplot() + ggplot2::ggtitle("SNPs' genotyping rates") }
Similarly, miss_stat.imiss
:
if (good_to_go) { knitr::kable(utils::head(missing_result$imiss_table)) }
The final column, F_MISS
is the actual genotyping rate for that individual.
As values are close to zero, this genotyping rate is high here.
if (good_to_go) { ggplot2::ggplot( missing_result$imiss_table, ggplot2::aes(x = "", y = F_MISS) ) + ggplot2::geom_boxplot() + ggplot2::ggtitle("Individuals' genotyping rates") }
Next we perform a similar analysis, except requesting allele frequencies instead of genotyping rates. The following command generates a file called freq_stat.frq which contains the minor allele frequency and allele codes for each SNP.
base_frq_filenames <- file.path(plinkr_folder, "frq") if (good_to_go) { freq_results <- freq_from_bfile( bfile = base_binary_filenames, out = base_frq_filenames ) knitr::kable(utils::head(freq_results$frq_table)) }
It is also possible to perform this frequency analysis (and the missingness analysis) stratified by a categorical, cluster variable. In this case, we shall use the file that indicates whether the individual is from the Chinese or the Japanese sample, pop.phe. This cluster file contains three columns; each row is an individual. The format is described more fully in the main documentation.
To perform a stratified analysis, use the --within
option.
The output will now indicate that a file called freq_stat.frq.strat
has been generated instead of freq_stat.frq
. If we view this file,
we see each row is now the allele frequency for each SNP stratified by subpopulation:
base_freq_within_filenames <- file.path(plinkr_folder, "frq_within") if (good_to_go) { freq_within_results <- freq_from_bfile_within_phe_file( bfile = base_binary_filenames, phe_filename = pop_phe_filename, out = base_freq_within_filenames ) knitr::kable(utils::head(freq_within_results$frq_strat_table)) }
Here we see that each SNP is represented twice - the CLST column indicates whether the frequency is from the Chinese or Japanese populations, coded as per the pop.phe file.
If you were just interested in a specific SNP, and wanted to know what the frequency was in the two populations, you can use the --snp option to select this SNP:
plink --bfile hapmap1 --snp rs1891905 --freq --within pop.phe --out snp1_frq_stat
would generate a file snp1_frq_stat.frq.strat containing only the population-specific frequencies for this single SNP. You can also specify a range of SNPs by adding the --window kb option or using the options --from and --to, following each with a different SNP (they must be in the correct order and be on the same chromosome). Follow this link for more details.
Let's now perform a basic association analysis on the disease trait for all single SNPs:
base_as1_filenames <- file.path(plinkr_folder, "as1") if (good_to_go) { assoc_results <- assoc_from_bfile( bfile = base_binary_filenames, out = base_as1_filenames ) knitr::kable(utils::head(assoc_results$assoc_table)) }
Each row is a single SNP association result. The fields are:
If a test is not defined (for example, if the variant is monomorphic but was not excluded by the filters), then values of NA for 'not applicable' will be given (as these are read by the package R to indicate missing data, which is convenient if using R to analyse the set of results).
To get the most significant results, we sort the association results based on their low p-value:
if (good_to_go) { knitr::kable(utils::head(dplyr::arrange(assoc_results$assoc_table, P))) }
Here we see that the simulated disease variant rs2222162 is actually the second most significant SNP in the list, with a large difference in allele frequencies of 0.28 in cases versus 0.62 in controls. However, we also see that, just by chance, a second SNP on chromosome 13 shows a slightly higher test result, with coincidentally similar allele frequencies in cases and controls. (Whether this result is due to chance alone or perhaps represents some confounding due to the population structure in this sample, we will investigate below). This highlights the important point that when performing so many tests, particularly in a small sample, we often expect the distribution of true positive results to be virtually indistinguishable from the best false positive results. That our variant appears in the top ten list is reassuring however.
To get a sorted list of association results, that also includes a range of significance values that are adjusted for multiple testing, use the --adjust flag. This generates the file as2.assoc.adjust in addition to the basic as2.assoc output file. Using more, one can easily look at one's most significant associations:
base_as2_filenames <- file.path(plinkr_folder, "as2") if (good_to_go) { assoc_adjust_results <- assoc_adjust_from_bfile( bfile = base_binary_filenames, out = base_as2_filenames ) knitr::kable(utils::head(assoc_results$assoc_adjusted_table)) }
Here we see a pre-sorted list of association results.
The fields are as follows:
In this particular case, we see that no single variant is significant at the 0.05 level after genome-wide correction. Different correction measures have different properties which are beyond the scope of this tutorial to discuss: it is up to the investigator to decide which to use and how to interpret them.
When the --adjust command is used, the log file records the inflation factor calculated for the genomic control analysis, and the mean chi-squared statistic (that should be 1 under the null):
Genomic inflation factor (based on median chi-squared) is 1.18739 Mean chi-squared statistic is 1.14813
These values would actually suggest that although no very strong stratification exists, there is perhaps a hint of an increased false positive rate, as both values are greater than 1.00.
HINT The adjusted significance values that control for multiple testing are, by default, based on the unadjusted significance values. If the flag --gc is specified as well as --adjust then these adjusted values will be based on the genomic-control significance value instead.
In this particular instance, where we already know about the Chinese/Japanese subpopulations, it might be of interest to directly look at the inflation factor that results from having population membership as the phenotype in a case/control analysis, just to provide extra information about the sample. That is, running the command using the alternate phenotype option (i.e. replacing the disease phenotype with the one in pop.phe, which is actually subpopulation membership): plink --bfile hapmap1 --pheno pop.phe --assoc --adjust --out as3
we see that testing for frequency differences between Chinese and Japanese individuals, we do see some departure from the null distribution:
Genomic inflation factor (based on median chi-squared) is 1.72519 Mean chi-squared statistic is 1.58537
That is, the inflation factor of 1.7 represents the maximum possible inflation factor if the disease were perfectly correlated with subpopulation that could arise from the Chinese/Japanese split in the sample (this does not account for any possible within-subpopulation structure, of course, that might also increase SNP-disease false positive rates).
We will return to this issue below, when we consider using the whole genome data to detect stratification more directly.
# clear_plinkr_cache()
# check_empty_plinkr_folder()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.