knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(plinkr)
# clear_plinkr_cache()
# These variables are used in the text, also when
# not good to go. Here the default values are set
n_snps <- "[unknown]"
n_individuals <- "[unknown]"

Here we follow the tutorial at http://zzz.bwh.harvard.edu/plink/tutorial.shtml. The text was literally copied initially.

if (!is_plink_installed()) {
  message("PLINK is not installed")
  message("Tip: use 'plinkr::install_plinks()'")
}
if (!is_plink_tutorial_data_installed()) {
  message("PLINK tutorial data is not installed")
  message("Tip: use 'plinkr::install_plink_tutorial_data()'")
}
good_to_go <- FALSE
if (is_plink_installed() && is_plink_tutorial_data_installed()) {
  good_to_go <- TRUE
}

Data

These are the tutorial data files:

if (good_to_go) {
  tutorial_data_filenames <- get_plink_tutorial_data_filenames()
  tutorial_data_filenames
}

Here we take a peek at those files, starting with hapmap1.map:

if (good_to_go) {
  hapmap1_map_filename <- stringr::str_subset(
    tutorial_data_filenames, "hapmap1.map"
  )
  map_table <- plinkr::read_plink_map_file(hapmap1_map_filename)
  n_snps <- nrow(map_table)
  knitr::kable(utils::head(map_table))
}

From the genetic mapping table, the number of rows equals the number of SNPs, which is r n_snps SNPs.

Now hapmap1.ped, of which we'll only show the first 10 columns:

if (good_to_go) {
  # Cannot do so until stringi::stri_split is fixed
  if (1 == 2) {
    hapmap1_ped_filename <- stringr::str_subset(
      tutorial_data_filenames, "hapmap1.ped"
    )
    ped_table <- plinkr::read_plink_ped_file(hapmap1_ped_filename)
    n_individuals <- nrow(ped_table)
    knitr::kable(ped_table[1:5, 1:10])
  }
}

The number of rows of the pedigree table equals the number of individuals, which equals r n_individuals. The number of columns equals 6 plus twice the number of SNPs.

if (good_to_go) {
  # Cannot do so until stringi::stri_split is fixed
  if (1 == 2) {
    testthat::expect_equal(ncol(ped_table), 6 + (2 * n_snps))
  }
}

Now pop.phe:

if (good_to_go) {
  pop_phe_filename <- stringr::str_subset(
    tutorial_data_filenames, "pop.phe"
  )
  knitr::kable(
    utils::head(plinkr::read_plink_phe_file(pop_phe_filename))
  )
}

Now qt.phe:

if (good_to_go) {
  qt_phe_filename <- stringr::str_subset(
    tutorial_data_filenames, "qt.phe"
  )
  knitr::kable(
    utils::head(plinkr::read_plink_phe_file(qt_phe_filename))
  )
}

Phenotypes

Two phenotypes were generated: a quantitative triat and a disease trait (affection status, coded 1=unaffected, 2=affected), based on a median split of the quantitative trait. The quantitative trait was generated as a function of three simple components:

Remember, this model is not intended to be realistic. The following contingency table shows the joint distribution of disease and subpopulation:

t_disease <- tibble::tribble(
  ~disease , ~chinese, ~japanese, # nolint
  "control", 34      , 11,        # nolint
  "case"   , 11      , 33         # nolint
)
knitr::kable(t_disease)

which shows a strong relationship between these two variables. The next table shows the association between the variant rs2222162 and disease:

t_variant <- tibble::tribble(
  ~disease , ~g11, ~g12, ~g22, # nolint
  "control", 17  , 22  , 6   , # nolint
  "case"   , 3   , 19  , 22    # nolint
)
knitr::kable(t_variant)

Again, the strong association is clear. Note that the alleles have been recoded as 1 and 2 (this is not necessary for PLINK to work, however -- it can accept any coding for SNPs).

In summary, we have a single causal variant that is associated with disease. Complicating factors are that this variant is one of 83534 SNPs and also that there might be some degree of confounding of the SNP-disease associations due to the subpopulation-disease association -- i.e. a possibility that population stratification effects will exist. Even though we might expect the two subpopulations to be fairly similar from an overall genetic perspective, and even though the sample size is small, this might still lead to an increase in false positive rates if not controlled for.

We will use the affection status variable as the default variable for analysis (i.e. the sixth column in the PED file). The quantitative trait is in a separate alternate phenotype file, qt.phe. The file pop.phe contains a dummy phenotype that is coded 1 for Chinese individuals and 2 for Japanese individuals. We will use this in investigating between-population differences. You can view these alternate phenotype files in any text editor.

In this tutorial dataset we focus on autosomal SNPs for simplicity, although PLINK does provide support for X and Y chromosome SNPs for a number of analyses. See the main documentation for further information.

Using PLINK to analyse these data

This tutorial is intended to introduce some of PLINK's features rather than provide exhaustive coverage of them. Futhermore, it is not intended as an analysis plan for whole genome data, or to represent anything close to 'best practice'.

Getting started

Just typing plink and specifying a file with no further options is a good way to check that the file is intact, and to get some basic summary statistics about the file.

if (good_to_go) {
  hapmap1_base_filename <- tools::file_path_sans_ext(hapmap1_map_filename)
  # This only works up until approx PLINK v1.7
  run_plink(
    args = c("--file", hapmap1_base_filename, "--noweb"),
    plink_options = create_plink_v1_7_options()
  )
}

The information contained here can be summarized as follows:

If other analyses had been requested, then the other output files generated would have been indicated in the log file. All output files that PLINK generates have the same format: root.extension where root is, by default, "plink" but can be changed with the --out option, and the extension will depend on the type of output file it is (a complete list of extensions is given here).

Making a binary PED file

The first thing we will do is to make a binary PED file. This more compact representation of the data saves space and speeds up subsequent analysis.

To make a binary PED file using plinkr, use make_bed:

# Work in a temporary folder
plinkr_folder <- get_plinkr_tempfilename()
base_binary_filenames <- file.path(plinkr_folder, "hapmap1")

if (good_to_go) {
  binary_filenames <- make_bed(
    base_input_filename = hapmap1_base_filename,
    base_output_filename = base_binary_filenames
  )
  list.files(dirname(base_binary_filenames))
}

Four files are created with this command:

if (good_to_go) {
  bim_filename <- stringr::str_subset(
    binary_filenames, "hapmap1.bim"
  )
  bim_table <- plinkr::read_plink_bim_file(bim_filename = bim_filename)
  knitr::kable(utils::head(bim_table))
}
if (good_to_go) {
  fam_filename <- stringr::str_subset(
    binary_filenames, "hapmap1.fam"
  )
  fam_table <- plinkr::read_plink_fam_file(fam_filename = fam_filename)
  knitr::kable(utils::head(fam_table))
}
if (good_to_go) {
  bed_filename <- stringr::str_subset(
    binary_filenames, "hapmap1.bed"
  )
  bed_table <- read_plink_bed_file(
    bed_filename = bed_filename,
    names_loci = bim_table$id,
    names_ind = fam_table$fam
  )
  knitr::kable(bed_table[1:5, 1:10])
}
if (good_to_go) {
  log_filename <- stringr::str_subset(
    binary_filenames, "hapmap1.log"
  )
  log_text <- read_plink_log_file(
    log_filename = log_filename
  )
  log_text
}

You can view the .bim, .fam and .log files directly, as these are plain-text. The .bed file, however, is in a binary and computer-friendly format, use plink::read_plink_bed_file to read that file.

Working with the binary PED file

To specify that the input data are in binary format, as opposed to the normal text PED/MAP format, just use the --bfile option instead of --file. To repeat the first command we ran (which just loads the data and prints some basic summary statistics):

if (good_to_go) {
  plinkr::run_plink(
    args = c("--bfile", base_binary_filenames, "--noweb"),
    plink_options = create_plink_v1_7_options()
  )
}

The things to note here:

Summary statistics: missing rates

Next, we shall generate some simple summary statistics on rates of missing data in the file, using the --missing option:

base_miss_stat_filenames <- file.path(plinkr_folder, "miss_stat")

if (good_to_go) {
  missing_result <- missing_from_bfile(
    bfile = base_binary_filenames,
    out = base_miss_stat_filenames
  )
}

which should generate the following output:

if (good_to_go) {
  missing_result$log
}

Here we see that no individuals were removed for low genotypes (MIND > 0.1 implies that we accept people with less than 10 percent missingness).

The per individual and per SNP (after excluding individuals on the basis of low genotyping) rates are then output to the files miss_stat.imiss and miss_stat.lmiss respectively. If we had not specified an --out option, the root output filename would have defaulted to "plink".

These output files are standard, plain text files that can be viewed in any text editor, pager, spreadsheet or statistics package.

Taking a look at the file miss_stat.lmiss:

if (good_to_go) {
  knitr::kable(utils::head(missing_result$lmiss_table))
}

We see, that is, for each SNP, we see the number of missing individuals (N_MISS) and the proportion of individuals missing (F_MISS).

if (good_to_go) {
  ggplot2::ggplot(
    missing_result$lmiss_table,
    ggplot2::aes(x = "all", y = F_MISS)
  ) + ggplot2::geom_boxplot() +
    ggplot2::ggtitle("SNPs' genotyping rates")
}

Similarly, miss_stat.imiss:

if (good_to_go) {
  knitr::kable(utils::head(missing_result$imiss_table))
}

The final column, F_MISS is the actual genotyping rate for that individual. As values are close to zero, this genotyping rate is high here.

if (good_to_go) {
  ggplot2::ggplot(
    missing_result$imiss_table,
    ggplot2::aes(x = "", y = F_MISS)
  ) + ggplot2::geom_boxplot() +
    ggplot2::ggtitle("Individuals' genotyping rates")
}

Summary statistics: allele frequencies

Next we perform a similar analysis, except requesting allele frequencies instead of genotyping rates. The following command generates a file called freq_stat.frq which contains the minor allele frequency and allele codes for each SNP.

base_frq_filenames <- file.path(plinkr_folder, "frq")

if (good_to_go) {
  freq_results <- freq_from_bfile(
    bfile = base_binary_filenames,
    out = base_frq_filenames
  )
  knitr::kable(utils::head(freq_results$frq_table))
}

It is also possible to perform this frequency analysis (and the missingness analysis) stratified by a categorical, cluster variable. In this case, we shall use the file that indicates whether the individual is from the Chinese or the Japanese sample, pop.phe. This cluster file contains three columns; each row is an individual. The format is described more fully in the main documentation.

To perform a stratified analysis, use the --within option. The output will now indicate that a file called freq_stat.frq.strat has been generated instead of freq_stat.frq. If we view this file, we see each row is now the allele frequency for each SNP stratified by subpopulation:

base_freq_within_filenames <- file.path(plinkr_folder, "frq_within")

if (good_to_go) {
  freq_within_results <- freq_from_bfile_within_phe_file(
    bfile = base_binary_filenames,
    phe_filename = pop_phe_filename,
    out = base_freq_within_filenames
  )
  knitr::kable(utils::head(freq_within_results$frq_strat_table))
}

Here we see that each SNP is represented twice - the CLST column indicates whether the frequency is from the Chinese or Japanese populations, coded as per the pop.phe file.

If you were just interested in a specific SNP, and wanted to know what the frequency was in the two populations, you can use the --snp option to select this SNP:

plink --bfile hapmap1 --snp rs1891905 --freq --within pop.phe --out snp1_frq_stat

would generate a file snp1_frq_stat.frq.strat containing only the population-specific frequencies for this single SNP. You can also specify a range of SNPs by adding the --window kb option or using the options --from and --to, following each with a different SNP (they must be in the correct order and be on the same chromosome). Follow this link for more details.

Basic association analysis

Let's now perform a basic association analysis on the disease trait for all single SNPs:

base_as1_filenames <- file.path(plinkr_folder, "as1")

if (good_to_go) {
  assoc_results <- assoc_from_bfile(
    bfile = base_binary_filenames,
    out = base_as1_filenames
  )
  knitr::kable(utils::head(assoc_results$assoc_table))
}

Each row is a single SNP association result. The fields are:

If a test is not defined (for example, if the variant is monomorphic but was not excluded by the filters), then values of NA for 'not applicable' will be given (as these are read by the package R to indicate missing data, which is convenient if using R to analyse the set of results).

To get the most significant results, we sort the association results based on their low p-value:

if (good_to_go) {
  knitr::kable(utils::head(dplyr::arrange(assoc_results$assoc_table, P)))
}

Here we see that the simulated disease variant rs2222162 is actually the second most significant SNP in the list, with a large difference in allele frequencies of 0.28 in cases versus 0.62 in controls. However, we also see that, just by chance, a second SNP on chromosome 13 shows a slightly higher test result, with coincidentally similar allele frequencies in cases and controls. (Whether this result is due to chance alone or perhaps represents some confounding due to the population structure in this sample, we will investigate below). This highlights the important point that when performing so many tests, particularly in a small sample, we often expect the distribution of true positive results to be virtually indistinguishable from the best false positive results. That our variant appears in the top ten list is reassuring however.

To get a sorted list of association results, that also includes a range of significance values that are adjusted for multiple testing, use the --adjust flag. This generates the file as2.assoc.adjust in addition to the basic as2.assoc output file. Using more, one can easily look at one's most significant associations:

base_as2_filenames <- file.path(plinkr_folder, "as2")

if (good_to_go) {
  assoc_adjust_results <- assoc_adjust_from_bfile(
    bfile = base_binary_filenames,
    out = base_as2_filenames
  )
  knitr::kable(utils::head(assoc_results$assoc_adjusted_table))
}

Here we see a pre-sorted list of association results.

The fields are as follows:

In this particular case, we see that no single variant is significant at the 0.05 level after genome-wide correction. Different correction measures have different properties which are beyond the scope of this tutorial to discuss: it is up to the investigator to decide which to use and how to interpret them.

When the --adjust command is used, the log file records the inflation factor calculated for the genomic control analysis, and the mean chi-squared statistic (that should be 1 under the null):

 Genomic inflation factor (based on median chi-squared) is 1.18739
 Mean chi-squared statistic is 1.14813

These values would actually suggest that although no very strong stratification exists, there is perhaps a hint of an increased false positive rate, as both values are greater than 1.00.

HINT The adjusted significance values that control for multiple testing are, by default, based on the unadjusted significance values. If the flag --gc is specified as well as --adjust then these adjusted values will be based on the genomic-control significance value instead.

In this particular instance, where we already know about the Chinese/Japanese subpopulations, it might be of interest to directly look at the inflation factor that results from having population membership as the phenotype in a case/control analysis, just to provide extra information about the sample. That is, running the command using the alternate phenotype option (i.e. replacing the disease phenotype with the one in pop.phe, which is actually subpopulation membership): plink --bfile hapmap1 --pheno pop.phe --assoc --adjust --out as3

we see that testing for frequency differences between Chinese and Japanese individuals, we do see some departure from the null distribution:

 Genomic inflation factor (based on median chi-squared) is 1.72519
 Mean chi-squared statistic is 1.58537

That is, the inflation factor of 1.7 represents the maximum possible inflation factor if the disease were perfectly correlated with subpopulation that could arise from the Chinese/Japanese split in the sample (this does not account for any possible within-subpopulation structure, of course, that might also increase SNP-disease false positive rates).

We will return to this issue below, when we consider using the whole genome data to detect stratification more directly.

Genotypic association models

Stratification analysis

Association analysis, accounting for clusters

Quantitative trait association analysis

Extracting a SNP of interest

Cleanup

# clear_plinkr_cache()
# check_empty_plinkr_folder()


richelbilderbeek/plinkr documentation built on March 25, 2024, 3:18 p.m.