In Genentech/midasHLA: R package for immunogenomics data handling and association analysis

knitr::opts_chunk$set(echo = TRUE, fig.align="center")
devtools::load_all()

htmltools::img(
  src = knitr::image_uri("MiDAS_logo.png"),
  alt = 'logo',
  style = 'position:absolute; top:0; right:0; padding:10px; width:250px'
  )

Introduction

Welcome to MiDAS. This tutorial is supposed to help you get started with your analyses of immunogenetic associations. We will work with a simulated data set of 500 patients and 500 controls with a binary disease diagnosis.

We also have high resolution HLA alleles (4 - 6 digit), and presence/absence calls for KIR genes.

Data import and sanity check

First, let's import the phenotype data and HLA calls using MiDAS import functions. MiDAS will check for correct nomenclature of HLA. We can also define 4-digit resolution for HLA alleles as import format, which means that alleles with higher resolution will be reduced.

dat_pheno <-
  read.table(
  file = system.file("extdata", "MiDAS_tut_pheno.txt", package = "midasHLA"),
  header = TRUE
  )
dat_HLA <-
  readHlaCalls(
  file = system.file("extdata", "MiDAS_tut_HLA.txt", package = "midasHLA"),
  resolution = 4
  )

Let's take a look at the imported HLA data tables:

dat_HLA %>%
  head(20) %>%
  kable(caption = "HLA data as imported by MiDAS") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(width = "100%", height = "300px")

Next, we want to check our HLA allele frequencies, and compare them to known frequencies from major populations. Here, we only include alleles with an allele frequency of 5% or higher in our study cohort. By default, MiDAS will output comparisons including the following populations, based on published data from allelefrequencies.net:

MiDAS comes with some pre-defined reference populations, but it is possible to customize these comparisons (see documentation).

freq_HLA <- getHlaFrequencies(hla_calls = dat_HLA, compare = TRUE) %>%
  filter(Freq > 0.01)

freq_HLA %>%
  head() %>%
  kable(caption = "HLA frequencies, compared to published references") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Let's assume our cohort is of predominantly European ancestry. We can plot the following comparison to see if allele frequencies in our data are distributed as expected, for example by visualizing common HLA-A allele frequencies in comparison with European, Chinese, and African American reference populations:

freq_HLA_long <- tidyr::gather(
  data = freq_HLA,
  key = "population",
  value = "freq",
  "Freq",
  "USA NMDP European Caucasian",
  "USA NMDP Chinese",
  "USA NMDP African American pop 2",
  factor_key = TRUE
) %>% 
  filter(grepl("^A", allele))

plot_HLAfreq <-
  ggplot2::ggplot(data = freq_HLA_long, ggplot2::aes(x = allele, y = freq, fill = population)) +
  ggplot2::geom_bar(
    stat = "identity",
    position = ggplot2::position_dodge(0.7),
    width = 0.7,
    colour = "black"
  ) +
  ggplot2::coord_flip() +
  ggplot2::scale_y_continuous(labels = formattable::percent)

plot_HLAfreq

HLA association analysis

Are classical HLA alleles associated with disease status?

The following function prepares our data for analysis, combining HLA and phenotypic data into one object. Here, we are interested in analyzing our data on the level of classical HLA alleles.

HLA <- prepareMiDAS(
  hla_calls = dat_HLA,
  colData = dat_pheno,
  experiment = "hla_alleles"
)

We can now test our HLA data for deviations from Hardy-Weinberg-Equilibrium (HWE) to filter out alleles that strongly deviate from HWE expectations (imputation or genotyping errors, ...). Here, let's remove alleles with a HWE p-value below 0.05 divided by the number of alleles tested / present in our data (N=447). We can create a filtered MiDAS object right away (as.MiDAS = TRUE), as done in this example, or output actual HWE test results.

HLA <- HWETest(
  object = HLA,
  experiment = "hla_alleles",
  HWE_cutoff = 0.05 / 447,
  as.MiDAS = TRUE
)

Now, we define our statistical model and run the analysis. Since we want to test for association with disease status, we use a logistic regression approach. The term is necessary as placeholder for the tested HLA alleles and needs to be included. It will become handy when for example defining more complex interaction models.

In the runMiDAS function, we then refer to this model, choose our analysis type and define a inheritance model. Here we use dominant model, meaning that individuals will be defined as non-carriers (0) vs. carriers (1) for a given allele. Alternatively, it is also possible to choose recessive (0 = non-carrier or heterozygous carrier, 1 = homozygous carrier), overdominant (assuming heterozygous (dis)advantage: 0 = non-carrier or homozygous carrier, 1 = heterozygous carrier), or additive (N of alleles) inheritance models. Moreover, we define allele inclusion criteria, such that we only consider alleles frequencies above 2% and below 98%. We also apply the Bonferroni method to not only get nominal P values, but also such adjusted for multiple testing. For alternative multiple testing correction methods, as well as the option to choose a custom number of tests, please refer to the documentation. exponentiate = TRUE means that the effect estimate will already be shown as odds ratio, since we use a logistic regression model.

HLA_model <- glm(disease ~ term, data = HLA, family = binomial())
HLA_results <- runMiDAS(
  object = HLA_model, 
  experiment = "hla_alleles", 
  inheritance_model = "dominant",
  lower_frequency_cutoff = 0.02, 
  upper_frequency_cutoff = 0.98, 
  correction = "bonferroni", 
  exponentiate = TRUE
)

kableResults(HLA_results)

Three HLA alleles show significant association with the disease after multiple testing adjustment. Due to the complex linkage disequilibrium structure in the MHC, it is likely that associations are not statistically independent. The two alleles HLA-DRB1*15:01 and HLA-DQB1*06:02 are a common class II haplotype. We can therefore test if there are associations that are statistically independent of our top-associated allele, by setting the conditional flag to TRUE. MiDAS will now perform stepwise conditional testing until the top associated allele does not reach a defined significance threshold (here th = 0.05, based on adjusted p value).

HLA_results_cond <- runMiDAS(
  object = HLA_model, 
  experiment = "hla_alleles", 
  inheritance_model = "dominant", 
  conditional = TRUE,
  lower_frequency_cutoff = 0.02, 
  upper_frequency_cutoff = 0.98, 
  correction = "bonferroni", 
  exponentiate = TRUE
)

kableResults(HLA_results_cond, scroll_box_height = "200px")

The results for conditional testing are displayed in a way that for each step the top associated allele is shown, along with a list of alleles conditioned on.

As we can see, HLA-DRB1*15:01 was not independently associated with the disease when correcting for our top-associated allele HLA-DQB1*06:02. However, HLA-B*57:01 can be considered an independent association signal.

HLA association fine-mapping on amino acid level

Next, we want to find out what are the strongest associated amino acid positions, corresponding to our allele-level associations. This can help fine-mapping the associated variants to e.g. the peptide binding region or other functionally distinct parts of the protein. We thus prepare a MiDAS object with experiment type "hla_aa", which includes the inference of amino acid variation from allele calls.

HLA_AA <- prepareMiDAS(
  hla_calls = dat_HLA,
  colData = dat_pheno,
  experiment = "hla_aa"
)

Amino acid data will be stored in a MiDAS object, but we can extract it to a data frame and select a couple of variables to display how this looks like:

dat_HLA_AA <- HLA_AA[["hla_aa"]] %>% 
  assay() %>% 
  t() %>% 
  as.data.frame() %>% 
  select(starts_with("B_97_")) %>% 
  head()

kable(dat_HLA_AA, caption = "HLA amino acid data as inferred by MiDAS") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Now, we run the association test based on amino acid variation. To first identify the most relevant associated amino acid positions, we run a likelihood ratio (omnibus) test, which groups all residues at each amino acid position.

HLA_AA_model <- glm(disease ~ term, data = HLA_AA, family = binomial())
HLA_AA_omnibus_results <- runMiDAS(
  HLA_AA_model,
  experiment = "hla_aa",
  inheritance_model = "dominant",
  conditional = FALSE,
  omnibus = TRUE,
  lower_frequency_cutoff = 0.02,
  upper_frequency_cutoff = 0.98,
  correction = "bonferroni"
)

kableResults(HLA_AA_omnibus_results)

Next, we can investigate how effect estimates are distributed for a given associated amino acid position, e.g. DQB1_9:

HLA_AA_DQB1_9_results <- runMiDAS(
  HLA_AA_model,
  experiment = "hla_aa",
  inheritance_model = "dominant",
  omnibus_groups_filter = "DQB1_9",
  lower_frequency_cutoff = 0.02,
  upper_frequency_cutoff = 0.98,
  correction = "bonferroni",
  exponentiate = TRUE
)

kableResults(HLA_AA_DQB1_9_results, scroll_box_height = "250px")

This shows us that individuals carrying a Phenylalanine (F) at position 9 of DQB1 have a significantly increased risk, whereas individuals carrying a Tyrosine (Y) at the same amino acid position have a decreased risk.

It is logical to hypothesize that the risk residue is found on HLA-DQB1*06:02, the previously associated HLA risk allele. MiDAS thus provides the function getAllelesforAA to map amino acid residues back to the respective HLA alleles.

HLA_AA_DQB1_9_alleles <- getAllelesForAA(HLA_AA,"DQB1_9")

kable(HLA_AA_DQB1_9_alleles, escape = FALSE) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Finally, it is also interesting to note that there are several amino acid positions coming up that determine the Bw4 binding motif (e.g. B_81), which is a determinant for interactions of HLA class I alleles with KIR on Natural Killer cells.

Can we find evidence for a role of HLA variation related to NK cell interactions?

Let's assume outcome differences in our disease of interest have been shown to be related to NK cell biology. HLA class I alleles can be grouped according to how they interact with KIR, expressed on NK cells. We here now prepare a new MiDAS object, grouping HLA alleles into categories defining their interactions with KIRs (Bw4 vs. Bw6, C1 vs. C2). Then we test these variables for association with disease outcome:

NKlig <- prepareMiDAS(
  hla_calls = dat_HLA,
  colData = dat_pheno,
  experiment = c("hla_alleles", "hla_NK_ligands")
)

NKlig_model <- glm(disease ~ term, data = NKlig, family = binomial())
NKlig_results <- runMiDAS(
  NKlig_model,
  experiment = "hla_NK_ligands",
  inheritance_model = "dominant",
  correction = "bonferroni",
  exponentiate = TRUE
)

kableResults(NKlig_results)

We find that HLA-Bw6 and HLA-Bw4 carrier status are associated with decreased and increased disease risk, respectively. Of course, this is interesting enough for us to invest in some KIR typing efforts.

KIR associations and HLA-KIR interactions

Do we see association on the level of KIR genes, and when considering defined HLA-KIR interactions?

Now that we have performed KIR genotyping, or e.g. inferred KIR types from available whole-genome sequencing data, we can import this information, and check the gene frequencies. In our example, we could successfully infer KIR gene presence status for 935 out of the 1,000 individuals in our data set.

dat_KIR <- readKirCalls(
  file = system.file("extdata", "MiDAS_tut_KIR.txt", package = "midasHLA")
)

kable(dat_KIR, caption = "KIR data as imported by MiDAS") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))  %>%
  scroll_box(height = "400px")

Next, we want to check our KIR genes frequencies.

kir_freq <- getKIRFrequencies(dat_KIR)

kable(kir_freq , caption = "KIR data as imported by MiDAS") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(height = "400px")

Next, we rerun our prepareMiDAS function, this time including the KIR data. We prepare the data to test for both KIR gene associations, as well as HLA-KIR interactions. But first, let's runMiDAS on the level of KIR genes.

HLAKIR <- prepareMiDAS(
  hla_calls = dat_HLA,
  kir_calls = dat_KIR,
  colData = dat_pheno,
  experiment = c("hla_NK_ligands","kir_genes", "hla_kir_interactions")
)
KIR_model <- glm(disease ~ term, data = HLAKIR, family = binomial())
KIR_results <- runMiDAS(
  KIR_model,
  experiment = "kir_genes",
  lower_frequency_cutoff = 0.02,
  upper_frequency_cutoff = 0.98,
  exponentiate = TRUE
)

kableResults(KIR_results)

We found an association of KIR3DL1 gene presence with disease status. Since HLA alleles encoding Bw4 epitopes are ligands for KIR3DL1, we can now ask the question whether patients coding for both receptor and ligand are at increased risk for our disease phenotype.

HLA-KIR interactions

Do known biological interactions between KIR receptors and their HLA ligands show significant assocation?

If both HLA alleles and KIR gene presence / absence information is available, MiDAS can encode experimentally validated receptor-ligand interactions, defined according to Pende et al. Please note that there is extensive allelic variation in KIR genes, as well as evidence that their interaction with HLA does not only depend on the presence of the KIR gene, but also allele status. Currently, MiDAS does not consider allelic variation in KIR, but a custom dictionary can be provided by the user (see documentation).

First let's explore HLA-KIR interactions frequencies

hlakir_freq <- getFrequencies(HLAKIR, experiment = "hla_kir_interactions")

kable(hlakir_freq, caption = "HLA-KIR interaction frequencies") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(height = "400px")

HLAKIR_model <- glm(disease ~ term, data = HLAKIR, family = binomial())
HLA_KIR_results <- runMiDAS(
  HLAKIR_model,
  experiment = "hla_kir_interactions",
  lower_frequency_cutoff = 0.02,
  upper_frequency_cutoff = 0.98,
  exponentiate = TRUE
)

kableResults(HLA_KIR_results)

Patients who are carriers of both HLA-Bw4 and KIR3DL1 are at a significantly increased risk for having the disease, which is consistent with the associations we saw on KIR ligand and KIR gene level.

Finally, let's now take a look at some additional ways to analyze immunogenetics data within MiDAS, which are probably not relevant for most users.

HLA heterozygosity and evolutionary divergence

Instead of focusing on single HLA alleles or their interactions, you might be interested in HLA diversity. For example, it has been shown that MHC heterozygosity confers a selective advantage against infections with multiple Salmonella strains. But given the degree of variability in HLA genes, heterozygosity is a rather crude measure. On the amino acid sequence level, it is possible to perform more refined analyses, by calculating the level of evolutionary divergence between pairs of HLA alleles of different genes. Pierini and Lenz have found Grantham's distance score to be a good proxy for HLA functional divergence. It has been demonstrated that HLA-C allelic divergence was associated with HIV viral load. In addition, there is evidence for a role of HLA class I divergence in the efficacy of cancer immunotherapy.

MiDAS provides experiment types hla_het to analyze heterozygosity of class I and II genes, as well as hla_divergence to analyze classical class I genes using Grantham's distance:

HLA_het <- prepareMiDAS(
  hla_calls = dat_HLA, 
  colData = dat_pheno, 
  experiment = c("hla_het","hla_divergence")
)

HLA_het_model <- glm(outcome ~ term, data=HLA_het, family=binomial())

HLA_het_results <- runMiDAS(HLA_het_model, 
  experiment = "hla_het", 
  exponentiate = TRUE
)

kableResults(HLA_het_results)

HLA_div_results <- runMiDAS(HLA_het_model, 
  experiment = "hla_divergence", 
  exponentiate = TRUE
)

kableResults(HLA_div_results, scroll_box_height = "250px")

In our example dataset, our results show that there is no significant association of HLA heterozygosity or evolutionary divergence with our disease phenotype of interest.

We hope that this tutorial is useful to get you started working with MiDAS. Please don't hesitate to contact us if you have questions or suggestions for improvement.

Genentech/midasHLA documentation built on Feb. 12, 2024, 9:38 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Genentech/midasHLA
R package for immunogenomics data handling and association analysis

In Genentech/midasHLA: R package for immunogenomics data handling and association analysis

Introduction

Data import and sanity check

HLA association analysis

Are classical HLA alleles associated with disease status?

HLA association fine-mapping on amino acid level

Can we find evidence for a role of HLA variation related to NK cell interactions?

KIR associations and HLA-KIR interactions

Do we see association on the level of KIR genes, and when considering defined HLA-KIR interactions?

HLA-KIR interactions

Do known biological interactions between KIR receptors and their HLA ligands show significant assocation?

HLA heterozygosity and evolutionary divergence

R Package Documentation

Browse R Packages

We want your feedback!

Genentech/midasHLA R package for immunogenomics data handling and association analysis

In Genentech/midasHLA: R package for immunogenomics data handling and association analysis

Introduction

Data import and sanity check

HLA association analysis

Are classical HLA alleles associated with disease status?

HLA association fine-mapping on amino acid level

Can we find evidence for a role of HLA variation related to NK cell interactions?

KIR associations and HLA-KIR interactions

Do we see association on the level of KIR genes, and when considering defined HLA-KIR interactions?

HLA-KIR interactions

Do known biological interactions between KIR receptors and their HLA ligands show significant assocation?

HLA heterozygosity and evolutionary divergence

R Package Documentation

Browse R Packages

We want your feedback!

Genentech/midasHLA
R package for immunogenomics data handling and association analysis