library(knitr)
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)

About the Peptidome Data

Peptides are small amino acid sequences that make up proteins. In the case of the peptides that bind to HLA, most are nine amino acids long, or nonamers. The dataset for this project has a total of 65,731 nonamer sequences split among 63 different HLA alleles, all sourced from the Immune Epitope Database.

Furthermore, not all of the HLA alleles come from the same genetic locus, or gene. In our dataset, we have peptidomes from we have five different HLA genes represented: HLA-A, HLA-B, HLA-C, HLA-E, and HLA-F. Our 63 peptidomes are split among 22 HLA-A, 34 HLA-B, 5 HLA-C, 1 HLA-E, and 1 HLA-F alleles.

Converting Peptide Sequences into Numeric Features

Since there are 20 different amino acids, one can make 20 different binary variables for each of the nine positions. This would translate each nonamer into $20 \times 9 = 180$ features. However, while this intrepretation is good at recording the idiosyncrasies of each specific amino acid, not all amino acids are equally distinct from one another - for example, glycine and alanine are both small hydrophobic molecules, whereas tyrosine is a hydrophilic molecule weighing more than glycine and alanine combined. So if an HLA can accept a glycine at position two of a nonamer, then it is much more likely to accept an alanine than a tyrosine at position two.

To get a better sense of the similarities among amino acids, it may be good to use the physiochemical properties of the amino acids themselves. Here we pick the three most salient properties: size, polarity, and charge. These will be measured by each amino acid's molecular weight, hydrophobicity index, and isoelectric point, respectively. This will create a second set of $3 \times 9 = 27$ features.



ash129/LAPP documentation built on May 10, 2019, 1:52 p.m.