In Chebuu/HPV-Serovars:

library(knitr)
knitr::opts_chunk$set(eval = F, echo = T)

Introduction

The first infectious cause of cancer was identified in 1911 by Peyton Rous who demonstrated the transmissibility of a tumor in fowl through injection of sub 0.2 micron filtrate of the tumor \cite{Rous_1911}.

Installing this package

# devtools::install_github('hebuu/HPV-Serovars')
remotes::install_github('chebuu/HPV-Serovars)

Loading this package in R environment

library(HPVSerovars)

(Not run) This vignette is packaged and accessible via: browseVignettes('HPVSerovars').

Installing OligAarrayAux

Download OligoArrayAux and see the following vignette for installation instructions.

library(devtools)

devtools::install_github('chebuu/Design-Group-Specific-Primers')

library(Design_Group_Specific_Primers)

vignette('Installing-OligoArrayAux', package = 'Design_Group_Specific_Primers')

Install Bioconductor packages

https://www.bioconductor.org/

# Install from Bioconductor
if (!requireNamespace("BiocManager", quietly = T))
  install.packages("BiocManager")

BiocManager::install("Biostrings")
BiocManager::install("DECIPHER")

# Open Biostrings vignettes
library(Biostrings)
browseVignettes("Biostrings")

# Open DECIPHER docs and vignettes
library(DECIPHER)
help("DECIPHER")
browseVignettes("DECIPHER")

Load packages

library(dplyr)
library(tidyr)
library(stringr)

library(Biostrings)
library(DECIPHER)

Make an alignment utility:

doAlignment <- function(xset) {
  useqs <- unique(xset)
  uidxs <- match(xset, useqs)
  aseqs <- AlignSeqs(useqs, verbose=F) 
  aseqs[uidxs]
}

Wiki Pheno Data

I copied the tables from Wiki: Human_papillomavirus_infection and included them in the R package associated with this document github.com/Chebuu/HPV-Serovars.

data(HPV.Sequelae, package='HPVSerovars')
data(HPV.GenitalCancerRisk, package='HPVSerovars')

The Wiki data contains phenotypic categorizations of several HPV strains. It's useful for testing.

HPV <- c(HPV.Sequelae, HPV.GenitalCancerRisk)
p.1 <- unique(unlist(HPV))

knitr::kable(
  sprintf('HPV-%s', sort(p.1)), col.names = '',
  caption = sprintf(
    'Unique HPV serotypes (N=%s) (Total=%s): ', 
    length(p.1), length(HPV)
  )
)

hist(unlist(HPV), max(unlist(HPV)), col='blue'); head(HPV, 4)

Here I've included the Case Statistics from Wikipedia too, although I have no use for it right now, so it's not saved as an .rda file in the package yet.

# N HPV-associated CA 2008–2012 in the USA
HPV.CaseStatistics <- t(
  data.frame(
    Oropharynx.M = c(12638, 9100, 8000),
    Oropharynx.W = c(3100, 2000, 1600),
    Penis = c(1168, 700, 600),
    Vulva = c(3554, 2400, 1700),
    Cervix = c(11771, 10700, 7800),
    Vagina = c(802, 600, 400),
    Anus.M = c(1750, 1600, 1400),
    Anus.W = c(3260, 3000, 2600),
    Rectum.M = c(237, 200, 200 ),
    Rectum.W = c(513, 500, 400 ),
    row.names = c(
      'Avg. N annual CA | ', 
      'HPV attributable CA (est.) | ', 
      'HPV 16/18 attributable CA (est.) |'
    )
  )
)

knitr::kable(
  HPV.CaseStatistics,
  caption = 'Estimated HPV associated cancers (2008-2012)'
)

library(vtable)

# HPV/CA
sum(HPV.CaseStatistics[,2]) / 
  sum(HPV.CaseStatistics[,1]) 

# HiRsk/CA
sum(HPV.CaseStatistics[,3]) / 
  sum(HPV.CaseStatistics[,1])

# HiRsk/HPV
sum(HPV.CaseStatistics[,3]) / 
  sum(HPV.CaseStatistics[,2])

NCBI Data

Below are my GenBank queries for all the HPV DNA in this repo. To run them, just paste the following queries in the search bar at: https://www.ncbi.nlm.nih.gov/nuccore

All HPV

https://www.ncbi.nlm.nih.gov/nuccore ```{sql, eval=F} "Human papillomavirus"[Primary Organism] AND viruses[filter] NOT Polyamides[All Fields] NOT Method[All Fields] NOT Patent[All Fields]

## Complete Genomes
[https://www.ncbi.nlm.nih.gov/nuccore](https://www.ncbi.nlm.nih.gov/nuccore)
```{sql, eval=F}
"Human papillomavirus"[Primary Organism] 
AND "complete genome"[All Fields]
NOT Isolate[Title]

Oral Isolates

https://www.ncbi.nlm.nih.gov/nuccore ```{sql, eval=F} "Human papillomavirus"[Primary Organism] AND viruses[filter] NOT Polyamides[All Fields] NOT Method[All Fields] NOT Patent[All Fields] AND Oral[All Fields]

### L1 gene

A collection of *l1* DNA sequences (avg. ~600bp) from various oral isolates of HPV (extracted from results of GenBank query above) are aligned and loaded in SQLite along with annotations from GenBank that describe each clinical isolate.

#### Design HPV16-specific F/R primers using `DECIPHER` utilities.

```r
data(Oral.L1.seqs)
data(Oral.L1.vars)

dbConn <- dbConnect(RSQLite::SQLite(), ':memory:')

Seqs2DB(Oral.L1.seqs, 'XStringSet', dbConn, '')
Add2DB(Oral.L1.vars %>% mutate(identifier = SVAR), dbConn)

dbGetQuery(dbConn, "select * from Seqs") %>% head(4)

tiles.L1 <- TileSeqs(
  dbFile = dbConn,
  minLength = 18,
  maxLength = 29,
  minCoverage = 0.8
)

print(
  tiles.L1[,c(6,11)] %>% sample_n(6)
)

oligos.L1 <- DesignPrimers(
  tiles = tiles.L1,
  identifier = 'HPV4',
  worstScore = -1E3,
  maxPermutations = 5,
  minGroupCoverage = 0.85,
  minCoverage = 0.85,
  minLength = 20,
  maxLength = 28
)

head(oligos.L1)

Plot hybridization curves for the reverse compliment of each primer pair (5) in each primer set (7).

plotMeltCurves <- function(temps, effs) {
  plot(
    temps, effs[,1], ylim=c(0,1),
    ylab="Hybridization Efficiency",
    xlab=expression(paste("Temperature (", degree, "C)", sep="")),
    type="l", lwd=2, col="Blue", main="Denaturation Plot"
  )
  lines(temps, effs[,2], col="Red", lwd=2)
  abline(h=0.5, lty=2, lwd=2, col="Orange")
  abline(v=64, lty=2, lwd=2, col="Green")
  legend(
    "topright",
    legend = c(
      "Forward Primer",
      "Reverse Primer",
      "50% Efficiency",
      "Annealing Temperature"
    ),
    col = c("Blue", "Red", "Orange", "Green"),
    lwd = c(2, 2, 2, 2), lty = c(1, 1, 2, 2)
  )
}

meltCurves <- 
  function(primers, target=reverseComplement(DNAStringSet(primers)), temps=60:75, P=4e-7, ions=.225, doPlot=TRUE, ...) {
    fxn <- function(temp) CalculateEfficiencyPCR(primers, target, temp, P=P, ions=ions, ...)
    effs <- matrix(unlist(lapply(temps, fxn)), ncol=2, byrow=TRUE)
    if (doPlot) plotMeltCurves(temps, effs)
    list(temps = temps, effs = effs)
  }

nsets <- nrow(oligos.L1)
npair <- ncol(oligos.L1)

for (i in 1:nsets)
  for (j in 1:npair)
    tryCatch(
      {
        c(
          oligos.L1$forward_primer[i,j],
          oligos.L1$reverse_primer[i,j]
        ) %>% meltCurves
      },
      error = function(e) NULL
        # # ¿ --- It's supposed to be 7o x 17p --- ?
        # warning(sprintf('%s [iteration i=%s j=%s]', e, i, j))
    )

More primer designs (RFLP, sequencing, etc.):

TYPE <- 'sequence'
MIN_LENGTH <- 15
MAX_LENGTH <- 25
MIN_SIZE <- 60
MAX_SIZE <- 100
LEVELS <- 2
RESOLUTION <- 3
ENZYMES <- NULL

DesignSignatures(
  dbConn,
  type = TYPE,
  minLength = MIN_LENGTH,
  maxLength = MAX_LENGTH,
  minProductSize = MIN_SIZE,
  maxProductSize = MAX_SIZE,
  resolution = RESOLUTION,
  levels = LEVELS,
  enzymes = ENZYMES
)

data(Oral.L1.seqs)
data(Oral.L1.vars)

dbConn <- dbConnect(SQLite(), ':memory:')

print(Oral.L1.seqs)

Seqs2DB(Oral.L1.seqs, 'XStringSet', dbConn, '')
Add2DB(Oral.L1.vars %>% mutate(identifier = SVAR), dbConn)

dbGetQuery(dbConn, "select * from Seqs") %>% head(4)

data(RESTRICTION_ENZYMES)

FOCUS_ID <- 'HPV11'

TYPE <- 'melt'
LEVELS <- 4
MIN_LENGTH <- 15
MAX_LENGTH <- 25
MIN_SIZE <- 60
MAX_SIZE <- 200
RESOLUTION <- seq(75, 300, 15)

RFLP.Oal.L1 <- lapply(
  rep(sample(1:3, 3), 3),  # Random RE combinations
  function(i)
    DesignSignatures(
      dbConn,
      type = TYPE,
      identifier = '',
      focusID = FOCUS_ID,
      enzymes = RESTRICTION_ENZYMES[i],
      minProductSize = MIN_SIZE,
      maxProductSize = MAX_SIZE,
      resolution = RESOLUTION, 
      levels = LEVELS
    )
)

head(RFLP.Oal.L1)

# TYPE <- 'length'
# LEVELS <- 2
# MIN_SIZE <- 200
# MAX_SIZE <- 1400
# RESOLUTION <- c(
#   seq(200, 700, 3),
#   seq(705, 1000, 5),
#   seq(1010, 1400, 10)
# )
# 
# TYPE <- 'melt'
# MIN_SIZE <- 55
# MAX_SIZE <- 400

RFLP optimization by grid search

FOCUS_ID <- 'HPV18'

params <- expand.grid(
  enzymes = lapply(1:3, function(x) 
    RESTRICTION_ENZYMES[
      sample(1:length(RESTRICTION_ENZYMES), 3)
    ]
  ),
  minProductSize = seq(80, 150, length.out = 3),
  maxProductSize = seq(500, 5000, length.out = 3),
  resolution = seq(1, 2, length.out = 3), 
  levels = seq(5, 20, length.out = 3)
) %>% mutate(type = rep('melt', 243))

gridsearch <- apply(params, 1, function(...) 
    list(
      arg = as.list(...), 
      out = do.call(
        DesignSignatures, 
        append(
          list(
            dbFile = dbConn, 
            tblName = 'Seqs', 
            identifier = '', 
            focusID = FOCUS_ID
          ), 
          as.list(...)
        )
      )
    )
) %>% bind_rows

objective <- function(x) {
  print(x)
}

Anal Isolates

## Cutaneous Isolates
[https://www.ncbi.nlm.nih.gov/nuccore](https://www.ncbi.nlm.nih.gov/nuccore)
```{sql, eval=F}
"Human papillomavirus"[Primary Organism]
NOT Polyamides[All Fields] 
NOT Method[All Fields] 
NOT Patent[All Fields]
AND Anal[All Fields]
AND "Complete Genome"[All Fields]

PaVE Data

Reference genomes for all serovars listed in data(HPV.Sequelae)and data(HPV.GenitalCancerRisk)were selected from among the results of the following PaVEPaVE query:

https://pave.niaid.nih.gov/#search/search_database/kw
?dbNamespace=Genomes
&includeNR=true
&refCloneOnly=false
&sort=Locus_ID
&sortType=true
&page=600&start=1
&text=Human&showTable=1
&

# data(PaVE.Wikiv)
data(HPV.REF.PaVE.seqs)
data(HPV.REF.PaVE.vars)

testVars <- lapply(grep('[1-9]*', HPV.REF.PaVE.vars, value = T), function(x) {
  sapply(HPV.Sequelae, function(y) any(x %in% y))
})

dbConn <- dbConnect(SQLite(), ':memory:')

DECIPHER::Seqs2DB(HPV.REF.PaVE.seqs, 'XStringSet', dbConn, '')
DECIPHER::Add2DB(
  data.frame(
    # identifier = testVars
    identifier = lapply(HPV.REF.PaVE.vars, function(x) grep('[1-9]', x, perl = T))
  ), 
  dbConn, verbose = F
)

Genital cancer risk groups (PaVE REF genomes)

HPV reference genomes from PaVE categorized by risk of genital (but not anal) cancer.

library(HPVSerovars)

data(gA) # High risk
data(gB) # Med risk
data(gC) # Some risk

dbDisconnect(dbConn)

dbConn <- dbConnect(RSQLite::SQLite(), ':memory:')

Seqs2DB(
  c(gA,gB,gC), 
  type = 'XStringSet', 
  dbFile = dbConn, 
  identifier = ''
)

Add2DB(
  data.frame(
    identifier = c(
      rep('A', length(gA)), 
      rep('B', length(gB)), 
      rep('C', length(gC))
    )
  ), 
  dbConn
)

gABC.synt <- FindSynteny(dbConn)

summary(gABC.synt)
pairs(gABC.synt, boxBlocks=TRUE)
unlist(AlignSynteny(gABC.synt, dbConn)) %>% head(4)

Using OpenPrimer

Installing OpenPrimer Dependencies

From the openPimer vignette:

openPrimeR requires external programs for some features, particularly for computing the physicochemical properties of primers. Please make sure you have the following tools installed on your system such that they are in your system’s path:

MELTING (>= 5.1.1): For melting temperature computations. ViennaRNA (>= 2.2.4): For secondary structure prediction. OligoArrayAux (>= 3.8): For primer efficiency computations as performed by DECIPHER. MAFFT (>= 7.305): For computing multiple sequence alignments. Pandoc (>= 1.19.1): For creating PDF reports.

If you would like to be able to access the immunoglobulin repository IMGT from the openPrimeR Shiny app, you should additionally fulfill the following dependencies:

PhantomJS (>= 2.1): For headless website calls. Python (>=2.7.9) and the selenium (>=3.0.1) module: For data extraction scripts. openPrimeR will automatically check for all dependencies and inform you about any missing dependencies when the package is attached.

Note that the tool is still functional if there are missing external programs. However, we recommend that all dependencies are fulfilled to guarantee the best user experience.