knitr::opts_chunk$set(echo = TRUE)
knitr::include_graphics(system.file("help/figures/hyper2.png", package = "hyper2"))

To cite the hyper2 package in publications, please use @hankin2017_rmd. This file creates objects jester and maxjest, which are datasets available in the hyper2 package, and documented at jester.Rd. This file takes quite a long time to run. Goldberg et al present a dataset in which respondents rated a number of jokes. Here, I analyse a small portion of this dataset using the hyper2 package. This document is intended to illustrate an extremely challenging application of the hyper2 package and (without cache) takes a long time to process. Goldberg's dataset has 24938 lines, one per respondent, and 101 columns, one per joke (the first column shows the number of jokes rated by each respondent); here I use 150 lines and 99 jokes (the 100th joke was not funny).

library("hyper2",quietly=TRUE)
a <- read.csv("jester-data-3.csv",head=FALSE) # File is 150 lines only
a <- as.matrix(a[,-c(1,100)])
a[a==99] <- NA
colnames(a) <- paste("joke",sprintf("%02d",seq_len(ncol(a))),sep="")
a[1,]

Row 1 of a is displayed (most entries are NA, signifying that respondent 1 did not rank that particular joke). It shows that the first respondent rated joke 5 at -1.65, joke 7 at -0.78, and so on. We can perform some summary statistics of a:

plot(rowSums(!is.na(a)),xlab="respondent")

The above plot shows how many jokes each of the 100 respondents rated.

plot(colSums(!is.na(a)),xlab="joke index")

The above shows how many respondents rated each joke. It would make sense to remove the jokes that were not rated:

dim(a)
a <- a[,colSums(!is.na(a))>1]
dim(a)

showing that 91 jokes were rated by at least one respondent. We need to transform the dataset:

f <- function(x){
    x <- x[!is.na(x)]
    x[order(x,decreasing=TRUE)] <- seq_along(x)
    return(x)
}
f(a[1,])

Thus we see that this respondent rated joke08 to be the funniest, having rank 1.

jester <- hyper2()
system.time(for(i in seq_len(nrow(a))){jester <- jester + rank_likelihood(f(a[i,]))})

now

system.time(jester_maxp <- maxp(jester,n=1))

and

jester_maxp
plot(jester_maxp)
equalp.test(jester,startp=indep(jester_maxp))

Reference

Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.

Create a table like formula 1 results table

jester_table <- a
for(i in seq_len(nrow(jester_table))){
  x <- jester_table[i,]
  x[!is.na(x)] <- rank(-x[!is.na(x)],ties.method="first")
  jester_table[i,] <- x
}
jester_table <- t(jester_table)
colnames(jester_table) <-paste("resp",seq_len(ncol(jester_table)),sep="_")
jester_table[1:6,1:10]

Above we see that respondent 1 ranked joke 8 as the funniest and joke 7 as the 15th funniest. Comparing with formula1_2022.txt, for example, we see that respondents correspond to venues and jokes correspond to drivers. We have to be a little bit careful because NA means "not rated", not "did not finish" as in the Formula 1 datasets.

plot(rowMeans(jester_table,na.rm=TRUE))
abline(v=17)

Package dataset {-}

Following lines create jester.rda, residing in the data/ directory of the package.

save(jester,jester_table,jester_maxp,file="jester.rda")

### References



RobinHankin/hyper2 documentation built on April 21, 2024, 11:38 a.m.