knitr::opts_chunk$set(echo = TRUE)
knitr::include_graphics(system.file("help/figures/hyper2.png", package = "hyper2"))
To cite the hyper2
package in publications, please use @hankin2017_rmd.
This file creates objects jester
and maxjest
, which are datasets
available in the hyper2
package, and documented at jester.Rd
.
This file takes quite a long time to run. Goldberg et al present a
dataset in which respondents rated a number of jokes. Here, I analyse
a small portion of this dataset using the hyper2
package. This
document is intended to illustrate an extremely challenging
application of the hyper2
package and (without cache) takes a long
time to process. Goldberg's dataset has 24938 lines, one per
respondent, and 101 columns, one per joke (the first column shows the
number of jokes rated by each respondent); here I use 150 lines and 99
jokes (the 100th joke was not funny).
library("hyper2",quietly=TRUE) a <- read.csv("jester-data-3.csv",head=FALSE) # File is 150 lines only a <- as.matrix(a[,-c(1,100)]) a[a==99] <- NA colnames(a) <- paste("joke",sprintf("%02d",seq_len(ncol(a))),sep="") a[1,]
Row 1 of a
is displayed (most entries are NA
, signifying that
respondent 1 did not rank that particular joke). It shows that the
first respondent rated joke 5 at -1.65, joke 7 at -0.78, and so on.
We can perform some summary statistics of a
:
plot(rowSums(!is.na(a)),xlab="respondent")
The above plot shows how many jokes each of the 100 respondents rated.
plot(colSums(!is.na(a)),xlab="joke index")
The above shows how many respondents rated each joke. It would make sense to remove the jokes that were not rated:
dim(a) a <- a[,colSums(!is.na(a))>1] dim(a)
showing that 91 jokes were rated by at least one respondent. We need to transform the dataset:
f <- function(x){ x <- x[!is.na(x)] x[order(x,decreasing=TRUE)] <- seq_along(x) return(x) } f(a[1,])
Thus we see that this respondent rated joke08
to be the funniest, having rank 1.
jester <- hyper2() system.time(for(i in seq_len(nrow(a))){jester <- jester + rank_likelihood(f(a[i,]))})
now
system.time(jester_maxp <- maxp(jester,n=1))
and
jester_maxp
plot(jester_maxp)
equalp.test(jester,startp=indep(jester_maxp))
Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.
jester_table <- a for(i in seq_len(nrow(jester_table))){ x <- jester_table[i,] x[!is.na(x)] <- rank(-x[!is.na(x)],ties.method="first") jester_table[i,] <- x } jester_table <- t(jester_table) colnames(jester_table) <-paste("resp",seq_len(ncol(jester_table)),sep="_") jester_table[1:6,1:10]
Above we see that respondent 1 ranked joke 8 as the funniest and joke 7 as the 15th funniest.
Comparing with formula1_2022.txt
, for example, we see that respondents correspond to venues and jokes correspond to drivers. We have to be a little bit careful because NA
means "not rated", not "did not finish" as in the Formula 1 datasets.
plot(rowMeans(jester_table,na.rm=TRUE)) abline(v=17)
Following lines create jester.rda
, residing in the data/
directory of the package.
save(jester,jester_table,jester_maxp,file="jester.rda")
### References
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.