knitr::opts_chunk$set(echo = TRUE)
This vignette explains basic functionalities of the package litRiddle
, a part of the Riddle of Literary Quality project.
The package contains the data of a reader survey about fiction in Dutch, a description of the novels the readers rated, and the results of stylistic measurements of the novels. The package also contains functions to combine, analyze, and visualize these data.
See: https://literaryquality.huygens.knaw.nl/ for further details. Information in Dutch about the package can be found at https://karinavdo.github.io/RaadselLiteratuur/02_07_data_en_R_package.html.
These data are also available as individual csv files for persons wanting to work with the data in non R environments. See: https://github.com/karinavdo/RiddleData.
If you use litRiddle
in your academic publications, please consider citing the following references:
Maciej Eder, Lensink, S., van Zundert, J.J., and van Dalen-Oskam, K.H. (2022). Replicating The Riddle of Literary Quality: The LitRiddle Package for R. In Digital Humanities 2022 Conference Abstracts, 636--637. Tokyo: The University of Tokyo / DH2022 Local Organizing Committee. https://dh2022.dhii.asia/abstracts/files/VAN_DALEN_OSKAM_Karina_Replicating_The_Riddle_of_Literary_Qu.html
Karina van Dalen-Oskam (2023). The Riddle of Literary Quality: A Computational Approach. Amsterdam University Press.
Install the package from the CRAN repository:
install.packages("litRiddle")
Alternatively, try installing it directly from the GitHub repository:
library(devtools) install_github("karinavdo/LitRiddleData", build_vignettes = TRUE)
First, one has to activate the package so that its functions become visible to the user:
``` {r warning = FALSE} library(litRiddle)
## The dataset To activate the dataset, type one of the following lines (or all of them): ``` {r} data(books) data(respondents) data(reviews) data(motivations) data(frequencies)
From now on, the dataset, divided into four data tables, is visible for the user. Please note that the functions discussed below do not need the dataset to be activated (they take care of it themselves), therefore you don't have to remember about this step if you plan to analyze the data using the functions from the package.
Time to explore some of the data tables. This generic function will list all the data points from the table books
:
``` {r eval = FALSE} books
This command will dump quite a lot of stuff on the screen offering little insight or overview. It's usually a better idea to select one portion of information at a time, usually one variable or one observation. We assume here that the user has some basic knowledge about R, and particularly s/he knows how to access values in vectors and tables (matrices). To get the titles of the books scored in the survey (or, say, the first 10 titles), one might type: ``` {r} books$title[1:10]
Well, but how do I know that the name of the particular variable I want to get is title
, rather than anything else? There exists a function that lists all the variables from the three data tables.
The function that creates a list of all the column names from all three datasets is named get.columns()
and needs no arguments to be run. What it means is that you simply type the following code, remembering about the parentheses at the end of the function:
get.columns()
Not bad indeed. However, how can I know what s.4a2
stands for?
Function that lists an short explanation of what the different column names refer to and what their levels consist of is called explain()
. To work properly, this function needs an argument to be passed, which basically mean that the user has to specify which dataset s/he is interested in. The options are as follows:
explain("books") explain("reviews") explain("respondents") explain("motivations") explain("frequencies")
The package provides a function to combine all information of the survey, reviews, and books into one big dataframe. The user can specify whether or not s/he wants to also load the freqTable with the frequency counts of the word n-grams of the books.
Combine and load all data from the books, respondents and reviews into a new dataframe (tibble format)
dat = combine.all(load.freq.table = FALSE)
Combine and load all data from the books, respondents and reviews into a new dataframe (tibble format), and additionally also load the frequency table of all word 1grams of the corpus used.
dat = combine.all(load.freq.table = TRUE)
Return the name of the dataset where a column can be found.
find.dataset("book.id") find.dataset("age.resp")
It's useful to combine it with the already-discussed function get.columns()
.
Make a table of frequency counts for one variable, and plot a histogram of the results. Not sure which variable you want to plot? Invoke the above-discussed function get.columns()
once more, to see which variables you can choose from:
``` {r eval = FALSE} get.columns()
Now the fun stuff: ``` {r} make.table(table.of = 'age.resp')
You can also adjust the x label, y label, title, and colors:
make.table(table.of = 'age.resp', xlab = 'age respondent', ylab = 'number of people', title = 'Distribution of respondent age', barcolor = 'red', barfill = 'white')
Note: please mind that in the above examples we used single quotes to indicate arguments (e.g. xlab = 'age respondent'
), whereas at the beginning of the document, we used double quotes (explain("books")
). We did it for a reason, namely we wanted to emphasize that the functions provided by the package litRiddle
are fully compliant with the generic R syntax, which allows for using either single or double quotes to indicate the strings.
make.table2(table.of = 'age.resp', split = 'gender.resp')
make.table2(table.of = 'literariness.read', split = 'gender.author')
Note that you can only provide an argument to the 'split' variable that has less than 31 unique values, to avoid uninterpretable outputs. E.g., consider the following code:
make.table2(table.of = 'age.resp', split = 'zipcode')
You can also adjust the x label, y label, title, and colors:
make.table2(table.of = 'age.resp', split = 'gender.resp', xlab = 'age respondent', ylab = 'number of people', barcolor = 'purple', barfill = 'yellow')
make.table2(table.of = 'literariness.read', split = 'gender.author', xlab = 'Overall literariness scores', ylab = 'number of people', barcolor = 'black', barfill = 'darkred')
The orginal survey about Dutch fiction was designed to rank the responses using descriptive terms, e.g. "very bad", "neutral", "a bit good" etc. In order to conduct the analyses, the responses were then converted to numerical scales ranging from 1 to 7 (the questions about literariness and literary quality) or from 1 to 5 (the questions about the reviewer's reading patterns). However, if you want the responses converted back to their original form, invoke the function order.responses()
that transforms the survey responses into ordered factors. Use either "bookratings" or "readingbehavior" to specify which of the survey questions needs to be changed into ordered factors. (We assume here that the user knows what the ordered factors are, because otherwise this function will not seem very useful). Levels of quality.read
and quality.notread
: "very bad", "bad", "a bit bad", "neutral", "a bit good", "good", "very good", "NA". Levels literariness.read
and literariness.notread
: "absolutely not literary", "non-literary", "not very literary", "between literary and non-literary","a bit literary", "literary", "very literary", "NA". Levels statements 4/12: "completely disagree", "disagree", "neutral", "agree", "completely agree", "NA".
To create a data frame with ordered factor levels of the questions on reading behavior:
dat.reviews = order.responses('readingbehavior') str(dat.reviews)
To create a data frame with ordered factor levels of the book ratings:
dat.ratings = order.responses('bookratings') str(dat.ratings)
The data frame frequencies
contains numerical values for word frequencies of the 5000 most frequent words (in a descending order of frequency) of 401 literary novels in Dutch. The table contains relative frequencies, meaning that the original word occurrences from a book were divided by the total number of words of the book in question. The measurments were obtained using the R package stylo
, and were later rounded to the 5th digit. To learn more about the novels themselves, type help(books)
.
The row names of the frequencies
data frame contain the titles of the novels corresponding to the title.short
column in the data frame books
.
rownames(frequencies)[10:20]
Listing the relative frequency values for the novel Weerzin by Rene Appel:
frequencies["Appel_Weerzin",][1:10]
And getting the book information:
books[books["short.title"]=="Appel_Weerzin",]
Version 1.0 of the package introduces a table motivations
, containing the 200k+ lemmatized and POS tagged tokens making up the text of all motivations. The Dutch Language Institute (INT, Leiden) took care of POS-tagging the data. The tagging was manually corrected by Karina van Dalen-Oskam. We tried to guarantee the highest possible quality, but mistakes may still occur.
The solution to add a token based table was chosen to not burden the table reviews
with lots of text, XML, or JSON in additional columns, leading to potential problems with default memory constraints in R.
To retrieve all tokens:
data(motivations) head(motivations, 15)
Usually one will probably want to work with the full text of motivations. A convenience function motivations.text()
is provided to create a view that has one motivation per row:
# We're importing `dplyr` to use `tibble` so we can # show very large tables somewhat nicer. suppressMessages(library(dplyr)) mots = motivations.text() tibble(mots)
explain
function from the package litRiddle because it has its own explain function. To use litRiddle's explain function after dplyr has been loaded, call it explicitly, like this: litRiddle::explain("books")
.Gathering all motivations for, for instance, one book, requires some trivial merging. Let's see what people said about Binet's HhhH. For this we select the book information of the book with ID 46 and we left join (merge) that (book.id
by book.id
) with the table mots
having all the motivations:
mots_hhhh <- merge(x = books[books["book.id"]==46,], y = mots, by = "book.id", all.x = TRUE) tibble(mots_hhhh)
Hmm... pretty wide table, select the text
column to get an idea of what is being said, and print with the n
parameter to see more rows:
print(tibble(mots_hhhh[,"text"]), n = 40)
If we also want to include review information, this requires another merge. Rather than trying to combine all data in one huge statement, it is usually easier to follow a step by step methog. First let's collect the motivations for HhhH. We will be more selective of columns. If you compare the following query with the merge
statement above, you will find that we use only author and title from books
and only repsondent ID and the motivational text from mots
, while we use book.id
from both to match for merging.
mots_hhhh = merge(x = books[books["book.id"] == 46, c("book.id", "author", "title")], y = mots[, c("book.id", "respondent.id", "text")], by = "book.id", all.x = TRUE) tibble(mots_hhhh)
We now have a new view that we can again merge with the information in the reviews
data:
tibble(merge(x = mots_hhhh, y = reviews, by = c("book.id", "respondent.id"), all.x = TRUE))
Note how we use a vector for by
to ensure we match on book ID and respondent ID. If we would use only book.id
we would get all score for that book by all respondents, but we want the score by these particular respondents that motivated their rating.
And -- being sceptical as we always should be about our strategies -- let us just check that we didn't miss anything, and sample if indeed repsondent 1022 had only one rating for HhhH:
reviews[ reviews["respondent.id"] == 1022 & reviews["book.id"] == 46, ]
Suppose we want to look into word frequencies of motivations. We can use base R table
to get an idea of how often what combination of lemma and POS tag
appears in the motivations:
toks = motivations # Remmber: that is a *token* table, one token + lemma + POS tag per row. head(table(toks$lemma, toks$upos), n = 30)
Wow, respondents are creative about using punctuation! In the interest of completeness we chose not to clean out all those emoticons from the data set. However, here we don't need those. So we filter, and sort. The code in the next cell is not trivial if you are new to R, or regular expressions. Hopefully the inserted comments will clarify a bit. Note, just in case you run into puzzling errors, this uses the dplyr.filter
as we imported dplyr
above. Base R filter
requires a different approach.
# filter out tokens that do not start with at least one word character # we use regular expression "\w+" which means "more than one word character", # the added backslash prevents R from interpreting the backslash as an # escape character. mots = filter(motivations, grepl('\\w+', lemma)) # create a data frame out of a table of raw frequencies. # Look up 'table function' in R documentation. mots = data.frame(table(mots$lemma, mots$upos)) # use interpretable column names colnames(mots) = c("lemma", "upos", "freq") # select only useful information, i.e. those lemma+pos combinations # that appear more than 0 times mots = mots[mots['freq'] > 0, ] # sort from most used to least used mots = mots[order(mots$freq, decreasing = TRUE), ] # finally show as a nicer looking table tibble(mots)
And rather unsurprisingly it is the pronouns and other functors that lead the pack.
For another exercise, let's look up something about the lemma "boek" (en. "book"):
mots[mots["lemma"] == "boek", ]
Linguistic parsers are not infallible. Apparently in three cases the parser did not know how to classify the word "boek", in which case the POS tag handed out is "X". Can we find the contexts where those linguistic unknowns were used? For this, first we find the book IDs from the books where this happened:
# First we find the motivation IDs from the books where this happens. boekx = motivations[motivations["lemma"] == "boek" & motivations["upos"] == "X", ] boekx
Now we need the full texts of all motivations, so we can find those three motivations we are looking for.
mots_text = motivations.text()
To find the three motivations we merge the boekx
table and the table with all the motivations, and we keep only those rows that pertain to the three motivation IDs. I.e. we merge onby="motvation.id"
with all.x=TRUE
, implying that we will keep all rows from x
(i.e. the three motivations with the "boek" POS tagged as "X") and that we will drop all non-related y
(i.e. all those motivations that do not have those linguistically unknown "boek" mentions).
boekx_mots_text = merge( x = boekx, y = mots_text, by = "motivation.id", all.x = TRUE)
And finally we show what those contexts are:
tibble(boekx_mots_text[ , c( "book.id.x", "respondent.id.x", "text")])
And just for good measure the full text of the third mention:
boekx_mots_text[3, "text"]
Next versions of the litRiddle
package will support likert plots. Visit https://github.com/jbryer/likert to learn more about the general idea and the implementation in R.
Next versions of the litRiddle
package will support topic modeling of the motivations indicated by the reviewers.
Each function provided by the package has its own help page; the same applies to the datasets:
{r eval = FALSE}
help(books)
help(respondents)
help(reviews)
help(frequencies)
help(combine.all)
help(explain)
help(find.dataset)
help(get.columns)
help(make.table)
help(make.table2)
help(order.responses)
help(litRiddle) # for the general description of the package
All the datasets use the UTF-8 encoding (also known as the Unicode). This should normally not cause any problems on MacOS and Linux machines, but Windows might be more tricky in this respect. We haven't experienced any inconveniences in our testing environment, but we cannot say the same about all the other machines.
Karina van Dalen-Oskam (2023). The Riddle of Literary Quality: A Computational Approach. Amsterdam University Press.
Karina van Dalen-Oskam (2021). Het raadsel literatuur. Is literaire kwaliteit meetbaar? Amsterdam University Press.
Maciej Eder, Saskia Lensink, Joris van Zundert, Karina van Dalen-Oskam (2022). Replicating The Riddle of Literary Quality: The litRiddle package for R, in: Digital Humanities 2022 Conference Abstracts. The University of Tokyo, Japan, 25--29 July 2022, p. 636--637 https://dh2022.dhii.asia/dh2022bookofabsts.pdf
Corina Koolen, Karina van Dalen-Oskam, Andreas van Cranenburgh, Erica Nagelhout (2020). Literary quality in the eye of the Dutch reader: The National Reader Survey. Poetics 79: 101439, https://doi.org/10.1016/j.poetic.2020.101439.
More publications from the project: see https://literaryquality.huygens.knaw.nl/?page_id=588.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.