In lassehjorthmadsen/salmer: A Data Set with Hymns in Danish

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The R-package, salmer, provides a data set with the text of 791 hymns in Danish, from the official book of hymns used in churches in Denmark. The purpose is to provide an interesting corpus of texts to use for NLP exercises and fun.

For the fun part, the package also makes available a set of functions that enable users to use the cut-up technique popularized by William S. Burroughs, to rearrange the lyrics and use that as a creative tool.

The technique involves rearrange words or text fragments more or less randomly, to form new associations, and new ideas, sometimes interesting ones.

The packages

I addition to salmer, for this demo, we use mostly dplyr, but I normally just load the whole tidyverse.

library(dplyr)
library(DT)
library(salmer)

The data

Two data sets are provided with the package. The first, hymns, contain the raw text of the hymns, along with some meta data, like number, title, author, and more (use ?hymns to get more details). For example, the first verse of the first hymn, by N.F.S. Grundtvig, goes like this:

hymns %>% 
  select(doc_id, verse, text, author) %>% 
  filter(doc_id == 1, verse == 1)

A tokenized and annotated version of the data is also available, in a tibble named annotated_hymms. Here, each row represents a token of the raw text. It is annotated with lemma, part-of-speech tag, number of vowels, a pronounciation code, information of rhyme scheme, and more. See ?annotated_hymns.

Selected columns and first 10 rows of that data set is shown below; it contains the tokens from the first line of the verse above:

annotated_hymns %>% 
  select(doc_id, token_id, token, vowels, upos, sampa, rhyme_scheme) %>% 
  head(10)

The very first word is, appropriately, "Guds" (The Lord's), which contains a single vowel, is a noun, pronunced "guDs using the SAMPA phonetic script (which, confusingly, uses the double quote character, ", to indicate where the main stress lies)." The last token in that line is an exclamation mark, '!', which is tagged 'PUNCT' for punctuation.

The rhyme scheme is also included; that column contains a number whenever a token should rhyme with an earlier one. Only words that are last of a line can have a rhyme scheme. A number of 2, for example, means that the word should rhyme with the last word two lines above.

In the small sample shown above, no words has a rhyme scheme; last word of that line ("løn") cannot be required to rhyme with an earlier word, since this is the first line and there are no earlier last-of-line words to rhyme with.

Cutting it up

The cut-up technique, as implemented here, rearranges the words randomly, but with some constrains: We want to replace words only with words that has the same number of vowels and the same part-of-speech tag. That way, the hymns remain singable and readable -- but not necessarily meaningful.

The information on pronunciation and rhyme scheme allows us to impose additional constrains on words that must rhyme, so we also retain the rhyme structure of the original. More on rhymes in next sections.

Let's try to use the cut-up technique on one popular hymn, Op, al den ting, som Gud har gjort, by Hans Adolph Brorson, 1734. This is hymn number 15 in Den Danske Salmebog.

The first verse reads like this:

my_hymn <- 15

hymns %>% 
  filter(doc_id == my_hymn, verse == 1) %>% 
  select(verse, text)

The cut-up() function takes as input a hymn number (here, r my_hymn, stored in my_hymn) and an dataframe with annotations, here annotated_hymns. We can retain a (little) bit of sanity by excepting punctuation from being cut up, so that an exclamation mark will stay that and not turn into a semicolon, for example. (The except parameter accepts a character vector, so we can have more than one exception; we might want to experiment with keeping, say, pronouns or nouns.)

First, cutting it up:

set.seed(2718) # For reproducibility

cutup <- annotated_hymns %>% 
  cut_up(my_hymn, except = "PUNCT") %>% 
  filter(verse == 1)

Inspecting a few colums from the output of cut_up(), we see that it has added a new column, token_new with a suggestion for a new token, with same number of vowels and same part-of-speach-tag (in column upos).

cutup %>% select(token, upos, vowels, token_new) %>% slice_head(n = 8)

The way cut_up() works is really simple: It takes a bit of annotated_hymns, that is, one specific hymn, then join it with the full data frame, using upos and number of vowels as keys. Since every word usually matches many words with same upos and same number of vowels, that generates a rather long list of possible matches. From that list, we just sample one word for every word, so we get back the original number of words.

A feature of this way of doing it, is that words are sampled with a probability in proportion to how often they appear in the full book of hymns. I suspect this tend to make the result more readable and seem more natural, since rare words stay rare.

Another note is, that words can have different POS-tags depending on the context they appear in. "Love" can be both a verb and a noun, for example. So when "love" appears as a noun ("for the love of God") it will be replaces with another word than can be a noun. Like "heat", for example.

Collapsing back into readable text

In order to better be able to appreciate the beauty of the newly created poetry, we would like to turn the one-token-per-row dataframe back in a one-line-per-row dataframe like in hymns. For that, we use the function collapse_annotation().

The cut-up-version, has suggested words in the token_new column. This version will read as follows:

cutup %>% 
  collapse_annotation(token_new) %>% 
  filter(line_id <= 4)

Restoring rhymes

We see that we have lost the rhymes. To get them back, we can rearrange just the words that need to rhyme, but with additional contrains. For this, we use the new_rhymes() function:

final <- cutup %>% new_rhymes(annotated_hymns)

For better comparison, we can arrange the three versions, original, cup-up, and cut-up with restored rhymes, side-by-side with a bit of extra information.

# Collapsed, i.e. readable, version
cutup_readable <- cutup %>% 
  collapse_annotation(token_new) %>% 
  select(`Cut-up` = text)

final_readable <- final %>% 
  collapse_annotation(token_new) %>% 
  select(`Fixed rhymes` = text)

# Rhyme scheme
rs <- cutup %>%
  filter(upos != "PUNCT") %>%
  group_by(line_id) %>%
  slice_max(token_id) %>%
  ungroup() %>%
  select(`Rhyme scheme` = rhyme_scheme)

# show final cut-up version along with original and other bits
comparison <- hymns %>%
  filter(doc_id == my_hymn, verse == 1) %>%
  select(Verse = verse, Original = text) %>%
  bind_cols(cutup_readable, final_readable, rs)

# put in a nice data table
options(DT.options = list(dom = 't', pageLength = 100, autoWidth = TRUE))

datatable(comparison, rownames = F,
            options = list(columnDefs = list(list(width = '10px', targets = c(0, 4)))))

The new_rhymes() function basically does the same shuffle as the cut_up() function, but only for words that are required to rhyme, and with additional constrain on the pronunciation, that will (try to) find alternative words that rhymes where they are supposed to.

In the example above, we see in the column with the original lyrics, that "stort" in the third line rhymes with "gjort" in the first line. And the finale word, "bevise" rhymes with "prise" two lines above. Therefore, "tryg" in the cut-up version gets replaced with "klar" so it properly rhymes with "var". Likewise, "julefest" gets replaced with "befaler" that rhymes with "taler".

And so, we have a new version of the first verse of a popular hymn, cut-up and rearranged like William S. Burroughs would have done. It rhymes and is singable, doesn't make much sense, but might serve as an inspirational starting point for writing new hymns.

We might want to generate many cut-ups to get something that feels interesting, but that is what we have R for.

Happy writing!

What rhymes?

Since finding words that rhymes, is by far the most tricky bit of this exercise, let's look a bit into that.

A perfect rhyme, according to Wikipedia, is when both of these conditions are meet:

The stressed vowel sound in both words is identical, as well as any subsequent sound. For example, "trouble" and "bubble" are perfect rhymes, because the first, stressed, vowel sound and subsequent vowel sounds are the same.
The onset (the beginning consonant sound) of the stressed syllable must differ. For example, "bean" and "green" is a perfect rhyme, while "leave" and "believe" is not.

So to find rhymes we need information on pronunciation. To annotate the hymns from Den Danske Salmebog, I used a pronunciation dictionary for Danish obtained here:. (The code used to actually do the annotation is in the pronunciation.R script in the data-raw folder.) )

To illustrate why it is necessary to use pronunciations, consider a couple of examples:

Anger doesn't rhyme with danger because the vowel sounds are pronounced diffently, even though they are spelled the same. There are many examples of this: Coward/toward; cross/gross; deaf/leaf.
Crooked doesn't rhyme with looked because the stress is on different vowels; second and first respectively. A few other examples are: Decade/degrade; distress/mistress.
As mentioned, the consonant sound starting the stressed vowel also cannot be identical. So while sorrow, borrow and tomorrow all rhymes, tomorrow wouldn't rhyme with the made-up word of hymorrow.

Here's an example of pronunciation of the Danish word "prise" (praise), from the annotated_hymns data set:

annotated_hymns %>% 
  filter(token == "prise") %>% 
  slice_head() %>% 
  select(token, sampa, vowels, stress_vowel, remainder)

The double quote character, ", indicate the the stress is on the first syllabus. The vowel part of that is pronunced i: (a long i, like the English e in "eve"). And the remainder is pronounced $s@. For a word to rhyme, both the stress vowel and the remainder must match.

To not mess op the rhythmic unit (foot), we also match on number of vowels, when finding rhyming replacements.

The final constraint is, that a word cannot rhyme with itself, of course. When we look for non-identical words that match both on number of vowels, the stress vowel and the remainder, we don't also have to make sure that the stressed consonant is different.

Still, the algoritm is not perfect. First, because we might not find any rhymes that fit all contraints. In that case, the word doesn't get replaced. Second, the pronunciation dictionary, while big, doesn't have every word from the book of hymns. Third, spoken language is not an exact science; several pronunciations of the same word may be acceptable.

All being said, I think get_rhymes() does a reasonable job of finding likely rhymes.

lassehjorthmadsen/salmer documentation built on April 15, 2022, 3:38 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com