In Computational-Cognitive-Musicology-Lab/humdrumR: humdrumR

source('vignette_header.R')

par(mar=c(1,1,1,1))

Welcome to "Getting to know your humdrum data"! This article explains how r hm you an find essential metainformation about your data: how much data is there, and how is it structured? If you don't understand your data, you won't be able to do intelligent analyses of it.

This article, like all of our articles, closely parallels information in r hm's detailed code documentation, which can be found in the "Reference" section of the r hm homepage. You can also find this information within R, once r hm is loaded, using ?summary.humdrumR.

Know your data

Before any data analysis, you should know your data. What's in the data? How much data is there? How is it formatted and encoded? Are there errors or ambiguities? How was it sampled? If you can't answer these questions, you can't make intelligent or useful scholarly inferences using the data.

After you've done your background research and read up on the details of the data you are working with, hopefully answering the questions posed above, there is one more step to do before you really start analysis: inspect (some of) the data. The fact that the humdrum syntax is human readable is one of the great strengths of the humdrum ecosystem! Open some of the data files up in a text-editor, or perhaps drop them into the Verovio Humdrum Viewer. Of course, you can only look at so much data "by eye"---still, its good practice to inspect as much of the data as you can, selecting files at random and skimming through them to see if they look the way you think they should look.

Data summaries

Once your data is read, the next step is to use r hm to get high-level summaries of the content of all the files in your data.

r Hm defines a number of tools to quickly summarize the structure and content of a humdrum data set. One of the most basic functions in R is summary(); Calling summary() on a humdrumR object will print a concise version of the output of r hm's five summary functions, which are described in detail below. Let's load our built-in Bach-chorale dataset, which we'll use throughout this article, and call summary():

setwd(humdrumRroot)
readHumdrum('HumdrumData/BachChorales/chor0.*') -> chorales

summary(chorales)

A lot of information, huh? There rest of this article will walk through this output and the specific functions that generate it.

Summarizing Structure

The most basic information you'll want about a humdrum dataset is how "big" it is---how much data is there? Printing a humdrumR object on the command line will always tell you how many files there are in your data: you can also call length() to get this number. The census() function, however, gives us much more detail about the size of the data, telling us how many records, tokens, and characters there are:

chorales |> census()

The corpus contains, in total, r humdrumR:::num2print(sum(census(chorales)$Tokens)) in r humdrumR:::num2print(sum(census(chorales)$Records)). The (unique) column tells us how many unique Tokens there are per file (and overall, at the bottom). The (per token) column indicates the average number of characters in each file, and overall.

Notice that census() defaults to counting all records/tokens. If you want to count only data tokens, specify census(chorales, dataTypes = 'D').

Spines and Interpretations

To work with humdrum data, you really need to know how many spines (and spine paths) are present in the data, and what interpretations are present. The spines() and interpretations() functions give us just this information!

spines(chorales)

interpretations(chorales)

For this toy dataset of 10 chorales, the output of spines() is pretty boring: all the chorales have four spines, with no spine paths. The interpretations() output is also boring, as we see that all 10 files have four **kern exclusive interpretations; However, interpretations() also tells about the tandem interpretations it recognizes---in this case, tempo, key, instrument, and time signature information.

The chorales dataset is structurally homogeneous, which is generally a good thing---it's much easier to analyze this sort of data! However, some humdrum datasets are more heterogeneous, which is where spines() and interpretations() come more in handy. Let's switch over to another one of our pre-packaged corpora, the Beethoven/Mozart variations (see the read/write):

readHumdrum(humdrumRroot, 'HumdrumData/.*Variations/.*.krn') -> variations

spines(variations)

Now we see something more interesting. Again, all the files have four spines, but eleven of the files include spine paths ("9 with 1 path" and "2 with 2 paths").

Let's check out the output of interpretations():

interpretations(variations)

Ah, this time we see that each file has a **function and a **harm spine, as well as two **kern spines. In fact, the "Tallies" at the bottom tells us that all 20 files have the same exclusive interpretations (in the same order), which r hm labels {A}: **function, **harm, **kern, **kern.

Summarizing Metadata

Another question to ask about a dataset is what kind of meta data is encoded in the data's reference records. The function reference() answers this question for us:

reference(chorales)

We see that all ten chorale files have, for example COM and CDT reference records, but only two have the OTL@@EN record. Not sure what those codes mean? You can also call reference() on a character-string for a reference code:

reference('COM')

reference('CDT')

To see the actual reference records themselves, you can index the result of the call to reference() by column or row. For example, to see all the ODT@@DE records:

reference(chorales)[ , 'OTL@@DE']

Or to see all the reference records for the third file:

reference(chorales)[3, ]

Summarizing Data

The next thing to do, when getting started with a r hm data analysis, is too get a sense of the data content itself. What tokens does our data actually contain? R's unique(), count(), and sort() functions are perfect for this. We'll need to use the some techniques from the Data Fields article, so review that if you don't understand the following!

Let's get the unique values, sorted:

chorales |>
  with(unique(Token)) |>
  sort()

Unlike unique(), count() will count each unique value, and we can then sort to see the most common tokens:

chorales |>
  with(count(Token)) |>
  sort()

Now we get a sense of the content of our dataset---in this case, there are a lot of different (unique) tokens!

Digging into Details

At this point we're starting to get a better picture of the content of our dataset. But don't get too hasty---it's a good idea to dig in a little more before we get confident we really know our data.

Our call to interpretations() told us to expect **kern data, representing musical "notes" (pitch and rhythm). So you probably expected to see things like 4. (dotted quarter note) and f# (F sharp above middle-C). But what is the X is 4dnX? Or the all the Js and Ls and ;s? We can look these up in the **kern definition, but the point is, we probably didn't know they were there until we took a look! You might think you know what's in your data...and get unpleasantly surprised. This is especially true with less mature (newer) datasets, which WILL DEFINITELY CONTAIN ERRORS.

We see a lot of ; tokens in our output. If you look these up, you'll learn that they are "pause signs", used to represent fermatas. But how many tokens have these fermatas?

Let's use the %~% operator, which allows us to search for matches to a (regular expression) pattern in a vector. In this case, we want to search for ";" in Token. %~% returns a logical value (TRUE or FALSE), which we can sum() to get a count of all the TRUEs:

chorales |>
  with(Token %~% ';') |>
  sum()

So there are r humdrumR:::num2print(with(chorales, Token %~% ';' |> sum())) ; tokens in the data. If we use within() (or mutate()) instead of with (and get rid of the sum()), we can see where these fermatas appear: