source('vignette_header.R')

Welcome to "Filtering humdrum data"! This article explains how r hm can be used find subsets of humdrum data. Most humdrum datasets contain a wealth of information: rhythms, pitches, lyrics, etc. However, the particular analysis you want to conduct today might not need all that information. The first step after loading humdrum data is usually to "filter out" the information you don't need. (Of course, you are not erasing it from the original files, which are safe on your computer!) This is often just one part of larger task of preparing your data. You don't necessarily need to do this, but usually it makes things simpler when you simply get anything you don't need out of the way.

In this article, we'll use our pre-packaged Bach chorale data to illustrate how filtering works, so let's load it up:

chorales <- readHumdrum(humdrumRroot, 'HumdrumData/BachChorales/.*krn')

Indexing

The simplest, common, way to filter your data is through "indexing," a common means in programming for selecting subsets of data. In R, we use the square brackets, [], to index various data objects, including basic atomic vectors, data.tables, and lists. For example, using brackets I can extract subsets of a vector:

myname <- c('N', 'A', 'T', ' ', 'C', 'O', 'N', 'D', 'I', 'T')

myname[1:3]

myname[5:10]

myname[c(1, 5)]

Watch out! R actually has two different ways of indexing: you can use a single pair of matches brackets ([ ]) or a double pair ([[ ]]). If you want a primer on how these two types of indexing are used in normal R objects, check out the R primer---you really don't need to do this before getting into r hm, but you might want to eventually.

We can use the single-bracket [] or double-bracket [[]] commands to index our r hm data objects.

Either type of bracket can accept either numeric or character vectors to index by.


To do indexing in a pipe, using the index() ([]) and index2() commands ([[]]).

Single-bracket indexing

The single-brackets are used to index whole pieces in your data. So if you have a dataset of 19 files, but you only want to look at some of those files, you'd use the single-brackets [].

Numeric indices (for single-brackets)

If we give a numeric value i to a single-bracket index command, that number will select the ith file in the dataset. For example, if we want to look only at the fifth chorale, we can call:

chorales[5]

We can also give the command a vector of i numbers:

chorales[c(1,3,5)]

You might want to use the R sequence command, :, to select a range of numbers:

chorales[6:10]

Character indices (for single-brackets)

If you supply a r hm object single-bracket index a character string, r hm will treat that string as a regular expression and return all the files that contain a match to that expression in any data token---even if there is only one.

For example, we might be only interested in files that use flat notes. Since **kern represents flats with "-", we can simply write:

chorales['-']

Look at that, five of our ten chorales contain at least one flat. The other chorales have zero flats. Notice that we still get the whole files returned to us!

Double-bracket indexing

The double-brackets are used to index within pieces in your data. Specifically, if you only want to work with certain spines or records within your pieces. The double-brackets apply the same filters separately to each file in the dataset.

In double-bracket indexing, you can provide two separate arguments, either individually or together:

If you want to use the j (spine) argument by itself, you should put a comma before it, to indicate that you are skipping over i. It is also good practice to by clear and put a comma after i, but this is not actually needed. Basically, commands should look like this:

Numeric indices (for double-brackets)

Numeric values i or j can be given to double-bracket r hm index commands. For i, the number is simply matched to record numbers in each file. For example, you could extract the first fifty records from each file as so:

chorales[[1:50, ]]

For j, the number is matched to the spines in each file (from left to right). So if you wanted the second spine in each file, you'd write:

chorales[[ , 2]]

If you give indices that are larger than the number of records or spines in a file, that file will be dropped. For example, if we call

chorales[[150:200, ]]

We only get three files back, because the other seven files don't have any records above 150! (Notice that r hm won't remove the Exclusive interpretation or spine spine closing (*-) records...since that would break the humdrum syntax.)

Character indices (for double-brackets)

Character string values i or j can also be given to double-bracket r hm index commands. r Hm will treat string as a regular expression and return all records (i) or spines (j) that any match to that expression in a data token---even if there is only one. For example, let's (again) say we are interesting in studying flats. Since **kern represents flats as "-", we can find records that contain a flat by calling:

chorales[['-', ]]

As with our single-bracket search for flats (previous section) we only get five files back, because only five of the chorales contain any flats at all. However, we now see that all records that don't contain a flat are completely stripped away, leaving a (handful) of records with at least one flat.

If we do the same thing with j (spines)

chorales[[ , '-']]

It looks like we get only one spine in each of our five files. Those single spines are simply which ever spine in each file contained any flat. But a couple things to notice! In the first file, what is now spine 1 was the old spine 2, which we can tell because that instrument interpretation is *Itenor. However, if you look at the last of the five files

chorales[[ , '-']][5]

we see *Ibass---so the new spine 1 was actually the original spine 1 in that file. And wait, there's more! If we dig in and look at the other files returned to us (which aren't shown by default),

chorales[[ , '-']][3]

we see that what is now the third file actually had flats in all four spines, and so all four spines are still there!


In some cases you may find the renumbering of spines on a per file basis confusing, and you want to keep track of the original spine numbers. Fortunately, for you there is an option that might helpful, which you will learn more about below: try specifying drop = FALSE. (The drop argument is less intuitive, but is the standard R argument for this sort of thing.)

chorales[[ , '-', drop = FALSE]]

Negative numeric indices

A nifty feature of R is that if you supply negative numbers to an indexer, R will remove those numbers. This works in r hm too, so if you want all the files except the first file, you could write:

chorales[-1]

Or if you want all the spines except the fourth spine, write

chorales[[ , -4]]

Or if you want to remove the first 20 records from each file:

chorales[[-1:-20, ]]

(Again, r hm won't remove the Exclusive interpretation or spine spine closing (*-) records...since that would break the humdrum syntax.)

General Filtering

The indexing commands (previous sections) only get you so far. If you want to be more precise about filtering, use the tidy-verse filter() method for r hm data. filter() works exactly like mutate.humdrumR. However, the expression(s) you give must evaluate to a logical vector (TRUE or FALSE), or you will get an error otherwise. filter() will take your logical result and filter out data which matches FALSE.

We can reproduce the functionality of the [] and [[]] indexing operators (previous section) using filter():

chorales |>
  filter(Spine == 1)

But, now we can filter based on any arbitrary criteria we want. For example, we could extract tokens from odd numbered spines in odd numbered records and even numbered spines in even numbered records. We'll use Rs modulo command %% to separate even and odd numbers (odd %% 2 == 1, even %% 2 == 0).

chorales |>
  filter((Record %% 2 == 0) == (Spine %% 2 == 0))

Would we ever want to do that? Probably not. However, returning to our flats, study, lets grab all the flat notes:

chorales |>
  filter(Token %~% '-')

Woh, we're only seeing a couple of B flats in these files! But from before, we know that there are a lot more flats in the fourth file (after losing the files with no flats):

chorales |>
  filter(Token %~% '-') |>
  index(4)

Subseting by Group

Like mutate(), filter() will evaluate its logical filtering expression within groups created by group_by(). This can be a useful way to get some context for our searches. For example, let's say we want to find flats again, but we want to see the whole bar of music that contains a flat. We can do this by grouping by the Bar field (and the Piece field, of course). We'll want to say "within each bar, if there are any flats, return TRUE for the whole bar, else return FALSE for the whole bar." By default, filter() will [recycle][recycling] scalar results, so if you return a single TRUE/FALSE, it will be recycled to fill the whole group.

chorales |>
  group_by(Piece, Bar) |>
  filter(any(Token %~% '-'))

Not enough context for you? Maybe we should group the bars into even-odd pairs:

chorales |>
  group_by(Piece, floor(Bar / 2)) |>
  filter(any(Token %~% '-'))

Removing vs Filtering

You probably noticed that, unlike the indexing commands [] and [[]], filter() doesn't seem to actually remove data that you filter out. So when we say

chorales |>
   filter(Spine == 1)

we still have four spines, but spines 2--4 are just emptied. This is correct. What filter() actually does is turn any filtered data points into null (d) data points. r Hm then ignores that data automatically. Why do we do this? There are several reasons:

  1. Sometimes, seeing the full structure remain in place is cleaner/easier to interpret than actually removing them. Basically, all those empty spines remind you what you filtered, so it is easier to see if the filter is doing what you think it is doing.
  2. A clear example of this is when we did [[ , j]] indexing for flats (above): when we simply removed the spines with no flats it was hard to tell which spine was which in the result.
  3. Many filters would break the humdrum syntax if the data was simply removed. If you removed all the tokens that don't contain flats, the result would be humdrum data with holes in it.
  4. Its possible to undo your filters, using the unfilter() command (details below).

Once you've done some filtering with filter(), if you want to get rid of empty parts of the data, you can do so using the commands removeEmptyFiles(), removeEmptySpines(), removeEmptyRecords(),removeEmptyPaths(), or removeEmptyStops(). By using this commands, we make sure that 1) you explicitly want to remove them and 2) the humdrum syntax is not broken, because only whole records/spines/paths/files are removed.

So for our spines example:

chorales |>
  filter(Spine == 1) |>
  removeEmptySpines()

or for records:

chorales |>
  filter(Record %% 2 == 0) |>
  removeEmptyRecords()

Complements

When r hm filters, it does not completely discard/erase the data. The output of filter() is a subset of the original data: the complement of that subset is retained (but hidden). This allows us to restore the filtered data, using unfilter():

chorales |>
  filter(Spine == 1) |>
  unfilter()

We can also expliticely switch to the complement using the complement() command:

chorales |>
  filter(Spine == 1) |>
  complement()

Why would we want to unfilter or complement our data? These functions actually allow a number of useful functionality that is not immediately obvious.

Example 1

Let's say we want to find and look at the notes leading up to the third phrase ending in each chorale? We'd first need to count the phrase endings (;) in each chorale. We could filter our data, and simply seq_along() (enumerate) the fermatas in each bar:

chorales |>
  filter(Token %~% ';') |>
  group_by(Piece, Spine) |>
  mutate(FermataN = seq_along(Token)) -> chorales

chorales[1]

Cool! We now have a FermataN field, showing us the number of each fermata. But wait, using filter removed most of our tokens, so we can only see the notes during the fermatas, not leading up to them.

chorales |>
  select(Token)

Luckily, we can unfilter(), then filter again by bar:

chorales |> 
  unfilter() |>
  ungroup() |>
  group_by(File, Bar) 

Example 2

What if you wanted to transpose the bass voice an octave higher, but keep the other voices the same? Well, we could try:

chorales |>
  filter(Spine == 1) |>
  mutate(Transposed = transpose(Token, by = 'P8')) |> 
  unfilter()

We successfully unfiltered...but the Token field was unfiltered, not our new Transposed field. This is because the Tranposed field has no complement data: it didn't exist until after the filter was applied. Luckily, we can tell unfiltered() to use a specific complement, using the complement argument:

chorales |>
  filter(Spine == 1) |>
  mutate(Transposed = transpose(Token, by = 'P8')) |> 
  unfilter(complement = 'Token')

Thus, unfilter() can be used to combine the content of different fields into one field!



Computational-Cognitive-Musicology-Lab/humdrumR documentation built on Oct. 22, 2024, 9:28 a.m.