source('vignette_header.R')
Welcome to "Filtering humdrum data"!
This article explains how r hm
can be used find subsets of humdrum data.
Most humdrum datasets contain a wealth of information: rhythms, pitches, lyrics, etc.
However, the particular analysis you want to conduct today might not need all that information.
The first step after loading humdrum data
is usually to "filter out" the information you don't need.
(Of course, you are not erasing it from the original files, which are safe on your computer!)
This is often just one part of larger task of preparing your data.
You don't necessarily need to do this, but usually it makes things simpler when you simply get anything you don't
need out of the way.
In this article, we'll use our pre-packaged Bach chorale data to illustrate how filtering works, so let's load it up:
chorales <- readHumdrum(humdrumRroot, 'HumdrumData/BachChorales/.*krn')
The simplest, common, way to filter your data is through "indexing," a common means in programming for selecting subsets of data.
In R, we use the square brackets, []
, to index various data objects,
including basic atomic vectors, data.tables, and lists.
For example, using brackets I can extract subsets of a vector:
myname <- c('N', 'A', 'T', ' ', 'C', 'O', 'N', 'D', 'I', 'T') myname[1:3] myname[5:10] myname[c(1, 5)]
Watch out! R actually has two different ways of indexing:
you can use a single pair of matches brackets ([ ]
) or a double pair ([[ ]]
).
If you want a primer on how these two types of indexing are used in normal R
objects, check out the R primer---you really don't need
to do this before getting into r hm
, but you might want to eventually.
We can use the single-bracket []
or double-bracket [[]]
commands to index our r hm
data objects.
Either type of bracket can accept either numeric
or character
vectors to index by.
To do indexing in a pipe, using the index()
([]
) and index2()
commands ([[]]
).
The single-brackets are used to index whole pieces in your data.
So if you have a dataset of 19 files, but you only want to look at some of those files, you'd use
the single-brackets []
.
If we give a numeric
value i
to a single-bracket index command, that number will select the i
th
file in the dataset.
For example, if we want to look only at the fifth chorale, we can call:
chorales[5]
We can also give the command a vector of i
numbers:
chorales[c(1,3,5)]
You might want to use the R sequence command, :
, to select a range of numbers:
chorales[6:10]
If you supply a r hm
object single-bracket index a character
string,
r hm
will treat that string as a regular expression and return all the files that
contain a match to that expression in any data token---even if there is only one.
For example, we might be only interested in files that use flat notes.
Since **kern
represents flats with "-
", we can simply write:
chorales['-']
Look at that, five of our ten chorales contain at least one flat. The other chorales have zero flats. Notice that we still get the whole files returned to us!
The double-brackets are used to index within pieces in your data. Specifically, if you only want to work with certain spines or records within your pieces. The double-brackets apply the same filters separately to each file in the dataset.
In double-bracket indexing, you can provide two separate arguments, either individually or together:
i
is the first argument which will index records within each file.j
is the second argument, which will index spines within each file.If you want to use the j
(spine) argument by itself, you should put a comma before it, to indicate that you are
skipping over i
.
It is also good practice to by clear and put a comma after i
, but this is not actually needed.
Basically, commands should look like this:
chorales[[i, ]]
(records)chorales[[ , j]]
(spines)chorales[[i , j]]
(records and spines)Numeric
values i
or j
can be given to double-bracket r hm
index commands.
For i
, the number is simply matched to record numbers in each file.
For example, you could extract the first fifty records from each file as so:
chorales[[1:50, ]]
For j
, the number is matched to the spines in each file (from left to right).
So if you wanted the second spine in each file, you'd write:
chorales[[ , 2]]
If you give indices that are larger than the number of records or spines in a file, that file will be dropped. For example, if we call
chorales[[150:200, ]]
We only get three files back, because the other seven files don't have any records above 150!
(Notice that r hm
won't remove the Exclusive interpretation or spine spine closing (*-
) records...since that would break the humdrum syntax.)
Character
string values i
or j
can also be given to double-bracket r hm
index commands.
r Hm
will treat string as a regular expression and return all records (i
) or spines (j
)
that any match to that expression in a data token---even if there is only one.
For example, let's (again) say we are interesting in studying flats.
Since **kern
represents flats as "-
", we can find records that contain a flat by calling:
chorales[['-', ]]
As with our single-bracket search for flats (previous section) we only get five files back, because only five of the chorales contain any flats at all. However, we now see that all records that don't contain a flat are completely stripped away, leaving a (handful) of records with at least one flat.
If we do the same thing with j
(spines)
chorales[[ , '-']]
It looks like we get only one spine in each of our five files.
Those single spines are simply which ever spine in each file contained any flat.
But a couple things to notice!
In the first file, what is now spine 1
was the old spine 2
, which we can tell because that
instrument interpretation is *Itenor
.
However, if you look at the last of the five files
chorales[[ , '-']][5]
we see *Ibass
---so the new spine 1
was actually the original spine 1
in that file.
And wait, there's more! If we dig in and look at the other files returned to us (which aren't shown by default),
chorales[[ , '-']][3]
we see that what is now the third file actually had flats in all four spines, and so all four spines are still there!
In some cases you may find the renumbering of spines on a per file basis confusing, and you want to keep track of the original spine numbers.
Fortunately, for you there is an option that might helpful, which you will learn more about below:
try specifying drop = FALSE
.
(The drop
argument is less intuitive, but is the standard R argument for this sort of thing.)
chorales[[ , '-', drop = FALSE]]
A nifty feature of R is that if you supply negative numbers to an indexer, R will remove those
numbers.
This works in r hm
too, so if you want all the files except the first file, you could write:
chorales[-1]
Or if you want all the spines except the fourth spine, write
chorales[[ , -4]]
Or if you want to remove the first 20 records from each file:
chorales[[-1:-20, ]]
(Again, r hm
won't remove the Exclusive interpretation or spine spine closing (*-
) records...since that would break the humdrum syntax.)
The indexing commands (previous sections) only get you so far.
If you want to be more precise about filtering, use the tidy-verse filter()
method for r hm
data.
filter()
works exactly like mutate.humdrumR.
However, the expression(s) you give must evaluate to a logical vector (TRUE
or FALSE
), or you will get an error otherwise.
filter()
will take your logical result and filter out data which matches FALSE
.
We can reproduce the functionality of the []
and [[]]
indexing operators (previous section) using filter()
:
chorales |> filter(Spine == 1)
But, now we can filter based on any arbitrary criteria we want.
For example, we could extract tokens from odd numbered spines in odd numbered records and even numbered spines in even numbered records.
We'll use Rs modulo command %%
to separate even and odd numbers (odd %% 2 == 1
, even %% 2 == 0
).
chorales |> filter((Record %% 2 == 0) == (Spine %% 2 == 0))
Would we ever want to do that? Probably not. However, returning to our flats, study, lets grab all the flat notes:
chorales |> filter(Token %~% '-')
Woh, we're only seeing a couple of B flats in these files! But from before, we know that there are a lot more flats in the fourth file (after losing the files with no flats):
chorales |> filter(Token %~% '-') |> index(4)
Like mutate()
, filter()
will evaluate its logical filtering expression within groups created by group_by()
.
This can be a useful way to get some context for our searches.
For example, let's say we want to find flats again, but we want to see the whole bar of music that contains a flat.
We can do this by grouping by the Bar
field (and the Piece
field, of course).
We'll want to say "within each bar, if there are any flats, return TRUE
for the whole bar, else return FALSE
for the whole bar."
By default, filter()
will [recycle][recycling] scalar results, so if you return a single TRUE
/FALSE
, it will be recycled to fill the whole group.
chorales |> group_by(Piece, Bar) |> filter(any(Token %~% '-'))
Not enough context for you? Maybe we should group the bars into even-odd pairs:
chorales |> group_by(Piece, floor(Bar / 2)) |> filter(any(Token %~% '-'))
You probably noticed that, unlike the indexing commands []
and [[]]
, filter()
doesn't seem to actually remove data that you filter out.
So when we say
chorales |> filter(Spine == 1)
we still have four spines, but spines 2--4 are just emptied.
This is correct.
What filter()
actually does is turn any filtered data points into null (d
) data points.
r Hm
then ignores that data automatically.
Why do we do this?
There are several reasons:
[[ , j]]
indexing for flats (above): when we simply removed the spines with no flats it was hard to tell which spine was which in the result. unfilter()
command (details below).Once you've done some filtering with filter()
, if you want to get rid of empty parts of the data, you can do so using the commands removeEmptyFiles()
, removeEmptySpines()
, removeEmptyRecords()
,removeEmptyPaths()
, or removeEmptyStops()
.
By using this commands, we make sure that 1) you explicitly want to remove them and 2) the humdrum syntax is not broken, because only whole records/spines/paths/files are removed.
So for our spines example:
chorales |> filter(Spine == 1) |> removeEmptySpines()
or for records:
chorales |> filter(Record %% 2 == 0) |> removeEmptyRecords()
When r hm
filters, it does not completely discard/erase the data.
The output of filter()
is a subset of the original data: the complement of that subset is retained (but hidden).
This allows us to restore the filtered data, using unfilter()
:
chorales |> filter(Spine == 1) |> unfilter()
We can also expliticely switch to the complement using the complement()
command:
chorales |> filter(Spine == 1) |> complement()
Why would we want to unfilter or complement our data? These functions actually allow a number of useful functionality that is not immediately obvious.
Let's say we want to find and look at the notes leading up to the third phrase ending in each chorale?
We'd first need to count the phrase endings (;
) in each chorale.
We could filter our data, and simply seq_along()
(enumerate) the fermatas in each bar:
chorales |> filter(Token %~% ';') |> group_by(Piece, Spine) |> mutate(FermataN = seq_along(Token)) -> chorales chorales[1]
Cool!
We now have a FermataN
field, showing us the number of each fermata.
But wait, using filter removed most of our tokens, so we can only see the notes during the fermatas, not leading up to them.
chorales |> select(Token)
Luckily, we can unfilter()
, then filter again by bar:
chorales |> unfilter() |> ungroup() |> group_by(File, Bar)
What if you wanted to transpose the bass voice an octave higher, but keep the other voices the same? Well, we could try:
chorales |> filter(Spine == 1) |> mutate(Transposed = transpose(Token, by = 'P8')) |> unfilter()
We successfully unfiltered...but the Token
field was unfiltered, not our new Transposed
field.
This is because the Tranposed
field has no complement data: it didn't exist until after the filter was applied.
Luckily, we can tell unfiltered()
to use a specific complement, using the complement
argument:
chorales |> filter(Spine == 1) |> mutate(Transposed = transpose(Token, by = 'P8')) |> unfilter(complement = 'Token')
Thus, unfilter()
can be used to combine the content of different fields into one field!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.