withinHumdrum | R Documentation |
These functions are the primary means of working with
humdrumR data.
They allow us to perform arbitrary (free form) manipulation of data fields
held within a humdrumR data object, with convenient functionality
for ignoring null data, lagging data, grouping data,
windowing, and more.
The with()
and within()
functions, which come from base R, are the core functions.
However, the dplyr "verbs" mutate()
, summarize()
, and reframe()
can be used as well—they
are equivalent to using with()
/within()
with particular arguments.
## S3 method for class 'humdrumR'
with(
data,
...,
dataTypes = "D",
recycle = "no",
alignLeft = TRUE,
expandPaths = FALSE,
drop = TRUE,
.by = NULL,
variables = list()
)
## S3 method for class 'humdrumR'
within(
data,
...,
dataTypes = "D",
alignLeft = TRUE,
expandPaths = FALSE,
recycle = "pad",
.by = NULL,
variables = list()
)
## S3 method for class 'humdrumR'
mutate(
.data,
...,
dataTypes = "D",
recycle = "ifscalar",
alignLeft = TRUE,
expandPaths = FALSE,
.by = NULL
)
## S3 method for class 'humdrumR'
summarise(
.data,
...,
dataTypes = "D",
expandPaths = FALSE,
drop = FALSE,
.by = NULL
)
## S3 method for class 'humdrumR'
reframe(
.data,
...,
dataTypes = "D",
alignLeft = TRUE,
expandPaths = FALSE,
recycle = "pad",
.by = NULL
)
## S3 method for class 'humdrumR'
ggplot(data = NULL, mapping = aes(), ..., dataTypes = "D")
data |
HumdrumR data. Must be a humdrumR data object. |
... |
Any number of expressions to evaluate. These expressions can reference If the expressions are named, the names are used to name the new fields
(or column names for |
dataTypes |
Which types of humdrum records to include. Defaults to Must be a single |
recycle |
How should results be "recycled" (or padded) to relative to the input length?
Must be a single |
alignLeft |
Should output that is shorter than input be aligned to the left? Defaults to Must be a singleton |
expandPaths |
Should spine paths be expanded before evaluating expressions? Defaults to Must be a singleton |
drop |
Whether to return a simplified data structure. Defaults to Must be a singleton This argument is conceptually similar to the |
.by |
Optional grouping fields; an alternative to using group_by(). Defaults to Must be If not |
variables |
A named Defaults to Must be a named |
These functions are the primary means of working with
humdrumR data.
They all allow you to write code that accesses and manipulates the raw fields()
in our data.
The main differences between them are what they do with the results of your code:
with()
and summarize()
return results in normal, "raw" R formats, removed
from the humdrumR data;
In contrast, within()
, mutate()
, and reframe()
always insert the results of your code into
new fields()
within your humdrum data.
The other distinctions between these functions have to do with how they recycle/pad results (see below).
The with()
, within()
, mutate()
, summarize()
, and reframe()
methods for humdrumR data
all perform "non-standard evalation" of
any expressions you provide them as arguments.
Basically, when you use a function like with(...)
or mutate(...)
, the expressions you write inside
the function call aren't evaluated right then and there—instead, R takes those expressions
into the "environment" of your humdrum table, where
all your fields are "visible" to the expression.
This means you can write code (expressions) that refer to your fields()
, like Token
or Spine
.
For example:
with(humData, ifelse(Spine > 2, kern(Token), recip(Token)))
Since all the fields in a humdrum table are the same length, the expressions you write can be, and generally should be, vectorized.
By default, with()
, within()
, etc. don't use the whole humdrum table,
but instead only evaluate their expressions using rows containing non-null data tokens (Type == "D"
).
This means that interpretations, comments, barlines, and null data tokens are automatically ignored for you!
This feature is controlled by the dataTypes
argument:
you can choose to work with the other token types by providing a character
string containing combinations
of the characters G
(global comments), L
(local comments), I
(interpretations),
M
(barlines), D
(non-null data), or d
(null data).
For example, dataTypes = 'MDd'
will evaluate your expressions on barline tokens (=
), non-null data,
and null data.
See the ditto()
manual for an example application of using dataTypes = 'Dd'
.
Keep in mind that humdrumR
dynamically updates what tokens are considered "null" ("d"
) based on what fields
are selected.
If multiple expression arguments are provided, each expression is evaluated in order, from left to right. Each expression can refer variables assigned in the previous expression (examples below).
Note: Within any of these expressions, the humdrumR namespace takes priority.
This means that, for example, if you use lag()
within an expression, the humdrumR version of lag()
will be used, even if you have loaded other packages which have their own lag()
function.
To use another package's function, you'll have to specify package::function()
—for example, dplyr::lag()
.
This is only an issue when functions have the exact same name as a humdrumR function.
These functions all do some pre-processing of expressions arguments before evaluating them. This pre-processing provides some convenient "syntactic sugar" for working with humdrum data. There are currently five pre-processing steps:
Explicit variable interpolation.
The .
placeholder for selected fields.
Automatic argument insertion.
"Lagged"-vectors shorthand.
"Splatted" arguments.
Each of these is explained below.
The variable
argument can be provided as an (option) list
of named values.
If any of the names in the variable
list appear as symbols (variable names)
in any expression argument, their value is interpolated in place of that symbol.
For example, in
within(humData, kern(Token, simple = x), variable(x = TRUE))
the variable x
will be changed to TRUE
, resulting in:
within(humData, kern(Token, simple = TRUE))
This feature is most useful for programmatic purposes, like if you'd like to run the same expression many times but with slightly different parameters.
The .
variable can be used as a special placeholder representing the data's first
selected field.
For example, in
humData |> select(Token) |> with(count(.))
will run count()
on the Token
field.
Because new fields created by within()
/mutate()
/reframe()
become the selected fields
(details below), the .
makes it easy to refer to the last new field in pipes.
For example, in
humData |> mutate(kern(Token, simple = TRUE)) |> with(count(.))
the count()
function is run on the output of the mutate(kern(Token, simpe = TRUE))
expression.
Many humdrumR functions are designed to work with certain common fields in humdrumR data.
For example, many pitch functions have a Key
argument which (can) take the
content of the Key
which readHumdrum()
creates when there are key interpretations,
like *G:
, in the data.
When an expression argument uses one of these functions, but doesn't explicitly set the argument, humdrumR
will automatically insert the appropriate field into the call (if the field is present).
So, for example, if you run
humData |> mutate(Solfa = solfa(Token))
on a data set that includes a Key
field, the expression will be changed to:
humData |> mutate(Solfa = solfa(Token, Key = Key))
If you don't want this to happen, you need to explicitly give a different Key
argument, like:
humData |> mutate(Solfa = solfa(Token, Key = 'F:'))
(The Key
argument can also be set to NULL
).
Another common/important automatic argument insertion is for functions with a groupby
argument.
These functions will automatically have appropriate grouping fields inserted into them.
For example, the mint()
(melodic intervals) command will automatically by applied using groupby
groupby = list(Piece, Spine, Path)
, which makes sure that melodic intervals are only calculated within
spine paths...not between pieces/spines/paths (which wouldn't make sense!).
All humdrumR
functions which use automatic argument interpolation will mention it in their own documentation.
For example, the ?solfa documentation mentions the treatment of Key
in its "Key" section.
In music analysis, we very often want to work with "lagged" vectors of data.
For example, we want to look at the relationship between a vector and the previous values of the
same vector—e.g., the vector offset or "lagged" by one index.
The lag()
and lead()
functions are useful for this,
always keeping them the same length so vectorization is never hindered.
In expression arguments, we can use a convenient shorthand to call lag()
(or lead
).
In an expression, any vector can be indexed with an integer
argument named lag
or lead
(case insensitive),
causing it to be lagged/led by that integer amount.
(A vector indexed with lag = 0
returns the unchanged vector.)
For example, the following two calls are the same:
humData |> with(Token[lag = 1]) humData |> with(lag(Token, 1))
This is most useful if the lag
/lead
index has multiple values:
if the indexed object appears within a higher function call,
each lag is inserted as a separate argument to that call.
Thus, these two calls are also the same:
humData |> with(count(Token[lag = 1:2])) humData |> with(count(lag(Token, 1), lag(Token, 2)))
Note that the lagging will also be automatically grouped within the fields list(Piece, Spine, Path)
,
which is the default "melodic" structure in most data.
This assures that a vector is "lagged" from one piece to another, or from one spine to the next.
If you'd like to turn this off or change the grouping, you need to override it by adding a
groupby
argument to the lagged index, like Token[lag = 1, groupby = list(...)]
.
Using lagged vectors, since they are vectorized, is the fastest (computationally) and easiest way of working with n-grams.
For example, if you want to create character
-string 5-grams of your data, you could call:
humData |> with(paste(Token[lag = 0:5], sep = '-'))
Since the lagging is grouped by list(Piece, Spine, Path)
,
these are true "melodic" n-grams, only created within spine-paths within each piece.
"Splatting" refers to feeding a function a list/vector of arguments.
Sometimes we want to divide our data into pieces (a l\'a group_by()), but
rather than applying the same expression to each piece, we want to feed
the separate pieces as separate arguments to the same function.
You can use some
syntactic sugar
to do just this.
We can index any field in our call with a splat
argument, which must be a Field %in% x
.
For example,
humData |> with(list(Token[splat = Spine %in% 1:2]))
In this call, the Token
field will be divided into two groups, one where Spine == 1
and the other where
Spine == 2
; the first group (Spine == 1
) will be used as the first argument to list
, and the second group
(Spine == 2
) as the second argument.
Thus, within
translates the previous expression to this:
humData |> within(list(Token[Spine == 1], Token[Spine == 2]))
Splatting can be little weird, because there is nothing to assure that the splatted arguments
are all the same length, which we usually want (vectorization).
For example, in the previous example, there is no guarantee that Token[Spine == 1]
and Token[Spine == 2]
are the same length.
This just means we should only use splatting if we really understand the groups we are splatting.
For example, if there are no spine paths or stops in our data, then we can know that all spines
have the same number of data records, but only including all data records (null and non-null).
So, if I know there are no stops/paths in our data, we can run something like this:
humData |> within(dataTypes = 'Dd', count(Token[splat = Spine %in% 1:2]))
In some cases you may find that there are certain arguments expressions that you use repeatedly.
You can store expressions as variables by "quoting" them: the most common way to
quote an expression in R is using the ~, which creates what is called a
"formula"—essentially a quoted expression.
You can also quote expressions, using quote()
.
Once you've quoted an expression you can pass it to
with()
, within()
, mutate()
, summarize()
, and reframe()
.
Image that you have three different datasets (humData1
, humData2
, and humData3
),
and you'd like to evaluate the expression count(kern(Token, simple = TRUE))
in all three.
Use the ~
operator to quote and save that expression to variable, then use it with with()
:
countKern <- ~count(kern(Token, simple = TRUE)) humData1 |> with(countKern) humData2 |> with(countKern) humData3 |> with(countKern)
For data that includes spine paths (which you can check with anyPaths()
),
some analyses may require that spine paths are treated as contiguous "melodies."
The expandPaths()
function can be used to "expand" spine paths into new spines.
The expandPaths
argument to with()
/within()
will cause expandPaths()
to be run on your data before evaluating your argument expressions.
After evaluation, the expanded parts of the data are then removed from the output.
The only differences between the with()
, within()
, mutate()
, summarize()
, and reframe()
humdrumR methods
are what they do with the results of expressions passed to them.
The major difference is that within()
, mutate()
, and reframe()
put results into new fields
in a humdrumR data, while with()
and summarize()
just return their results in "normal" R.
The other differences between the functions simply relate to how the recycle
and drop
arguments are used (details below).
The recycle
argument controls how the results of your code are, or aren't, recycled (or padded).
When you write code using your humdrumR data's fields()
as input, your results are inspected to see how long they are compared to the length of the input field(s).
If any of your results are longer than the input, you'll get an error message—humdrumR
can't (yet) handle that case.
If any of your results are shorter than the input, the recycle
argument controls what happens to that result.
There are seven options:
"no"
: The result is not recycled or padded. For calls to within()
, mutate
, or reframe()
, this option is not allowed.
"yes"
: the result is recycled, no matter how long it is.
"pad"
: the result is padded with NA
values.
"ifscalar"
: if the result is scalar (length 1), it is recycled; otherwise you see an error.
"ifeven"
: if the result length evenly divides the input length, it is recycled; otherwise you see an error.
"never"
: The result is not recycled. If the result does not match the input length, you see an error.
"summarize"
: if the result is not scalar, even if it matches the input length, you see an error. The result is not recycled.
The result of padding/recycling also depends on the alignLeft
argument:
If alignLeft = TRUE
, results are padded to the right: like c(result, NA, NA, ...)
;
If alignLeft = FALSE
, results are padded on the left: like c(..., NA, NA, results)
.
Recycling is also affected if the result's length does not evenly divide the input length.
For example, consider a result c(1, 2, 3)
which needs to be recycled to length 10
:
If alignLeft = TRUE
, the result is recycled c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1)
;
If alignLeft = FALSE
, the result is recycled c(3, 1, 2, 3, 1, 2, 3, 1, 2, 3)
.
The humdrumR with()
and summarize()
methods return "normal" R data objects.
The only difference between the with()
and summarize()
methods is their default drop
and recycle
arguments:
with(..., drop = TRUE, recycle = 'no')
summarize(..., drop = FALSE, recycle = 'summarize')
If drop = TRUE
, these methods return whatever your code's result is, with no parsing.
This can be any kind of R data,
including vectors or objects like lm fits
or tables.
If drop = FALSE
, the results will instead be returned in a data.table()
.
If you are working with grouped data,
the drop = FALSE
output (data.table
) will include all grouping columns as well
as the results of your expressions.
If drop = TRUE
and there is only one result per group, the grouping fields will be
used to generate names for the output vector.
The humdrumR within()
, mutate()
, and reframe()
methods always return a new humdrumR data object,
with new fields created from your code results.
The only differences between these methods is their default recycle
argument and the types of recycle
argument they allow:
within(..., recycle = 'pad')
Can accept any recycle
option except "no"
.
mutate(..., recycle = 'ifscalar')
Can only accept "ifscalar"
or "never"
.
reframe(..., recycle = 'pad')
Can only accept "pad"
or "yes"
.
When running within()
, mutate()
, or reframe()
, new fields()
are
added to the output humdrumR data.
These new fields become the selected fields in the output.
You can explicitly name newly created fields (recommended), or allow humdrumR
to automatically name them (details below).
When using with(..., drop = FALSE)
or summarize(..., drop = FALSE)
, the column names of the output data.table
are determined in the same way.
Note that within()
, mutate()
, and reframe()
will (attempt to) put any result back into your
humdrumR data...even if it doesn't make much sense.
Things will work well with vectors.
Atomic vectors are usually the best to work with (i.e., numbers, character
strings, or logical
values),
but list
s will work well too—just remember that you'll need to treat those fields as lists
(e.g., you might need to use lapply()
or Map()
to work with list
fields.)
Any non-vector result will be put into a list as well, padded as needed.
For example, if you use lm()
to compute a linear-regression in a call to within()
the result will be a new field containing a list
, with first element in the
list being a single lm
fit object, and the rest of the list empty (padded to the length of the field).
If you don't explicitly name the code expressions you provide, the new fields are named
by capturing the expression code itself as a character
string.
However, it is generally a better idea to explicitly name your new fields.
This can be done in two ways:
Base-R within() style: Use the <-
assignment operator inside your expression.
Example: within(humData, Kern <- kern(Token))
.
Tidyverse mutate() style: provide the expression as a named argument with =
.
Example: mutate(humData, Kern = kern(Token))
.
Either style can be used with any of the humdrumR
methods.
When using <-
, only top-level assignment will create a new field, which means only one field can be assigned per expression.
For example,
within(humData, Semits <- semits(Token), Recip <- recip(Token))
will create two fields (Semits
and Recip
).
However,
within(humData, { Semits <- semits(Token) Recip <- recip(Token) })
will not.
The result of expressions grouped by {}
is always the last expression in the brackets.
Thus, the last example above will only create one new field, corresponding to the result of recip(Token)
.
However, the resulting field won't be called Recip
!
This is because only top-level assignments are used to name an expression:
To name a multi-expression expression (using {}
), you could do something like this:
within(humData, Recip <- { Semits <- semits(Token) recip(Token) })
Of course, only the result of recip(Token)
would be saved to Recip
,
so the Semits <- semits(Token)
expression is doing nothing useful here.
All argument expressions passed to the with()
/within()
methods are evaluated in order, from left to right,
so any assignments in a previous expression will be visible to the next expression.
This means we can, for example, do this:
within(humData, Kern <- kern(Token), Kern2 <- paste0(Kern, nchar(Kern)))
the use of Kern
in the second expression will refer to the Kern
assigned in the previous expression.
The with()
, within()
, mutate()
, summarize()
, and reframe()
functions all
work with grouped data, or data with contextual windows defined.
When groups or windows are defined, all argument expressions are evaluated independently
within each and every group/window.
Results are then processed (including recycling/padding) within each group/window.
Finally, the results are then pieced back together in locations corresponding to the
original data locations.
Since groups are necessarily exhaustive and non-overlapping, the results
location are easy to understand.
On the other hand contextual windows may overlap, which means and non-scalar results
could potentially overlap as well;
in these cases, which result data lands where may be hard to predict.
These functions are most useful in combination with the
subset(), group_by(), and context()
commands.
# with/within style:
humData <- readHumdrum(humdrumRroot, "HumdrumData/BachChorales/chor00[1-4].krn")
humData |> with(count(kern(Token, simple = TRUE), Spine))
humData |> within(Kern <- kern(Token),
Recip <- recip(Token),
Semits <- semits(Token)) -> humData
humData |>
group_by(Spine) |>
with(mean(Semits))
humData |>
group_by(Piece, Spine) |>
with(mean(Semits), drop = FALSE)
# tidyverse (dplyr) style:
humData <- readHumdrum(humdrumRroot, "HumdrumData/BachChorales/chor00[1-4].krn")
humData |> mutate(Kern = kern(Token),
Recip = recip(Token),
Semits = semits(Token)) -> humData
humData |>
group_by(Spine, Bar) |>
summarize(mean(Semits))
# dataTypes argument
humData |>
group_by(Piece, Spine) |>
within(paste(Token, seq_along(Token)))
humData |>
group_by(Piece, Spine) |>
mutate(Enumerated = paste(Token, seq_along(Token)),
dataTypes = 'Dd')
# recycle argument
humData |>
group_by(Piece, Bar, Spine) |>
mutate(BarMean = mean(Semits), recycle = 'ifscalar')
humData |>
group_by(Piece, Bar, Spine) |>
within(BarMean = mean(Semits), recycle = 'pad')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.