Acoustic analysis with soundgen


There are numerous programs out there for performing acoustic analysis, including several open-source options and R packages. For in-depth analysis of individual mammalian sounds it's hard to beat PRAAT (batch processing is possible, but a bit tricky, because PRAAT uses its own, rather unusual scripting language). For bird sounds, a sophisticated tool is Sound Analysis Pro. In R, the most general-purpose acoustic toolkit is the seewave package. Soundgen builds upon the functionality of seewave, adding high-level functions for sound synthesis (see the vignette on sound synthesis), manipulation, and analysis.

Reasons to use soundgen for acoustic analysis might be:

  1. User-friendly approach: a single call to the analyzeFolder function will give you a dataframe containing dozens of commonly used acoustic descriptors for each file in an entire folder. So if you'd rather get started with model-building without delving too deeply into acoustics, you are one line of code away from your dataset.
  2. Flexible pitch tracking: soundgen uses several popular methods of pitch detection in parallel, followed by their integration and postprocessing. While the abundance of control parameters may initially seem daunting, for those who do wish to delve deeply this makes soundgen's pitch tracker very versatile and offers a lot of power for high-precision analysis.
  3. Interactive apps for manual correction of pitch contours (pitch_app) and formants (formant_app).
  4. Audio segmentation with in-built optimization: the tools for syllable segmentation and detection of energy bursts are fast and simple (based on smoothed intensity contours) but quite flexible. Control parameters can also be optimized automatically as long as you have a manually segmented training sample.
  5. Additional specialized tools for acoustic analysis such as modulation spectra and self-similarity matrices.

Many of the large variety of existing tools for acoustic analysis were designed with a particular type of sound in mind, usually human speech or bird songs. Soundgen has been developed to work with human nonverbal vocalizations such as screams and laughs. These sounds are much harsher and noisier than ordinary speech, but they closely resemble vocalizations of other mammals. Acoustic analysis with soundgen may be particularly appropriate if you are extracting a large number of acoustic predictors from a large number of audio files, for example:

The most relevant soundgen functions for acoustic analysis are:

TIP Soundgen's functions for acoustic analysis are not meant to be exhaustive. MFCC extraction is readily available in R (e.g., with tuneR::melfcc), so there was no need to duplicate it in soundgen. Linear predictive coding (LPC) is also implemented in R (see phonTools::lpc and phonTools::findformants). As a convenience, soundgen::analyze shows the output of phonTools::findformants, but for serious formant analysis you might want to use formant_app or another interactive program like PRAAT and check everything manually. A good approach may be to start with pitch_app to ensure accurate pitch tracking, then run soundgen::analyze with these manual pitch values to get a table of many common acoustic predictors.

This vignette is designed to show how soundgen can be used effectively to perform acoustic analysis. It assumes that the reader is already familiar with key concepts of phonetics and bioacoustics.

TIP This vignette mostly covers acoustic analysis with soundgen. In many cases, there are related R functions from other packages. For a tour-de-force overview of alternatives together with highly accessible theoretical explanations of sound characteristics, see Sueur (2018) "Sound analysis and synthesis with R"

Acoustic analysis with analyze

To demonstrate acoustic analysis in practice, let's begin by generating a sound with a known pitch contour. To make pitch tracking less trivial and demonstrate some of its challenges, let's add some noise, subharmonics, and jitter:

s1 = soundgen(sylLen = 900, temperature = 0.001,
              pitch = list(time = c(0, .3, .8, 1), 
                           value = c(300, 900, 400, 1300)),
              noise = c(-40, -20), 
              subFreq = 100, subDep = 20, jitterDep = 0.5, 
              plot = TRUE, ylim = c(0, 4))
# playme(s1)  # replay as many times as needed w/o re-synthesizing the sound

The contour of f0 is determined by our pitch anchors, so we can calculate the true median pitch:

true_pitch = getSmoothContour(anchors = list(time = c(0, .3, .8, 1),
                                             value = c(300, 900, 400, 1300)),
                              len = 1000)  # any length will do
median(true_pitch)  # 633 Hz

Basic principles

At the heart of acoustic analysis with soundgen is the short-time Fourier transform (STFT): we look at one short segment of sound at a time (one STFT frame), analyze its spectrum using Fast Fourier Transform (FFT), and then move on to the next - perhaps overlapping - frame. As the analysis window slides along the signal, STFT shows which frequencies it contains at different points of time. The nuts and bolts of STFT are beyond the scope of this vignette, but they can be found in just about any textbook on phonetics, acoustics, digital signal processing, etc. For a quick R-friendly introduction, see seewave vignette on acoustic analysis.

Putting the spectra of all frames together, we get a spectrogram. analyze calls another function from soundgen package, spectrogram, to produce a spectrogram and then plot pitch candidates on top of it. See the examples in ?spectrogram for plot customization like color themes, contrast, brightness, etc. To analyze a sound with default settings and plot its spectrogram, all we need to specify is its sampling rate (the default in soundgen is 16000 Hz):

a1 = analyze(s1, samplingRate = 16000, plot = TRUE, ylim = c(0, 4))
# a1$detailed  # many acoustic predictors measured for each STFT frame
median(true_pitch)  # true value, as synthesized above
median(a1$detailed$pitch, na.rm = TRUE)  # our estimate
# Pitch postprocessing is stochastic (see below), so the contour may vary.
# Many candidates are off target, mainly b/c of misleading subharmonics.

There are several key parameters that control the behavior of STFT and affect all extracted acoustic variables. The same parameters serve as arguments to spectrogram. As a result, you can immediately see what frame-by-frame input you have fed into the algorithm for acoustic analysis by visually inspecting the produced spectrogram. If you can hear f0, but can't see individual harmonics in the spectrogram, the pitch tracker probably will not see them, either, and will therefore fail to detect f0 correctly. The first remedy is thus to adjust STFT settings, using the spectrogram for visual feedback:

Basic spectral descriptives

Apart from pitch tracking, analyze calculates and returns several acoustic characteristics from each non-silent STFT frame:

TIP: many of these descriptors are calculated separately for the entire sound and only for the voiced frames, resulting in extra output variables like "amplVoiced".

Custom spectral descriptives

The function soundgen::analyze returns a few spectral descriptives that make sense for nonverbal vocalizations, but additional predictors may be useful for other applications (bird songs, non-biological sounds, etc.). Many popular spectral descriptors are mathematically trivial to derive - all you need is the spectrum for each STFT frame, or perhaps even the average spectrum of the entire sound. Here is how you can get these spectra.

For the average spectrum of an entire sound, go no further than seewave::spec or seewave::meanspec:

spec = seewave::spec(s1, f = 16000, plot = FALSE)  # FFT of the entire sound
avSpec = seewave::meanspec(s1, f = 16000, plot = FALSE)  # STFT followed by averaging
# either way, you get a dataframe with two columns: frequencies and their strength

If you are interested in how the spectrum changes over time, extract frame-by-frame spectra - for example, with spectrogram(..., output = 'original'):

spgm = spectrogram(s1, samplingRate = 16000, output = 'original', plot = FALSE)
# rownames give you frequencies in KHz, colnames are time stamps in ms

Let's say you are working with frame-by-frame spectra and want to calculate skewness, the 66.6th percentile, and the ratio of energy above/below 500 Hz. Before you go hunting for a piece of software that returns exactly those descriptors, consider this. Once you have normalized the spectrum to add up to 1, it basically becomes a probability density function (pdf), so you can summarize it in the same way as you would any other distribution of a random variable. Look up the formulas you need and just do the raw math:

# Transform spectrum to pdf (all columns should sum to 1):
spgm_norm = apply(spgm, 2, function(x) x / sum(x))
# Set up a dataframe to store the output
out = data.frame(skew = rep(NA, ncol(spgm)),
                 quantile66 = NA,
                 ratio500 = NA)
# Process each STFT frame
for (i in 1:ncol(spgm_norm)) {
  # Absolute spectrum for this frame
  df = data.frame(
    freq = as.numeric(rownames(spgm_norm)),  # frequency (kHz)
    d = spgm_norm[, i]                       # density
  # plot(df, type = 'l')

  # Skewness (see
  m = sum(df$freq * df$d)  # spectral centroid, kHz
  out$skew[i] = sum((df$freq - m)^3 * df$d)

  # 66.6th percentile (2/3 of density below this frequency)
  out$quantile66[i] = df$freq[min(which(cumsum(df$d) >= 2/3))]  # in kHz

  # Energy above/below 500 Hz
  out$ratio500[i] = sum(df$d[df$freq >= .5]) / sum(df$d[df$freq < .5])

If you need to do this analysis repeatedly, just wrap the code into your own function that takes a wav file as input and returns all these spectral descriptives. You can also save the actual spectra of different sound files and add them up to obtain an average spectrum across multiple sound files, work with cochleograms instead of raw spectra (check out tuneR::melfcc), etc. Be your own boss!


The digital representation of a sound is a long vector of numbers on some arbitrary scale, say [-1, 1]. Values further from zero correspond to a higher amplitude - in physical terms, to greater pertubations of sound pressure level caused by the propagating sound wave. A smoothed line following peak amplitude values is known as an amplitude envelope. However, there is no simple correspondence between the absolute height of amplitude peaks and the subjectively experienced loudness of the corresponding sound. A commonly reported measure of sound intensity is its root mean square (RMS) amplitude, which takes into account the average value of sound pressure, and not only the height of peaks. More sophisticated estimates of loudness also take into account the relative sensitivity of human hearing to different frequencies, masking of adjacent tones in the time and frequency domains, etc.

To illustrate the differences between these estimates, let's look at a pure tone sweeping with fixed absolute amplitude from 100 to 4000 Hz over 2 s:

dur = 2  # 2 s duration
samplingRate = 16000
f0 = seq(100, 8000, length.out = samplingRate * dur)
sweep = sin(2 * pi * cumsum(f0) / samplingRate)
# playme(sweep)
# spectrogram(sweep, 16000)
# plot(sweep, type = 'l')

Smoothed absolute amplitude envelope (flat):

seewave::env(sweep, f = samplingRate, envt = 'abs', msmooth=c(50, 0))

RMS amplitude per STFT frame, as returned by analyze(), column "ampl":

a = analyze(sweep, samplingRate = samplingRate, pitchMethods = NULL, plot = FALSE)
plot(seq(0, dur, length.out = length(a$detailed$ampl)), 
     a$detailed$ampl, type = 'b', xlab = 'Time, s')

An estimate of subjectively experienced loudness in sone, column "loudness":

plot(seq(0, dur, length.out = length(a$detailed$loudness)), a$detailed$loudness, 
     type = 'b', xlab= 'Time, s')

Soundgen also has a dedicated function for calculating the loudness and plotting the output, getLoudness(). Loudness values are overlaid on the spectrogram - observe how the loudness peaks as f0 reaches about 2-3 kHz and then drops. The absolute values in sone are only an approximation, since they are dictated by the playback device (e.g. your headphones), but the change of loudness within one sound, or across different sounds analyzed with the same settings, is informative.

l = getLoudness(sweep, samplingRate = samplingRate)

Pitch tracking

If you look at the source code of soundgen::analyze() and embedded functions, you will see that almost all of this code deals with a single acoustic characteristic: fundamental frequency (f0) or its perceptual equivalent, pitch. That's because pitch is both highly salient to listeners and notoriously difficult to measure accurately. The approach followed by soundgen's pitch tracker is to use several different estimates of f0, each of which is better suited to certain types of sounds. You can use any pitch tracker individually, but their output is also automatically integrated and postprocessed so as to generate the best overall estimate of frame-by-frame pitch. Autocorrelation is needed to calculate the harmonics-to-noise ratio (HNR) of an STFT frame, and then this information is used to adjust the expectations of the cepstral and spectral algorithms. In particular, if autocorrelation suggests that the pitch is high, confidence in cepstral estimates is attenuated; and if autocorrelation suggests that HNR is low, thresholds for spectral peak detection are raised, making spectral pitch estimates more conservative.

The plot below shows a spectrogram of the sound with overlaid pitch candidates generated by five different methods (listed in pitchMethods), with a very vague prior - that is, with no specific expectations regarding the true range of pitch values. The size of each point shows the certainty of estimation: smaller points are calculated with lower certainty and have less weight when all candidates are integrated into the final pitch contour (blue line).

a = analyze(s1, samplingRate = 16000, priorSD = 24, 
            pitchMethods = c('autocor', 'cep', 'dom', 'spec', 'hps'),
            plot = TRUE, ylim = c(0, 4))

Different pitch tracking methods have their own pros and cons. Cepstrum is helpful for speech but pretty useless for high-frequency whistles or screams, harmonic product spectrum (hps) is easily misled by subharmonics (as in this example), lowest dominant frequency band (dom) can't handle low-frequency wind noise, etc. The default is to use "dom" and "autocor" as the most generally applicable, but you can experiment with all methods and check which ones perform best with the specific type of audio that you are analyzing. Each method can also be fine-tuned (see below), but first it is worth considering the general pitch-related settings.

General settings

analyze has a few arguments that affect all methods of pitch tracking:

par(mfrow = c(1, 2))
# default prior in soundgen
getPrior(priorMean = 300, priorSD = 6)
# narrow peak at 2 kHz
getPrior(priorMean = 2000, priorSD = 1)
par(mfrow = c(1, 1))

TIP The final pitch contour can still pass through low-certainty candidates, so the prior is a soft alternative (or addition) to the inflexible bounds of pitchFloor and pitchCeiling But the prior has a major impact on pitch tracking, so it is by default shown in every plot

Pitch tracking methods

Having looked at the general settings, it is time to consider the theoretical principles behind each pitch tracking method, together with arguments to analyze that can be used to tweak each one.


Time domain: pitch by autocorrelation, PRAAT, pitchAutocor.

This is an R implementation of the algorithm used in the popular open-source program PRAAT (Boersma, 1993). The basic idea is that a harmonic signal correlates with itself most strongly at a delay equal to the period of its fundamental frequency (f0). Peaks in the autocorrelation function are thus treated as potential pitch candidates. The main trick is to choose an appropriate windowing function and adjust for its own autocorrelation. Compared to other methods implemented in soundgen, pitch estimates based on autocorrelation appear to be particularly accurate for relatively high values of f0. The settings that control pitchAutocor are:

To use only autocorrelation pitch tracking, but with lower-than-default voicing threshold and more candidates, we can do something like this (prior is disabled so as not to influence the certainties of different pitch candidates):

a = analyze(s1, samplingRate = 16000, 
            plot = TRUE, ylim = c(0, 4), priorMean = NA,
            pitchMethods = 'autocor',
            pitchAutocor = list(autocorThres = .45, 
                                # + plot pars if needed
                                col = 'green'),
            nCands = 3)

Dominant frequency

Frequency domain: the lowest dominant frequency band, dom.

If the sound is harmonic and relatively noise-free, the spectrum of a frame typically has little energy below f0. It is therefore likely that the first spectral peak is in fact f0, and all we have to do is choose a reasonable threshold to define what counts as a peak. Naturally, there are cases of missing f0 and misleading low-frequency noises. Nevertheless, this simple estimate is often surprisingly accurate, and it may be our best shot when the vocal cords are vibrating in a chaotic fashion (deterministic chaos). For example, sounds such as roars lack clear harmonics but are perceived as voiced, and the lowest dominant frequency band (which may also be the first or second formant) often corresponds to perceived pitch.

The settings that control dom are:

For the sound we are trying to analyze, we can increase domSmooth and/or raise domThres to ignore the subharmonics and trace the true pitch contour:

a = analyze(s1,  samplingRate = 16000, 
            plot = TRUE, ylim = c(0, 4), priorMean = NA,
            pitchMethods = 'dom',
            pitchDom = list(domThres = .1, domSmooth = 500, cex = 1.5))


Frequency domain: pitch by cepstrum, pitchCep.

Cepstrum is the FFT of log-spectrum. It may be a bit challenging to wrap one's head around, but the main idea is quite simple: just as FFT is a way to find periodicity in a signal, cepstrum is a way to find periodicity in the spectrum. In other words, if the spectrum contains regularly spaced harmonics, its FFT will contain a peak corresponding to this regularity. And since the distance between harmonics equals the fundamental frequency, this cepstral peak gives us f0. Actually, in soundgen the FFT is applied to raw spectrum, not log-spectrum, since it appears to produce better results. Cepstrum is not very useful when f0 is so high that the spectrum contains only a few harmonics, so soundgen automatically discounts the contribution of high-frequency cepstral estimates.

The settings that control pitchCep are:

a = analyze(s1, samplingRate = 16000, 
            plot = TRUE, ylim = c(0, 4), priorMean = NA,
            pitchMethods = 'cep',
            pitchCep = list(cepThres = .3),
            nCands = 2)

Ratio of harmonics

Frequency domain: ratios of harmonics, BaNa, pitchSpec.

All harmonics are multiples of the fundamental frequency. The ratio of two neighboring harmonics is thus predictably related to their rank relative to f0. For example, (3 * f0) / (2 * f0) = 1.5, so if we find two harmonics in the spectrum that have a ratio of exactly 1.5, it is likely that f0 is half the lower one (Ba et al., 2012). This is the principle behind the spectral pitch estimate in soundgen, which seems to be particularly useful for noisy, relatively low-pitched sounds.

The settings that control pitchSpec are:

a = analyze(s1, samplingRate = 16000, 
            plot = TRUE, ylim = c(0, 4), priorMean = NA,
            pitchMethods = 'spec',
            pitchSpec = list(specThres = .2, specPeak = .1, cex = 2),
            nCands = 2)

Harmonic product spectrum

Frequency domain: pitchHps.

This is a simple spectral method based on downsampling the spectrum several times and multiplying the resulting spectra. This emphasizes the lowest harmonic present in the signal, which is hopefully f0. By definition, this method is easily misled by subharmonics (additional harmonics between the main harmonics of f0), but it can be useful in situations when the subharmonic frequency is actually of interest.

The settings that control pitchHps are:

a = analyze(s1, samplingRate = 16000, 
            plot = TRUE, ylim = c(0, 4), priorMean = NA,
            pitchMethods = 'hps',
            pitchHps = list(hpsNum = 2, # try 8 or so to measure subharmonics
                            hpsThres = .2))

TIP Each pitch tracking method can be tweaked to produce reasonable results for a particular sound (read: to agree with human intuition). The real trick is to find settings that are accurate on average, across a wide range of sounds and recording conditions. The default settings in analyze are the result of optimization against manually verified pitch measurements of a corpus of 260 human non-linguistic vocalizations. For other types of sounds, you will need to perform your own manual tweaking and/or formal optimization.

Missing fundamental

The perception of pitch does not depend on the presence of the lowest partial corresponding to the actual fundamental frequency: even if it is removed or masked by low-frequency noise, the pitch remains unchanged. By definition, the "dom" estimate of pitch cannot function when this lowest partial is missing (it works by literally tracking the lowest dominant frequency band). However, the remaining four pitch tracking methods - autocorrelation, cepstrum, BaNa, and HPS - have no problem dealing with a missing fundamental frequency because they take the entire spectrum into account, not only the lowest partial.

A sound with four partials at 300 Hz (f0), 600 Hz, 900 Hz, and 1200 Hz:

s_withf0 = soundgen(sylLen = 600, pitch = 300,
              rolloffExact = c(1, 1, 1, 1), formants = NULL, lipRad = 0)
# playme(s_withf0)
seewave::meanspec(s_withf0, f = 16000, dB = 'max0', flim = c(0, 3))

Now the same sound, but without the first partial (f0):

s_withoutf0 = soundgen(sylLen = 600, pitch = 300,
              rolloffExact = c(0, 1, 1, 1), formants = NULL, lipRad = 0)
# playme(s_withoutf0)  # you can clearly hear the difference
seewave::meanspec(s_withoutf0, f = 16000, dB = 'max0', flim = c(0, 3))

No problem with pitch tracking (except for the dom method), although the pitch contour is following a partial that is no longer there:

a_withoutf0 = analyze(s_withoutf0, 16000, , priorMean = NA,
             pitchMethods = c('autocor', 'dom', 'cep', 'spec', 'hps'),
             plot = TRUE, ylim = c(0, 2), dynamicRange = 60, osc = FALSE)

The implications are as follows: if the lower part of your signal is degraded (wind noise, an engine running, somebody else talking in the background, etc.), you can apply a high-pass filter to remove low frequencies. Even if you filter out the first partial by doing so, pitch tracking will still be possible. BUT: do NOT use the "dom" pitch estimate if f0 is either filtered out or invisible because of noise!

Postprocessing of pitch contour

Pitch postprocessing in soundgen includes a whole battery of distinct operations through which the pitch candidates generated by one or more tracking methods are integrated into the final pitch contour. We will look at them one by one, in the order in which they are performed in analyze. But first of all, here is how to disable them all:

a = analyze(
  samplingRate = 16000, plot = TRUE, ylim = c(0, 4), priorMean = NA,
  shortestSyl = 0,        # any length of voiced fragments
  interpol = NULL,        # don't interpolate missing f0 values
  pathfinding = 'none',   # don't look for optimal path through candidates
  snakeStep = 0,          # don't run the snake
  smooth = 0              # don't run median smoothing

When the sound is not too tricky and enough pitch candidates are available, postprocessing actually makes little difference. In terms of the accuracy of median estimate of f0, you are likely to get a good result even with postprocessing is completely disabled. However, if you are interested in the actual intonation contours, not just the global average, postprocessing can help a lot.

Continuous voiced fragments

It often makes sense to make assumptions about the possible temporal structure of voiced fragments, such as their minimum expected length (shortestSyl) and spacing (shortestPause). If these two parameters are positive numbers, the first stage of postprocessing is to divide the sound into continuous voiced fragments that satisfy these assumptions. The default minimum length of a voiced fragment is a single STFT frame. If shortestSyl is longer than a single frame, then we need at least two adjacent voiced frames to start a new voiced fragment. A single voiced frame surrounded by unvoiced frames then gets discarded (assumed to be unvoiced). If two voiced fragments are separated by less than shortestPause, they are merged. What this means is simply that they are processed as a single syllable by pathfinder() (see below). No interpolation takes place at this stage.

The next few blocks of postprocessing are performed by an internal function, soundgen:::pathfinder(). Its input is a matrix of pitch candidates for each frame of a single voiced syllable, usually with multiple candidates per frame. Each candidate is also associated with a different certainty. We want to find a good path through these candidates - that is, a pitch contour that both passes close to the strongest candidates and minimizes pitch jumps, producing a relatively smooth contour. The simplest first approximation is to take a mean of all pitch candidates per frame weighted by their certainty - the "center of gravity" of pitch candidates - and for each frame to select the candidate that lies closest to this center of gravity. This initial guess at a reasonable path may or may not be processed further, depending on the settings described below.


To make sure we have at least one pitch candidate for every frame in the supposedly continuous voiced fragment, we interpolate to fill in any missing values. The same algorithm also adds new pitch candidates with certainty interpol$cert if a frame has no pitch candidates within interpol$tol of the median of the "center of gravity" estimate over plus-minus interpol$win frames. The frequency of new candidates is equal to this median. For example, if interpol$tol = 0.05, new candidates are calculated if there are none within 0.95 to 1.05 times the median over the interpolation window. You can also enable interpolation to fill in unvoiced frames, but without adding new pitch candidates in voiced frames. To do so, set interpol$tol = Inf.

Here is an example (interpolated segments are shown with a dotted line)

a1 = analyze(s1, samplingRate = 16000, step = 10, priorMean = NA,
             pitchMethods = 'cep', pitchCep = list(cepThres = .45),
             interpol = NULL,   # disable interpolation
             pathfinding = 'none',  
             snakeStep = 0, smooth = 0, plot = FALSE)
a2 = analyze(s1, samplingRate = 16000, step = 10, priorMean = NA,
             pitchMethods = 'cep', pitchCep = list(cepThres = .45),
             # interpol = list(win = 100, tol = .1),
             pathfinding = 'none', 
             snakeStep = 0, smooth = 0, plot = FALSE)  
plot(a1$detailed$time, a1$detailed$pitch, type = 'l', 
     xlab = 'Time, ms', ylab = 'Pitch, Hz')
points(a2$detailed$time, a2$detailed$pitch, type = 'l', 
       col = 'red', lty = 3)


The next step after interpolation is pathfinding proper - searching for the optimal path through pitch candidates. If pathfinding = "none", this step is skipped, so we just continue working with the path that lies as close as possible to the (possibly interpolated) center of gravity of pitch candidates. If pathfinding = "fast" (the default option), a simple heuristic is employed, in which we walk down the path twice, first left to right and then right to left, trying to minimize the cost measured as a weighted mean of the distance from the center of gravity and the deviation from a smooth contour. The key setting is certWeight, which specifies how much we prioritize the certainty of pitch candidates vs. pitch jumps / the internal tension of the resulting pitch curve. Low certWeight (close to 0): we are mostly concerned with avoiding rapid pitch fluctuations in our contour. High certWeight (close to 1): we mostly pay attention to our certainty in particular pitch candidates. The example below is intended as an illustration of how pathfinding works, so all other types of smoothing are disabled, forcing the final pitch contour to pass strictly through existing candidates.

a1 = analyze(s1, samplingRate = 16000, priorMean = NA,
             pitchMethods = 'cep', pitchCep = list(cepThres = .15), nCands = 2,
             snakeStep = 0, smooth = 0, interpol = list(tol = Inf),
             certWeight = 0,  # minimize pitch jumps
             plot = TRUE, main = 'Minimize jumps', 
             showLegend = FALSE, osc = FALSE, ylim = c(0, 3))  
a2 = analyze(s1, samplingRate = 16000, priorMean = NA,
             pitchMethods = 'cep', pitchCep = list(cepThres = .15), nCands = 2,
             snakeStep = 0, smooth = 0, interpol = list(tol = Inf),
             certWeight = 1,  # minimize deviation from high-certainty candidates
             plot = TRUE, main = 'Pass through top cand-s', 
             showLegend = FALSE, osc = FALSE, ylim = c(0, 3))

The final option is pathfinding = 'slow', which calls stats::optim(method = 'SANN') to perform simulated annealing. This is a more powerful algorithm than the simple heuristic in pathfinding = 'fast', but it is called "slow" for a good reason. In case you have plenty of time, it does improve the results, but note that this algorithm is stochastic, so each run may produce different results. Use an additional argument, annealPars, to control the algorithm. See ?stats::optim for more details.


What is here esoterically referred to as the "snake" can be seen as an alternative to the pathfinding algorithms above, although both can also be performed sequentially. Whereas pathfinding attempts to find the best path through existing pitch candidates, the snake wiggles the contour under a weighted combination of (a) elastic forces trying to snap the pitch contour to a straight line and (b) the pull of high-certainty pitch candidates. In a sense the snake is thus a combination of interpolation and pathfinding: like interpolation, it can add new values different from existing candidates, and like pathfinding, it balances the certainty in candidates against the smoothness of the resulting contour.

The only new control parameter in the snake module (apart from certWeight) is snakeStep, which controls the speed of adaptation (the default is 0.05). The higher it is, the faster the snake "wiggles". This reduces processing time, but introduces a risk of "overshooting". If snakeStep is too low (close to 0), the snake moves too slowly and may fail to reach its optimal configuration. To disable the snake module, set snakeStep = NULL. You can also produce a separate plot of the snake by setting snakePlot = TRUE, as in the example below (again, all other postprocessing is disabled to show what the snake alone will do). The zigzagging line is the initial contour (the path through pitch candidates that lie as close as possible to the center of gravity of each frame), the smooth blue line is the pitch contour after running the snake, and the green lines trace the progress of iterative snake adaptation. Note that at certWeight = 0.1 the snake is heavily biased towards producing a smooth contour, regardless of its distance from high-certainty pitch candidates.

a1 = analyze(s1, samplingRate = 16000, plot = FALSE, priorMean = NA,
             pitchMethods = 'cep', pitchCep = list(cepThres = .2), nCands = 2,
             pathfinding = 'none', smooth = 0, interpolTol = Inf,
             certWeight = 0.1,  # like pathfinding, the snake is affected by certWeight
             snakeStep = 0.05, snakePlot = TRUE)

TIP Should you use pathfinding, the snake, or both? Pathfinding makes more sense if you want the final contour to pass strictly through existing candidates, say if there are relatively few candidates, most of which are right on target and some completely off. In these conditions the snake will not do much (but not much harm, either). The snake becomes attractive if you have a lot of candidates from different pitch tracking methods, many of which are slightly off and should be averaged. In addition, the more garbage you expect among your pitch candidates, the more you might want to interpolate and apply median smoothing

Median smoothing

The final postprocessing stage is median smoothing. It is conceptually similar to interpolation, except that by now there is only a single f0 value left per frame, so we can forget about the multiple candidates and their certainties. It wouldn't make much sense to apply kernel smoothing to this curve: the snake can usually do this in a smarter way, since it does know about the multiple candidates and their certainties. What we want from the smoothing algorithm is to detect and correct only outliers: the values that stick out from the surrounding frames. The parameters that control this module are smooth and smoothVars.

If smooth is a positive number, contours of the variables in smoothVars are smoothed using a customized version of median smoothing. This modifies only the values that deviate considerably from the moving median and preserves all other values (so this is a bit different from applying a moving median or kernel smoothing). smooth controls both the tolerated deviance and the size of the window for calculating a moving median. smooth = 1 (the default) corresponds to a window of ~100 ms and tolerated deviation of ~4 semitones. This smoothing can be applied to any measured value, not only the final pitch contour. The default is smoothVars = c('pitch', 'dom'). To turn off the median smoothing, set smooth = 0 or smoothVars = NULL.

a1 = analyze(s1, samplingRate = 16000, priorMean = NA,
             pitchMethods = 'cep', pitchCep = list(cepThres = .1), nCands = 3,
             pathfinding = 'none', snakeStep = 0, interpolTol = Inf,
             smooth = 0,  # no smoothing
             summaryFun = NULL, plot = FALSE)
a1$detailed$medianSmooth = soundgen:::medianSmoother(
  data.frame(pitch = a1$detailed$pitch), smoothing_ww = 3, smoothingThres = 1
plot(pitch ~ time, data = a1$detailed, type = 'l', xlab = 'Time, ms', ylab = 'Pitch, Hz')
points(medianSmooth ~ time, data = a1$detailed, type = 'l', col = 'red', lty = 3, lwd = 2)
# dotted line = with median smoothing

TIP Pathfinding ("slow", "fast" or "none") is the only postprocessing module that does not deviate from pitch candidates actually returned by pitch tracking algorithms. Interpolation, snake, and median smoothing produce new pitch values per frame, which may be quite different from any actual candidates

Naturally, the extracted pitch contours can be smoothed outside analyze() using any algorithm you like, such as GAM, low-pass filter, etc. For users of PRAAT, there is a function that implements its smoothing algorithm:

a1$detailed$ps = pitchSmoothPraat(
  bandwidth = 5, # cutoff above 5 Hz (5 pitch values per s)
  samplingRate = 1000/25)  # 1/step = number of pitch values per s
plot(pitch ~ time, data = a1$detailed, type = 'l', xlab = 'Time, ms', ylab = 'Pitch, Hz')
points(ps ~ time, data = a1$detailed, type = 'l', col = 'red', lty = 3, lwd = 2)
# dotted line = after low-pass smoothing

Customization of pitch plotting

When analyzing a sound, and even when batch-processing an entire folder, it is often helpful to plot both the final pitch contour - perhaps overlaid on a spectrogram - and individual pitch candidates. You can easily do so from analyze, as in all the examples above. The default plotting parameters can also be customized, for example:

a = analyze(
  s1, samplingRate = 16000, plot = TRUE, priorMean = NA,
  # options for spectrogram(): see ?spectrogram
  xlab = 'Time (ms)',
  main = 'My spectrogram',
  dynamicRange = 90,
  contrast = .5,
  brightness = -0.3,
  colorTheme = 'seewave',
  ylim = c(0, 4),
  # + other pars passed to soundgen:::filled.contour.mod()

  # options for oscillogram
  osc = 'dB', 
  heights = c(3, 1),

  # options for plotting the final pitch contour (line)
  pitchPlot = list(       
    col = 'black',
    lwd = 5,
    lty = 3
    # + other pars passed to base::lines()

  # options for plotting pitch candidates (points)
  pitchAutocor = list(col = rgb(0, 1, 0, .5), pch = 16, cex = 2),
  pitchDom = list(col = 'red', cex = 4)

You can also suppress plotting any of these three components: the spectrogram, the final pitch contour, or individual pitch candidates. To plot pitch candidates but not the spectrogram, set brightness = 1. To suppress plotting the pitch contour, set pitchPlot = list(lwd = 0). To suppress plotting pitch candidates, set their cex to 0, eg pitchAutocor = list(cex = 0).

To save the plots, specify a valid path ('' = the same folder as where input audio files live). This is mostly useful when you are batch-processing an entire folder full of audio files and want to save the plots for manual checking of extracted plot contours. Note that the file '00_clickablePlots_analyze.html' offers an easy way to inspect all the plots together in a web browser, and clicking each plot plays the corresponding audio. This works even if the plots are not saved in the same folder as audio.

a = analyze('~/Downloads/temp2', savePlots = '',
            width = 900, height = 500, units = 'px')

TIP Processing time varies a lot depending on the exact settings and input, but expect up to a few seconds of machine time per second of audio. The surest way to speed things up is to reduce samplingRate (but not too low, otherwise the resolution of pitch estimation begins to suffer) and to use a relatively short windowLength with little overlap. Also avoid pathfinding = 'slow' (other types of postprocessing have very little effect on processing time)

Manual correction of pitch contours with pitch_app()

Just like soundgen_app() is an interactive shiny app for sound synthesis, pitch_app() provides a wrapper around analyze() with an option to step in and manually correct the intonation contour. The app runs in a browser; you can load one or more wav or mp3 files, adjust the settings, verify and correct the pitch contours, and save the output as a .csv file. Please see ?pitch_app for more information.

TIP pitch_app() is meant for extracting pitch contours. Although its output is similar to that of analyze(), it doesn't offer options for measuring formants, loudness, etc. Suggested workflow: extract manually corrected pitch contours first, and then run analyzeFolder(..., pitchManual = output.csv), where output.csv is the dataset produced by pitch_app

Summarizing pitch contours

By default, analyze() and pitch_app() summarize contours of estimated pitch and other variables using simple summary functions such as mean, median, SD, etc. These summaries can be extended with user-defined summaryFun, but then the same descriptives are applied to every single output of analyze(), potentially resulting in very large output matrices with a lot of descriptives no-one is ever going to need. To avoid this clutter, it can be convenient to summarize one variable at a time - normally pitch - and extract a much larger number of summary measures of the average value, its variability, slope, inflections, etc. This task can be accomplished with pitchDescriptives(). For example:

pd = pitchDescriptives(
  a1$detailed$pitch, step = a1$detailed$time[2] - a1$detailed$time[1],
  timeUnit = 'ms',
  smoothBW = c(10, 1),   # original + smoothed at 10 Hz and 1 Hz
  inflThres = .2,        # min amplitude that counts as inflections
  plot = TRUE

Formant analysis with formant_app()

Formants are the natural resonances of the vocal tract - that is, frequency bands that are amplified as the sound passes through the vocal tract. In speech, formants are responsible for differences between vowel sounds such as /a/, /i/, /u/, etc., but formants are also present in many animal calls. The most common way to estimate formant frequencies and bandwidth is by using Linear Predictive Coding (LPC), which is implemented in phonTools::findformants(). Because formant measurement is quite error-prone, however, any serious analysis should include manual verification of LPC measurements. Soundgen offers a web app for this task: formant_app(). Just like pitch_app(), it runs in the system-default web browser; you can add annotations to audio files and manually verify and, if needed, correct the formant measurements for each annotated segment.

TIP: pitch_app and formant_app are meant to work in the Firefox browser. A bug in Chrome interferes with correct audio playback; Safari and IE may or may not work.

Audio segmentation

Finding syllables with segment()

In addition to measuring spectral characteristics and fundamental frequency, it is often important to analyze the temporal structure of a sound. In particular, it is often helpful to divide a sound into separate "syllables" - continuous acoustic fragments separated by what we consider to be "silence". If this "silence" was not full of background noise and breathing, the task would be trivial: we could simply define syllables as continuous segments with ampiltude above some threshold. As it is, with non-studio material it is problematic to find a single threshold that would accurately coincide with the beginning and end of syllables in different sounds (presuming that we are interested in batch processing multiple sounds without manually adjusting the settings for each sound). Soundgen attempts to solve this problem by using an adaptive algorithm sensitive to the nature of background noise, signal-to-noise ratio, and acoustic novelty to achieve flexible and fast syllable segmentation.

Sometimes we are more interested in the rate of syllables per second and in their regularity, rather than the absolute duration of each syllable. In this case it makes more sense to look for bursts of acoustic energy or sudden changes relative to background noise. The spacing between bursts - the interburst interval - is a perceptually salient characteristic of a bout of vocalizing.

Soundgen package contains a function called segment, which uses both these approaches: it looks for continuous syllables and for acoustic bursts. These two algorithms are not independent: syllables are found first, and then the median length of a syllable becomes the expected interburst interval, guiding burst detection. segment operates with ordinary or mel-transformed spectrograms or with amplitude envelopes - smoothed contours of sound intensity.

# for info on using soundgen() function, see the vignette on sound synthesis 
s2 = soundgen(nSyl = 8, sylLen = 50, pauseLen = 70, temperature = 0,
              pitch = c(368, 284), amplGlobal = c(0, -20))
# add noise so SNR decreases from 20 to 0 dB from syl1 to syl8
s2 = s2 + runif(length(s2), -10 ^ (-20 / 20), 10 ^ (-20 / 20))
# playme(s2, samplingRate = 16000)
a = segment(s2, samplingRate = 16000, plot = TRUE)

This synthesized laugh-like sound contains 8 syllables, each 50 ms long and separated by 70 ms of unvoiced fragments with some overlapping aspiration noise (NOT silence!). With default settings, segment finds 7 syllables of median length median(a$syllables$sylLen, na.rm = TRUE) = 48 ms separated by median(a$syllables$pauseLen, na.rm = TRUE) = 72 ms. The syllables are shown by blue line segments in the plot above. The last syllables are missed either because they are below the default detection threshold SNR or because their apparent duration falls below the default shortestSyl of 40 ms. All 8 bursts are correctly detected. The interburst interval is estimated to be median(a$bursts$interburst, na.rm = TRUE) = 120 ms (the correct number is 70 + 50 = 120 ms), with SD = sd(a$bursts$interburst, na.rm = TRUE) = 3 ms (the correct number is 0 since we set temperature to 0). Note that, just as with many real-life sounds, the question of when each syllable starts and ends is pretty subjective, and segmentation can be performed on different time scales, eg looking for bouts of laughter surrounded by noise or for individual syllables within each laugh.

Some other settings worth mentioning are:

a1 = segment(s2, samplingRate = 16000, plot = TRUE, 
             SNR = 0.1, main = 'SNR very low')
a2 = suppressWarnings(segment(s2, samplingRate = 16000, plot = TRUE, 
            SNR = 14, main = 'SNR very high'))
a1 = segment(s2, samplingRate = 16000, plot = TRUE, 
             windowLength = 40, overlap = 0, main = 'overlap too low')
a2 = suppressWarnings(segment(s2, samplingRate = 16000, plot = TRUE, 
             windowLength = 5, overlap = 80, main = 'window too short'))
a3 = segment(s2, samplingRate = 16000, plot = TRUE, 
             windowLength = 150, overlap = 80, main = 'window too long')
# too long, but at least bursts are detected
a1 = segment(s2, samplingRate = 16000, plot = TRUE, 
             shortestSyl = 80, main = 'shortestSyl too long')    
# merges syllables
a2 = segment(s2, samplingRate = 16000, plot = TRUE, 
             shortestPause = 80, main = 'shortestPause too long')  

TIP For batch processing without manual adjustments, and particularly when the focus is on describing the overall rate of acoustic events rather than on actually segmenting a recording into separate vocalizations, bursts may be more robust than syllables

Self-similarity matrices

The basic idea behind self-similarity matrices (SSMs) is to compare different parts of the same sound with each other and present their pairwise similarity indices as a square matrix with time along both dimensions. The diagonal going from bottom-left to top-right represents the similarity of each fragment with itself. For example, at (100, 100) we have the similarity of the fragment at 100 ms with itself, while at (100, 200) we have the similarity of the fragments at 100 ms and 200 ms. Let's begin with an example taken from ?ssm:

s3 = c(soundgen(), soundgen(nSyl = 4, sylLen = 50, pauseLen = 70, 
       formants = NA, pitch = c(500, 330)))
# playme(s3, 16000)
m = ssm(s3, samplingRate = 16000)

Look at the upper panel first. There are three main regions: the large red square in the lower left corner, which corresponds to the first bout; the red square in the center, which corresponds to the long pause between the bouts; and the checkered field in the upper right corner, which corresponds to the second bout. Observe the characteristic diagonal. The SSM clearly shows periodicity within the second bout, which confirms that these syllables are very similar to each other in terms of their spectral characteristics. Now look at the lower panel of the plot. This is a spectrogram overlaid with so-called "novelty": peaks show when something changes in the sound, based on the SSM.

The results of SSM analysis can vary dramatically, depending on how it is performed. We begin by extracting some form of spectrogram of the entire sound, then we somehow compare each frame to each other frame, and finally we calculate a measure of novelty. At each step there are crucial decisions to make, which can dramatically alter the result. Let's look at the settings that control each step. As you play with these settings, don't be surprised if your SSM suddenly looks completely different!

  1. Spectrogram extraction is controlled by some already familiar settings such as windowLength and overlap (or step if you prefer). However, it can often be helpful to base the SSM on some other spectral characteristics than the usual spectrogram. Under the hood, ssm calls tuneR::melfcc to return the power spectrum (as in a spectrogram, input = 'spectrum'), mel-transformed spectrum (auditory spectrum, which is supposed to be closer to how humans perceive sounds, input = 'audiogram'), or mel-frequency cepstral coefficients (MFCCs - a popular choice of input for automatic sound classification, as in speech recognition software, input = 'mfcc'). There are also several arguments that are passed to tuneR::melfcc: maxFreq controls the highest frequency analyzed, MFCC controls the number of mel-frequency cepstral coefficients to extract, and nBands basically controls the frequency resolution.

  2. SSM calculation starts with the spectrogram (or the MFCCs matrix) produced at the previous stage. This input is divided into windows, each of which can be one or several STFT frames in length, and then each window is compared to each other window using either cosine similarity (simil = 'cosine') or Pearson's correlation (simil = 'cor'). Prior to this the input matrix can be normalized column-wise, so as to level out the differences in sound intensity across frames (norm = TRUE). The size of analysis window is controlled by the argument ssmWin, ms. Note that as you increase ssmWin, the resulting SSM literally shrinks in size, reducing its resolution. This is a great feature if your sound is very long, but it is seldom needed for shorter sounds. As you decrease ssmWin, it ultimately reaches the size of a single STFT frame (step) and can't go down any further, unless you improve time resolution of the spectrogram itself by reducing windowLength and/or increasing overlap.

  3. Novelty calculation starts with the SSM computed at the previous step. Here the idea is to move along the diagonal and multiply the SSM by a Gaussian checkerboard matrix, so as to detect changes in the spectral structure (see Foote, 2000). The settings that control this algorithm are kernelLen and kernelSD. The crucial parameter is kernelLen (ms): a large kernel is good for detecting global changes in the sound, such as transitions between different vocalizations, whereas a small kernel is good for detecting rapid changes, such as individual bursts in a laugh. If you are curious about the kernel itself, take a look at ?soundgen:::getCheckerboardKernel().

par(mfrow = c(2, 1))
m1 = ssm(s3, samplingRate = 16000,
         input = 'melspec', simil = 'cor', norm = FALSE, 
         ssmWin = 10, kernelLen = 150)  # detailed, local features
m2 = ssm(s3, samplingRate = 16000,
         input = 'mfcc', simil = 'cosine', norm = TRUE, 
         ssmWin = 50, kernelLen = 600)  # more global
par(mfrow = c(1, 1))

TIP analyze() internally calls ssm() to calculate novelty

Modulation spectrum

Acoustic analysis in the context of phonetic or bioacoustic research tends to begin with a spectrogram. However, there is a complementary perspective on visualizing and analyzing sounds, which looks at the joint distribution of temporal and spectral modulation known as "modulation spectrum". The principal insight is the idea that a sound can be represented as a sum of sinusoidal spectrotemporal ripples. In case this concept is difficult to grasp, here are two intuitive explanations (for a rigourous scientific presentation, please refer to Singh & Theunissen, 2003). The first way to understand the modulation spectrum is to think of it as a two-dimensional Fourier transform of the spectrogram. For those familiar with image processing, this is the standard way of representing any two-dimensional image as a sum of sinusoidal components (ripples) that differ in their frequency and direction. When the image in question is a spectro-temporal representation of a sound, such as a spectrogram or wavelet transform, the result is the modulation spectrum. The second basic insight is to link the modulation spectrum to spectro-temporal receptive fields (STRFs) of auditory neurons. Some neurons respond to rapid upward sweeps, others to broadband noise pulses at a particular frequency, and so on. The modulation spectrum maps nicely onto STRFs, in the sense that it shows which spectro-temporal patterns are prevalent in a particular sound.

The basic function for obtaining a modulation spectrum in soundgen is modulationSpectrum. It accepts one or more numeric vectors or one or more audio files as input, prepares a spectrogram via STFT, takes its two-dimensional FFT, and optionally performs some additional post-processing and plotting. For multiple inputs, a separate modulation spectrum is initially extracted for each one, and then they are averaged into a single output. For processing a number of audio files separately and saving their individual modulation spectra, please use the function modulationSpectrumFolder.

The first and crucial thing to remember when working with modulation spectra is that they are extremely sensitive to the initial step of transforming a sound into a spectrogram, especially the window length and step used in STFT. In order to detect temporal modulation at 20 Hz, for example, the step of a spectrogram has to be under 50 ms (1/20 s). Likewise, spectral modulation of 10 cycles/KHz cannot be resolved unless the window length is long enough (~20 ms). Here is a simple illustration:

s = soundgen(pitch = 70, amFreq = 25, amDep = 80, rolloff = -15)
ms = modulationSpectrum(s, samplingRate = 16000, logWarp = NULL,
                        windowLength = 25, step = 25, amRes = NULL)

In this example we missed both spectral modulation at ~14 cycles/KHz corresponding to the f0 of 70 Hz (1000/70 = 14.3) and temporal modulation at 25 Hz. To increase the resolution in both directions, we can simultaneously increase the window length (to capture spectral modulations) and decrease STFT step (to capture temporal modulations):

ms = modulationSpectrum(s, samplingRate = 16000, logWarp = NULL,
                        windowLength = 40, step = 10, amRes = NULL)

Now we can see both spectral modulations at 14 Hz (f0) and temporal modulations at 25 Hz, as expected. When working with sounds for which you don't know the ground truth, however, the point is not to take the first modulation spectrum you happen to produce at face value - play around with different settings and compare the results. This is really the same indeterminacy principle as the more familiar tradeoff between frequency and time resolution in spectrograms.

Beyond these basic settings, there are several other options:

ms = modulationSpectrum(
  s, samplingRate = 16000, windowLength = 15, step = 5, 
  amRes = NULL,  # analyze the entire sound at once (see TIP below)
  logSpec = FALSE,  # log-transform the spectrogram before 2D FFT?
  power = 2,  # square amplitudes in modulation spectrum ("power" spectrum)
  roughRange = c(30, 150),  # temporal modulations in the "roughness" range
  amRange = c(10, 70),    # AM frequencies of interest
  logWarp = 2,  # log-transform axes for plotting
  kernelSize = 7,  # apply Gaussian blur for smoothing
  quantiles = c(.5, .8, .95, .99),  # customize contour lines
  colorTheme = 'terrain.colors'  # alternative palette
ms[c('roughness', 'amFreq', 'amDep')]

TIP Roughness and amplitude modulation can be calculated for the entire sound (set amRes = NULL) or for as many frames as possible given the duration of audio and the desired resolution for the analysis of temporal modulation (one frame for the analysis of roughness comprises several STFT frames, so we measure about amRes values per second)


The results of acoustic analysis are sensitive to the chosen settings: the size and overlap of Fourier windows, the method(s) of pitch tracking and their parameters, etc. The default settings of the two main functions presented in this vignette, analyze and segment, have been optimized for human non-linguistic vocalizations. If you analyze other types of sounds, such as human speech or whale songs, you can probably improve the accuracy by changing some settings. In many cases the required changes are obvious: for example, to analyze f0 in ultrasonic whistles of dolphins or rodents you will obviously want to record at a much higher sampling rate, use shorter FFT windows, raise pitch_ceiling and / or prior_mean, etc. Other settings are not so obvious or even downright opaque, inviting some form of automatic optimization.

All that you need in order to run optimization is a training sample - that is, a few hundred sounds for which you know the right answer (average pitch, the number of syllables, or whatever it is that you wish to measure). You can start by analyzing this training sample with some reasonable settings, and then you can check the measurements manually, correcting any mistakes. This gives you a "key", and then you can fiddle with the settings trying to reproduce this key as closely as possible, but now without manual interventions. Once you have found the optimal settings, you can use them to analyze future samples of acoustically similar material. More sophisticated methods of optimization involving cross-validation are also possible, but even the simplest check against manual measurements can cause a dramatic improvement in the quality of your measurements.

R has excellent optimization tools, notably stats::optim. It can also be done manually; for example, you can specify a grid with combinations of several control parameters and repeat the analysis for each combination, each time comparing the result with your "key". Since I have gone through this process when optimizing the default settings in analyze and segment, I believe it may be helpful to share some code and tips specific to this particular optimization problem.

Soundgen package contains two vectors containing "keys" for the optimization of segmentation (segmentManual) and pitch tracking (pitchManual) of 260 sounds in the corpus of human vocalizations described in Anikin & Persson (2017). segmentManual contains manually verified syllable counts, and pitchManual the average pitch of these 260 sounds. These numbers are by no means intended to represent the absolute truth: in many cases it is very hard to objectively count the syllables or even determine the "true" pitch (if you doubt it, listen to the sounds, which can be downloaded from Nevertheless, these manual measurements should not be too far off the mark, so they are acceptable targets for training the algorithm.

The easiest, "out-of-the-box" solution is to use the soundgen function optimizePars, which is essentially a wrapper around stats::optim tailored to this particular task. There are extensive examples in the documentation for this function: see ?optimizePars.

TIP The parameters of segment can be optimized within an hour or two with 260 sounds (total audio duration ~9 min), but pitch tracking is really too slow to be optimized head-on, except maybe for one or two parameters at a time using grid optimization instead of optim

If you prefer to optimize manually, without calling opitmizePars, here are a couple of examples.

# checking combinations of pitch tracking methods
myfolder = ''  
key = log(pitchManual)
p = c('autocor', 'cep', 'spec', 'dom')
pp = c(list(p),
       combn(p, 3, simplify = FALSE),
       combn(p, 2, simplify = FALSE),
       combn(p, 1, simplify = FALSE))
out = list()
res = data.frame('pars' = sapply(pp, function(x) paste(x, collapse = ',')),
                 cor1 = rep(NA, length(pp)),
                 cor2 = rep(NA, length(pp)))
# repeating the analysis for each combination of methods in pp
for (i in 1:length(pp)) {
  out[[i]] = analyze(myfolder, plot = FALSE, step = 50,
                           pitchMethods = pp[[i]])$summary$pitch_median
  res$cor1[i] = cor(log(out[[i]]), log(pitchManual), use = 'pairwise.complete.obs')
  res$cor2[i] = cor(log(out[[i]]), log(pitchManual), use = 'pairwise.complete.obs') *
    (1 - mean([[i]]) & !
  print(res[i, ])
res[order(res$cor1, decreasing = TRUE), ]  # max correlation regardless of NA
res[order(res$cor2, decreasing = TRUE), ]  # max correlation penalized for NA

And another example, for a grid of two parameters of analyze:

myfolder = ''
key = log(pitchManual)
out = list()
pars = expand.grid(windowLength = c(17, 35, 50),
                   smooth = c(0, 1, 2))
for (i in 1:nrow(pars)) {
  out[[i]] = suppressWarnings(analyze(myfolder, plot = FALSE,
               step = 25,
               pitchMethods = c('autocor','dom'),
               windowLength = pars$windowLength[i],
               smooth = pars$smooth[i]))$summary$pitch_median
  print(cor(log(out[[i]]), key, use = 'pairwise.complete.obs'))
  print(cor(log(out[[i]]), key, use = 'pairwise.complete.obs') *
          (1 - mean([[i]]) & !
pars$r1 = sapply(out, function(x) {
  cor(log(x), key, use = 'pairwise.complete.obs')
pars$r2 = sapply(out, function(x) {
  cor(log(x), key, use = 'pairwise.complete.obs') *
    (1 - mean( & !

v = 6  # pick some combination of par values to explore
trial = log(out[[v]])  
cor(key, trial, use = 'pairwise.complete.obs')
cor(key, trial, use = 'pairwise.complete.obs') *
  (1 - mean( & !
plot (key, trial)
abline(a=0, b=1, col='red')


Anikin, A. & Persson, T. (2017). Non-linguistic vocalizations from online amateur videos for emotion research: a validated corpus. Behavior Research Methods, 49(2): 758-771. Text and sounds available here

Ba, H., Yang, N., Demirkol, I., & Heinzelman, W. (2012, August). BaNa: A hybrid approach for noise resilient pitch detection. In Statistical Signal Processing Workshop (SSP), 2012 IEEE (pp. 369-372).

Boersma, P. (1993). Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the institute of phonetic sciences (Vol. 17, No. 1193, pp. 97-110).

Foote, J. (2000). "Automatic Audio Segmentation using a measure of audio novelty." In Proceedings of IEEE International Conference on Multimedia and Expo, vol. I, pp. 452-455.

Singh, N. C., & Theunissen, F. E. (2003). Modulation spectra of natural sounds and ethological theories of auditory processing. The Journal of the Acoustical Society of America, 114(6), 3394-3411.

Sueur, J. (2018). Sound analysis and synthesis with R. Heidelberg, Germany: Springer.

Try the soundgen package in your browser

Any scripts or data that you put into this service are public.

soundgen documentation built on June 20, 2021, 1:06 a.m.