extract_features: Extract audio features
In voice: Speaker Recognition, Voice Analysis and Mood Inference via Music Theory

extract_features

R Documentation

Extract audio features

Description

Extracts features from WAV audio files.

Usage

extract_features(
  x,
  features = c("f0", "fmt", "gain"),
  filesRange = NULL,
  sex = "u",
  windowShift = 10,
  numFormants = 8,
  numcep = 12,
  dcttype = c("t2", "t1", "t3", "t4"),
  fbtype = c("mel", "htkmel", "fcmel", "bark"),
  resolution = 40,
  usecmp = FALSE,
  mc.cores = 1,
  full.names = TRUE,
  recursive = FALSE,
  check.mono = FALSE,
  stereo2mono = FALSE,
  overwrite = FALSE,
  freq = 44100,
  round.to = NULL,
  verbose = FALSE,
  pycall = "~/miniconda3/envs/pyvoice/bin/python"
)

Arguments

`x`	A vector containing either files or directories of audio files in WAV format.
`features`	Vector of features to be extracted. (Default: `'f0','fmt','gain'`). Available features: `'f0','f0_mhs','f0_praat','fmt','fmt_praat','zcr','rms','gain','rfc','ac','cep','dft','css','lps','mfcc','df','pf','rf','rcf','rpf'`.
`filesRange`	The desired range of directory files (Default: `NULL`, i.e., all files). Should only be used when all the WAV files are in the same folder.
`sex`	`= <code>` set sex specific parameters where <code> = `'f'`[emale], `'m'`[ale] or `'u'`[nknown] (Default: `'u'`). Used as 'gender' by `wrassp::ksvF0`, `wrassp::forest` and `wrassp::mhsF0`.
`windowShift`	`= <dur>` set analysis window shift to <dur>ation in ms (Default: `5.0`). Used by `wrassp::ksvF0`, `wrassp::forest`, `wrassp::mhsF0`, `wrassp::zcrana`, `wrassp::rfcana`, `wrassp::acfana`, `wrassp::cepstrum`, `wrassp::dftSpectrum`, `wrassp::cssSpectrum` and `wrassp::lpsSpectrum`.
`numFormants`	`= <num>` <num>ber of formants (Default: `8`). Used by `wrassp::forest`.
`numcep`	Number of Mel-frequency cepstral coefficients (cepstra) to return (Default: `12`). Used by `tuneR::melfcc`.
`dcttype`	Type of DCT used. `'t1'` or `'t2'`, `'t3'` for HTK `'t4'` for feacalc (Default: `'t2'`). Used by `tuneR::melfcc`.
`fbtype`	Auditory frequency scale to use: `'mel'`, `'bark'`, `'htkmel'`, `'fcmel'` (Default: `'mel'`). Used by `tuneR::melfcc`.
`resolution`	`= <freq>` set FFT length to the smallest value which results in a frequency resolution of <freq> Hz or better (Default: `40.0`). Used by `wrassp::cssSpectrum`, `wrassp::dftSpectrum` and `wrassp::lpsSpectrum`.
`usecmp`	Logical. Apply equal-loudness weighting and cube-root compression (PLP instead of LPC) (Default: `FALSE`). Used by `tuneR::melfcc`.
`mc.cores`	Number of cores to be used in parallel processing. (Default: `1`)
`full.names`	Logical. If `TRUE`, the directory path is prepended to the file names to give a relative file path. If `FALSE`, the file names (rather than paths) are returned. (Default: `TRUE`) Used by `base::list.files`.
`recursive`	Logical. Should the listing recursively into directories? (Default: `FALSE`) Used by `base::list.files`.
`check.mono`	Logical. Check if the WAV file is mono. (Default: `TRUE`)
`stereo2mono`	(Experimental) Logical. Should files be converted from stereo to mono? (Default: `TRUE`)
`overwrite`	(Experimental) Logical. Should converted files be overwritten? If not, the file gets the suffix `_mono`. (Default: `FALSE`)
`freq`	Frequency in Hz to write the converted files when `stereo2mono=TRUE`. (Default: `44100`)
`round.to`	Number of decimal places to round to. (Default: `NULL`)
`verbose`	Logical. Should the running status be showed? (Default: `FALSE`)
`pycall`	Python call. See https://github.com/filipezabala/voice for details.

Details

The feature 'df' corresponds to 'formant dispersion' (df2:df8) by Fitch (1997), 'pf' to formant position' (pf1:pf8) by Puts, Apicella & Cárdena (2011), 'rf' to 'formant removal' (rf1:rf8) by Zabala (2023), 'rcf' to 'formant cumulated removal' (rcf2:rcf8) by Zabala (2023) and 'rpf' to 'formant position removal' (rpf2:rpf8) by Zabala (2023). The 'fmt_praat' feature may take long time processing. The following features may contain a variable number of columns: 'cep', 'dft', 'css' and 'lps'.

Value

A Media data frame containing the selected features.

References

Levinson N. (1946). The Wiener (root mean square) error criterion in filter design and prediction. Journal of Mathematics and Physics, 25(1-4), 261–278. (\Sexpr[results=rd]{tools:::Rd_expr_doi("10.1002/SAPM1946251261")})

Durbin J. (1960). “The fitting of time-series models.” Revue de l’Institut International de Statistique, pp. 233–244. (https://www.jstor.org/stable/1401322)

Cooley J.W., Tukey J.W. (1965). “An algorithm for the machine calculation of complex Fourier series.” Mathematics of computation, 19(90), 297–301. (https://www.ams.org/journals/mcom/1965-19-090/S0025-5718-1965-0178586-1/)

Wasson D., Donaldson R. (1975). “Speech amplitude and zero crossings for automated identification of human speakers.” IEEE Transactions on Acoustics, Speech, and Signal Processing, 23(4), 390–392. (https://ieeexplore.ieee.org/document/1162690)

Allen J. (1977). “Short term spectral analysis, synthesis, and modification by discrete Fourier transform.” IEEE Transactions on Acoustics, Speech, and Signal Processing, 25(3), 235– 238. (https://ieeexplore.ieee.org/document/1162950)

Schäfer-Vincent K. (1982). "Significant points: Pitch period detection as a problem of segmentation." Phonetica, 39(4-5), 241–253. (\Sexpr[results=rd]{tools:::Rd_expr_doi("10.1159/000261665")} )

Schäfer-Vincent K. (1983). "Pitch period detection and chaining: Method and evaluation." Phonetica, 40(3), 177–202. (\Sexpr[results=rd]{tools:::Rd_expr_doi("10.1159/000261691")})

Ephraim Y., Malah D. (1984). “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator.” IEEE Transactions on acoustics, speech, and signal processing, 32(6), 1109–1121. (https://ieeexplore.ieee.org/document/1164453)

Delsarte P., Genin Y. (1986). “The split Levinson algorithm.” IEEE transactions on acoustics, speech, and signal processing, 34(3), 470–478. (https://ieeexplore.ieee.org/document/1164830)

Jackson J.C. (1995). "The Harmonic Sieve: A Novel Application of Fourier Analysis to Machine Learning Theory and Practice." Technical report, Carnegie-Mellon University Pittsburgh PA Schoo; of Computer Science.

Fitch, W.T. (1997) "Vocal tract length and formant frequency dispersion correlate with body size in rhesus macaques." J. Acoust. Soc. Am. 102, 1213 – 1222. (\Sexpr[results=rd]{tools:::Rd_expr_doi("10.1121/1.421048")})

Boersma P., van Heuven V. (2001). Praat, a system for doing phonetics by computer. Glot. Int., 5(9/10), 341–347. (https://www.fon.hum.uva.nl/paul/papers/speakUnspeakPraat_glot2001.pdf)

Ellis DPW (2005). “PLP and RASTA (and MFCC, and inversion) in Matlab.” Online web resource.

Puts, D.A., Apicella, C.L., Cardenas, R.A. (2012) "Masculine voices signal men's threat potential in forager and industrial societies." Proc. R. Soc. B Biol. Sci. 279, 601–609. (\Sexpr[results=rd]{tools:::Rd_expr_doi("10.1098/rspb.2011.0829")})

Examples

library(voice)

# get path to audio file
path2wav <- list.files(system.file('extdata', package = 'wrassp'),
pattern = glob2rx('*.wav'), full.names = TRUE)

# minimal usage
M1 <- extract_features(path2wav)
M2 <- extract_features(dirname(path2wav))
identical(M1,M2)
table(basename(M1$wav_path))

# limiting filesRange
M3 <- extract_features(path2wav, filesRange = 3:6)
table(basename(M3$wav_path))

voice documentation built on Aug. 8, 2025, 6:41 p.m.