phonetise: Tokenise IPA strings
In phonetisr: A Naive IPA Tokeniser

View source: R/phonetise.R

phonetise

R Documentation

Tokenise IPA strings

Description

phonetise() tokenises strings of IPA symbols (like phonetic transcriptions of words) into individual "phones". The output is a list.

Usage

phonetise(
  strings,
  multi = NULL,
  regex = NULL,
  split = TRUE,
  sep = " ",
  sanitise = TRUE,
  ignore_stress = TRUE,
  ignore_tone = TRUE,
  diacritics = FALSE,
  affricates = FALSE,
  v_sequences = FALSE,
  prenasalised = FALSE,
  all_multi = FALSE,
  sanitize = sanitise
)

phonetize(
  strings,
  multi = NULL,
  regex = NULL,
  split = TRUE,
  sep = " ",
  sanitise = TRUE,
  ignore_stress = TRUE,
  ignore_tone = TRUE,
  diacritics = FALSE,
  affricates = FALSE,
  v_sequences = FALSE,
  prenasalised = FALSE,
  all_multi = FALSE,
  sanitize = sanitise
)

Arguments

`strings`	A character vector with a list of words in IPA.
`multi`	A character vector of one or more multi-character phones as strings.
`regex`	A string with a regular expression to match several multi-character phones.
`split`	If set to `TRUE` (the default), the tokenised strings are split into phones (i.e. the output is a vector with one element per phone). If set to `FALSE`, the string is not split and the phones are separated with the character defined in `sep`.
`sep`	A character to be used as the separator of the phones if `split = FALSE` (default is `⁠ ⁠`, space).
`sanitise`	Whether to remove all non-IPA characters (`TRUE` by default).
`ignore_stress`	If `TRUE` (the default), stress marks are not parsed.
`ignore_tone`	If `TRUE` (the default), tone marks and letters are not parsed.
`diacritics`	If set to `TRUE`, parses all valid diacritics as part of the previous character (`FALSE` by default).
`affricates`	If set to `TRUE`, parses homorganic stop + fricative as affricates.
`v_sequences`	If set to `TRUE`, collapses vowel sequences (`FALSE` by default).
`prenasalised`	If set to `TRUE`, parses prenasalised consonants as such (`FALSE` by default).
`all_multi`	If set to `TRUE`, `diacritics`, `affricates`, `v_sequences` and `prenasalised` are all set to `TRUE`.
`sanitize`	Alias of `sanitise`.

Value

A list of phonetised strings.

Examples

# using unicode escapes for CRAN policy
ipa <- c("p\u02B0a\u0303k\u02B0", "t\u02B0um\u0325", "\u025Bk\u02B0\u026F")
ph <- c("p\u02B0", "t\u02B0", "k\u02B0", "a\u0303", "m\u0325")

phonetise(ipa, multi = ph)

ph_2 <- ph[4:5]

# Match any character followed by <\u02B0> with ".\u02B0".
phonetise(ipa, multi = ph_2, regex = ".\u02B0")

# Same result.
phonetise(ipa, regex = ".(\u0303|\u0325|\u02B0)")

# Don't split strings and use "." as separator
phonetise(ipa, multi = ph, split = FALSE, sep = ".")

phonetisr documentation built on April 3, 2025, 10:49 p.m.