In tjmahr/readtextgrid: Read in a 'Praat' 'TextGrid' File

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

readtextgrid

readtextgrid parses Praat textgrids into R dataframes.

Installation

Install from CRAN:

install.packages("readtextgrid")

Install the development version from Github:

install.packages("remotes")
remotes::install_github("tjmahr/readtextgrid")

Basic example

Here is the example textgrid created by Praat. It was created using New -> Create TextGrid... with default settings in Praat.

This textgrid is bundled with this R package. We can locate the file with example_textgrid(). We read in the textgrid with read_textgrid().

library(readtextgrid)

# Locates path to an example textgrid bundled with this package
tg <- example_textgrid()

read_textgrid(path = tg)

The dataframe contains one row per annotation: one row for each interval on an interval tier and one row for each point on a point tier. If a point tier has no points, it is represented with single row with NA values.

The columns encode the following information:

file filename of the textgrid. By default this column uses the filename in path. A user can override this value by setting the file argument in read_textgrid(path, file), which can be useful if textgrids are stored in speaker-specific folders.
tier_num the number of the tier (as in the left margin of the textgrid editor)
tier_name the name of the tier (as in the right margin of the textgrid editor)
tier_type the type of the tier. "IntervalTier" for interval tiers and "TextTier" for point tiers (this is the terminology used inside of the textgrid file format).
tier_xmin, tier_xmax start and end times of the tier in seconds
xmin, xmax start and end times of the textgrid interval or point tier annotation in seconds
text the text in the annotation
annotation_num the number of the annotation in that tier (1 for the first annotation, etc.)

Reading in directories of textgrids

Suppose you have data on multiple speakers with one folder of textgrids per speaker. As an example, this package has a folder called speaker_data bundled with it representing 5 five textgrids from 2 speakers.

speaker-data
+-- speaker001
|   +-- s2T01.TextGrid
|   +-- s2T02.TextGrid
|   +-- s2T03.TextGrid
|   +-- s2T04.TextGrid
|   \-- s2T05.TextGrid
\-- speaker002
    +-- s2T01.TextGrid
    +-- s2T02.TextGrid
    +-- s2T03.TextGrid
    +-- s2T04.TextGrid
    \-- s2T05.TextGrid

First, we create a vector of file-paths to read into R.

# Get the path of the folder bundled with the package
data_dir <- system.file(package = "readtextgrid", "speaker-data")

# Get the full paths to all the textgrids
paths <- list.files(
  path = data_dir, 
  pattern = "TextGrid$",
  full.names = TRUE, 
  recursive = TRUE
)

We can use purrr::map_dfr()--map the read_textgrid function over the paths and combine the dataframes (_dfr)---to read all these textgrids into R. But note that this way loses the speaker information.

library(purrr)

map_dfr(paths, read_textgrid)

We can use purrr::map2_dfr() and some dataframe manipulation to add the speaker information.

library(dplyr)

# This tells read_textgrid() to set the file column to the full path
data <- map2_dfr(paths, paths, read_textgrid) |> 
  mutate(
    # basename() removes the folder part from a path, 
    # dirname() removes the file part from a path
    speaker = basename(dirname(file)),
    file = basename(file),
  ) |> 
  select(
    speaker, everything()
  )

data

Another strategy would be to read the textgrid dataframes into a list column and unnest() them.

# Read dataframes into a list column
data_nested <- tibble(
  speaker = basename(dirname(paths)),
  data = map(paths, read_textgrid)
)

# We have one row per textgrid dataframe because `data` is a list column
data_nested

# promote the nested dataframes into the main dataframe
tidyr::unnest(data_nested, "data")

Pivoting textgrids [dev version]

In the textgrids above, there is a natural nesting or hierarchy to the tiers. Intervals in words tier contain intervals in the phones tier. It is often necessary to group intervals by their parent intervals (group phones by words). This package provides the pivot_textgrid_tiers() function to convert textgrids into a wide format in a way that respect the nesting/hierarchy of tiers.

data_wide <- pivot_textgrid_tiers(
  data, 
  tiers = c("words", "phones"), 
  join_cols = c("speaker", "file")
)

data_wide

# more clearly,
data_wide |> 
  select(
    speaker, file, words, phones, 
    words_xmin, words_xmax, phones_xmin, phones_xmax
  )

Some remarks:

Each tier in tiers becomes a batch of columns. For the rows for words become words (the original text value), words_xmin, words_xmax, etc.
The columns in join_cols should uniquely identify a textgrid file, so the combination of speaker and file is needed in the case where different speakers have the same file.
The tier names in tiers should be given in the order of their nesting from outside to inside (e.g., words contain phones). Behind the scenes, dplyr::left_join(..., relationship = "one-to-many") is used to constrain how intervals are combined.

This function also works on a single tiers value. In this case, the function returns just the intervals in that tier with the columns renamed and prefixed.

data |> 
  pivot_textgrid_tiers(
    tiers = "words", 
    join_cols = c("speaker", "file")
  )

Other tips

Speeding things up

Do you have thousands of textgrids to read? The following workflow can speed things up. We are going to read the textgrids in parallel. We use the future package to manage the parallel computation. We use the furrr package to get future-friendly versions of the purrr functions. We tell future to use a multisession plan for parallelism: Do the extra computation on separate R sessions in the background. Then everything else is the same. Just replace map() with future_map().

library(future)
library(furrr)
plan(multisession, workers = 4)

data_nested <- tibble(
  speaker = basename(dirname(paths)),
  data = future_map(paths, read_textgrid)
)

By default, readtextgrid uses readr::guess_encoding() to determine the encoding of the textgrid before reading it in. But if you know the encoding beforehand, you can skip this guessing. In my limited testing, I found that setting the encoding could reduce benchmark times by 3--4% compared to guessing the encoding.

Here, we read 100 textgrids using different approaches to benchmark the results.

paths_bench <- sample(paths, 100, replace = TRUE)
bench::mark(
  lapply_guess = lapply(paths_bench, read_textgrid),
  lapply_set = lapply(paths_bench, read_textgrid, encoding = "UTF-8"),
  future_guess = future_map(paths_bench, read_textgrid),
  future_set = future_map(paths_bench, read_textgrid, encoding = "UTF-8"), 
  min_iterations = 5,
  check = TRUE
)

Helpful columns

The following columns are often helpful:

duration of an interval
xmid midpoint of an interval
total_annotations total number of annotations on a tier

Here is how to create them:

data |>
  # grouping needed for counting annotations per tier per file per speaker
  group_by(speaker, file, tier_num) |>
  mutate(
    duration = xmax - xmin,
    xmid = xmin + (xmax - xmin) / 2,
    total_annotations = sum(!is.na(annotation_num))
  ) |> 
  ungroup() |> 
  glimpse()

Launching Praat

This tip is written from the perspective of a Windows user who uses git bash for a terminal.

To open textgrids in Praat, you can tell R to call Praat from the command line. You have to know where the location of the Praat binary is though. I like to keep a copy in my project directories. So, assuming that Praat.exe in my working folder, the following would open the 10 textgrids in paths in Praat.

system2(
  command = "./Praat.exe",
  args = c("--open", paths),
  wait = FALSE
)

Limitations

readtextgrid supports textgrids created by Praat by using Save as text file.... It uses a parsing strategy based on regular expressions targeting indentation patterns and text flags in the file format. The formal specification of the textgrid format, however, is much more flexible. As a result, not every textgrid that Praat can open---especially the minimal "short text" files---is compatible with this package.

Acknowledgments

readtextgrid was created to process data from the WISC Lab project. Thus, development of this package was supported by NIH R01DC009411 and NIH R01DC015653.

Please note that the 'readtextgrid' project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

tjmahr/readtextgrid documentation built on Dec. 24, 2024, 12:49 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com