seqformat: Conversion between sequence representation formats
In TraMineR: Trajectory Miner: a Sequence Analysis Toolkit

seqformat

R Documentation

Conversion between sequence representation formats

Description

Convert a sequence data set from one representation format to another.

Usage

seqformat(data, var = NULL, from, to, compress = FALSE, nrep = NULL, tevent,
  stsep = NULL, covar = NULL, SPS.in = list(xfix = "()", sdsep = ","),
  SPS.out = list(xfix = "()", sdsep = ","), id = 1, begin = 2, end = 3,
  status = 4, process = TRUE, pdata = NULL, pvar = NULL, limit = 100,
  overwrite = TRUE, fillblanks = NULL, tmin = NULL, tmax = NULL, missing = "*",
  with.missing = TRUE, right="DEL", compressed, nr)

Arguments

`data`	Data frame, matrix, `stslist` state sequence object, or character string vector. The data to use. (If a tibble, `data` is internally converted with `as.data.frame`). A data frame or a matrix with sequence data in one or more columns when `from = "STS"` or `from = "SPS"`. If sequence data are in a single column or in a string vector, they are assumed to be in compressed form (see `stsep`). A data frame with at least four columns when `from = "SPELL"`. Unless specified with the `var`, or `id` / `begin` / `end` / `status` arguments, the first four columns are assumed to be individual ID, spell start time, spell end time, and spell state status. A state sequence object when `from = "STS"` or `from` is not specified.
`var`	`NULL`, List of Integers or Strings. Default: `NULL`. Indexes or names of the columns containing the sequence information in `data`. If `NULL`, all columns are considered.
`from`	String. Format of the input sequence data. It can be `"STS"` (successive states), `"SPS"` (successive state-duration spells), or `"SPELL"` (vertical id-start-end-state spells). Ignored when `data` is a `stslist` state sequence object.
`to`	String. Format of the output data. It can be `"STS"` (successive states), `"DSS"` (distinct successive states), `"SPS"` (sequences of spells), `"SRS"` (shifted replicated sequences), `"SPELL"` (vertical spells), or `"TSE"` (time stamped events).
`compress`	Logical. Default: `FALSE`. When `to = "STS"`, `to = "DSS"`, or `to = "SPS"`, should the sequences (row vector of states) be concatenated into strings? See `seqconc`.
`nrep`	Integer. Number of shifted replications when `to = "SRS"`.
`tevent`	Matrix. The transition-definition matrix when `to = "TSE"`. It should be of size `d * d` where `d` is the number of distinct states appearing in the sequences. The cell `(i,j)` lists the events associated with a transition from state `i` to state `j`. It can be created with `seqetm`.
`stsep`	`NULL`, Character. Default: `NULL`. When `from = "STS"` or `from = "SPS"`, separator token between states in the compressed form (strings). If `NULL`, `seqfcheck` is called for detecting automatically a separator among "-" and ":". Other separators must be explicitly specified. See `seqdecomp`.
`covar`	List of Integers or Strings. When `to = "SRS"`, indexes or names of `data` columns to include as covariates in the output. Ignored otherwise. Applies only when `data` is a data frame with both sequence and covariate data. Must be used in conjunction with `var`. Covariate values are replicated across the shifted replicated rows.
`SPS.in`	List. Default: `list(xfix = "()", sdsep = ",")`. Specifications for the state-duration couples in the input data when `from = "SPS"`. The first element, `xfix`, specifies the prefix/suffix character. If a single character, it is used as both prefix and suffix. If a two-character string, the first character is used as prefix and the second one as suffix. `xfix = ""` means no prefix/suffix. The second element, `sdsep`, specifies the separator token between state and duration.
`SPS.out`	List. Default: `list(xfix = "()", sdsep = ",")`. The specifications for the state-duration couples in the output data when `to = "SPS"`. See `SPS.in` above.
`id`	`NULL`, Integer, String, Vector of Integers or Strings. Default: `1`. When `from = "SPELL"`, index or name of the column containing the individual IDs in `data` (after `var` filtering). When `to = "TSE"`, index or name of the `data` column containing the individual IDs (after `var` filtering), or vector of unique individual IDs. If `NULL`, indexes of the sequences in the input data are used as IDs. If no `id` is provided when calling the function and `from` is not `"SPELL"`, `id` is set as `NULL`. When `from = "SPELL"` and `to = "TSE"`, `id` cannot be `NULL` and the IDs in the TSE output refer to the IDs in the `id` column of the `SPELL` data.
`begin`	Integer or String. Default: `2`. Index or name of the `data` column containing the spell start times (after `var` filtering) when `from = "SPELL"`. Start times must be positive integers.
`end`	Integer or String. Default: `3`. Index or name of the `data` column containing the spell end times (after `var` filtering) when `from = "SPELL"`. End times must be positive integers.
`status`	Integer or String. Default: `4`. Index or name of the `data` column containing the spell statuses (after `var` filtering) when `from = "SPELL"`.
`process`	Logical. Default: `TRUE`. When `from = "SPELL"`, if `TRUE`, create sequences on a process time axis, if `FALSE`, create sequences on a calendar time axis. This `process` argument as well as the associated `pdata` and `pvar` arguments are intended for `data` containing spell data with calendar begin and end times. When those times are ages, use `process = FALSE` with `pdata=NULL` to use those ages as process times. Option `process = TRUE` does currently not work for age times.
`pdata`	`NULL`, `"auto"`, or data frame. Default: `NULL`. (tibbles are internally converted with `as.data.frame`). To be used only with `from = "SPELL"` or `to = "SPELL"`. If `NULL`, start and end times of each spell in the from data are supposed to be ages if `process = TRUE`, and years if `process = FALSE`. If `"auto"`, ages are computed using the start time of the first spell of each individual as her/his birthdate and `process = TRUE`. For `process = FALSE`, `"auto"` is equivalent to `NULL`. A data frame containing the ID and the birth time of the individuals when `from = "SPELL"` or `to = "SPELL"`. Use `pvar` to specify the column names. The ID is used to match the birth time of each individual with the sequence data. The birth time should be integer. It is the start time from which the positions on the time axis are computed. It also serves to compute `tmin` and to guess `tmax` when the latter are `NULL`, `from = "SPELL"`, and `process = FALSE`.
`pvar`	List of Integers or Strings. The indexes or names of the columns of the data frame `pdata` that contain the ID and the birth time of the individuals in that order.
`limit`	Integer. Default: `100`. The maximum age of age sequences when `from = "SPELL"` and `process = TRUE`. Age sequences will be considered to start at 1 and to end at `limit`.
`overwrite`	Logical. Default: `TRUE`. When `from = "SPELL"`, if `TRUE`, the most recent episode overwrites the older one when they overlap each other, if `FALSE`, the most recent episode starts after the end of the previous one.
`fillblanks`	Character. Token used to fill gaps between episodes when `from = "SPELL"`.
`tmin`	`NULL` or Integer. Default: `NULL`. When `from = "SPELL"` and `process = FALSE`, start time of the axis. If `NULL`, `tmin` is set as the earliest spell start time (min of `begin`) or, when `pdata` is a data frame, as the earliest birth time of the individuals.
`tmax`	`NULL` or Integer. Default: `NULL`. When `from = "SPELL"` and `process = FALSE`, end time of the axis. If `NULL`, `tmax` is set as the latest spell end time (max of `end`) or, when `pdata` is a data frame, as the sum of the latest spell end time and the latest birth time of the individuals.
`missing`	String. Default: `"*"`. Token used for missing states in `data`. It will be replaced by `NA` in the output data. Ignored when `data` is a state sequence object (see `seqdef`), in which case the attribute `nr` is used as missing value token.
`with.missing`	Logical. Default: `TRUE`. When `to = "SPELL"`, should the spells of missing states be included?
`right`	One of `"DEL"` or `NA`. Default: `"DEL"`. When `to = "SPELL"` and `with.missing=TRUE`, set `right=NA` to include ending spells of missing states.
`compressed`	Deprecated. Use `compress` instead.
`nr`	Deprecated. Use `missing` instead.

Details

The seqformat function converts data from one format to another. The input data is first converted into STS format and then converted into the output format. Depending on input and output formats, some information can be lost during the conversion process. The output is a matrix or a data frame, NOT a sequence stslist object. To process, print, and plot the sequences with TraMineR functions, you will have to first transform the returned data frame into a stslist state sequence object with seqdef. See Gabadinho et al. (2009) and Ritschard et al. (2009) for more details on longitudinal data formats and conversion between them.

When data is in "SPELL" format (from = "SPELL"), begin and end times are expected to be positions in the sequences. Therefore, they should be strictly positive integers. With process=TRUE, the outcome sequences will be aligned on ages (process duration since birth), while with process=FALSE they will be aligned on dates (position on the calendar time). If process=TRUE, values in the begin and end columns of data are assumed to be ages when pdata is NULL and integer dates otherwise. If process=FALSE, begin and end values are assumed to be integer dates when pdata is NULL and ages otherwise.

To convert from person-period data use from = "SPELL" and set both begin and end as the index or name of the time (period) column. Alternatively, use the reshape command of stats, which is more efficient.

Value

A data frame for SRS, TSE, and SPELL outcomes, otherwise a matrix.

When from="SPELL", outcome has an attribute issues with indexes of sequences with issues (truncated sequences, missing start time, spells before birth year, ...)

Author(s)

Gilbert Ritschard, Alexis Gabadinho, Pierre-Alexandre Fonta, Nicolas S. Müller, Matthias Studer

References

Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2009). Mining Sequence Data in R with the TraMineR package: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva.

Ritschard, G., A. Gabadinho, M. Studer and N. S. Müller. Converting between various sequence representations. in Ras, Z. & Dardzinska, A. (eds.) Advances in Data Management, Springer, 2009, 223, 155-175.

Examples

## ========================================
## Examples with raw STS sequences as input
## ========================================

## Loading a data frame with sequence data in the columns 13 to 24
data(actcal)

## Converting to SPS format
actcal.SPS.A <- seqformat(actcal, 13:24, from = "STS", to = "SPS")
head(actcal.SPS.A)

## Converting to compressed SPS format with no
## prefix/suffix and with "/" as state/duration separator
actcal.SPS.B <- seqformat(actcal, 13:24, from = "STS", to = "SPS",
  compress = TRUE, SPS.out = list(xfix = "", sdsep = "/"))
head(actcal.SPS.B)

## Converting to compressed DSS format
actcal.DSS <- seqformat(actcal, 13:24, from = "STS", to = "DSS",
  compress = TRUE)
head(actcal.DSS)


## ==============================================
## Examples with a state sequence object as input
## ==============================================

## Loading a data frame with sequence data in the columns 10 to 25
data(biofam)

## Limiting the number of considered cases to the first 20
biofam <- biofam[1:20, ]

## Creating a state sequence object
biofam.labs <- c("Parent", "Left", "Married", "Left/Married",
  "Child", "Left/Child", "Left/Married/Child", "Divorced")
biofam.short.labs <- c("P", "L", "M", "LM", "C", "LC", "LMC", "D")
biofam.seq <- seqdef(biofam, 10:25, alphabet = 0:7,
  states = biofam.short.labs, labels = biofam.labs)

## Converting to SPELL format
bf.spell <- seqformat(biofam.seq, from = "STS", to = "SPELL",
  pdata = biofam, pvar = c("idhous", "birthyr"))
head(bf.spell)

## Converting to shifted replicated sequences (SRS)
bf.srs <- seqformat(biofam, var=10:25, from="STS", to="SRS", 
                    covar=c("sex","plingu02"))
tail(bf.srs)


## ======================================
## Examples with SPELL sequences as input
## ======================================

## Loading two data frames: bfspell20 and bfpdata20
## bfspell20 contains the first 20 biofam sequences in SPELL format
## bfpdata20 contains the IDs and the years at which the
## considered individuals were aged 15
data(bfspell)

## Converting to STS format with alignement on calendar years
bf.sts.y <- seqformat(bfspell20, from = "SPELL", to = "STS",
  id = "id", begin = "begin", end = "end", status = "states",
  process = FALSE)
head(bf.sts.y)

## Converting to STS format with alignement on ages
bf.sts.a <- seqformat(bfspell20, from = "SPELL", to = "STS",
  id = "id", begin = "begin", end = "end", status = "states",
  process = TRUE, pdata = bfpdata20, pvar = c("id", "when15"),
  limit = 16)
names(bf.sts.a) <- paste0("a", 15:30)
head(bf.sts.a)


## ==================================
## Examples for TSE and SPELL output
## in presence of missing values
## ==================================

data(ex1) ## STS data with missing values
## creating the state sequence object with by default
## the end missings coded as void ('%')
sqex1 <- seqdef(ex1[,1:13])
as.matrix(sqex1)

## Creating state-event transition matrices
ttrans <- seqetm(sqex1, method='transition')
tstate <- seqetm(sqex1, method='state')

## Converting into time stamped events
seqformat(sqex1, from = "STS", to = "TSE", tevent = ttrans)
seqformat(sqex1, from = "STS", to = "TSE", tevent = tstate)

## Converting into vertical spell data
seqformat(sqex1, from = "STS", to = "SPELL", with.missing=TRUE)
seqformat(sqex1, from = "STS", to = "SPELL", with.missing=TRUE, right=NA)
seqformat(sqex1, from = "STS", to = "SPELL", with.missing=FALSE)

TraMineR documentation built on Dec. 15, 2025, 3:01 a.m.