seqformat: Conversion between sequence formats
In TraMineR: Trajectory Miner: a Sequence Analysis Toolkit

seqformat

R Documentation

Conversion between sequence formats

Description

Convert a sequence data set from one format to another.

Usage

seqformat(data, var = NULL, from, to, compress = FALSE, nrep = NULL, tevent,
  stsep = NULL, covar = NULL, SPS.in = list(xfix = "()", sdsep = ","),
  SPS.out = list(xfix = "()", sdsep = ","), id = 1, begin = 2, end = 3,
  status = 4, process = TRUE, pdata = NULL, pvar = NULL, limit = 100,
  overwrite = TRUE, fillblanks = NULL, tmin = NULL, tmax = NULL, missing = "*",
  with.missing = TRUE, right="DEL", compressed, nr)

Arguments

`data`	Data frame, matrix, `stslist` state sequence object, or character string vector. The data to use. (Tibble will be converted with `as.data.frame`). A data frame or a matrix with sequence data in one or more columns when `from = "STS"` or `from = "SPS"`. If sequence data are in a single column or in a string vector, they are assumed to be in the compressed form (see `stsep`). A data frame with sequence data in one or more columns when `from = "SPELL"`. If sequence data has not four columns ordered as individual ID, spell start time, spell end time, and spell state status, use `var` or `id` / `begin` / `end` / `status`. A state sequence object when `from = "STS"` or `from` is not specified.
`var`	`NULL`, List of Integers or Strings. Default: `NULL`. The indexes or the names of the columns with the sequence data in `data`. If `NULL`, all columns are considered.
`from`	String. The format of the input sequence data. It can be `"STS"`, `"SPS"`, or `"SPELL"`. It is not needed if `data` is a state sequence object.
`to`	String. The format of the output data. It can be `"STS"`, `"DSS"`, `"SPS"`, `"SRS"`, `"SPELL"`, or `"TSE"`.
`compress`	Logical. Default: `FALSE`. When `to = "STS"`, `to = "DSS"`, or `to = "SPS"`, should the sequences (row vector of states) be concatenated into strings? See `seqconc`.
`nrep`	Integer. The number of shifted replications when `to = "SRS"`.
`tevent`	Matrix. The transition-definition matrix when `to = "TSE"`. It should be of size `d * d` where `d` is the number of distinct states appearing in the sequences. The cell `(i,j)` lists the events associated with a transition from state `i` to state `j`. It can be created with `seqetm`.
`stsep`	`NULL`, Character. Default: `NULL`. The separator between states in the compressed form (strings) when `from = "STS"` or `from = "SPS"`. If `NULL`, `seqfcheck` is called for detecting automatically a separator among "-" and ":". Other separators must be specified explicitly. See `seqdecomp`.
`covar`	List of Integers or Strings. The indexes or the names of additional columns in `data` to include as covariates in the output when `to = "SRS"`. The covariates are replicated across the shifted replicated rows.
`SPS.in`	List. Default: `list(xfix = "()", sdsep = ",")`. The specifications for the state-duration couples in the input data when `from = "SPS"`. The first specification, `xfix`, specifies the prefix/suffix character. Use a two-character string if the prefix and the suffix differ. Use `xfix = ""` when no prefix/suffix are present. The second specification, `sdsep`, specifies the state/duration separator.
`SPS.out`	List. Default: `list(xfix = "()", sdsep = ",")`. The specifications for the state-duration couples in the output data when `to = "SPS"`. See `SPS.in` above.
`id`	`NULL`, Integer, String, List of Integers or Strings. Default: `1`. When `from = "SPELL"`, the index or the name of the column containing the individual IDs in `data` (after `var` filtering). When `to = "TSE"`, the index or the name of the column containing the individual IDs in `data` (after `var` filtering) or the unique individual IDs. If `id` is not manually specified, `id` is set as `NULL` for backward compatibility with TraMineR 1.8-13 behaviour. If `id` is manually or automatically set as `NULL`, the original individual IDs are ignored and replaced by the indexes of the sequences in the input data. When `from = "SPELL"` and `to = "TSE"`, the index or the name of the column containing the individual IDs in `data` (after `var` filtering). The TSE output will use the original individual IDs.
`begin`	Integer or String. Default: `2`. The index or the name of the column containing the spell start times in `data` (after `var` filtering) when `from = "SPELL"`. Start times should be positive integers.
`end`	Integer or String. Default: `3`. The index or the name of the column containing the spell end times in `data` (after `var` filtering) when `from = "SPELL"`. End times should be positive integers.
`status`	Integer or String. Default: `4`. The index or the name of the column containing the spell statuses in `data` (after `var` filtering) when `from = "SPELL"`.
`process`	Logical. Default: `TRUE`. When `from = "SPELL"`, if `TRUE`, create sequences on a process time axis, if `FALSE`, create sequences on a calendar time axis. This `process` argument as well as the associated `pdata` and `pvar` arguments are intended for `data` containing spell data with calendar begin and end times. When those times are ages, use `process = FALSE` with `pdata=NULL` to use those ages as process times. Option `process = TRUE` does currently not work for age times.
`pdata`	`NULL`, `"auto"`, or data frame. Default: `NULL`. (tibble will be converted with `as.data.frame`). If `NULL`, the start and end times of each spell are supposed to be, if `process = TRUE`, ages, if `process = FALSE`, years when `from = "SPELL"`. If `"auto"`, ages are computed using the start time of the first spell of each individual as her/his birthdate when `from = "SPELL"` and `process = TRUE`. For `from = "SPELL"` and `process = FALSE`, `"auto"` is equivalent to `NULL`. A data frame containing the ID and the birth time of the individuals when `from = "SPELL"` or `to = "SPELL"`. Use `pvar` to specify the column names. The ID is used to match the birth time of each individual with the sequence data. The birth time should be integer. It is the start time from which the positions on the time axis are computed. It also serves to compute `tmin` and to guess `tmax` when the latter are `NULL`, `from = "SPELL"`, and `process = FALSE`.
`pvar`	List of Integers or Strings. The indexes or names of the columns of the data frame `pdata` that contain the ID and the birth time of the individuals in that order.
`limit`	Integer. Default: `100`. The maximum age of age sequences when `from = "SPELL"` and `process = TRUE`. Age sequences will be considered to start at 1 and to end at `limit`.
`overwrite`	Logical. Default: `TRUE`. When `from = "SPELL"`, if `TRUE`, the most recent episode overwrites the older one when they overlap each other, if `FALSE`, in case of overlap, the most recent episode starts after the end of the previous one.
`fillblanks`	Character. The value to fill gaps between episodes when `from = "SPELL"`.
`tmin`	`NULL` or Integer. Default: `NULL`. The start time of the axis when `from = "SPELL"` and `process = FALSE`. If `NULL`, the value is the minimum of the spell start times (see `begin`) or the minimum of the birth time of the individuals (see `pdata` when it is a data frame and `process = FALSE`).
`tmax`	`NULL` or Integer. Default: `NULL`. The end time of the axis when `from = "SPELL"` and `process = FALSE`. If `NULL`, the value is the maximum of the spell end times (see `end`) or the sum of the maximum of the spell end times and of the maximum of the birth time of the individuals (see `pdata` when it is a data frame and `process = FALSE`).
`missing`	String. Default: `"*"`. The code for missing states in `data`. It will be replaced by `NA` in the output data. Ignored when `data` is a state sequence object (see `seqdef`), in which case the attribute `nr` is used as missing value code.
`with.missing`	Logical. Default: `TRUE`. When `to = "SPELL"`, should the spells of missing states be included?
`right`	One of `"DEL"` or `NA`. Default: `"DEL"`. When `to = "SPELL"` and `with.missing=TRUE`, set `right=NA` to include the end spells of missing states.
`compressed`	Deprecated. Use `compress` instead.
`nr`	Deprecated. Use `missing` instead.

Details

The seqformat function is used to convert data from one format to another. The input data is first converted into the STS format and then converted to the output format. Depending on input and output formats, some information can be lost in the conversion process. The output is a matrix or a data frame, NOT a sequence stslist object. To process, print or plot the sequences with TraMineR functions, you will have to first transform the data frame into a stslist state sequence object with seqdef. See Gabadinho et al. (2009) and Ritschard et al. (2009) for more details on longitudinal data formats and converting between them.

When data are in "SPELL" format (from = "SPELL"), the begin and end times are expected to be positions in the sequences. Therefore they should be strictly positive integers. With process=TRUE, the outcome sequences will be aligned on ages (process duration since birth), while with process=FALSE they will be aligned on dates (position on the calendar time). If process=TRUE, values in the begin and end columns of data are assumed to be ages when pdata is NULL and integer dates otherwise. If process=FALSE, begin and end values are assumed to be integer dates when pdata is NULL and ages otherwise.

To convert from person-period data use from = "SPELL" and set both begin and end as the column index or name of the time variable. Alternatively, use the reshape command of stats, which is more efficient.

Value

A data frame for SRS, TSE, and SPELL, a matrix otherwise.

When from="SPELL", outcome has an attribute issues with indexes of sequences with issues (truncated sequences, missing start time, spells before birth year, ...)

Author(s)

Alexis Gabadinho, Pierre-Alexandre Fonta, Nicolas S. Müller, Matthias Studer, and Gilbert Ritschard.

References

Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2009). Mining Sequence Data in R with the TraMineR package: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva.

Ritschard, G., A. Gabadinho, M. Studer and N. S. Müller. Converting between various sequence representations. in Ras, Z. & Dardzinska, A. (eds.) Advances in Data Management, Springer, 2009, 223, 155-175.

Examples

## ========================================
## Examples with raw STS sequences as input
## ========================================

## Loading a data frame with sequence data in the columns 13 to 24
data(actcal)

## Converting to SPS format
actcal.SPS.A <- seqformat(actcal, 13:24, from = "STS", to = "SPS")
head(actcal.SPS.A)

## Converting to compressed SPS format with no
## prefix/suffix and with "/" as state/duration separator
actcal.SPS.B <- seqformat(actcal, 13:24, from = "STS", to = "SPS",
  compress = TRUE, SPS.out = list(xfix = "", sdsep = "/"))
head(actcal.SPS.B)

## Converting to compressed DSS format
actcal.DSS <- seqformat(actcal, 13:24, from = "STS", to = "DSS",
  compress = TRUE)
head(actcal.DSS)


## ==============================================
## Examples with a state sequence object as input
## ==============================================

## Loading a data frame with sequence data in the columns 10 to 25
data(biofam)

## Limiting the number of considered cases to the first 20
biofam <- biofam[1:20, ]

## Creating a state sequence object
biofam.labs <- c("Parent", "Left", "Married", "Left/Married",
  "Child", "Left/Child", "Left/Married/Child", "Divorced")
biofam.short.labs <- c("P", "L", "M", "LM", "C", "LC", "LMC", "D")
biofam.seq <- seqdef(biofam, 10:25, alphabet = 0:7,
  states = biofam.short.labs, labels = biofam.labs)

## Converting to SPELL format
bf.spell <- seqformat(biofam.seq, from = "STS", to = "SPELL",
  pdata = biofam, pvar = c("idhous", "birthyr"))
head(bf.spell)


## ======================================
## Examples with SPELL sequences as input
## ======================================

## Loading two data frames: bfspell20 and bfpdata20
## bfspell20 contains the first 20 biofam sequences in SPELL format
## bfpdata20 contains the IDs and the years at which the
## considered individuals were aged 15
data(bfspell)

## Converting to STS format with alignement on calendar years
bf.sts.y <- seqformat(bfspell20, from = "SPELL", to = "STS",
  id = "id", begin = "begin", end = "end", status = "states",
  process = FALSE)
head(bf.sts.y)

## Converting to STS format with alignement on ages
bf.sts.a <- seqformat(bfspell20, from = "SPELL", to = "STS",
  id = "id", begin = "begin", end = "end", status = "states",
  process = TRUE, pdata = bfpdata20, pvar = c("id", "when15"),
  limit = 16)
names(bf.sts.a) <- paste0("a", 15:30)
head(bf.sts.a)


## ==================================
## Examples for TSE and SPELL output
## in presence of missing values
## ==================================

data(ex1) ## STS data with missing values
## creating the state sequence object with by default
## the end missings coded as void ('%')
sqex1 <- seqdef(ex1[,1:13])
as.matrix(sqex1)

## Creating state-event transition matrices
ttrans <- seqetm(sqex1, method='transition')
tstate <- seqetm(sqex1, method='state')

## Converting into time stamped events
seqformat(sqex1, from = "STS", to = "TSE", tevent = ttrans)
seqformat(sqex1, from = "STS", to = "TSE", tevent = tstate)

## Converting into vertical spell data
seqformat(sqex1, from = "STS", to = "SPELL", with.missing=TRUE)
seqformat(sqex1, from = "STS", to = "SPELL", with.missing=TRUE, right=NA)
seqformat(sqex1, from = "STS", to = "SPELL", with.missing=FALSE)

TraMineR documentation built on April 12, 2025, 1:53 a.m.