Conversion between sequence formats

Share:

Description

Convert a sequence data set from one format to another.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
seqformat(data, var=NULL, id=NULL,
         from, to, compressed=FALSE,
         nrep=NULL, tevent, stsep=NULL, covar=NULL,
         SPS.in=list(xfix="()", sdsep=","),
         SPS.out=list(xfix="()", sdsep=","),
         begin=NULL, end=NULL, status=NULL,
         process=TRUE, pdata=NULL, pvar=NULL,
         limit=100, overwrite=TRUE,
         fillblanks=NULL, tmin=NULL, tmax=NULL, nr="*")
 

Arguments

data

a data frame or matrix containing sequence data.

var

List of columns with the sequence data. Default is NULL, i.e., all columns. Sequences are assumed to be in compressed form (character strings) when there is a single column and in extended form otherwise.

id

Column containing the 'id' of the sequences. Mandatory with from="SPELL" in order to identify the spells of a same sequence.

from

Format of the input data. One of "STS", "SPS", "SPELL". If data is a sequence object, format is automatically set to "STS".

to

Format for output data. One of "STS", "SPS", "SRS", "DSS", "TSE".

compressed

Logical. Should "STS", "SPS" or "DSS" output be compressed into character strings? Ignored for other output formats.

nrep

Number of shifted replications for output in "SRS" format.

tevent

Transition definition matrix for converting to time-stamped-event ("TSE") format. Should be a matrix of size d * d where d is the number of distinct states appearing in the sequences. In this matrix, the cell (i,j) lists the events associated with a transition from state i to state j.

stsep

Separator character between successive elements in compressed (character strings) input data. If NULL (default value), the seqfcheck function is called for detecting automatically a separator among "-" and ":". Other separators must be specified explicitly.

covar

When from="STS" or from="SPS", additional column names to be included as covariates in the output data frame. When to="SRS" the covariates are replicated across the shifted replicated rows. Default is NULL. Ignored when from="SPELL".

SPS.in

List with the xfix= and sdsep= specifications for the state-duration couples in input data in SPS form. The first specification, xfix, specifies the prefix/suffix character (use a two-character string if the prefix and suffix differ and set xfix="" when no prefix/suffix are present). The second one, sdsep, specifies the state/duration separator.

SPS.out

List with the xfix and sdsep specifications for output in SPS format. (see argument SPS.in above.)

nr

Symbol used for missing state in input "SPS" format which will be converted to NA in "STS" representation.

begin

When converting from SPELL, the column with the beginning position of the spell. (Positions must be integer values!)

end

When converting from SPELL, the column with the end position of the spell. (Positions must be integer values!)

status

When converting from SPELL, the column with the status.

process

Logical: When converting from SPELL, should sequences be created on a process time axis? Default is TRUE. Set as FALSE for creating sequences on a calendar time axis.

pdata

When converting from SPELL and process=TRUE, either NULL, "auto" or the name of the data frame containing the individual 'birth' time, that is, the initial time from which the process time will be computed. If set as NULL (default), the starting and ending time of each spell are supposed to be ages. If set as "auto", ages are computed using the starting time of the first spell of each individual as her/his birth date. If external birth dates are provided, the pdata data must contain two columns: an id to match the birth time with SPELL data and a 'birth' time.

pvar

When pdata is a data frame, a vector of two names or numbers, the first one specifying the column with the individual 'id', and the second one the 'birth' time.

limit

When converting from SPELL, size of the resulting data frame when creating age sequences (by default ranges from age 1 to age 100)

overwrite

When converting from SPELL, if overwrite is set to TRUE, the most recent episode overwrites the older one when they overlap each other. If set to FALSE, the most recent episode starts in case of overlap after the end of the previous one.

fillblanks

When converting from SPELL, if fillblanks is not NULL, gaps between episodes are filled with the fillblanks character value.

tmin

Integer. When converting from SPELL with process=FALSE, defines the starting time of the axis. If set as NULL, the minimum time is taken from the ‘begin’ column in the data.

tmax

Integer. When converting from SPELL with process=FALSE, defines the ending time. If set as NULL, the value is guessed from the data (not so accurately!).

Details

The seqformat function is used to convert data from one format to another. The input data is first converted into the STS format and then converted to the output format. Depending on input and output formats, some information can be lost in the conversion process. The output is a matrix, NOT a sequence object to be passed to TraMineR functions for plotting and mining sequences (use the seqdef function for that). See Gabadinho et al. (2009) and Ritschard et al. (2009) for more details on longitudinal data formats and converting between them.

Value

A data frame

Author(s)

Alexis Gabadinho, Nicolas S. Müller and Matthias Studer (with Gilbert Ritschard for the help page)

References

Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2009). Mining Sequence Data in R with the TraMineR package: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva.

Ritschard, G., A. Gabadinho, M. Studer and N. S. Müller. Converting between various sequence representations. in Ras, Z. & Dardzinska, A. (ed.) Advances in Data Management, Springer, 2009, 223, 155-175.

See Also

seqdef

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
## Converting sequences into SPS format
data(actcal)
actcal.SPS.A <- seqformat(actcal,13:24, from="STS", to="SPS")
head(actcal.SPS.A)

## SPS (compressed) format with no prefix/suffix "/" as state/duration separator
actcal.SPS.B <- seqformat(actcal,13:24,
    from="STS", to="SPS", compressed=TRUE,
    SPS.out=list(xfix="", sdsep="/"))
head(actcal.SPS.B)

## Converting sequences into DSS (compressed) format
actcal.DSS <- seqformat(actcal,13:24,
    from="STS", to="DSS", compressed=TRUE)
head(actcal.DSS)

## Converting from SPELL to STS format
##  bfspell20 contains the first 20 biofam sequences in SPELL format
##  bfpdata20 ids and year when aged 15 of the considered cases
data(bfspell) ## includes bfspell20 and bfpdata20
bf.sts <- seqformat(bfspell20, from="SPELL", to="STS", process=TRUE,
    id='id', begin='begin', end='end', status='states', pdata=bfpdata20,
    pvar=c('id','when15'), limit=16)
names(bf.sts) <- paste0('a',15:30)
head(bf.sts)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.