dUtility data-frame manipulations

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(mets)

Simple data manipulation for data-frames

Here are some key data-manipulation steps on a data-frame which is how we typically organize our data in R. After having read the data into R it will typically be a data-frame, if not we can force it to be a data-frame. The basic idea of the utility functions is to get a simple and easy to type way of making simple data-manipulation on a data-frame much like what is possible in SAS or STATA.

The functions, say, dcut, dfactor and so on are all functions that basically does what the base R cut, factor do, but are easier to use in the context of data-frames and have additional functionality.

library(mets)
data(melanoma)
is.data.frame(melanoma)

Here we work on the melanoma data that is already read into R and is a data-frame.

dUtility functions

The structure for all functions is

to use the function on y in a dataframe grouped by x if condition ifcond is valid. The basic functions are

Data processing * dsort * dreshape * dcut * drm, drename, ddrop, dkeep, dsubset * drelevel * dlag * dfactor, dnumeric

Data aggregation * dby, dby2 * dscalar, deval, daggregate * dmean, dsd, dsum, dquantile, dcor * dtable, dcount

Data summaries * dhead, dtail, * dsummary, * dprint, dlist, dlevels, dunique

A generic function daggregate, daggr, can be called with a function as the argument

without the grouping variable (x)

A useful feature is that y and x as well as the subset condition can be specified using regular-expressions or by wildcards (default). Here to illustrate this, we compute the means of certain variables.

First just oveall

dmean(melanoma,~thick+I(log(thick)))

now only when days>500

dmean(melanoma,~thick+I(log(thick))|I(days>500))

and now after sex but only when days>500

dmean(melanoma,thick+I(log(thick))~sex|I(days>500))

and finally after quartiles of days (via the dcut function)

dmean(melanoma,thick+I(log(thick))~I(dcut(days)))

or summary of all variables starting with "s" and that contains "a"

dmean(melanoma,"s*"+"*a*"~sex|I(days>500))

Renaming, deleting, keeping, dropping variables

melanoma=drename(melanoma,tykkelse~thick)
names(melanoma)

Deleting variables

data(melanoma)
melanoma=drm(melanoma,~thick+sex)
names(melanoma)

or sas style

data(melanoma)
melanoma=ddrop(melanoma,~thick+sex)
names(melanoma)

alternatively we can also keep certain variables

data(melanoma)
melanoma=dkeep(melanoma,~thick+sex+status+days)
names(melanoma)

This can also be done with direct asignment

data(melanoma)
ddrop(melanoma) <- ~thick+sex
names(melanoma)

The dkeep function can also be used to re-ordering the variables in the data-frame

data(melanoma)
names(melanoma)
melanoma=dkeep(melanoma,~days+status+.)
names(melanoma)

Looking at the data

data(melanoma)
dstr(melanoma)

The data can in Rstudio be seen as a data-table but to list certain parts of the data in output window

dlist(melanoma)
dlist(melanoma, ~.|sex==1)
dlist(melanoma, ~ulc+days+thick+sex|sex==1)

Getting summaries

dsummary(melanoma)

or for specfic variables

dsummary(melanoma,~thick+status+sex)

Summaries in different groups (sex)

dsummary(melanoma,thick+days+status~sex)

and only among those with thin-tumours or only females (sex==1)

dsummary(melanoma,thick+days+status~sex|thick<97)
dsummary(melanoma,thick+status~+1|sex==1)

or

dsummary(melanoma,~thick+status|sex==1)

To make more complex conditions need to use the I()

dsummary(melanoma,thick+days+status~sex|I(thick<97 & sex==1))

Tables between variables

dtable(melanoma,~status+sex)

All bivariate tables

dtable(melanoma,~status+sex+ulc,level=2)

All univariate tables

dtable(melanoma,~status+sex+ulc,level=1)

and with new variables

dtable(melanoma,~status+sex+ulc+dcut(days)+I(days>300),level=1)

Sorting the data

To sort the data

data(melanoma)
mel= dsort(melanoma,~days)
dsort(melanoma) <- ~days
head(mel)

and to sort after multiple variables increasing and decreasing

dsort(melanoma) <- ~days-status
head(melanoma)

Making new variales for the analysis

To define a bunch of new covariates within a data-frame

data(melanoma)
melanoma= transform(melanoma, thick2=thick^2, lthick=log(thick) ) 
dhead(melanoma)

When the above definitions are done using a condition this can be achieved using the dtransform function that extends transform with a possible condition

 melanoma=dtransform(melanoma,ll=thick*1.05^ulc,sex==1)  
 melanoma=dtransform(melanoma,ll=thick,sex!=1)  
 dmean(melanoma,ll~sex+ulc)

Making factors (groupings)

On the melanoma data the variable thick gives the thickness of the melanom tumour. For some analyses we would like to make a factor depending on the thickness. This can be done in several different ways

melanoma=dcut(melanoma,~thick,breaks=c(0,200,500,800,2000))

New variable is named thickcat.0 by default.

To see levels of factors in data-frame

dlevels(melanoma)

Checking group sizes

dtable(melanoma,~thickcat.0)

With adding to the data-frame directly

dcut(melanoma,breaks=c(0,200,500,800,2000)) <- gr.thick1~thick
dlevels(melanoma)

new variable is named thickcat.0 (after first cut-point), or to get quartiles with default names thick.cat.4

dcut(melanoma) <- ~ thick  # new variable is thickcat.4
dlevels(melanoma)

or median groups, here starting again with the original data,

data(melanoma)
dcut(melanoma,breaks=2) <- ~ thick  # new variable is thick.2
dlevels(melanoma)

to control new names

data(melanoma)
mela= dcut(melanoma,thickcat4+dayscat4~thick+days,breaks=4)
dlevels(mela)

or

data(melanoma)
dcut(melanoma,breaks=4) <- thickcat4+dayscat4~thick+days
dlevels(melanoma)

This can also be typed out more specifically

melanoma$gthick = cut(melanoma$thick,breaks=c(0,200,500,800,2000))
melanoma$gthick = cut(melanoma$thick,breaks=quantile(melanoma$thick),include.lowest=TRUE)

Working with factors

To see levels of covariates in data-frame

data(melanoma)
dcut(melanoma,breaks=4) <- thickcat4~thick
dlevels(melanoma) 

To relevel the factor

dtable(melanoma,~thickcat4)
melanoma = drelevel(melanoma,~thickcat4,ref="(194,356]")
dlevels(melanoma)

or to take the third level in the list of levels, same as above,

melanoma = drelevel(melanoma,~thickcat4,ref=2)
dlevels(melanoma)

To combine levels of a factor (first combinining first 3 groups into one)

melanoma = drelevel(melanoma,~thickcat4,newlevels=1:3)
dlevels(melanoma)

or to combine groups 1 and 2 into one group and 3 and 4 into another

dkeep(melanoma) <- ~thick+thickcat4
melanoma = drelevel(melanoma,gthick2~thickcat4,newlevels=list(1:2,3:4))
dlevels(melanoma)

Changing order of factor levels

dfactor(melanoma,levels=c(3,1,2,4)) <-  thickcat4.2~thickcat4
dlevel(melanoma,~ "thickcat4*")
dtable(melanoma,~thickcat4+thickcat4.2)

Combine levels but now control factor-level names

melanoma=drelevel(melanoma,gthick3~thickcat4,newlevels=list(group1.2=1:2,group3.4=3:4))
dlevels(melanoma)

Making a factor from existing numeric variable and vice versa

A numeric variable "status" with values 1,2,3 into a factor by

data(melanoma)
melanoma = dfactor(melanoma,~status, labels=c("malignant-melanoma","censoring","dead-other"))
melanoma = dfactor(melanoma,sexl~sex,labels=c("females","males"))
dtable(melanoma,~sexl+status.f)

A gender factor with values "M", "F" can be converted into numerics by

melanoma = dnumeric(melanoma,~sexl)
dstr(melanoma,"sex*")
dtable(melanoma,~'sex*',level=2)

SessionInfo

sessionInfo()


Try the mets package in your browser

Any scripts or data that you put into this service are public.

mets documentation built on May 29, 2024, 3:51 a.m.