process_data: Process the data

View source: R/process_data.R

process_dataR Documentation

Process the data

Description

This takes a dataset and the metadata for the dataset and creates R data frames in a format required for the subsequent steps.

Usage

process_data(data_file_path, metadata_file_path)

Arguments

data_file_path

Path to the dataset

metadata_file_path

Path to the metadata file

Details

The metadata should contain the following information as a minimum. variable: this is the name of the variable and should match the column names of the dataset. data_type: 'numerical' for continuous variables, 'count' for count variables, 'binary for binary categorical variables, 'nominal' for unordered categorical variables with more than 2 levels, 'ordinal' for ordered categorical variables, 'date' for variables stored as date, and 'time' for variables stored containing the time of the day.

Optional information includes the following. reference: Reference category for binary and nominal variables. This should be a category existing in the variable. ordinal levels: the levels of ordinal data from lower to higher order, separated by ";". This must include all the levels in the data.

You can use guess_data_types as a starting point for the metadata, which is included in the output list of the guess_data_types function.

Value

outcome

Whether the operation was successfully performed

message

Any information, particularly when the operation fails.

data_processed

The data which has been modifed according to the metadata when correct parameters are provided

.

any_type

All fields.

quantitative

Fields recognised as quantitative.

numerical

Fields recognised as continuous.

count

Fields recognised as count.

categorical

Fields recognised as categorical data.

nominal

Fields recognised as nominal data

binary

Fields recognised as binary data.

ordinal

Fields recognised as ordinal data.

date

Fields recognised as date.

time

Fields recognised as time.

Author(s)

Kurinchi Gurusamy

See Also

guess_data_types

Examples

library(survival)
# Use the dataset colon as example
# Select only the survival for these examples (etype == 2)
data_file_path <- paste0(tempdir(), "/df.csv")
write.csv(colon[colon$etype == 2, ], data_file_path, row.names = FALSE, na = "")
metadata <- {data.frame(
  variable = c("id","study","rx","sex","age",
               "obstruct","perfor","adhere","nodes","status",
               "differ","extent","surg","node4","time",
               "etype"),
  data_type = c("nominal", "nominal", "nominal", "binary", "numerical",
                "binary", "binary", "binary", "count", "binary",
                "ordinal", "ordinal", "binary", "binary", "numerical",
                "nominal"),
  reference = c(NA, NA, "Obs", 0, NA,
                0, 0, 0, NA, 0,
                NA, NA, 0, 0, NA,
                NA),
  ordinal_levels = c(NA, NA, NA, NA, NA,
                     NA, NA, NA, NA, NA,
                     "1;2;3", "1;2;3;4", NA, NA, NA,
                     NA),
  comments = NA
)}
metadata_file_path <- paste0(tempdir(), "/metadata.csv")
write.csv(metadata, metadata_file_path, row.names = FALSE, na = "")
processed_data <- process_data(data_file_path, metadata_file_path)

EQUALPrognosis documentation built on Feb. 4, 2026, 5:15 p.m.