split_and_combine_files: Create test train files from a number of files

View source: R/data_converters.R

split_and_combine_filesR Documentation

Create test train files from a number of files

Description

This function combines files into a train and test set, stored on disk. It can be used in combination with genomes_to_kmer_libsvm() to create a dataset that can be loaded into XGBoost (either by first creating an xgboost::DMatrix, or by using the data argument in xgboost::xgb.train() or xgboost::xgb.cv()). The following three files will be created:

  1. train.txt - the training data

  2. test.txt - the testing data (if split < 1)

  3. names.csv - a csv file containing the original filenames and their corresponding type (train or test)

The function will check if the data is already in the appropriate format and will not overwrite unless forced using the overwrite argument.

By providing 1.0 to the split argument, the function can be used to combine files without a train-test split. In this case, all the files will be classed as 'train', and there will be no 'test' data. This is useful if one wants to perform cross-validation using xgboost::xgb.cv() or MIC::xgb.cv.lowmem(). It is also possible to combine all data into train and then perform splitting after loading into an xgboost::DMatrix, using xgboost::slice().

Usage

split_and_combine_files(
  path_to_files,
  file_ext = ".txt",
  split = 0.8,
  train_target_path = NULL,
  test_target_path = NULL,
  names_backup = NULL,
  shuffle = TRUE,
  overwrite = FALSE
)

Arguments

path_to_files

path containing files or vector of filepaths

file_ext

file extension to filter

split

train-test split

train_target_path

name of train file to save as (by default, will be train.txt in the path_to_files directory)

test_target_path

name of test file to save as (by default, will be test.txt in the path_to_files directory)

names_backup

name of file to save backup of filename metadata (by default, will be names.csv in the path_to_files directory)

shuffle

randomise prior to splitting

overwrite

overwrite target files

Value

named list of paths to created train/test files, original filenames

Examples

set.seed(123)
# create 10 random libsvm files
tmp_dir <- tempdir()
# remove any existing .txt files
file.remove(
list.files(tmp_dir, pattern = "*.txt", full.names = TRUE)
)
for (i in 1:10) {
 # each line is K: V
 writeLines(paste0(i, ": ", paste0(sample(1:100, 10, replace = TRUE),
 collapse = " ")), file.path(tmp_dir, paste0(i, ".txt")))
 }

 # split files into train and test directories
 paths <- split_and_combine_files(
  tmp_dir,
  file_ext = "txt",
  split = 0.8,
  train_target_path = file.path(tmp_dir, "train.txt"),
  test_target_path = file.path(tmp_dir, "test.txt"),
  names_backup = file.path(tmp_dir, "names.csv"),
  overwrite = TRUE)

 readLines(paths[["train"]])

MIC documentation built on April 12, 2025, 2:26 a.m.