split_and_combine_files: Create test train files from a number of files
In MIC: Analysis of Antimicrobial Minimum Inhibitory Concentration Data

split_and_combine_files

R Documentation

Create test train files from a number of files

Description

This function combines files into a train and test set, stored on disk. It can be used in combination with genomes_to_kmer_libsvm() to create a dataset that can be loaded into XGBoost (either by first creating an xgboost::DMatrix, or by using the data argument in xgboost::xgb.train() or xgboost::xgb.cv()). The following three files will be created:

train.txt - the training data
test.txt - the testing data (if split < 1)
names.csv - a csv file containing the original filenames and their corresponding type (train or test)

The function will check if the data is already in the appropriate format and will not overwrite unless forced using the overwrite argument.

By providing 1.0 to the split argument, the function can be used to combine files without a train-test split. In this case, all the files will be classed as 'train', and there will be no 'test' data. This is useful if one wants to perform cross-validation using xgboost::xgb.cv() or MIC::xgb.cv.lowmem(). It is also possible to combine all data into train and then perform splitting after loading into an xgboost::DMatrix, using xgboost::slice().

Usage

split_and_combine_files(
  path_to_files,
  file_ext = ".txt",
  split = 0.8,
  train_target_path = NULL,
  test_target_path = NULL,
  names_backup = NULL,
  shuffle = TRUE,
  overwrite = FALSE
)

Arguments

`path_to_files`	path containing files or vector of filepaths
`file_ext`	file extension to filter
`split`	train-test split
`train_target_path`	name of train file to save as (by default, will be train.txt in the path_to_files directory)
`test_target_path`	name of test file to save as (by default, will be test.txt in the path_to_files directory)
`names_backup`	name of file to save backup of filename metadata (by default, will be names.csv in the path_to_files directory)
`shuffle`	randomise prior to splitting
`overwrite`	overwrite target files

Value

named list of paths to created train/test files, original filenames

Examples

set.seed(123)
# create 10 random libsvm files
tmp_dir <- tempdir()
# remove any existing .txt files
file.remove(
list.files(tmp_dir, pattern = "*.txt", full.names = TRUE)
)
for (i in 1:10) {
 # each line is K: V
 writeLines(paste0(i, ": ", paste0(sample(1:100, 10, replace = TRUE),
 collapse = " ")), file.path(tmp_dir, paste0(i, ".txt")))
 }

 # split files into train and test directories
 paths <- split_and_combine_files(
  tmp_dir,
  file_ext = "txt",
  split = 0.8,
  train_target_path = file.path(tmp_dir, "train.txt"),
  test_target_path = file.path(tmp_dir, "test.txt"),
  names_backup = file.path(tmp_dir, "names.csv"),
  overwrite = TRUE)

 readLines(paths[["train"]])

MIC documentation built on June 10, 2025, 9:14 a.m.