train_model | R Documentation |
Train a neural network on genomic data. Data can be fasta/fastq files, rds files or a prepared data set. If the data is given as collection of fasta, fastq or rds files, function will create a data generator that extracts training and validation batches from files. Function includes several options to determine the sampling strategy of the generator and preprocessing of the data. Training progress can be visualized in tensorboard. Model weights can be stored during training using checkpoints.
train_model(
model = NULL,
dataset = NULL,
dataset_val = NULL,
train_val_ratio = 0.2,
run_name = "run_1",
initial_epoch = 0,
class_weight = NULL,
print_scores = TRUE,
epochs = 10,
max_queue_size = 100,
steps_per_epoch = 1000,
path_checkpoint = NULL,
path_tensorboard = NULL,
path_log = NULL,
save_best_only = NULL,
save_weights_only = FALSE,
tb_images = FALSE,
path_file_log = NULL,
reset_states = FALSE,
early_stopping_time = NULL,
validation_only_after_training = FALSE,
train_val_split_csv = NULL,
reduce_lr_on_plateau = TRUE,
lr_plateau_factor = 0.9,
patience = 20,
cooldown = 1,
model_card = NULL,
callback_list = NULL,
train_type = "label_folder",
path = NULL,
path_val = NULL,
batch_size = 64,
step = NULL,
shuffle_file_order = TRUE,
vocabulary = c("a", "c", "g", "t"),
format = "fasta",
ambiguous_nuc = "zero",
seed = c(1234, 4321),
file_limit = NULL,
use_coverage = NULL,
set_learning = NULL,
proportion_entries = NULL,
sample_by_file_size = FALSE,
n_gram = NULL,
n_gram_stride = 1,
masked_lm = NULL,
random_sampling = FALSE,
add_noise = NULL,
return_int = FALSE,
maxlen = NULL,
reverse_complement = FALSE,
reverse_complement_encoding = FALSE,
output_format = "target_right",
proportion_per_seq = NULL,
read_data = FALSE,
use_quality_score = FALSE,
padding = FALSE,
concat_seq = NULL,
target_len = 1,
skip_amb_nuc = NULL,
max_samples = NULL,
added_label_path = NULL,
add_input_as_seq = NULL,
target_from_csv = NULL,
target_split = NULL,
shuffle_input = TRUE,
vocabulary_label = NULL,
delete_used_files = FALSE,
reshape_xy = NULL,
return_gen = FALSE
)
model |
A keras model. | |||||||
dataset |
List of training data holding training samples in RAM instead of using generator. Should be list with two entries called | |||||||
dataset_val |
List of validation data. Should have two entries called | |||||||
train_val_ratio |
For generator defines the fraction of batches that will be used for validation (compared to size of training data), i.e. one validation iteration
processes | |||||||
run_name |
Name of the run. Name will be used to identify output from callbacks. If | |||||||
initial_epoch |
Epoch at which to start training. Note that network
will run for ( | |||||||
class_weight |
List of weights for output. Order should correspond to
If
| |||||||
print_scores |
Whether to print train/validation scores during training. | |||||||
epochs |
Number of iterations. | |||||||
max_queue_size |
Maximum size for the generator queue. | |||||||
steps_per_epoch |
Number of training batches per epoch. | |||||||
path_checkpoint |
Path to checkpoints folder or | |||||||
path_tensorboard |
Path to tensorboard directory or | |||||||
path_log |
Path to directory to write training scores. File name is | |||||||
save_best_only |
Only save model that improved on some score. Not applied if argument is | |||||||
save_weights_only |
Whether to save weights only. | |||||||
tb_images |
Whether to show custom images (confusion matrix) in tensorboard "IMAGES" tab. | |||||||
path_file_log |
Write name of files used for training to csv file if path is specified. | |||||||
reset_states |
Whether to reset hidden states of RNN layer at every new input file and before/after validation. | |||||||
early_stopping_time |
Time in seconds after which to stop training. | |||||||
validation_only_after_training |
Whether to skip validation during training and only do one validation iteration after training. | |||||||
train_val_split_csv |
A csv file specifying train/validation split. csv file should contain one column named | |||||||
reduce_lr_on_plateau |
Whether to use learning rate scheduler. | |||||||
lr_plateau_factor |
Factor of decreasing learning rate when plateau is reached. | |||||||
patience |
Number of epochs waiting for decrease in validation loss before reducing learning rate. | |||||||
cooldown |
Number of epochs without changing learning rate. | |||||||
model_card |
List of arguments for training parameters of training run. Must contain at least an entry | |||||||
callback_list |
Add additional callbacks to | |||||||
train_type |
Either
| |||||||
path |
Path to training data. If | |||||||
path_val |
Path to validation data. See | |||||||
batch_size |
Number of samples used for one network update. | |||||||
step |
Frequency of sampling steps. | |||||||
shuffle_file_order |
Boolean, whether to go through files sequentially or shuffle beforehand. | |||||||
vocabulary |
Vector of allowed characters. Characters outside vocabulary get encoded as specified in | |||||||
format |
File format, | |||||||
ambiguous_nuc |
How to handle nucleotides outside vocabulary, either
| |||||||
seed |
Sets seed for reproducible results. | |||||||
file_limit |
Integer or | |||||||
use_coverage |
Integer or | |||||||
set_learning |
When you want to assign one label to set of samples. Only implemented for
| |||||||
proportion_entries |
Proportion of fasta entries to keep. For example, if fasta file has 50 entries and | |||||||
sample_by_file_size |
Sample new file weighted by file size (bigger files more likely). | |||||||
n_gram |
Integer, encode target not nucleotide wise but combine n nucleotides at once. For example for | |||||||
n_gram_stride |
Step size for n-gram encoding. For AACCGGTT with | |||||||
masked_lm |
If not
| |||||||
random_sampling |
Whether samples should be taken from random positions when using | |||||||
add_noise |
| |||||||
return_int |
Whether to return integer encoding or one-hot encoding. | |||||||
maxlen |
Length of predictor sequence. | |||||||
reverse_complement |
Boolean, for every new file decide randomly to use original data or its reverse complement. | |||||||
reverse_complement_encoding |
Whether to use both original sequence and reverse complement as two input sequences. | |||||||
output_format |
Determines shape of output tensor for language model.
Either
| |||||||
proportion_per_seq |
Numerical value between 0 and 1. Proportion of sequence to take samples from (use random subsequence). | |||||||
read_data |
If | |||||||
use_quality_score |
Whether to use fastq quality scores. If | |||||||
padding |
Whether to pad sequences too short for one sample with zeros. | |||||||
concat_seq |
Character string or | |||||||
target_len |
Number of nucleotides to predict at once for language model. | |||||||
skip_amb_nuc |
Threshold of ambiguous nucleotides to accept in fasta entry. Complete entry will get discarded otherwise. | |||||||
max_samples |
Maximum number of samples to use from one file. If not | |||||||
added_label_path |
Path to file with additional input labels. Should be a csv file with one column named "file". Other columns should correspond to labels. | |||||||
add_input_as_seq |
Boolean vector specifying for each entry in | |||||||
target_from_csv |
Path to csv file with target mapping. One column should be called "file" and other entries in row are the targets. | |||||||
target_split |
If target gets read from csv file, list of names to divide target tensor into list of tensors.
Example: if csv file has header names | |||||||
shuffle_input |
Whether to shuffle entries in file. | |||||||
vocabulary_label |
Character vector of possible targets. Targets outside | |||||||
delete_used_files |
Whether to delete file once used. Only applies for rds files. | |||||||
reshape_xy |
Can be a list of functions to apply to input and/or target. List elements (containing the reshape functions)
must be called x for input or y for target and each have arguments called x and y. For example:
| |||||||
return_gen |
Whether to return the train and validation generators (instead of training). |
A list of training metrics.
# create dummy data
path_train_1 <- tempfile()
path_train_2 <- tempfile()
path_val_1 <- tempfile()
path_val_2 <- tempfile()
for (current_path in c(path_train_1, path_train_2,
path_val_1, path_val_2)) {
dir.create(current_path)
create_dummy_data(file_path = current_path,
num_files = 3,
seq_length = 10,
num_seq = 5,
vocabulary = c("a", "c", "g", "t"))
}
# create model
model <- create_model_lstm_cnn(layer_lstm = 8, layer_dense = 2, maxlen = 5)
# train model
hist <- train_model(train_type = "label_folder",
model = model,
path = c(path_train_1, path_train_2),
path_val = c(path_val_1, path_val_2),
batch_size = 8,
epochs = 3,
steps_per_epoch = 6,
step = 5,
format = "fasta",
vocabulary_label = c("label_1", "label_2"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.