train: Train a machine learning model to classify images

Description Usage Arguments

View source: R/train.R

Description

train allows users to train their own machine learning model using images that have been manually classified. We recommend having at least 500 images per species, but accuracies will be higher with > 10,000 images. This model will take a very long time to run. We recommend using a GPU if possible. In the data_info csv, you must have two columns with no headers. Column 1 must be the file name of the image. Column 2 must be a number corresponding to the species. Give each species (or group of species) a number identifying it. You can use the make_input function for help making this csv. The first species must be 0, the next species 1, and so on. If this is your first time using this function, you should see additional documentation at https://github.com/mikeyEcology/MLWIC2 . This function uses absolute paths, but if you are unfamilliar with this process, you can put all of your images, the image label csv ("data_info") and the trained_model folder that you downloaded following the directions at https://github.com/mikeyEcology/MLWIC2 into one directory on your computer. Then set your working directory to this location and the function will find the absolute paths for you.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
train(
  path_prefix = paste0(getwd(), "/images"),
  data_info = paste0(getwd(), "/image_labels.csv"),
  model_dir = paste0(getwd(), "/MLWIC2_helper_files"),
  python_loc = "/anaconda2/bin/",
  os = "Mac",
  num_gpus = 2,
  num_classes = 59,
  delimiter = ",",
  architecture = "resnet",
  depth = "18",
  batch_size = 128,
  log_dir = "species_model",
  log_dir_train = "MLWIC2_train_output",
  retrain = TRUE,
  retrain_from = "species_model",
  num_epochs = 55,
  top_n = 5,
  num_cores = 1,
  randomize = TRUE,
  max_to_keep = 5,
  print_cmd = FALSE,
  shiny = FALSE
)

Arguments

path_prefix

Absolute path to location of the images on your computer

data_info

csv with file names for each photo (relative path to file). This file must have no headers (column names). column 1 must be the file name of each image including the extention (i.e., .jpg). Column 2 must be a number corresponding to the species. Give each species (or group of species) a number identifying it. The first species must be 0, the next species 1, and so on.

model_dir

Absolute path to the location where you stored the MLWIC2_helper_files folder that you downloaded from github.

python_loc

The location of python on your machine.

os

the operating system you are using. If you are using windows, set this to "Windows", otherwise leave as default

num_gpus

The number of GPUs available. If you are using a CPU, leave this as default.

num_classes

The number of classes (species or groups of species) in your model.

delimiter

this will be a ',' for a csv.

architecture

the architecture of the deep neural network (DNN). Resnet-18 is the default. Other options are c("alexnet", "densenet", "googlenet", "nin", "vgg")

depth

the number of layers in the DNN. If you are using resnet, the options are c(18, 34, 50, 101, 152). If you are using densenet, the options are c(121, 161, 169, 201). If you are an architecture other than resnet or densenet, the number of layers will be automatically set.

batch_size

the number of images simultaneously passed to the model for training. It must be a multiple of 16. Smaller numbers will train models that are more accurate, but it will take longer to train. The default is 128.

log_dir_train

directory where you will store the model information. This will be called when you what you specify in the log_dir option of the classify function. You will want to use unique names if you are training multiple models on your computer; otherwise they will be over-written.

retrain

If TRUE, the model you train will be a retraining of the model you specify in 'retrain_from'. If FALSE, you are starting training from scratch. Retraining will be faster but training from scratch will be more flexible.

retrain_from

name of the directory from which you want to retrain the model.

num_epochs

the number of epochs you want to use for training. The default is 55 and this is recommended for training a full model. But if you need to start and stop training, you may want to use a smaller number at times.

top_n

The number of guesses that you want the model to save. This needs to be less than or equal to the number of classes.

num_cores

The number of cores you want to use. You can find the number on your computer using parallel::detectCores()

randomize

If TRUE, this will randomize the order in which images are passed to training

max_to_keep

maximum number of snapshot files to keep. These are the snapshots that are taken of the current version of the model at the end of each epoch.

print_cmd

print the system command instead of running the function. This is for development.


mikeyEcology/MLWIC2 documentation built on Feb. 18, 2021, 11:46 a.m.