This is a quick start tutorial for training a model using GCP AutoML Tables.
Create a Google Cloud Storage Bucket and save the name for use in an upcoming step.
Run the following chunk to install the required R packages to complete this tutorial (checking to see if they are installed first and only install if not already):
list.of.packages <- c("remotes", "googleAuthR") new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])] if(length(new.packages)) install.packages(new.packages) remotes::install_github("justinjm/googleCloudAutoMLTablesR")
Create a file called .Renviron
in your project's working directory and use the following environemtn arguments:
GAR_SERVICE_JSON
- path to service account (JSON) keyfile downloaded before and copied GCAT_DEFAULT_PROJECT_ID
- string of your GCP project you configured before GCAT_DEFAULT_REGION
- region of GCP resorces that can be one of: "us-central1"
or "eu"
e.g. your .Renviron
should look like:
# .Renviron GAR_SERVICE_JSON="/Users/me/auth/auth.json" GCAT_DEFAULT_PROJECT_ID="my-project" GCAT_DEFAULT_REGION="us-central1"
Save the .Renviron
file and restart your R session to load the environment variables
Load the 2 R packages we need and then authenticate using the Service account we just created:
library(googleAuthR) library(googleCloudAutoMLTablesR) options(googleAuthR.scopes.selected = "https://www.googleapis.com/auth/cloud-platform") gar_auth_service(json_file = Sys.getenv("GAR_SERVICE_JSON"))
To make things easier later on, we set some global arguements for our GCP project and the region of the GCP resources within our project:
projectId <- Sys.getenv("GCAT_DEFAULT_PROJECT_ID") location <- "us-central1" gcat_region_set("us-central1") gcat_project_set(projectId)
datasets_list <- gcat_list_datasets() datasets_list
To simplify code for other API calls, you can set a dataset name to a global environment variable:
gcat_global_dataset("test_01_bq")
Then you can retrieve it like so:
gcat_get_global_dataset()
The example workflow in the rest of this vignette uses this function to keep the code as concise as possible.
There are 2 options for loading data into AutoML Tables
At a high-level, the workflow is:
To load data into AutoML Tables from Google Cloud storage, first do the following:
us-central1
regiongoogleCloudStorageR
Now we are ready to create an AutoML tables dataset.
Create a AutoML Tables dataset (with a unique displayName
):
gcat_dataset <- gcat_create_dataset(displayName = "test_01_bq") gcat_dataset
Execute import from Google Cloud storage:
# set url as seperate parameter to keep line length under 80 gs_url <- "gs://gcatr-dev/bank_marketing.csv" gcat_import_job <- gcat_import_data(displayName = "test_03_gcs", input_source = "gcs", input_url = gs_url) gcat_import_job
To load data into AutoML Tables from BigQuery, first do the following:
bigQueryR
or bigrquery
Create a AutoML Tables dataset (with a unique displayName
):
gcat_dataset <- gcat_create_dataset(displayName = "test_01_bq") gcat_dataset
Execute import from BigQuery:
# set url as seperate parameter to keep line length under 80 bq_url <- "bq://gc-automl-tables-r.gcatr_dev.bank_marketing" gcat_dataset_import <- gcat_import_data(dataset_display_name = "test_01_bq", input_source = "bq", input_url = bq_url) gcat_dataset_import
Once the import - from GCS or BQ - is completed, you'll recieve an email notification. Locate the datasetId
in the GCP console or via gcat_list_datasets
function. (e.g. - will be in the form TBL123456789
) Once the data import is completed, we can then sanity check the results.
Now we're ready to view the results and get the tableSpecId
we need for later functions:
dataset <- gcat_get_dataset() dataset
AutoML Tables automatically detects your data column type. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema.
Lets first view the table schema and the columns
table_spec <- gcat_get_table_specs() table_spec
column_specs_list <- gcat_list_column_specs() column_specs_list
column_spec <- gcat_get_column_spec(columnDisplayName = "V16") column_spec
Set V16 or outcome source as label
dataset <- gcat_set_target_column(columnDisplayName = "V16") dataset
Training or creating a model is done with gcat_create_model
.
This will be a long-lasting operation. You will recieve an email when the model training completes
gcat_model <- gcat_create_model( datasetDisplayName = "test_01_bq", columnDisplayName = "V16", modelDisplayName = "test_01_bq_02", optimizationObjective = "MINIMIZE_LOG_LOSS", trainBudgetMilliNodeHours = 1000) gcat_model
More details here: Training models|AutoML Tables Documentation
While you wait - or after your latest model training completes - you can view the list trained models:
models_list <- gcat_list_models() models_list
After the model has trained, you can get the trainied model with the following:
gcat_model <- gcat_get_model(modelDisplayName = "test_01_bq_01") gcat_model
After training has been completed, you can review various performance statistics on the model, such as the accuracy, precision, recall, and so on. The metrics are returned in a nested data structure:
model_evaluation <- gcat_list_model_evaluations( projectId = projectId, locationId = location, modelDisplayName = "test_01_bq_01" ) model_evaluation
There are two different prediction modes: online and batch. The following cell shows you how to make a batch prediction.
batch_predictions <- gcat_batch_predict( modelDisplayName = "test_01_bq_01", inputSource = "gs://gcatr-dev/bank_marketing_batch_01.csv", outputTarget = "gs://gcatr-dev/predictions/" ) batch_predictions
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.