Summary

This is a quick start tutorial for training a model using GCP AutoML Tables.

Setup

1. Configure your own Google Cloud Platform project

  1. Select or create a GCP project.. When you first create an account, you get a $300 free credit towards your compute/storage costs.
  2. Make sure that billing is enabled for your project.
  3. Enable the AI Platform APIs and Compute Engine APIs.
  4. Enable AutoML API.

2. Create a service account and grant permissions

  1. In the GCP Console, go to the Create service account key page.
  2. From the Service account drop-down list, select New service account.
  3. In the Service account name field, enter a name.
  4. From the Role drop-down list, select AutoML > AutoML Admin, Storage > Storage Admin and BigQuery > BigQuery Admin.
  5. Click Create. A JSON file that contains your key downloads to your local environment.
  6. Save the json file and copy the pathname for an upcoming step

3. Create Google Cloud Storage Bucket

Create a Google Cloud Storage Bucket and save the name for use in an upcoming step.

4. Configure R environment

Install R packages

Run the following chunk to install the required R packages to complete this tutorial (checking to see if they are installed first and only install if not already):

list.of.packages <- c("remotes", "googleAuthR")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

remotes::install_github("justinjm/googleCloudAutoMLTablesR")

.Renviron

Create a file called .Renviron in your project's working directory and use the following environemtn arguments:

e.g. your .Renviron should look like:

# .Renviron
GAR_SERVICE_JSON="/Users/me/auth/auth.json"
GCAT_DEFAULT_PROJECT_ID="my-project"
GCAT_DEFAULT_REGION="us-central1"

Save the .Renviron file and restart your R session to load the environment variables

Authentication

Load the 2 R packages we need and then authenticate using the Service account we just created:

library(googleAuthR)
library(googleCloudAutoMLTablesR)

options(googleAuthR.scopes.selected = "https://www.googleapis.com/auth/cloud-platform")

gar_auth_service(json_file = Sys.getenv("GAR_SERVICE_JSON"))

Set global arguments

To make things easier later on, we set some global arguements for our GCP project and the region of the GCP resources within our project:

projectId <- Sys.getenv("GCAT_DEFAULT_PROJECT_ID")
location <- "us-central1"
gcat_region_set("us-central1")
gcat_project_set(projectId)

Viewing AutoML Tables datasets

Get list of datasets

datasets_list <- gcat_list_datasets()
datasets_list

Optional - setting and getting global dataset name

To simplify code for other API calls, you can set a dataset name to a global environment variable:

gcat_global_dataset("test_01_bq")

Then you can retrieve it like so:

gcat_get_global_dataset()

The example workflow in the rest of this vignette uses this function to keep the code as concise as possible.

Importing data into AutoML Tables

Options and Workflow

There are 2 options for loading data into AutoML Tables

  1. Google Cloud Storage
  2. BigQuery

At a high-level, the workflow is:

  1. Locate your data in GCP (csv file in GCS bucket or csv file loaded into a BQ table)
  2. Create AutoML Tables dataset
  3. Import data from GCS or BQ into AutoML Tables dataset

From Google Cloud Storage

To load data into AutoML Tables from Google Cloud storage, first do the following:

  1. Setup a GCS bucket that is regional and in us-central1 region
  2. Upload your data file manually or via googleCloudStorageR

Now we are ready to create an AutoML tables dataset.

Create AutoML Tables dataset

Create a AutoML Tables dataset (with a unique displayName):

gcat_dataset <- gcat_create_dataset(displayName = "test_01_bq")
gcat_dataset

Import data

Execute import from Google Cloud storage:

# set url as seperate parameter to keep line length under 80
gs_url <- "gs://gcatr-dev/bank_marketing.csv"

gcat_import_job <- gcat_import_data(displayName = "test_03_gcs",
                                    input_source = "gcs",
                                    input_url = gs_url)
gcat_import_job

From BigQuery

To load data into AutoML Tables from BigQuery, first do the following:

  1. Setup a BQ dataset
  2. Upload your data file manually or via bigQueryR or bigrquery

Create AutoML Tables dataset

Create a AutoML Tables dataset (with a unique displayName):

gcat_dataset <- gcat_create_dataset(displayName = "test_01_bq")
gcat_dataset

Import data

Execute import from BigQuery:

# set url as seperate parameter to keep line length under 80
bq_url <- "bq://gc-automl-tables-r.gcatr_dev.bank_marketing"

gcat_dataset_import <- gcat_import_data(dataset_display_name = "test_01_bq",
                                        input_source = "bq",
                                        input_url = bq_url)
gcat_dataset_import

View dataset

Once the import - from GCS or BQ - is completed, you'll recieve an email notification. Locate the datasetId in the GCP console or via gcat_list_datasets function. (e.g. - will be in the form TBL123456789) Once the data import is completed, we can then sanity check the results.

Now we're ready to view the results and get the tableSpecId we need for later functions:

dataset <- gcat_get_dataset()
dataset

Updating dataset in AutoML Tables

AutoML Tables automatically detects your data column type. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema.

Lets first view the table schema and the columns

Get table specifications (tableSpec)

table_spec <- gcat_get_table_specs()
table_spec

List column specs

column_specs_list <- gcat_list_column_specs()
column_specs_list

Get info about our target column

column_spec <- gcat_get_column_spec(columnDisplayName = "V16")
column_spec 

Set label or target column

Set V16 or outcome source as label

dataset <- gcat_set_target_column(columnDisplayName = "V16")
dataset

Modeling

Create model

Training or creating a model is done with gcat_create_model.

This will be a long-lasting operation. You will recieve an email when the model training completes

gcat_model <- gcat_create_model(
  datasetDisplayName = "test_01_bq",
  columnDisplayName = "V16",
  modelDisplayName = "test_01_bq_02",
  optimizationObjective = "MINIMIZE_LOG_LOSS",
  trainBudgetMilliNodeHours = 1000)

gcat_model

More details here: Training models|AutoML Tables Documentation

Viewing Models

While you wait - or after your latest model training completes - you can view the list trained models:

models_list <- gcat_list_models()
models_list

After the model has trained, you can get the trainied model with the following:

gcat_model <- gcat_get_model(modelDisplayName = "test_01_bq_01")
gcat_model

Evalute the model

After training has been completed, you can review various performance statistics on the model, such as the accuracy, precision, recall, and so on. The metrics are returned in a nested data structure:

model_evaluation <- gcat_list_model_evaluations(
  projectId = projectId,
  locationId = location,
  modelDisplayName = "test_01_bq_01"
)
model_evaluation

Make batch prediction

There are two different prediction modes: online and batch. The following cell shows you how to make a batch prediction.

batch_predictions <- gcat_batch_predict(
  modelDisplayName = "test_01_bq_01",
  inputSource = "gs://gcatr-dev/bank_marketing_batch_01.csv",
  outputTarget = "gs://gcatr-dev/predictions/"
)
batch_predictions


justinjm/googleCloudAutoMLTablesR documentation built on Jan. 11, 2023, 7:38 p.m.