transform_config_from_estimator: Export Airflow transform config from a SageMaker estimator

View source: R/workflow_airflow.R

transform_config_from_estimatorR Documentation

Export Airflow transform config from a SageMaker estimator

Description

Export Airflow transform config from a SageMaker estimator

Usage

transform_config_from_estimator(
  estimator,
  task_id,
  task_type,
  instance_count,
  instance_type,
  data,
  data_type = "S3Prefix",
  content_type = NULL,
  compression_type = NULL,
  split_type = NULL,
  job_name = NULL,
  model_name = NULL,
  strategy = NULL,
  assemble_with = NULL,
  output_path = NULL,
  output_kms_key = NULL,
  accept = NULL,
  env = NULL,
  max_concurrent_transforms = NULL,
  max_payload = NULL,
  tags = NULL,
  role = NULL,
  volume_kms_key = NULL,
  model_server_workers = NULL,
  image_uri = NULL,
  vpc_config_override = NULL,
  input_filter = NULL,
  output_filter = NULL,
  join_source = NULL
)

Arguments

estimator

(sagemaker.model.EstimatorBase): The SageMaker estimator to export Airflow config from. It has to be an estimator associated with a training job.

task_id

(str): The task id of any airflow.contrib.operators.SageMakerTrainingOperator or airflow.contrib.operators.SageMakerTuningOperator that generates training jobs in the DAG. The transform config is built based on the training job generated in this operator.

task_type

(str): Whether the task is from SageMakerTrainingOperator or SageMakerTuningOperator. Values can be 'training', 'tuning' or None (which means training job is not from any task).

instance_count

(int): Number of EC2 instances to use.

instance_type

(str): Type of EC2 instance to use, for example, 'ml.c4.xlarge'.

data

(str): Input data location in S3.

data_type

(str): What the S3 location defines (default: 'S3Prefix'). Valid values: * 'S3Prefix' - the S3 URI defines a key name prefix. All objects with this prefix will be used as inputs for the transform job. * 'ManifestFile' - the S3 URI points to a single manifest file listing each S3 object to use as an input for the transform job.

content_type

(str): MIME type of the input data (default: None).

compression_type

(str): Compression type of the input data, if compressed (default: None). Valid values: 'Gzip', None.

split_type

(str): The record delimiter for the input object (default: 'None'). Valid values: 'None', 'Line', 'RecordIO', and 'TFRecord'.

job_name

(str): transform job name (default: None). If not specified, one will be generated.

model_name

(str): model name (default: None). If not specified, one will be generated.

strategy

(str): The strategy used to decide how to batch records in a single request (default: None). Valid values: 'MultiRecord' and 'SingleRecord'.

assemble_with

(str): How the output is assembled (default: None). Valid values: 'Line' or 'None'.

output_path

(str): S3 location for saving the transform result. If not specified, results are stored to a default bucket.

output_kms_key

(str): Optional. KMS key ID for encrypting the transform output (default: None).

accept

(str): The accept header passed by the client to the inference endpoint. If it is supported by the endpoint, it will be the format of the batch transform output.

env

(dict): Environment variables to be set for use during the transform job (default: None).

max_concurrent_transforms

(int): The maximum number of HTTP requests to be made to each individual transform container at one time.

max_payload

(int): Maximum size of the payload in a single HTTP request to the container in MB.

tags

(list[dict]): List of tags for labeling a transform job. If none specified, then the tags used for the training job are used for the transform job.

role

(str): The “ExecutionRoleArn“ IAM Role ARN for the “Model“, which is also used during transform jobs. If not specified, the role from the Estimator will be used.

volume_kms_key

(str): Optional. KMS key ID for encrypting the volume attached to the ML compute instance (default: None).

model_server_workers

(int): Optional. The number of worker processes used by the inference server. If None, server will use one worker per vCPU.

image_uri

(str): A Docker image URI to use for deploying the model

vpc_config_override

(dict[str, list[str]]): Override for VpcConfig set on the model. Default: use subnets and security groups from this Estimator. * 'Subnets' (list[str]): List of subnet ids. * 'SecurityGroupIds' (list[str]): List of security group ids.

input_filter

(str): A JSONPath to select a portion of the input to pass to the algorithm container for inference. If you omit the field, it gets the value '$', representing the entire input. For CSV data, each row is taken as a JSON array, so only index-based JSONPaths can be applied, e.g. $[0], $[1:]. CSV data should follow the 'RFC format <https://tools.ietf.org/html/rfc4180>'_. See 'Supported JSONPath Operators <https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-data-processing.html#data-processing-operators>'_ for a table of supported JSONPath operators. For more information, see the SageMaker API documentation for 'CreateTransformJob <https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTransformJob.html>'_. Some examples: "$[1:]", "$.features" (default: None).

output_filter

(str): A JSONPath to select a portion of the joined/original output to return as the output. For more information, see the SageMaker API documentation for 'CreateTransformJob <https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTransformJob.html>'_. Some examples: "$[1:]", "$.prediction" (default: None).

join_source

(str): The source of data to be joined to the transform output. It can be set to 'Input' meaning the entire input record will be joined to the inference result. You can use OutputFilter to select the useful portion before uploading to S3. (default: None). Valid values: Input, None.

Value

dict: Transform config that can be directly used by SageMakerTransformOperator in Airflow.


DyfanJones/sagemaker-r-workflow documentation built on April 3, 2022, 11:28 p.m.