LowGPUUtilization: Debugger LowGPUUtilization class

Description Super class Methods

Description

This rule helps to detect if GPU utilization is low or suffers from fluctuations. This is checked for each single GPU on each worker node. Rule returns True if 95th quantile is below threshold_p95 which indicates under-utilization. Rule returns true if 95th quantile is above threshold_p95 and 5th quantile is below threshold_p5 which indicates fluctuations.

Super class

sagemaker.debugger::ProfilerRuleBase -> LowGPUUtilization

Methods

Public methods

Inherited methods

Method new()

Initialize LowGPUUtilization class

Usage
LowGPUUtilization$new(
  threshold_p95 = 70,
  threshold_p5 = 10,
  window = 500,
  patience = 1000,
  scan_interval_us = 60 * 1000 * 1000
)
Arguments
threshold_p95

: threshold for 95th quantile below which GPU is considered to be underutilized. Default is 70 percent.

threshold_p5

: threshold for 5th quantile. Default is 10 percent.

window

: number of past datapoints which are used to compute the quantiles.

patience

: How many values to record before checking for underutilization/fluctuations. During training initilization, GPU is likely at 0 percent, so Rule should not check for underutilization immediately. Default 1000.

scan_interval_us

: interval with which timeline files are scanned. Default is 60000000 us.


Method clone()

The objects of this class are cloneable with this method.

Usage
LowGPUUtilization$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.


DyfanJones/sagemaker-r-debugger documentation built on Jan. 20, 2022, 5:49 p.m.