monitor_cluster_resources: Monitor resources use on cluster

View source: R/monitoring.R

monitor_cluster_resourcesR Documentation

Monitor resources use on cluster

Description

To-do: add better docs

The data comes from the linux command ps (specifically, "ps -u <username> -o pcpu,rss,size,state,time,cmd" ). If you want to know EXACTLY what each column means RTFM and type in man ps in a UNIX terminal. See Details for more information about the data frame being saved.

Usage

monitor_cluster_resources(username_or_command, login_node, node_list,
  save_path, sleeping_time, total_checks, ..., stop_file = NULL)

Arguments

username_or_command

The username you're using to log in to the remote server or, if you supply command_maker=NULL to the ..., the command you want to call and check the results of. Just stick with your username. It's easier for everyone.

login_node

the name of the gateway node (e.g. 'zach@remote_back_up_server.server.com'). Should NOT be the same as the node you're using to run the other tasks.

node_list

a list of the nodes you want to monitor

save_path

the filename you want to save all this information to (on the remote server). If NULL, it returns the future of the data frame it would normally save. Choosing this option will overwrite the current future plan.

sleeping_time

time between checks in seconds

total_checks

total number of checks

...

additional arguments supplied to monitor_resources_on_node

stop_file

The path of a file where, if present on the node, will cause it to end and return prematurely. A totally hacky way of communicating with the monitoring functions. Wholesome people should not bother with this parameter.

Details

Each row is a process at a given time on a given node.

Columns:

  • %CPU is the percent CPU being used by the process (can go > 100

  • RSS is the memory usage (google it), probably in kb

  • SIZE is somehow also related to memory usage (ugh computer stuff, amirite guys)

  • S is the state of the process. Basically "S" means sleeping and "R" means running

  • TIME is the CPU time of the process. Basically how long it's been "active." (Processes, unlike grad students, sleep a lot)

  • PID is the ID of the process.

  • CMD is an extended form of the command/name of the process. All R processes have been renamed "R"

  • SampleTime is when the process was pinged

  • Nodename is the name of the node

  • PIDofMonitor is the process ID of the monitoring process itself. You can use this to filter out the resources being used by this process.

Examples

## Not run: 
monitor_cluster_resources("zach",
                          "zach@remote_backup_server.com",
                          nodes_to_monitor,
                          save_path="/u/zach/bb_maker_resources.RDS",
                          sleeping_time = 10,
                          total_checks = 6)
# Wait for it to complete before using another connection to 'remote_backup_server.com'
plan(remote, workers = "zach@remote_backup_server.com")
df %<-% readRDS("/u/zach/bb_maker_resources.RDS")
resolved(futureOf(df))

df %>%
  group_by(Nodename, SampleTime) %>%
  filter(CMD =="R") %>%
  summarise(RSS = sum(as.numeric(RSS)),
            CPU = sum(as.numeric(`%CPU`))) %>%
  filter(RSS > 2e+06) %>%
  ggplot(aes(x=SampleTime, y=RSS, color=Nodename)) +
  geom_line()

## End(Not run)

burchill/cs documentation built on May 28, 2023, 1:29 p.m.