monitor_cluster_resources: Monitor resources use on cluster
In burchill/cs: Functions for working with Rochester's CS cluster

monitor_cluster_resources

R Documentation

Monitor resources use on cluster

Description

To-do: add better docs

The data comes from the linux command ps (specifically, "ps -u <username> -o pcpu,rss,size,state,time,cmd" ). If you want to know EXACTLY what each column means RTFM and type in man ps in a UNIX terminal. See Details for more information about the data frame being saved.

Usage

monitor_cluster_resources(username_or_command, login_node, node_list,
  save_path, sleeping_time, total_checks, ..., stop_file = NULL)

Arguments

`username_or_command`	The username you're using to log in to the remote server or, if you supply `command_maker=NULL` to the ..., the command you want to call and check the results of. Just stick with your username. It's easier for everyone.
`login_node`	the name of the gateway node (e.g. 'zach@remote_back_up_server.server.com'). Should NOT be the same as the node you're using to run the other tasks.
`node_list`	a list of the nodes you want to monitor
`save_path`	the filename you want to save all this information to (on the remote server). If NULL, it returns the future of the data frame it would normally save. Choosing this option will overwrite the current `future` `plan`.
`sleeping_time`	time between checks in seconds
`total_checks`	total number of checks
`...`	additional arguments supplied to `monitor_resources_on_node`
`stop_file`	The path of a file where, if present on the node, will cause it to end and return prematurely. A totally hacky way of communicating with the monitoring functions. Wholesome people should not bother with this parameter.

Details

Each row is a process at a given time on a given node.

Columns:

%CPU is the percent CPU being used by the process (can go > 100
RSS is the memory usage (google it), probably in kb
SIZE is somehow also related to memory usage (ugh computer stuff, amirite guys)
S is the state of the process. Basically "S" means sleeping and "R" means running
TIME is the CPU time of the process. Basically how long it's been "active." (Processes, unlike grad students, sleep a lot)
PID is the ID of the process.
CMD is an extended form of the command/name of the process. All R processes have been renamed "R"
SampleTime is when the process was pinged
Nodename is the name of the node
PIDofMonitor is the process ID of the monitoring process itself. You can use this to filter out the resources being used by this process.

Examples

## Not run: 
monitor_cluster_resources("zach",
                          "zach@remote_backup_server.com",
                          nodes_to_monitor,
                          save_path="/u/zach/bb_maker_resources.RDS",
                          sleeping_time = 10,
                          total_checks = 6)
# Wait for it to complete before using another connection to 'remote_backup_server.com'
plan(remote, workers = "zach@remote_backup_server.com")
df %<-% readRDS("/u/zach/bb_maker_resources.RDS")
resolved(futureOf(df))

df %>%
  group_by(Nodename, SampleTime) %>%
  filter(CMD =="R") %>%
  summarise(RSS = sum(as.numeric(RSS)),
            CPU = sum(as.numeric(`%CPU`))) %>%
  filter(RSS > 2e+06) %>%
  ggplot(aes(x=SampleTime, y=RSS, color=Nodename)) +
  geom_line()

## End(Not run)

burchill/cs documentation built on May 28, 2023, 1:29 p.m.