Skip to main content

Tutorial: Collecting System Information from Clusters

This tutorial creates a workflow using actor-IaC to collect system information from multiple compute nodes. In cluster management, understanding CPU, memory, disk, GPU, OS, and network information for each node is a basic and important task.

Prerequisites

Users must satisfy the following conditions:

  • actor-IaC is installed on the user's machine (see installation tutorial for installation instructions)
  • Users can log into target nodes using SSH public key authentication (see SSH Setup tutorial for configuration instructions)
  • Users can execute the sudo command on target nodes (required for GPU information retrieval, see SSH Setup tutorial for configuration instructions)

1. Create a Working Directory

Users create a directory to place workflow files.

mkdir -p ~/works/testcluster-iac/sysinfo
cd ~/works/testcluster-iac

2. Create the Inventory File

Users define the target nodes for system information collection in the inventory.ini file. This tutorial targets 6 compute nodes: 192.168.5.13, 14, 15, 21, 22, and 23.

cat > inventory.ini << 'EOF'
[compute]
node13 actoriac_host=192.168.5.13
node14 actoriac_host=192.168.5.14
node15 actoriac_host=192.168.5.15
node21 actoriac_host=192.168.5.21
node22 actoriac_host=192.168.5.22
node23 actoriac_host=192.168.5.23

[compute:vars]
actoriac_user=devteam
EOF

The meanings of the configuration contents in the inventory.ini file are as follows:

ItemDescription
[compute]Defines a group name
node13 to node23Defines identification names for each node
actoriac_host=192.168.5.XXSpecifies the actual IP address for each node
[compute:vars]Defines variables that apply to the entire compute group
actoriac_user=devteamSpecifies the username for SSH connections

3. Understand the Actor Tree Generated by actor-IaC

When actor-IaC reads the inventory.ini file and executes a workflow, actor-IaC generates the following actor tree.

ROOT actor
├── logStore actor
└── nodeGroup actor
├── accumulator actor
├── node-node13 actor
├── node-node14 actor
├── node-node15 actor
├── node-node21 actor
├── node-node22 actor
└── node-node23 actor

The actor tree generated by actor-IaC has the following three important design points.

3.1 Parent-Child Relationship Between nodeGroup Actor and node Actors

The purpose of actor-IaC is to execute the same configuration tasks on multiple servers in parallel. To achieve parallel execution, actor-IaC places multiple node actors as child actors under the nodeGroup actor.

  • nodeGroup actor: Executes the main workflow. Controls which sub-workflows to execute on which nodes.
  • node actor: Executes sub-workflows. Defines the specific commands to execute on each node.

In this tutorial, 6 node actors (node-node13 to node-node23) corresponding to 6 nodes are generated. actor-IaC executes the same sub-workflow in parallel on these 6 node actors.

3.2 Log Output Aggregation by the accumulator Actor

When multiple node actors execute sub-workflows in parallel, multiple node actors generate log output simultaneously. The accumulator actor aggregates log output from multiple node actors in one place.

node-node13 actor ──┐
node-node14 actor ──┤
node-node15 actor ──┼─→ accumulator actor ─→ logStore actor ─→ actor-iac-logs.mv.db
node-node21 actor ──┤ │
node-node22 actor ──┤ └─→ standard output (console)
node-node23 actor ──┘

3.3 Log Persistence by the logStore Actor

The logStore actor is dedicated to log persistence. Logs from multiple node actors are written in mixed order of arrival, but since each log entry includes nodeId, users can filter by nodeId later to extract logs for specific nodes.

4. Create the System Information Collection Workflow

4.1 About Workflow Attributes

In actor-IaC workflows, users can document using the following attributes:

LevelAttributeDescription
Workflowdescription:Description of the entire workflow
Stepnote:Description of the step

Setting these attributes allows users to check workflow contents using the describe command.

4.2 Create the Sub-Workflow

Users create the sub-workflow to be executed on each node in the ~/works/testcluster-iac/sysinfo/collect-sysinfo.yaml file.

cat > sysinfo/collect-sysinfo.yaml << 'EOF'
name: collect-sysinfo

description: |
Sub-workflow to collect system information from each compute node.
Retrieves hostname, OS, CPU, memory, disk, GPU, and network information.

steps:
- states: ["0", "1"]
note: Retrieve hostname and OS information
actions:
- actor: this
method: executeCommand
arguments:
- |
echo "===== HOSTNAME ====="
hostname -f
echo ""
echo "===== OS INFO ====="
cat /etc/os-release | grep -E "^(NAME|VERSION|ID)="
uname -a

- states: ["1", "2"]
note: Retrieve CPU architecture, core count, and model name
actions:
- actor: this
method: executeCommand
arguments:
- |
echo "===== CPU INFO ====="
lscpu | grep -E "^(Architecture|CPU\(s\)|Model name|Thread|Core|Socket)"

- states: ["2", "3"]
note: Retrieve memory capacity (total, used, free)
actions:
- actor: this
method: executeCommand
arguments:
- |
echo "===== MEMORY INFO ====="
free -h

- states: ["3", "4"]
note: Retrieve disk device list and mount status
actions:
- actor: this
method: executeCommand
arguments:
- |
echo "===== DISK INFO ====="
lsblk -d -o NAME,SIZE,TYPE,MODEL 2>/dev/null || lsblk -d -o NAME,SIZE,TYPE
echo ""
df -h | grep -E "^(/dev|Filesystem)"

- states: ["4", "5"]
note: Retrieve GPU presence and model name (use nvidia-smi for NVIDIA GPUs)
actions:
- actor: this
method: executeCommand
arguments:
- |
echo "===== GPU INFO ====="
if command -v nvidia-smi &> /dev/null; then
nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader 2>/dev/null || echo "nvidia-smi failed"
else
lspci 2>/dev/null | grep -i -E "(vga|3d|display)" || echo "No GPU detected via lspci"
fi

- states: ["5", "end"]
note: Retrieve network interfaces and IP addresses
actions:
- actor: this
method: executeCommand
arguments:
- |
echo "===== NETWORK INFO ====="
ip -4 addr show | grep -E "(^[0-9]+:|inet )" | head -20
EOF

4.3 Create the Main Workflow

Users create the main workflow that calls the sub-workflow in the ~/works/testcluster-iac/sysinfo/main-collect-sysinfo.yaml file.

cat > sysinfo/main-collect-sysinfo.yaml << 'EOF'
name: main-collect-sysinfo

description: |
Main workflow to collect system information from all compute nodes in the cluster in parallel.
Executes collect-sysinfo.yaml on each node.

steps:
- states: ["0", "end"]
note: Execute collect-sysinfo.yaml in parallel on all nodes
actions:
- actor: nodeGroup
method: apply
arguments:
actor: "node-*"
method: runWorkflow
arguments: ["collect-sysinfo.yaml"]
EOF

5. Verify the Workflow Contents

Users can check workflow contents using the describe command.

5.1 Check the Workflow List

./actor_iac.java list -d ./sysinfo

Example output:

Available workflows (directory: ./sysinfo)
------------------------------------------------------------------------------------------
# File (-w) Path Workflow Name (in logs)
------------------------------------------------------------------------------------------
1 collect-sysinfo ./sysinfo/collect-sysinfo.yaml collect-sysinfo
2 main-collect-sysinfo ./sysinfo/main-collect-sysinfo... main-collect-sysinfo

5.2 Check the Workflow Description

./actor_iac.java describe -d ./sysinfo -w collect-sysinfo

Example output:

Workflow: collect-sysinfo
File: /home/user/works/testcluster-iac/sysinfo/collect-sysinfo.yaml

Description:
Sub-workflow to collect system information from each compute node.
Retrieves hostname, OS, CPU, memory, disk, GPU, and network information.

5.3 Check the Step Descriptions

Adding the --steps option displays the note for each step.

./actor_iac.java describe -d ./sysinfo -w collect-sysinfo --steps

Example output:

Workflow: collect-sysinfo
File: /home/user/works/testcluster-iac/sysinfo/collect-sysinfo.yaml

Description:
Sub-workflow to collect system information from each compute node.
Retrieves hostname, OS, CPU, memory, disk, GPU, and network information.

Steps:

[0 -> 1]
Retrieve hostname and OS information

[1 -> 2]
Retrieve CPU architecture, core count, and model name

[2 -> 3]
Retrieve memory capacity (total, used, free)

[3 -> 4]
Retrieve disk device list and mount status

[4 -> 5]
Retrieve GPU presence and model name (use nvidia-smi for NVIDIA GPUs)

[5 -> end]
Retrieve network interfaces and IP addresses
tip

By utilizing the describe command, users can understand workflow overviews without opening YAML files directly. In team development, the describe command is convenient for sharing workflow purposes and the intent of each step.

6. Verify the Directory Structure

When users complete the work up to this point, the ~/works/testcluster-iac directory has the following structure.

~/works/testcluster-iac/
├── actor_iac.java
├── inventory.ini
└── sysinfo
├── collect-sysinfo.yaml
└── main-collect-sysinfo.yaml

7. Execute the Workflow

Users execute the run command to execute the system information collection workflow.

./actor_iac.java run -d ./sysinfo -w main-collect-sysinfo -i inventory.ini -g compute

The explanations for each option are as follows:

OptionDescription
-d ./sysinfoSpecifies the workflow directory
-w main-collect-sysinfoSpecifies the workflow name to execute
-i inventory.iniSpecifies the inventory file
-g computeSpecifies the target group

actor-IaC executes the sub-workflow in parallel on 6 nodes. Output from each node is aggregated by the accumulator actor and displayed on the console.

Summary

This tutorial created a workflow to collect system information from multiple compute nodes using actor-IaC.

What was learned:

  • Define multiple nodes in an inventory file
  • Understand the structure of actor trees generated by actor-IaC
  • Write descriptions using description: and note: in workflows
  • Check workflow contents using the describe command

For information about checking execution results and utilizing the log database, see the next tutorial.