Tutorial: Collecting System Information from Clusters
This tutorial creates a workflow using actor-IaC to collect system information from multiple compute nodes. In cluster management, understanding CPU, memory, disk, GPU, OS, and network information for each node is a basic and important task.
Prerequisites
Users must satisfy the following conditions:
- actor-IaC is installed on the user's machine (see installation tutorial for installation instructions)
- Users can log into target nodes using SSH public key authentication (see SSH Setup tutorial for configuration instructions)
- Users can execute the
sudocommand on target nodes (required for GPU information retrieval, see SSH Setup tutorial for configuration instructions)
1. Create a Working Directory
Users create a directory to place workflow files.
mkdir -p ~/works/testcluster-iac/sysinfo
cd ~/works/testcluster-iac
2. Create the Inventory File
Users define the target nodes for system information collection in the inventory.ini file. This tutorial targets 6 compute nodes: 192.168.5.13, 14, 15, 21, 22, and 23.
cat > inventory.ini << 'EOF'
[compute]
node13 actoriac_host=192.168.5.13
node14 actoriac_host=192.168.5.14
node15 actoriac_host=192.168.5.15
node21 actoriac_host=192.168.5.21
node22 actoriac_host=192.168.5.22
node23 actoriac_host=192.168.5.23
[compute:vars]
actoriac_user=devteam
EOF
The meanings of the configuration contents in the inventory.ini file are as follows:
| Item | Description |
|---|---|
[compute] | Defines a group name |
node13 to node23 | Defines identification names for each node |
actoriac_host=192.168.5.XX | Specifies the actual IP address for each node |
[compute:vars] | Defines variables that apply to the entire compute group |
actoriac_user=devteam | Specifies the username for SSH connections |
3. Understand the Actor Tree Generated by actor-IaC
When actor-IaC reads the inventory.ini file and executes a workflow, actor-IaC generates the following actor tree.
ROOT actor
├── logStore actor
└── nodeGroup actor
├── accumulator actor
├── node-node13 actor
├── node-node14 actor
├── node-node15 actor
├── node-node21 actor
├── node-node22 actor
└── node-node23 actor
The actor tree generated by actor-IaC has the following three important design points.
3.1 Parent-Child Relationship Between nodeGroup Actor and node Actors
The purpose of actor-IaC is to execute the same configuration tasks on multiple servers in parallel. To achieve parallel execution, actor-IaC places multiple node actors as child actors under the nodeGroup actor.
- nodeGroup actor: Executes the main workflow. Controls which sub-workflows to execute on which nodes.
- node actor: Executes sub-workflows. Defines the specific commands to execute on each node.
In this tutorial, 6 node actors (node-node13 to node-node23) corresponding to 6 nodes are generated. actor-IaC executes the same sub-workflow in parallel on these 6 node actors.
3.2 Log Output Aggregation by the accumulator Actor
When multiple node actors execute sub-workflows in parallel, multiple node actors generate log output simultaneously. The accumulator actor aggregates log output from multiple node actors in one place.
node-node13 actor ──┐
node-node14 actor ──┤
node-node15 actor ──┼─→ accumulator actor ─→ logStore actor ─→ actor-iac-logs.mv.db
node-node21 actor ──┤ │
node-node22 actor ──┤ └─→ standard output (console)
node-node23 actor ──┘
3.3 Log Persistence by the logStore Actor
The logStore actor is dedicated to log persistence. Logs from multiple node actors are written in mixed order of arrival, but since each log entry includes nodeId, users can filter by nodeId later to extract logs for specific nodes.
4. Create the System Information Collection Workflow
4.1 About Workflow Attributes
In actor-IaC workflows, users can document using the following attributes:
| Level | Attribute | Description |
|---|---|---|
| Workflow | description: | Description of the entire workflow |
| Step | note: | Description of the step |
Setting these attributes allows users to check workflow contents using the describe command.
4.2 Create the Sub-Workflow
Users create the sub-workflow to be executed on each node in the ~/works/testcluster-iac/sysinfo/collect-sysinfo.yaml file.
cat > sysinfo/collect-sysinfo.yaml << 'EOF'
name: collect-sysinfo
description: |
Sub-workflow to collect system information from each compute node.
Retrieves hostname, OS, CPU, memory, disk, GPU, and network information.
steps:
- states: ["0", "1"]
note: Retrieve hostname and OS information
actions:
- actor: this
method: executeCommand
arguments:
- |
echo "===== HOSTNAME ====="
hostname -f
echo ""
echo "===== OS INFO ====="
cat /etc/os-release | grep -E "^(NAME|VERSION|ID)="
uname -a
- states: ["1", "2"]
note: Retrieve CPU architecture, core count, and model name
actions:
- actor: this
method: executeCommand
arguments:
- |
echo "===== CPU INFO ====="
lscpu | grep -E "^(Architecture|CPU\(s\)|Model name|Thread|Core|Socket)"
- states: ["2", "3"]
note: Retrieve memory capacity (total, used, free)
actions:
- actor: this
method: executeCommand
arguments:
- |
echo "===== MEMORY INFO ====="
free -h
- states: ["3", "4"]
note: Retrieve disk device list and mount status
actions:
- actor: this
method: executeCommand
arguments:
- |
echo "===== DISK INFO ====="
lsblk -d -o NAME,SIZE,TYPE,MODEL 2>/dev/null || lsblk -d -o NAME,SIZE,TYPE
echo ""
df -h | grep -E "^(/dev|Filesystem)"
- states: ["4", "5"]
note: Retrieve GPU presence and model name (use nvidia-smi for NVIDIA GPUs)
actions:
- actor: this
method: executeCommand
arguments:
- |
echo "===== GPU INFO ====="
if command -v nvidia-smi &> /dev/null; then
nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader 2>/dev/null || echo "nvidia-smi failed"
else
lspci 2>/dev/null | grep -i -E "(vga|3d|display)" || echo "No GPU detected via lspci"
fi
- states: ["5", "end"]
note: Retrieve network interfaces and IP addresses
actions:
- actor: this
method: executeCommand
arguments:
- |
echo "===== NETWORK INFO ====="
ip -4 addr show | grep -E "(^[0-9]+:|inet )" | head -20
EOF
4.3 Create the Main Workflow
Users create the main workflow that calls the sub-workflow in the ~/works/testcluster-iac/sysinfo/main-collect-sysinfo.yaml file.
cat > sysinfo/main-collect-sysinfo.yaml << 'EOF'
name: main-collect-sysinfo
description: |
Main workflow to collect system information from all compute nodes in the cluster in parallel.
Executes collect-sysinfo.yaml on each node.
steps:
- states: ["0", "end"]
note: Execute collect-sysinfo.yaml in parallel on all nodes
actions:
- actor: nodeGroup
method: apply
arguments:
actor: "node-*"
method: runWorkflow
arguments: ["collect-sysinfo.yaml"]
EOF
5. Verify the Workflow Contents
Users can check workflow contents using the describe command.
5.1 Check the Workflow List
./actor_iac.java list -d ./sysinfo
Example output:
Available workflows (directory: ./sysinfo)
------------------------------------------------------------------------------------------
# File (-w) Path Workflow Name (in logs)
------------------------------------------------------------------------------------------
1 collect-sysinfo ./sysinfo/collect-sysinfo.yaml collect-sysinfo
2 main-collect-sysinfo ./sysinfo/main-collect-sysinfo... main-collect-sysinfo
5.2 Check the Workflow Description
./actor_iac.java describe -d ./sysinfo -w collect-sysinfo
Example output:
Workflow: collect-sysinfo
File: /home/user/works/testcluster-iac/sysinfo/collect-sysinfo.yaml
Description:
Sub-workflow to collect system information from each compute node.
Retrieves hostname, OS, CPU, memory, disk, GPU, and network information.
5.3 Check the Step Descriptions
Adding the --steps option displays the note for each step.
./actor_iac.java describe -d ./sysinfo -w collect-sysinfo --steps
Example output:
Workflow: collect-sysinfo
File: /home/user/works/testcluster-iac/sysinfo/collect-sysinfo.yaml
Description:
Sub-workflow to collect system information from each compute node.
Retrieves hostname, OS, CPU, memory, disk, GPU, and network information.
Steps:
[0 -> 1]
Retrieve hostname and OS information
[1 -> 2]
Retrieve CPU architecture, core count, and model name
[2 -> 3]
Retrieve memory capacity (total, used, free)
[3 -> 4]
Retrieve disk device list and mount status
[4 -> 5]
Retrieve GPU presence and model name (use nvidia-smi for NVIDIA GPUs)
[5 -> end]
Retrieve network interfaces and IP addresses
By utilizing the describe command, users can understand workflow overviews without opening YAML files directly. In team development, the describe command is convenient for sharing workflow purposes and the intent of each step.
6. Verify the Directory Structure
When users complete the work up to this point, the ~/works/testcluster-iac directory has the following structure.
~/works/testcluster-iac/
├── actor_iac.java
├── inventory.ini
└── sysinfo
├── collect-sysinfo.yaml
└── main-collect-sysinfo.yaml
7. Execute the Workflow
Users execute the run command to execute the system information collection workflow.
./actor_iac.java run -d ./sysinfo -w main-collect-sysinfo -i inventory.ini -g compute
The explanations for each option are as follows:
| Option | Description |
|---|---|
-d ./sysinfo | Specifies the workflow directory |
-w main-collect-sysinfo | Specifies the workflow name to execute |
-i inventory.ini | Specifies the inventory file |
-g compute | Specifies the target group |
actor-IaC executes the sub-workflow in parallel on 6 nodes. Output from each node is aggregated by the accumulator actor and displayed on the console.
Summary
This tutorial created a workflow to collect system information from multiple compute nodes using actor-IaC.
What was learned:
- Define multiple nodes in an inventory file
- Understand the structure of actor trees generated by actor-IaC
- Write descriptions using
description:andnote:in workflows - Check workflow contents using the
describecommand
For information about checking execution results and utilizing the log database, see the next tutorial.