Collecting System Information from a Cluster

Problem Definition

In cluster management, understanding the configuration information of each node (CPU, memory, disk, GPU, OS, network) is a fundamental and important task. Using Turing-workflow, you can define target nodes in an inventory file and execute the same information collection process in parallel across multiple nodes.

Assumed Network Configuration

This assumes a configuration where you SSH connect from an operator terminal to compute nodes in the cluster. The operator terminal is the machine running Turing-workflow, and if it is located on a network outside the cluster, it accesses each compute node through a gateway.

Item	Value
Gateway IP address	192.168.5.1
Compute node IP addresses	192.168.5.13, .14, .15, .21, .22, .23
SSH username	youruser

If the operator terminal is inside the cluster, you can SSH directly to each node. If the operator terminal is outside the cluster, use SSH's ProxyJump feature through the gateway.

How to do it

1. Create the Inventory File

Define the target nodes for system information collection in inventory.ini.

[compute]
node13 actoriac_host=192.168.5.13
node14 actoriac_host=192.168.5.14
node15 actoriac_host=192.168.5.15
node21 actoriac_host=192.168.5.21
node22 actoriac_host=192.168.5.22
node23 actoriac_host=192.168.5.23

[compute:vars]
actoriac_user=youruser

Item	Description
`[compute]`	Group name
`node13` -- `node23`	Identifier for each node
`actoriac_host=...`	IP address of each node
`[compute:vars]`	Variables applied to the entire group
`actoriac_user=youruser`	SSH connection username

2. Create the Sub-workflow

Create the sub-workflow that will be executed on each node in sysinfo/collect-sysinfo.yaml.

name: collect-sysinfo

description: |
  Sub-workflow to collect system information from each compute node.
  Retrieves hostname, OS, CPU, memory, disk, GPU, and network information.

steps:
  - states: ["0", "1"]
    note: Retrieve hostname and OS information
    actions:
      - actor: this
        method: executeCommand
        arguments:
          - |
            echo "===== HOSTNAME ====="
            hostname -f
            echo ""
            echo "===== OS INFO ====="
            cat /etc/os-release | head -5
            uname -a

  - states: ["1", "2"]
    note: Retrieve CPU architecture, core count, and model name
    actions:
      - actor: this
        method: executeCommand
        arguments:
          - |
            echo "===== CPU INFO ====="
            lscpu | head -15

  - states: ["2", "3"]
    note: Retrieve memory capacity (total, used, free)
    actions:
      - actor: this
        method: executeCommand
        arguments:
          - |
            echo "===== MEMORY INFO ====="
            free -h

  - states: ["3", "4"]
    note: Retrieve disk device list and mount status
    actions:
      - actor: this
        method: executeCommand
        arguments:
          - |
            echo "===== DISK INFO ====="
            lsblk -d -o NAME,SIZE,TYPE,MODEL 2>/dev/null || lsblk -d -o NAME,SIZE,TYPE
            echo ""
            df -h | head -10

  - states: ["4", "5"]
    note: Retrieve GPU presence and model name (uses nvidia-smi for NVIDIA GPUs)
    actions:
      - actor: this
        method: executeCommand
        arguments:
          - |
            echo "===== GPU INFO ====="
            if command -v nvidia-smi &> /dev/null; then
                nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader 2>/dev/null || echo "nvidia-smi failed"
            else
                lspci 2>/dev/null | head -5 || echo "No GPU detected via lspci"
            fi

  - states: ["5", "end"]
    note: Retrieve network interfaces and IP addresses
    actions:
      - actor: this
        method: executeCommand
        arguments:
          - |
            echo "===== NETWORK INFO ====="
            ip -4 addr show | head -20

The sub-workflow consists of 6 steps. Each step sequentially collects hostname/OS, CPU, memory, disk, GPU, and network information.

3. Create the Main Workflow

Create the main workflow that calls the sub-workflow in sysinfo/main-collect-sysinfo.yaml.

name: main-collect-sysinfo

description: |
  Main workflow to collect system information from all compute nodes in parallel.
  Executes collect-sysinfo.yaml on each node.

steps:
  - states: ["0", "end"]
    note: Execute collect-sysinfo.yaml in parallel on all nodes
    actions:
      - actor: nodeGroup
        method: apply
        arguments:
          actor: "node-*"
          method: runWorkflow
          arguments: ["collect-sysinfo.yaml"]

The main workflow uses the nodeGroup.apply() method to execute the sub-workflow in parallel on all node actors matching the node-* pattern.

4. Run the Workflow

./turing_workflow.java run -w sysinfo/main-collect-sysinfo.yaml -i inventory.ini -g compute

Option	Description
`-w sysinfo/main-collect-sysinfo.yaml`	Workflow file to execute
`-i inventory.ini`	Inventory file
`-g compute`	Target group

Turing-workflow executes the sub-workflow in parallel on 6 nodes. Output from each node is prefixed with [node-node13] or similar. All logs are automatically saved to turing-workflow-logs.mv.db.

5. Directory Structure

The directory structure after completing the work is as follows.

~/works/testcluster/
├── turing_workflow.java
├── Turing-workflow-3.0.0.jar
├── turing-workflow-logs.mv.db    ← Automatically created after execution
├── inventory.ini
└── sysinfo/
    ├── collect-sysinfo.yaml
    └── main-collect-sysinfo.yaml

Under the hood

Actor Tree Generation

When executing a workflow, Turing-workflow reads the inventory file and generates an actor tree. Turing-workflow creates node actors corresponding to each node definition in the inventory file and places them as child actors of the nodeGroup actor. In the example above, 6 node actors are generated for the 6 nodes (node13 -- node23).

ROOT Actor
└── nodeGroup Actor ("nodeGroup")
    ├── node-node13 Actor
    ├── node-node14 Actor
    ├── node-node15 Actor
    ├── node-node21 Actor
    ├── node-node22 Actor
    └── node-node23 Actor

The purpose of Turing-workflow is to execute identical configuration tasks in parallel across multiple servers. To achieve parallel execution, multiple node actors are placed as child actors under the nodeGroup actor.

Actor	Role
`nodeGroup` Actor	Executes the main workflow. Controls parallel execution of sub-workflows on target nodes
node Actor	Executes sub-workflows. Executes commands on remote nodes through SSH connections

Parallel Execution of Sub-workflows

When the main workflow main-collect-sysinfo.yaml calls the nodeGroup.apply() method, NodeGroupInterpreter executes the sub-workflow collect-sysinfo.yaml in parallel on all child actors matching the specified pattern (node-*). Each NodeIIAR actor executes commands on remote nodes through SSH connections and returns the results.

Problem Definition​

Assumed Network Configuration​

How to do it​

1. Create the Inventory File​

2. Create the Sub-workflow​

3. Create the Main Workflow​

4. Run the Workflow​

5. Directory Structure​

Under the hood​

Actor Tree Generation​

Parallel Execution of Sub-workflows​