Skip to main content
Newer Version Available

This is documentation for version 2.13.0. See the latest version.

Investigating Failure Causes with Logs

Problem Definition

Goal: Identify the cause and take action when some nodes fail during workflow execution.

When executing workflows across multiple nodes, only some nodes may fail. Since console output flows by, it's difficult to investigate later "which node failed" and "why it failed."

Using the log database makes it easy to identify failed nodes and confirm error details.

How to do it

Check Failure in Session List

./actor_iac.java log-info --db ./actor-iac-logs --since 1h
Sessions:
================================================================================
#5 main-collect-sysinfo PARTIAL
Inventory: inventory.ini
Started: 2026-01-15T10:00:00+09:00
--------------------------------------------------------------------------------

The status is PARTIAL (partial failure).

Check Which Nodes Failed

./actor_iac.java log-info --db ./actor-iac-logs --session 5 --list-actors
Nodes in session #5 (main-collect-sysinfo):
======================================================================
NODE_ID STATUS LOG_LINES
----------------------------------------------------------------------
node-node13 COMPLETED 57
node-node14 COMPLETED 57
node-node15 COMPLETED 57
node-node21 FAILED 3
node-node22 FAILED 2
node-node23 COMPLETED 57
======================================================================
Total: 6 nodes

node-node21 and node-node22 are FAILED, and LOG_LINES is extremely low.

Check Error Details for Failed Nodes

./actor_iac.java log-info --db ./actor-iac-logs --session 5 --node node-node21
Logs for node: node-node21
================================================================================
[2026-01-15T10:00:01+09:00] INFO [node-node21] Connecting to 192.168.5.21...
[2026-01-15T10:00:01+09:00] ERROR [node-node21] SSH connection failed: Connection refused
[2026-01-15T10:00:01+09:00] ERROR [node-node21] Workflow execution aborted
================================================================================
Total: 3 lines
./actor_iac.java log-info --db ./actor-iac-logs --session 5 --node node-node22
Logs for node: node-node22
================================================================================
[2026-01-15T10:00:01+09:00] INFO [node-node22] Connecting to 192.168.5.22...
[2026-01-15T10:00:31+09:00] ERROR [node-node22] SSH connection failed: Connection timed out
================================================================================
Total: 2 lines

Remediation and Re-execution

Now that the problems have been identified, address them:

  • node21: Start the SSH server (sudo systemctl start sshd)
  • node22: Check network connection or firewall settings

After remediation, re-execute the workflow.

Under the hood

Session Status

The status column in the sessions table contains the following values:

StatusDescription
RUNNINGCurrently executing
COMPLETEDAll nodes succeeded
PARTIALSome nodes failed
FAILEDAll nodes failed

Node Status

The status column in the node_results table contains the following values:

StatusDescription
COMPLETEDCompleted successfully
FAILEDFailed (reason recorded in reason column)

Common Error Messages

Error MessageCauseRemedy
Connection refusedSSH server not runningsudo systemctl start sshd
Connection timed outNetwork unreachable or firewallCheck network settings
Authentication failedInvalid credentialsCheck SSH key or password
Host key verification failedHost key changedRemove entry from ~/.ssh/known_hosts

Problem Identification by LOG_LINES

The LOG_LINES count shown by --list-actors can indicate when failure occurred:

LOG_LINESEstimated Failure Timing
0-5 linesFailed at SSH connection stage
10-30 linesFailed during workflow execution
Similar to normal nodesFailed near final step