Debugging Sub-workflow Execution

Problem Definition

Goal: Identify the cause when unexpected behavior occurs in a structure where the main workflow calls sub-workflows.

In a structure where the main workflow calls sub-workflows with runWorkflow, problems can occur at multiple layers:

Main workflow step transitions
Specific steps in the sub-workflow
Command execution within the sub-workflow

main-workflow.yaml
    │
    ├─ step 0→1: Initialization
    ├─ step 1→2: runWorkflow("sub-workflow.yaml")  ← Sub-workflow executed here
    │       │
    │       └─ sub-workflow.yaml
    │              ├─ step 0→1: Command A
    │              ├─ step 1→2: Command B    ← Problem occurs here?
    │              └─ step 2→end: Output results   ← Not reaching here?
    │
    └─ step 2→end: Report generation

Important Principle: Look at Logs, Not Print Debugging

Infrastructure as Code programs may make changes to the system with each execution. The debugging technique of "adding print statements and re-running" is dangerous. The log database already contains the necessary information, so investigate the logs first.

How to do it

1. Check the Session and Actor List

./actor_iac.java log-info --db ./actor-iac-logs --session 17 --list-actors

Nodes in session #17 (main-cluster-status):
======================================================================
NODE_ID                        STATUS     LOG_LINES
----------------------------------------------------------------------
cli                            COMPLETED  12
nodeGroup                      COMPLETED  45
node-stonefly514               COMPLETED  128
workflowReporter               COMPLETED  8
======================================================================
Total: 4 actors

Check the LOG_LINES for each actor. If the node actor that executed the sub-workflow (node-stonefly514) has fewer log lines than expected, it may have failed midway.

2. Check the Logs of the Node that Executed the Sub-workflow

./actor_iac.java log-info --db ./actor-iac-logs --session 17 --node node-stonefly514

Each step in the sub-workflow is recorded in the logs with the following pattern:

[2026-01-29T04:33:40+09:00] INFO  [node-stonefly514] --- Transition: 0 → 1 ---
[2026-01-29T04:33:40+09:00] INFO  [node-stonefly514] Executing: Collect cluster info and nodes
[2026-01-29T04:33:41+09:00] INFO  [node-stonefly514] {"cluster":"https://...","hostname":"stonefly514",...}
[2026-01-29T04:33:41+09:00] INFO  [node-stonefly514] --- Transition: 1 → 2 ---
[2026-01-29T04:33:41+09:00] INFO  [node-stonefly514] Executing: Collect pods and PVCs per namespace
...

3. Extract Logs for a Specific Step

Use grep to extract step transitions and see how far execution progressed:

./actor_iac.java log-info --db ./actor-iac-logs --session 17 --node node-stonefly514 | grep "Transition"

[2026-01-29T04:33:40+09:00] INFO  [node-stonefly514] --- Transition: 0 → 1 ---
[2026-01-29T04:33:41+09:00] INFO  [node-stonefly514] --- Transition: 1 → 2 ---

If the 2 → end transition is missing, Step 3 (printYaml) was not reached. Step 2's 1 → 2 may have failed, or Step 3 was not executed.

4. Check for Errors

./actor_iac.java log-info --db ./actor-iac-logs --session 17 --node node-stonefly514 --level ERROR

Or, look at the logs around a specific step in detail:

./actor_iac.java log-info --db ./actor-iac-logs --session 17 --node node-stonefly514 | grep -A10 "Transition: 1 → 2"

5. Check Command Output

Standard output and standard error from commands executed in sub-workflows are also recorded in the logs:

./actor_iac.java log-info --db ./actor-iac-logs --session 17 --node node-stonefly514 | grep -i "error\|failed\|exception"

You may find JSON parsing errors, jq errors, kubectl errors, etc.

Under the hood

Information Recorded in Logs

When executing sub-workflows, the following information is recorded in logs:

Information	Recording Timing	Purpose
Step transitions	At start of each step	Check how far execution progressed
Action name	Before action execution	Check what was executed
Command output	During `executeCommand`	Check command results
ActionResult	At action completion	Check success/failure

Why Print Debugging is Inappropriate

Infrastructure as Code workflows may affect the system with each execution:

Workflow Type	Re-execution Risk
Information gathering (`kubectl get`, etc.)	Low (read-only)
Configuration changes (`kubectl apply`, etc.)	Medium-High (modifies state)
Resource creation/deletion	High (duplicate creation or accidental deletion if not idempotent)

The log database already contains detailed information, so there's no need to re-execute for print debugging.

Basic Debugging Strategy

1. Check the log database
   ↓
2. Trace step transitions
   ↓
3. Identify the failed step
   ↓
4. Look at that step's output in detail
   ↓
5. Identify the cause and fix the workflow
   ↓
6. Re-execute after fixing (only once)

Instead of repeatedly "adding print statements and re-executing," proceed with "look at logs to identify the cause, fix it, then re-execute."

Problem Definition​

How to do it​

1. Check the Session and Actor List​

2. Check the Logs of the Node that Executed the Sub-workflow​

3. Extract Logs for a Specific Step​

4. Check for Errors​

5. Check Command Output​

Under the hood​

Information Recorded in Logs​

Why Print Debugging is Inappropriate​

Basic Debugging Strategy​