Skip to main content

Debugging Sub-workflow Execution

Problem Definition

Goal: Identify the cause when unexpected behavior occurs in a structure where the main workflow calls sub-workflows.

In a structure where the main workflow calls sub-workflows with runWorkflow, problems can occur at multiple layers:

  • Main workflow step transitions
  • Specific steps in the sub-workflow
  • Command execution within the sub-workflow
main-workflow.yaml

├─ step 0→1: Initialization
├─ step 1→2: runWorkflow("sub-workflow.yaml") ← Sub-workflow executed here
│ │
│ └─ sub-workflow.yaml
│ ├─ step 0→1: Command A
│ ├─ step 1→2: Command B ← Problem occurs here?
│ └─ step 2→end: Output results ← Not reaching here?

└─ step 2→end: Report generation

Important Principle: Look at Logs, Not Print Debugging

Infrastructure as Code programs may make changes to the system with each execution. The debugging technique of "adding print statements and re-running" is dangerous. The log database already contains the necessary information, so investigate the logs first.

How to do it

1. Check the Session and Actor List

./actor_iac.java log-info --db ./actor-iac-logs --session 17 --list-actors
Nodes in session #17 (main-cluster-status):
======================================================================
NODE_ID STATUS LOG_LINES
----------------------------------------------------------------------
cli COMPLETED 12
nodeGroup COMPLETED 45
node-stonefly514 COMPLETED 128
workflowReporter COMPLETED 8
======================================================================
Total: 4 actors

Check the LOG_LINES for each actor. If the node actor that executed the sub-workflow (node-stonefly514) has fewer log lines than expected, it may have failed midway.

2. Check the Logs of the Node that Executed the Sub-workflow

./actor_iac.java log-info --db ./actor-iac-logs --session 17 --node node-stonefly514

Each step in the sub-workflow is recorded in the logs with the following pattern:

[2026-01-29T04:33:40+09:00] INFO  [node-stonefly514] --- Transition: 0 → 1 ---
[2026-01-29T04:33:40+09:00] INFO [node-stonefly514] Executing: Collect cluster info and nodes
[2026-01-29T04:33:41+09:00] INFO [node-stonefly514] {"cluster":"https://...","hostname":"stonefly514",...}
[2026-01-29T04:33:41+09:00] INFO [node-stonefly514] --- Transition: 1 → 2 ---
[2026-01-29T04:33:41+09:00] INFO [node-stonefly514] Executing: Collect pods and PVCs per namespace
...

3. Extract Logs for a Specific Step

Use grep to extract step transitions and see how far execution progressed:

./actor_iac.java log-info --db ./actor-iac-logs --session 17 --node node-stonefly514 | grep "Transition"
[2026-01-29T04:33:40+09:00] INFO  [node-stonefly514] --- Transition: 0 → 1 ---
[2026-01-29T04:33:41+09:00] INFO [node-stonefly514] --- Transition: 1 → 2 ---

If the 2 → end transition is missing, Step 3 (printYaml) was not reached. Step 2's 1 → 2 may have failed, or Step 3 was not executed.

4. Check for Errors

./actor_iac.java log-info --db ./actor-iac-logs --session 17 --node node-stonefly514 --level ERROR

Or, look at the logs around a specific step in detail:

./actor_iac.java log-info --db ./actor-iac-logs --session 17 --node node-stonefly514 | grep -A10 "Transition: 1 → 2"

5. Check Command Output

Standard output and standard error from commands executed in sub-workflows are also recorded in the logs:

./actor_iac.java log-info --db ./actor-iac-logs --session 17 --node node-stonefly514 | grep -i "error\|failed\|exception"

You may find JSON parsing errors, jq errors, kubectl errors, etc.

Under the hood

Information Recorded in Logs

When executing sub-workflows, the following information is recorded in logs:

InformationRecording TimingPurpose
Step transitionsAt start of each stepCheck how far execution progressed
Action nameBefore action executionCheck what was executed
Command outputDuring executeCommandCheck command results
ActionResultAt action completionCheck success/failure

Why Print Debugging is Inappropriate

Infrastructure as Code workflows may affect the system with each execution:

Workflow TypeRe-execution Risk
Information gathering (kubectl get, etc.)Low (read-only)
Configuration changes (kubectl apply, etc.)Medium-High (modifies state)
Resource creation/deletionHigh (duplicate creation or accidental deletion if not idempotent)

The log database already contains detailed information, so there's no need to re-execute for print debugging.

Basic Debugging Strategy

1. Check the log database

2. Trace step transitions

3. Identify the failed step

4. Look at that step's output in detail

5. Identify the cause and fix the workflow

6. Re-execute after fixing (only once)

Instead of repeatedly "adding print statements and re-executing," proceed with "look at logs to identify the cause, fix it, then re-execute."