Observability¶
Three browser tabs cover the full diagnostic surface for the agent stack. One command opens all three.
Opens Jaeger (:16686), Prometheus (:9090), and Dozzle (:8888) in your default browser. Cross-platform (Linux, macOS, Windows, WSL2 — see scripts/observe.py for the platform handling).
Mental model¶
| Pane | Category | The question it answers |
|---|---|---|
| Jaeger | traces | Where did this request go? — A2A hops, MCP tool calls, full pipeline timing |
| Prometheus | metrics | How often / how slow? — rates, durations, percentile latencies |
| Dozzle | logs | What is THIS agent saying right now? — per-container live tail |
Trace → flow. Metric → aggregate. Log → narrative. If you read only one column, you should still know which tab to click.
Cross-tool linking¶
Every log line emitted inside an OpenTelemetry span context carries the active trace ID:
So the cross-pane jump is direct:
- Spot a slow span in Jaeger → copy its trace ID from the URL or the trace header.
- Paste it into Dozzle's filter as
trace=<id>. - The exact log lines emitted during that trace return — across every container that touched it.
Lines emitted outside any span (startup chatter, pre-request setup) render with the bare format and skip the trace block entirely, so non-traced lines don't pad the output with 32 zeros.
The wiring lives in shared/src/kourai_common/log.py::_OtelTraceFilter — a small logging.Filter that reads opentelemetry.trace.get_current_span() per record and stamps otelTraceID / otelSpanID onto it. Pairs with the format swapper _TraceAwareFormatter that picks between the two format strings based on whether the trace is real or zero.
Triage runbook¶
When something looks wrong, the answer is almost always one of these patterns:
Agent looks stuck mid-pipeline
1. make observe → click Dozzle.
2. Find the agent's container group in the sidebar (agents group).
3. Live-tail; the most recent line is what it's working on right now. If the same line has been there for >30s, the agent is wedged — check for tool_use lines that didn't get a tool_result reply.
4. Grab the trace ID from the most recent line, paste into Jaeger's search to see the full A2A → MCP flow that led here.
A request finished but felt slow 1. Open Jaeger. 2. Search the service that owns the request entry point (Hephaestus for player-facing requests). 3. Click the most recent trace; the waterfall shows which child span dominated wall time. 4. If the dominant span is an LLM call, that's the LiteLLM request — check Prometheus for whether this latency is normal for that agent + model combo.
Errors visible somewhere but unclear which agent
1. Jaeger first — failed spans are flagged red. Drill into the trace, find the span with status: error, the service name on it tells you which container to tail.
2. Use that trace ID to grep Dozzle for the same trace context across other containers (the error often originates upstream of where it surfaced).
OOM / container kept restarting 1. Dozzle directly — the log scrolling in real time shows the bootstrap chatter on each restart. 2. Jaeger may show no recent traces from that service — corroborates that the agent is too unhealthy to even emit telemetry.
Container groups in Dozzle¶
The compose file labels every service with a dev.dozzle.group so the Dozzle sidebar separates them into four collapsible sections:
| Group | Members | Purpose |
|---|---|---|
agents |
mneme, kallos, techne, dokimasia, metis, puck, cupid, aidos, aletheia, hephaestus | The 10 A2A agents (6 specialists + 2 companions + 2 validators) |
mcp |
memory-mcp, context7-mcp | MCP sidecars the agents call into |
observability |
jaeger, prometheus, dozzle | The triad itself |
infra |
vn-bridge | Bridges + future infrastructure |
When triaging, "is this an agent issue or an infra issue?" is one click in the sidebar, not a hunt through 16 containers.
What's currently populated, and what's not¶
Be candid about gaps so you don't go looking for data that isn't there:
- Jaeger — fully populated. Every A2A call creates a span; W3C trace
context propagates via A2A
metadataheaders; MCP tool calls add child spans. Runningjaegertracing/jaeger:2.17.0with the OTel-Collector-shape config indocker/jaeger-config.yaml. - Dozzle — fully populated. Every container's stdout/stderr; live tail; trace IDs on every span-bound log line.
- Prometheus — running
prom/prometheus:v3.11.3-distrolessand scraping the spanmetrics connector's RED metrics endpoint on:8889. Service Performance Monitoring (the "Monitor" tab in Jaeger's UI) is populated for agent latency / rate / error percentiles. Note that the 10 agents themselves are not scraped directly today — RED metrics come from the spanmetrics-derived stream, not from per-agent/metricsendpoints. Adding per-agent scraping is sibling work.
Span naming convention¶
Every span follows a <source>.<operation> shape so Jaeger search by service is meaningful:
| Span | Source |
|---|---|
hephaestus.route |
Pipeline determination |
hephaestus.pipeline.step.{agent} |
Each specialist call |
hephaestus.pipeline.fix_loop |
Kallos-Techne iterations |
a2a.connect.{agent} |
Agent card fetch |
a2a.send.{agent} |
Message send |
{agent}.execute |
Agent-specific execution |
{agent}.generate |
LLM call |
mcp.context7.query |
Context7 tool call |
mcp.memory.{op} |
Memory MCP operation |
References¶
- OpenTelemetry Python — the SDK every agent uses
- Jaeger v2 deployment guide — the OTel-Collector-shape config the stack runs
- Prometheus 3.0 migration — context for the v2 → v3 jump completed under M16
- Dozzle docs — keyboard shortcuts, filters, swarm mode
- W3C Trace Context — the propagation format A2A
metadatacarries