-
Notifications
You must be signed in to change notification settings - Fork 20
feat: OpenTelemetry OTLP trace export from MCP Gateway #3177
Description
Upstream request
gh-aw#24373 requests OTLP trace export from the agent runtime. The MCP Gateway is a natural instrumentation point — it sits at the chokepoint between the agent and every backend MCP server, observing every tool call, guard evaluation, and DIFC decision.
What this enables
Structured, per-tool-call span data exported via OTLP to any compatible backend (Jaeger, Datadog, Honeycomb, Grafana Tempo, Langfuse, etc.). This would give operators:
- Per-call latency profiling — which tool/backend is slowest
- Distributed tracing — correlate gateway spans with downstream service spans
- Guard decision visibility — see DIFC allow/deny/filter decisions as span events
- Error attribution — pinpoint failures to specific backend calls, not just aggregate logs
Proposed trace hierarchy
Trace: gateway-session-{sessionID}
└─ Span: gateway.request (HTTP handler)
├─ Span: gateway.tool_call (get_file_contents)
│ ├─ Span: gateway.guard.label_resource
│ ├─ Span: gateway.guard.evaluate
│ ├─ Span: gateway.backend.execute ← MCP JSON-RPC to backend
│ ├─ Span: gateway.guard.label_response
│ └─ Span: gateway.guard.filter_collection
├─ Span: gateway.tool_call (search_code)
│ └─ ...
└─ Span: gateway.tool_call (create_pull_request)
└─ ...
Integration analysis
Why the gateway architecture makes this easy
-
Context already flows end-to-end.
context.Contextis threaded from HTTP handler →callBackendTool()→executeBackendToolCall()→conn.SendRequestWithServerID(). OTEL span context attaches to the existing context chain with zero plumbing changes. -
callBackendTool()has explicit phases. The function already has labeled phases (Phase 0–6: extract labels → label resource → evaluate → execute → label response → filter → accumulate). Each phase is a natural span boundary. -
JSONL logger already captures RPC flow.
JSONLRPCMessagehas timestamp, direction (IN/OUT), method, serverID, and payload. Addingtrace_idandspan_idfields makes JSONL entries correlatable with OTEL traces — no new log format needed. -
Middleware insertion point is clear.
wrapWithMiddleware()chains SDK logging → shutdown check → auth. A tracing middleware slots in between SDK logging and shutdown check. -
No existing OTEL dependency.
go.modhas noopentelemetryimports — clean slate, no version conflicts. -
Session/request IDs already exist. SessionID (from Authorization header) and X-Request-ID are already extracted and logged. These become trace context attributes.
Performance overhead: negligible
| Path | Overhead | Rationale |
|---|---|---|
| HTTP middleware span | < 0.5ms | One span create/close per request |
| Tool call span | < 0.1% of call time | Backend network I/O dominates (100ms+) |
| Guard phase sub-spans | < 1ms total | CPU-bound WASM/evaluation already ~ms |
| Context value storage | Negligible | Already 6+ context.WithValue calls per request |
The noop tracer (when OTLP is not configured) has zero overhead — the SDK's no-op implementation returns immediately.
Costs
- New dependency:
go.opentelemetry.io/otel+ OTLP exporter (~5 packages). Well-maintained, stable API (v1.x). - Config surface: New
[gateway.tracing]section (endpoint, sample rate, service name). ~50 lines of config code. - Code changes: ~200-300 lines across 4-5 files. No refactoring of existing code — purely additive.
- Binary size: OTLP exporter adds ~2-3MB to the binary.
- Maintenance: OTEL Go SDK is stable (v1.39+). Breaking changes are rare.
Benefits
- Debugging: Replace "grep through logs" with visual trace timelines for tool call sequences
- Performance monitoring: Identify slow backends, guard bottlenecks, connection pool issues
- Compliance auditing: Guard decisions (allow/deny/filter) as structured span events with full context
- Ecosystem compatibility: OTLP is the standard — works with every major observability platform
- Zero overhead when off: Noop tracer means no cost for users who don't configure an endpoint
Potential solution
Phase 1: Foundation (~200 lines)
New config (internal/config/config_tracing.go):
type TracingConfig struct {
Endpoint string `toml:"endpoint" json:"endpoint,omitempty"`
ServiceName string `toml:"service_name" json:"serviceName,omitempty"`
SampleRate float64 `toml:"sample_rate" json:"sampleRate,omitempty"` // 0.0-1.0, default 1.0
}Config via TOML or standard OTEL env vars (OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME):
[gateway.tracing]
endpoint = "http://localhost:4318"
sample_rate = 1.0Tracer initialization (in cmd/root.go, before server startup):
tp := initTracerProvider(cfg.Gateway.Tracing) // returns noop if no endpoint
defer tp.Shutdown(ctx)Phase 2: HTTP middleware (~50 lines)
New WithOTELTracing() in internal/server/http_helpers.go:
func WithOTELTracing(next http.Handler, tag string) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
ctx, span := tracer.Start(r.Context(), "gateway.request",
trace.WithAttributes(
attribute.String("session.id", SessionIDFromContext(r.Context())),
attribute.String("http.path", r.URL.Path),
))
defer span.End()
next.ServeHTTP(w, r.WithContext(ctx))
})
}Insert in wrapWithMiddleware() between SDK logging and shutdown check.
Phase 3: Tool call spans (~100 lines)
In callBackendTool(), wrap each phase:
ctx, span := tracer.Start(ctx, "gateway.tool_call",
trace.WithAttributes(
attribute.String("tool.name", toolName),
attribute.String("server.id", serverID),
))
defer span.End()
// Phase 3: backend execution
ctx, execSpan := tracer.Start(ctx, "gateway.backend.execute")
backendResult, err := executeBackendToolCall(ctx, ...)
execSpan.End()Phase 4: JSONL correlation (~20 lines)
Add optional trace_id and span_id to JSONLRPCMessage:
type JSONLRPCMessage struct {
// ... existing fields ...
TraceID string `json:"trace_id,omitempty"`
SpanID string `json:"span_id,omitempty"`
}Extract from context in LogRPCMessageJSONL() — correlates existing JSONL logs with OTEL traces.
Open questions
- AWF firewall allowlisting: The OTLP endpoint needs to be reachable from within the AWF sandbox. Should the gateway auto-add the endpoint to the firewall allowlist, or require explicit configuration?
- Payload inclusion: Should tool call arguments/results be included as span attributes? Risk: large payloads. Mitigation: truncate to N bytes, controlled by config flag.
- Scope: Should Phase 1 include guard sub-spans, or just HTTP + tool call spans? Guard sub-spans add visibility but more instrumentation code.