Skip to content

feat: OpenTelemetry OTLP trace export from MCP Gateway #3177

@lpcox

Description

@lpcox

Upstream request

gh-aw#24373 requests OTLP trace export from the agent runtime. The MCP Gateway is a natural instrumentation point — it sits at the chokepoint between the agent and every backend MCP server, observing every tool call, guard evaluation, and DIFC decision.

What this enables

Structured, per-tool-call span data exported via OTLP to any compatible backend (Jaeger, Datadog, Honeycomb, Grafana Tempo, Langfuse, etc.). This would give operators:

  • Per-call latency profiling — which tool/backend is slowest
  • Distributed tracing — correlate gateway spans with downstream service spans
  • Guard decision visibility — see DIFC allow/deny/filter decisions as span events
  • Error attribution — pinpoint failures to specific backend calls, not just aggregate logs

Proposed trace hierarchy

Trace: gateway-session-{sessionID}
  └─ Span: gateway.request (HTTP handler)
       ├─ Span: gateway.tool_call (get_file_contents)
       │    ├─ Span: gateway.guard.label_resource
       │    ├─ Span: gateway.guard.evaluate
       │    ├─ Span: gateway.backend.execute ← MCP JSON-RPC to backend
       │    ├─ Span: gateway.guard.label_response
       │    └─ Span: gateway.guard.filter_collection
       ├─ Span: gateway.tool_call (search_code)
       │    └─ ...
       └─ Span: gateway.tool_call (create_pull_request)
            └─ ...

Integration analysis

Why the gateway architecture makes this easy

  1. Context already flows end-to-end. context.Context is threaded from HTTP handler → callBackendTool()executeBackendToolCall()conn.SendRequestWithServerID(). OTEL span context attaches to the existing context chain with zero plumbing changes.

  2. callBackendTool() has explicit phases. The function already has labeled phases (Phase 0–6: extract labels → label resource → evaluate → execute → label response → filter → accumulate). Each phase is a natural span boundary.

  3. JSONL logger already captures RPC flow. JSONLRPCMessage has timestamp, direction (IN/OUT), method, serverID, and payload. Adding trace_id and span_id fields makes JSONL entries correlatable with OTEL traces — no new log format needed.

  4. Middleware insertion point is clear. wrapWithMiddleware() chains SDK logging → shutdown check → auth. A tracing middleware slots in between SDK logging and shutdown check.

  5. No existing OTEL dependency. go.mod has no opentelemetry imports — clean slate, no version conflicts.

  6. Session/request IDs already exist. SessionID (from Authorization header) and X-Request-ID are already extracted and logged. These become trace context attributes.

Performance overhead: negligible

Path Overhead Rationale
HTTP middleware span < 0.5ms One span create/close per request
Tool call span < 0.1% of call time Backend network I/O dominates (100ms+)
Guard phase sub-spans < 1ms total CPU-bound WASM/evaluation already ~ms
Context value storage Negligible Already 6+ context.WithValue calls per request

The noop tracer (when OTLP is not configured) has zero overhead — the SDK's no-op implementation returns immediately.

Costs

  • New dependency: go.opentelemetry.io/otel + OTLP exporter (~5 packages). Well-maintained, stable API (v1.x).
  • Config surface: New [gateway.tracing] section (endpoint, sample rate, service name). ~50 lines of config code.
  • Code changes: ~200-300 lines across 4-5 files. No refactoring of existing code — purely additive.
  • Binary size: OTLP exporter adds ~2-3MB to the binary.
  • Maintenance: OTEL Go SDK is stable (v1.39+). Breaking changes are rare.

Benefits

  • Debugging: Replace "grep through logs" with visual trace timelines for tool call sequences
  • Performance monitoring: Identify slow backends, guard bottlenecks, connection pool issues
  • Compliance auditing: Guard decisions (allow/deny/filter) as structured span events with full context
  • Ecosystem compatibility: OTLP is the standard — works with every major observability platform
  • Zero overhead when off: Noop tracer means no cost for users who don't configure an endpoint

Potential solution

Phase 1: Foundation (~200 lines)

New config (internal/config/config_tracing.go):

type TracingConfig struct {
    Endpoint    string  `toml:"endpoint" json:"endpoint,omitempty"`
    ServiceName string  `toml:"service_name" json:"serviceName,omitempty"`
    SampleRate  float64 `toml:"sample_rate" json:"sampleRate,omitempty"` // 0.0-1.0, default 1.0
}

Config via TOML or standard OTEL env vars (OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME):

[gateway.tracing]
endpoint = "http://localhost:4318"
sample_rate = 1.0

Tracer initialization (in cmd/root.go, before server startup):

tp := initTracerProvider(cfg.Gateway.Tracing) // returns noop if no endpoint
defer tp.Shutdown(ctx)

Phase 2: HTTP middleware (~50 lines)

New WithOTELTracing() in internal/server/http_helpers.go:

func WithOTELTracing(next http.Handler, tag string) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        ctx, span := tracer.Start(r.Context(), "gateway.request",
            trace.WithAttributes(
                attribute.String("session.id", SessionIDFromContext(r.Context())),
                attribute.String("http.path", r.URL.Path),
            ))
        defer span.End()
        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

Insert in wrapWithMiddleware() between SDK logging and shutdown check.

Phase 3: Tool call spans (~100 lines)

In callBackendTool(), wrap each phase:

ctx, span := tracer.Start(ctx, "gateway.tool_call",
    trace.WithAttributes(
        attribute.String("tool.name", toolName),
        attribute.String("server.id", serverID),
    ))
defer span.End()

// Phase 3: backend execution
ctx, execSpan := tracer.Start(ctx, "gateway.backend.execute")
backendResult, err := executeBackendToolCall(ctx, ...)
execSpan.End()

Phase 4: JSONL correlation (~20 lines)

Add optional trace_id and span_id to JSONLRPCMessage:

type JSONLRPCMessage struct {
    // ... existing fields ...
    TraceID string `json:"trace_id,omitempty"`
    SpanID  string `json:"span_id,omitempty"`
}

Extract from context in LogRPCMessageJSONL() — correlates existing JSONL logs with OTEL traces.

Open questions

  1. AWF firewall allowlisting: The OTLP endpoint needs to be reachable from within the AWF sandbox. Should the gateway auto-add the endpoint to the firewall allowlist, or require explicit configuration?
  2. Payload inclusion: Should tool call arguments/results be included as span attributes? Risk: large payloads. Mitigation: truncate to N bytes, controlled by config flag.
  3. Scope: Should Phase 1 include guard sub-spans, or just HTTP + tool call spans? Guard sub-spans add visibility but more instrumentation code.

References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions