feat: OpenTelemetry OTLP trace export from MCP Gateway

## Upstream request

[gh-aw#24373](https://github.com/github/gh-aw/issues/24373) requests OTLP trace export from the agent runtime. The MCP Gateway is a natural instrumentation point — it sits at the chokepoint between the agent and every backend MCP server, observing every tool call, guard evaluation, and DIFC decision.

## What this enables

Structured, per-tool-call span data exported via OTLP to any compatible backend (Jaeger, Datadog, Honeycomb, Grafana Tempo, Langfuse, etc.). This would give operators:

- **Per-call latency profiling** — which tool/backend is slowest
- **Distributed tracing** — correlate gateway spans with downstream service spans
- **Guard decision visibility** — see DIFC allow/deny/filter decisions as span events
- **Error attribution** — pinpoint failures to specific backend calls, not just aggregate logs

## Proposed trace hierarchy

```
Trace: gateway-session-{sessionID}
  └─ Span: gateway.request (HTTP handler)
       ├─ Span: gateway.tool_call (get_file_contents)
       │    ├─ Span: gateway.guard.label_resource
       │    ├─ Span: gateway.guard.evaluate
       │    ├─ Span: gateway.backend.execute ← MCP JSON-RPC to backend
       │    ├─ Span: gateway.guard.label_response
       │    └─ Span: gateway.guard.filter_collection
       ├─ Span: gateway.tool_call (search_code)
       │    └─ ...
       └─ Span: gateway.tool_call (create_pull_request)
            └─ ...
```

## Integration analysis

### Why the gateway architecture makes this easy

1. **Context already flows end-to-end.** `context.Context` is threaded from HTTP handler → `callBackendTool()` → `executeBackendToolCall()` → `conn.SendRequestWithServerID()`. OTEL span context attaches to the existing context chain with zero plumbing changes.

2. **`callBackendTool()` has explicit phases.** The function already has labeled phases (Phase 0–6: extract labels → label resource → evaluate → execute → label response → filter → accumulate). Each phase is a natural span boundary.

3. **JSONL logger already captures RPC flow.** `JSONLRPCMessage` has timestamp, direction (IN/OUT), method, serverID, and payload. Adding `trace_id` and `span_id` fields makes JSONL entries correlatable with OTEL traces — no new log format needed.

4. **Middleware insertion point is clear.** `wrapWithMiddleware()` chains SDK logging → shutdown check → auth. A tracing middleware slots in between SDK logging and shutdown check.

5. **No existing OTEL dependency.** `go.mod` has no `opentelemetry` imports — clean slate, no version conflicts.

6. **Session/request IDs already exist.** SessionID (from Authorization header) and X-Request-ID are already extracted and logged. These become trace context attributes.

### Performance overhead: negligible

| Path | Overhead | Rationale |
|------|----------|-----------|
| HTTP middleware span | < 0.5ms | One span create/close per request |
| Tool call span | < 0.1% of call time | Backend network I/O dominates (100ms+) |
| Guard phase sub-spans | < 1ms total | CPU-bound WASM/evaluation already ~ms |
| Context value storage | Negligible | Already 6+ context.WithValue calls per request |

The noop tracer (when OTLP is not configured) has zero overhead — the SDK's no-op implementation returns immediately.

## Costs

- **New dependency**: `go.opentelemetry.io/otel` + OTLP exporter (~5 packages). Well-maintained, stable API (v1.x).
- **Config surface**: New `[gateway.tracing]` section (endpoint, sample rate, service name). ~50 lines of config code.
- **Code changes**: ~200-300 lines across 4-5 files. No refactoring of existing code — purely additive.
- **Binary size**: OTLP exporter adds ~2-3MB to the binary.
- **Maintenance**: OTEL Go SDK is stable (v1.39+). Breaking changes are rare.

## Benefits

- **Debugging**: Replace "grep through logs" with visual trace timelines for tool call sequences
- **Performance monitoring**: Identify slow backends, guard bottlenecks, connection pool issues
- **Compliance auditing**: Guard decisions (allow/deny/filter) as structured span events with full context
- **Ecosystem compatibility**: OTLP is the standard — works with every major observability platform
- **Zero overhead when off**: Noop tracer means no cost for users who don't configure an endpoint

## Potential solution

### Phase 1: Foundation (~200 lines)

**New config** (`internal/config/config_tracing.go`):
```go
type TracingConfig struct {
    Endpoint    string  `toml:"endpoint" json:"endpoint,omitempty"`
    ServiceName string  `toml:"service_name" json:"serviceName,omitempty"`
    SampleRate  float64 `toml:"sample_rate" json:"sampleRate,omitempty"` // 0.0-1.0, default 1.0
}
```

Config via TOML or standard OTEL env vars (`OTEL_EXPORTER_OTLP_ENDPOINT`, `OTEL_SERVICE_NAME`):
```toml
[gateway.tracing]
endpoint = "http://localhost:4318"
sample_rate = 1.0
```

**Tracer initialization** (in `cmd/root.go`, before server startup):
```go
tp := initTracerProvider(cfg.Gateway.Tracing) // returns noop if no endpoint
defer tp.Shutdown(ctx)
```

### Phase 2: HTTP middleware (~50 lines)

New `WithOTELTracing()` in `internal/server/http_helpers.go`:
```go
func WithOTELTracing(next http.Handler, tag string) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        ctx, span := tracer.Start(r.Context(), "gateway.request",
            trace.WithAttributes(
                attribute.String("session.id", SessionIDFromContext(r.Context())),
                attribute.String("http.path", r.URL.Path),
            ))
        defer span.End()
        next.ServeHTTP(w, r.WithContext(ctx))
    })
}
```

Insert in `wrapWithMiddleware()` between SDK logging and shutdown check.

### Phase 3: Tool call spans (~100 lines)

In `callBackendTool()`, wrap each phase:
```go
ctx, span := tracer.Start(ctx, "gateway.tool_call",
    trace.WithAttributes(
        attribute.String("tool.name", toolName),
        attribute.String("server.id", serverID),
    ))
defer span.End()

// Phase 3: backend execution
ctx, execSpan := tracer.Start(ctx, "gateway.backend.execute")
backendResult, err := executeBackendToolCall(ctx, ...)
execSpan.End()
```

### Phase 4: JSONL correlation (~20 lines)

Add optional `trace_id` and `span_id` to `JSONLRPCMessage`:
```go
type JSONLRPCMessage struct {
    // ... existing fields ...
    TraceID string `json:"trace_id,omitempty"`
    SpanID  string `json:"span_id,omitempty"`
}
```

Extract from context in `LogRPCMessageJSONL()` — correlates existing JSONL logs with OTEL traces.

## Open questions

1. **AWF firewall allowlisting**: The OTLP endpoint needs to be reachable from within the AWF sandbox. Should the gateway auto-add the endpoint to the firewall allowlist, or require explicit configuration?
2. **Payload inclusion**: Should tool call arguments/results be included as span attributes? Risk: large payloads. Mitigation: truncate to N bytes, controlled by config flag.
3. **Scope**: Should Phase 1 include guard sub-spans, or just HTTP + tool call spans? Guard sub-spans add visibility but more instrumentation code.

## References

- Upstream: [gh-aw#24373](https://github.com/github/gh-aw/issues/24373)
- [OpenTelemetry Go SDK](https://github.com/open-telemetry/opentelemetry-go)
- [OTLP Specification](https://opentelemetry.io/docs/specs/otlp/)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: OpenTelemetry OTLP trace export from MCP Gateway #3177

Upstream request

What this enables

Proposed trace hierarchy

Integration analysis

Why the gateway architecture makes this easy

Performance overhead: negligible

Costs

Benefits

Potential solution

Phase 1: Foundation (~200 lines)

Phase 2: HTTP middleware (~50 lines)

Phase 3: Tool call spans (~100 lines)

Phase 4: JSONL correlation (~20 lines)

Open questions

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Path	Overhead	Rationale
HTTP middleware span	< 0.5ms	One span create/close per request
Tool call span	< 0.1% of call time	Backend network I/O dominates (100ms+)
Guard phase sub-spans	< 1ms total	CPU-bound WASM/evaluation already ~ms
Context value storage	Negligible	Already 6+ context.WithValue calls per request

feat: OpenTelemetry OTLP trace export from MCP Gateway #3177

Description

Upstream request

What this enables

Proposed trace hierarchy

Integration analysis

Why the gateway architecture makes this easy

Performance overhead: negligible

Costs

Benefits

Potential solution

Phase 1: Foundation (~200 lines)

Phase 2: HTTP middleware (~50 lines)

Phase 3: Tool call spans (~100 lines)

Phase 4: JSONL correlation (~20 lines)

Open questions

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions