feat: OpenTelemetry OTLP trace export from MCP Gateway and proxy#3178
feat: OpenTelemetry OTLP trace export from MCP Gateway and proxy#3178
Conversation
Agent-Logs-Url: https://github.com/github/gh-aw-mcpg/sessions/b08422da-4c60-434b-a7a3-8d5854b4896a Co-authored-by: lpcox <15877973+lpcox@users.noreply.github.com>
…ecSpan.End() Agent-Logs-Url: https://github.com/github/gh-aw-mcpg/sessions/b08422da-4c60-434b-a7a3-8d5854b4896a Co-authored-by: lpcox <15877973+lpcox@users.noreply.github.com>
|
@copilot take a look at this comment and identify any opportunities/requirements for integration outside of the gateway github/gh-aw#24373 (comment) |
There was a problem hiding this comment.
Pull request overview
Adds OpenTelemetry tracing to MCP Gateway with OTLP/HTTP export so operators can visualize end-to-end request and tool-call latency, guard decisions, and backend failures in standard OTEL backends.
Changes:
- Introduces tracing configuration (
[gateway.tracing]) and CLI flags to control OTLP endpoint, service name, and sampling. - Adds an OTLP tracer provider initializer that installs a noop provider when tracing is disabled.
- Instruments HTTP requests and backend tool calls with spans and basic attributes.
Show a summary per file
| File | Description |
|---|---|
| internal/tracing/provider.go | Initializes global OTEL tracer provider (noop vs OTLP/HTTP), resolves config/env, sets sampler. |
| internal/tracing/provider_test.go | Adds basic tests around provider initialization and tracer access. |
| internal/server/http_helpers.go | Adds WithOTELTracing middleware and makes it outermost in the middleware chain. |
| internal/server/unified.go | Adds spans for tool-call lifecycle and backend execution latency/errors. |
| internal/config/config_tracing.go | Defines TracingConfig + defaults registration. |
| internal/config/config_core.go | Adds Tracing *TracingConfig to GatewayConfig. |
| internal/cmd/flags_tracing.go | Adds --otlp-* flags and env-based defaults. |
| internal/cmd/root.go | Merges tracing flags into config, initializes provider, and flushes on shutdown. |
| go.mod / go.sum | Adds OpenTelemetry dependencies and bumps some indirect deps. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 9/10 changed files
- Comments generated: 7
| RegisterDefaults(func(cfg *Config) { | ||
| if cfg.Gateway != nil && cfg.Gateway.Tracing != nil { | ||
| if cfg.Gateway.Tracing.ServiceName == "" { | ||
| cfg.Gateway.Tracing.ServiceName = DefaultTracingServiceName | ||
| } | ||
| if cfg.Gateway.Tracing.SampleRate == 0 { | ||
| cfg.Gateway.Tracing.SampleRate = DefaultTracingSampleRate | ||
| } | ||
| } |
There was a problem hiding this comment.
The defaults setter forces SampleRate to 1.0 when the decoded value is 0, which prevents users from configuring sample_rate = 0.0 in TOML (documented as valid for “no sampling”). To support an explicit 0.0 value, consider representing SampleRate as a pointer (so “unset” can be distinguished from 0.0) or using a separate “set” flag during decode/defaulting.
internal/cmd/flags_tracing.go
Outdated
| "OTLP HTTP endpoint for trace export (e.g. http://localhost:4318). Overrides OTEL_EXPORTER_OTLP_ENDPOINT env var. Tracing is disabled when empty.") | ||
| cmd.Flags().StringVar(&otlpServiceName, "otlp-service-name", getDefaultOTLPServiceName(), | ||
| "Service name reported in traces. Overrides OTEL_SERVICE_NAME env var.") | ||
| cmd.Flags().Float64Var(&otlpSampleRate, "otlp-sample-rate", config.DefaultTracingSampleRate, | ||
| "Fraction of traces to sample and export (0.0–1.0). Default 1.0 samples everything.") | ||
| }) | ||
| } | ||
|
|
||
| // getDefaultOTLPEndpoint returns the OTLP endpoint, checking OTEL_EXPORTER_OTLP_ENDPOINT | ||
| // environment variable first, then falling back to empty (disabled). | ||
| func getDefaultOTLPEndpoint() string { | ||
| return envutil.GetEnvString("OTEL_EXPORTER_OTLP_ENDPOINT", "") | ||
| } | ||
|
|
||
| // getDefaultOTLPServiceName returns the OTLP service name, checking OTEL_SERVICE_NAME | ||
| // environment variable first, then falling back to the default. |
There was a problem hiding this comment.
The flag help text says --otlp-endpoint/--otlp-service-name override the corresponding OTEL env vars, but tracing.resolveEndpoint/resolveServiceName currently prefer OTEL_EXPORTER_OTLP_ENDPOINT/OTEL_SERVICE_NAME over config/flags. This means operators cannot override (or disable) an env-configured endpoint via CLI, which is surprising and contradicts the help. Consider changing precedence to CLI > env > config (e.g., resolve envs in cmd layer and pass resolved values to InitProvider), or update the help text to match actual precedence.
| "OTLP HTTP endpoint for trace export (e.g. http://localhost:4318). Overrides OTEL_EXPORTER_OTLP_ENDPOINT env var. Tracing is disabled when empty.") | |
| cmd.Flags().StringVar(&otlpServiceName, "otlp-service-name", getDefaultOTLPServiceName(), | |
| "Service name reported in traces. Overrides OTEL_SERVICE_NAME env var.") | |
| cmd.Flags().Float64Var(&otlpSampleRate, "otlp-sample-rate", config.DefaultTracingSampleRate, | |
| "Fraction of traces to sample and export (0.0–1.0). Default 1.0 samples everything.") | |
| }) | |
| } | |
| // getDefaultOTLPEndpoint returns the OTLP endpoint, checking OTEL_EXPORTER_OTLP_ENDPOINT | |
| // environment variable first, then falling back to empty (disabled). | |
| func getDefaultOTLPEndpoint() string { | |
| return envutil.GetEnvString("OTEL_EXPORTER_OTLP_ENDPOINT", "") | |
| } | |
| // getDefaultOTLPServiceName returns the OTLP service name, checking OTEL_SERVICE_NAME | |
| // environment variable first, then falling back to the default. | |
| "OTLP HTTP endpoint for trace export (e.g. http://localhost:4318). Defaults from OTEL_EXPORTER_OTLP_ENDPOINT when set. Tracing is disabled when empty.") | |
| cmd.Flags().StringVar(&otlpServiceName, "otlp-service-name", getDefaultOTLPServiceName(), | |
| "Service name reported in traces. Defaults from OTEL_SERVICE_NAME when set.") | |
| cmd.Flags().Float64Var(&otlpSampleRate, "otlp-sample-rate", config.DefaultTracingSampleRate, | |
| "Fraction of traces to sample and export (0.0–1.0). Default 1.0 samples everything.") | |
| }) | |
| } | |
| // getDefaultOTLPEndpoint returns the default value for the OTLP endpoint flag, | |
| // using OTEL_EXPORTER_OTLP_ENDPOINT when set and otherwise falling back to empty | |
| // (disabled). | |
| func getDefaultOTLPEndpoint() string { | |
| return envutil.GetEnvString("OTEL_EXPORTER_OTLP_ENDPOINT", "") | |
| } | |
| // getDefaultOTLPServiceName returns the default value for the OTLP service name | |
| // flag, using OTEL_SERVICE_NAME when set and otherwise falling back to the | |
| // configured default. |
internal/server/http_helpers.go
Outdated
| // Add session ID to span after context is available | ||
| sessionID := SessionIDFromContext(ctx) | ||
| span.SetAttributes(attribute.String("session.id", auth.TruncateSessionID(sessionID))) | ||
|
|
||
| next.ServeHTTP(w, r.WithContext(ctx)) |
There was a problem hiding this comment.
WithOTELTracing sets the session.id attribute before calling next.ServeHTTP. In this server, the session ID is injected into the request context inside the StreamableHTTP callback (setupSessionCallback), which happens during next.ServeHTTP, so this will typically record session.id=default rather than the real session. Consider capturing the request pointer you pass to next (e.g., req := r.WithContext(ctx)), calling next.ServeHTTP(w, req), then reading SessionIDFromContext(req.Context()) and setting the attribute before ending the span.
| // Add session ID to span after context is available | |
| sessionID := SessionIDFromContext(ctx) | |
| span.SetAttributes(attribute.String("session.id", auth.TruncateSessionID(sessionID))) | |
| next.ServeHTTP(w, r.WithContext(ctx)) | |
| req := r.WithContext(ctx) | |
| next.ServeHTTP(w, req) | |
| // Add session ID to span after request handling, once the session has been attached. | |
| sessionID := SessionIDFromContext(req.Context()) | |
| span.SetAttributes(attribute.String("session.id", auth.TruncateSessionID(sessionID))) |
internal/cmd/root.go
Outdated
| if tracingCfg != nil && tracingCfg.Endpoint != "" { | ||
| log.Printf("OpenTelemetry tracing enabled: endpoint=%s, service=%s, sampleRate=%.2f", | ||
| tracingCfg.Endpoint, tracingCfg.ServiceName, tracingCfg.SampleRate) | ||
| logger.LogInfoMd("startup", "OpenTelemetry tracing enabled: endpoint=%s, service=%s", | ||
| tracingCfg.Endpoint, tracingCfg.ServiceName) |
There was a problem hiding this comment.
The tracing enable/disable log message is based on tracingCfg.Endpoint, but InitProvider resolves the endpoint from OTEL_EXPORTER_OTLP_ENDPOINT first. In particular, if OTEL_EXPORTER_OTLP_ENDPOINT is set and a user runs with --otlp-endpoint= (explicitly disabling via CLI), the log will claim tracing is disabled while InitProvider will still enable tracing via the env var. Consider aligning the precedence (CLI > env > config) or logging the resolved endpoint actually used by InitProvider.
| if tracingCfg != nil && tracingCfg.Endpoint != "" { | |
| log.Printf("OpenTelemetry tracing enabled: endpoint=%s, service=%s, sampleRate=%.2f", | |
| tracingCfg.Endpoint, tracingCfg.ServiceName, tracingCfg.SampleRate) | |
| logger.LogInfoMd("startup", "OpenTelemetry tracing enabled: endpoint=%s, service=%s", | |
| tracingCfg.Endpoint, tracingCfg.ServiceName) | |
| resolvedTracingEndpoint := os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT") | |
| if resolvedTracingEndpoint == "" && tracingCfg != nil { | |
| resolvedTracingEndpoint = tracingCfg.Endpoint | |
| } | |
| if resolvedTracingEndpoint != "" { | |
| log.Printf("OpenTelemetry tracing enabled: endpoint=%s, service=%s, sampleRate=%.2f", | |
| resolvedTracingEndpoint, tracingCfg.ServiceName, tracingCfg.SampleRate) | |
| logger.LogInfoMd("startup", "OpenTelemetry tracing enabled: endpoint=%s, service=%s", | |
| resolvedTracingEndpoint, tracingCfg.ServiceName) |
| func TestInitProvider_SampleRateZero_UsesNeverSampler(t *testing.T) { | ||
| ctx := context.Background() | ||
|
|
||
| cfg := &config.TracingConfig{ | ||
| Endpoint: "http://localhost:14318", | ||
| ServiceName: "test-service", | ||
| SampleRate: 0.0, // never sample | ||
| } | ||
|
|
||
| provider, err := tracing.InitProvider(ctx, cfg) | ||
| require.NoError(t, err) | ||
| require.NotNil(t, provider) | ||
|
|
||
| shutdownCtx, cancel := context.WithTimeout(ctx, 100*time.Millisecond) | ||
| defer cancel() | ||
| _ = provider.Shutdown(shutdownCtx) | ||
| } | ||
|
|
||
| func TestInitProvider_SampleRatePartial_UsesRatioSampler(t *testing.T) { | ||
| ctx := context.Background() | ||
|
|
||
| cfg := &config.TracingConfig{ | ||
| Endpoint: "http://localhost:14318", | ||
| ServiceName: "test-service", | ||
| SampleRate: 0.5, // 50% sampling | ||
| } | ||
|
|
||
| provider, err := tracing.InitProvider(ctx, cfg) | ||
| require.NoError(t, err) | ||
| require.NotNil(t, provider) | ||
|
|
||
| shutdownCtx, cancel := context.WithTimeout(ctx, 100*time.Millisecond) | ||
| defer cancel() | ||
| _ = provider.Shutdown(shutdownCtx) | ||
| } |
There was a problem hiding this comment.
The sample-rate tests are named as if they validate sampler selection, but they don’t assert anything about whether spans are sampled/recording. As written, they won’t catch regressions (e.g., SampleRate: 0.0 currently gets treated as default due to resolveSampleRate). Consider asserting span.SpanContext().IsSampled() / span.IsRecording() for 0.0 and 1.0 cases, and making the ratio test deterministic (e.g., by validating bounds rather than probabilistic sampling).
internal/tracing/provider.go
Outdated
| // | ||
| // Once initialized, obtain a tracer with: | ||
| // | ||
| // tracer := otel.Tracer("mcp-gateway") |
There was a problem hiding this comment.
The package comment’s example uses otel.Tracer("mcp-gateway"), but this package’s instrumentation scope is instrumentationName = "github.com/github/gh-aw-mcpg" and Tracer() returns otel.Tracer(instrumentationName). Consider updating the example to match the actual instrumentation name (or changing instrumentationName if the intended scope name is mcp-gateway) to avoid inconsistent tracer names in emitted telemetry.
| // tracer := otel.Tracer("mcp-gateway") | |
| // tracer := otel.Tracer("github.com/github/gh-aw-mcpg") |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…xy handler Agent-Logs-Url: https://github.com/github/gh-aw-mcpg/sessions/ed1acdd1-76fa-4f66-b810-c5c896c45c22 Co-authored-by: lpcox <15877973+lpcox@users.noreply.github.com>
The main integration point outside the gateway is the What was added:
Other opportunities identified (not yet implemented):
Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
|
@copilot does the tracing cover backend calls that the guard makes? for example when an mcp request enters the gateway and hits a guard, do backend guard calls carry the same marker the original request? |
Yes — guard backend calls carry the same trace context as the original request, in both modes: Gateway ( Proxy ( One gap: guard enrichment calls ( |
|
@copilot where is the trace identifier created? is it passed into the gateway by the agent? or does the gateway generate the id for each request it receives? |
The gateway generates a fresh trace ID for every request it receives — agents cannot inject one today. Here's what happens: both To enable end-to-end distributed tracing (agent → gateway → backend MCP server), two things would need to be added to
Without that, each gateway request is a self-contained trace. I can add W3C context propagation if you want agents to be able to stitch their own spans into the gateway's traces. |
- Use *float64 for SampleRate so 0.0 (NeverSample) is distinguishable from unset; add GetSampleRate() accessor with default fallback - Remove env var override in resolveEndpoint/resolveServiceName since CLI flags already use env vars as defaults (correct precedence: CLI flag > env var > config file) - Fix flag help text: "Defaults from" instead of "Overrides" for env var relationship - Move session.id span attribute after next.ServeHTTP so the session is available from context - Fix root.go tracing log to use resolved values instead of raw config - Add IsSampled()/IsRecording() assertions to sampler tests - Add SampleRateOne and SampleRateNil test cases - Fix package doc example to use correct instrumentation name Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@copilot add w3c context propagation to the gateway but fall back to a gateway generated trace id if there isn't one present in the incoming mcp request |
Register a global W3C TextMapPropagator (TraceContext + Baggage) in InitProvider for both noop and SDK paths. Extract incoming traceparent/ tracestate headers in WithOTELTracing (gateway) and WrapHTTPHandler (proxy) before starting each span, so an agent-originated trace is continued rather than fragmented. When no headers are present, OTEL generates a fresh root span and trace ID (existing behavior). Add three tests: propagator registration, header-driven continuation, and no-header fallback to a fresh trace ID. Agent-Logs-Url: https://github.com/github/gh-aw-mcpg/sessions/0209d486-c1c2-4927-bfcf-dc7822ec3990 Co-authored-by: lpcox <15877973+lpcox@users.noreply.github.com>
Done in commit The W3C TraceContext + Baggage propagator is now registered globally in ctx := otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))
ctx, span := t.Start(ctx, "gateway.request", ...)If an agent sends a Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
The MCP Gateway has no structured observability into per-tool-call latency, guard decisions, or backend failures. This adds OpenTelemetry OTLP trace export across both the MCP Gateway server mode and the
awmg proxysubcommand, giving operators visual trace timelines across any OTEL-compatible backend (Jaeger, Honeycomb, Grafana Tempo, Datadog, etc.).Changes
Config (
internal/config/config_tracing.go,config_core.go)TracingConfigstruct added toGatewayConfigunder[gateway.tracing]endpoint,service_name,sample_rate(0.0–1.0, default 1.0)Tracer provider (
internal/tracing/provider.go,internal/tracing/http.go)InitProvider()returns a noop provider (zero overhead) when no endpoint is configured, or a real OTLP/HTTP SDK provider otherwiseOTEL_EXPORTER_OTLP_ENDPOINT,OTEL_SERVICE_NAMEAlwaysSample/NeverSample/TraceIDRatioBasedbased onsample_rateTraceContext+Baggagepropagator registered globally (both noop and SDK paths) so that incomingtraceparent/tracestateheaders are always honouredWrapHTTPHandlerhelper usable by both the gateway and proxy packagesHTTP middleware (
internal/server/http_helpers.go)WithOTELTracing()middleware adds agateway.requestspan per request withhttp.method,http.path, andsession.idattributeswrapWithMiddleware()traceparent/tracestateheaders to continue an agent-originated trace; falls back to a fresh gateway-generated trace ID when no headers are presentTool call instrumentation (
internal/server/unified.go)gateway.tool_callspan covers the full DIFC enforcement lifecycle (phases 0–6) withtool.name+server.idattributesgateway.backend.executechild span isolates backend MCP JSON-RPC latency with error recording on failureCLI flags (
internal/cmd/flags_tracing.go,root.go)--otlp-endpoint,--otlp-service-name,--otlp-sample-rateflags; override config file valuesGitHub API proxy tracing (
internal/proxy/,internal/cmd/proxy.go)proxy.requestspan wraps every proxied request via sharedWrapHTTPHandlertraceparent/tracestateheaders to continue an agent-originated trace; falls back to a fresh gateway-generated trace ID when no headers are presentproxy.difc_pipelinespan covers all 6 DIFC enforcement phases (label_resource → evaluate → forward → label_response → filter → propagate)proxy.backend.forwardchild span isolates upstream GitHub API round-trip latency with HTTP status code and error recordingawmg proxysubcommand gains--otlp-endpoint,--otlp-service-name,--otlp-sample-rateflags and initializes the tracer provider with graceful shutdownResulting trace structures
Gateway mode:
Proxy mode (
awmg proxy):Distributed tracing (agent → gateway):
When an agent sends a
traceparentheader the gateway span is parented to the agent's span, producing a single end-to-end trace. Without that header a self-contained root span with a fresh trace ID is generated automatically.