Low-level GPU compute runtime for Apple Silicon, focused on memory, synchronization, and data movement using Metal.
- Minimal runtime: device/queue/pipeline + buffer utilities
- Kernel experiments: communication patterns + memory behavior
- Built-in benchmarks: bandwidth, latency, scaling
- Apple frameworks:
Metal(compute), plusFoundationfor CLI/utilities - License: MIT (see
LICENSE) - Workflows:
docs/workflows.md
This repo is meant to be experiments + tests = reliability on real hardware.
What’s solid now:
- Experiments are measurable: bandwidth/transfer/scan/matmul/latency + sweeps +
--reps+ p50/p95 - Correctness checks exist where it matters (
scan,matmulfor small sizes, plusgpucomm selftest) - CI keeps it buildable and the CLI usable (
swift build -c release+gpucomm --help)
Caveats:
- GitHub Actions macOS runners aren’t Apple Silicon GPUs you control, so CI can’t validate “real” Metal timings—only build + basic CLI behavior
- “Reliability on hardware” still depends on running these benches on target machines and tracking regressions (record chip/macOS version + commit + outputs)
When you post results (issues/comments), include:
- Hardware: chip + GPU (and whether on battery/low-power mode)
- OS/tooling: macOS version + Xcode/Swift version
- Repo state: commit SHA + command line +
--format json/jsonloutput
Quick metadata + sanity check:
git rev-parse HEAD
sw_vers
xcodebuild -version
system_profiler SPHardwareDataType | head -n 30
system_profiler SPDisplaysDataType | head -n 80
swift build -c release
./.build/release/gpucomm selftest --format jsonExample benchmark report (JSONL, p50/p95 via --reps):
./.build/release/gpucomm bench transfer-sweep --sizes-kib 1,4,64 --iters 5000 --warmup 200 --reps 5 --direction both --mode both --format jsonl
./.build/release/gpucomm bench bandwidth-sweep --sizes-mib 1,4,16,64 --iters 200 --reps 5 --mode private --format jsonlPrimary tracking issue: #1
| Milestone | Roadmap Comment | Commit |
|---|---|---|
| Transfer benchmark | #1 (comment) | eefc5bb |
| Scan (1024) | #1 (comment) | 93d5627 |
| Scan (multi-block) | #1 (comment) | c7c9f9e |
| Matmul (naive+tiled) | #1 (comment) | d42298c |
| Matmul sweep | #1 (comment) | b70fde7 |
| Matmul tiled variants | #1 (comment) | 731c33f |
Output formats (--format) |
#1 (comment) | 7d30ac8 |
| Scan sweep | #1 (comment) | 9be8a34 |
| Bandwidth sweep | #1 (comment) | 37fdb5b |
| Transfer sweep | #1 (comment) | f000bc7 |
| Percentiles for sweeps | #1 (comment) | 9054a8b |
--reps for single benches |
#1 (comment) | c637e3b |
| macOS CI build | #1 (comment) | 466795e |
| CI help smoke | #1 (comment) | 25c6f84 |
| Latency benchmark | #1 (comment) | 8382a5c |
| Hardware selftest | #1 (comment) | c101034 |
swift build -c release.build/release/gpucomm bench bandwidth --size-mib 64 --iters 200 --mode shared
.build/release/gpucomm bench bandwidth --size-mib 64 --iters 200 --mode private
.build/release/gpucomm bench bandwidth-sweep --sizes-mib 1,4,16,64 --iters 200 --mode private --format jsonl
.build/release/gpucomm bench scan --n 1024 --iters 200 --warmup 20
.build/release/gpucomm bench scan --n 65536 --iters 50 --warmup 10
.build/release/gpucomm bench scan-sweep --ns 1024,4096,65536,1048576 --iters 50 --warmup 10 --format jsonl
.build/release/gpucomm bench latency --kind kernel --iters 2000 --warmup 200 --reps 5 --format json
.build/release/gpucomm bench matmul --m 256 --n 256 --k 256 --iters 50 --warmup 10 --variant tiled16
.build/release/gpucomm bench matmul-sweep --m 512 --n 512 --k 512 --iters 10 --warmup 3
.build/release/gpucomm bench transfer --size-kib 4 --iters 10000 --warmup 100 --direction h2d --mode private --strategy blit
.build/release/gpucomm bench transfer --size-kib 4 --iters 10000 --warmup 100 --direction d2h --mode private --strategy blit --format json
.build/release/gpucomm bench transfer-sweep --sizes-kib 1,4,64 --iters 5000 --warmup 200 --direction both --mode both --format jsonl
.build/release/gpucomm run reduction --n 1024
.build/release/gpucomm selftestSources/GPUCommCore: runtime + benchmarksSources/GPUCommCore/Resources/Kernels: Metal kernels (compiled at runtime)Sources/gpucomm: CLI
