GPU 可观测性差距：为何我们需要在 GPU 设备上使用 eBPF

# The GPU Observability Gap: Why We Need eBPF on GPU devices
# GPU 可观测性差距：为何我们需要在 GPU 设备上使用 eBPF

> Yusheng Zheng=, Tong Yu=, Yiwei Yang=
>
> 郑昱笙=, 于童=, 杨逸凡=

As a revolutionary technology that provides programmability in the kernel, eBPF has achieved tremendous success in CPU observability, networking, and security. However, for the increasingly important field of GPU computing, we also need a flexible and efficient means of observation. Currently, most GPU performance analysis tools are limited to observing from the CPU side through drivers/user-space APIs or vendor-specific performance analysis interfaces (like CUPTI), making it difficult to gain deep insights into the internal execution of the GPU. To address this, bpftime provides GPU support through its CUDA/SYCL attachment implementation, enabling eBPF programs to execute **within GPU kernels** on NVIDIA and AMD GPUs. This brings eBPF's programmability, observability, and customization capabilities to GPU computing workloads. By doing so, it enables real-time profiling, debugging, and runtime extension of GPU applications without source code modification, addressing a gap in the current observability landscape.

作为一项在内核中提供可编程能力的革命性技术，eBPF 在 CPU 的可观测性、网络和安全领域取得了巨大成功。然而，对于日益重要的 GPU 计算领域，我们也需要一种灵活高效的观测手段。目前，大多数 GPU 性能分析工具仅限于通过驱动程序/用户空间 API 或供应商特定的性能分析接口（如 CUPTI）从 CPU 侧进行观测，这使得深入了解 GPU 内部执行变得困难。为了解决这个问题，bpftime 通过其 CUDA/SYCL 附加实现提供了 GPU 支持，使得 eBPF 程序能够在 NVIDIA 和 AMD GPU 的**GPU 内核中**执行。这将 eBPF 的可编程性、可观测性和定制能力带到了 GPU 计算工作负载中。通过这种方式，它实现了对 GPU 应用程序的实时分析、调试和运行时扩展，而无需修改源代码，填补了当前可观测性领域的空白。

> **Note:** GPU support is still experimental. For questions or suggestions, [open an issue](https://github.com/eunomia-bpf/bpftime/issues) or [contact us](mailto:team@eunomia.dev).
> 
> **注意：** GPU 支持仍处于实验阶段。如有问题或建议，请[提交 issue](https://github.com/eunomia-bpf/bpftime/issues)或[联系我们](mailto:team@eunomia.dev)。

## The Problem: GPU Observability Challenges
## 问题所在：GPU 可观测性的挑战

### The Core Challenge: An Opaque Execution Model
### 核心挑战：不透明的执行模型

GPUs have become the dominant accelerators for machine learning, scientific computing, and high-performance computing workloads, but their SIMT (Single Instruction, Multiple Thread) execution model introduces significant observability and extensibility challenges. Modern GPUs organize thousands of threads into warps (typically 32 threads) that execute in lockstep on streaming multiprocessors (SMs), with kernels launched asynchronously from the host. These threads navigate complex multi-level memory hierarchies ranging from fast but limited per-thread registers, to shared memory/LDS within thread blocks, through L1/L2 caches, to slower but abundant device memory, while contending with limited preemption capabilities that make kernel execution difficult to interrupt, inspect, or extend. This architectural complexity creates rich performance characteristics including warp divergence, memory coalescing patterns, bank conflicts, and occupancy variations that directly impact throughput, yet these behaviors remain largely opaque to traditional observability tools. Understanding and optimizing issues like kernel stalls, memory bottlenecks, inefficient synchronization, or suboptimal SM utilization requires fine-grained visibility into the execution flow, memory access patterns, and inter-warp coordination happening deep inside the GPU, along with the ability to dynamically inject custom logic. These capabilities are what existing tooling struggles to provide in a flexible, programmable manner.

GPU 已成为机器学习、科学计算和高性能计算工作负载的主要加速器，但其 SIMT（单指令多线程）执行模型带来了显著的可观测性和可扩展性挑战。现代 GPU 将数千个线程组织成 warp（通常为 32 个线程），在流式多处理器（SM）上同步执行，内核由主机异步启动。这些线程在复杂的多级内存层次结构中运行，从快速但有限的每线程寄存器，到线程块内的共享内存/LDS，再到 L1/L2 缓存，最后是速度较慢但容量巨大的设备内存。同时，它们还面临有限的抢占能力，使得内核执行难以中断、检查或扩展。这种架构复杂性带来了丰富的性能特征，包括 warp 分歧、内存合并模式、bank 冲突和占用率变化，这些都直接影响吞吐量，但传统的可观测性工具却基本上无法洞察这些行为。要理解和优化内核停顿、内存瓶颈、低效同步或 SM 利用率不佳等问题，需要对 GPU 内部深处的执行流、内存访问模式和 warp 间协调有细粒度的可见性，并能够动态注入自定义逻辑。而这些正是现有工具难以以灵活、可编程的方式提供的能力。

Existing GPU tracing and profiling tools fall into two categories, each with significant limitations. First, many tracing tools operate exclusively at the CPU-GPU boundary by intercepting CUDA/SYCL userspace library calls (e.g., via LD_PRELOAD hooks on libcuda.so) or instrumenting kernel drivers at the system call layer. While this approach captures host-side events like kernel launches, memory transfers, and API timing, it fundamentally treats the GPU as a black box, providing no visibility into what happens during kernel execution, no correlation with specific warp behaviors or memory stalls, and no ability to adaptively modify behavior based on runtime conditions inside the device. Second, GPU vendor-specific profilers (NVIDIA's CUPTI, Nsight Compute, Intel's GTPin, AMD's ROCProfiler, research tools like NVBit or Neutrino) do provide device-side instrumentation and can collect hardware performance counters, warp-level traces, or instruction-level metrics. However, these tools operate in isolated ecosystems disconnected from Linux kernel observability and extension stacks. They cannot correlate GPU events with CPU-side eBPF probes (kprobes, uprobes, tracepoints), require separate data collection pipelines and analysis workflows, some of them may impose substantial overhead (10-100x slowdowns for fine-grained instrumentation), and lack the dynamic programmability and control from the control plane that makes eBPF powerful for customizing what data to collect and how to process it in production environments, without recompilation or service interruption.

现有的 GPU 跟踪和分析工具可分为两类，每一类都有显著的局限性。首先，许多跟踪工具仅在 CPU-GPU 边界操作，通过拦截 CUDA/SYCL 用户空间库调用（例如，通过对 `libcuda.so` 的 `LD_PRELOAD` 钩子）或在系统调用层检测内核驱动程序。虽然这种方法可以捕获主机侧的事件，如内核启动、内存传输和 API 计时，但它从根本上将 GPU 视为一个黑盒，无法提供内核执行期间发生的情况的可见性，无法与特定的 warp 行为或内存停顿相关联，也无法根据设备内部的运行时条件自适应地修改行为。其次，GPU 供应商特定的分析器（NVIDIA 的 CUPTI、Nsight Compute，Intel 的 GTPin，AMD 的 ROCProfiler，以及像 NVBit 或 Neutrino 这样的研究工具）确实提供了设备侧的检测功能，可以收集硬件性能计数器、warp 级别的跟踪或指令级别的指标。然而，这些工具在与 Linux 内核可观测性和扩展堆栈脱节的孤立生态系统中运行。它们无法将 GPU 事件与 CPU 侧的 eBPF 探针（kprobes、uprobes、tracepoints）相关联，需要独立的数据收集管道和分析工作流，其中一些可能会带来巨大的开销（细粒度检测会导致 10-100 倍的性能下降），并且缺乏动态可编程性和来自控制平面的控制，而这正是 eBPF 在生产环境中定制数据收集和处理方式而无需重新编译或服务中断的强大之处。

### The Timeline Visibility Gap: What Can and Cannot Be Observed
### 时间线可见性差距：能观测到什么和不能观测到什么

To illustrate this challenge, consider a common debugging scenario. "My CUDA application takes 500ms to complete, but I don't know where the time is spent. Is it memory transfers, kernel execution, or API overhead?" The answer depends critically on whether the application uses synchronous or asynchronous CUDA APIs, and reveals fundamental limitations in CPU-side observability.

为了说明这一挑战，我们来看一个常见的调试场景。“我的 CUDA 应用程序需要 500 毫秒才能完成，但我不知道时间花在了哪里。是内存传输、内核执行，还是 API 开销？” 答案关键取决于应用程序是使用同步还是异步 CUDA API，这也揭示了 CPU 侧可观测性的根本局限。

In synchronous mode, CUDA API calls block until the GPU completes each operation, creating a tight coupling between CPU and GPU timelines. Consider a typical workflow of allocate device memory, transfer data to GPU (host-to-device), execute kernel, and wait for completion. CPU-side profilers can measure the wall-clock time of each blocking API call, providing useful high-level insights. For example, if `cudaMemcpy()` takes 200μs while `cudaDeviceSynchronize()` (which waits for kernel completion) takes only 115μs, a developer can quickly identify that data transfer dominates over computation, suggesting a PCIe bottleneck that might be addressed by using pinned memory, larger batch sizes, or asynchronous transfers. However, when the developer asks "The kernel sync takes 115μs, but why is my kernel slow? Is it launch overhead, memory stalls, warp divergence, or low SM utilization?", CPU-side tools hit a fundamental wall. The 115μs sync time is an opaque aggregate that conflates multiple hidden GPU-side phases including kernel launch overhead (~5μs to schedule work on SMs), actual kernel execution (~100μs of computation on streaming multiprocessors), and cleanup (~10μs to drain pipelines and release resources). Even with perfect timing of synchronous API calls, CPU-side tools cannot distinguish whether poor kernel performance stems from excessive launch overhead, compute inefficiency, memory access patterns causing stalls, or SM underutilization. These require visibility into warp-level execution, memory transaction statistics, and per-thread behavior that is only accessible from inside the GPU during kernel execution. We need a finer-grained, programmable, GPU-internal perspective to understand what is happening during kernel execution, not just when it starts and ends.

在同步模式下，CUDA API 调用会阻塞，直到 GPU 完成每个操作，从而在 CPU 和 GPU 时间线之间建立了紧密的耦合。考虑一个典型的工作流程：分配设备内存、将数据传输到 GPU（主机到设备）、执行内核，并等待完成。CPU 侧的分析器可以测量每个阻塞 API 调用的墙上时钟时间，提供有用的高层见解。例如，如果 `cudaMemcpy()` 耗时 200μs，而 `cudaDeviceSynchronize()`（等待内核完成）仅耗时 115μs，开发者可以迅速确定数据传输主导了计算，这表明可能存在 PCIe 瓶颈，可以通过使用固定内存、更大的批量大小或异步传输来解决。然而，当开发者问“内核同步耗时 115μs，但我的内核为什么慢？是启动开销、内存停顿、warp 分歧，还是 SM 利用率低？”时，CPU 侧工具遇到了根本性的障碍。这 115μs 的同步时间是一个不透明的聚合值，它混合了多个隐藏的 GPU 侧阶段，包括内核启动开销（约 5μs 在 SM 上调度工作）、实际内核执行（约 100μs 在流式多处理器上的计算）和清理（约 10μs 清空流水线和释放资源）。即使能够完美计时同步 API 调用，CPU 侧工具也无法区分内核性能不佳是源于过度的启动开销、计算效率低下、导致停顿的内存访问模式，还是 SM 利用率不足。这些都需要对 warp 级别的执行、内存事务统计和每线程行为有可见性，而这些只能在内核执行期间从 GPU 内部访问。我们需要一个更细粒度、可编程的 GPU 内部视角来理解内核执行期间发生了什么，而不仅仅是它何时开始和结束。

Modern CUDA applications increasingly use asynchronous APIs (`cudaMemcpyAsync()`, `cudaLaunchKernel()` with streams) to maximize hardware utilization by overlapping CPU work with GPU execution. This introduces temporal decoupling where API calls return immediately after enqueuing work to a stream, allowing the CPU to continue executing while the GPU processes operations sequentially in the background. This breaks the observability that CPU-side tools had in synchronous mode.

现代 CUDA 应用程序越来越多地使用异步 API（`cudaMemcpyAsync()`、带流的 `cudaLaunchKernel()`）来通过重叠 CPU 工作与 GPU 执行来最大化硬件利用率。这引入了时间上的解耦，API 调用在将工作排入流后立即返回，允许 CPU 继续执行，而 GPU 在后台顺序处理操作。这打破了 CPU 侧工具在同步模式下所拥有的可观测性。

Consider the same workflow now executed asynchronously. The developer enqueues a host-to-device transfer (200μs), kernel execution (100μs), and device-to-host transfer (150μs), then continues CPU work before eventually calling `cudaStreamSynchronize()` to wait for all GPU operations to complete. From the CPU's perspective, all enqueue operations return in microseconds, and only the final sync point blocks, reporting a total of 456μs (1 + 200 + 5 + 100 + 150 μs of sequential GPU work).

现在考虑以异步方式执行相同的工作流程。开发者将一个主机到设备的传输（200μs）、内核执行（100μs）和设备到主机的传输（150μs）排入队列，然后在最终调用 `cudaStreamSynchronize()` 等待所有 GPU 操作完成之前继续执行 CPU 工作。从 CPU 的角度来看，所有的入队操作都在微秒内返回，只有最后的同步点会阻塞，报告总共耗时 456μs（1 + 200 + 5 + 100 + 150 μs 的顺序 GPU 工作）。

```
CPU Timeline (what traditional tools see):
─────────────────────────────────────────────────────────────────────────────────
cudaMallocAsync() cudaMemcpyAsync() cudaLaunchKernel() cudaMemcpyAsync()      cudaStreamSync()
●─────●─────●─────●─────────────────────────────────────────────────────────────────●────
1μs  1μs  1μs  1μs         CPU continues doing other work...               456μs wait
(alloc)(H→D)(kernel)(D→H)                                                       (all done)

GPU Timeline (actual execution - sequential in stream):
─────────────────────────────────────────────────────────────────────────────────
◄─Alloc─►◄───────H→D DMA────────►◄─Launch─►◄────Kernel Exec────►◄────D→H DMA────►
│ ~1μs  │       200μs            │   5μs   │       100μs        │      150μs     │
┴────────┴────────────────────────┴─────────┴────────────────────┴────────────────┴─────
         ↑                                  ↑                                     ↑
    CPU already moved on              GPU still working                    Sync returns
```

In synchronous execution, measuring individual API call durations allowed developers to identify whether transfers or compute dominated. In asynchronous mode, this capability disappears entirely as all timing information is collapsed into a single 456μs aggregate at the sync point. The question "Is my bottleneck memory transfer or kernel execution?" becomes unanswerable from the CPU side. If the first transfer takes twice as long due to unpinned memory (400μs instead of 200μs), delaying all subsequent operations by 200μs, the developer only sees the total time increase from 456μs to 656μs with zero indication of which operation caused the delay, when it occurred, or whether it propagated to downstream operations.

在同步执行中，测量单个 API 调用的持续时间可以让开发者确定是传输还是计算占主导地位。在异步模式下，这种能力完全消失，因为所有的时间信息都在同步点被压缩成一个 456μs 的总和。问题“我的瓶颈是内存传输还是内核执行？”从 CPU 侧变得无法回答。如果第一次传输由于非固定内存而耗时两倍（400μs 而不是 200μs），从而将所有后续操作延迟 200μs，开发者只会看到总时间从 456μs 增加到 656μs，而没有任何迹象表明是哪个操作导致了延迟、何时发生，或者是否传播到了下游操作。

Asynchronous execution not only hides the GPU-internal details that were already invisible in synchronous mode (warp divergence, memory stalls, SM utilization), but also eliminates the coarse-grained phase timing that CPU tools could previously provide. Developers lose the ability to even perform basic triage. They cannot determine whether to focus optimization efforts on memory transfers, kernel logic, or API usage patterns without either (1) reverting to slow synchronous execution for debugging (defeating the purpose of async), or (2) adding manual instrumentation that requires recompilation and provides only static measurement points.

异步执行不仅隐藏了在同步模式下本已不可见的 GPU 内部细节（warp 分歧、内存停顿、SM 利用率），而且还消除了 CPU 工具先前可以提供的粗粒度阶段计时。开发者甚至失去了进行基本问题分类的能力。他们无法确定是应该将优化精力集中在内存传输、内核逻辑，还是 API 使用模式上，除非（1）为了调试而恢复到缓慢的同步执行（这违背了异步的目的），或者（2）添加需要重新编译且只提供静态测量点的手动检测代码。

Modern GPU applications like LLM serving further complicate this picture with advanced optimization techniques. Batching strategies combine multiple operations to maximize throughput like a pipeline, but make it harder to identify which individual operations are slow. Persistent kernels stay resident on the GPU processing multiple work batches, eliminating launch overhead but obscuring phase boundaries. Multi-stream execution with complex dependencies between streams creates intricate execution graphs where operations from different streams interleave unpredictably. Shared memory usage per thread block constrains occupancy and limits concurrent warp execution, creating subtle resource contention that varies based on kernel configuration. These optimizations significantly improve throughput but make the already-opaque async execution model even more difficult to observe and debug from the CPU side.

像 LLM 服务这样的现代 GPU 应用程序通过先进的优化技术使情况进一步复杂化。批处理策略将多个操作组合起来，像流水线一样最大化吞吐量，但这使得识别哪些单个操作缓慢变得更加困难。持久化内核驻留在 GPU 上处理多个工作批次，消除了启动开销，但模糊了阶段边界。具有复杂流间依赖关系的多流执行创建了错综复杂的执行图，其中来自不同流的操作不可预测地交错执行。每个线程块的共享内存使用限制了占用率和并发 warp 执行，从而根据内核配置产生了微妙的资源争用。这些优化显著提高了吞吐量，但使得本已不透明的异步执行模型从 CPU 侧更难观察和调试。

However, even if we had perfect GPU-internal visibility showing high memory stall cycles, we still couldn't determine the root cause: Is it warp divergence causing uncoalesced access? Host thread descheduling delaying async memory copies? PCIe congestion from concurrent RDMA operations? Or on-demand page migration latency from the kernel? Similarly, when observing SM underutilization, device-only metrics cannot distinguish whether the grid is too small, launches are serialized by a userspace mutex, or the driver throttled after an ECC error or power event.

然而，即使我们拥有完美的 GPU 内部可见性，显示出高内存停顿周期，我们仍然无法确定根本原因：是 warp 分歧导致了非合并访问吗？是主机线程的调度延迟了异步内存拷贝吗？是并发 RDMA 操作导致的 PCIe 拥塞吗？还是内核的按需页面迁移延迟？同样，当观察到 SM 利用率不足时，仅凭设备端的指标无法区分是网格太小、启动被用户空间互斥锁串行化，还是驱动在发生 ECC 错误或电源事件后进行了节流。

The challenge becomes acute in production environments where tail latency spikes occur intermittently. Are they caused by GPU cache effects, cgroup throttling on the host producer thread, or interference from another container issuing large DMA transfers? Device-only tooling reports "what happened on the GPU," but not "why it happened now in this heterogeneous system." Without time-aligned correlation to userspace behavior (threads, syscalls, allocations), kernel events (page faults, scheduler context switches, block I/O), and driver decisions (stream dependency resolution, memory registration, power state transitions), engineers must iterate slowly: observe a GPU symptom → guess a host cause → rebuild with ad-hoc instrumentation → repeat. This feedback loop is expensive and often infeasible in latency-sensitive deployments. To address these challenges in production environments, the industry has started to adopt Continuous Profiling, which aims to provide "always-on" monitoring for online services without significant performance impact. However, existing GPU continuous profiling solutions often rely on sampling high-level metrics (like GPU utilization or power consumption), which cannot explain "why" a kernel is slow, or they depend on periodically running heavyweight vendor tools, which introduce unacceptable performance overhead. These solutions fail to strike a balance between low overhead, high fidelity, and fine-grained visibility, nor do they solve the fundamental problem of cross-layer correlation.

在生产环境中，当尾延迟尖峰间歇性出现时，挑战变得尤为严峻。它们是由 GPU 缓存效应、主机生产者线程上的 cgroup 节流，还是由另一个容器发出的大量 DMA 传输干扰引起的？仅限设备的工具报告“GPU 上发生了什么”，但不能说明“为什么现在在这个异构系统中发生”。如果没有与用户空间行为（线程、系统调用、分配）、内核事件（缺页、调度器上下文切换、块 I/O）和驱动程序决策（流依赖关系解析、内存注册、电源状态转换）的时间对齐关联，工程师必须缓慢迭代：观察 GPU 症状 → 猜测主机原因 → 使用临时检测工具重建 → 重复。这种反馈循环成本高昂，并且在对延迟敏感的部署中通常是不可行的。为了在生产环境中应对这些挑战，业界已开始采用持续分析（Continuous Profiling），旨在为在线服务提供“永远在线”的监控，而不会产生显著的性能影响。然而，现有的 GPU 持续分析解决方案通常依赖于采样高级指标（如 GPU 利用率或功耗），这无法解释“为什么”一个内核很慢，或者它们依赖于定期运行重量级的供应商工具，这会带来不可接受的性能开销。这些解决方案未能在低开销、高保真度和细粒度可见性之间取得平衡，也未能解决跨层关联的根本问题。

> **The key insight:** Effective GPU observability and extensibility requires a unified solution that spans multiple layers of the heterogeneous computing stack: from userspace applications making CUDA API calls, through OS kernel drivers managing device resources, down to device code executing on GPU hardware. Traditional tools are fragmented across these layers, providing isolated visibility at the CPU-GPU boundary or within GPU kernels alone, but lacking the cross-layer correlation needed to understand how decisions and events at one level impact performance and behavior at another.
>
> **关键洞见：** 有效的 GPU 可观测性和可扩展性需要一个统一的解决方案，该方案能跨越异构计算堆栈的多个层次：从进行 CUDA API 调用的用户空间应用程序，到管理设备资源的操作系统内核驱动程序，再到在 GPU 硬件上执行的设备代码。传统工具在这些层次上是碎片化的，仅在 CPU-GPU 边界或单独的 GPU 内核内部提供孤立的可见性，但缺乏理解一个层次的决策和事件如何影响另一层次性能和行为所需的跨层关联。

#### Limitations of Existing Tools
#### 现有工具的局限性

Existing GPU tracing and profiling tools can be broadly categorized into three types, each with inherent limitations that prevent them from offering a complete, unified solution.

现有的 GPU 跟踪和分析工具大致可分为三类，每一类都有其固有的局限性，使其无法提供一个完整、统一的解决方案。

#### 1. CPU-GPU Boundary-Only Tracing Tools
#### 1. 仅限 CPU-GPU 边界的跟踪工具

Many tracing tools operate exclusively at the CPU-GPU boundary by intercepting CUDA/SYCL userspace library calls (e.g., via `LD_PRELOAD` hooks on `libcuda.so`) or instrumenting kernel drivers at the system call layer.

许多跟踪工具仅在 CPU-GPU 边界操作，通过拦截 CUDA/SYCL 用户空间库调用（例如，通过对 `libcuda.so` 的 `LD_PRELOAD` 钩子）或在系统调用层检测内核驱动程序。

-   **Advantages**: Can capture host-side events like kernel launches, memory transfers, and API timing.
-   **Limitations**: This approach fundamentally treats the GPU as a "black box." It provides no visibility into what happens during kernel execution, cannot correlate performance issues with specific warp behaviors or memory stalls, and cannot adaptively modify behavior based on runtime conditions inside the device.
-   **优势**：可以捕获主机侧的事件，如内核启动、内存传输和 API 计时。
-   **局限性**：这种方法从根本上将 GPU 视为一个“黑盒”。它无法提供内核执行期间发生的情况的可见性，不能将性能问题与特定的 warp 行为或内存停顿相关联，也无法根据设备内部的运行时条件自适应地修改行为。

#### 2. Vendor-Specific, Heavyweight Profilers: The Case of Nsight
#### 2. 供应商特定的重量级分析器：以 Nsight 为例

NVIDIA's Nsight suite attempts to address some cross-domain visibility challenges. **Nsight Systems** provides system-wide timelines showing CPU threads, CUDA API calls, kernel launches, and memory transfers in a unified view. **Nsight Compute** offers deep kernel-level microarchitectural metrics via CUPTI counters and replay-based analysis, with guided optimization rules. These tools can correlate launches with CPU thread scheduling and provide rich per-kernel stall reasons and memory metrics.

NVIDIA 的 Nsight 套件试图解决一些跨域可见性挑战。**Nsight Systems** 提供系统级的时间线，统一展示 CPU 线程、CUDA API 调用、内核启动和内存传输。**Nsight Compute** 通过 CUPTI 计数器和基于回放的分析，提供深入的内核级微架构指标，并附有指导性优化规则。这些工具可以将启动与 CPU 线程调度相关联，并提供丰富的每内核停顿原因和内存指标。

However, Nsight and similar vendor tools suffer from fundamental limitations compared to a unified eBPF approach.
- **Closed Event Model**: Nsight offers a closed event model with a fixed set of events and no arbitrary programmable logic at attach points—you cannot dynamically add custom instrumentation without recompiling applications or write filtering predicates like "only collect data when kernel execution exceeds 100ms."
- **Intrusive Profiling Sessions**: These tools require special profiling sessions that perturb workload behavior through counter multiplexing and replay mechanisms, making them unsuitable for always-on continuous telemetry in production, and causing replay-based collection to miss transient anomalies and rare events.
- **Lack of On-Device Filtering and Aggregation**: Nsight lacks in-situ filtering and aggregation, forcing all raw data to be exported then post-processed, which creates multi-GB traces from large applications with massive async pipelines and no programmable adaptive response to change sampling logic on-the-fly based on observed state.
- **Limited System Integration**: Nsight cannot attach dynamic probes to persistent kernels without restart, lacks integration with Linux eBPF infrastructure (kprobes/uprobes/tracepoints), and cannot share data structures (maps) across CPU and GPU instrumentation, making it extremely difficult to stitch causality chains like page fault (host) → delayed launch enqueue → warp stall spike.
- **Vendor Lock-in**: These are NVIDIA-only tools with no clear path to vendor-neutral deployment across AMD, Intel, or other accelerators in heterogeneous systems.

然而，与统一的 eBPF 方法相比，Nsight 和类似的供应商工具存在根本性的局限。
- **封闭事件模型**：Nsight 提供一个封闭的事件模型，具有固定的事件集，在附加点没有任意的可编程逻辑——你无法在不重新编译应用程序的情况下动态添加自定义检测，也无法编写诸如“仅当内核执行时间超过 100ms 时才收集数据”之类的过滤谓词。
- **侵入式分析会话**：这些工具需要特殊的分析会话，通过计数器复用和回放机制扰乱工作负载行为，使其不适合在生产环境中进行永远在线的连续遥测，并且基于回放的收集会错过瞬时异常和罕见事件。
- **缺乏设备端过滤和聚合**：Nsight 缺乏原位过滤和聚合，迫使所有原始数据先导出再进行后处理，这会从具有大量异步流水线的大型应用程序中产生数 GB 的跟踪文件，并且没有可编程的自适应响应来根据观察到的状态动态改变采样逻辑。
- **有限的系统集成**：Nsight 无法在不重启的情况下将动态探针附加到持久化内核，缺乏与 Linux eBPF 基础设施（kprobes/uprobes/tracepoints）的集成，并且无法在 CPU 和 GPU 检测之间共享数据结构（maps），这使得将诸如缺页（主机）→ 延迟的启动入队 → warp 停顿尖峰之类的因果链拼接起来变得极其困难。
- **供应商锁定**：这些是仅限 NVIDIA 的工具，没有明确的途径可以在异构系统中跨 AMD、Intel 或其他加速器进行与供应商无关的部署。

In practice, developers face iterative root cause analysis slowdowns with large traces, miss production issues that don't reproduce under profiling overhead, and cannot correlate GPU events with existing production observability stacks (perf, bpftrace, custom eBPF agents) without complex mode-switching to special "profiling sessions."

在实践中，开发者面临着因大型跟踪文件而导致的迭代式根本原因分析减慢、错过在分析开销下无法复现的生产问题，并且无法在不进行复杂模式切换到特殊“分析会话”的情况下，将 GPU 事件与现有的生产可观测性堆栈（perf、bpftrace、自定义 eBPF 代理）相关联。

#### 3. Research Tools and Interfaces for Fine-Grained Analysis
#### 3. 用于细粒度分析的研究工具和接口

When deeper visibility than Nsight is required, the industry has explored tools based on binary instrumentation or lower-level interfaces, such as NVIDIA CUPTI, NVBit, and NEUTRINO.

当需要比 Nsight 更深入的可见性时，业界已经探索了基于二进制插桩或更低级接口的工具，例如 NVIDIA CUPTI、NVBit 和 NEUTRINO。

-   **CUPTI (CUDA Profiling Tools Interface)**: As a mature interface, CUPTI is well-suited for obtaining kernel-level high-level metrics (like start/end times) and hardware performance counters. Projects like [xpu-perf](https://github.com/eunomia-bpf/xpu-perf) have demonstrated its effectiveness in correlating CPU-GPU data. However, when it comes to understanding "why" a kernel is slow, the high-level metrics provided by CUPTI are often insufficient.
-   **Binary Instrumentation Tools (NVBit, NEUTRINO)**: These tools achieve fine-grained, instruction-level observability by instrumenting at the assembly or PTX (NVIDIA GPU's intermediate language) level. For instance, NEUTRINO, which emerged around the same time, uses assembly-level probes to gather data. However, it typically requires programming directly in assembly, which is not only complex but also lacks the safety and portability offered by eBPF. They are also often independent of CPU profilers, making it difficult to provide unified, cross-layer, multi-device visibility, correlate event logic, and handle clock drift, which can be very challenging. The process of correlating events may also involve multiple data copies, leading to additional performance overhead. Neutrino is not designed for always-on monitoring; it is also session-based, generating large amounts of information that await subsequent processing.

-   **CUPTI (CUDA 分析工具接口)**：作为一个成熟的接口，CUPTI 非常适合获取内核级别的高级指标（如开始/结束时间）和硬件性能计数器。像 [xpu-perf](https://github.com/eunomia-bpf/xpu-perf) 这样的项目已经证明了它在关联 CPU-GPU 数据方面的有效性。然而，当涉及到理解“为什么”一个内核很慢时，CUPTI 提供的高级指标通常是不够的。
-   **二进制插桩工具 (NVBit, NEUTRINO)**：这些工具通过在汇编或 PTX（NVIDIA GPU 的中间语言）级别进行插桩，实现了细粒度的、指令级别的可观测性。例如，大约在同一时期出现的 NEUTRINO，使用汇编级别的探针来收集数据。然而，它通常需要直接用汇编语言编程，这不仅复杂，而且缺乏 eBPF 提供的安全性和可移植性。它们通常也独立于 CPU 分析器，这使得提供统一的、跨层的、多设备的可见性、关联事件逻辑和处理时钟漂移变得非常困难。关联事件的过程可能还涉及多次数据拷贝，导致额外的性能开销。Neutrino 并非为永远在线的监控而设计；它也是基于会话的，会生成大量等待后续处理的信息。

In summary, while existing tools are powerful in certain aspects, they are either too high-level or too cumbersome and isolated. Developers face a difficult trade-off between iteration speed, production safety, and the depth of problem diagnosis, and they cannot easily correlate GPU events with existing CPU observability stacks (like perf, bpftrace).

总而言之，虽然现有工具在某些方面功能强大，但它们要么过于高级，要么过于繁琐和孤立。开发者在迭代速度、生产安全性和问题诊断深度之间面临艰难的权衡，并且他们无法轻易地将 GPU 事件与现有的 CPU 可观测性堆栈（如 perf、bpftrace）相关联。

## The Solution: Extending eBPF to the GPU
## 解决方案：将 eBPF 扩展到 GPU

To overcome the limitations of existing tools, we need a solution that can unify CPU and GPU observability while providing programmable, low-overhead monitoring. The eBPF technology and its implementation in the `bpftime` project make this possible.

为了克服现有工具的局限性，我们需要一个能够统一 CPU 和 GPU 可观测性，同时提供可编程、低开销监控的解决方案。eBPF 技术及其在 `bpftime` 项目中的实现使这成为可能。

### Why eBPF?
### 为什么是 eBPF？

To understand why eBPF is the right tool for this challenge, it's helpful to look at its impact on the CPU world. eBPF (extended Berkeley Packet Filter) is a revolutionary technology in the Linux kernel that allows sandboxed programs to be dynamically loaded to extend kernel capabilities safely. On the CPU side, eBPF has become a cornerstone of modern observability, networking, and security due to its unique combination of programmability, safety, and performance. It enables developers to attach custom logic to thousands of hook points, collecting deep, customized telemetry with minimal overhead. The core idea behind `bpftime` is to bring this same transformative power to the traditionally opaque world of GPU computing.

要理解为什么 eBPF 是应对这一挑战的正确工具，看看它在 CPU 世界的影响会很有帮助。eBPF（扩展的伯克利数据包过滤器）是 Linux 内核中的一项革命性技术，它允许动态加载沙箱程序以安全地扩展内核功能。在 CPU 方面，由于其可编程性、安全性和性能的独特结合，eBPF 已成为现代可观测性、网络和安全的基石。它使开发者能够将自定义逻辑附加到数千个钩子点，以最小的开销收集深入的、定制化的遥测数据。`bpftime` 背后的核心思想是将这种同样的变革性力量带到传统上不透明的 GPU 计算世界。

By running eBPF programs natively inside GPU kernels, `bpftime` provides safe, programmable, unified observability and extensibility across the entire stack.
- **Unified Cross-Layer Observability**: The architecture treats CPU and GPU probes as peers in a unified control plane. Shared BPF maps and ring buffers enable direct data exchange, and dynamic instrumentation works without recompilation or restart. Integration with existing eBPF infrastructure (perf, bpftrace, custom agents) requires no mode-switching. Developers can simultaneously trace CPU-side CUDA API calls via uprobes, kernel driver interactions via kprobes, and GPU-side kernel execution via CUDA probes, all using the same eBPF toolchain and correlating events across the host-device boundary. Example questions now become answerable: "Did the CPU syscall delay at T+50μs cause the GPU kernel to stall at T+150μs?" or "Which CPU threads are launching the kernels that exhibit high warp divergence?" This cross-layer visibility enables root-cause analysis that spans the entire heterogeneous execution stack, from userspace application logic through kernel drivers to GPU hardware behavior, without leaving the production observability workflow.
- **Low-Overhead Production Monitoring**: Unlike session-based profilers, it enables always-on production monitoring with dynamic load/unload of probes and device-side predicate filtering to reduce overhead.
- **Restoring Asynchronous Visibility**: It recovers async-mode visibility with per-phase timestamps (H→D at T+200μs, kernel at T+206μs, D→H at T+456μs), exposes GPU-internal details with nanosecond-granularity telemetry for warp execution and memory patterns, and correlates CPU and GPU events without the heavyweight overhead of traditional separate profilers.

通过在 GPU 内核内部原生运行 eBPF 程序，`bpftime` 在整个堆栈中提供了安全、可编程、统一的可观测性和可扩展性。
- **统一的跨层可观测性**：该架构将 CPU 和 GPU 探针视为统一控制平面中的对等体。共享的 BPF maps 和环形缓冲区实现了直接的数据交换，动态插桩无需重新编译或重启即可工作。与现有的 eBPF 基础设施（perf、bpftrace、自定义代理）集成无需模式切换。开发者可以使用相同的 eBPF 工具链，同时通过 uprobes 跟踪 CPU 侧的 CUDA API 调用，通过 kprobes 跟踪内核驱动程序交互，通过 CUDA 探针跟踪 GPU 侧的内核执行，并在主机-设备边界上关联事件。现在可以回答诸如“T+50μs 的 CPU 系统调用延迟是否导致了 T+150μs 的 GPU 内核停顿？”或“哪些 CPU 线程正在启动表现出高 warp 分歧的内核？”等问题。这种跨层可见性使得根本原因分析能够跨越整个异构执行堆栈，从用户空间应用程序逻辑到内核驱动程序再到 GPU 硬件行为，而无需离开生产可观测性工作流。
- **低开销的生产监控**：与基于会话的分析器不同，它通过动态加载/卸载探针和设备端谓词过滤来减少开销，从而实现永远在线的生产监控。
- **恢复异步可见性**：它通过各阶段的时间戳（H→D 在 T+200μs，内核在 T+206μs，D→H 在 T+456μs）恢复了异步模式的可见性，通过纳秒级粒度的遥测技术暴露了 warp 执行和内存模式的 GPU 内部细节，并且在没有传统独立分析器的重量级开销的情况下关联了 CPU 和 GPU 事件。

### `bpftime`: Running eBPF Natively on the GPU
### `bpftime`：在 GPU 上原生运行 eBPF

**bpftime's approach** bridges this gap by extending eBPF's programmability and customization model directly into GPU execution contexts, enabling eBPF programs to run natively inside GPU kernels alongside application workloads. It employs a PTX injection technique, dynamically injecting eBPF programs into the intermediate assembly language (PTX) of NVIDIA GPUs, allowing for direct hooking of GPU threads.

**bpftime 的方法**通过将 eBPF 的可编程性和定制模型直接扩展到 GPU 执行上下文中，弥合了这一差距，使得 eBPF 程序能够与应用程序工作负载一起在 GPU 内核内部原生运行。它采用 PTX 注入技术，将 eBPF 程序动态注入到 NVIDIA GPU 的中间汇编语言（PTX）中，从而实现对 GPU 线程的直接挂钩。

This approach allows us to obtain extremely fine-grained, in-kernel runtime information that is difficult to access with high-level APIs like CUPTI. For example, with `bpftime`, we can:
- **Trace memory access patterns of individual threads or thread blocks**: Understand exactly how memory access instructions are executed and identify issues like uncoalesced accesses.
- **Observe the scheduling behavior of Streaming Multiprocessors (SMs)**: See how warps are scheduled and executed on the SMs.
- **Analyze in-kernel control flow**: Identify the specific branches causing warp divergence and quantify their impact.

这种方法使我们能够获得极其细粒度的、内核内的运行时信息，而这些信息是像 CUPTI 这样的高级 API 难以访问的。例如，使用 `bpftime`，我们可以：
- **跟踪单个线程或线程块的内存访问模式**：准确理解内存访问指令是如何执行的，并识别像非合并访问这样的问题。
- **观察流式多处理器（SM）的调度行为**：查看 warps 是如何被调度并在 SM 上执行的。
- **分析内核内控制流**：识别导致 warp 分歧的具体分支，并量化其影响。

`bpftime`'s PTX injection is not intended to replace CUPTI but to complement its capabilities. When developers need to dive into the micro-level of GPU kernel execution to diagnose complex issues arising from thread behavior, memory access, or scheduling policies, `bpftime` fills the gap in fine-grained, programmable observability in the current toolchain. Compared to other binary instrumentation tools, `bpftime`'s eBPF-based solution offers several unique advantages:

`bpftime` 的 PTX 注入并非旨在取代 CUPTI，而是对其能力的补充。当开发者需要深入到 GPU 内核执行的微观层面，以诊断由线程行为、内存访问或调度策略引起的复杂问题时，`bpftime` 填补了当前工具链中细粒度、可编程可观测性的空白。与其他二进制插桩工具相比，`bpftime` 基于 eBPF 的解决方案具有几个独特的优势：

1.  **Enhanced Security**: The eBPF verifier provides a safe sandbox environment, preventing the execution of unsafe instructions.
2.  **Convenient Cross-Layer Correlation**: eBPF on the GPU is essentially an extension of eBPF on the CPU, facilitating unified instrumentation and programming across CPU user-space, kernel-space, and GPU. This makes it easier to correlate events across the CPU, GPU, and network card, as well as between user-space and kernel-space.
3.  **Simplified Programming Model**: eBPF uses a restricted C language for programming, abstracting away the complexities of low-level assembly and allowing developers to write powerful observation and analysis programs more efficiently.

1.  **增强的安全性**：eBPF 验证器提供了一个安全的沙箱环境，防止执行不安全的指令。
2.  **便捷的跨层关联**：GPU 上的 eBPF 本质上是 CPU 上 eBPF 的扩展，促进了跨 CPU 用户空间、内核空间和 GPU 的统一插桩和编程。这使得在 CPU、GPU 和网卡之间，以及用户空间和内核空间之间关联事件变得更加容易。
3.  **简化的编程模型**：eBPF 使用受限的 C 语言进行编程，抽象了底层汇编的复杂性，使开发者能够更高效地编写强大的观测和分析程序。

#### `bpftime`'s Architecture and Advantages
#### `bpftime` 的架构与优势

The system defines a comprehensive set of GPU-side attach points that mirror the flexibility of CPU-side kprobes/uprobes. Developers can instrument CUDA/SYCL device function entry and exit points (analogous to function probes), thread block lifecycle events (block begin/end), synchronization primitives (barriers, atomics), memory operations (loads, stores, transfers), and stream/event operations. eBPF programs written in restricted C are compiled through LLVM into device-native bytecode (PTX (Parallel Thread Execution) assembly for NVIDIA GPUs or SPIR-V for AMD/Intel) and dynamically injected into target kernels at runtime through binary instrumentation, without requiring source code modification or recompilation.

该系统定义了一套全面的 GPU 侧附加点，反映了 CPU 侧 kprobes/uprobes 的灵活性。开发者可以检测 CUDA/SYCL 设备函数的入口和出口点（类似于函数探针）、线程块生命周期事件（块开始/结束）、同步原语（屏障、原子操作）、内存操作（加载、存储、传输）以及流/事件操作。用受限 C 语言编写的 eBPF 程序通过 LLVM 编译成设备原生字节码（NVIDIA GPU 的 PTX（并行线程执行）汇编或 AMD/Intel 的 SPIR-V），并通过二进制插桩在运行时动态注入到目标内核中，无需修改或重新编译源代码。

The runtime provides a full eBPF execution environment on the GPU including (1) a safety verifier to ensure bounded execution and memory safety in the SIMT context, (2) a rich set of GPU-aware helper functions for accessing thread/block/grid context, timing, synchronization, and formatted output, (3) specialized BPF map types that live in GPU memory for high-throughput per-thread data collection (GPU array maps) and event streaming (GPU ringbuf maps), and (4) a host-GPU communication protocol using shared memory and spinlocks for safely calling host-side helpers when needed.

该运行时在 GPU 上提供了一个完整的 eBPF 执行环境，包括（1）一个安全验证器，以确保在 SIMT 上下文中的有界执行和内存安全；（2）一套丰富的 GPU 感知辅助函数，用于访问线程/块/网格上下文、计时、同步和格式化输出；（3）存在于 GPU 内存中的专用 BPF map 类型，用于高吞吐量的每线程数据收集（GPU 数组 maps）和事件流（GPU 环形缓冲区 maps）；以及（4）一个使用共享内存和自旋锁的主机-GPU 通信协议，用于在需要时安全地调用主机侧辅助函数。

This architecture enables not only collecting fine-grained telemetry (per-warp timing, memory access patterns, control flow divergence) at nanosecond granularity, but also adaptively modifying kernel behavior based on runtime conditions, building custom extensions and optimizations, and unifying GPU observability with existing CPU-side eBPF programs into a single analysis pipeline, all while maintaining production-ready overhead characteristics. This enables:

这种架构不仅能够在纳秒级粒度上收集细粒度的遥测数据（如每 warp 计时、内存访问模式、控制流分歧），还能根据运行时条件自适应地修改内核行为，构建自定义扩展和优化，并将 GPU 可观测性与现有的 CPU 侧 eBPF 程序统一到单个分析流水线中，同时保持生产就绪的开销特性。这使得：

-   **3-10x faster performance** than tools like NVBit for instrumentation
-   **Vendor-neutral design** that works across NVIDIA, AMD and Intel GPUs
-   **Unified observability and control** with Linux kernel eBPF programs (kprobes, uprobes)
-   **Fine-grained profiling and runtime customization** at the warp or instruction level
-   **Adaptive GPU kernel memory optimization** and programmable scheduling across SMs
-   **Dynamic extensions** for GPU workloads without recompilation
-   **Accelerated eBPF applications** by leveraging GPU compute power

-   比 NVBit 等工具**快 3-10 倍的插桩性能**
-   **与供应商无关的设计**，可在 NVIDIA、AMD 和 Intel GPU 上工作
-   与 Linux 内核 eBPF 程序（kprobes、uprobes）**统一的可观测性和控制**
-   在 warp 或指令级别的**细粒度分析和运行时定制**
-   **自适应的 GPU 内核内存优化**和跨 SM 的可编程调度
-   无需重新编译即可对 GPU 工作负载进行**动态扩展**
-   利用 GPU 计算能力**加速 eBPF 应用**

The architecture is designed to achieve four core goals: (1) provide a unified eBPF-based interface that works seamlessly across userspace, kernel, multiple CPU and GPU contexts from different vendors, (2) enable dynamic, runtime instrumentation without requiring source code modification or recompilation, and (3) maintain safe and efficient execution within the constraints of GPU hardware and SIMT execution models. (4) Less dependency and easy to deploy, built on top of existing CUDA/SYCL/OpenGL runtimes without requiring custom kernel drivers, firmware modifications, or heavy-weight runtimes like record-and-replay systems.

该架构旨在实现四个核心目标：（1）提供一个统一的基于 eBPF 的接口，能够无缝地在用户空间、内核、来自不同供应商的多个 CPU 和 GPU 上下文中工作；（2）实现动态的运行时插桩，无需修改或重新编译源代码；（3）在 GPU 硬件和 SIMT 执行模型的约束下，保持安全高效的执行；（4）依赖少且易于部署，构建于现有的 CUDA/SYCL/OpenGL 运行时之上，无需自定义内核驱动、固件修改或像录制-回放系统那样的重量级运行时。

But talking about architecture is one thing; seeing it in action is another. Just how fast is it? We'll dive deep into the performance benchmarks in our next blog post. Stay tuned!

但谈论架构是一回事，看它实际运行是另一回事。它到底有多快？我们将在下一篇博客文章中深入探讨性能基准测试。敬请期待！

## Architecture
## 架构

### CUDA Attachment Pipeline
### CUDA 附加流水线

The GPU support is built as an instrumentation pipeline that dynamically injects eBPF programs into GPU kernels. The key components are:

GPU 支持被构建为一个插桩流水线，它将 eBPF 程序动态注入到 GPU 内核中。关键组件包括：

1.  **CUDA/OpenCL Runtime Hooking**: Using `LD_PRELOAD`, `bpftime` intercepts calls to the CUDA/SYCL runtime library. This allows it to gain control over kernel launches and other GPU-related operations.
2.  **eBPF to PTX/SPIR-V JIT Compilation**: When a kernel is launched, `bpftime` takes the eBPF bytecode intended for a GPU probe and Just-In-Time (JIT) compiles it into the target GPU's instruction set architecture—PTX for NVIDIA or SPIR-V for AMD/Intel.
3.  **Binary Instrumentation and Injection**: The compiled eBPF code is injected into the target kernel's binary (e.g., PTX code) before it is loaded onto the GPU. This runtime modification allows eBPF programs to execute natively within the kernel's context.
4.  **Helper Function Trampoline**: `bpftime` provides a set of eBPF helper functions accessible from the GPU. These helpers are implemented as a trampoline to perform tasks like accessing maps, getting timestamps, or submitting data through ring buffers.
5.  **Shared Data Structures**: BPF maps and ring buffers are implemented over shared memory (pinned host memory) or device memory accessible by both the CPU and GPU, enabling efficient data exchange between the host and the device.

1.  **CUDA/OpenCL 运行时挂钩**：使用 `LD_PRELOAD`，`bpftime` 拦截对 CUDA/SYCL 运行时库的调用。这使其能够控制内核启动和其他与 GPU 相关的操作。
2.  **eBPF 到 PTX/SPIR-V 的 JIT 编译**：当内核启动时，`bpftime` 获取用于 GPU 探针的 eBPF 字节码，并将其即时（JIT）编译成目标 GPU 的指令集架构——NVIDIA 的 PTX 或 AMD/Intel 的 SPIR-V。
3.  **二进制插桩和注入**：编译后的 eBPF 代码在加载到 GPU 之前被注入到目标内核的二进制文件（例如，PTX 代码）中。这种运行时修改允许 eBPF 程序在内核上下文中原生执行。
4.  **辅助函数蹦床（Trampoline）**：`bpftime` 提供了一套可从 GPU 访问的 eBPF 辅助函数。这些辅助函数被实现为一个蹦床，用于执行访问 maps、获取时间戳或通过环形缓冲区提交数据等任务。
5.  **共享数据结构**：BPF maps 和环形缓冲区通过 CPU 和 GPU 均可访问的共享内存（固定主机内存）或设备内存实现，从而实现了主机和设备之间的高效数据交换。

```
┌─────────────────────────────────────────────────────────────────┐
│                      Application Process                        │
│              (LD_PRELOAD)              JIT compile              │
│  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐     │
│  │     PTX/     │───▶│  bpftime     │────▶│  GPU Kernel  │     │
│  │    SPIR-V    │     │  runtime     │     │  with eBPF   │     │
│  └──────────────┘     └──────────────┘     └──────────────┘     │
│                     manage   │                      │           │
│                              ▼                      ▼           │
│                       ┌────────────────────────────────────┐    │
│                       │ Shared Maps                        │    │
│                       │  (Host-GPU)                        │    │
│                       └────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘
```

## Examples
## 示例

We demonstrate GPU eBPF capabilities through several examples, bcc style tools:

我们通过几个 bcc 风格的工具示例来展示 GPU eBPF 的能力：

### [kernelretsnoop](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu/kernelretsnoop) - Per-Thread Exit Timestamp Tracer
### [kernelretsnoop](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu/kernelretsnoop) - 每线程退出时间戳追踪器

Attaches to CUDA kernel exits and records the exact nanosecond timestamp when each GPU thread completes execution. This reveals thread divergence, memory access patterns, and warp scheduling issues that are invisible to traditional profilers.

附加到 CUDA 内核退出点，并记录每个 GPU 线程完成执行时的精确纳秒级时间戳。这揭示了传统分析器无法看到的线程分歧、内存访问模式和 warp 调度问题。

> **Note**: The `kprobe`/`kretprobe` naming convention used in these examples is a placeholder to maintain conceptual similarity with Linux kernel eBPF. In `bpftime`, these probes attach to device-side GPU kernel functions, not Linux kernel functions. This naming may be revised in the future to better reflect their scope.
>
> **注意**：这些示例中使用的 `kprobe`/`kretprobe` 命名约定是一个占位符，以保持与 Linux 内核 eBPF 的概念相似性。在 `bpftime` 中，这些探针附加到设备侧的 GPU 内核函数，而不是 Linux 内核函数。这个命名将来可能会修改以更好地反映其范围。

**Use case**: You notice your kernel is slower than expected. `kernelretsnoop` reveals that thread 31 in each warp finishes 750ns later than threads 0-30, exposing a boundary condition causing divergence. You refactor to eliminate the branch, and all threads now complete within nanoseconds of each other.

**使用场景**：你注意到你的内核比预期的要慢。`kernelretsnoop` 显示每个 warp 中的第 31 个线程比 0-30 号线程晚 750ns 完成，暴露出一个导致分歧的边界条件。你重构代码消除了分支，现在所有线程都在纳秒级的时间内相互完成。

```c
// eBPF program runs on GPU at kernel exit
// eBPF 程序在内核退出时在 GPU 上运行
SEC("kretprobe/_Z9vectorAddPKfS0_Pf")
int ret__cuda() {
    u64 tid_x, tid_y, tid_z;
    bpf_get_thread_idx(&tid_x, &tid_y, &tid_z);  // Which thread am I?
    u64 ts = bpf_get_globaltimer();               // When did I finish?

// Write to ringbuffer for userspace analysis
    bpf_perf_event_output(ctx, &events, 0, &data, sizeof(data));
}
```

### [threadhist](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu/threadhist) - Thread Execution Count Histogram
### [threadhist](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu/threadhist) - 线程执行计数直方图

Uses GPU array maps to count how many times each thread executes. Detects workload imbalance where some threads do far more work than others, wasting GPU compute capacity.

使用 GPU 数组 maps 来计算每个线程执行的次数。检测工作负载不平衡的情况，即某些线程比其他线程做更多的工作，从而浪费 GPU 计算能力。

**Use case**: Your grid-stride loop processes 1M elements with 5 threads. You expect balanced work, but `threadhist` shows thread 4 executes only 75% as often as threads 0-3. The boundary elements divide unevenly, leaving thread 4 idle while others work. You adjust the distribution and achieve balanced execution.

**使用场景**：你的网格步长循环用 5 个线程处理 1M 个元素。你期望工作是平衡的，但 `threadhist` 显示第 4 个线程的执行次数仅为 0-3 号线程的 75%。边界元素分配不均，导致第 4 个线程在其他线程工作时处于空闲状态。你调整了分配方式，实现了平衡执行。

// Per-thread counter in GPU array map
    u64 *count = bpf_map_lookup_elem(&thread_counts, &tid_x);
    if (count) {
        __atomic_add_fetch(count, 1, __ATOMIC_SEQ_CST);  // Thread N executed once more
    }
}
```

### [launchlate](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu/launchlate) - Kernel Launch Latency Profiler
### [launchlate](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu/launchlate) - 内核启动延迟分析器

Measures the time between `cudaLaunchKernel()` on CPU and actual kernel execution on GPU. Reveals hidden queue delays, stream dependencies, and scheduling overhead that make fast kernels slow in production.

测量 CPU 上的 `cudaLaunchKernel()` 调用与 GPU 上实际内核执行之间的时间。揭示隐藏的队列延迟、流依赖关系和调度开销，这些开销会使快速的内核在生产中变慢。

**Use case**: Your kernels execute in 100μs each, but users report 50ms latency. `launchlate` shows 200-500μs launch latency per kernel because each waits for the previous one and memory transfers to complete. Total time is 5ms, not 1ms. You switch to CUDA graphs, batching all launches, and latency drops to 1.2ms.

**使用场景**：你的每个内核执行时间为 100μs，但用户报告延迟为 50ms。`launchlate` 显示每个内核有 200-500μs 的启动延迟，因为每个内核都在等待前一个内核和内存传输完成。总时间是 5ms，而不是 1ms。你切换到 CUDA graphs，将所有启动批处理，延迟降至 1.2ms。

```c
BPF_MAP_DEF(BPF_MAP_TYPE_ARRAY, launch_time);

// CPU-side uprobe captures launch time
// CPU 侧的 uprobe 捕获启动时间
SEC("uprobe/app:cudaLaunchKernel")
int uprobe_launch(struct pt_regs *ctx) {
    u64 ts_cpu = bpf_ktime_get_ns();  // When did CPU request launch?
    bpf_map_update_elem(&launch_time, &key, &ts_cpu, BPF_ANY);
}

// GPU-side kprobe captures execution start
// GPU 侧的 kprobe 捕获执行开始时间
SEC("kprobe/_Z9vectorAddPKfS0_Pf")
int kprobe_exec() {
    u64 ts_gpu = bpf_get_globaltimer();  // When did GPU actually start?
    u64 *ts_cpu = bpf_map_lookup_elem(&launch_time, &key);

u64 latency = ts_gpu - *ts_cpu;  // How long did kernel wait in queue?
    u32 bin = get_hist_bin(latency);
    // Update histogram...
}
```

### Other Examples
### 其他示例

-   **[cuda-counter](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu/cuda-counter)**: Basic probe/retprobe with timing measurements
-   **[mem_trace](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu/mem_trace)**: Memory access pattern tracing and analysis
-   **[directly_run_on_gpu](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu/directly_run_on_gpu)**: Run eBPF programs directly on GPU without attaching to kernels
-   **[cuda-counter](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu/cuda-counter)**：带计时测量的基本探针/返回探针
-   **[mem_trace](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu/mem_trace)**：内存访问模式跟踪与分析
-   **[directly_run_on_gpu](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu/directly_run_on_gpu)**：直接在 GPU 上运行 eBPF 程序，无需附加到内核

### Key Components
### 关键组件

1.  **CUDA Runtime Hooking**: Intercepts CUDA API calls using Frida-based dynamic instrumentation.
2.  **PTX/SPIR-V Modification**: Converts eBPF bytecode to PTX (Parallel Thread Execution) or SPIR-V assembly and injects it into GPU kernels.
3.  **Helper Trampoline**: Provides GPU-accessible helper functions for map operations, timing, and context access.
4.  **Host-GPU Communication**: Enables synchronous calls from GPU to host via pinned shared memory.

1.  **CUDA 运行时挂钩**：使用基于 Frida 的动态插桩拦截 CUDA API 调用。
2.  **PTX/SPIR-V 修改**：将 eBPF 字节码转换为 PTX（并行线程执行）或 SPIR-V 汇编，并将其注入到 GPU 内核中。
3.  **辅助函数蹦床（Trampoline）**：提供 GPU 可访问的辅助函数，用于 map 操作、计时和上下文访问。
4.  **主机-GPU 通信**：通过固定的共享内存实现从 GPU 到主机的同步调用。

## References
## 参考文献

1.  [bpftime OSDI '25 Paper](https://www.usenix.org/conference/osdi25/presentation/zheng-yusheng)
2.  [CUDA Runtime API](https://docs.nvidia.com/cuda/cuda-runtime-api/)
3.  [PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/)
4.  [eBPF Documentation](https://ebpf.io/)
5.  [eGPU: Extending eBPF Programmability and Observability to GPUs](https://dl.acm.org/doi/10.1145/3723851.3726984)
6.  [NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs](https://github.com/NVlabs/NVBit)