doughgle / decode-kernel-call-trace.md
A Linux kernel call trace, like any traceback, call stack, stacktrace or backtrace, lists a most-recent-first chain of function calls which led to a crash. For the linux kernel, a crash is a panic or oops.
Without debug symbols, a kernel stack trace may come only with a line-by-line list of function symbols, e.g.
- Here, ‘?’ means that the information about this stack entry is probably not reliable (ref).
- uncore_pmu_event_add is the name of the function symbol.
- +0xa25 represents the offset within the function.
- /0x16d0 is the size of the function.
This cannot be directly correlated to the lines of kernel code for debugging and further analysis. For that we need debug symbols, and a script to decode the call trace.
Kernel image and debug symbols
Debug symbols map memory addresses in the binary image to named variables and functions in the source code. Kernel images do not typically include debug symbols. Stripping debug symbols ensures a lightweight kernel image that can be booted quickly by the bootloader. However, the drawback is that there is limited debug information.
With debug symbols, the same kernel version is 744MB:
With debug symbols stripped (the default), the uncompressed kernel is 43MB:
The decode_stacktrace.sh script
To decode a kernel stacktrace to the line of code in the kernel, we need 4 things:
- The kernel stacktrace e.g. stacktrace.log .
- The decode_stacktrace.sh script.
- The kernel image (of the same version that produced the stack trace) with debug symbols.
- The kernel source code tree for the same version that produced the stack trace.
The decode_stacktrace.sh script can be found in the kernel source tree, in the scripts directory. Alternatively, it may be found in the kernel headers directory e.g. /usr/src/linux-headers-5.4.0-80-generic/scripts/decode_stacktrace.sh .
To decode a stacktrace, execute decode_stacktrace.sh :
Get a kernel image with debug symbols
To get a kernel image with debug symbols, you may need to add a debug symbols repository for the package manager of your linux distribution. On Ubuntu, you can follow the guidance in Debug Symbol Packages. To find and install a kernel image with debug symbols for the kernel currently running:
Once installed, the kernel image can be found in /usr/lib/debug/boot/ directory.
How do I trace a system call in Linux?
How would I follow a system call from a trap to the kernel, to how arguments are passed, to how the system call in located in the kernel, to the actual processing of the system call in the kernel, to the return back to the user and how state is restored?
3 Answers 3
SystemTap
This is the most powerful method I’ve found so far. It can even show the call arguments: Does ftrace allow capture of system call arguments to the Linux kernel, or only function names?
Then on another terminal:
Tested on Ubuntu 18.04, Linux kernel 4.15.
ltrace -S shows both system calls and library calls
This awesome tool therefore gives even further visibility into what executables are doing.
ftrace minimal runnable example
Mentioned at https://stackoverflow.com/a/29840482/895245 but here goes a minimal runnable example.
One cool thing about this method is that it shows the function call for all processes on the system at once, although you can also filter PIDs of interest with set_ftrace_pid .
Tested on Ubuntu 18.04, Linux kernel 4.15.
GDB step debug the Linux kernel
Depending on the level of internals detail you need, this is an option: How to debug the Linux kernel with GDB and QEMU?
strace minimal runnable example
Here is a minimal runnable example of strace : How should strace be used? with a freestanding hello world, which makes how everything works perfectly clear.
How to Trace Linux System Calls in Production (Without Breaking Performance)
Join the DZone community and get the full member experience.
If you need to dynamically trace Linux process system calls, you might first consider strace. strace is simple to use and works well for issues such as «Why can’t the software run on this machine?» However, if you’re running a trace in a production environment, strace is NOT a good choice. It introduces a substantial amount of overhead. According to a performance test conducted by Arnaldo Carvalho de Melo, a senior software engineer at Red Hat, the process traced using strace ran 173 times slower, which is disastrous for a production environment.
So are there any tools that excel at tracing system calls in a production environment? The answer is YES. This blog post introduces perf and traceloop, two commonly used command-line tools, to help you trace system calls in a production environment.
perf, a performance profiler for Linux
perf is a powerful Linux profiling tool, refined and upgraded by Linux kernel developers. In addition to common features such as analyzing Performance Monitoring Unit (PMU) hardware events and kernel events, perf has the following subcomponents:
- sched: Analyzes scheduler actions and latencies.
- timechart: Visualizes system behaviors based on the workload.
- c2c: Detects the potential for false sharing. Red Hat once tested the c2c prototype on a number of Linux applications and found many cases of false sharing and cache lines on hotspots.
- trace: Traces system calls with acceptable overheads. It performs only 1.36 times slower with workloads specified in the dd command.
Let’s look at some common uses of perf.
To see which commands made the most system calls:
From the output, you can see that the kube-apiserver command had the most system calls during sampling.
To see system calls that have latencies longer than a specific duration. In the following example, this duration is 200 milliseconds:
From the output, you can see the process names, process IDs (PIDs), the specific system calls that exceed 200 ms, and the returned values.
To see the processes that had system calls within a period of time and a summary of their overhead:
From the output, you can see the times of each system call, the times of the errors, the total latency, the average latency, and so on.
To analyze the stack information of calls that have a high latency:
To trace a group of tasks. For example, two BPF tools are running in the background. To see their system call information, you can add them to a perf_event cgroup and then execute per trace :
Trace a group of tasks
Those are some of the most common uses of perf. If you’d like to know more (especially about perf-trace), see the Linux manual page. From the manual pages, you will learn that perf-trace can filter tasks based on PIDs or thread IDs (TIDs), but that it has no convenient support for containers and the Kubernetes (K8s) environments. Don’t worry. Next, we’ll discuss a tool that can easily trace system calls in containers and in K8s environments that uses cgroup v2.
Traceloop, a performance profiler for cgroup v2 and K8s
Traceloop provides better support for tracing Linux system calls in the containers or K8s environments that use cgroup v2. You might be unfamiliar with traceloop but know BPF Compiler Collection (BCC) pretty well. (Its front-end is implemented using Python or C++.) In the IO Visor Project, BCC’s parent project, there is another project named gobpf that provides Golang bindings for the BCC framework. Based on gobpf, traceloop is developed for environments of containers and K8s. The following illustration shows the traceloop architecture:
We can further simplify this illustration into the following key procedures. Note that these procedures are implementation details, not operations to perform:
- bpf helper gets the cgroup ID. Tasks are filtered based on the cgroup ID rather than on the PID and TID.
- Each cgroup ID corresponds to a bpf tail call that can call and execute another eBPF program and replace the execution context. Syscall events are written through a bpf tail call to a perf ring buffer with the same cgroup ID.
- The user space reads the perf ring buffer based on this cgroup ID.
Currently, you can get the cgroup ID only by executing bpf helper: bpf_get_current_cgroup_id , and this ID is available only in cgroup v2. Therefore, before you use traceloop, make sure that cgroup v2 is enabled in your environment.
In the following demo (on the CentOS 8 4.18 kernel), when traceloop exits, the system call information is traced:
Trace Linux System Calls with Least Impact on Performance in Production
If you need to dynamically trace Linux process system calls, you might first consider strace. strace is simple to use and works well for issues such as “Why can’t the software run on this machine?” However, if you’re running a trace in a production environment, strace is NOT a good choice. It introduces a substantial amount of overhead. According to a performance test conducted by Arnaldo Carvalho de Melo, a senior software engineer at Red Hat, the process traced using strace ran 173 times slower, which is disastrous for a production environment.
So are there any tools that excel at tracing system calls in a production environment? The answer is YES. This blog post introduces perf and traceloop, two commonly used command-line tools, to help you trace system calls in a production environment.
perf, a performance profiler for Linux
perf is a powerful Linux profiling tool, refined and upgraded by Linux kernel developers. In addition to common features such as analyzing Performance Monitoring Unit (PMU) hardware events and kernel events, perf has the following subcomponents:
- sched: Analyzes scheduler actions and latencies.
- timechart: Visualizes system behaviors based on the workload.
- c2c: Detects the potential for false sharing. Red Hat once tested the c2c prototype on a number of Linux applications and found many cases of false sharing and cache lines on hotspots.
- trace: Traces system calls with acceptable overheads. It performs only 1.36 times slower with workloads specified in the dd command.
Let’s look at some common uses of perf.
-
To see which commands made the most system calls:
From the output, you can see that the kube-apiserver command had the most system calls during sampling.
From the output, you can see the process names, process IDs (PIDs), the specific system calls that exceed 200 ms, and the returned values.
To see the processes that had system calls within a period of time and a summary of their overhead:
From the output, you can see the times of each system call, the times of the errors, the total latency, the average latency, and so on.
To analyze the stack information of calls that have a high latency:
To trace a group of tasks. For example, two BPF tools are running in the background. To see their system call information, you can add them to a perf_event cgroup and then execute per trace :
Those are some of the most common uses of perf. If you’d like to know more (especially about perf-trace), see the Linux manual page. From the manual pages, you will learn that perf-trace can filter tasks based on PIDs or thread IDs (TIDs), but that it has no convenient support for containers and the Kubernetes (K8s) environments. Don’t worry. Next, we’ll discuss a tool that can easily trace system calls in containers and in K8s environments that uses cgroup v2.
Traceloop, a performance profiler for cgroup v2 and K8s
Traceloop provides better support for tracing Linux system calls in the containers or K8s environments that use cgroup v2. You might be unfamiliar with traceloop but know BPF Compiler Collection (BCC) pretty well. (Its front-end is implemented using Python or C++.) In the IO Visor Project, BCC’s parent project, there is another project named gobpf that provides Golang bindings for the BCC framework. Based on gobpf, traceloop is developed for environments of containers and K8s. The following illustration shows the traceloop architecture:
We can further simplify this illustration into the following key procedures. Note that these procedures are implementation details, not operations to perform:
- bpf helper gets the cgroup ID. Tasks are filtered based on the cgroup ID rather than on the PID and TID.
- Each cgroup ID corresponds to a bpf tail call that can call and execute another eBPF program and replace the execution context. Syscall events are written through a bpf tail call to a perf ring buffer with the same cgroup ID.
- The user space reads the perf ring buffer based on this cgroup ID.
Note:
Currently, you can get the cgroup ID only by executing bpf helper: bpf_get_current_cgroup_id , and this ID is available only in cgroup v2. Therefore, before you use traceloop, make sure that cgroup v2 is enabled in your environment.
In the following demo (on the CentOS 8 4.18 kernel), when traceloop exits, the system call information is traced:
As the results show, the traceloop output is similar to that of strace or perf-trace except for the cgroup-based task filtering. Note that CentOS 8 mounts cgroup v2 directly on the /sys/fs/cgroup path instead of on /sys/fs/cgroup/unified as Ubuntu does. Therefore, before you use traceloop, you should run mount -t cgroup2 to determine the mount information.
The team behind traceloop has integrated it with the Inspektor Gadget project, so you can run traceloop on the K8s platform using kubectl. See the demos in Inspektor Gadget – How to use and, if you like, try it on your own.
Benchmark with system calls traced
We conducted a sysbench test in which system calls were either traced using multiple tracers (traceloop, strace, and perf-trace) or not traced. The benchmark results are as follows:
As the benchmark shows, strace caused the biggest decrease in application performance. perf-trace caused a smaller decrease, and traceloop caused the smallest.
Summary of Linux profilers
For issues such as “Why can’t the software run on this machine,” strace is still a powerful system call tracer in Linux. But to trace the latency of system calls, the BPF-based perf-trace is a better option. In containers or K8s environments that use cgroup v2, traceloop is the easiest to use.
Try TiDB Cloud Free Now!
Get the power of a cloud-native, distributed SQL database built for real-time
analytics in a fully managed service.
Have a question or comment about the article? Visit the TiDB Forum