bpftrace

Linux Kernel observability made fun, creative and easy.

eBPF is a new technology created for Linux which allows executing programs inside the Kernel without requiring code change or extra deployment while bpftrace is a tool and language that allows interacting with Linux Kernel and Driver interfaces.

Introduction

Berkeley Packet Filter (BPF) is a technology created in 1992 to allow filter incoming network packets using and made them available on user space (without copying from kernel) increasing performance of network monitoring tools. As time pass by BPF evolved and in 2013 it was replaced in Linux Kernel by a general-purpose virtual machine capable to filter not only networking events but everything inside the kernel and its drivers, such technology was given the name eBPF (Extended Berkeley Packet Filter).

eBPF consist of a syscall that allows to create event-based programs that are validated for security and safety reasons before being compiled into an instruction set and submit to Kernel execution, as Kernel events are triggered the program takes its action and store output data to a map storage that may be copied to user space later on, additionally a set of extensible helper functions are provided out of the box increasing the range of problems that can be solved.

bpftrace consist of a tool and high level language that serves as a frontend allowing developers to create BPF scripts that will be load into the Kernel to monitor pre-defined events and take necessary action, what makes it a very suitable tool for Ops engineers.

For example, the following script counts every entry of a syscall grouping them by process name:

giscard@bpf# bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
Attaching 1 probe...
^C

@[gsd-printer]: 1
@[memballoon]: 2
@[vmstats]: 2
@[C2 CompilerThre]: 2
@[C1 CompilerThre]: 2
@[GUsbEventThread]: 2
@[Flushing Daemon]: 3
@[rs:main Q:Reg]: 3
@[in:imuxsock]: 3

That is not all, there are several other ways to interact with eBPF as for example using a BPF  Compiler Collection (BCC) which provides wrapper for languages like Python and C  or one of the several libraries like the gobpf or libbpf.

Installation

In despite of having instructions on how to install bpftrace available at iovisor project it is important have in mind that old versions of Kernel may not support all its functionalities, therefore being recommended (at least for learning purpose) try to install a kernel newer than 5.4. Also, I would recommend install linux-kernel-headers, build-essentials and manpages-dev and be aware that extra packages might be necessary like bpfcc-tools.

sudo apt-get install linux-headers-$(uname -r)
sudo apt install build-essential
sudo apt-get install manpages-dev
sudo apt-get install bpfcc-tools linux-headers-$(uname -r)

As usual in Linux, different distributions will present BPF tolls in a different way, something a bit annoying on Ubuntu ones consist of the fact that BCC tools are all suffixed with bpfcc while that is not true for the Debian one.

Command Line

bpftrace can be used as a one-liner script or passing as argument a file containing script to be run. For the first case the option “-e” is used followed by a single quoted one-liner while for the second case the file script is used as argument:

  • One-liner
# bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
  • Script
# bpftrace syscall_per_process.bt
--- syscall_per_process.bt file ----
tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }

On several occasions it might be necessary to reuse some C struct from kernel itself or any other library, for such cases it is possible to use the “-I <folder>” or “–include <path>”, as the following example:

# bpftrace -I /usr/src/linux-headers-5.4.0-53/include tcp_rcv.bt
--- tcp_rcv.bt ---
#include <net/sock.h>
#include <linux/tcp.h>

kprobe:tcp_rcv_established
{
  $sock = (struct sock*)arg0;
  $tcps = (struct tcp_sock*)arg0;
}

Last but not least important comes the “-l” options which allows to search all available events using wildcards, including the option “-v” to list argument details:

giscard@bpf# bpftrace -l "*syscalls*"
tracepoint:syscalls:sys_enter_socket
tracepoint:syscalls:sys_exit_socket
tracepoint:syscalls:sys_enter_socketpair
tracepoint:syscalls:sys_exit_socketpair
tracepoint:syscalls:sys_enter_bind

giscard@bpf# bpftrace -lv 'tracepoint:syscalls:sys_enter_socket'
tracepoint:syscalls:sys_enter_socket
    int __syscall_nr;
    int family;
    int type;
    int protocol;

For a full description and updated list of parameters supported by bpftrace access its usage reference guide.

Language

Most of the structure like comments, conditions, array access, casting, assignment and arithmetic operations follow the C language syntax, with the constraint that loop usage are still experimental and a nice tuple support which can be read using dot notation like a C struct.

Global variables are declared by starting with “@” while local variables shall start with “$” it also supports creating associative arrays by using the syntax “@var_name[keys…]” where all key values are hashed creating a performatic map to store and retrieve values. Several built-in variables and functions are also provided, they are described in the built-in section below.

Biggest change comes in the fact that whole program is event based, what requires from the programmer define all events that are being tracked with an optional filter that is validated before the code get executed.

The script below is tracking all syscalls which return is -13 (EPERM) what basically prints all processes  trying to invoke a function it is not allowed to:

tracepoint:raw_syscalls:sys_exit
/args->ret == -13/
{
  printf("EPERM: %s -> %s\n", comm, ksym(*(kaddr("sys_call_table") + args->id * 8)));
}

Worth mention that events can be defined using “*” wildcard, for example kprobe:vfs_read* will includes all vfs_readf, vfs_read, vfs_readv and vfs_readlink.

For a full description of the bpftrace language access its language reference guide.

Events

Events are probably the most important topic when using bpftrace since knowing in deep the events is what allows to instrument and track properly functionalities, in fact that is the hardest part when learning how to use eBPF since its language is very simple and constraints to keep it safe does not open space for more complex programming structure.

Learning how to proper use all the events, mainly the ones related to Linux Kernel functionalities requires deep experience and knowledge on how Linux works which most engineers do not have. A personal hint for one trying to ramp up faster on that would be buying the book BPF Performance Tools by Brendan Gregg, which goes beyond just explaining several of existing events still pointing several tools to help analyze common problems a company might be facing.

The simples events that can be used are BEGIN and END which basically track start and end of a BPF program, they might seem useless at first glance however they can be used to print respectively usage information and final output. Also in the case of the END it shall be used to cleanup global variables preventing them to be printed when the program is finished (by default bpftrace prints all global variables).

BEGIN
{
  if($1 == 0){
    printf("Usage my.bt <pid>\n");
  } else {
    printf("Starting tracking. Hit Ctrl-C to end\n");
  }

  @map["k1"] = 5;
  @map["k2"] = 10;
  @value = @map["k1"] + @map["k2"]

}

END
{
  clear(@map);
}

It is also possible to define events that will be triggered at a specific interval considering a single CPU what makes it good to print results, or profile all CPUs which should be used to track ongoing execution, both of them support sampling using seconds (s), milliseconds (ms), microseconds (us) or hertz (hz).

An important tip when using the profile event, is trying to not use round number (ex: 100 seconds) as those values can match a specific cycle inside the application resulting on a biased analysis, so prefer to use a different number like 99 instead of 100, the same is truth when using hz, us, ms where the -1 rule should play well enough.

The following script profiles all CPUs by counting the current process using the CPU and print the results at every 10 seconds:

interval:s:10
{
  time("%H:%M:%S\n");
  print(@app);
}

profile:hz:49
{
  @app[comm] = count()
}

Up until now not a single event from Linux Kernel or any 3rd party library or application was used, which does not allow us to do that much. Such events, also called probes, can be subdivided in Trace points (tracepoints), Software Interruptions (software), Hardware Interruptions (hardware), Kernel Probes (kprobe or kretprobe), User Probes (uprobe or uretprobe) and User-level Static Defined Tracing (ustd).

Linux Tracepoints are hard coded hooks inside the Kernel allowing a code to be executed when the event is fired (which is exactly what bpftrace does), one good thing about tracepoint is that they have a contract and are quite stable in several kernel versions (backward compatibility) and they exist for almost all system calls so always prefer to use them, a not so good news is that we do not have that much tracepoints, so it is usually not enough to do everything we might have on mind when trying to observe or monitor specific features.

It is possible to use bpftrace -l ‘tracepoint:*’ to display all available tracepoints and use the “-v” option to print its arguments. On my view the hardest thing about tracepoints is understand how to use their arguments (accessible via the args built-in variable) that can be done by finding documentation or looking inside the Kernel code itself, so here comes another tip: use the kstack bpftrace function to find where the tracepoint was invoked then download the kernel code to dig into it.

As showed previously the following code tracks return of all syscalls by filtering the ones where a no permission (EPERM) was returned, observes how the args variable is used:

tracepoint:raw_syscalls:sys_exit
/args->ret == -13/
{
  printf("EPERM: %s -> %s\n", comm, ksym(*(kaddr("sys_call_table") + args->id * 8)));
}

Now imagine you want to track something is ongoing however there is no trace point available – at the time this post was written there was 2K tracepoints against +50K Kernel Probes – one solution would be send a patch to Linux Kernel adding the new tracepoint (that could take a while) or just use one of the thousands kernel functions and track it straight away.  Even though it is important to have on mind that Kernel Probes are “just” a C function inside the Kernel and like any function or method inside a program it is subject to unnoticed change.

Personally, I like this approach of “Private APIs”, much better assume the risk and getting the problem solved then having to wait months or sometimes a life for a strict contract.

In order to see the list of Kernel Probes the following bpftrace -l “kprobe:*” command can be used, the “-v” option here does not display extra information even though most of the functions can be easily found inside the kernel source code. For each kprobe event there is also a kretprobe for its return which arguments and return can be respectively accessed using args0, args1, …, argsN or retval variables.

The following script track all calls to vfs_read and compute how much time was spend inside it grouping them by thread ID and file mode:

#include <linux/fs.h>

kprobe:vfs_read
{
  $file = (struct file *)arg0;
  @start[tid, $file->f_inode->i_mode] = nsecs;
}

kretprobe:vfs_read
/@start[tid, ((struct file*)arg0)->f_inode->i_mode] && retval >= 0/
{
  $duration_us = (nsecs - @start[tid]) / 1000;
  @us[pid,comm] = hist($duration_us);
  delete(@start[tid])
}

Software and Hardware Interruption are still intimately related to kernel functionalities and shall be used with caution once they can be triggered millions of times putting any tracing under pressure and affect overall system performance.

While software interruption is triggered by currently running processes (asking for an IO for example) the hardware ones are usually caused by devices requiring CPU attention. Both software and hardware events can be list by using bpftrace -l ‘software:*’ and bpftrace -l ‘hardware:*’.

The following program sample one hundredth of all page faults grouping them by User Stack and Process name:

software:page-faults:1
{
  @[ustack, comm] = count();
}

Finally, we have User Events which on the kernel perspective is everything coming from the user space. User events can be split into User Probes (like Kernel Probes, i.e., functions inside user programs or libraries) and User-level Static Defined Tracing (like tracepoints, i.e. tracepoints inside user programs or libraries).

Similarly to kprobes, uprobes also has it uretprobes and uses arg0, …, argN, retval to access arguments and return value, additionally they are not meaning to be a public contract and may change as time pass by, differently from USTD which authors should keep backward compatibility or modifying following more strict rules.

One thing to have in mind when using uprobes is that it is necessary to explicit point the library/program path so BPF knows how to find it, also it is important to have a debugger version containing all ELF information so stack trace can points to the right symbols. A list of uprobes or ustd can be found using bpftrace -l ‘uprobe:<path>:*’ and bpftrace -l ‘ustd:<path>:*’ respectively.

The following code calculates the latency when acquiring a mutex lock from a give process PID by tracking the well-known pthread library:

uprobe:/lib/x86_64-linux-gnu/libpthread.so.0:pthread_mutex_lock
/pid == $1/
{
  @lock_start[tid] = nsecs;
  @lock_addr[tid] = arg0;
}

uretprobe:/lib/x86_64-linux-gnu/libpthread.so.0:pthread_mutex_lock
/pid == $1 && @lock_start[tid]/
{
  @lock_latency_ns[usym(@lock_addr[tid])] = hist(nsecs - @lock_start[tid]);
  delete(@lock_start[tid]);
  delete(@lock_addr[tid]);
}

Built-in

Built-in are intrinsic part of a bpftrace program and can be used straight away saving programmers time and prevent them to reinvent the wheel for common problems, so far, they have been used in several of the examples showed above without get into details and are divided in three main categories: variables, functions and map functions.

Variables can be accessed by its name and are pre-assigned with a value, for example pid contains process ID while tid is thread ID and cgroup the cgroup ID, the same way is possible to get user ID and group ID by using gid and uid, getting the current time is easy as type nsecs while getting time since the BPF program has started is accessible using elapsed.

It is also possible to print Kernel/User Stack by using kstack and ustack or get current CPU and process name using cpu and comm, another common used variable is currtask which points to current Kernel Task using task_struct format.

A special set of variables $1, $2, …, $N represents the Nth parameter passed to the BPF program (or 0 if none is provided) while $# has the number of parameters being passed.

Functions can serve multiple purpose, the used ones helps print output like print(), printf() and time() or handle/format strings like str(), join(), buf(), strncmp(), strfmttime(). Other functions can help you translate a kernel or user address to a function name or vice-versa which is the case for ksym(), usym(), kaddr() and uaddr(), or print User/Kernel Stacks with a limited number of frames like ustack() and kstack() finally there are functions to help translate IP address like ntop(), read a processor register value using reg(), executing a command line using system() or read a file using cat().

Map Functions are the ones to be used on Map Variables (the ones defined as @var[key…]) and are very handy for any program trying to get statistics or make sense of code behavior. The function count() accumulates how many times a map has been invoked, avg() automatically calculate the average, sum() the sum while stats() get all them together in a single shot. There are even min() and max() to respectively store the minimum and maximum value.

Since histograms are very useful to display statistics data, it is possible to use hist() and lhist() for respectively store a log2 and a linear representation, finally there are functions to print(), clean(), zero() or delete() elements from the maps.

For a full description of the bpftrace language access its builtin variable reference guide and builtin functions reference guide.

BCC Tools

As mentioned before bpftrace is useful to crate ad hoc tools that will allow monitor or observe a very specific functionality, one of the biggest counterpart of using bpftrace scripts is that we cannot use variables to define events like in the example below which would be a tool to calculate latency given any Kernel Probe:

kprobe:$1
/pid == $1/
{
  @lock_start[tid] = nsecs;
  @lock_addr[tid] = arg0;
}

kretprobe:$1
/pid == $1) && @lock_start[tid]/
{
  @lock_latency_ns[usym(@lock_addr[tid]), comm] = hist(nsecs - @lock_start[tid]);
  delete(@lock_start[tid]);
  delete(@lock_addr[tid]);
}

Fortunately, there are some BCC Tools which are installed and can be very hand, so a programmer does not have to create such tools themselves. There are dozens of BCC Tools so I will focus on the ones which are generic and can be used at any scenario, for a comprehensive list of all tools please access BCC Tools.

The tool funccount (or funccount-bpfcc on ubuntu) allows to define a duration time to track a probe pattern and count how many times the event was fired printing them based on an interval and filtering by a specific PID, as the following script which tracks all tracepoint (“t”) syscalls for a firefox process:

giscard@bpf# funccount-bpfcc -p 45945 -d 10 -i 1 't:syscalls:sys_enter_*'
Tracing 332 functions for "b't:syscalls:sys_enter_*'"... Hit Ctrl-C to end.

FUNC                                    COUNT
syscalls:sys_enter_mmap                    30
syscalls:sys_enter_getpid                  31
syscalls:sys_enter_unlink                  32
syscalls:sys_enter_writev                  32
syscalls:sys_enter_openat                  32
syscalls:sys_enter_munmap                  32
syscalls:sys_enter_close                  128
syscalls:sys_enter_epoll_wait             269
syscalls:sys_enter_sendmsg                304
syscalls:sys_enter_write                  381
syscalls:sys_enter_read                   383
syscalls:sys_enter_poll                   609
syscalls:sys_enter_futex                  877
syscalls:sys_enter_recvmsg                939

NOTE: stackcount (stackcount-bpfcc) perform the same job however grouping data per stack

Similarly the tool argdist (or argdist-bpfcc on ubuntu) allows to count number of times a probe is triggered considering its arguments or return value, like the code below that groups sys_enter_read based on the size of the buffer:

giscard@bpf# argdist-bpfcc -p 45945 -d 10 -i 1 -C 't:syscalls:sys_enter_read(int __syscall_nr, unsigned int fd, char* buf, size_t count):int:args->count'
[16:02:14]
t:syscalls:sys_enter_read(int __syscall_nr, unsigned int fd, char* buf, size_t count):int:args->count
    COUNT      EVENT
    1          args->count = 8
    2          args->count = 2048
    3          args->count = 10
    3          args->count = 1

Another example is the tool trace (or trace-bpfcc on ubuntu) which allows to print arguments or return value for any given function or pattern. In the example below it gives more details on which process and thread is invoking the syscall and the count parameter:

giscard@bpf# trace-bpfcc -p 45945 't:syscalls:sys_enter_read "%d",args->count'
PID     TID     COMM            FUNC             -
45945   45945   firefox         sys_enter_read   1
45945   45945   firefox         sys_enter_read   1
45945   45970   Gecko_IOThread  sys_enter_read   1
45945   45970   Gecko_IOThread  sys_enter_read   1
45945   48340   threaded-ml     sys_enter_read   10

In despite of BCC Tools and bpftrace scripts being similar they are not the same, BCC tools has its own way to describe probes, so for example you cannot use tracepoint instead use t, the same is true for kprobe/uprobe which are replaced by p and kretprobe/uretprobe by r,  finally USTD are shortcut to u.

Resources

Since bpftrace allows to monitor any kernel or user functionality there are innumerable tools that can be created each of them focusing on a specific scenario or problem. Even though the possibilities are a lot, most of them will target common hardware resources like CPU, Memory, FileSystems, Disks or Network.

While looking at CPUs it might be interesting track schedule queue time, processes being started, utilization, off-cpu or on-cpu time, cache misses and interruptions; for resources like memory it is possible to track leaks, page fault, swap memory, heap expansion, shared memory and so forth.

File Systems can be tracked on file details and buffer size, page caches hit or miss, it is also possible to track read-aheads performed by the Kernel and inode caches to see which process is using any of them. When looking at Disk operations is possible to calculate latency, random versus sequential operations, time spent on queue and IO errors.

On networking perspective is like having a tcpdump on steroids, it is possible to monitor socket statistics per protocol type, check queue latency and size, latency to open connections or for packet exchange, it is even possible observe segmentation and congestion control or get inner details like TCP windowing.

Additional to all that was mentioned it is possible to use bpftrace to monitor security violations or DDoS attack or even create tools that can mirror all commands typed in a bash or log all root access request of its subsequent operations. Getting a bit further it can be used to track and monitor the Kernel itself in such a way that is possible to understand its behavior or profile for performance improvement.

Other Tools

In despite of BPF being created three decades ago and eBPF being incorporated into Linux Kernel in 2013, we are talking about a relatively new technology which are still being discovered by the community. Nonetheless there are already powerful tools based on eBPF that delivery a high value for a company with systems deployed on the Cloud with hundreds or even thousand hosts.

One of those tools is Performance Co-Pilot (PCP) which distribute agents throughout the servers and collect several metrics including the ones gathered by existing BCC Tools or bpftrace scripts or being extended by custom ones created as necessary, PCP when integrated to Grafana or Prometheus delivers an end to end solution for collecting, visualize and generate alarms based on inner and customizable metrics.

Kubectl-Trace is an effort to allow use kubernetes integrated with BPF so ad hoc scripts, commands or generic tools can be distributed and applied to all Nodes with a single command line typed by the cluster administrator.

A use case different from observability consist on the Cilium project which uses BPF events to create network policies to prevent/allow hosts to talk to each other, it also collect metrics detailing inter component communication and can even perform Load Balance on incoming packets, additionally it is fully integrated to Kubernetes by providing Container Network Interface (CNI).

One final example is the project Falco which allows to use BPF trace to create security rules that will be used constantly monitor hosts and containers, collecting metrics and generating alarms if any rule is dismissed or even taking corrective actions like killing a mal-functioning processed or blocking a unconventional user. Similarly, to other mentioned projects it can be integrated to Kubernetes to apply pod policies and restriction beforehand.

Conclusion

Linux has been for a long time the most used OS platform for servers, some statistics shows it serves +96% of all top domains on the Internet and has also take a big bite on the supercomputer market share. Even though it is an amazing and free platform surrounded by great engineers and an empowered community it is a software like any other else, therefore subject to bugs, performance issues and design improvement; additionally, we are talking about a software that is highly customizable and parameterizable whose default configurations are unlikely to match in its best way a random application.

Nonetheless most company or teams do not have enough information or knowledge of its details, which might be the root cause for several performance problems or hardware sub utilization that ends translating into higher costs.

While huge companies has the sense that spending millions dollars or thousands of hours on improving application performance worth the investment, and small ones use those as a future backlog, fewer invest on an outstanding Ops team able to dig into the deepest of Linux in order to provide the same gains in a less costly and faster way.

eBPF is the kind of technology that can shorter the time and effort necessary to find bottlenecks and improve overall performance on Linux usage (and 3rd party components/libraries) at the same time it gives good hints on what resources a specific application is overusing and shall be improved.

Talking on user space application monitoring, that is probably the biggest drawback for eBPF, since it heavily relies on events based on function calls and a symbol table, most of applications today won’t be able to enjoy it out of the box, main reasons is most library are stripped off its debug information for performance improvement or are compiled using optimizations that might mislead the tracing, which can be workaround by providing such version in disk for tracing purpose only.

Additionally VM or interpreted languages don’t have a static symbol table which makes harder to apply BPF on them, what on my view doesn’t scratch BPF beauty and utility once most of platforms do provide their own set of profiling and debug tool.

Learning the internals of BPF might be cumbersome for one not experienced with Linux Kernel, even though that is fore sure not an excuse to keep your eye out of it, most of the cumbersome tasks are being provided as tools by open source projects like IOVisor, and the basics to create your own scripts using bpftrace should not require more than 20 hours and maybe an extra 100 ones for better practice and understanding.

Finally, the most important part which is developing an ecosystem that allows distribute all the trace and observability across a cluster of hundreds or thousand hosts has already been developed what makes the opportunity to use them in your cloud environment or personal datacenter a steal, there are no excuses to not planning doing so.

If none of the observability and performance gains convinced that BPF worth the time or at least an extra attention, it is also time to look into it as a security tool capable to work around known vulnerabilities or Zero-Day flaws from Day zero.

Deixe um comentário

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logotipo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair /  Alterar )

Foto do Google

Você está comentando utilizando sua conta Google. Sair /  Alterar )

Imagem do Twitter

Você está comentando utilizando sua conta Twitter. Sair /  Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair /  Alterar )

Conectando a %s