Return to Systems Performance Table of Contents, Systems Performance Glossary, Systems Performance, 2nd Edition, Performance Bibliography, Systems Performance, Performance DevOps, IT Bibliography, DevOps Bibliography
“ (SysPrfBGrg 2021)
Operating systems have historically provided many tools for observing system software and hardware components. To the newcomer, the wide range of available tools and metrics suggested that everything — or at least everything important — could be observed. In reality there were many gaps, and systems performance experts became skilled in the art of inference and interpretation: figuring out activity from indirect tools and statistics. For example, network packets could be examined individually (sniffing), but disk I/O could not (at least, not easily).
Observability has greatly improved in Linux thanks to the rise of dynamic tracing tools, including the BPF-based BCC and bpftrace. Dark corners are now illuminated, including individual disk I/O using biosnoop(8). However, many companies and commercial monitoring products have not yet adopted system tracing, and are missing out on the insight it brings. I have led the way by developing, publishing, and explaining new tracing tools, tools already in use by companies such as Netflix and Facebook.
The learning objectives of this chapter are:
In Chapter 1 I introduced different types of observability: counters, profiling, and tracing, as well as static and dynamic instrumentation. This chapter explains observability tools and their data sources in detail, including a summary of sar(1), the system activity reporter, and an introduction to tracing tools. This gives you the essentials for understanding Linux observability; later chapters (6 to 11) use these tools and sources to solve specific issues. Chapters 13 to 15 cover the tracers in depth.
This chapter uses the Ubuntu Linux distribution as an example; most of these tools are the same across other Linux distributions, and some similar tools exist for other kernels and operating systems where these tools originated.
4.1 Tool Coverage
Figure 4.1 shows an operating system diagram that I have annotated with the Linux workload observability tools1 relevant to each component.
1When teaching performance classes in the mid-2000s, I would draw my own kernel diagram on a whiteboard and annotate it with the different performance tools and what they observed. I found it an effective way for explaining tool coverage as a form of mental map. I’ve since published digital versions of these, which adorn cubicle walls around the world. You can download them on my website Gregg 20a].
Figure 4.1 Linux workload observability tools
Most of these tools focus on a particular resource, such as CPU, memory, or disks, and are covered in a later chapter dedicated to that resource. There are some multi-tools that can analyze many areas, and they are introduced later in this chapter: perf, Ftrace, BCC, and bpftrace.
4.1.1 Static Performance Tools
There is another type of observability that examines attributes of the system at rest rather than under active workload. This was described as the static performance tuning methodology in Chapter 2, Methodologies, Section 2.5.17, Static Performance Tuning, and these tools are shown in Figure 4.2.
Figure 4.2 Linux static performance tuning tools
Remember to use the tools in Figure 4.2 to check for issues with configuration and components. Sometimes performance issues are simply due to a misconfiguration.
4.1.2 Crisis Tools
When you have a production performance crisis that requires various performance tools to debug it, you might find that none of them are installed. Worse, since the server is suffering a performance issue, installing the tools may take much longer than usual, prolonging the crisis.
For Linux, Table 4.1 lists the recommended installation packages or source repositories that provide these crisis tools. Package names for Ubuntu/Debian are shown in this table (these package names may vary for different Linux distributions).
Table 4.1 Linux crisis tool packages
Large companies, such as Netflix, have OS and performance teams who ensure that production systems have all of these packages installed. A default Linux distribution may only have procps and util-linux installed, so all the others must be added.
In container environments, it may be desirable to create a privileged debugging container that has full access to the system2 and all tools installed. The image for this container can be installed on container host]]s and deployed when needed.
2It could also be configured to share namespaces with a target container to analyze.
Adding tool packages is often not enough: kernel and user-space software may also need to be configured to support these tools. Tracing tools typically require certain kernel CONFIG options to be enabled, such as CONFIG_FTRACE and CONFIG_BPF. Profiling tools typically require software to be configured to support stack walking, either by using frame-pointer compiled versions of all software (including system libraries: libc, libpthread, etc.) or debuginfo packages installed to support dwarf stack walking. If your company has yet to do this, you should check that each performance tool works and fix those that do not before they are urgently needed in a crisis.
The following sections explain performance observability tools in more detail.
4.2 Tool Types
A useful categorization for observability tools is whether they provide system-wide or per-process observability, and whether they are based on counters or events. These attributes are shown in Figure 4.3, along with Linux tool examples.
Figure 4.3 Observability tool types
Some tools fit in more than one quadrant; for example, top(1) also has a system-wide summary, and system-wide event tools can often filter for a particular process (-p PID).
Event-based tools include profilers and tracers. Profilers observe activity by taking a series of snapshots on events, painting a coarse picture of the target. Tracers instrument every event of interest, and may perform processing on them, for example to generate customized counters. Counters, tracing, and profiling were introduced in Chapter 1.
The following sections describe Linux tools that use fixed counters, tracing, and profiling, as well as those that perform monitoring (metrics).
4.2.1 Fixed Counters
Kernels maintain various counters for providing system statistics. They are usually implemented as unsigned integers that are incremented when events occur. For example, there are counters for the number of network packets received, disk I/O issued, and interrupts that occurred. These are exposed by monitoring software as metrics (see Section 4.2.4, Monitoring).
A common kernel approach is to maintain a pair of cumulative counters: one to count events and the other to record the total time in the event. These provide the count of events directly and the average time (or latency) in the event, by dividing the total time by the count. Since they are cumulative, by reading the pair at a time interval (e.g., one second) the delta can be calculated, and from that the per-second count and average latency. This is how many system statistics are calculated.
Performance-wise, counters are considered “free” to use since they are enabled by default and maintained continually by the kernel. The only additional cost when using them is the act of reading their values from user-space (which should be negligible). The following example tools read these system-wide or per process.
These tools examine system-wide activity in the context of system software or hardware resources, using kernel counters. Linux tools include:
These tools are typically viewable by all users on the system (non-root). Their statistics are also commonly graphed by monitoring software.
Many follow a usage convention where they accept an optional interval and count, for example, vmstat(8) with an interval of one second and an output count of three:
$ vmstat 1 3
procs ———–memory———- —swap– —–io—- -system– ——cpu—–
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 0 1446428 662012 142100 5644676 1 4 28 152 33 1 29 8 63 0 0
4 0 1446428 665988 142116 5642272 0 0 0 284 4957 4969 51 0 48 0 0
4 0 1446428 685116 142116 5623676 0 0 0 0 4488 5507 52 0 48 0 0
The first line of output is the summary-since-boot, which shows averages for the entire time the system has been up. The subsequent lines are the one-second interval summaries, showing current activity. At least, this is the intent: this Linux version mixes summary-since-boot and current values for the first line (the memory columns are current values; vmstat(8) is explained in Chapter 7).
Per-Process
These tools are process-oriented and use counters that the kernel maintains for each process. Linux tools include:
These tools typically read statistics from the /proc file system.
4.2.2 Profiling
Profiling characterizes the target by collecting a set of samples or snapshots of its behavior. CPU usage is a common target of profiling, where timer-based samples are taken of the instruction pointer or stack trace to characterize CPU-consuming code paths. These samples are usually collected at a fixed rate, such as 100 Hz (cycles per second) across all CPUs, and for a short duration such as one minute. Profiling tools, or profilers, often use 99 Hz instead of 100 Hz to avoid sampling in lockstep with target activity, which could lead to over- or undercounting.
Profiling can also be based on untimed hardware events, such as CPU hardware cache misses or bus activity. It can show which code paths are responsible, information that can especially help developers optimize their code for memory usage.
Unlike fixed counters, profiling (and tracing) are typically only enabled on an as-needed basis, since they can cost some CPU overhead to collect, and storage overhead to store. The magnitudes of these overheads depend on the tool and the rate of events it instruments. Timer-based profilers are generally safer: the event rate is known, so its overhead can be predicted, and the event rate can be selected to have negligible overhead.
System-wide Linux profilers include:
These can also be used to target a single process.
Per-Process
Process-oriented profilers include:
See Chapter 6, CPUs, and Chapter 13, perf, for more about profiling tools.
4.2.3 Tracing
Tracing instruments every occurrence of an event, and can store event-based details for later analysis or produce a summary. This is similar to profiling, but the intent is to collect or inspect all events, not just a sample. Tracing can incur higher CPU and storage overheads than profiling, which can slow the target of tracing. This should be taken into consideration, as it may negatively affect the production workload, and measured timestamps may also be skewed by the tracer. As with profiling, tracing is typically only used as needed.
Logging, where infrequent events such as errors and warnings are written to a log file for later reading, can be thought of as low-frequency tracing that is enabled by default. Logs include the system log.
The following are examples of system-wide and per-process tracing tools.
These tracing tools examine system-wide activity in the context of system software or hardware resources, using kernel tracing facilities. Linux tools include:
perf(1), Ftrace, BCC, and bpftrace are introduced in Section 4.5, Tracing Tools, and covered in detail in Chapters 13 to 15. There are over one hundred tracing tools built using BCC and bpftrace, including biosnoop(8) and execsnoop(8) from this list. More examples are provided throughout this book.
Per-Process
These tracing tools are process-oriented, as are the operating system frameworks on which they are based. Linux tools include:
The debuggers can examine per-event data, but they must do so by stopping and starting the execution of the target. This can come with an enormous overhead cost, making them unsuitable for production use.
System-wide tracing tools such as perf(1) and bpftrace support filters for examining a single process and can operate with much lower overhead, making them preferred where available.
4.2.4 Monitoring
Monitoring was introduced in Chapter 2, Methodologies. Unlike the tool types covered previously, monitoring records statistics continuously]] in case they are later needed.
sar(1)
A traditional tool for monitoring a single operating system host is the System Activity Reporter, sar(1), originating from AT&T Unix. sar(1) is counter-based and has an agent that executes at scheduled times (via cron) to record the state of system-wide counters. The sar(1) tool allows these to be viewed at the command line, for example:
Linux 4.15.0-66-generic (bgregg) 12/21/2019 _x86_64]]_ (8 CPU)
12:00:01 AM CPU %user %nice %system %iowait %steal %idle
12:05:01 AM all 3.34 0.00 0.95 0.04 0.00 95.66
12:10:01 AM all 2.93 0.00 0.87 0.04 0.00 96.16
12:15:01 AM all 3.05 0.00 1.38 0.18 0.00 95.40
12:20:01 AM all 3.02 0.00 0.88 0.03 0.00 96.06
[…]
Average: all 0.00 0.00 0.00 0.00 0.00 0.00
By default, sar(1) reads its statistics archive (if enabled) to print recent historical statistics. You can specify an optional interval and count for it to examine current activity at the rate specified.
sar(1) can record dozens of different statistics to provide insight into CPU, memory, disks, networking, interrupts, power usage, and more. It is covered in more detail in Section 4.4, sar.
Third-party monitoring products are often built on sar(1) or the same observability statistics it uses, and expose these metrics over the network.
The traditional technology for network monitoring is the Simple Network Management Protocol (SNMP). Devices and operating systems can support SNMP and in some cases provide it by default, avoiding the need to install third-party agents or exporters. SNMP includes many basic OS metrics, although it has not been extended to cover modern applications. Most environments have been switching to custom agent-based monitoring instead.
Modern monitoring software runs agents (also known as exporters or plugins) on each system to record kernel and application metrics. These can include agents for specific applications and targets, for example, the MySQL database server, the Apache Web Server, and the MemCached caching system. Such agents can provide detailed application request metrics that are not available from system counters alone.
Monitoring software and agents for Linux include:
An example monitoring architecture is pictured in Figure 4.4 involving a monitoring database server for archiving metrics, and a monitoring web server for providing a client UI. The metrics are sent (or made available) by agents to the database server and then made available to client UIs for display in as line graphs and in dashboards. For example, Graphite Carbon is a monitoring database server, and Grafana is a monitoring web server/dashboard.
Figure 4.4 Example monitoring architecture
There are dozens of monitoring products, and hundreds of different agents for different target types. Covering them is beyond the scoped of this book. There is, however, one common denominator that is covered here: system statistics (based on kernel counters). The system statistics shown by monitoring products are typically the same as those shown by system tools: vmstat(8), iostat(1), etc. Learning these will help you understand monitoring products, even if you never use the command-line tool]]s. These tools are covered in later chapters.
Some monitoring products read their system metrics by running the system tools and parsing the text output, which is inefficient. Better monitoring products use library and kernel interfaces to read the metrics directly — the same interfaces as used by the command-line tool]]s. These sources are covered in the next section, focusing on the most common denominator: the kernel interfaces.
4.3 Observability Sources
The sections that follow describe various interfaces that provide the data for observability tools on Linux. They are summarized in Table 4.2.
Table 4-2 Linux observability sources
The main sources of systems performance statistics are covered next: /proc and /sys. Then other Linux sources are covered: delay accounting, netlink, tracepoints, kprobes, USDT, uprobes, PMCs, and more.
The tracers covered in Chapter 13 perf, Chapter 14 Ftrace, and Chapter 15 BPF utilize many of these sources, especially system-wide tracing. The scoped of these tracing sources is pictured in Figure 4.5, along with event and group names: for example, block: is for all the block I/O tracepoints, including block:block_rq_issue.
Figure 4.5 Linux tracing sources
Only a few example USDT sources are pictured in Figure 4.5, for the PostgreSQL database (postgres:), the JVM hotspot compiler (hotspot:), and libc (libc:). You may have many more depending on your user-level software.
For more information on how tracepoints, kprobes, and uprobes work, their internals are documented in Chapter 2 of BPF Performance Tools Gregg 19].
4.3.1 /proc
This is a file system interface for kernel statistics. /proc contains a number of directories, where each directory is named after the process ID for the process it represents. In each of these directories is a number of files containing information and statistics about each process, mapped from kernel data structures. There are additional files in /proc for system-wide statistics.
/proc is dynamically created by the kernel and is not backed by storage devices (it runs in-memory). It is mostly read-only, providing statistics for observability tools. Some files are writeable, for controlling process and kernel behavior.
The file system interface is convenient: it’s an intuitive framework for exposing kernel statistics to user-land via the directory tree, and has a well-[[known programming interface via the POSIX file [[system calls: open(), read(), close(). You can also explore it at the command line using cd, cat(1), grep(1), and awk(1). The file system also provides user-level security through use of file access permissions. In rare cases where the typical process observability tools (ps(1), top(1), etc.) cannot be executed, some process debugging can still be performed by shell built-ins from the /proc directory.
The overhead of reading most /proc files is negligible; exceptions include some memory-map related files that walk page tables.
Various files are provided in /proc for per-process statistics. Here is an example of what may be available (Linux 5.4), here looking at PID 187333:
3You can also examine /proc/self for your current process (shell).
arch_status environ mountinfo personality statm
attr/ exe@ mounts projid_map status
autogroup fd/ mountstats root@ syscall
cgroup gid_map ns/ schedstat timers
clear_refs io numa_maps sessionid timerslack_ns
cmdline limits oom_adj setgroups uid_map
comm loginuid oom_Score smaps wchan
coredump_filter map_files/ oom_Score_adj smaps_rollup
The exact list of files available depends on the kernel version and CONFIG options.
Those related to per-process performance observability include:
The following shows how per-process statistics are read by top(1), traced using strace(1):
stat(”/proc/14704“, {st_mode=S_IFDIR]] | 0555, st_[[size=0, …}) = 0
open(”/proc/14704/stat“, O_RDONLY) = 4
read(4, ”14704 (sshd) S 1 14704 14704 0 -“…, 1023) = 232
close(4)
This has opened a file called “stat” in a directory named after the process ID (14704), and then read the file contents.
top(1) repeats this for all active processes on the system. On some systems, especially those with many processes, the overhead from performing these can become noticeable, especially for versions of top(1) that repeat this sequence for every process on every screen update. This can lead to situations where top(1) reports that top itself is the highest CPU consumer!
Linux has also extended /proc to include system-wide statistics, contained in these additional files and directories:
acpi/ dma kallsyms mdstat schedstat thread-self@
buddyinfo driver/ kcore meminfo scsi/ timer_list
bus/ execdomains keys misc self@ tty/
cgroups fb key-users modules slabinfo uptime
cmdline filesystems kmsg mounts@ softirqs version
consoles fs/ kpagecgroup mtrr stat vmallocinfo
cpuinfo interrupts kpagecount net@ swaps vmstat
crypto iomem kpageflags pagetypeinfo sys/ zoneinfo
devices ioports loadavg partitions sysrq-trigger
diskstats irq/ locks sched_debug sysvipc/
System-wide files related to performance observability include:
[[cpuinfo]]: [[Physical]] [[processor]] [[information]], including every [[virtual CPU]], [[model]] [[name]], [[clock speed]], and [[cache size]]s.
[[disk]][[stat]]s: [[Disk I/O]] [[statistic]]s for all [[disk device]]s
[[interrupt]]s: [[Interrupt]] [[counter]]s per [[CPU]]
[[load]][[avg]]: [[Load]] [[average]]s
meminfo: [[System]] [[memory]] [[usage]] [[break]][[down]]s
[[net]]/[[dev]]: [[Network interface]] [[statistic]]s
[[net]]/[[netstat]]: [[System-wide]] [[networking]] [[statistic]]s
[[net]]/[[tcp]]: [[Active]] [[TCP]] [[socket]] [[information]]
[[pressure]]/: [[Pressure]] [[stall]] [[information]] ([[PSI]]) [[file]]s
schedstat: [[System-wide]] [[CPU]] [[scheduler]] [[statistic]]s
[[self]]: A [[symlink]] to the [[current]] [[process]] [[ID]] [[directory]], for convenience
slabinfo: [[Kernel]] slab [[allocator]] [[cache]] [[statistic]]s
[[stat]]: A [[summary]] of [[kernel]] and [[system]] [[resource]] [[statistic]]s: [[CPUs]], [[disk]]s, [[paging]], [[swap]], [[process]]es
[[zoneinfo]]: [[Memory]] [[zone]] [[information]]
These are read by system-wide tools. For example, here’s vmstat(8) reading /proc, as traced by strace(1):
open(”/proc/meminfo“, O_RDONLY) = 3
read(3, “MemTotal: 889484 kB\nMemF”…, 2047) = 1170
open(”/proc/stat“, O_RDONLY) = 4
read(4, ”cpu 14901 0 18094 102149804 131“…, 65535) = 804
open(”/proc/vmstat“, O_RDONLY) = 5
read(5, “nr_free_pages 160568\nnr_inactive”…, 2047) = 1998
This output shows that vmstat(8) was reading meminfo, stat, and vmstat.
The /proc/stat file provides system-wide CPU utilization]] statistics and is used by many tools (vmstat(8), mpstat(1), sar(1), monitoring agents). The accuracy of these statistics depends on the kernel configuration. The default configuration (CONFIG_TICK_CPU_ACCOUNTING) measures CPU utilization]] with a granularity of clock ticks [Weisbecker 13], which may be four milliseconds (depending on CONFIG_HZ). This is generally sufficient. There are options to improve accuracy by using higher-resolution counters, though with a small performance cost (VIRT_CPU_ACCOUNTING_NATIVE and VIRT_CPU_ACCOUTING_GEN), as well an option to for more accurate IRQ time (IRQ_TIME_ACCOUNTING). A different approach to obtaining accurate CPU utilization]] measurements is to use MSRs or PMCs.
/proc files are usually text formatted, allowing them to be read easily from the command line and processed by shell scripting tools. For example:
MemTotal: 15923672 kB
Buffers: 94536 kB
Cached: 2512040 kB
Active: 1671088 kB
[…]
MemTotal: 15923672 kB
While this is convenient, it does add a small amount of overhead for the kernel to encode the statistics as text, and for any user-land tool that then parses the text. netlink, covered in Section 4.3.4, netlink, is a more efficient binary interface.
The contents of /proc are documented in the proc(5) man page and in the Linux kernel documentation: Documentation/filesystems/proc.txt [Bowden 20]. Some parts have extended documentation, such as diskstats in Documentation/iostats.txt and scheduler stats in Documentation/scheduler/sched-stats.txt. Apart from the documentation, you can also study the kernel source [[code to understand the exact origin of all items in /proc. It can also be helpful to read the source to the tools that consume them.
Some of the /proc entries depend on CONFIG options: schedstats is enabled with CONFIG_SCHEDSTATS, sched with CONFIG_SCHED_DEBUG, and pressure with CONFIG_PSI.
4.3.2 /sys
Linux provides a sysfs file system, mounted on /sys, which was introduced with the 2.6 kernel to provide a directory-based structure for kernel statistics. This differs from /proc, which has evolved over time and had various system statistics mostly added to the top-[[level directory. sysfs was originally designed to provide device driver statistics but has been extended to include any statistic type.
For example, the following lists /sys files for CPU 0 (truncated):
$ find /sys/devices/system/cpu/cpu0 -type f
/sys/devices/system/cpu/cpu0/uevent
/sys/devices/system/cpu/cpu0/hotplug/target
/sys/devices/system/cpu/cpu0/hotplug/state
/sys/devices/system/cpu/cpu0/hotplug/fail
/sys/devices/system/cpu/cpu0/crash_notes_size
/sys/devices/system/cpu/cpu0/power/runtime_active_time
/sys/devices/system/cpu/cpu0/power/runtime_active_kids
/sys/devices/system/cpu/cpu0/power/pm_qos_resume_latency_us
/sys/devices/system/cpu/cpu0/power/runtime_usage
[…]
/sys/devices/system/cpu/cpu0/topology/die_id
/sys/devices/system/cpu/cpu0/topology/physical_package_id
/sys/devices/system/cpu/cpu0/topology/core_cpus_list
/sys/devices/system/cpu/cpu0/topology/die_cpus_list
/sys/devices/system/cpu/cpu0/topology/core_siblings
[…]
Many of the listed files provide information about the CPU hardware caches. The following output shows their contents (using grep(1) so that the file name is included with the output):
$ grep . /sys/devices/system/cpu/cpu0/cache/index
/sys/devices/system/cpu/cpu0/cache/index0/level:1
/sys/devices/system/cpu/cpu0/cache/index1/level:1
/sys/devices/system/cpu/cpu0/cache/index2/level:2
/sys/devices/system/cpu/cpu0/cache/index3/level:3
$ grep . /sys/devices/system/cpu/cpu0/cache/index
/sys/devices/system/cpu/cpu0/cache/index0/size:32K
/sys/devices/system/cpu/cpu0/cache/index1/size:32K
/sys/devices/system/cpu/cpu0/cache/index2/size:1024K
/sys/devices/system/cpu/cpu0/cache/index3/size:33792K
This shows that CPU 0 has access to two Level 1 caches, each 32 Kbytes, a Level 2 cache of 1 Mbyte, and a Level 3 cache of 33 Mbytes.
The /sys file system typically has tens of thousands of statistics in read-only files, as well as many writeable files for changing kernel state. For example, CPUs can be set to online or offline by writing “1” or “0” to a file named “online.” As with reading statistics, some state settings can be made by using text strings at the command line (echo 1 > filename), rather than a binary interface.
4.3.3 Delay Accounting
Linux systems with the CONFIG_TASK_DELAY_ACCT option track time per task in the following states:
[[Scheduler]] [[latency]]: [[Waiting]] for a turn [[on-CPU]]
[[Block]] [[I/O]]: [[Waiting]] for a [[block]] [[I/O]] to [[complete]]
[[Swapping]]: [[Waiting]] for [[paging]] ([[memory]] [[pressure]])
[[Memory]] [[re]][[claim]]: [[Waiting]] for the [[memory]] [[re]][[claim]] [[routine]]
Technically, the scheduler latency statistic is sourced from schedstats (mentioned earlier, in /proc) but is exposed with the other delay accounting states. (It is in struct sched_info, not struct task_delay_info.)
These statistics can be read by user-level tools using taskstats, which is a netlink-based interface for fetching per-task and process statistics. In the kernel source there is:
[[Documentation]]/[[accounting]]/[[delay]]-[[accounting]].[[txt]]: the [[documentation]]
[[tool]]s/[[accounting]]/[[get]][[delay]]s.c: an [[example]] [[consumer]]
The following is some output from getdelays.c:
PID 17451
CPU count real total virtual total delay total delay average
386 3452475144 31387115236 1253300657 3.247ms
IO count delay total delay average
302 1535758266 5ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
0 0 0ms
Times are given in nanoseconds unless otherwise specified. This example was taken from a heavily CPU-loaded system, and the process inspected was suffering scheduler latency.
4.3.4 netlink
netlink is a special socket address family (AF_NETLINK) for fetching kernel information. Use of netlink involves opening a networking socket with the AF_NETLINK address family and then using a series of send(2) and recv(2) calls to pass requests and receiving information in binary structs. While this is a more complicated interface to use than /proc, it is more efficient, and also supports notifications. The libnetlink library helps with usage.
As with earlier tools, strace(1) can be used to show where the kernel information is coming from. Inspecting the socket statistics tool ss(8):
[…]
socket(AF_NETLINK, SOCK_RAW]] | [[SOCK_CLOEXEC, NETLINK_SOCK_DIAG) = 3
[…]
This is opening an AF_NETLINK socket for the group NETLINK_SOCK_DIAG, which returns information about sockets]]. It is documented in the sock_diag(7) man page. netlink groups include:
[[NET]][[LINK]]_[[ROUTE]]: [[Route]] [[information]] (there is also /[[proc]]/[[net]]/[[route]])
[[NET]][[LINK]]_[[SOCK]]_[[DIAG]]: [[Socket]] [[information]]
[[NET]][[LINK]]_SE[[LINUX]]: [[SE]][[Linux]] [[event]] [[notification]]s
[[NET]][[LINK]]_[[AUDIT]]: [[Audit]]ing]] ([[security]])
[[NET]][[LINK]]_[[SCSI]][[TRANSPORT]]: [[SCSI]] [[transport]]s
[[NET]][[LINK]]_[[CRYPTO]]: [[Kernel]] [[crypto]] [[information]]
Commands that use netlink include ip(8), ss(8), routel(8), and the older ifconfig(8) and netstat(8).
4.3.5 Tracepoints
Tracepoints are a Linux kernel event source based on static instrumentation, a term introduced in Chapter 1, Introduction, Section 1.7.3, Tracing. Tracepoints are hard-coded]] instrumentation points placed at logical locations in kernel code. For example, there are tracepoints at the start and end of system calls, scheduler events, file system operations, and disk I/O.4 The tracepoint infrastructure was developed by Mathieu Desnoyers and first made available in the Linux 2.6.32 release in 2009. Tracepoints are a stable API5 and are limited in number.
4Some are gated by Kconfig options and may not be available if the kernel is compiled without them; e.g., rcu tracepoints and CONFIG_RCU_TRACE.
5I’d call it “best-effort stable.” It is rare, but I have seen tracepoints change.
Tracepoints are an important resource for performance analysis as they power advanced tracing tools that go beyond summary statistics, providing deeper insight into kernel behavior. While function-based tracing can provide a similar power (e.g., Section 4.3.6, kprobes), only tracepoints provide a stable interface, allowing robust tools to be developed.
This section explains tracepoints. These can be used by the tracers introduced in Section 4.5, Tracing Tools, and are covered in depth in Chapters 13 to 15.
Available tracepoints can be listed using the perf list command (the syntax for perf(1) syntax is covered in Chapter 14):
List of pre-defined events (to be used in -e):
[…]
block:block_rq_complete Tracepoint event
block:block_rq_insert Tracepoint event
block:block_rq_issue Tracepoint event
[…]
sched:sched_wakeup Tracepoint event
sched:sched_wakeup_new Tracepoint event
sched:sched_waking Tracepoint event
scsi:scsi_dispatch_cmd_done Tracepoint event
scsi:scsi_dispatch_cmd_error Tracepoint event
scsi:scsi_dispatch_cmd_start Tracepoint event
scsi:scsi_dispatch_cmd_timeout Tracepoint event
[…]
skb:consume_skb Tracepoint event
skb:kfree_skb Tracepoint event
[…]
I have truncated the output to show a dozen example tracepoints from the block device layer, the scheduler, and SCSI. On my system there are 1808 different tracepoints, 634 of which are for instrumenting]] syscalls.
Apart from showing when an event happened, tracepoints can also provide contextual data about the event. As an example, the following perf(1) command traces the block:block_rq_issue tracepoint and prints events live:
[…]
0.000 kworker/u4:1-e/20962 block:block_rq_issue:259,0 W 8192 () 875216 + 16
[kworker/u4:1]
255.945 :22696/22696 block:block_rq_issue:259,0 RA 4096 () 4459152 + 8 bash
256.957 :22705/22705 block:block_rq_issue:259,0 RA 16384 () 367936 + 32 bash
[…]
The first three fields are a timestamp (seconds), process details (name/thread ID), and event description (followed by a colon separator instead of a space). The remaining fields are arguments for the tracepoint and are generated by a format string explained next; for the specific block:block_rq_issue format string, see Chapter 9, Disks, Section 9.6.5, perf.
A note about terminology: tracepoints (or trace points) are technically the tracing functions (also called tracing hooks) placed in the kernel source. For example, trace_sched_wakeup() is a tracepoint, and you will find it called from kernel/sched/core.c. This tracepoint may be instrumented via tracers using the name “sched:sched_wakeup”; however, that is technically a trace event, defined by the TRACE_EVENT macro. TRACE_EVENT also defines and formats its arguments, auto-generates the trace_sched_wakeup() code, and places the trace event in the tracefs and perf_event_open(2) interfaces Ts’o 20]. Tracing tools primarily instrument trace events, although they may refer to them as “tracepoints.” perf(1) calls trace events “Tracepoint event,” which is confusing since kprobe- and uprobe-based trace events are also labeled]] “Tracepoint event.”
Tracepoints Arguments and Format String
Each tracepoint has a format string that contains event arguments: extra context about the event. The structure of this format string can be seen in a “format” file under /sys/kernel/debug/tracing/events. For example:
ID: 1080
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:dev_t dev; offset:8; size:4; signed:0;
field:sector_t sector; offset:16; size:8; signed:0;
field:unsigned int nr_sector; offset:24; size:4; signed:0;
field:unsigned int bytes; offset:28; size:4; signed:0;
field:char rwbs[8]; offset:32; size:8; signed:1;
field:char comm[16]; offset:40; size:16; signed:1;
field:__data_loc char[] cmd; offset:56; size:4; signed:1;
print fmt: ”%d,%d %s %u (%s) %llu + %u [%s]“, 1),
__get_str(cmd), (unsigned long long)REC→sector, REC→nr_sector, REC→comm
The final line shows the string format and arguments. The following shows the format string formatting from this output, followed by an example format string from the previous perf script output:
%d,%d %s %u (%s) %llu + %u [%s]
259,0 W 8192 () 875216 + 16 [kworker/u4:1]
Tracers can typically access the arguments from format strings via their names. For example, the following uses perf(1) to trace block I/O issue events only when the size (bytes argument) is larger than 655366:
6The –filter argument for perf trace was added in Linux 5.5. On older kernels, you can accomplish this using: perf trace -e block:block_rq_issue –filter 'bytes > 65536' -a; perf script
0.000 jbd2/nvme0n1p1/174 block:block_rq_issue:259,0 WS 77824 () 2192856 + 152
[jbd2/nvme0n1p1-]
5.784 jbd2/nvme0n1p1/174 block:block_rq_issue:259,0 WS 94208 () 2193152 + 184
[jbd2/nvme0n1p1-]
[…]
As an example of a different tracer, the following uses bpftrace to print the bytes argument only for this tracepoint (bpftrace syntax is covered in Chapter 15, BPF; I’ll use bpftrace for subsequent examples as it is concise to use, requiring fewer commands):
[…]
The output is one line for each I/O issue, showing its size.
Tracepoints are a stable API that consists of the tracepoint name, format string, and arguments.
Tracing tools can use tracepoints via their trace event files in tracefs (typically mounted at /sys/kernel/debug/tracing) or the perf_event_open(2) syscall. As an example, my Ftrace-based iosnoop(8) tool uses the tracefs files:
chdir(”/sys/kernel/debug/tracing“) = 0
openat(AT_FDCWD, ”/var/tmp/.ftrace-lock“, O_WRONLY]] | O_CREAT | O_TRUNC, 0666) = 3 [...] [[openat(AT_FDCWD, ”events/block/block_rq_issue/enable“, O_WRONLY]] | O_CREAT | O_TRUNC, 0666) = 3 [[openat(AT_FDCWD, ”events/block/block_rq_complete/enable“, O_WRONLY]] | O_CREAT | O_TRUNC, 0666) = 3 [...] The [[output includes a chdir(2) to the tracefs directory and the opening of “enable” files for block tracepoints. It also includes a /var/tmp/.ftrace-lock: this is a precaution I coded]] that blocks concurrent tool users, which the tracefs interface does not easily support. The perf_event_open(2) interface does support concurrent users and is preferred where possible. It is used by my newer BCC version of the same tool:
perf_event_open({type=PERF_TYPE_TRACEPOINT, size=0 /* PERF_ATTR_SIZE_??? */,
config=2323, …}, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 8
perf_event_open({type=PERF_TYPE_TRACEPOINT, size=0 /* PERF_ATTR_SIZE_??? */,
config=2324, …}, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 10
[…]
perf_event_open(2) is the interface to the kernel perf_events subsystem, which provides various profiling and tracing capabilities. See its man page for more details, as well as the perf(1) front end in Chapter 13.
When tracepoints are activated, they add a small amount of CPU overhead to each event. The tracing tool may also add CPU overhead to post-process events, plus file system overheads to record them. Whether the overheads are high enough to perturb production applications depends on the rate of events and the number of CPUs, and is something you will need to consider when using tracepoints.
On typical systems of today (4 to 128 CPUs), I find that event rates of less than 10,000 per second cost negligible overhead, and only over 100,000 does the overhead begin to become measurable. As event examples, you may find that disk events are typically fewer than 10,000 per second, but scheduler events can be well over 100,000 per second and therefore can be expensive to trace.
I’ve previously analyzed overheads for a particular system and found the minimum tracepoint overhead to be 96 nanoseconds of CPU time Gregg 19]. There is a new type of tracepoint called raw tracepoints, added to Linux 4.7 in 2018, which avoids the cost of creating stable tracepoint arguments, reducing this overhead.
Apart from the enabled overhead while tracepoints are in use, there is also the disabled overhead for making them available. A disabled tracepoint becomes a small number of instructions: for x86_64]] it is a 5-byte no-operation (nop) instruction. There is also a tracepoint handler added to the end of the function, which increases its text size a little. While these overheads are very small, they are something you should analyze and understand when adding tracepoints to the kernel.
The tracepoints technology is documented in the kernel source under Documentation/trace/tracepoints.rst. The tracepoints themselves are (sometimes) documented in the header file]]s that define them, found in the Linux source under include/trace/events. I summarized advanced tracepoint topics in BPF Performance Tools, Chapter 2 Gregg 19]: how they are added to kernel code, and how they work at the instruction level.
Sometimes you may wish to trace software execution for which there are no tracepoints: for that you can try the unstable kprobes interface.
4.3.6 kprobes
kprobes (short for kernel probes) is a Linux kernel event source for tracers based on dynamic instrumentation, a term introduced in Chapter 1, Introduction, Section 1.7.3, Tracing. kprobes can trace any kernel function or instruction, and were made available in Linux 2.6.9, released in 2004. They are considered an unstable API because they expose raw kernel functions and arguments that may change between kernel versions.
kprobes can work in different ways internally. The standard method is to modify the instruction text of running kernel code to insert instrumentation where needed. When instrumenting]] the entry of functions, an optimization may be used where kprobes make use of existing Ftrace function tracing, as it has lower overhead.7
7It can also be enabled/disabled via the debug.kprobes-optimization sysctl(8).
kprobes are important because they are a last-resort8 source of virtually unlimited information about kernel behavior in production, which can be crucial for observing performance issues that are invisible to other tools. They can be used by the tracers introduced in Section 4.5, Tracing Tools, and are covered in depth in Chapters 13 to 15.
8Without kprobes, the last resort option would be to modify the kernel code to add instrumentation where needed, recompile, and redeploy.
kprobes and tracepoints are compared in Table 4.3.
Table 4.3 kprobes to tracepoints comparison
kprobes can trace the entry to functions as well as instruction offsets within functions. The use of kprobes creates kprobe events (a kprobe-based trace event). These kprobe events only exist when a tracer creates them: by default, the kernel code runs unmodified.
As an example of using kprobes, the following bpftrace command instruments the do_nanosleep() kernel function and prints the on-CPU process:
The output shows a couple of sleeps by a process named “mysqld”, and one by “sleep” (likely /bin/sleep). The kprobe event for do_nanosleep() is created when the bpftrace program begins running and is removed when bpftrace terminates (Ctrl-C).
As kprobes can trace kernel function calls, it is often desirable to inspect the arguments to the function for more context. Each tracing tool exposes them in its own way and is covered in later sections. For example, using bpftrace to print the second argument to do_nanosleep(), which is the hrtimer_mode:
mode: 1
mode: 1
mode: 1
[…]
Function arguments are available in bpftrace using the arg0..argN built-in variable.
kretprobes
The return from kernel functions and their return value can be traced using kretprobes (short for kernel return probes), which are similar to kprobes. kretprobes are implemented using a kprobe for the function entry, which inserts a trampoline function to instrument the return.
When paired with kprobes and a tracer that records timestamps, the duration of a kernel function can be measured. For example, measuring the duration of do_nanosleep() using bpftrace:
[[Intel]]: [[Chapter]] 19, “[[Performance Monitoring]] [[Event]]s,” of [[Intel]]® 64 and [[IA]]-32 [[Architecture]]s [[Software]] [[Developer]]’s [[Manual]] [[Volume]] 3 [[Intel]] 16].
[[AMD]]: [[Section]] 2.1.1, “[[Performance Monitor]] [[Counter]]s,” of [[Open-Source]] [[Register]] [[Reference]] For [[AMD]] [[Family]] 17h [[Processor]]s [[Model]]s 00h-2Fh [[AMD]] 18]
[[ARM]]: [[Section]] D7.10, “PMU [[Event]]s and [[Event]] [[Number]]s,” of [[Arm]]® [[Architecture]] [[Reference Manual]] Armv8, for Armv8-A [[architecture]] [[profile]] [[ARM]] 19]There has been work to develop a standard naming scheme for PMCs that could be supported across all processors, called the performance application programming interface (PAPI) [UTK 20]. Operating system support for PAPI has been mixed: it requires frequent updates to map PAPI names to vendor PMC codes. Chapter 6, CPUs, Section 6.4.1, Hardware, subsection Hardware Counters (PMCs), describes their implementation in more detail and provides additional PMC examples. 4.3.10 Other Observability Sources Other observability sources include:
MSRs: [[PMC]]s are [[implemented]] [[using]] [[model]]-specific [[register]]s (MSRs). There are other MSRs for [[show]]ing the [[configuration]] and [[health]] of the [[system]], including the [[CPU]] [[clock rate]], [[usage]], [[temp]]eratures, and [[power]] [[consumption]]. The [[available]] MSRs are [[dependent]] on the [[processor]] [[type]] ([[model]]-specific), [[BIOS]] [[version]] and [[setting]]s, and [[hypervisor]] [[setting]]s. One [[use]] is an [[ac]][[curate]] [[cycle]]-[[based]] [[measurement]] of [[CPU]] utilization]].
p[[trace]](2): This [[syscall]] [[control]]s [[process]] [[tracing]], which is used by [[gdb]](1) for [[process]] [[debugging]] and [[strace]](1) for [[tracing]] [[syscall]]s. It is [[break]][[point]]-[[based]] and can [[slow]] the [[target]] over one hundred-fold. [[Linux]] also has [[tracepoint]]s, [[introduce]]d in [[Section]] 4.3.5, [[Tracepoint]]s, for more [[efficient]] [[syscall]] [[tracing]].
[[Function]] [[profiling]]: [[Profiling]] [[function call]]s (m[[count]]() or __f[[entry]]__()) are [[added]] to the [[start]] of all non-[[inlined]] [[kernel]] [[function]]s on [[x86]] for [[efficient]] [[Ftrace]] [[function]] [[tracing]]. They are [[convert]][[ed]] to nop [[instruction]]s until [[needed]]. See [[Chapter]] 14, [[Ftrace]].
[[Network]] [[sniffing]] (libpcap): These [[interface]]s [[provide]] a way to [[capture]] [[packet]]s from [[network]] [[device]]s for [[detailed]] [[investigation]]s into [[packet]] and [[protocol]] [[performance]]. On [[Linux]], [[sniffing]] is [[provided]] via the libpcap [[library]] and /[[proc]]/[[net]]/[[dev]] and is [[consumed]] by the [[tcpdump]](8) [[tool]]. There are [[overhead]]s, both [[CPU]] and [[storage]], for [[capturing]] and [[examining]] all [[packet]]s. See [[Chapter]] 10 for more about [[network]] [[sniffing]].
[[net[[filter]] [[conn[[track]]: The [[Linux]] [[net[[filter]] [[technology]] [[allow]]s [[custom]] [[handler]]s to be [[executed]] on [[event]]s, not just for [[firewall]], but also for [[connection]] [[track]]ing]] ([[conn[[track]]). This [[allow]]s [[log]]s to be [[created]] of [[network]] [[flow]]s [Ayuso 12].
[[Process]] [[accounting]]: This [[date]]s [[back]] to [[mainframe]]s and the [[need]] to [[bill]] [[department]]s and [[user]]s for their [[computer]] [[usage]], [[based]] on the [[execution]] and [[runtime]] of [[process]]es. It [[exist]]s in some [[form]] for [[Linux]] and other [[system]]s and can some[[time]]s be [[helpful]] for [[performance analysis]] at the [[process]] [[level]]. For [[example]], the [[Linux]] [[atop]](1) [[tool]] [[use]]s [[process]] [[accounting]] to [[catch]] and [[display]] [[information]] from [[short]]-[[live]]d [[process]]es that would otherwise be missed when taking [[snapshot]]s of /[[proc]] [Atop[[tool]] 20].
[[Software]] [[event]]s: These are [[related]] to [[hardware]] [[event]]s but are [[instrument]][[ed]] in [[software]]. [[Page fault]]s are an [[example]]. [[Software]] [[event]]s are made [[available]] via the [[perf]]_[[event]]_[[open]](2) [[interface]] and are used by [[perf]](1) and [[bpftrace]]. They are [[picture]]d in [[Figure]] 4.5.
[[System call]]s: Some [[system]] or [[library]] [[call]]s may be [[available]] to [[provide]] some [[performance metric]]s. These [[include]] getr[[usage]](2), a [[system call]] for [[process]]es to [[get]] their own [[resource]] [[usage]] [[statistic]]s, including [[user]]- and [[system]]-[[time]], [[fault]]s, [[message]]s, and [[context]] [[switch]]es.If you are interested in how each of these works, you will find that documentation is usually available, intended for the developer who is building tools upon these interfaces. And More Depending on your kernel version and enabled options, even more observability sources may be available. Some are mentioned in later chapters of this book. For Linux these include I/O accounting, blktrace, timer_stats, lockstat, and debugfs. One way to find such sources is to read the kernel code you are interested in observing and see what statistics or tracepoints have been placed there. In some cases there may be no kernel statistics for what you are after. Beyond dynamic instrumentation (Linux kprobes and uprobes), you may find that debuggers such as gdb(1) and lldb(1) can fetch kernel and application variables to shed some light on an investigation. Solaris Kstat As an example of a different way to provide system statistics, Solaris-based systems use a kernel statistics (Kstat) framework that provides a consistent hierarchical structure of kernel statistics, each named using the following four-tuple: Click here to view code image module:instance:name:statistic These are
[[module]]: This usually refers to the [[kernel]] [[module]] that [[created]] the [[statistic]], such as sd for the [[SCSI]] [[disk]] [[driver]], or [[zfs]] for the [[ZFS file system]].
[[instance]]: Some [[module]]s [[exist]] as [[multiple]] [[instance]]s, such as an sd [[module]] [[for each]] [[SCSI]] [[disk]]. The [[instance]] is an [[enumeration]].
[[name]]: This is a [[name]] for the [[group]] of [[statistic]]s.
[[statistic]]: This is the [[individual]] [[statistic]] [[name]].Kstats are accessed using a binary kernel interface, and various libraries exist. As an example Kstat, the following reads the “nproc” statistic using kstat(1M) and specifying the full four-tuple: Click here to view code image $ kstat -p unix:0:system_misc:nproc unix:0:system_misc:nproc 94 This statistic shows the currently running number of processes. In comparison, the /proc/stat-style sources on Linux have in[[consistent formatting and usually require text parsing to process, costing some CPU cycles. 4.4 sar sar(1) was introduced in Section 4.2.4, Monitoring, as a key monitoring facility. While there has been much excitement recently with BPF tracing superpowers (and I’m partly responsible), you should not overlook the utility of sar(1) — it’s an essential systems performance tool that can solve many performance issues on its own. The Linux version of sar(1) is also well-designed, having self-descriptive column headings, network metric groups, and detailed documentation (man pages). sar(1) is provided via the sysstat package. 4.4.1 sar(1) Coverage Figure 4.6 shows the observability coverage from the different sar(1) command line options. Figure 4.6 Linux sar(1) observability This figure shows that sar(1) provides broad coverage of the kernel and devices, and even has observability for fans. The -m (power management) option also supports other arguments not shown in this figure, including IN for voltage inputs, TEMP for device temperatures, and USB for USB device power statistics. 4.4.2 sar(1) Monitoring You may find that sar(1) data collecting (monitoring) is already enabled for your Linux systems. If it isn’t, you need to enable it. To check, simply run sar without options. For example: Click here to view code image $ sar Cannot open /var/log/sysstat/sa19: No such file or directory Please check if data collecting is enabled The output shows that sar(1) data collecting is not yet enabled on this system (the sa19 file refers to the daily archive for the 19th of the month). The steps to enable it may vary based on your distribution. Configuration (Ubuntu) On this Ubuntu system, I can enable sar(1) data collecting by editing the /etc/default/sysstat file and setting ENABLED to be true: Click here to view code image ubuntu
[[perf]](1): The [[official]] [[Linux]] [[profiler]]. It is [[excellent]] for [[CPU profiling]] ([[sampling]] of [[stack trace]]s) and [[PMC]] [[analysis]], and can [[instrument]] other [[event]]s, typically [[recording]] to an [[output]] [[file]] for [[post]]-[[processing]].
[[Ftrace]]: The [[official]] [[Linux]] [[tracer]], it is a [[multi]]-[[tool]] [[compose]]d of [[different]] [[tracing]] [[utilities]]. It is [[suite]]d for [[kernel]] [[code path]] [[analysis]] and [[resource]]-[[constrained]] [[system]]s, as it can be used without [[dependencies]].
[[BPF]] ([[BCC]], [[bpftrace]]): [[Extended BPF]] was [[introduce]]d in [[Chapter]] 3, [[Operating System]]s, [[Section]] 3.4.4, [[Extended BPF]]. It [[power]]s [[advanced]] [[tracing tool]]s, the [[main]] ones being [[BCC]] and [[bpftrace]]. [[BCC]] [[provide]]s [[powerful]] [[tool]]s, and [[bpftrace]] [[provide]]s a [[high-level language]] for [[custom]] one-[[liner]]s and [[short]] [[program]]s.
[[SystemTap]]: A [[high-level language]] and [[tracer]] with many tapsets ([[libraries]]) for [[tracing]] [[different]] [[target]]s [[Eigler]] 05][[Source]][[ware]] 20]. It has recently been [[developing]] a [[BPF]] [[backend]], which I [[recommend]] (see the stapbpf(8) [[man page]]).
LTTng: A [[tracer]] [[optimized]] for [[black]]-box]] [[recording]]: [[optimal]]ly [[recording]] many [[event]]s for later [[analysis]] [LTTng 20].The first three tracers are covered in Chapter 13, perf; Chapter 14, Ftrace; and Chapter 15, BPF. The chapters that now follow (5 to 12) include various uses of these tracers, showing the commands to type and how to interpret the output. This ordering is deliberate, focusing on uses and performance wins first, and then covering the tracers in more detail later if and as needed. At Netflix, I use perf(1) for CPU analysis, Ftrace for kernel code digging, and BCC/bpftrace for everything else (memory, file systems, disks, networking, and application tracing). 4.6 Observing Observability Observability tools and the statistics upon which they are built are implemented in software, and all software has the potential for bugs. The same is true for the documentation that describes the software. Regard with a healthy skepticism any statistics that are new to you, questioning what they really mean and whether they are really correct. Metrics may be subject to any of the following problems:
[[Tool]]s and [[measurement]]s are some[[time]]s wrong.
[[Man page]]s are not always [[right]].
[[Available]] [[metric]]s may be [[incomplete]].
[[Available]] [[metric]]s may be poorly [[designed]] and [[conf]][[using]].
[[Metric]] [[collector]]s (e.g., that [[parse]] [[tool]] [[output]]) can have [[bug]]s.1313In this case the tool and measurement are correct, but an automated collector has introduced errors. At Surge 2013 I gave a lightning talk on an astonishing case Gregg 13c]: a benchmarking company reported poor metrics for a product I was supporting, and I dug in. It turned out the shell script they used to automate the benchmark had two bugs. First, when processing output from fio(1), it would take a result such as “100KB/s” and use a regular expression]] to elide nun-numeric characters, including “KB/s]]” to turn this into “100”. Since fio(1) reported results with different units (bytes, Kbytes, Mbytes), this introduced massive (1024x) errors. Second, they also elided decimal places, so a result of “1.6” became “16”.
[[Metric]] [[processing]] ([[algorithm]]s/[[spreadsheet]]s) can also [[introduce]] [[error]]s.When multiple observability tools have overlapping coverage, you can use them to cross-check each other. Ideally, they will source different instrumentation frameworks to check for bugs in the frameworks as well. Dynamic instrumentation is especially useful for this purpose, as custom tools can be created to double-check metrics. Another verification technique is to apply known workloads and then to check that the observability tools agree with the results you expect. This can involve the use of micro-benchmarking tools that report their own statistics for comparison. Sometimes it isn’t the tool or statistic that is in error, but the documentation that describes it, including man pages. The software may have evolved without the documentation being updated. Realistically, you may not have time to double-check every performance measurement you use and will do this only if you encounter unusual results or a result that is company critical. Even if you do not double-check, it can be valuable to be aware that you didn’t and that you assumed the tools were correct. Metrics can also be incomplete. When faced with a large number of tools and metrics, it may be tempting to assume that they provide complete and effective coverage. This is often not the case: metrics may have been added by programmers to debug their own code, and later built into observability tools without much study of real customer needs. Some programmers may not have added any at all to new subsystems. An absence of metrics can be more difficult to identify than the presence of poor metrics. Chapter 2, Methodologies, can help you find these missing metrics by studying the questions you need answered for performance analysis. 4.7 Exercises Answer the following questions about observability tools (you may wish to revisit the introduction to some of these terms in Chapter 1):
[[List]] some [[static performance tool]]s.
[[What is]] [[profiling]]?
[[Why]] would [[profiler]]s [[use]] 99 [[Hertz]] instead of 100 [[Hertz]]?
[[What is]] [[tracing]]?
[[What is]] [[static instrumentation]]?
[[Describe]] [[why]] [[dynamic instrumentation]] is [[important]].
[[What is]] the [[difference]] between [[tracepoint]]s and [[kprobe]]s?
[[Describe]] the [[expected]] [[CPU]] [[overhead]] ([[low]]/medium/[[high]]) from the [[follow]]ing:
[[Disk]] [[IOPS]] [[counter]]s (as seen by [[iostat]](1))
[[Tracing]] per-[[event]] [[disk I/O]] via [[tracepoint]]s or [[kprobe]]s
[[Tracing]] per-[[event]] [[context]] [[switch]]es ([[tracepoint]]s/[[kprobe]]s)
[[Tracing]] per-[[event]] [[process]] [[execution]] ([[exec]]ve(2)) ([[tracepoint]]s/[[kprobe]]s)
[[Tracing]] per-[[event]] [[libc]] [[malloc]]() via [[uprobe]]s
[[Describe]] [[why]] [[PMC]]s are [[valuable]] for [[performance analysis]].
Given an [[observability tool]], [[describe]] how you could determine what [[instrumentation]] [[source]]s it [[use]]s.4.8 References Eigler 05] Eigler, F. Ch., et al. “Architecture of SystemTap: A Linux Trace/Probe Tool,” http://sourceware.org/systemtap/archpaper.pdf, 2005. [Drongowski 07] Drongowski, P., “Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors,” AMD (Whitepaper), 2007. [Ayuso 12] Ayuso, P., “The Conn[[track-Tools User Manual,” http://conn[[track-tools.net[[filter.org/manual.html, 2012. Gregg 13c] Gregg, B., “Benchmarking Gone Wrong,” Surge 2013: Lightning Talks, https://www.youtube.com/watch?v=vm1GJMp0QN4#t=17m48s, 2013. [Weisbecker 13] Weisbecker, F., “Status of Linux dynticks,” OSPERT, http://www.ertl.jp/~shinpei/conf/ospert13/slides/FredericWeisbecker.pdf, 2013. Intel 16] Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2, September 2016, https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html, 2016. AMD 18] Open-Source Register Reference for AMD Family 17h Processors Models 00h-2Fh, https://developer.amd.com/resources/developer-guides-manuals, 2018. ARM 19] Arm® Architecture Reference Manual Armv8, for Armv8-A architecture profile, https://developer.arm.com/architectures/cpu-architecture/a-profile/docs?_ga=2.78191124.1893781712.1575908489-930650904.1559325573, 2019. Gregg 19] Gregg, B., BPF Performance Tools: Linux System and Application Observability, Addison-Wesley, 2019. [Atoptool 20] “Atop,” www.atoptool.nl/index.php, accessed 2020. [Bowden 20] Bowden, T., Bauer, B., et al., “The /proc Filesystem,” Linux documentation, https://www.kernel.org/doc/html/latest/filesystems/proc.html, accessed 2020. Gregg 20a] Gregg, B., “Linux Performance,” http://www.brendangregg.com/linuxperf.html, accessed 2020. [LTTng 20] “LTTng,” https://lttng.org, accessed 2020. PCP 20] “Performance Co-Pilot,” https://pcp.io, accessed 2020. [Prometheus 20] “Exporters and Integrations,” https://prometheus.io/docs/instrumenting]]/exporters, accessed 2020. Sourceware 20] “SystemTap,” https://sourceware.org/systemtap, accessed 2020. Ts’o 20] Ts’o, T., Zefan, L., and Zanussi, T., “Event Tracing,” Linux documentation, https://www.kernel.org/doc/html/latest/trace/events.html, accessed 2020. [Xenbits 20] “Xen Hypervisor Command Line Options,” https://xenbits.xen.org/docs/4.11-testing/misc/xen-command-line.html, accessed 2020. [UTK 20] “Performance Application Programming Interface,” http://icl.cs.utk.edu/papi, accessed 2020. ==Fair Use Sources== Fair Use Sources: