Table of Contents
Systems Performance, 2nd Edition, by Brendan Gregg Index
Systems Performance Index - is the MASTER.
DO NOT EDIT HERE. Temporary Backup
Return to Systems Performance Glossary, Systems Performance, 2nd Edition, Performance Bibliography, Systems Performance, Performance DevOps, IT Bibliography, DevOps Bibliography
“ (SysPrfBGrg 2021)
A
- Accelerators in USE method, 49
- accept system calls, 95
- ACK detection in TCP, 512
Active benchmarking]], 657–660
Active listening in three-way handshakes, 511
Active pages in page caches, 318
Activities overview, 3–4
Ad hoc checklist method, 43–44
Adaptive mutex locks, 198
Adaptive Replacement Cache (ARC), 381
Address space, 304
guests, 603
kernel, 90
memory, 304, 310
processes, 95, 99–102, 319–322
threads, 227–228
virtual memory, 104, 305
Address space layout randomization (ASLR), 723
Advanced Format for magnetic rotational disks, 437
AF_NETLINK address family, 145–146
monitoring software, 137–138
product monitoring, 79
AKS (Azure [[Kubernetes Service), 586
Alerts, 8
caching, 36
congestion control, 115, 118, 513–514
Allocation groups in XFS, 380
memory, 309
multithreaded applications, 353
process virtual address space, 320–321
Amazon EKS]] (Elastic Kubernetes Service]]), 586
Amdahl’s Law of Scalability, 64–65
benchmarking]], 644–646, 665–666
latency, 56–57, 384–386, 454–455
off-CPU, 188–192
resource, 38–39
workload, 4–5, 39–40
Analysis step in scientific method, 44–45
Analysis strategy in case study]], 784
an[[notate subcommand for perf, 673
Anti-methods
blame-someone-else, 43
Apdex (application performance index), 174
Application calls, tuning, 415–416
Application I/O, 369, 435
Application instrumentation in off-CPU analysis, 189
Application internals, 213
Application layer, file system latency in, 384
Application performance index (Apdex), 174
Applications, 171
basics, 172–173
bpftrace for, 765
common case optimization, 174
exercises, 216–217
internals, 213
latency documentation, 385
methodology. See Applications methodology
missing stacks, 215–216
missing symbols, 214
objectives, 173–174
observability, 174
observability tools. See Applications observability tools
performance techniques. See Applications performance techniques
programming languages. See Applications programming languages
references, 217–218
distributed tracing, 199
overview, 186–187
static performance tuning, 198–199
thread state analysis, 193–197
USE method, 193
Applications observability tools
bpftrace, 209–213
execsnoop, 207–208
overview, 199–200
perf, 200–203
profile, 203–204
strace, 205–207
Applications performance techniques
buffers, 177
caching, 176
concurrency and parallelism, 177–181
non-blocking I/O, 181
Performance Mantras, 182
polling, 177
Applications programming languages, 182–183
compiled, 183–184
garbage collection]], 184–185
Interpreted, 184–185
virtual machines, 185
Appropriateness level in methodologies, 28–29
ARC (Adaptive Replacement Cache), 381
CPUs. See CPUs architecture
disks. See Disks architecture
file systems. See File [[systems architecture
memory. See Memory architecture
networks. See Networks architecture
scalable, 581–582
archive subcommand for perf, 673
arg variables for bpftrace, 778
kprobes, 152
networks, 507
tracepoints, 148–149
uprobes, 154
Arithmetic mean, 74
Arrival process in queueing systems, 67
cloud computing, 583–584
ASLR (address space layout randomization), 723
Associativity in caches, 234
Asynchronous disk I/O, 434–435
Asynchronous interrupts, 96–97
Asynchronous writes, 366
cloud computing, 583–584
available_filter_functions file, 710
Averages, 74–75
Axes
- flame graphs, 10, 187, 290
- scalability tests, 62
- scatter plots, 81–82, 488
B
Back-ends in instruction pipeline, 224
Background color in flame graphs, 291
Backlogs in network connections, 507, 519–520, 556–557, 569
Balloon drivers, 597
disks, 424
interconnects, 237
networks, 500, 508, 532–533
OS virtualization, 614–615
Bare-metal hypervisors, 587
BATCH scheduling policy, 243
BBR (Bottleneck Bandwidth and RTT) algorithm, 118, 513
bcache technology, 117
BCC (BPF Compiler Collection), 12
disks, 450
documentation, 760–761
installing, 754
networks, 526
one-liners, 757–759
overview, 753–754
system-wide tracing, 136
bench subcommand for perf, 673
Benchmark paradox, 648–649
Benchmarket]]ing, 642
Benchmarking]], 641–642
analysis, 644–646
CPUs, 254
effective, 643–644
exercises, 668
failures, 645–651
industry standards, 654–656
memory, 328
micro-benchmarking]]. See Micro-benchmarking]]
questions, 667–668
reasons, 642–643
references, 669–670
replay, 654
simulation, 653–654
specials, 650
types, 13, 651–656
Benchmarking]] methodology
active, 657–660
checklist, 666–667
overview, 656
passive, 656–657
ramping load, 662–664
sanity checks, 664–665
statistical analysis, 665–666
USE method, 661
workload characterization, 662
Berkeley Packet Filter (BPF), 751–752
BCC compiler. See BCC (BPF Compiler Collection)
description, 12–13
extended. See Extended BPF
iterator, 562
kernels, 92
OS virtualization tracing, 620, 624–625, 629
program, 90
Berkeley Software Distribution (BSD)]], 113
BFQ (Budget Fair Queueing) I/O schedulers, 119, 449
Big kernel lock (BKL) performance bottleneck, 116
Billing in cloud computing, 584
Bimodal performance, 76
Binary executable files, 183
Binary translations in hardware virtualization, 588, 590
CPU, 253, 297–298
NUMA, 353
processor, 181–182
bioerr tool, 487
BCC, 753–755
disks, 450, 468–470
example, 753–754
biosnoop tool
BCC, 755
disks, 470–472
hardware virtualization, 604–605
outliers, 471–472
system-wide tracing, 136
biotop tool
BCC, 755
disks, 450, 473–474
BCC, 755
blame command, 120
Blame-someone-else anti-method, 43
Blanco, Brenden, 753
Blind faith benchmarking]], 645
blkio control group]], 610, 617
action identifiers, 477
analysis, 478–479
description, 116
disks, 475–479
RWBS description, 477
visualizations, 479
Block-based file systems, 375–376
Block device interface, 109–110, 447
Block I/O state in delay accounting, 145
Block I/O times for disks, 427–428, 472
Block interleaving, 378
FFS, 378
Block stores in cloud computing, 584
Blue-green cloud computing deployments, 3–4
Bonnie and Bonnie++ benchmarking]] tools
active benchmarking]], 657–660
file systems, 412–414
Boolean expressions in bpftrace, 775–776
Boot options, security, 298–299
Borkmann, Daniel, 121
Borrowed virtual time (BVT) schedulers, 595
Bottleneck Bandwidth and RTT (BBR) algorithm, 118, 513
complexity, 6
defined, 22
USE method, 47–50, 245, 324, 450–451
BPF. See Berkeley Packet Filter (BPF)
application internals, 213
block I/O events, 625, 658–659
description, 282
event sources, 558
examples, 284, 761–762
file system internals, 408
hardware virtualization, 602
installing, 762
malloc() bytes flame graph, 346
one-liners for CPUs, 283, 803–804
one-liners for disks, 479–480, 806–807
one-liners for file systems, 402–403, 805–806
one-liners for memory, 343–344, 804–805
one-liners for networks, 550–552, 807–808
page fault flame graphs, 346
programming. See bpftrace tool programming
references, 782
scheduling internals, 284–285
system-wide tracing, 136
tracepoints, 149
user allocation stacks, 345
actions, 769
comments, 767
documentation, 781
example, 766
filters, 769
flow [[control, 775–777
functions, 770–772, 778–781
Hello, World! program, 770
operators, 776–777
program structure, 767
timing, 772–773
usage, 766–767
variables, 770–771, 777–778
tuning, 571
Branch prediction in instruction pipeline, 224
brk system calls, 95
Broadcast]] network messages, 503
BSD ([[Berkeley Software Distribution)]], 113
btrfs file system, 381–382, 399
btt tool, 478
hash table]]s, 180
Buddy allocators, 317
Budget Fair Queueing (BFQ) I/O schedulers, 119, 449
Buffer caches, 110, 374
Bufferbloat, 507
applications, 177
block devices, 110, 374
networks, 507
ring, 522
TCP, 520, 569
bufgrow tool, 409
applications, 172
buildid-cache subcommand for perf, 673
Built-in bpftrace variables, 770, 777–778
Bursting in cloud computing, 584, 614–615
Buses, memory, 312–313
BVT (borrowed virtual time) schedulers, 595
tuning, 571
Bytecode, 185
compiled languages, 183
symbols, 214
stacks, 215
c2c subcommand for perf, 673, 702
Cache Allocation Technology (CAT), 118, 596
Cache miss rate, 36
Cache warmth, 222
applications, 176
associativity, 234
block devices, 110, 374
cache line size, 234
coherency, 234–235
CPUs, hardware virtualization, 596
CPUs, OS virtualization, 615–616
defined, 23
dentry, 375
file systems, flushing, 414
file systems, OS virtualization, 613
file systems, overview, 361–363
file systems, tuning, 389, 414–416
file systems, types, 373–375
file systems, usage, 309
inode, 375
methodologies, 35–37
micro-benchmarking]] test, 390
operating systems, 108–109
page, 315, 374
RAID, 445
tuning, 60
file systems, 399, 658–659
memory, 348
Canary testing, 3
Capacity-based utilization, 34
Capacity of file systems, 371
benchmarking]] for, 642
cloud computing, 582–584
defined, 4
micro-benchmarking]], 70
overview, 69
resource limits, 70–71
CAPI (Coherent Accelerator Processor Interface), 236
Carrier sense multiple access with collision detection (CSMA/CD) algorithm, 516
CAS (column address strobe) latency, 311
bug database systems, 792–793
conclusion, 792
configuration, 786–788
PMCs, 788–789
references, 793
statistics, 784–786
tracing, 790–792
Casual benchmarking]], 645
CAT (Cache Allocation Technology), 118, 596
CAT (Intel Cache Allocation Technology), 118, 596
CFQ (completely fair queueing), 115, 449
CFS (completely fair scheduler), 116–117
CPU scheduling, 241
description, 243
description, 116, 118
Linux kernel, 116
memory, 317, 353
OS virtualization, 606, 608–611, 613–620, 630
resource management, 111, 298
statistics, 139, 141, 620–622, 627–628
cgtop tool, 621
Characterizing memory usage, 325–326
Cheating in benchmarking]], 650–651
ad hoc checklist method, 43–44
benchmarking]], 666
CPUs, 247, 527
disks, 453
file systems, 387
memory, 325
Chip-level multiprocessing (CMP), 220
chrt command, 295
Circular buffers for applications, 177
CISCs (complex instruction set computers), 224
Classes, scheduling
CPUs, 242–243
I/O, 493
kernel, 106, 115
priority, 295
clear function in bpftrace, 780
clear subcommand in trace-cmd, 735
CPUs, 223, 230
operating systems, 99
clone system calls, 94, 100
Cloud computing, 579–580
background, 580–581
comparisons, 634–636
vs. enterprise, 62
exercises, 636–637
hardware virtualization. See Hardware virtualization
lightweight virtualization, 630–633
OS virtualization. See OS virtualization
overview, 14
PMCs, 158
references, 637–639
scalable architecture, 581–582
storage, 584–585
types, 634
Cloud-[[native databases, 582
Clue-based approach in thread state analysis, 196
Clusters in cloud computing, 586
CMP (chip-level multiprocessing), 220
CNI (container network interface]]) software, 586
Co-routines in applications, 178
Code changes in cloud computing, 583
Coefficient of variation (CoV), 76
caches, 234–235
models, 63
Coherent Accelerator Processor Interface (CAPI), 236
Collisions
hash, 180
networks, 516
Colors in flame graphs, 291
Column address strobe (CAS) latency, 311
Column quantizations, 82–83
comm variable in bpftrace, 778
Comma-separated values (CSV) format for sar, 165
Common case optimization in applications, 174
Communication in multiprocess vs. multithreading, 228
Community applications, 172–173
Competition, benchmarking]], 649
Compiled programming languages
optimizations, 183–184
overview, 183
CPU optimization, 229
options, 295
Completely fair queueing (CFQ), 115, 449
Completely fair scheduler (CFS), 116–117
CPU scheduling, 241
description, 243
Completion target in workload analysis, 39
Complex instruction set computers (CISCs), 224
Complexity, 5
Comprehension in flame graphs, 249
btrfs, 382
disks, 369
ZFS, 381
Compute Unified Device Architecture (CUDA), 240
applications, 177–181
micro-benchmarking]], 390, 456
CONFIG_TASK_DELAY_ACCT option, 145
applications, 172
case study]], 786–788
Congestion avoidance and control
Linux kernel, 115
networks, 508
TCP, 510, 513
tuning, 570
connect system calls, 95
Connections for networks, 509
backlogs, 507, 519–520, 556–557, 569
characteristics, 527–528
firewalls, 517
latency, 7, 24–25, 505–506, 528
local, 509
monitoring, 529
NICs, 109
QUIC, 515
UDP, 514
Container network interface]] (CNI) software, 586
lightweight virtualization, 631–632
observability, 617–630
OS virtualization, 605–630
resource controls, 52, 70, 613–617, 626
locks, 198
models, 63
defined, 90
kernels, 93
Contributors to system performance technologies, 811–814
Control group]]s (cgroups). See cgroups
Control paths in hardware virtualization, 594
Control units in CPUs, 230
caches, 430
disk, 426
micro-benchmarking]], 457
network, 501–502, 516
solid-state drives, 440–441
tunable, 494–495
USE method, 49, 451
Controls, resource. See Resource controls
Copy-on-write (COW) file systems, 376
btrfs, 382
ZFS, 380
Copy-on-write (COW) process strategy, 100
CoreLink Interconnects, 236
Cores
defined, 220
Corrupted file [[system data, 365
count function in bpftrace, 780
Counters, 8–9
fixed, 133–135
hardware, 156–158
CoV (coefficient of variation), 76
COW (copy-on-write) file systems, 376
btrfs, 382
ZFS, 380
COW (copy-on-write) process strategy, 100
CPCs (CPU performance counters), 156
CPI (cycles per instruction), 225
CPU-bound applications, 106
CPU mode for applications, 172
CPU performance counters (CPCs), 156
CPU registers, perf-tools for, 746–747
BCC, 755
case study]], 790–791
threads, 278–279
CPUs, 219–220
architecture. See CPUs architecture
binding, 181–182
bpftrace for, 763, 803–804
clock rate, 223
exercises, 299–300
experiments, 293–294
feedback]]-directed optimization, 122
flame graphs. See Flame graphs
garbage collection]], 185
hardware virtualization, 589–592, 596–597
instructions, defined, 220
instructions, IPC, 225
instructions, pipeline, 224
instructions, size, 224
instructions, steps, 223
instructions, width, 224
memory caches, 221–222
methodology. See CPUs methodology
models, 221–222
multiprocess and multithreading, 227–229
observability tools. See CPUs observability tools
OS virtualization, 611, 614, 627, 630
preemption, 227
references, 300–302
saturation, 226–227
schedulers, 105–106
scheduling classes, 115
simultaneous multithreading, 225
subsecond-offset heat maps, 289
terminology, 220
USE method, 49–51, 795–797
utilization, 226
utilization heat maps, 288–289
virtualization support, 588
visualizations, 288–293
CPUs architecture, 221, 229
accelerators, 240–242
associativity, 234
caches, 230–235
GPUs, 240–241
hardware, 230–241
idle threads, 244
interconnects, 235–237
latency, 233–234
memory management units, 235
NUMA grouping, 244
PMCs, 237–239
processors, 230
schedulers, 241–242
scheduling classes, 242–243
software, 241–244
micro-benchmarking]], 253–254
overview, 244–245
performance monitoring, 251
profiling, 247–250
sample processing, 247–248
static performance tuning, 252
USE, 245–246
workload characterization, 246–247
CPUs observability tools, 254–255
bpftrace, 282–285
GPUs, 287
hardirqs, 282
miscellaneous, 285–286
mpstat, 259
perf, 267–276
pidstat, 262
profile, 277–278
ps, 260–261
ptime, 263–264
runqlat, 279–280
runqlen, 280–281
sar, 260
softirqs, 281–282
time, 263–264
tlbstat, 266–267
top, 261–262
uptime, 255–258
vmstat, 258
applications, 187–189
benchmarking]], 660–661
perf, 200–201
record, 695–696
steps, 247–250
system-wide, 268–270
overview, 294–295
scheduling priority and class, 295
security boot options, 298–299
exclusive, 298
cpusets control group]], 610, 614, 627
Crash resilience, multiprocess vs. multithreading, 228
Cr[[edit-based schedulers, 595
Critical paths in systemd]] service manager, 120
CSMA/CD (carrier sense multiple access with collision detection) algorithm, 516
CSV (comma-separated values) format for sar, 165
CUBIC algorithm for TCP congestion control, 513
CUDA (Compute Unified Device Architecture), 240
CUMASK values in MSRs, 238–239
curtask variable for bpftrace, 778
CPUs, 251
memory, 326
Cycles per instruction (CPI), 225
D
Daily patterns, monitoring, 78
Data [[Center TCP (DCTCP) congestion control, 118, 513
Data deduplication in ZFS, 381
Data integrity in magnetic rotational disks, 438
Data paths in hardware virtualization, 594
Data Plane Development Kit (DPDK), 523
Data rate in throughput, 22
applications, 172
cloud computing, 582
OSI model, 502
UDP, 514
DAX (Direct Access), 118
dbstat tool, 756
dcsnoop tool, 409
dcstat tool, 409
DCTCP (Data [[Center TCP) congestion control, 118, 513
dd command
disks, 490–491
file systems, 411–412
DDR SDRAM (double data rate synchronous dynamic random-access memory), 313
Deadline I/O schedulers, 243, 448
DEADLINE scheduling policy, 243
Deflated disk I/O, 369
Defragmentation in XFS, 380
Degradation in scalability, 31–32
kernel, 116
overview, 145
ext4, 379
XFS, 380
delete function in bpftrace, 780
memory, 307–308
Dependencies in perf-tools, 748
Development, benchmarking]] for, 642
Development attribute, multiprocess vs. multithreading, 228
drivers, 109–110, 522
hardware virtualization, 588, 594, 597
Dhrystone benchmark
CPUs, 254
simulations, 653
Differentiated Services Code Points (DSCPs), 509–510
Direct Access (DAX), 118
Direct buses, 313
Direct mapped caches, 234
Direct measurement approach in thread state analysis, 197
Direct-reclaim memory method, 318–319
Directories in file systems, 107
Directory indexes in ext3, 379
Directory name lookup cache (DNLC), 375
caches, 430
magnetic rotational disks, 439
tunable, 494–495
USE method, 451
Disk I/O state in thread state analysis, 194–197
Disks, 423–424
architecture. See Disks architecture
exercises, 495–496
experiments, 490–493
IOPS, 432
methodology. See Disks methodology
non-data-[[transfer disk commands, 432
observability tools. See Disks observability tools
read/write ratio, 431
references, 496–498
saturation, 434
terminology, 424
tunable, 494
tuning, 493–495
USE method, 451
utilization, 433
visualizations, 487–490
interfaces, 442–443
magnetic rotational disks, 435–439
operating system disk I/O stack, 446–449
persistent memory, 441
solid-state drives, 439–441
vs. application I/O, 435
bpftrace for, 764, 806–807
caching, 430
errors, 483
latency, 428–430, 454–455, 467–472, 482–483
operating system stacks, 446–449
OS virtualization, 613, 616
OS virtualization strategy, 630
random vs. sequential, 430–431
scatter plots, 488
size, 432, 480–481
synchronous vs. asynchronous, 434–435
time measurements, 427–429
wait, 434
micro-benchmarking]], 456–457
overview, 449–450
performance monitoring, 452
scaling, 457–458
static performance tuning, 455–456
USE method, 450–451
workload characterization, 452–454
controllers, 426
Disks observability tools, 484–486
biolatency, 468–470
biosnoop, 470–472
biostacks, 474–475
biotop, 473–474
bpftrace, 479–483
iostat, 459–463
iotop, 472–473
MegaCli, 484
miscellaneous, 487
overview, 458–459
perf, 465–468
pidstat, 464–465
PSI, 464
sar, 463–464
Distributed operating systems, 123–124
Distributed tracing, 199
multimodal, 76–77
normal, 75
dmesg tool
CPUs, 245
description, 15
memory, 348
OS virtualization, 619
DNLC (directory name lookup cache), 375
Docker 607, 620–622
application latency, 385
BCC, 760–761
bpftrace, 781
Ftrace, 748–749
kprobes, 153
perf, 276, 703
PMCs, 158
sar, 165–166
tracepoints, 150–151
uprobes, 155
USDT, 156
scheduling, 244
Xen, 589
Double data rate synchronous dynamic random-access memory (DDR SDRAM), 313
Double-pumped data [[transfer for CPUs, 237
DPDK (Data Plane Development Kit), 523
DRAM (dynamic random-access memory), 311
overview, 55–56
balloon, 597
device, 109–110, 522
parameterized, 593–595
drsnoop tool
BCC, 756
memory, 342
DSCPs (Differentiated Services Code Points), 509–510
description, 12
Duplicate ACK detection, 512
Duration in RED method, 53
DWARF (debugging with attributed record formats) stack walking, 216, 267, 676, 696
kprobes, 151
overview, 12
Dynamic priority in scheduling classes, 242–243
Dynamic random-access memory (DRAM), 311
Dynamic sizing in cloud computing, 583–584
D[[Trace, 114
perf, 677–678
tools, 12
DynTicks, 116
E
Early Departure Time (EDT), 119, 524
eBPF. See Extended BPF
EBS (Elastic Block Store), 585
ECC (error-correcting code) for magnetic rotational disks, 438
ECN (Explicit Congestion Notification) field
IP, 508–510
TCP, 513
tuning, 570
EDT (Early Departure Time), 119, 524
EFS (Elastic File [[System), 585
EKS (Elastic Kubernetes Service]]), 586
elasped variable in bpftrace, 777
Elastic Block Store (EBS), 585
Elastic File [[System (EFS), 585
Elastic Kubernetes Service]] (EKS), 586
Elevator seeking in magnetic rotational disks, 437–438
ELF (Executable and Linking Format) binaries
description, 183
missing symbols in, 214
eMLC (enterprise multi-level cell) flash memory]], 440
Encapsulation for networks, 504
End-to-end network arguments, 507
Enterprise models, 62
Enterprise multi-level cell (eMLC) flash memory]], 440
benchmarking]], 647
processes, 101–102
Ephemeral drives, 584
Ephemeral ports, 531
epoll system call, 115, 118
EPTs (extended page tables), 593
Erlang virtual machines, 185
Error-correcting code (ECC) for magnetic rotational disks, 438
applications, 193
benchmarking]], 647
CPUs, 245–246, 796, 798
disk controllers, 451
I/O, 483, 798
kernels, 798
memory, 324–325, 796, 798
networks, 526–527, 529, 796–797
RED method, 53
storage, 797
USE method overview, 47–48, 51–53
Ethernet congestion avoidance, 508
Event-based concurrency, 178
Event-based tools, 133
Event sources for Wireshark, 559
disks, 454
file systems, 388
Ftrace, 707–708
kprobes, 719–720
methodologies, 57–58
uprobes, 722–723
case study]], 789–790
CPUs, 273–274
observability source, 159
selecting, 274–275
synthetic, 731–733
trace, 148
events directory in tracefs, 710
Eviction policies for caching, 36
evlist subcommand for perf, 673
synchronous interrupts, 97
user mode, 93
kernel, 94
processes, 100
BCC, 756
CPUs, 285
static instrumentation, 11–12
tracing, 136
Executable and Linking Format (ELF) binaries
description, 183
missing symbols in, 214
Executable data in process virtual address space, 319
Executable text in process virtual address space, 319
execve system call, 11
exit function in bpftrace, 770, 779
Experimentation-based performance gains, 73–74
CPUs, 293–294
disks, 490–493
file systems, 411–414
networks, 562–567
overview, 13–14
scientific method, 45–46
Experts for applications, 173
Explicit Congestion Notification (ECN) field
IP, 508–510
TCP, 513
tuning, 570
Explicit logical metadata]] in file systems, 368
Exporters for monitoring, 55, 79, 137
Express Data Path (XDP) technology
description, 118
event sources, 558
ext3 file system, 378–379
features, 379
tuning, 416–418
Extended BPF, 12
BCC 751–761
bpftrace 752–753, 761–781, 803–808
description, 118
firewalls, 517
histograms, 744
kernel-mode applications, 92
overview, 121–122
tracing tools, 166
Extended page tables (EPTs), 593
Extent-based file systems, 375–376
Extents, 375–376
btrfs, 382
ext4, 380
F
FaaS (functions as a service), 634
FACK (forward acknowledgments) in TCP, 514
Factor analysis in capacity planning, 71–72
Failures, benchmarking]], 645–651
False sharing for hash table]]s, 181
Families of instance types, 581
Fast File [[System (FFS)
description, 113
overview, 377–378
Fast retransmits in TCP, 510, 512
Fast user-space mutex (Futex), 115
Fastpath state in Mutex locks, 179
in synchronous interrupts, 97
page faults. See page faults
FC (Fibre Channel) interface, 442–443
Feedback]]-directed optimization (FDO), 122
FFS (Fast File [[System)
description, 113
overview, 377–378
Fibre Channel (FC) interface, 442–443
Field-programmable gate arrays (FPGAs), 240–241
FIFO scheduling policy, 243
File descriptor capacity in USE method, 52
File [[offset pattern, micro-benchmarking]] for, 390
File stores in cloud computing, 584
File [[system internals, bpftrace for, 408
architecture. See File [[systems architecture
bpftrace for, 764, 805–806
caches. See File [[systems caches
capacity, OS virtualization, 616
capacity, performance issues, 371
exercises, 419–420
experiments, 411–414
hardware virtualization, 597
I/O, logical vs. physical, 368–370
I/O, non-blocking, 366–367
I/O, random vs. sequential, 363–364
interfaces, 361
latency, 362–363
memory-mapped files, 367
metadata]], 367–368
methodology. See File [[systems methodology
micro-benchmark tools, 412–414
models, 361–362
observability tools. See File [[systems observability tools
OS virtualization, 611–612
paging, 306
pre[[fetch, 364–365
reads, micro-benchmarking]] for, 61
references, 420–421
special, 371
synchronous writes, 366
tuning, 414–419
types. See File [[systems types
visualizations, 410–411
caches, 373–375
features, 375–377
VFS, 107, 373
File [[systems caches, 361–363
flushing, 414
hit ratio, 17
OS virtualization, 616
OS virtualization strategy, 630
tuning, 389
usage, 309
micro-benchmarking]], 390–391
overview, 383–384
performance monitoring, 388
static performance tuning, 389
workload characterization, 386–388
workload separation, 389
File [[systems observability tools
bpftrace, 402–408
fatrace, 395–396
filetop, 398–399
free, 392–393
miscellaneous, 409–410
mount, 392
opensnoop, 397
overview, 391–392
sar, 393–394
slabtop, 394–395
strace, 395
top, 393
vmstat, 393
btrfs, 381–382
ext3, 378–379
ext4, 379
FFS, 377–378
XFS, 379–380
ZFS, 380–381
bpftrace, 769, 776
event, 693–694
kprobes, 721–722
PID, 729–730
tracepoints, 717–718
uprobes, 723
disks, 493
file systems, 413–414
Firewalls, 503
misconfigured, 505
overview, 517
tuning, 574
Five Whys in drill-down analysis, 56
automated, 201
characteristics, 290–291
colors, 291
CPU profiling, 10–11, 187–188, 278, 660–661
generating, 249, 270–272
interactivity, 291
interpretation, 291–292
missing stacks, 215
overview, 289–290
page faults, 340–342, 346
perf, 119
performance wins, 250
profiles, 278
sample processing, 249–250
scripts, 700
FlameScoped tool, 292–293, 700
Flash-memory-based SSDs, 439–440
Flash translation layer (FTL) in solid-state drives, 440–441
Flent (FLExible Network Tester) tool, 567
disks, 493
file systems, 413–414
FLExible Network Tester (Flent) tool, 567
Floating point events in perf, 680
floating-point operations per second (FLOPS) in benchmarking]], 655
Flow [[control in bpftrace, 775–777
fork system calls, 94, 100
Format string for tracepoints, 148–149
Forward acknowledgments (FACK) in TCP, 514
FPGAs (field-programmable gate arrays), 240–241
FFS, 377
file systems, 364
memory, 321
packets, 505
reducing, 380
defined, 500
networks, 515
OSI model, 502
Free memory lists, 315–318
description, 15
file systems, 392–393
memory, 348
OS virtualization, 619
jails, 606
je[[malloc, 322
kernel, 113
TSA analysis, 217
performance vs. Linux, 124
TCP LRO, 523
Frequency sampling for hardware events, 682–683
Front-ends in instruction pipeline, 224
fsrwstat tool, 409
FTL (flash translation layer) in solid-state drives, 440–441
ftrace subcommand for perf, 673
Ftrace, 13, 705–706
capabilities overview, 706–708
description, 166
documentation, 748–749
hist triggers, 727–733
hwlat, 726
kprobes, 719–722
options, 716
OS virtualization, 629
perf, 741
references, 749
tracepoints, 717–718
tracing, 136
uprobes, 722–723
Full I/O distributions disk latency, 454
Full stack in systems performance, 1
Fully associative caches, 234
Fully-preemptible kernels, 110, 114
func variable in bpftrace, 778
BCC, 756–758
example, 747
Ftrace, 706–707
BCC, 757
description, 708
options, 725
function_profile_enabled file, 710
Ftrace, 707, 711–712
observability source, 159
Function tracer. See Ftrace tool
profiling, 248
Functional block diagrams in USE method, 49–50
Functional units in CPUs, 223
Functions as a service (FaaS), 634
Functions in bpftrace, 770, 778–781
Futex (fast user-space mutex), 115
futex system calls, 95
G
Garbage collection]], 185–186
optimizations, 183–184
PGO kernels, 122
Generic segmentation offload (GSO) in networks, 520–521
Generic system performance methodologies, 40–41
github.[[com tool package, 132
GKE (Google Kubernetes Engine), 586
syscalls, 92
Good/fast/cheap trade-off]]s, 26–27
Google Kubernetes Engine (GKE), 586
Goroutines for applications, 178
gprof tool, 135
Grafana, 8–9, 138
Graphics processing units (GPUs)
tools, 287
GRO (Generic Receive Offload), 119
heap, 320
memory, 185, 316, 327
GSO (generic segmentation offload) in networks, 520–521
hardware virtualization, 590–593, 596–605
lightweight virtualization, 632–633
OS virtualization, 617, 627–629
H
Hard disk drives (HDDs), 435–439
memory, 311–315
networks, 515–517
threads, 220
tracing, 276
Hardware-assisted virtualization, 590
Hardware counters. See Performance monitoring counters (PMCs)
CPUs, 273–274
perf, 680–683
selecting, 274–275
Hardware instances in cloud computing, 580
Hardware latency detector (hwlat), 708, 726
Hardware RAID, 444
Hardware resources in capacity planning, 70
comparisons, 634–636
I/O, 593–595
implementation, 588–589
multi-tenant contention, 595
observability, 597–605
overhead, 589–595
overview, 587–588
Hash fields in hist triggers, 728
Hash table]]s in applications, 180–181
HDDs (hard disk drives), 435–439
hdparm tool, 491–492
Head-based sampling in distributed tracing, 199
Heads in magnetic rotational disks, 436
description, 304
growth, 320
process virtual address space, 319
CPU utilization]], 288–289
disk utilization, 490
file systems, 410–411
overview, 82–83
Hello, World! program, 770
hist function in bpftrace, 780
Hist triggers
fields, 728–729
modifiers, 729
stack trace keys, 730–731
usage, 727
Histogram, 76–77
Horizontal pod autoscalers (HPAs), 73
Horizontal scaling and scalability
cloud computing, 581–582
Hosts
applications, 172
cloud computing, 580
hardware virtualization, 597–603
lightweight virtualization, 632
OS virtualization, 617, 619–627
Hot/cold flame graphs, 191
Hourly patterns, monitoring, 78
HPAs (horizontal pod autoscalers), 73
HT (HyperTransport) for CPUs, 236
Hue in flame graphs, 291
Huge pages, 115–116, 314, 352–353
hwlat (hardware latency detector), 708, 726
Hybrid clouds, 580
Hyper-Threading Technology, 225
Hyper-V, 589
Hypercalls in paravirtualization, 588
Hyperthreading-aware scheduling classes, 243
HyperTransport (HT) for CPUs, 236
cloud computing, 580
hardware virtualization, 587–588
kernels, 93
Hypothesis step in scientific method, 44–45
I
I/O. See Input/output (I/O)
IaaS (infrastructure as a service), 580
Icicle graphs, 250
icstat tool, 409
IDDs (isolated driver domains), 596
Identification in drill-down analysis, 55
Idle memory, 315
Idle scheduling class, 243
IDLE scheduling policy, 243
Idle state in thread state analysis, 194, 196–197
Idle threads, 99, 244
If statements, 776
ifpps tool, 561
iftop tool, 562
Implicit logical metadata]], 368
Inactive pages in page caches, 318
Incast problem in networks, 524
caches, 375
VFS, 373
Individual synchronous writes, 366
Industry benchmarking]], 60–61
Industry standards for benchmarking]], 654–655
Infrastructure as a service (IaaS), 580
inject subcommand for perf, 673
caches, 375
VFS, 373
solid-state drive controllers, 440
hardware virtualization, 593–595, 597
I/O-bound applications, 106
latency, 424
merging, 448
non-blocking, 181, 366–367
OS virtualization, 611–612, 616–617
random vs. sequential, 363–364
schedulers, 448
scheduling, 115–116
size, applications, 176
size, micro-benchmarking]], 390
stacks, 107–108, 372
USE method, 798
Input/output operations per second. See IOPS (input/output operations per second)
bpftrace, 210–212
perf, 202–203
BCC, 754
bpftrace, 762
instances directory in tracefs, 710
description, 14
types, 580
Instruction pointer for threads, 100
defined, 220
IPC, 225
pipeline, 224
size, 224
steps, 223
text, 304
width, 224
Instructions per cycle (IPC), 225, 251, 326
Integrated caches, 232
Intel Cache Allocation Technology (CAT), 118, 596
Intel processor cache sizes, 230–231
Intel VTune Amplifier XE tool, 135
Intelligent Platform Management Interface (IPMI), 98–99
Intelligent pre[[fetch in ZFS, 381
Inter-processor interrupts (IPIs), 110
Inter-stack latency in networks, 529
Interactivity in flame graphs, 291
buses, 313
CPUs, 235–237
USE method, 49–51
defined, 500
device drivers, 109–110
disks, 442–443
file systems, 361
kprobes, 153
network, 109, 501
network negotiation, 508
PMCs, 157–158
scheduling in NAPI, 522
tracepoints, 149–150
uprobes, 154–155
Interleaving in FFS, 378
congestion avoidance, 508
overview, 509–510
sockets]], 509
Interpretation of flame graphs, 291–292
Interpreted programming languages, 184–185
Interrupt coalescing mode for networks, 522
Interrupt service requests (IRQs), 96–97
Interrupt service routines (ISRs), 96
asynchronous, 96–97
defined, 91
hardware, 282
masking, 98–99
network [[latency, 529
overview, 96
soft, 281–282
synchronous, 97
threads, 97–98
interval probes in bpftrace, 774
Interval statistics, stat for, 693
IO accounting, 116
io_uring_enter command, 181
io_uring interface, 119
ioctl system calls, 95
ionice tool, 493–494
ioping tool, 492
IOPS (input/output operations per second)
defined, 22
description, 7
disks, 429, 431–432
networks, 527–529
iosched tool, 487
iosnoop tool, 743
bonnie++ tool, 658
description, 15
disks, 450, 459–463
memory, 348
options, 460
OS virtualization, 619, 627
iotop tool, 450, 472–473
congestion avoidance, 508
overview, 509–510
sockets]], 509
IPC (instructions per cycle), 225, 251, 326
ipecn tool, 561
example, 13–14
network micro-benchmarking]], 10
network throughput, 564–565
IPIs (inter-processor interrupts), 110
IPMI (Intelligent Platform Management Interface), 98–99
IRQs (interrupt service requests), 96–97
irqsoff tracer, 708
Isolated driver domains (IDDs), 596
Isolation in OS virtualization, 629
ISRs (interrupt service routines), 96
J
analysis, 29
case study]], 783–792
flame graphs, 201, 271
garbage colleciton, 185–186
Java F[[light Recorder, 135
stack traces, 215
symbols, 214
uprobes, 213
virtual machines, 185
Java F[[light Recorder (JFR), 135
JavaScript Object Notation (JSON) format, 163–164
JBOD (just a bunch of disks), 443
je[[malloc allocator, 322
JFR (Java F[[light Recorder), 135
JIT (just-in-[[time) compilation
Linux kernel, 117
PGO kernels, 122
Jitter in operating systems, 99
jmaps tool, 214
Journaling
btrfs, 382
ext3, 378–379
file systems, 376
XFS, 380
JSON (JavaScript Object Notation) format, 163–164
Jumbo frames
packets, 505
tuning, 574
Just a bunch of disks (JBOD), 443
Just-in-[[time (JIT) compilation
Linux kernel, 117
PGO kernels, 122
K
KCM (Kernel Connection Multiplex]]or), 118
Keep-alive strategy in networks, 507
Kendall’s notation for queueing systems, 67–68
Kernel-based Virtual Machine (KVM) technology
CPU quotas, 595
description, 589
Linux kernel, 116
observability, 600–603
Kernel bypass for networks, 523
Kernel Connection Multiplex]]or (KCM), 118
Kernel mode, 93
Kernel page table isolation (KPTI) patches, 121
Kernel space, 90
Kernel state in thread state analysis, 194–197
Kernel statistics (Kstat) framework, 159–160
CPUs, 226
bpftrace for, 765
BSD, 113
comparisons, 124
defined, 90
developments, 115–120
execution, 92–93
file systems, 107
filtering in OS virtualization, 629
Linux, 114–122, 124
monolithic, 123
overview, 91–92
PGO, 122
PMU events, 680
preemption, 110
schedulers, 105–106
Solaris, 114
stacks, 103
system calls, 94–95
unikernels, 123
Unix, 112
USE method, 798
user modes, 93–94
versions, 111–112
KernelShark software, 83–84, 739–740
kfunc probes, 774
killsnoop tool
BCC, 756
kmem subcommand for perf, 673, 702
Knee points
models, 62–64
scalability, 31
kprobes, 685–686
arguments, 686–687, 720–721
filters, 721–722
overview, 151–153
profiling, 722
return values, 721
triggers, 721–722
KPTI (kernel page table isolation) patches, 121
kretfunc probes, 774
kretprobes, 152–153, 774
kstack function in bpftrace, 779
kstack variable in bpftrace, 778
Kstat (kernel statistics) framework, 159–160
ksym function, 779
node, 608
OS virtualization, 620–621
KVM. See Kernel-based Virtual Machine (KVM) technology
kvm subcommand for perf, 673, 702
Kyber multi-queue schedulers, 449
L
Label selectors in cloud computing, 586
Language virtual machines, 185
Large Receive Offload (LRO), 116
Large segment offload for packet size, 505
analysis methodologies, 56–57
applications, 173
biolatency, 468–470
CPUs, 233–234
defined, 22
disk I/O, 428–430, 454–455, 467–472, 482–483
distributions, 76–77
file systems, 362–363, 384–386, 388
hardware, 118
hardware virtualization, 604
interrupts, 98
memory, 311, 441
methodologies, 24–25
networks, connections, 7, 24–25, 505–506, 528
outliers, 58, 186, 424, 471–472
overview, 6–7
packets, 532–533
percentiles, 413–414
perf, 467–468
scatter plots, 81–82, 488
scheduler, 226, 272–273
solid-state drives, 441
ticks, 99
transaction costs analysis, 385–386
VFS, 406–408
LatencyTOP tool for file systems, 396
latencytop tool for operating systems, 116
LBR (last branch record), 216, 676, 696
Leak detection for memory, 326–327
Least frequently used (LFU) caching algorithm, 36
Least recently used (LRU) caching algorithm, 36
data, 232
instructions, 232
memory, 314
embedded, 232
memory, 314
LLC, 232
memory, 314
Level of appropriateness in methodologies, 28–29
LFU (least frequently used) caching algorithm, 36
lhist function, 780
libpcap library as observability source, 159
Life cycle for processes, 100–101
network connections, 507
solid-state drives, 441
Lightweight threads, 178
comparisons, 634–636
implementation, 631–632
observability, 632–633
overhead, 632
overview, 630
Limit investigations, benchmarking]] for, 642
Limitations of averages, 75
Limits for OS virtualization resources, 613
disks, 487–488
working with, 80–81
methodologies, 32
models, 63
Link aggregation tuning, 574
Link-[[time optimization (LTO), 122
Linux 60-second analysis, 15–16
extended BPF, 121–122
kernel developments, 115–120
KPTI patches, 121
observability sources, 138–146
observability tools, 130
operating system disk I/O stack, 447–448
overview, 114–115
static performance tools, 130–131
systemd]] service manager, 120
thread state analysis, 195–197
linux-tools-common linux-tools tool package, 132
perf, 673
Listen backlogs in networks, 519
listen subcommand in trace-cmd, 735
perf, 674–675
llcstat tool
BCC, 756
CPUs, 285
Load averages for uptime, 255–257
schedulers, 241
micro-benchmarking]], 61
Load vs. architecture in methodologies, 30–31
Local network connections, 509
Localhost network connections, 509
Lock state in thread state analysis, 194–197
lock subcommand for perf, 673, 702
Locks
analysis, 198
applications, 179–181
tracing, 212–213
applications, 172
ZFS, 381
defined, 220
hardware threads, 221
Logical metadata]] in file systems, 368
Logical operations in file systems, 361
LRO (Large Receive Offload), 116
LRU (least recently used) caching algorithm, 36
lsof tool, 561
LTO (link-[[time optimization), 122
LTTng tool, 166
M
madvise system call, 367, 415–416
Magnetic rotational disks, 435–439
caching, 37–39
defined, 90, 304
latency, 26
managing, 104–105
overview, 311–312
malloc() bytes flame graphs, 346
Map functions in bpftrace, 771–772, 780–781
Map variables in bpftrace, 771
Mapping memory. See Memory mappings
Marketing, benchmarking]] for, 642
Markovian arrivals in queueing systems, 68–69
Masking interrupts, 98–99
Maximum controller operation rate, 457
Maximum controller throughput, 457
Maximum disk operation rate, 457
Maximum disk random reads, 457
magnetic rotational disks, 436–437
micro-benchmarking]], 457
Maximum transmission unit (MTU) size for packets, 504–505
MCS locks, 117
Mean, 74
“A Measure of Transaction Processing Power,” 655
MegaCli tool, 484
Melo, Arnaldo Carvalho de, 671
Meltdown vulnerability, 121
meminfo tool, 142
BCC, 756
memory, 348
Memory, 303–304
allocators, 309, 353
architecture. See Memory architecture
bpftrace for, 763–764, 804–805
CPU caches, 221–222
exercises, 354–355
file system cache usage, 309
garbage collection]], 185
hardware virtualization, 596–597
internals, 346–347
methodology. See Memory methodology
multiprocess vs. multithreading, 228
NUMA binding, 353
observability tools. See Memory observability tools
OS virtualization, 611, 613, 615–616
OS virtualization strategy, 630
overcommit, 308
overprovisioning in solid-state drives, 441
paging, 306–307
persistent, 441
references, 355–357
shared, 310
terminology, 304
tuning, 350–354
USE method, 49–51, 796–798
utilization and saturation, 309
virtual, 90, 104–105, 304–305
working set size, 310
Memory architecture, 311
buses, 312–313
CPU caches, 314
hardware, 311–315
latency, 311
main memory, 311–312
MMU, 314
process virtual address space, 319–322
software, 315–322
TLB, 314
memory control group]], 610, 616
Memory management units (MMUs), 235, 314
displaying, 337–338
files, 367
hardware virtualization, 592–593
kernel, 94
micro-benchmarking]], 390
OS virtualization, 611
micro-benchmarking]], 328
overview, 323
performance monitoring, 326
static performance tuning, 327–328
usage characterization, 325–326
USE method, 324–325
bpftrace, 343–347
drsnoop, 342
miscellaneous, 347–350
numastat, 334–335
overview, 328–329
perf, 338–342
pmap, 337–338
ps, 335–336
PSI, 330–331
sar, 331–333
slabtop, 333–334
swapon, 331
top, 336–337
vmstat, 329–330
wss, 342–343
Memory reclaim state in delay accounting, 145
Metadata]]
ext3, 378
file systems, 367–368
Method R, 57
Methodologies, 21–22
ad hoc checklist method, 43–44
anti-methods, 42–43
applications. See Applications methodology
benchmarking]]. See Benchmarking]] methodology
caching, 35–37
CPUs. See CPUs methodology
disks. See Disks methodology
exercises, 85–86
file systems. See File [[systems methodology
general, 40–41
level of appropriateness, 28–29
Linux 60-second analysis checklist, 15–16
load vs. architecture, 30–31
memory. See Memory methodology
Method R, 57
metrics, 32–33
micro-benchmarking]], 60–61
modeling. See Methodologies modeling
models, 23–24
monitoring, 77–79
networks. See Networks methodology
performance, 41–42
performance mantras, 61
perspectives, 37–40
point-in-[[time recommendations, 29–30
profiling, 35
RED method, 53
references, 86–87
saturation, 34–35
scalability, 31–32
scientific method, 44–46
static performance tuning, 59–60
statistics, 73–77
terminology, 22–23
trade-off]]s, 26–27
USE method, 47–53
utilization, 33–34
visualizations. See Methodologies visualizations
Amdahl’s Law of Scalability, 64–65
enterprise vs. cloud, 62
Universal Scalability Law, 65–66
visual identification, 62–64
Methodologies visualizations, 79
scatter plots, 81–82
surface plots, 84–85
tools, 85
Metrics, 8–9
applications, 172
methodologies, 32–33
observability tools, 167–168
USE method, 48–51
MFU (most frequently used) caching algorithm, 36
Micro-benchmarking]]
CPUs, 253–254
description, 13
disks, 456–457, 491–492
file systems, 390–391, 412–414
memory, 328
methodologies, 60–61
networks, 533
overview, 651–652
cloud computing, 583–584
USE method, 53
Midpath state for Mutex locks, 179
Migration types for free lists, 317
MINIX operating system, 114
MIPS (millions of instructions per second]]) in benchmarking]], 655
Missing stacks, 215–216
Missing symbols, 214
Mixed-mode flame graphs, 187
MLC (multi-level cell) flash memory]], 440
description, 95
mmapsnoop tool, 348
MMUs (memory management units), 235, 314
mnt control group]], 609
defined, 90
kernels, 93
Model-specific registers (MSRs)
CPUs, 238
observability source, 159
Amdahl’s Law of Scalability, 64–65
CPUs, 221–222
disks, 425–426
enterprise vs. cloud, 62
file systems, 361–362
methodologies, 23–24
networks, 501–502
overview, 62
Universal Scalability Law, 65–66
visual identification, 62–64
Modular I/O scheduling, 116
Monitoring, 77–79
CPUs, 251
disks, 452
file systems, 388
memory, 326
networks, 529, 537
observability tools, 137–138
products, 79
sar, 161–162
Most frequently used (MFU) caching algorithm, 36
Most recently used (MRU) caching algorithm, 36
Mount points in file systems, 106
file systems, 392
options, 416–417
Mounting]] file systems, 106, 392
mpstat tool
case study]], 785–786
CPUs, 245, 259
description, 15
lightweight virtualization, 633
OS virtualization, 619
mq-deadline multi-queue schedulers, 449
MR-IOV (multiroot I/O virtualization), 593–594
MRU (most recently used) caching algorithm, 36
MSRs (model-specific registers)
CPUs, 238
observability source, 159
mtr tool, 567
Multi-level cell (MLC) flash memory]], 440
description, 119
operating system disk I/O stack, 449
Multiblock allocators in ext4, 379
Multicalls in paravirtualization, 588
Multicast network transmissions, 503
Multichannel memory buses, 313
Multics (Multiplexed]] Information and Computer Services) operating system, 112
Multimodal distributions, 76–77
Multiple causes as performance challenge, 6
Multiple performance issues, 6
Multiple pre[[fetch streams in ZFS, 381
Multiple-zone disk recording, 437
Multiplexed]] Information and Computer Services (Multics) operating system, 112
applications, 177–181
overview, 110
Multiqueue I/O schedulers, 119
Multiroot I/O virtualization (MR-IOV), 593–594
Multitenancy in cloud computing, 580
contention in hardware virtualization, 595
contention in OS virtualization, 612–613
applications, 177–181
CPUs, 227–229
SMT, 225
Mutex (MUTually EXclusive) locks
applications, 179–180
contention, 198
tracing, 212–213
USE method, 52
CPU flame graph, 187–188
CPU profiling, 200, 203, 269–270, 277, 283–284, 697–700
disk I/O tracing, 466–467, 470–471, 488
file tracing, 397–398, 401–402
memory [[allocation, 345
Off–CPU analysis, 204–205, 275–276
Off–CPU Time flame graphs, 190–192
page fault sampling, 339–341
scheduler latency, 272, 279–280
s[[hards, 582
stack traces, 215
working set size, 342
N
NAGLE algorithm for TCP congestion control, 513
Name [[resolution latency, 505, 528
Namespaces in OS virtualization, 606–609, 620, 623–624
NAS (network-attached storage), 446
Native Command Queueing (NCQ), 437
Native hypervisors, 587
Negative caching in Dcache, 375
Nested page tables (NPTs), 593
Net I/O state in thread state analysis, 194–197
description, 562
socket information, 142
Net[[filter conn[[track as observability source, 159
Netflix cloud performance team, 2–3
netlink observability tools, 145–146, 536
Network-attached storage (NAS), 446
Network interface cards (NICs)
description, 501–502
network connections, 109
sent and received packets, 522
Networks, 499–500
architecture. See Networks architecture
bpftrace for, 764–765, 807–808
buffers, 27, 507
congestion avoidance, 508
connection backlogs, 507
controllers, 501–502
encapsulation, 504
exercises, 574–575
experiments, 562–567
hardware virtualization, 597
interface negotiation, 508
interfaces, 501
latency, 505–507
local connections, 509
methodology. See Networks methodology
micro-benchmarking]] for, 61
models, 501–502
observability tools. See Networks observability tools
operating systems, 109
OS virtualization, 611–613, 617, 630
protocol stacks, 502
protocols, 504
references, 575–578
routing, 503
sniffing, 159
stacks, 518–519
terminology, 500
throughput, 527–529
USE method, 49–51, 796–797
utilization, 508–509
hardware, 515–517
protocols, 509–515
software, 517–524
micro-benchmarking]], 533
overview, 524–525
performance monitoring, 529
static performance tuning, 531–532
USE method, 526–527
workload characterization, 527–528
bpftrace, 550–558
ethtool, 546–547
ifconfig, 537–538
ip, 536–537
miscellaneous, 560–562
nicstat, 545–546
nstat, 538–539
overview, 533–534
sar, 543–545
ss, 534–536
tcp[[dump, 558–559
tcpretrans, 549–550
Wireshark, 560
configuration, 574
system-wide, 567–572
New Vegas (NV) congestion control algorithm, 118
BCC, 756
file systems, 399
nfsstat tool, 561
NFU (not frequently used) caching algorithm, 36
nice command
CPU priorities, 252
resource management, 111
scheduling priorities, 295
NICs (network interface cards)
description, 501–502
network connections, 109
sent and received packets, 522
nicstat tool, 132, 525, 545–546
“A Nine Year Study of File [[System and Storage Benchmarking]],” 643
Nitro hardware virtualization
description, 589
NMIs (non-maskable interrupts), 98
Node taints in cloud computing, 586
event-based concurrency, 178
non-blocking I/O, 181
symbols, 214
Nodes
main memory, 312
Noisy neighbors
OS virtualization, 617
applications, 181
file systems, 366–367
Non-data-[[transfer disk commands, 432
Non-idle time, 34
Non-maskable interrupts (NMIs), 98
benchmarking]] for, 642
software change case study]], 18
Non-uniform memory access (NUMA)
CPUs, 244
main memory, 312
multiprocessors, 110
Non-uniform random distributions, 413
Non-Volatile Memory express (NVMe) interface, 443
nop tracer, 708
Normal distribution, 75
NORMAL scheduling policy, 243
Not frequently used (NFU) caching algorithm, 36
NPTs (nested page tables), 593
nsecs variable in bpftrace, 777
nsenter command, 624
nstat tool, 134, 525, 538–539
ntop function, 779
NUMA. See Non-uniform memory access (NUMA)
numactl command, 298, 353
numastat tool, 334–335
Number of service centers in queueing systems, 67
NV (New Vegas) congestion control algorithm, 118
O
O(1) scheduling class, 243
Object stores in cloud computing, 584
allocators, 321
applications, 174
benchmarks, 643
counters, statistics, and metrics, 8–9
hardware virtualization, 597–605
operating systems, 111
OS virtualization. See OS virtualization observability
overview, 7–8
profiling, 10–11
RAID, 445
tracing, 11–12
Observability tools, 129
applications. See Applications observability tools
coverage, 130
CPUs. See CPUs observability tools
crisis, 131–133
disks. See Disks observability tools
evaluating results, 167–168
exercises, 168
file system. See File [[systems observability tools
memory. See Memory observability tools
monitoring, 137–138
network. See Networks observability tools
profiling, 135
references, 168–169
sar, 160–166
static performance, 130–131
tracing, 136, 166
types, 133
Observability tools sources, 138–140
delay accounting, 145
kprobes, 151–153
miscellaneous, 159–160
/proc file system, 140–143
/sys file system, 143–144
tracepoints, 146–151
uprobes, 153–155
USDT, 155–156
Observation-based performance gains, 73
Observational tests in scientific method, 44–45
Observer effect in metrics, 33
footprints, 188–189
time flame graphs, 205
BCC, 756
description, 285
networks, 561
stack traces, 204–205
time flame graphs, 205
Offset heat maps, 289, 489–490
On-die caches, 231
On-disk caches, 425–426, 430, 437
Online defragmentation, 380
OOM killer (out-of-memory killer), 316–317, 324
OOM (out of memory), defined, 304
oomkill tool
BCC, 756
description, 348
description, 94
non-blocking I/O, 181
BCC, 756
file systems, 397
Operating systems, 89
additional reading, 127–128
caching, 108–109
clocks and idle, 99
defined, 90
device drivers, 109–110
distributed, 123–124
exercises, 124–125
file systems, 106–108
interrupts, 96–99
jitter, 99
kernels, 91–95, 111–114, 124
Linux. See Linux operating system
multiprocessors, 110
networking, 109
observability, 111
PGO kernels, 122
preemption, 110
processes, 99–102
references, 125–127
resource management, 110–111
schedulers, 105–106
stacks, 102–103
system calls, 94–95
terminology, 90–91
unikernels, 123
virtual memory, 104–105
virtualization. See OS virtualization
defined, 22
file systems, 387–388
applications, 172
file systems, 370–371
micro-benchmarking]], 390
Operators for bpftrace, 776–777
Optimistic spinning in Mutex locks, 179
applications, 174
compiler, 183–184, 229
networks, 524
Orchestration in cloud computing, 586
OS instances in cloud computing, 580
comparisons, 634–636
control group]]s, 609–610
implementation, 607–610
namespaces, 606–609
overhead, 610–613
overview, 605–607
OS virtualization observability
containers, 620–621
guests, 627–629
hosts, 619–627
namespaces, 623–624
overview, 617–618
strategy, 629–630
tracing tools, 629
OSI model, 502
Out-of-memory killer (OOM killer), 316–317, 324
Out of memory (OOM), defined, 304
Outliers
latency, 186, 424, 471–472
normal distributions, 77
Output formats in sar, 163–165
Output with solid-state drive controllers, 440
Overcommit strategy, 115
Overcommitted main memory, 305, 308
PMCs, 157–158
hardware virtualization, 589–595
kprobes, 153
lightweight virtualization, 632
metrics, 33
multiprocess vs. multithreading, 228
OS virtualization, 610–613
strace, 207
ticks, 99
tracepoints, 150
uprobes, 154–155
Overlayfs file system, 118
Overprovisioning cloud computing, 583
Oversize arenas, 322
P
Pacing in networks, 524
defined, 500
latency, 532–533
networks, 504
OSI model, 502
size, 504–505
sniffing, 530–531
throttling, 522
Padding locks for hash table]]s, 181
file systems, 374
memory, 315
defined, 304
flame graphs, 340–342, 346
sampling, 339–340
daemons, 317
working with, 306
Page scanning, 318–319, 323, 374
Paged virtual memory, 113
Pages
defined, 304
kernel, 115
sizes, 352–353
anonymous, 305–307
demand, 307–308
file system, 306
memory, 104–105
overview, 306
PAPI (performance application programming interface), 158
Parallelism in applications, 177–181
Paravirtualization (PV), 588, 590
Paravirtualized I/O drivers, 593–595
Parity in RAID, 445
Passive benchmarking]], 656–657
Passive listening in three-way handshakes, 511
Pathologies in solid-state drives, 441
Patrol reads in RAID, 445
Pause frames in congestion avoidance, 508
pchar tool, 564
PCI pass-through in hardware virtualization, 593
PCP (Performance Co-Pilot), 138
PE (Portable Executable) format, 183
PEBS (precise event-based sampling), 158
Per-interval I/O averages latency values, 454
Per-interval statistics with stat, 693
Per-process observability tools, 133
/proc file system, 140–141
profiling, 135
tracing, 136
description, 75
latency, 413–414
perf_event control group]], 610
case study]], 789–790
CPU flame graphs, 201
CPU profiling, 200–201, 245, 268–270
description, 116
disk block devices, 465–467
disk I/O, 450, 467–468
documentation, 276
flame graphs, 119, 270–272
hardware virtualization, 601–602, 604
memory, 324
networks, 526, 562
one-liners for counting events, 675
one-liners for dynamic tracing, 677–678
one-liners for listing events, 674–675
one-liners for memory, 338–339
one-liners for profiling, 675–676
one-liners for reporting, 678–679
one-liners for static tracing, 676–677
OS virtualization, 619, 629
overview, 671–672
page fault flame graphs, 340–342
page fault sampling, 339–340
PMCs, 157, 273–274
subcommands. See perf tool subcommands
tools collection. See perf-tools collection
tracepoint events, 684–685
tracepoints, 147, 149
tracing, 136, 166
hardware, 274–275, 680–683
kprobes, 685–687
overview, 679–681
software, 683–684
uprobes, 687–689
documentation, 703
ftrace, 741
miscellaneous, 702–703
overview, 672–674
record, 694–696
report, 696–698
script, 698–701
stat, 691–694
trace, 701–702
coverage, 742
documentation, 748
example, 747
one-liners, 745–747
overview, 741–742
perf-tools-unstable tool package, 132
Performance and performance monitoring
applications, 172
challenges, 5–6
cloud computing, 14, 586
CPUs, 251
disks, 452
file systems, 388
memory, 326
networks, 529
OS virtualization, 620
resource analysis investments, 38
Performance application programming interface (PAPI), 158
Performance Co-Pilot (PCP), 138
Performance engineers, 2–3
Performance instrumentation counters (PICs), 156
applications, 182
list of, 61
Performance monitoring counters (PMCs), 156
case study]], 788–789
challenges, 158
CPUs, 237–239, 273–274
documentation, 158
example, 156–157
interface, 157–158
memory, 326
Performance monitoring unit (PMU) events, 156, 680
Periods in OS virtualization, 615
Persistent memory, 441
Personalities in FileBench, 414
Perspectives
overview, 4–5
performance analysis, 37–38
Perturbations
benchmarks, 648
system test]]s, 23
pfm-events, 681
PGO (profile-guided]] optimization) kernels, 122
Physical metadata]] in file systems, 368
Physical operations in file systems, 361
Physical resources in USE method, 795–798
PICs (performance instrumentation counters), 156
pids control group]], 610
filters, 729–730
process environment, 101
CPUs, 245, 262
description, 15
disks, 464–465
OS virtualization, 619
pktgen tool, 567
Platters in magnetic rotational disks, 435–436
Plugins for monitoring software, 137
pmap tool, 135, 337–338
CPUs, 265–266
memory, 348
PMCs. See Performance monitoring counters (PMCs)
pmheld tool, 212–213
PMU (performance monitoring unit) events, 156, 680
Pods in cloud computing, 586
Point-in-[[time recommendations in methodologies, 29–30
Policies for scheduling classes, 106, 242–243
poll system call, 177
Polling applications, 177
btrfs, 382
overview, 382–383
ZFS, 380
Portability of benchmarks, 643
Portable Executable (PE) format, 183
Ports
ephemeral, 531
network, 501
Power states in processors, 297
Precise event-based sampling (PEBS), 158
Prediction step in scientific method, 44–45
CPUs, 227
Linux kernel, 116
operating systems, 110
schedulers, 241
Pre[[fetch caches, 230
Pre[[fetch for file systems
overview, 364–365
ZFS, 381
Presentability of benchmarks, 643
Pressure stall information (PSI)
CPUs, 257–258
description, 119
disks, 464
memory, 323, 330–331
applications, 173
benchmarking]] for, 643
CPUs, 227, 252–253
OS virtualization resources, 613
schedulers, 105–106
scheduling classes, 242–243, 295
Priority inheritance scheme, 227
Priority pause frames in congestion avoidance, 508
Private clouds, 580
Privilege rings in kernels, 93
probe subcommand for perf, 673
probe variable in bpftrace, 778
bpftrace, 767–768, 774–775
kprobes, 685–687
perf, 685
uprobes, 687–689
USDT, 690–691
wildcards, 768–769
case study]], 16, 783–784
determining, 44
/proc file system observability tools, 140–143
Process-context IDs (PCIDs), 119
filters, 729–730
process environment, 101
Processes
accounting, 159
creating, 100
defined, 90
environment, 101–102
life cycle, 100–101
overview, 99–100
profiling, 271–272
schedulers, 105–106
swapping, 104–105, 308–309
tracing, 207–208
USE method, 52
virtual address space, 319–322
binding, 181–182
defined, 90, 220
tuning, 299
Products, monitoring, 79
Profile-guided]] optimization (PGO) kernels, 122
applications, 203–204
BCC, 756
CPUs, 245, 277–278
profiling, 135
Ftrace, 707
I/O, 203–204, 210–212
interpretation, 249–250
kprobes, 722
methodologies, 35
observability tools, 135
overview, 10–11
perf, 675–676
uprobes, 723
bpftrace. See bpftrace tool programming
compiled, 183–184
garbage collection]], 185–186
Interpreted, 184–185
overview, 182–183
virtual machines, 185
Prometheus monitoring software, 138
benchmarking]] for, 642
testing, 3
Proportional set size (PSS) in shared memory]], 310
Protection rings in kernels, 93
HTTP/3, 515
IP, 509–510
networks, 502, 504, 509–515
QUIC, 515
TCP, 510–514
UDP, 514
CPUs, 260–261
memory, 335–336
OS virtualization, 619
PSI. See Pressure stall information (PSI)
PSS (proportional set size) in shared memory]], 310
Pterodactyl latency heat maps, 488–489
Public clouds, 580
PV (paravirtualization), 588, 590
Q
hardware virtualization, 589
lightweight virtualization, 631
QLC (quad-level cell) flash memory]], 440
QoS (quality of service) for networks, 532–533
QPI (Quick Path Interconnect), 236–237
Qspinlocks, 117–118
Quad-level cell (QLC) flash memory]], 440
Quality of service (QoS) for networks, 532–533
Quantifying issues, 6
Quantifying performance gains, 73–74
Quarterly patterns, monitoring, 79
Question step in scientific method, 44–45
networks, 521
OS virtualization, 617
tuning, 571
interrupts, 98
overview, 23–24
TCP connections, 519–520
QUIC protocol, 515
hardware virtualization, 589
lightweight virtualization, 631
Quick Path Interconnect (QPI), 236–237
Quotas in OS virtualization, 615
R
RACK (recent acknowledgments) in TCP, 514
RAID (redundant array of independent disks) architecture, 444–445
Ramping load benchmarking]], 662–664
Random-access pattern in micro-benchmarking]], 390
Random change anti-method, 42–43
disks, 430–431, 436
latency profile, micro-benchmarking]], 457
Rate transitions in networks, 517
Raw hardware event descriptors, 680
Raw tracepoints, 150
RCU-walk (read-copy-update-walk) algorithm, 375
Re-exec method in heap growth, 320
Read-ahead in file systems, 365
Read-copy-update-walk (RCU-walk) algorithm, 375
Read latency profile in micro-benchmarking]], 457
Read-modify-write operation in RAID, 445
description, 94
tracing, 404–405
Read/write ratio in disks, 431
Real-time scheduling classes, 106, 253
Real-time systems, interrupt masking in, 98
Reaping memory, 316, 318
Rebuilding volumes and pools, 383
Receive Flow Steering (RFS) in networks, 523
Receive Packet Steering (RPS) in networks, 523
Receive Side Scaling (RSS) in networks, 522–523
Recent acknowledgments (RACK) in TCP, 514
example, 672
options, 695
overview, 694–695
stack walking, 696
record subcommand for trace-cmd, 735
RED method, 53
Reduced instruction set computers (RISCs), 224
Redundant array of independent disks (RAID) architecture, 444–445
Reno algorithm for TCP congestion control, 513
Repeatability of benchmarks, 643
example, 672
overview, 696–697
STDIO, 697–698
TUI interface, 697
report subcommand for trace-cmd, 735
perf, 678–679
sar, 163, 165
Request rate in RED method, 53
Requests in workload analysis, 39
Resilvering volumes and pools, 383
Resource analysis perspectives, 4–5, 38–39
CPUs, 253, 298
disks, 456, 494
hardware virtualization, 595–597
lightweight virtualization, 632
memory, 328, 353–354
networks, 532–533
operating systems, 110–111
OS virtualization, 613–617, 626–627
tuning, 571
USE method, 52
Resource isolation in cloud computing, 586
Resource limits in capacity planning, 70–71
Resource lists in USE method, 49
Resource utilization in applications, 173
Resources in USE method, 47
Response time]]
defined, 22
disks, 452
latency, 24
restart subcommand in trace-cmd, 735
Retention policy for caching, 36
latency, 528
TCP, 510, 512, 529
UDP, 514
Retrospectives, 4
kprobes, 721
kretprobes, 152
ukretprobes, 154
uprobes, 723
retval variable in bpftrace, 778
RFS (Receive Flow Steering) in networks, 523
applications, 177
networks, 522
RISCs (reduced instruction set computers), 224
Roles, 2–3
Root level in file systems, 106
Rostedt, Steven, 705, 711, 734, 739–740
Rotation time in magnetic rotational disks, 436
Round-trip time (RTT) in networks, 507, 528
Route tables, 537
Router]]s, 516–517
RPS (Receive Packet Steering) in networks, 523
RR scheduling policy, 243
RSS (Receive Side Scaling) in networks, 522–523
RT scheduling class, 242–243
RTT (round-trip time) in networks, 507, 528
CPUs, 222
defined, 220
latency, 222
schedulers, 105, 241
Runnability of benchmarks, 643
Runnable state in thread state analysis, 194–197
runqlat tool
CPUs, 279–280
description, 756
runqlen tool
CPUs, 280–281
description, 756
CPUs, 285
description, 756
S
S3 (Simple Storage Service), 585
SaaS (software as a service), 634
SACK (selective acknowledgment) algorithm, 514
SACKs (selective acknowledgments), 510
CPU profiling, 35, 135, 187, 200–201, 247–248
distributed tracing, 199
page faults, 339–340
PMCs, 157–158
Sanity checks in benchmarking]], 664–665
sar (system activity reporter)
configuration, 162
coverage, 161
CPUs, 260
description, 15
disks, 463–464
documentation, 165–166
file systems, 393–394
memory, 331–333
monitoring, 137, 161–165
networks, 543–545
options, 801–802
OS virtualization, 619
overview, 160
reporting, 163
SAS (Serial Attached SCSI) disk interface, 442
SATA (Serial ATA) disk interface, 442
applications, 193
CPUs, 226–227, 245–246, 251, 795, 797
defined, 22
disk controllers, 451
flame graphs, 291
I/O, 798
kernels, 798
memory, 309, 324–326, 796–797
methodologies, 34–35
networks, 526–527, 796–797
storage, 797
USE method, 47–48, 51–53
Saturation points in scalability, 31
Scalability and scaling
Amdahl’s Law of Scalability, 64–65
cloud computing, 581–584
CPU, 522–523
disks, 457–458
methodologies, 31–32
models, 63–64
multithreading, 227
Universal Scalability Law, 65–66
Scalability ceiling, 64
Scalable Vector Graphics (SVG) files, 164
Scanning pages, 318–319, 323, 374
Scatter plots
disk I/O, 81–82
sched command, 141
sched subcommand for perf, 272–273, 673, 702
schedstat tool, 141–142
CPUs, 226, 272–273
delay accounting, 145
Scheduler tracing off-CPU analysis, 189–190
CPUs, 241–242
defined, 220
hardware virtualization, 596–597
kernel, 105–106
options, 295–296
scheduling internals, 284–285
CPUs, 115, 242–243
I/O, 115, 493
kernel, 106
priority, 295
Scheduling in Kubernetes, 586
Scientific method, 44–46
Scratch variables in bpftrace, 770–771
flame graphs, 700
overview, 698–700
script subcommand for perf, 673
Scrubbing file systems, 376
SCSI (Small Computer System Interface)
disks, 442
SDT events, 681
Second-level caches in file systems, 362
defined, 424
size, 437
zoning, 437
Security boot options, 298–299
SEDA (staged event-driven architecture]]), 178
SEDF (simple earliest deadline first) schedulers, 595
Seek time in magnetic rotational disks, 436
defined, 304
OSI model, 502
process virtual address space, 319
Selective acknowledgment (SACK) algorithm, 514
Selective acknowledgments (SACKs), 510
Self-Monitoring, Analysis and Reporting Technology (SMART) data, 485
Semaphores for applications, 179
disks, 430–431, 436
Serial ATA (SATA) disk interface, 442
Serial Attached SCSI (SAS) disk interface, 442
Server instances in cloud computing, 580
Service consoles in hardware virtualization, 589
Service thread pools for applications, 178
defined, 22
I/O, 427–429
Set associative caches, 234
cloud computing, 582
Shared memory]], 310
Shares in OS virtualization, 614–615, 626
Shingled Magnetic Recording (SMR) drives, 439
shmsnoop tool, 348
Short-lived processes, 12, 207–208
Short-stroking in magnetic rotational disks, 437
Simple earliest deadline first (SEDF) schedulers, 595
Simple Network Management Protocol (SNMP), 55, 137
Simple Storage Service (S3), 585
Simulation benchmarking]], 653–654
Simultaneous multithreading (SMT), 220, 225
Single-level cell (SLC) flash memory]], 440
Single root I/O virtualization (SR-IOV), 593
Site reliability engineers (SREs), 4
cloud computing, 583–584
disk I/O, 432, 480–481
instruction, 224
packets, 504–505
virtual memory, 308
word, 229, 310
working set. See Working set size (WSS)
Slab
allocator, 114
process virtual address space, 321–322
slabinfo tool, 142
slabtop tool, 333–334, 394–395
SLC (single-level cell) flash memory]], 440
Sleeping state in thread state analysis, 194–197
Sloth disks, 438
Slow disks case study]], 16–18
Slowpath state in Mutex locks, 179
Small Computer System Interface (SCSI)
disks, 442
SMART (Self-Monitoring, Analysis and Reporting Technology) data, 485
SMP (symmetric multiprocessing), 110
SMR (Shingled Magnetic Recording) drives, 439
SMs (streaming multiprocessors), 240
SMT (simultaneous multithreading), 220, 225
btrfs, 382
ZFS, 381
SNMP (Simple Network Management Protocol), 55, 137
SO_BUSY_POLL socket option, 522
SO_REUSEPORT socket option, 117
SO_TIMESTAMP socket option, 529
SO_TIMESTAMPING socket option, 529
Sockets]]
BSD, 113
defined, 500
description, 109
local connections, 509
options, 573
statistics, 534–536
tracing, 552–555
tuning, 569
soconnlat tool, 561
memory, 315–322
networks, 517–524
Software as a service (SaaS), 634
Software change case study]], 18–19
case study]], 789–790
observability source, 159
perf, 680, 683–684
recording and tracing, 275–276
USE method, 52, 798–799
kernel, 114
Kstat, 160
Slab allocator, 322, 652
zones, 606, 620
Solid-state disks (SSDs)
overview, 439–441
sormem tool, 561
Source [[code for applications, 172
SPEC (Standard Performance Evaluation Corporation) benchmarks, 655–656
Special file systems, 371
applications, 179
contention, 198
queued, 118
SPs (streaming processors), 240
SR-IOV (single root I/O virtualization), 593
SREs (site reliability engineers), 4
ss tool, 145–146, 525, 534–536
SSDs (solid-state disks)
overview, 439–441
description, 102
displaying, 204–205
keys, 730–731
Stack walking, 102, 696
I/O, 107–108, 372
missing, 215–216
network, 109, 518–519
operating system disk I/O, 446–449
overview, 102
process virtual address space, 319
protocol, 502
reading, 102–103
Staged event-driven architecture]] (SEDA), 178
Standard Performance Evaluation Corporation (SPEC) benchmarks, 655–656
Starovoitov, Alexei, 121
start subcommand in trace-cmd, 735
Starvation]] in deadline I/O schedulers, 448
description, 635
options, 692–693
overview, 691–692
stat subcommand in trace-cmd, 735
Stateful workload simulation, 654
Stateless workload simulation, 653
TCP, 511–512
thread state analysis, 193–197
overview, 11–12
tracepoints, 146, 717
applications methodology, 198–199
CPUs, 252
disks, 455–456
file systems, 389
memory, 327–328
methodologies, 59–60
networks, 531–532
tools, 130–131
Static priority of threads, 242–243
Static tracing in perf, 676–677
Statistical analysis in benchmarking]], 665–666
Statistics, 8–9
averages, 74–75
baseline, 59
case study]], 784–786
multimodal distributions, 76–77
outliers, 77
quantifying performance gains, 73–74
standard deviation, percentiles, and median, 75
statm tool, 141
statsnoop tool, 409
stop subcommand in trace-cmd, 735
cloud computing, 584–585
sample processing, 248–249
USE method, 49–51, 796–797
str function, 770, 778
bonnie++ tool, 660
file system latency, 395
limitations, 202
networks, 561
overhead, 207
system call tracing, 205–207
tracing, 136
stream subcommand in trace-cmd, 735
Streaming multiprocessors (SMs), 240
Streaming processors (SPs), 240
Streaming workloads in disks, 430–431
Stress testing in software change case study]], 18
Stripe width of volumes and pools, 383
Striped allocation in XFS, 380
Stripes in RAID, 444–445
strncmp function, 778
Stub domains in hardware virtualization, 596
Subjectivity, 5
Subsecond-offset heat maps, 289
Summary-since-boot values monitoring, 79
Superscalar architectures for CPUs, 224
Surface plots, 84–85
SUT (system under test) models, 23
SVG (Scalable Vector Graphics) files, 164
Swap capacity in OS virtualization, 613, 616
disks, 487
memory, 331
Swapping
defined, 304
memory, 316, 323
overview, 305–307
processes, 104–105, 308–309
delay accounting, 145
thread state analysis, 194–197
Symbol churn, 214
Symbols, missing, 214
Symmetric multiprocessing (SMP), 110
Synchronization primitives for applications, 179
Synchronous disk I/O, 434–435
Synchronous interrupts, 97
Synchronous writes, 366
BCC, 756
file systems, 409
Synthetic events in hist triggers, 731–733
/sys file system, 143–144
SysBench system benchmark, 294
BCC, 756
CPUs, 285
file systems, 409
system calls count, 208–209
sysctl tool
congestion control, 570
schedulers, 296
System activity reporter. See sar (system activity reporter)
analysis, 192
counting, 208–209
defined, 90
file system latency, 385
kernel, 92, 94–95
micro-benchmarking]] for, 61
observability source, 159
System design]], benchmarking]] for, 642
system function in bpftrace, 770, 779
System statistics, monitoring, 138
System under test (SUT) models, 23
System-wide CPU profiling, 268–270
System-wide observability tools, 133
/proc file system, 141–142
profiling, 135
tracing, 136
System-wide tunable parameters
ECN, 570
networks, 567–572
production example, 568
sockets]] and TCP buffers, 569
TCP congestion control, 570
systemd]]-analyze command, 120
systemd]] service manager, 120
Systems performance overview, 1–2
activities, 3–4
cloud computing, 14
complexity, 5
counters, statistics, and metrics, 8–9
experiments, 13–14
latency, 6–7
methodologies, 15–16
multiple performance issues, 6
observability, 7–13
performance challenges, 5–6
perspectives, 4–5
references, 19–20
roles, 2–3
T
Tagged Command Queueing (TCQ), 437
Tahoe algorithm for TCP congestion control, 513
Tail-based sampling in distributed tracing, 199
Tail Loss Probe (TLP), 117, 512
Task capacity in USE method, 799
Tasks
defined, 90
idle, 99
tc tool, 566
TC[[Malloc allocator, 322
TCP. See Transmission Control Protocol (TCP)
BSD, 113
kernels, 109
protocol, 502
TCP segmentation offload (TSO), 521
TCP Tail Loss Probe (TLP), 117
BPF for, 12
description, 526
overview, 558–559
BCC, 756
description, 525
overview, 548
BCC, 756
overview, 549–550
BCC, 756
description, 526
TCQ (Tagged Command Queueing), 437
Temperature-aware scheduling classes, 243
Temperature sensors for CPUs, 230
Tenancy in cloud computing, 580
contention in hardware virtualization, 595
contention in OS virtualization, 612–613
Tensor processing units (TPUs), 241
Test errors in benchmarking]], 646–647
Text step in scientific method, 44–45
Text user interface (TUI), 697
Theoretical maximum disk throughput, 436–437
Thermal pressure in Linux kernel, 119
THP (transparent huge pages)
Linux kernel, 116
memory, 353
Thread pools in USE method, 52
Thread state analysis, 193–194
Linux, 195–197
software change case study]], 19
states, 194–195
applications, 177–181
CPUs, 227–229
defined, 90
flusher, 374
hardware, 221
idle, 99, 244
interrupts, 97–98
lightweight, 178
micro-benchmarking]], 653
processes, 100
schedulers, 105–106
SMT, 225
USE method, 52
3D XPoint persistent memory, 441
Three-way handshakes in TCP, 511
benchmarks, 661
hardware virtualization, 597
OS virtualization, 626
packets, 522
applications, 173
defined, 22
disks, 424
magnetic rotational disks, 436–437
networks, monitoring, 529
solid-state drives, 441
averages over, 74
disk measurements, 427–429
Time-based patterns in monitoring, 77–78
Time-based utilization, 33–34
time function in bpftrace, 778
disks, 429–430
methodologies, 25–26
Time sharing for schedulers, 241
Time slices for schedulers, 242
Time to first byte (TTFB) in networks, 506
timechart subcommand for perf, 673
Timer-based profile sampling, 247–248
Timerless multitasking, 117
file systems, 371
TCP, 511
tiptop tool, 348
TLBs. See Translation lookaside buffers (TLBs)
tlbstat tool
CPUs, 266–267
memory, 348
TLC (tri-level cell) flash memory]], 440
TLP (Tail Loss Probe), 117, 512
TLS (transport layer security), 113
CPUs, 245
disks, 450
memory, 323–324
networks, 525
overview, 46
Top-[[level directories, 107
Top of file system layer, file system latency in, 385
CPUs, 245, 261–262
description, 15
file systems, 393
hardware virtualization, 600
lightweight virtualization, 632–633
memory, 324, 336–337
OS virtualization, 619, 624
TPC (Transaction Processing Performance Council) benchmarks, 655
TPC-A benchmark, 650–651
TPUs (tensor processing units), 241
documentation, 740
one-liners, 736–737
overview, 734
trace subcommand for perf, 673, 701–702
tracefs file system, 149–150
contents, 709–711
overview, 708–709
tracepoint probes, 774
arguments and format string, 148–149
description, 11
documentation, 150–151
example, 147–148
filters, 717–718
interface, 149–150
Linux kernel, 116
overhead, 150
overview, 146
triggers, 718
tracepoints tracer, 707
traceroute tool, 563–564
BPF, 12–13
case study]], 790–792
distributed, 199
locks, 212–213
observability tools, 136
OS virtualization, 620, 624–625, 629
perf, 676–678
schedulers, 189–190
sockets]], 552–555
software, 275–276
static instrumentation, 11–12
strace, 136, 205–207
tools, 166
trace-cmd. See trace-cmd front end
virtual file system, 405–406
Trade-off]]s in methodologies, 26–27
Traffic control utility in networks, 566
Transaction costs of latency, 385–386
Transaction groups (TXGs) in ZFS, 381
Transaction Processing Performance Council (TPC) benchmarks, 655
Translation lookaside buffers (TLBs)
CPUs, 232
flushing, 121
memory, 314–315
MMU, 235
Translation storage buffers (TSBs), 235
Transmission Control Protocol (TCP)
analysis, 531
anti-bufferbloat, 117
autocorking, 117
buffers, 520, 569
congestion algorithms, 115
congestion avoidance, 508
congestion control, 118, 513, 570
connection latency, 24, 506, 528
connection queues, 519–520
connection rate, 527–529
duplicate ACK detection, 512
features, 510–511
friends, 509
New Vegas, 118
retransmits, 117, 512, 528–529
SACK, FACK, and RACK, 514
transfer time, 24–25
Transmit Packet Steering (XPS) in networks, 523
Transparent huge pages (THP)
Linux kernel, 116
memory, 353
Transport layer security (TLS), 113
Traps
defined, 90
synchronous interrupts, 97
Tri-level cell (TLC) flash memory]], 440
hist. See Hist triggers
kprobes, 721–722
tracepoints, 718
uprobes, 723
Troubleshooting, benchmarking]] for, 642
TSBs (translation storage buffers), 235
TSO (TCP segmentation offload), 521
TTFB (time to first byte) in networks, 506
TUI (text user interface), 697
disks, 494
memory, 350–351
micro-benchmarking]], 390
networks, 567
operating systems, 493–495
point-in-[[time recommendations, 29–30
tradeoffs with, 27
benchmarking]] for, 642
caches, 60
disks, 493–495
file system caches, 389
file systems, 414–419
memory, 350–354
methodologies, 27–28
networks, 567–574
static performance. See Static performance tuning
targets, 27–28
TXGs (transaction groups) in ZFS, 381
Type 1 hypervisors, 587
Type 2 hypervisors, 587
U
sar configuration, 162
UDP Generic Receive Offload (GRO), 119
UDP (User Datagram Protocol), 514
UDS (Unix domain sockets]]), 509
UIDs (user IDs) for processes, 101
UIO (user space I/O) in kernel bypass, 523
Ultra Path Interconnect (UPI), 236–237
UMA (uniform memory access) memory system, 311–312
UMA (universal memory allocator), 322
UMASK values in MSRs, 238–239
Unicast network transmissions, 503
UNICS (UNiplexed Information and Computing Service), 112
Unified buffer caches, 374
Uniform memory access (UMA) memory system, 311–312
Unikernels, 92, 123, 634
UNiplexed Information and Computing Service (UNICS), 112
Universal memory allocator (UMA), 322
Universal Scalability Law (USL), 65–66
Unix domain sockets]] (UDS), 509
Unix kernels, 112
UPI (Ultra Path Interconnect), 236–237
uprobes, 687–688
arguments, 154, 688–689, 723
bpftrace, 774
documentation, 155
example, 154
filters, 723
Ftrace, 708
interface and overload, 154–155
Linux kernel, 117
overview, 153
profiling, 723
return values, 723
triggers, 723
case study]], 784–785
CPUs, 245
description, 15
OS virtualization, 619
PSI, 257–258
uretprobes, 154
USDT (user-level static instrumentation events)
perf, 681
probes, 690–691
USDT (user-level statically defined tracing), 11, 155–156
USE method. See Utilization, saturation, and errors (USE) method
User address space in processes, 102
User allocation stacks, 345
User Datagram Protocol (UDP), 514
User IDs (UIDs) for processes, 101
User land, 90
User-level static instrumentation events (USDT)
perf, 681
probes, 690–691
User-level statically defined tracing (USDT), 11, 155–156
User mutex in USE method, 799
User space, defined, 90
User space I/O (UIO) in kernel bypass, 523
User state in thread state analysis, 194–197
username variable in bpftrace, 777
USL (Universal Scalability Law), 65–66
ustack function in bpftrace, 779
ustack variable in bpftrace, 778
usym function, 779
applications, 173, 193
CPUs, 226, 245–246, 251, 795, 797
defined, 22
disk controllers, 451
disks, 433, 452
I/O, 798
kernels, 798
memory, 309, 324–326, 796–797
methodologies, 33–34
networks, 508–509, 526–527, 796–797
storage, 796–797
USE method, 47–48, 51–53
Utilization, saturation, and errors (USE) method
applications, 193
benchmarking]], 661
CPUs, 245–246
disks, 450–451
functional block diagrams, 49–50
memory, 324–325
metrics, 48–51
microservices, 53
networks, 526–527
overview, 47
procedure, 47–48
references, 799
V
V-NAND (vertical NAND) flash memory]], 440
valgrind tool
memory, 348
Variable block [[sizes in file systems, 375
Variables in bpftrace, 770–771, 777–778
benchmarks, 647
description, 75
vCPUs (virtual CPUs), 595
Verification of observability tool results, 167–168
applications, 172
kernel, 111–112
Vertical NAND (V-NAND) flash memory]], 440
cloud computing, 581
VFIO (virtual function]] I/O) drivers, 523
VFS. See Virtual file system (VFS)
VFS layer, file system latency analysis in, 385
vfs_read function in bpftrace, 772–773
vfs_read tool in Ftrace, 706–707
vfsstat tool, 409
Vibration in magnetic rotational disks, 438
Virtual CPUs (vCPUs), 595
defined, 424
utilization, 433
description, 107
interface, 373
latency, 406–408
tracing, 405–406
Virtual function]] I/O (VFIO) drivers, 523
Virtual machine managers (VMMs)
cloud computing, 580
hardware virtualization, 587–605
Virtual machines (VMs)
cloud computing, 580
hardware virtualization, 587–605
programming languages, 185
defined, 90, 304
managing, 104–105
overview, 305
size, 308
Virtual-to-guest physical translation, 593
hardware. See Hardware virtualization
OS. See OS virtualization
Visual identification of models, 62–64
Visualizations, 79
CPUs, 288–293
disks, 487–490
file systems, 410–411
flame graphs. See Flame graphs
scatter plots, 81–82
surface plots, 84–85
tools, 85
VMMs (virtual machine managers)
cloud computing, 580
hardware virtualization, 587–588
VMs (virtual machines)
cloud computing, 580
hardware virtualization, 587–588
programming languages, 185
CPUs, 245, 258
description, 15
disks, 487
file systems, 393
hardware virtualization, 604
memory, 323, 329–330
OS virtualization, 619
VMware ESX, 589
file systems, 382–383
Voluntary kernel preemption, 110, 116
W
disks, 434
I/O, 427
wakeup tracer, 708
wakeup_rt tracer, 708
Warm caches, 37
Warmth of caches, 37
Wear leveling in solid-state drives, 441
Weekly patterns, monitoring, 79
Whetstone benchmark, 254, 653
Whys in drill-down analysis, 56
flame graphs, 290–291
instruction, 224
DiskMon, 493
fibers, 178
Hyper-V, 589
LTO and PGO, 122
portable executable format, 183
ProcMon, 207
TIME_WAIT, 512
CPUs, 229
memory, 310
Work queues with interrupts, 98
Working set size (WSS)
benchmarking]], 664
memory, 310, 328, 342–343
micro-benchmarking]], 390–391, 653
Workload analysis perspectives, 4–5, 39–40
benchmarking]], 662
CPUs, 246–247
disks, 452–454
file systems, 386–388
methodologies, 54
networks, 527–528
Workload separation in file systems, 389
Write amplification in solid-state drives, 440
file systems, 365
on-disk, 425
write system calls, 94
Write type, micro-benchmarking]] for, 390
wss tool, 342–343
WSS (working set size)
benchmarking]], 664
memory, 310, 328, 342–343
micro-benchmarking]], 390–391, 653
X
XDP (Express Data Path) technology
description, 118
event sources, 558
description, 589
network performance, 597
observability, 599
xentop tool, 599
XFS file system, 379–380
BCC, 756
file systems, 399
Y
Yearly patterns, monitoring, 79
Z
features, 380–381
options, 418–419
BCC, 757
file systems, 399
Zones
magnetic rotational disks, 437
OS virtualization, 606, 620
Fair Use Sources
Performance: Systems performance, Systems performance bibliography, Systems Performance Outline: (Systems Performance Introduction, Systems Performance Methodologies, Systems Performance Operating Systems, Systems Performance Observability Tools, Systems Performance Applications, Systems Performance CPUs, Systems Performance Memory, Systems Performance File Systems, Systems Performance Disks, Systems Performance Network, Systems Performance Cloud Computing, Systems Performance Benchmarking, Systems Performance perf, Systems Performance Ftrace, Systems Performance BPF, Systems Performance Case Study), Accuracy, Algorithmic efficiency (Big O notation), Algorithm performance, Amdahl's Law, Android performance, Application performance engineering, Async programming, Bandwidth, Bandwidth utilization, bcc, Benchmark (SPECint and SPECfp), BPF, bpftrace, Performance bottleneck (“Hotspots”), Browser performance, C performance, C Plus Plus performance | C++ performance, C Sharp performance | performance, Cache hit, Cache performance, Capacity planning, Channel capacity, Clock rate, Clojure performance, Compiler performance (Just-in-time (JIT) compilation - Ahead-of-time compilation (AOT), Compile-time, Optimizing compiler), Compression ratio, Computer performance, Concurrency, Concurrent programming, Concurrent testing, Container performance, CPU cache, CPU cooling, CPU cycle, CPU overclocking (CPU boosting, CPU multiplier), CPU performance, CPU speed, CPU throttling (Dynamic frequency scaling - Dynamic voltage scaling - Automatic underclocking), CPU time, CPU load - CPU usage - CPU utilization, Cycles per second (Hz), CUDA (Nvidia), Data transmission time, Database performance (ACID-CAP theorem, Database sharding, Cassandra performance, Kafka performance, IBM Db2 performance, MongoDB performance, MySQL performance, Oracle Database performance, PostgreSQL performance, Spark performance, SQL Server performance), Disk I/O, Disk latency, Disk performance, Disk speed, Disk usage - Disk utilization, Distributed computing performance (Fallacies of distributed computing), DNS performance, Efficiency - Relative efficiency, Encryption performance, Energy efficiency, Environmental impact, Fast, Filesystem performance, Fortran performance, FPGA, Gbps, Global Interpreter Lock - GIL, Golang performance, GPU - GPGPU, GPU performance, Hardware performance, Hardware performance testing, Hardware stress test, Haskell performance, High availability (HA), Hit ratio, IOPS - I/O operations per second, IPC - Instructions per cycle, IPS - Instructions per second, Java performance (Java data structure performance - Java ArrayList is ALWAYS faster than LinkedList, Apache JMeter), JavaScript performance (V8 JavaScript engine performance, Node.js performance - Deno performance), JVM performance (GraalVM, HotSpot), Kubernetes performance, Kotlin performance, Lag (video games) (Frame rate - Frames per second (FPS)), Lagometer, Latency, Lazy evaluation, Linux performance, Load balancing, Load testing, Logging, macOS performance, Mainframe performance, Mbps, Memory footprint, Memory speed, Memory performance, Memory usage - Memory utilization, Micro-benchmark, Microsecond, Monitoring
Linux/UNIX commands for assessing system performance include:
- uptime the system reliability and load average
- Top (Unix) | top for an overall system view
- Vmstat (Unix) | vmstat vmstat reports information about runnable or blocked processes, memory, paging, block I/O, traps, and CPU.
- Htop (Unix) | htop interactive process viewer
- dstat, atop helps correlate all existing resource data for processes, memory, paging, block I/O, traps, and CPU activity.
- iftop interactive network traffic viewer per interface
- nethogs interactive network traffic viewer per process
- iotop interactive I/O viewer
- Iostat (Unix) | iostat for storage I/O statistics
- Netstat (Unix) | netstat for network statistics
- mpstat for CPU statistics
- tload load average graph for terminal
- xload load average graph for X
- /proc/loadavg text file containing load average
(Event monitoring - Event log analysis, Google Cloud's operations suite (formerly Stackdriver), htop, mpstat, macOS Activity Monitor, Nagios Core, Network monitoring, netstat-iproute2, proc filesystem (procfs)]] - ps (Unix), System monitor, sar (Unix) - systat (BSD), top - top (table of processes), vmstat), Moore’s law, Multicore - Multi-core processor, Multiprocessor, Multithreading, mutex, Network capacity, Network congestion, Network I/O, Network latency (Network delay, End-to-end delay, packet loss, ping - ping (networking utility) (Packet InterNet Groper) - traceroute - netsniff-ng, Round-trip delay (RTD) - Round-trip time (RTT)), Network performance, Network switch performance, Network usage - Network utilization, NIC performance, NVMe, NVMe performance, Observability, Operating system performance, Optimization (Donald Knuth: “Premature optimization is the root of all evil), Parallel processing, Parallel programming (Embarrassingly parallel), Perceived performance, Performance analysis (Profiling), Performance design, Performance engineer, Performance equation, Performance evaluation, Performance gains, Performance Mantras, Performance measurement (Quantifying performance, Performance metrics), Perfmon, Performance testing, Performance tuning, PowerShell performance, Power consumption - Performance per watt, Processing power, Processing speed, Productivity, Python performance (CPython performance, PyPy performance - PyPy JIT), Quality of service (QOS) performance, Refactoring, Reliability, Response time, Resource usage - Resource utilization, Router performance (Processing delay - Queuing delay), Ruby performance, Rust performance, Scala performance, Scalability, Scalability test, Server performance, Size and weight, Slow, Software performance, Software performance testing, Speed, Stress testing, SSD, SSD performance, Swift performance, Supercomputing, Tbps, Throughput, Time (Time units, Nanosecond, Millisecond, Frequency (rate), Startup time delay - Warm-up time, Execution time), TPU - Tensor processing unit, Tracing, Transistor count, TypeScript performance, Virtual memory performance (Thrashing), Volume testing, WebAssembly, Web framework performance, Web performance, Windows performance (Windows Performance Monitor). (navbar_performance)
Cloud Monk is Retired ( for now). Buddha with you. © 2025 and Beginningless Time - Present Moment - Three Times: The Buddhas or Fair Use. Disclaimers
SYI LU SENG E MU CHYWE YE. NAN. WEI LA YE. WEI LA YE. SA WA HE.