當前位置：首頁 > 运维知识 > linux >内容正文

linux

linux perf - 性能测试和优化工具

發(fā)布時間：2025/5/22 linux 42 豆豆

生活随笔收集整理的這篇文章主要介紹了 linux perf - 性能测试和优化工具小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

Perf簡介

Perf是Linux kernel自帶的系統(tǒng)性能優(yōu)化工具。雖然它的版本還只是0.0.2，Perf已經(jīng)顯現(xiàn)出它強大的實力，足以與目前Linux流行的OProfile相媲美了。

Perf 的優(yōu)勢在于與Linux Kernel的緊密結合，它可以最先應用到加入Kernel的new feature。而像OProfile, GProf等通常會“慢一拍”。Perf的基本原理跟OProfile等類似，也是在CPU的PMU registers中Get/Set performance counters來獲得諸如instructions executed, cache-missed suffered, branches mispredicted等信息。Linux kernel對這些registers進行了一系列抽象，所以你可以按進程，按CPU或者按counter group等不同類別來查看Sample信息。

?

使用Perf

Perf的使用流程和OProfile很像。所以如果你會用OProfile的話，用Perf就很簡單。這里只是簡單翻譯一下在[1]中的Perf examples中舉的例子。有更多發(fā)現(xiàn)的話以后會繼續(xù)更新。

$?perf record -f -- git gc???Counting objects: 1283571, done. Compressing objects: 100% (206724/206724), done. Writing objects: 100% (1283571/1283571), done. Total 1283571 (delta 1070675), reused 1281443 (delta 1068566) [ perf record: Captured and wrote 31.054 MB perf.data (~1356768 samples) ] ? $?perf report --sort comm,dso,symbol?| head -10 # Samples: 1355726 # # Overhead Command Shared Object Symbol # ........ ............... ....................................... ...... # 31.53% git /usr/bin/git [.] 0x0000000009804f 13.41% git-prune /usr/bin/git-prune [.] 0x000000000ad06d 10.05% git /lib/tls/i686/cmov/libc-2.8.90.so [.] _nl_make_l10nflist 5.36% git-prune /usr/lib/libz.so.1.2.3.3 [.] 0x00000000009d51 4.48% git /lib/tls/i686/cmov/libc-2.8.90.so [.] memcpy

perf record相當于opcontrol –-start, 而perf report相當于opreport.

Perf用例

查看所有可用的counters用'perf list’:

titan:~> perf list [...] kmem:kmalloc [Tracepoint event] kmem:kmem_cache_alloc [Tracepoint event] kmem:kmalloc_node [Tracepoint event] kmem:kmem_cache_alloc_node [Tracepoint event] kmem:kfree [Tracepoint event] kmem:kmem_cache_free [Tracepoint event] kmem:mm_page_free_direct [Tracepoint event] kmem:mm_pagevec_free [Tracepoint event] kmem:mm_page_alloc [Tracepoint event] kmem:mm_page_alloc_zone_locked [Tracepoint event] kmem:mm_page_pcpu_drain [Tracepoint event] kmem:mm_page_alloc_extfrag [Tracepoint event] 你可以用以上counter的任意組合來跑你的測試程序。比如，用以下命令來看跑 hackbench時page alloc/free的次數(shù)。 titan:~> perf stat -e kmem:mm_page_pcpu_drain -e kmem:mm_page_alloc -e kmem:mm_pagevec_free -e kmem:mm_page_free_direct ./hackbench 10 Time: 0.575 Performance counter stats for './hackbench 10': 13857 kmem:mm_page_pcpu_drain 27576 kmem:mm_page_alloc 6025 kmem:mm_pagevec_free 20934 kmem:mm_page_free_direct 0.613972165 seconds time elapsed Perf可以幫你統(tǒng)計N次結果的數(shù)值波動情況： titan:~> perf stat --repeat 5 -e kmem:mm_page_pcpu_drain -e kmem:mm_page_alloc -e kmem:mm_pagevec_free -e kmem:mm_page_free_direct ./hackbench 10 Time: 0.627 Time: 0.644 Time: 0.564 Time: 0.559 Time: 0.626 Performance counter stats for './hackbench 10' (5 runs): 12920 kmem:mm_page_pcpu_drain ( +- 3.359% ) 25035 kmem:mm_page_alloc ( +- 3.783% ) 6104 kmem:mm_pagevec_free ( +- 0.934% ) 18376 kmem:mm_page_free_direct ( +- 4.941% ) 0.643954516 seconds time elapsed ( +- 2.363% ) 有了以上的統(tǒng)計數(shù)據(jù)，你可以開始sample某一個你關心的tracepoint（比如page allocations): titan:~/git> perf record -f -e kmem:mm_page_alloc -c 1 ./git gc Counting objects: 1148, done. Delta compression using up to 2 threads. Compressing objects: 100% (450/450), done. Writing objects: 100% (1148/1148), done. Total 1148 (delta 690), reused 1148 (delta 690) [ perf record: Captured and wrote 0.267 MB perf.data (~11679 samples) ] 查看哪個function引起了page allocations: titan:~/git> perf report # Samples: 10646 # # Overhead Command Shared Object # ........ ............... .......................... # 23.57% git-repack /lib64/libc-2.5.so 21.81% git /lib64/libc-2.5.so 14.59% git ./git 11.79% git-repack ./git 7.12% git /lib64/ld-2.5.so 3.16% git-repack /lib64/libpthread-2.5.so 2.09% git-repack /bin/bash 1.97% rm /lib64/libc-2.5.so 1.39% mv /lib64/ld-2.5.so 1.37% mv /lib64/libc-2.5.so 1.12% git-repack /lib64/ld-2.5.so 0.95% rm /lib64/ld-2.5.so 0.90% git-update-serv /lib64/libc-2.5.so 0.73% git-update-serv /lib64/ld-2.5.so 0.68% perf /lib64/libpthread-2.5.so 0.64% git-repack /usr/lib64/libz.so.1.2.3

更進一步的查看：

titan:~/git> perf report --sort comm,dso,symbol

# Samples: 10646 # # Overhead Command Shared Object Symbol # ........ ............... .......................... ...... # 9.35% git-repack ./git [.] insert_obj_hash 9.12% git ./git [.] insert_obj_hash 7.31% git /lib64/libc-2.5.so [.] memcpy 6.34% git-repack /lib64/libc-2.5.so [.] _int_malloc 6.24% git-repack /lib64/libc-2.5.so [.] memcpy 5.82% git-repack /lib64/libc-2.5.so [.] __GI___fork 5.47% git /lib64/libc-2.5.so [.] _int_malloc 2.99% git /lib64/libc-2.5.so [.] memset ? 同時，call-graph(函數(shù)調(diào)用圖)也可以被記錄下來，并且能告訴你每個函數(shù)所占用的百分比。 titan:~/git> perf record -f -g -e kmem:mm_page_alloc -c 1 ./git gc Counting objects: 1148, done. Delta compression using up to 2 threads. Compressing objects: 100% (450/450), done. Writing objects: 100% (1148/1148), done. Total 1148 (delta 690), reused 1148 (delta 690) [ perf record: Captured and wrote 0.963 MB perf.data (~42069 samples) ] titan:~/git> perf report -g # Samples: 10686 # # Overhead Command Shared Object # ........ ............... .......................... # 23.25% git-repack /lib64/libc-2.5.so | |--50.00%-- _int_free | |--37.50%-- __GI___fork | make_child | |--12.50%-- ptmalloc_unlock_all2 | make_child | --6.25%-- __GI_strcpy 21.61% git /lib64/libc-2.5.so | |--30.00%-- __GI_read | | | --83.33%-- git_config_from_file | git_config | | [...] 用以下命令可以查看整個系統(tǒng)10秒內(nèi)的page allocation次數(shù)：

titan:~/git> perf stat -a -e kmem:mm_page_pcpu_drain -e kmem:mm_page_alloc -e kmem:mm_pagevec_free -e kmem:mm_page_free_direct sleep 10

Performance counter stats for 'sleep 10': 171585 kmem:mm_page_pcpu_drain 322114 kmem:mm_page_alloc 73623 kmem:mm_pagevec_free 254115 kmem:mm_page_free_direct 10.000591410 seconds time elapsed

用以下命令查看每隔1秒，系統(tǒng)page allocation的波動狀況:

titan:~/git> perf stat --repeat 10 -a -e kmem:mm_page_pcpu_drain -e kmem:mm_page_alloc -e kmem:mm_pagevec_free -e kmem:mm_page_free_direct sleep 1 Performance counter stats for 'sleep 1' (10 runs): 17254 kmem:mm_page_pcpu_drain ( +- 3.709% ) 34394 kmem:mm_page_alloc ( +- 4.617% ) 7509 kmem:mm_pagevec_free ( +- 4.820% ) 25653 kmem:mm_page_free_direct ( +- 3.672% ) 1.058135029 seconds time elapsed ( +- 3.089% )

通過反匯編往往能找出是哪行代碼生成的指令會引起問題。

titan:~/git> perf annotate __GI___fork ------------------------------------------------ Percent | Source code & Disassembly of libc-2.5.so ------------------------------------------------??:??:??: Disassembly of section .plt:??: Disassembly of section .text:??:?: 00000031a2e95560 <__fork>: [...] 0.00 : 31a2e95602: b8 38 00 00 00 mov $0x38,�x 0.00 : 31a2e95607: 0f 05 syscall 83.42 : 31a2e95609: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax 0.00 : 31a2e9560f: 0f 87 4d 01 00 00 ja 31a2e95762 <__fork+0x202> 0.00 : 31a2e95615: 85 c0 test??�x,�x 以上結果顯示__GI__forks的83.42%的時間來源于0x38的系統(tǒng)調(diào)用。

值得優(yōu)化某個特定的函數(shù)嗎？

你也許想知道是否值得去優(yōu)化你程序中的某個特定函數(shù)。一個很好的例子是git mailing list中討論的關于SHA1 哈希算法優(yōu)化的問題，我們可以用perf來預判優(yōu)化的結果。具體參見Linus的回信[2].

"perf report --sort comm,dso,symbol" profiling shows the following for 'git fsck --full' on the kernel repo, using the Mozilla SHA1: 47.69% git /home/torvalds/git/git [.] moz_SHA1_Update 22.98% git /lib64/libz.so.1.2.3 [.] inflate_fast 7.32% git /lib64/libc-2.10.1.so [.] __GI_memcpy 4.66% git /lib64/libz.so.1.2.3 [.] inflate 3.76% git /lib64/libz.so.1.2.3 [.] adler32 2.86% git /lib64/libz.so.1.2.3 [.] inflate_table 2.41% git /home/torvalds/git/git [.] lookup_object 1.31% git /lib64/libc-2.10.1.so [.] _int_malloc 0.84% git /home/torvalds/git/git [.] patch_delta 0.78% git [kernel] [k] hpet_next_event

很明顯，SHA1加密算法的性能在這里很關鍵。

如何測量latency

如果你在build kernel時enabled了

CONFIG_PERF_COUNTER=y CONFIG_EVENT_TRACING=y

那你可以加-tip參數(shù)來使用幾個新的performance counter來測量scheduler的lantencies。

perf stat -e sched:sched_stat_wait -e task-clock ./hackbench 20

以上命令能夠得出等待CPU用了多少時間。你可以重復10次這樣的操作：

aldebaran:/home/mingo> perf stat --repeat 10 -e / sched:sched_stat_wait:r -e task-clock ./hackbench 20 Time: 0.251 Time: 0.214 Time: 0.254 Time: 0.278 Time: 0.245 Time: 0.308 Time: 0.242 Time: 0.222 Time: 0.268 Time: 0.244 Performance counter stats for './hackbench 20' (10 runs): 59826 sched:sched_stat_wait # 0.026 M/sec ( +- 5.540% ) 2280.099643 task-clock-msecs # 7.525 CPUs ( +- 1.620% ) 0.303013390 seconds time elapsed ( +- 3.189% ) 讀取scheduling的events counter # perf list 2>&1 | grep sched: sched:sched_kthread_stop [Tracepoint event] sched:sched_kthread_stop_ret [Tracepoint event] sched:sched_wait_task [Tracepoint event] sched:sched_wakeup [Tracepoint event] sched:sched_wakeup_new [Tracepoint event] sched:sched_switch [Tracepoint event] sched:sched_migrate_task [Tracepoint event] sched:sched_process_free [Tracepoint event] sched:sched_process_exit [Tracepoint event] sched:sched_process_wait [Tracepoint event] sched:sched_process_fork [Tracepoint event] sched:sched_signal_send [Tracepoint event] sched:sched_stat_wait [Tracepoint event] sched:sched_stat_sleep [Tracepoint event] sched:sched_stat_iowait [Tracepoint event]

對于latency analysis而言，stat_wait/sleep/iowait是值得注意的event。如果你想看所有delays和它們的mix/max/avg，你可以：

perf record -e sched:sched_stat_wait:r -f -R -c 1 ./hackbench 20 perf trace

perf stats for doing nothing http://blog.csdn.net/bluebeach/article/details/5912062

Perf stats for "doing nothing"

I've recently discovered the?perf?Linux tool. I heard that oprofile was deprecated and that there is a new tool, and I noted down to try it sometime.

Updated: more languages, fixed typos, more details, some graphs. Apologies if this shows twice in your feed.

The problem with perf stats is that I?hate?bloat, or even perceived bloat. Even when it doesn't affect me in any way, the concept of wasted cycles makes me really sad.

You probably can guess where this is going… I said, well, let's see what perf says about a simple "null" program. Surely doing nothing should be just a small number of instructions, right?

Note: I think that perf also records kernel-side code, because the lowest I could get was about ~50K instructions for starting a null program in assembler that doesn't use libc and just executes the?syscall?asm instruction. However, these ~50K instructions are noise the moment you start to use more high-level languages. Yes, this is expected, but the I was still shocked. And there's lots of delta between languages I'd expected to behave somewhat identical.

Again, this is not important in the real world. At all. They are just numbers, and probably the noise (due to short runtime) has lots of influence on the resulting numbers. And I might have screwed up the measurements somehow.

Test setup

Each program was the equivalent of 'exit 0' in the appropriate form for the language. During the measurements, the machine was as much as possible idle (single-user mode, measurements run at real-time priority, etc.). For compiled languages,?-O2?was used. For scripts, a simple?#!/path/to/interpreter?(without options, except in the case of Python, see below) was used. Each program/script was run 500 times (perf's?-r 500) and I've checked that the variations were small (±0.80% on the metrics I used).

You can find all the programs I've used at?http://git.k1024.org/perf-null.git/, the current tests are for the tag version?perf-null-0.1.

The raw data for the below tables/graphs is at?log-4.

Results

Compiled languages

LanguageCyclesInstructions

asm	63K	51K
c-dietlibc	74K	57K
c-libc-static	177K	107K
c-libc-shared	506K	300K
c++-static	178K	107K
c++-dynamic	1,750K	1,675K
haskell-single	2,229K	1,338K
haskell-threaded	2,629K	1,522K
ocaml-bytecode	3,271K	2,741K
ocaml-native	1,042K	666K

Going from dietlibc to glibc doubles the number of instructions, and for libc going from static to dynamic linking again roughly doubles it. I didn't manage to compile a program dynamically-linked against dietlibc.

C++ is interesting. Linked statically, it is in the same ballpark as C, but when linked dynamically, it executes an order of magnitude??more instructions. I would guess that the initialisation of the standard C++ library is complex?

Haskell, which has a GC and quite a complex runtime, executes slightly less instructions than C++, but uses more cycles. Not bad, given the capabilities of the runtime. The two versions of the Haskell program are with the single-threaded runtime and with the multi-threaded one; not much difference. A fully statically-linked Haskell binary (not recommended usually) goes below 1M instructions, but not by much.

OCaml is a very nice surprise. The bytecode runtime is a bit slow to startup, but the (native) compiled version is quite fast to start: only 2× number of instructions and cycles compared to C, for an advanced language. And twice as fast as Haskell ?. Nice!

Shells

LanguageCyclesInstructions

dash	766K	469K
bash	1,680K	1,044K
mksh	1,258K	942K
mksh-static	504K	322K

So, dash takes ~470K instructions to start, which is way below the C++ count and a bit higher than the C one. Hence, I'd guess that dash is implemented in C ?.

Next, bash is indeed slower on startup than dash, and by slightly more than 2× (both instructions and cycles). So yes, switching?/bin/sh?from bash to dash makes sense.

I wasn't aware of?mksh, so thanks for the comments. It is, in the static variant, more efficient that dash, by about 1.5×. However, the dynamically linked version doesn't look too great (dash is also dynamically linked; I would guess a statically-linked dash "beats" mksh-static).

Text processing

I've added perl here (even though it's a 'full' language) just for comparison; it's also in the next section.

LanguageCyclesInstructions

mawk	849K	514K
gawk	1,363K	980K
perl	2,946K	2,213K

A normal spread. I knew the reason why mawk is?Priority: required?is that it's faster than gawk, but I wouldn't have guessed it's almost twice as fast.

Interpreted languages

Here is where the fun starts…

LanguageCyclesInstructions

lua 5.1	1,947K	1,485K
lua 5.2	1,724K	1,335K
lua jit	1,209K	803K
perl	2,946K	2,213K
tcl 8.4	5,011K	4,552K
tcl 8.5	6,888K	6,022K
tcl 8.6	8,196K	7,236K
ruby 1.8	7,013K	6,128K
ruby 1.9.3	35,870K	35,022K
python 2.6 -S	11,752K	10,247K
python 2.7 -S	11,438K	10,198K
python 3.2 -S	29,003K	27,409K
pypy -S	21,106K	10,036K
python 2.6	25,143K	21,989K
python 2.7	47,325K	50,217K
python 2.7 -O	47,341K	50,185K
python 3.2	113,567K	124,133K
python 3.2 -O	113,424K	124,133K
pypy	90,779K	68,455K

The numbers here are not quite what I expected. There's a huge delta between the fastest (hi Lua!) and the slowest (bye Python!).

I wasn't familiar with Lua, so I tested it thanks to the comments. It is, I think, the only language which actually improves from one version to the next (bonus points), and where the JIT version also make is faster. In context, lua jit starts faster than C++.

Perl is the one that goes above C++'s instructions count, but not by much. From the point of view of the system, a Perl 'hello world' is only about 1.3×-1.6x slower than a C++ one. Not bad, not bad.

Next category is composed of TCL and Ruby, both of which had older versions 2-3× slower than Perl, but whose most recent versions are even more slower. TCL has an almost constant slowdown across versions (5M, 6.9M, 8.2M cycles), but Ruby seems to have taken a significant step backwards: 1.9.3 is 5× slower than 1.8. I wonder why? As for TCL, I didn't expect it to be slower to startup than Perl; good to know.

Last category is Python. Oh my. If you run?perf stat python -c 'pass'?you get some unbelievable numbers, like 50M instructions to do, well, nothing. Yes, it has a GC, yes, it does import modules at runtime, but still… On closer investigation, the?sitemodule and the imports it does do eat a lot of time. Running a simpler?python -S?brings it back to a more reasonable 10M instructions, which is in-line with the other interpreted languages.

However, even with the -S taken into account, Python also slows down across versions: a tiny improvement from 2.6 to 2.7, but (like Ruby) a 3× slowdown from 2.7 to 3.2. Trying the “optimised” version (-O) doesn't help at all. Trying pypy, which was based on Python 2.7, makes it around 2× slower to startup (both with and without?-S).

So in the interpreted languages, it seems only Lua is trying to improve, the rest of the languages are piling up bloat with every version. Note: I should have tried multiple perl versions too.

Java

Java is in its own category; you guess why ?, right?

GCJ was version 4.6, whereas by?java?below I mean?OpenJDK Runtime Environment (IcedTea6 1.11) (6b24-1.11-4).

LanguageCyclesInstructions

null-gcj	97,156K	74,576K
java -jamvm	85,535K	80,102K
java -server	147,174K	136,803K
java -zero	132,967K	124,977K
java -cacao	229,799K	205,312K

Using gcj to compile to “native code” (not sure whether that's native-native or something else) results in a binary that uses less than 100M cycles to start, but the jamvm VM is faster than that (85M cycles). Not bad for java! Python 3.2 is slower to startup—yes, I think the world has gone crazy.

However, the other VMs are a few times slower: server (the default one) is ~150M cycles, and cacao is ~230M cycles. Wow.

The other thing about java is that it was the only one that couldn't be put nicely in a file that you just ‘exec’ (there is?binfmt_misc?indeed, but that doesn't allow different Java classes to use different Java VMs, so I don't count this), as opposed to every single other thing I tested here. Someone didn't grow on Unix?

Comparative analysis

Since there are almost 4 orders of magnitude difference between all the things tested here, a graph of cycles or instructions is not really useful. However, cycles/instruction, branches percentage and branches miss-predicted percentage can be. Hence first the cycles/instructions:

Pypy is jumping out of the graph here, with the top value of over 2 cycles/instruction. Lua JIT is also bigger than Lua non-JIT, so maybe there's something to this (mostly joking, two data points don't make a series). On the other hand, Python wins as best cycles/instruction (0.91). Lots of ILP, to get below 1?

Java gets, irrespective of VM, consistently near 1.0-1.1. C++ gets very different numbers between static linking (1.666) and dynamic linking (1.045), whereas C has basically identical numbers. mksh also has a difference between dynamic and static linking. Hmm…

Ruby, TCL and Python have consistent values across versions.

And that's about what I can see from that graph. Next up, percentage of branches out of total instructions and percentage of branches missed:

Note that the two lines shouldn't really be on the same graph; for the branch %, the 100% is the total instructions count, but for the branch miss %, the 100% is the total branch count. Anyway.

There are two low-value outliers:

dynamically-linked C++ has a low branch percentage (17.46%) and a very low branch miss percentage (only 4.32%)
gcj-compiled java has a?very?low branch miss percentage (only 2.82%!!!), even though is has a “regular” branch percentage (20.85%)

So it seems the gcj libraries are well optimised? I'm not familiar enough with this topic, but on the graph it does indeed stand out.

On the other end, mksh-static has a high branch miss percentage: 11.60%, which jumps clearly ahead of all the others; this might be why it has a high cycles/instruction count, due to all the stalls in misprediction; one has to wonder why it confuses the branch predictor?

I find it interesting that the overall branch count is very similar across languages, both when most of the cost is in the kernel (e.g. asm) and when the user-space cost heavily over-weighs the kernel (e.g. Java). The average is 20.85%, minimum is 17.46%, max 22.93%, standard deviation (if I used gnumeric correctly) is just 0.01. This seems a bit suspicious to me ?. On the other hand, the mispredicted branches percentage varies much more: from a measly 2.82% to 11.60% (5x difference).

Summary

So to recap, counting just instructions:

going from dietlibc to glibc: 2× increase
going from statically-linked libc to dynamically-linked libc: doubles it again
going from C to C++: 5× increase
C++ to Perl: 1.3×
Perl to Ruby: 3×
Ruby to Python (-S): 1.6x
Python -S to regular Python: 5x
Python to Java: 1×-2×, depending on version/runtime
branch percentage (per total instructions) is quite consistent across all of the programs

Overall, you get roughly three orders of magnitude slower startup between a plain C program using dietlibc and Python. And all, to do basically nothing.

On the other hand, I learned some interesting things while doing it, so it wasn't quite for nothing ?.

總結

以上是生活随笔為你收集整理的linux perf - 性能测试和优化工具的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： jquery判断div滚动条到底部
下一篇： Linux程序包管理