MLC——内存延迟及带宽测试工具
why MLC
影響程序性能的兩個(gè)重要因素:
①應(yīng)用程序從處理器緩存和從內(nèi)存子系統(tǒng)獲取數(shù)據(jù)所消耗的時(shí)間,其中存在各種延遲;
②帶寬b/w(bandwidth 非Bilibili World)
mlc正是做這個(gè)的
測(cè)試內(nèi)容
Node訪(fǎng)問(wèn)速度
在NUMA(Non-Uniform Memory Access 非一致性?xún)?nèi)存訪(fǎng)問(wèn))構(gòu)架下,不同的內(nèi)存器件和CPU核心從屬不同的 Node,每個(gè) Node 都有自己的集成內(nèi)存控制器(IMC,Integrated Memory Controller),解決了“每個(gè)處理器共享相同的地址空間問(wèn)題”,避免總線(xiàn)帶寬,內(nèi)存沖突問(wèn)題。
(補(bǔ)充:core=物理cpu,獨(dú)立的物理執(zhí)行單元;thread=邏輯cpu,線(xiàn)程
socket = node 相當(dāng)于主板上的cpu插槽。node內(nèi)部,不同核心間使用IMC Bus通信;不同node間通過(guò)QPI(Quick Path Interconnect)進(jìn)行通信
同城速達(dá)的速度肯定與國(guó)際郵件不同,所以QPI(remote)延遲明顯高于IMC Bus(local)
測(cè)試樣例:
查詢(xún)內(nèi)存訪(fǎng)問(wèn)延遲 指令
./mlc --latency_matrix結(jié)果
Numa node Numa node ? ? 0 ? ? 1 0 82.2 129.6 1 131.1 81.6表示node之間/內(nèi)部的空閑內(nèi)存訪(fǎng)問(wèn)延遲矩陣,以ns為單位
帶寬
帶寬反映了單位時(shí)間的傳輸速率馬路越寬,就不會(huì)堵車(chē)了。帶寬反映了單位時(shí)間的傳輸速率
Measuring Peak Injection Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads ? ? ? : 69143.9 3:1 Reads-Writes : 61908.4 2:1 Reads-Writes : 60040.5 1:1 Reads-Writes : 54517.6 Stream-triad like: 57473.4r:w 表示不同讀寫(xiě)比下的內(nèi)存帶寬
一般情況下,內(nèi)存的寫(xiě)速度慢于讀取速度(Talk is easy, show me the CODE)
所以當(dāng)讀寫(xiě)比下降時(shí),帶寬會(huì)下降(路窄了,塞車(chē)了)
問(wèn)題分析:如果帶寬急劇下降,可能是寫(xiě)入程序增多;或者是寫(xiě)入程序出問(wèn)題,速度太慢了
測(cè)試樣例
查詢(xún)存訪(fǎng)問(wèn)帶寬 指令(單獨(dú)判斷numa節(jié)點(diǎn)間內(nèi)存訪(fǎng)問(wèn)是否正常還可以使用 )
./mlc --bandwidth_matrix結(jié)果
Measuring Memory Bandwidths between nodes within system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic typeNuma node Numa node ? ? 0 ? ? 1 0 35216.6 32537.9 1 31875.1 35048.5問(wèn)題分析:如果副對(duì)角線(xiàn)數(shù)值相差過(guò)大,表明兩個(gè)node相互訪(fǎng)問(wèn)的帶寬差距較大
解決方法:出現(xiàn)不平衡的時(shí)候一般從內(nèi)存插法、內(nèi)存是否故障以及numa平衡等角度進(jìn)行排查
內(nèi)存訪(fǎng)問(wèn)帶寬和內(nèi)存延遲的關(guān)系(讀操作)
Measuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ==========================00000 523.74 69057.400002 589.55 68668.700008 686.99 68571.400015 549.87 68873.600050 575.48 68673.000100 524.74 68877.500200 197.61 64225.800300 131.60 47141.000400 110.39 36803.000500 117.32 30135.200700 100.90 22179.101000 100.93 15762.801300 91.74 12351.601700 98.61 ? 9475.202500 86.66 ? 6927.803500 88.13 ? 5132.605000 87.68 ? 3818.609000 85.36 ? 2473.520000 84.83 ? 1538.7可以觀(guān)察內(nèi)存在負(fù)載壓力下的響應(yīng)變化,以及是否在到達(dá)一定帶寬時(shí),出現(xiàn)不可接受的內(nèi)存響應(yīng)時(shí)間
測(cè)量CPU cache到CPU cache之間的訪(fǎng)問(wèn)延遲
Measuring cache-to-cache transfer latency (in ns)... Local Socket L2->L2 HIT latency ? 38.6 Local Socket L2->L2 HITM latency ? 43.6 Remote Socket L2->L2 HITM latency (data address homed in writer socket)Reader Socket Writer Socket ? ? ? ? 0 ? ? ? ? 10 ? ? ? ? - ? ? 133.41 ? ? 133.7 ? ? ? ? - Remote Socket L2->L2 HITM latency (data address homed in reader socket)Reader Socket Writer Socket ? ? ? ? 0 ? ? ? ? 10 ? ? ? ? - ? ? 133.51 ? ? 133.7 ? ? ? ? -峰值帶寬
指令
mlc --peak_bandwidth結(jié)果
Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writesMeasuring Peak Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads : 50035.2 3:1 Reads-Writes : 48119.3 2:1 Reads-Writes : 47434.3 1:1 Reads-Writes : 48325.5 Stream-triad like: 44029.0空閑內(nèi)存延遲
指令
mlc --idle_latency結(jié)果
Using buffer size of 200.000MB Each iteration took 260.5 core clocks ( 113.3 ns)有負(fù)載內(nèi)存延時(shí)
指令
mlc --loaded_latency結(jié)果
Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writesMeasuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ==========================00000 217.32 49703.400002 258.98 49482.400008 217.48 49908.100015 220.12 49973.700050 206.33 49185.700100 174.02 43811.800200 141.63 27651.100300 130.65 19614.600400 126.05 15217.000500 122.70 12506.000700 121.46 9253.001000 120.55 6690.601300 118.75 5314.901700 120.18 4148.702500 119.53 3055.703500 119.60 2349.405000 116.60 1816.909000 116.17 1257.820000 116.87 867.6其余操作(未完待續(xù)
-
測(cè)量指定node之間的訪(fǎng)問(wèn)延遲
-
測(cè)量CPU cache的訪(fǎng)問(wèn)延遲
-
測(cè)量cores/Socket的指定子集內(nèi)的訪(fǎng)問(wèn)帶寬
-
測(cè)量不同讀寫(xiě)比下的帶寬
-
指定隨機(jī)的訪(fǎng)問(wèn)模式以替換默認(rèn)的順序模式進(jìn)行測(cè)量
-
指定測(cè)試時(shí)的步幅
總結(jié)
以上是生活随笔為你收集整理的MLC——内存延迟及带宽测试工具的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: OS入门--
- 下一篇: 什么是Internet