當前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

从JoinBatchGroup 代码细节来看Rocksdb的相比于leveldb的写入优势

發布時間：2023/11/27 生活经验 27 豆豆

生活随笔收集整理的這篇文章主要介紹了从JoinBatchGroup 代码细节来看Rocksdb的相比于leveldb的写入优势小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

- 1. Rocksdb寫入模型
- 2. LevelDB寫入的優化點
- 3. Rocksdb 的優化
- - 1. Busy Loop
  - 2. Short Wait -- SOMETIMES busy Loop
  - 3. Long-wait
  - 4. 測試驗證
- 4. 總結

1. Rocksdb寫入模型

本節討論一下Rocksdb在寫入鏈路上的一個優化點，這個優化細節可以說將Rocksdb這個存儲引擎的優秀代碼能力和他們對整個操作系統的理解展現得淋漓盡致。

Ps 本文涉及的rocksdb代碼版本是6.6.3

首先一張圖簡單介紹一下Rocksdb多線程下的寫入模型。

Rocksdb 多線程下默認的寫入方式會按照上圖模型進行：

1 2 3 步總體上是說將并發寫入的多個線程中選出一個leader（一般由這個線程隊列中的第一個線程擔當leader），由這個leader將所有要寫入的數據做一個batch group，其他的線程則處于等待狀態。

4 5 步的時候 leader完成了batch 寫wal，喚醒其他的等待線程。

6 7 8 則所有的線程可以并發寫memtable。

2. LevelDB寫入的優化點

提升寫吞吐的能力主要就是通過多線程的數據batch，來加速wal的寫入，這個優化其實leveldb 也同樣做了，如下：

Status DBImpl::Write(const WriteOptions& options, WriteBatch* my_batch) {...MutexLock l(&mutex_);writers_.push_back(&w);while (!w.done && &w != writers_.front()) {w.cv.Wait();}if (w.done) {return w.status;}...
}

可以看到底層是通過w.cv.Wait()來讓follwer等待，也就是通過pthread_cond_wait函數。pthread_cond_wait到條件變量的線程喚醒中涉及到FUTEX_WAIT到FUTEX_WAKE的狀態轉變，這個時間平均需要10us，因為這個函數內部實現需要對互斥量加鎖/解鎖。這個過程整個leader - batch模型的寫入會因為pthread_cond_wait 中的條件鎖發生耗時代價高昂的context switch，這個耗時對于先寫page-cache的wal來說實在是不利于整體吞吐的提升。

后續會在rocksdb的優化中通過實際代碼會演示這個上下文切換的過程對性能損失有多嚴重，當然更直觀的對比就是Rocksdb的吞吐和leveldb同等配置下的吞吐。

3. Rocksdb 的優化

基于以上leveldb batch寫入模型通過條件鎖讓follower線程等待，這會造成耗時高昂的context switch 。

所以Rocksdb 將pthread_cond_wait 優化為了如下三步：

Busy Loop with pause
Short wait – SOMETIMES busy Loop with yield
Long wait – Blocking Wait

接下來可以依次看看

優化實現的入口函數是WriteThread::JoinBatchGroup —> WriteThread::AwaitState

1. Busy Loop

這第一步的優化主要是通過讓線程循環忙等待一段時間，在至強(xeon)CPU下，一次循環大概需要7ns，而這里會忙等待200次，總共超過1us的時間。這段時間足夠Leader的writer 完成WriteBatch的寫入，而且這個時間忙等會讓follower線程占用CPU，并不會發生context switch。這里相比于leveldb的pthread_cond_wait上下文消耗的10us量級來說已經小了很多。

  for (uint32_t tries = 0; tries < 200; ++tries) {state = w->state.load(std::memory_order_acquire);if ((state & goal_mask) != 0) {return state;}port::AsmVolatilePause();}

其中AsmVolatilePause函數主要是執行asm volatile("pause");執行pause指令，官方對pause指令的描述如下：

Improves the performance of spin-wait loops. When executing a “spin-wait loop,” a Pentium 4 or Intel Xeon processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops.

主要是用來提升spin-wait-loop的性能，一般CPU執行spin-wait在循環退出的時候檢測指令的內存序發生變化會重排指令流水線，從而造成性能損失。而pause指令則能夠告訴CPU 進程當前處于spin-wait狀態，這個時候能夠避免CPU流水線的指令重排，從而能夠減少性能的損失。

2. Short Wait – SOMETIMES busy Loop

如果Rocksdb能夠準確得預測當前線程的等待時間，那其實就不需要這個優化了，只需要段時間的Loop和長時間的Long-wait就可以了。但是實際的應用場景中無法預知線程的具體等待時間（比如Rocksdb leader寫WAL過程中其他的follower正在等待，但是這個時候磁盤是HDD，那一次寫入可能達到ms的時間；或者nvme的寫入十幾us的時間；這一些時間都是需要follower等待）。

那么Short-Wait就是用來解決這種處于loop到long-wait之間的線程等待優化的，細節還是很有意思的。

  if (max_yield_usec_ > 0) {update_ctx = Random::GetTLSInstance()->OneIn(sampling_base);if (update_ctx || yield_credit.load(std::memory_order_relaxed) >= 0) {// we're updating the adaptation statistics, or spinning has >// 50% chance of being shorter than max_yield_usec_ and causing no// involuntary context switchesauto spin_begin = std::chrono::steady_clock::now();// this variable doesn't include the final yield (if any) that// causes the goal to be metsize_t slow_yield_count = 0;auto iter_begin = spin_begin;// 這里的循環不是無止境的，max_yield_usec_ 是通過外部options參數控制// 默認是100while ((iter_begin - spin_begin) <=std::chrono::microseconds(max_yield_usec_)) {// 先讓出時間片std::this_thread::yield();// 搶占時間片// state滿足條件，則跳出循環state = w->state.load(std::memory_order_acquire);if ((state & goal_mask) != 0) {// successwould_spin_again = true;break;}auto now = std::chrono::steady_clock::now();if (now == iter_begin ||now - iter_begin >= std::chrono::microseconds(slow_yield_usec_)) {// conservatively count it as a slow yield if our clock isn't// accurate enough to measure the yield duration++slow_yield_count;if (slow_yield_count >= kMaxSlowYieldsWhileSpinning) {// Not just one ivcsw, but several.  Immediately update yield_credit// and fall back to blockingupdate_ctx = true;break;}}iter_begin = now;}}}

主體邏輯先看上面的while循環中，還是像開始的Loop中一樣判斷state是否滿足條件，如果滿足則退出循環。state不滿足條件的話通過 std::this_thread::yield();能夠將剩下的時間片交給其他的線程執行。當下一次循環時需要執行state.load的時候再次搶占CPU的時間片。不過這個循環并不是無限執行的，會執行max_yield_usec_(us)，這個max_yield_usec_參數是通過外部的兩個option指定的，如果enable_write_thread_adaptive_yield為真，則將write_thread_max_yield_usec設置為執行的時間，否則設置為0。所以這里循環的默認執行時間是100us。

那如果執行了100us的時間，發現state并沒有發生變化，這段時間這么多次的cpu context switch是不是就無用了，還消耗了大量的CPU。顯然rocksdb不允許這么low且低效的做法，這也就是剩下的while循環內的邏輯要做的事情。主要就是判斷yield的執行時間來判斷，如果當前循環讓出的時間片超過db_options.write_thread_slow_yield_usec也就是slow_yield_usec_的3us，且連續超過3次，則認為當前等待滿足state的時間過久，需要切換到 Long-wait了。

以上代碼最開始也有幾個判斷進入while循環的條件：

  if (max_yield_usec_ > 0) {update_ctx = Random::GetTLSInstance()->OneIn(sampling_base);if (update_ctx || yield_credit.load(std::memory_order_relaxed) >= 0) {......}}

首先需要max_yield_usec_大于0，由外部參數控制，默認是100。如果不滿足，則進入long-wait。

判斷update_ctx是否不等于0，這里通過隨機函數的OneIn來判斷，sampling_base是256，則這里有255/256概率是為true的，或者判斷yield_credit是否>0，針對yield_credit的更新則是通過判斷short-wait 階段中是否滿足了條件，滿足的話則讓yield_credit+1，如果short-wait不滿足，則會-1。也就是只要short-wait的時間能夠持續滿足state的條件，那么每次的執行大多數都會集中到short-wait中。同時，這一是Rocksdb 在寫吞吐和系統CPU時間的消耗之間所做的權衡，因為頻繁的short-wait,也就意味著拼房的context switch，也就是更多的CPU消耗。

關于yield_credit 的更新邏輯如下:

如果short-wait中滿足條件之后會將would_spin_again置為true，也就是會給yield_credit +1，否則就-1。

  if (update_ctx) {// Since our update is sample based, it is ok if a thread overwrites the// updates by other threads. Thus the update does not have to be atomic.auto v = yield_credit.load(std::memory_order_relaxed);// fixed point exponential decay with decay constant 1/1024, with +1// and -1 scaled to avoid overflow for int32_t//// On each update the positive credit is decayed by a facor of 1/1024 (i.e.,// 0.1%). If the sampled yield was successful, the credit is also increased// by X. Setting X=2^17 ensures that the credit never exceeds// 2^17*2^10=2^27, which is lower than 2^31 the upperbound of int32_t. Same// logic applies to negative credits.v = v - (v / 1024) + (would_spin_again ? 1 : -1) * 131072;yield_credit.store(v, std::memory_order_relaxed);}

實際On NVME設備的測試過程中并發寫Rocksdb會發現大多數線程等待都會進入到short-wait。后續會有一段簡單的測試代碼以及測試結果來看一下這一部分的優化對Rocksdb吞吐的影響。。?？磳Ρ葴y試，影響真是挺大的。

3. Long-wait

如果前兩個等待階段都沒有滿足state的狀態變更，那么就只能進入和leveldb邏輯一樣的long-wait階段了，通過cond.Wait來長等待。

uint8_t WriteThread::BlockingAwaitState(Writer* w, uint8_t goal_mask) {// We're going to block.  Lazily create the mutex.  We guarantee// propagation of this construction to the waker via the// STATE_LOCKED_WAITING state.  The waker won't try to touch the mutex// or the condvar unless they CAS away the STATE_LOCKED_WAITING that// we install below.w->CreateMutex();auto state = w->state.load(std::memory_order_acquire);assert(state != STATE_LOCKED_WAITING);if ((state & goal_mask) == 0 &&w->state.compare_exchange_strong(state, STATE_LOCKED_WAITING)) {...w->StateCV().wait(guard, [w] {return w->state.load(std::memory_order_relaxed) != STATE_LOCKED_WAITING;});}...
}

這里也有優化，通過CreateMutex 僅僅會創建好condvar 和 mutex，但是當實際state 不滿足條件，需要等待的時候才會執行condvar.Wait()。

可以說是Rocksdb將這一部分代碼優化到了極致。

4. 測試驗證

通過如下寫入函數簡單驗證一下開啟short-wait和關閉short-wait對寫入吞吐的影響有多大，單db 壓10個線程，除了下面提到的一個對比參數之外其他都用默認的參數：

void DBWrite(int num) {double ts = now();int db_num = num % FLAGS_multidb_nums;while (true) {std::string key = std::to_string(generator_());std::string value(FLAGS_value_len, 'x');if(num == 0) {rocksdb::SetPerfLevel(rocksdb::PerfLevel::kEnableTimeExceptForMutex);rocksdb::get_perf_context()->Reset();// rocksdb::get_iostats_context()->Reset();}src_db[db_num]->Put(rocksdb::WriteOptions(), "test_graph_"+key, value);++g_op_W;if(num == 0 && now() - ts >= 1) { // 每隔一秒，打印一次0號線程的延時數據rocksdb::SetPerfLevel(rocksdb::PerfLevel::kDisable);std::cout<< "\nwrite_thread_wait_nanos "<< rocksdb::get_perf_context()->write_thread_wait_nanos<< std::endl;ts = now();}}
}

一種是開啟enable_write_thread_adaptive_yield，也是默認的選項。

一種是關閉以上選項。

## 關閉選項
write_thread_wait_nanos 14766
write_speed : 186138write_thread_wait_nanos 43489
write_speed : 181371write_thread_wait_nanos 41962
write_speed : 177163write_thread_wait_nanos 33322
write_speed : 171145## 開啟選項
write_thread_wait_nanos 14880
write_speed : 388869write_thread_wait_nanos 14267
write_speed : 353567write_thread_wait_nanos 12905
write_speed : 364561write_thread_wait_nanos 11570
write_speed : 363793

可以看到關閉了short-wait，在10個寫入的并發情況下性能差異2x，且關閉選項之后都并發足夠多時會大概率等待在long-wait中，這也是這也是關閉short-wait選項之后write_thread_wait_nanos 的時間過長的原因。所以這里的short-wait相比于leveldb的有更為明顯的提升，而且在更大的并發場景，rocksdb的優勢會更大。

4. 總結

Rocksdb從leveldb對寫入batch模型的pthread_cond_wait 中優化出來這么多，目的還是為了進一步提升寫吞吐；當然也會帶來更多的CPU消耗，比如short-wait策略。
內核的理解需要進一步加強，在Rocksdb的這一部分優化中需要對內核的線程調度/CPU的指令重排由足夠深入的理解之后才能寫出這樣的代碼。其實對于我們這種代碼功底還不夠的人來說，直接深入到內核其實有點難的，還是先把相關的功能用熟練之后再考慮。比如內存屏障，CPU時間片轉移的基本函數等。

真的是跟著Rocksdb 底層學習存儲引擎才能學到引擎內核的精髓啊，每一個代碼細節都被精心雕琢打磨。

一些操作系統耗時信息：

tables	are
互斥鎖耗時	25ns
訪存耗時	100us
上下文切換耗時	3us
nvme 隨機讀寫耗時	10-20us
機械盤尋址時間	5-10ms

總結

以上是生活随笔為你收集整理的从JoinBatchGroup 代码细节来看Rocksdb的相比于leveldb的写入优势的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：求块字开头的成语接龙！
下一篇： BitCask 持久化hash存储引擎