當前位置：首頁 > 运维知识 > linux >内容正文

linux

linux内核编译及添加系统调用(hdu)_浅谈关于Linux内核write系统调用操作的原子性

發布時間：2023/12/10 linux 25 豆豆

生活随笔收集整理的這篇文章主要介紹了 linux内核编译及添加系统调用(hdu)_浅谈关于Linux内核write系统调用操作的原子性小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Linux系統的write調用到底是不是原子的。網上能搜出一大堆文章，基本上要么是翻譯一些文獻，要么就是胡扯，本文中我來結合實例來試著做一個稍微好一點的回答。

先擺出結論吧。結論包含兩點，即write調用不能保證什么以及write調用能保證什么。

首先，write調用不能保證你要求的調用是原子的，以下面的調用為例：

ret = write(fd, buff, 512);

Linux無法保證將512字節的buff寫入文件這件事是原子的，因為：

即便你寫了512字節那也只是最大512字節，buff不一定有512字節這么大；

write操作有可能被信號中途打斷，進而使得ret實際上小于512；

實現根據不同的系統而不同，且幾乎都是分層，作為接口無法確保所有層資源預留。磁盤的緩沖區可能空間不足，導致底層操作失敗。

如果不考慮以上這些因素，write調用為什么不設計成直接返回True或者False呢？要么成功寫入512字節，要么一點都不寫入，這樣豈不更好？之所以不這么設計，正是基于上述不可回避的因素來考慮的。

??在系統調用設計的意義上，不信任的價值大于信任，最壞的考慮優先于樂觀地盲進。

??其次，write調用能保證的是，不管它實際寫入了多少數據，比如寫入了n字節數據，在寫入這n字節數據的時候，在所有共享文件描述符的線程或者進程之間，每一個write調用是原子的，不可打斷的。舉一個例子，比如線程1寫入了3個字符’a’，線程2寫入了3個字符’b’，結果一定是‘aaabbb’或者是‘bbbaaa’，不可能是類似‘abaabb’這類交錯的情況。

??也許你自然而然會問一個問題，如果兩個進程沒有共享文件描述符呢？比如進程A和進程B分別獨立地打開了一個文件，進程A寫入3個字符’a’，進程B寫入了3個字符’b’，結果怎樣呢？

??答案是，這種情況下沒有任何保證，最終的結果可能是‘aaabbb’或者是‘bbbaaa’，也可能是‘abaabb’這種交錯的情況。如果你希望不交錯，那么怎么辦呢？答案也是有的，那就是在所有寫進程打開文件的時候，采用O_APPEND方式打開即可。

??作為一個和用戶態交互的典型系統調用，write無法保證用戶要求的事情是原子的，但它在共享文件的范圍內能保證它實際完成的事情是原子的，在非共享文件的情況下，雖然它甚至無法保證它完成的事情是原子的，但是卻提供了一種機制可以做到這種保證。可見，write系統調用設計的非常之好，邊界十分清晰！

??關于以上的這些保證是如何做到的，下面簡要地解釋下。我本來是不想解釋的，但是看了下面的解釋后，對于理解上述的保證很有幫助，所以就不得不追加了。解釋歸于下圖所示：

總結一下套路：

APPEND模式通過鎖inode，保證每次寫操作均在inode中獲取的文件size后追加數據，寫完后釋放鎖；

非APPEND模式通過鎖file結構體后獲取file結構體的pos字段，并將數據追加到pos后，寫完更新pos字段后釋放鎖。

由此可見，APPEND模式提供了文件層面的全局寫安全，而非APPEND模式則提供了針對共享file結構體的進程/線程之間的寫安全。

??值得一再重申的是，由于write調用只是在inode或者file層面上保證一次寫操作的原子性，但無法保證用戶需要寫入的數據的一次肯定被寫完，所以在多線程多進程文件共享情況下就需要用戶態程序自己來應對short write問題，比如設計一個鎖保護一個循環，直到寫完成或者寫出錯，不然循環不退出，鎖不釋放…

??此外，我們知道，apache，nginx以及另外一些服務器寫日志都是通過APPEND來保證獨立原子寫入的，要知道這些日志對于這類服務器而言是極端重要的。

本文寫到這里貌似應該可以結束了，但是下面才是重頭戲！

需要C/C++ Linux服務器架構師學習資料后臺私信“資料”(資料包括C/C++，Linux，golang技術，Nginx，ZeroMQ，MySQL，Redis，fastdfs，MongoDB，ZK，流媒體，CDN，P2P，K8S，Docker，TCP/IP，協程，DPDK，ffmpeg等)，免費分享

??我寫了一個分析TCP數據包的程序，通過不斷打日志的方式把數據包的信息記錄在文件里，程序是個多線程程序，大概10多個線程同時寫一個內存文件系統的文件，最后我發現少了一條日志！程序本身不是重點，我可以通過以下的小程序代之解釋：

#include #include #include #include #include #include #include #include char a[512];char b[16];int main(){ int fd; memset(a, 'a', 512); memset(b, '-', 16); fd = open("/usr/src/probe/test.txt", O_RDWR|O_CREAT|O_TRUNC, 0660); if (fork() == 0) { prctl(PR_SET_NAME, (unsigned long)"child"); write(fd, b, 16); exit(0); } write(fd, a, 512); exit(0);}

編譯為parent并運行，你猜猜最后test.txt里面是什么內容？

??由于父子進程是共享fd指示的file結構體的，按照上面的解釋，最終的文件內容肯定是下面兩種中的一種：

----------------aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa1

或者：

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa----------------1

可是，事實并不是這樣！事實上，在很小的概率下，文件中只有512個字符‘a’，沒有看到任何字符‘-‘(當然還會有別的情況)！Why？

??你能理解，當事實和理論分析不符的時候是多么痛苦，標準上明明就是說要保證共享file結構體的進程/線程一次寫操作的原子性，然而事實證明有部分內容確實是被覆蓋了，這顯然并不合理。

??再者說了，系統調用在設計之初就要做出某種級別的保證，比如一次操作的原子性等等，這樣的系統API才更友好，我相信標準是對的，所以我就覺得這是代碼的BUG所致。是這么個思路嗎？

??不！上面的這段話是事后諸葛亮的言辭，本文其實是一篇倒敘，是我先發現了寫操作被覆蓋，進而去逐步排查，最終才找到本文最開始的那段理論的，而不是反過來。所以，在我看到這個莫名其妙的錯誤后，我并不知道這是否合理，我只是依靠信仰覺得這次又是內核的BUG！然而我如何來證明呢？

??首先我要想到一個寫操作被覆蓋的場景，然后試著去重現這個場景，最終去修復它。首先第一步還是看代碼，出問題的內核是3.10社區版內核，于是我找到源碼：

SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, size_t, count){ struct fd f = fdget(fd); ssize_t ret = -EBADF; if (f.file) { loff_t pos = file_pos_read(f.file); ret = vfs_write(f.file, buf, count, &pos); file_pos_write(f.file, pos); fdput(f); } return ret;}

說實話，這段代碼我是分析了足足10分鐘才發現一個race的。簡單講，我把這個系統調用分解為了三部分：

get pos

vfs_write

update pos

race發生在1和2或者2和3之間。以下圖示之：

既然找到了就容易重現了，方法有兩類，一類是拼命那個寫啊寫，碰運氣重現，但這不是我的方式，另一種方法我比較喜歡，即故意放大race的條件！

??對于本文的場景，我使用jprobe機制故意在1和2之間插入了一個schedule。試著加載包含下面代碼的模塊：

ssize_t jvfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos){ if (!strcmp(current->comm, "parent")) { msleep(2000); } jprobe_return(); return 0;}static struct jprobe delay_stub = { .kp = { .symbol_name = "vfs_write", }, .entry = jvfs_write,};

我是HZ1000的機器，上述代碼即在1和2之間睡眠2秒鐘，這樣幾乎可以100%重現問題。

??試著跑了一遍，真的就重現了！文件中有512個字符‘a’，沒有看到任何字符‘-‘！

??看起來這問題在多CPU機器上是如此地容易重現，以至于任何人都會覺得這問題不可能會留到3.10內核還不被修補啊！但是內核源碼擺在那里，確實是有問題啊！這個時候，我才想起去看一些文檔，看看這到底是一個問題呢還是說這本身是合理的，只是需要用戶態程序采用某種手段去規避。曲折之路就不多贅述了，直接man 2 write，看BUGS section：

BUGS According to POSIX.1-2008/SUSv4 Section XSI 2.9.7 ("Thread Interactions with Regular File Operations"): All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2008 when they operate on regular files or symbolic links: ... Among the APIs subsequently listed are write() and writev(2). And among the effects that should be atomic across threads (and processes) are updates of the file offset. However, on Linux before version 3.14, this was not the case: if two processes that share an open file description (see open(2)) perform a write() (or writev(2)) at the same time, then the I/O operations were not atomic with respect updating the file offset, with the result that the blocks of data output by the two processes might (incorrectly) overlap. This problem was fixed in Linux 3.14.

嗯，說明3.10的內核真的是BUG，3.14以后的內核解決了，非常OK！看了4.14的內核，問題沒有了，這問題早就在3.14社區內核中解決：

SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, size_t, count){ struct fd f = fdget_pos(fd); // 這里會鎖file的pos鎖 ssize_t ret = -EBADF; if (f.file) { loff_t pos = file_pos_read(f.file); ret = vfs_write(f.file, buf, count, &pos); if (ret >= 0) file_pos_write(f.file, pos); fdput_pos(f); } return ret;}

針對該問題的patch說明：

From: Linus Torvalds Date: Mon, 3 Mar 2014 09:36:58 -0800Subject: [PATCH 1/2] vfs: atomic f_pos accesses as per POSIXOur write() system call has always been atomic in the sense that you getthe expected thread-safe contiguous write, but we haven't actuallyguaranteed that concurrent writes are serialized wrt f_pos accesses, sothreads (or processes) that share a file descriptor and use "write()"concurrently would quite likely overwrite each others data.This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says: "2.9.7 Thread Interactions with Regular File Operations All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2008 when they operate on regular files or symbolic links: [...]"and one of the effects is the file position update.This unprotected file position behavior is not new behavior, and nobodyhas ever cared. Until now. Yongzhi Pan reported unexpected behavior toMichael Kerrisk that was due to this.This resolves the issue with a f_pos-specific lock that is taken byread/write/lseek on file descriptors that may be shared across threadsor processes.

一波三折的事情貌似結束了，總結一下收獲就是，碰到問題直接看文檔而不是代碼估計可能會更快速解決問題。

這絕對是本文的最后一部分，如果再發生故事，我保證會放棄！因為這個問題本來就是碰到了順便拿來玩玩的。

??當我把機器重啟到Centos 2.6.32內核(我認為低版本內核更容易重現，更容易說明問題)時，依然載入我那個jprobe內核模塊，運行我那個parent程序，然而并沒有重現問題，相反地，當parent被那個msleep阻塞后，child同樣也被阻塞了，看樣子是修復bug后的行為啊。

??第一感覺這可能性不大，畢竟3.10內核都有的問題，2.6.32怎么可能避開？！然而事后仔細一想，不對，3.10的問題內核是社區內核，2.6.32的是Centos內核，后者會拉取很多的上游patch來解決一些顯然的問題的，對于衍生自Redhat公司的穩定版內核，這并不稀奇。

??最后，我找到了write的實現：

SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, size_t, count){ struct file *file; ssize_t ret = -EBADF; int fput_needed; file = fget_light_pos(fd, &fput_needed); // 這里是關鍵 if (file) { loff_t pos = file_pos_read(file); ret = vfs_write(file, buf, count, &pos); file_pos_write(file, pos); fput_light_pos(file, fput_needed); } return ret;}

請注意fget_light_pos是一個新的實現：

struct file *fget_light_pos(unsigned int fd, int *fput_needed){ struct file *file = fget_light(fd, fput_needed); if (file && (file->f_mode & FMODE_ATOMIC_POS)) { if (file_count(file) > 1) { *fput_needed |= FDPUT_POS_UNLOCK; // 如果有超過一個進程/線程在操作同一個file，則先lock它！ mutex_lock(&file->f_pos_lock); } } return file;}

事情就是在這里起了變化！Centos早就拉取了修復該問題的patch，解決了問題便無法重現問題。

??所以，社區版內核和發行版內核是完全不同的，側重點不同吧，社區版內核可能更在意內核本身的子系統以及性能因素，而發行版內核則更看重穩定性以及系統調用，畢竟系統就是用來跑應用的，系統調用作為一個接口，一定要穩定無BUG！

創作挑戰賽新人創作獎勵來咯，堅持創作打卡瓜分現金大獎

總結

以上是生活随笔為你收集整理的linux内核编译及添加系统调用(hdu)_浅谈关于Linux内核write系统调用操作的原子性的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： python白森_氧气恋人
下一篇：【转】Linux下c++调用自己编写的m