bug诞生记——信号(signal)处理导致死锁
? ? ? ? 這個bug源于項目中一個詭異的現象:代碼層面沒有明顯的鎖的問題,但是執行時發生了死鎖一樣的表現。我把業務邏輯簡化為:父進程一直維持一個子進程。(轉載請指明出于breaksoftware的csdn博客)
? ? ? ?首先我們定義一個結構體ProcessGuard,它持有子進程的ID以及保護它的的鎖。這樣我們在多線程中,可以安全的操作這個結構體。
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <unistd.h>
#include <stdlib.h>
#include <signal.h>
#include <pthread.h>struct ProcessGuard {pthread_mutex_t pids_mutex;pid_t pid;
};
? ? ? ? 主進程的主線程啟動一個線程,用于不停監視ProcessGuard的pid是否為0(即子進程不存在)。如果不存在就創建子進程,并把進程ID記錄到pid中;
void chile_process() {while (1) {printf("This is the child process. My PID is %d.My thread_id is %lu.\n", getpid(), pthread_self());sleep(1);}
}void create_process_routine() {printf("This is the child thread of parent process. My PID is %d.My thread_id is %lu.\n", getpid(), pthread_self());while (1) {int child = 0;if (child == 0) {pthread_mutex_lock(&g_guard->pids_mutex);}if (g_guard->pid != 0) {continue; }pid_t pid = fork();sleep(1);printf("Create child process %d.\n", pid);if (pid < 0) {perror("fork failed");}else if (pid == 0) {chile_process();child = 1;break;}else {// parent processg_guard->pid = pid;printf("dispatch task to process. pid is %d.\n", pid);}if (child == 0) {pthread_mutex_unlock(&g_guard->pids_mutex); }else {break;}}
}
? ? ? ? 我們在父進程的主線程中注冊一個signal監聽。如果子進程被殺掉,則將ProcessGuard中pid設置為0,這樣父進程的監控線程將重新啟動一個進程。
void sighandler(int signum) {printf("This is the parent process.Catch signal %d.My PID is %d.My thread_id is %lu.\n", signum, getpid(), pthread_self());pthread_mutex_lock(&g_guard->pids_mutex);g_guard->pid = 0;pthread_mutex_unlock(&g_guard->pids_mutex);
}
? ? ? ? 最后看下父進程,它初始化一些結構后,注冊了signal處理事件并啟動了創建子進程的線程。
int main(void) {pthread_t creat_process_tid;g_guard = malloc(sizeof(struct ProcessGuard));pthread_mutex_t pids_mutex;if (pthread_mutex_init(&g_guard->pids_mutex, NULL) != 0) {perror("init pids_mutex error.");exit(1);}g_guard->pid = 0;printf("This is the Main thread of parent process.PID is %d.My thread_id is %lu.\n", getpid(), pthread_self());signal(SIGCHLD, sighandler);pthread_create(&creat_process_tid, NULL, (void*)create_process_routine, NULL);while(1) {printf("Get task from network.\n");sleep(1);}pthread_mutex_destroy(&g_guard->pids_mutex);return 0;
}
? ? ? ? 上述代碼,我們看到鎖只在線程函數create_process_routine和signal處理函數sighandler中被使用了。它們兩個在代碼層面沒有任何調用關系,所以不應該出現死鎖!但是實際并非如此。
? ? ? ? 我們運行程序,并且殺死子進程,會發現主進程并沒有重新啟動一個新的子進程。
$ ./test
This is the Main thread of parent process.PID is 17641.My thread_id is 140014057678656.
Get task from network.
This is the child thread of parent process. My PID is 17641.My thread_id is 140014049122048.
Create child process 17643.
dispatch task to process. pid is 17643.
Create child process 0.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the parent process.Catch signal 17.My PID is 17641.My thread_id is 140014049122048.
Get task from network.
Get task from network.
Get task from network.
Get task from network.
Get task from network.
? ? ? ? 這個和我們代碼設計不符合,而且不太符合邏輯。于是我們使用gdb attach主進程。
Attaching to process 17641
[New LWP 17642]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f578fb7a9d0 in __GI___nanosleep (requested_time=requested_time@entry=0x7fffd2b41190, remaining=remaining@entry=0x7fffd2b41190) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
28 ../sysdeps/unix/sysv/linux/nanosleep.c: No such file or directory.
(gdb) info threadsId Target Id Frame
* 1 Thread 0x7f57902be740 (LWP 17641) "test" 0x00007f578fb7a9d0 in __GI___nanosleep (requested_time=requested_time@entry=0x7fffd2b41190, remaining=remaining@entry=0x7fffd2b41190)at ../sysdeps/unix/sysv/linux/nanosleep.c:282 Thread 0x7f578fa95700 (LWP 17642) "test" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
(gdb) t 2
[Switching to thread 2 (Thread 0x7f578fa95700 (LWP 17642))]
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135 ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f578fe91023 in __GI___pthread_mutex_lock (mutex=0x55c51383e260) at ../nptl/pthread_mutex_lock.c:78
#2 0x000055c512c29a9d in sighandler ()
#3 <signal handler called>
#4 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:133
#5 0x00007f578fe91023 in __GI___pthread_mutex_lock (mutex=0x55c51383e260) at ../nptl/pthread_mutex_lock.c:78
#6 0x000055c512c29b42 in create_process_routine ()
#7 0x00007f578fe8e6db in start_thread (arg=0x7f578fa95700) at pthread_create.c:463
#8 0x00007f578fbb788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
? ? ? ? 我們查看線程2的調用棧,發現棧幀5和棧幀1鎖住了相同的mutex(0x55c51383e260)。而我們線程代碼中鎖是加/解成對,那么第二個鎖是哪兒來的呢?
? ? ? ? 我們看到棧幀1的鎖是源于棧幀2對應的函數sighandler,即下面代碼
void sighandler(int signum) {printf("This is the parent process.Catch signal %d.My PID is %d.My thread_id is %lu.\n", signum, getpid(), pthread_self());pthread_mutex_lock(&g_guard->pids_mutex);g_guard->pid = 0;pthread_mutex_unlock(&g_guard->pids_mutex);
}
? ? ? ? 于是,問題來了。我們在線程函數create_process_routine中從來沒有調用sighandler,那這個調用是哪兒來的?
? ? ? ? 在linux文檔http://man7.org/linux/man-pages/man7/signal.7.html中,我們發現了有關signal的這段話
A process-directed signal may be delivered to any one of the threads that does not currently have the signal blocked. If more than one of the threads has the signal unblocked, then the kernel chooses an arbitrary thread to which to deliver the signal.
? ? ? ? 這句話是說process-directed signal會被投遞到當前沒有被標記不接受該signal的任意一個線程中。?具體是哪個,是由系統內核決定的。這就意味著我們的sighandler可能在主線程中執行,也可能在子線程中執行。于是發生了我們上面的死鎖現象。
? ? ? ? 那么如何解決?官方的方法是使用sigprocmask讓一些存在潛在死鎖關系的線程不接收這些信號。但是這個方案在復雜的系統中是存在缺陷的。因為我們的工程往往使用各種開源庫或者第三方庫,我們無法控制它們啟動線程的問題。所以,我的建議是:在signal處理函數中,盡量使用無鎖結構。通過中間數據的設計,將復雜的業務代碼和signal處理函數隔離。
總結
以上是生活随笔為你收集整理的bug诞生记——信号(signal)处理导致死锁的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: bug诞生记——不定长参数隐藏的类型问题
- 下一篇: AWS攻略——使用CodeCommit托