當前位置：首頁 > 运维知识 > linux >内容正文

linux

Linux kernel 3.10内核源码分析--进程上下文切换

發布時間：2025/3/15 linux 45 豆豆

生活随笔收集整理的這篇文章主要介紹了 Linux kernel 3.10内核源码分析--进程上下文切换小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

一、疑問
進程調度時，當被選中的next進程不是current進程時，需要進行上下文切換。
進行上下文切換時，有一些問題不太容易理解，比如：
1、進程上下文切換必然發生在內核態嗎？
2、上下文切換后原來的進程(prev)如果恢復執行，從什么地方開始執行？
3、上下文切換后，如何切換到新進程執行？新進程從什么地方開始執行？
5、上下文切換時，堆棧如何切換，如果保證不混亂？
6、A進程執行時被打斷調度B進程運行，B進程正常執行過程中被打斷調度C進程運行，C運行被打斷中調度D運行，以此類推，看似一個無限嵌套，如何恢復到A進程運行，不會一層層返回吧？會不會有問題？
7、上下文切換后，如何恢復到新進程的用戶態程序繼續執行？
上述問題(可能還有其它疑問~)在理解了進程上下文切換的細節后，就都能回答了。

二、原理
進程上下文切換設計到幾個關鍵的地方，也正是上述疑問所在的地方：
1、進程調度必然經過schedule函數，顯然必然發生內核態，那上下文切換也必然發生于內核態了。進程調度通常的時機有：
????1）中斷/異常/系統調用返回
????2）其它，如wakeup()或手工調用schedule
在沒有開啟內核搶占的環境中(通常如此)，僅當被替換進程(prev)處于位于用戶態時，才能發生調度(上下文切換)。
呵呵，看似跟“進程調度必然發生內核態”的說法是矛盾的，其實不然，這里的意思是，在prev進程被打斷之前，其位于用戶態，當其被打斷之后(最常見的如時鐘中斷)，當然就進入內核態了，然后在內核態完成進度調度和上下文切換。
2、當進程被打斷(比如中斷)時，當前的上下文信息(包括eip、CS和其它寄存器信息)會保存在當前的內核棧(或中斷棧)中，當中斷返回時，如果沒有發生調度(不滿足調度條件)，會恢復之前的上下文信息，即恢復到之前的被打斷之前的狀態繼續執行。(在entry_xx.S的匯編代碼中實現)。
3、當進程被打斷并產生調度時，最終會進入switch_to宏進行上下文切換，被替換的進程(prev)當前的IP指針會被替換為“標號1(__switch_to函數后的一行代碼)”，并被保存在task_struct.thread.ip中，同時會將被選中將執行的進程(next)的ip、堆棧指針已經相關的上下文加載到當前環境中，實現新進程的調度執行。
而當原來的prev進程重新被調度執行時，由于之前保存的IP指針為“標號1”，所以會從“標號1”開始執行，具體見后面的代碼分析。
4、新進程(next)的執行分兩種情況：
????1）經過調度后
經過調度后，會經歷switch_to的流程，那么在進程被調度出去時，會保存switch_to宏中的“標號1”到task_struct.thread.ip中，當該進程被重新調度時，過程如3中描述一樣，也會從switch_to宏中的“標號1”處開始執行。
????2）fork創建之后未經過調度
此時，該進程未經歷switch_to的流程，由于在fork時，會將新進程的thread.eip設置成ret_from_fork(參見copy_thread函數)，所以此時該進程會從ret_from_fork處(在entry_xx.S的匯編代碼中)開始執行。
5、堆棧的具體切換見另一篇文章：kernel 3.10內核源碼分析--內核棧及堆棧切換?
6、上下文切換后，由于原來的上下文完全被新上下文替換，所以新進程開始執行后，就已經沒有原進程的遺留信息后，此時新進程用的是自己的地址空間、堆棧、和其它上下文，原進程被調度出去后，就跟現在的上下文脫離關系了。所以，不存在嵌套的說法，沒有問題。
7、如之前所說，進程被中斷時，其EIP和CS會自動保存在當前進程的內核棧(或中斷棧)中，當新進程被調度執行時，其內核棧(或中斷棧)中同樣保存之前被調度出去時壓入的EIP和CS，此時硬件會自動從內核棧中彈出EIP和CS，并將堆棧切換到用戶棧，并恢復到用戶態執行。

三、代碼分析
進行上下文切換，主要由switch_to宏實現，代碼分析如下：

點擊(此處)折疊或打開

??* 上下文切換，在schedule中調用，current進程調度出去，當該進程被再次調度到時，重新從__switch_to后面開始執行

??* prev:被替換的進程

??* next:被調度的新進程

??* last:當切換回原來的進程(prev)后，被替換的另外一個進程。

??*/

#define switch_to(prev, next, last)????\

do {????\

/*????\

* Context-switching clobbers all registers, so we clobber????\

* them explicitly, via unused output variables.????\

* (EAX and EBP is not listed because EBP is saved/restored????\

* explicitly for wchan access and EAX is the return value of????\

* __switch_to())????\

*/????\

unsigned long ebx, ecx, edx, esi, edi;????\

asm volatile("pushfl\n\t"????/* save flags */????/*將eflags寄存器值壓棧*/\

????"pushl %%ebp\n\t"????/* save EBP */????/*將EBP壓棧*/\

/*將當前棧指針(內核態)保存到prev進程的thread.sp中*/

????"movl %%esp,%[prev_sp]\n\t"????/* save ESP */ \

????/*將next進程的棧指針(內核態)裝載到ESP寄存器中*/

????"movl %[next_sp],%%esp\n\t"????/* restore ESP */ \

????/*保存"標號1"的地址到prev進程的thread.ip，以便當prev進程重新被調度運行時，可以從"標號1處"重新開始執行*/

????"movl $1f,%[prev_ip]\n\t"????/* save EIP */????\

????/*

?????????* 將next進程的IP(通常都是"標號1"的地址，因為通常都是經歷過這里的調度過程的，上一行代碼中即保存了這個IP)

???* 壓入當前的(即next進程的)堆棧中。結合后面的jmp指令(注意:不是call指令)一起理解，當__switch_to執行完ret返回時，

???* 會自動從當前的堆棧中彈出該地址作為函數的返回地址接著執行，如此即可實現新進程的運行。

???????????????????????*/

????"pushl %[next_ip]\n\t"????/* restore EIP */????\

????__switch_canary????\

????/*

?????????*jmp到__switch_to函數執行，當此函數返回時，自動跳轉到[next_ip]開始執行，實現新進程的調度。注意不是call，jmp指令

?????????* 不會自動將當前地址壓棧，call會自動壓棧

?????????*/

????"jmp __switch_to\n"????/* regparm call */????\

????/*當prev進程再次被調度到時，從這里開始執行*/

????"1:\t"????\

????/*恢復EBP*/

????"popl %%ebp\n\t"????/* restore EBP */????\

????/*恢復eflags*/

????"popfl\n"????/* restore flags */????\

????/* output parameters */????\

????/*輸出參數*/

????: [prev_sp] "=m" (prev->thread.sp),????\

??????[prev_ip] "=m" (prev->thread.ip),????\

??????"=a" (last),????\

??????/* clobbered output registers: */????\

??????"=b" (ebx), "=c" (ecx), "=d" (edx),????\

??????"=S" (esi), "=D" (edi)????\

??????\

??????__switch_canary_oparam????\

??????/* input parameters: */????\

??????/*輸入參數*/

????: [next_sp] "m" (next->thread.sp),????\

??????[next_ip] "m" (next->thread.ip),????\

??????\

??????/* regparm parameters for __switch_to(): */????\

??????/*將prev和next分別存入ecx和edx，然后作為參數傳入到__switch_to函數中*/

??????[prev] "a" (prev),????\

??????[next] "d" (next)????\

??????__switch_canary_iparam????\

????: /* reloaded segment registers */????\

"memory");????\

} while (0)

__switch_to函數實現如下：

點擊(此處)折疊或打開

/*入參通過寄存器eax和edx從switch_to宏中傳入*/

__notrace_funcgraph struct task_struct *

__switch_to(struct task_struct *prev_p, struct task_struct *next_p)

{

/*取prev進程的上下文信息*/

struct thread_struct *prev = &prev_p->thread,

*next = &next_p->thread;

/*取當前CPU*/

int cpu = smp_processor_id();

/*獲取當前CPU的TSS對應的tss_struct*/

struct tss_struct *tss = &per_cpu(init_tss, cpu);

fpu_switch_t fpu;

/* never put a printk in __switch_to... printk() calls wake_up*() indirectly */

fpu = switch_fpu_prepare(prev_p, next_p, cpu);

* Reload esp0.

?* 由于Linux的具體實現中，TSS不是針對每進程，而是針對每CPU的，即每個CPU對應一個tss_struct，那在進程上下文切換時，

?* 需要考慮當前CPU上TSS中的內容的更新，其實就是內核棧指針的更新，更新后，當新進程再次進入到內核態執行時，

?* 才能確保CPU硬件能從TSS中自動讀取到正確的內核棧指針(sp0)的值，以保證從用戶態切換到內核態時，相應的堆棧切

?* 換正常。

?*/

/*將next進程的內核棧指針(next->thread->sp0)值更新到當前CPU的TSS中*/

load_sp0(tss, next);

* Save away %gs. No need to save %fs, as it was saved on the

* stack on entry. No need to save %es and %ds, as those are

* always kernel segments while inside the kernel. Doing this

* before setting the new TLS descriptors avoids the situation

* where we temporarily have non-reloadable segments in %fs

* and %gs. This could be an issue if the NMI handler ever

* used %fs or %gs (it does not today), or if the kernel is

* running inside of a hypervisor layer.

lazy_save_gs(prev->gs);

* Load the per-thread Thread-Local Storage descriptor.

?* 將next_p進程使用的線程局部存儲(TLS)段裝入本地CPU的全局描述符表.

?*/

load_TLS(next, cpu);

* Restore IOPL if needed. In normal use, the flags restore

* in the switch assembly will handle this. But if the kernel

* is running virtualized at a non-zero CPL, the popf will

* not restore flags, so it must be done in a separate step.

if (get_kernel_rpl() && unlikely(prev->iopl != next->iopl))

set_iopl_mask(next->iopl);

* Now maybe handle debug registers and/or IO bitmaps

if (unlikely(task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV ||

????task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT))

__switch_to_xtra(prev_p, next_p, tss);

* Leave lazy mode, flushing any hypercalls made here.

* This must be done before restoring TLS segments so

* the GDT and LDT are properly updated, and must be

* done before math_state_restore, so the TS bit is up

* to date.

/*架構相關處理，半虛擬化中使用*/

arch_end_context_switch(next_p);

* Restore %gs if needed (which is common)

if (prev->gs | next->gs)

lazy_load_gs(next->gs);

switch_fpu_finish(next_p, fpu);

/*將current_task per-CPU變量值更新為next進程信息*/

this_cpu_write(current_task, next_p);

?* 這里需要仔細理解。return到哪里?

?* switch_to宏中，jmp到__switch_to函數之前將"next_ip"壓入了當前堆棧，那通常情況下，這里return后，

?* 會自動從堆棧中彈出next_ip開始執行，而next_ip通常為switch_to宏中保存的"標號1"的地址，即

?* 這里通常會返回到switch_to宏中__switch_to函數之后的標號1處開始執行。

?* 但有例外:對于沒有產生過進程切換,而是第一次開始執行的進程(剛完成fork開始執行)來說.

?* 由于沒有通過switch_to宏保存next_ip，所以并不會跳回switch_to，而是跳轉到ret_from_fork函數的超始

?* 地址開始執行，因為在fork新進程時，即設置好了該進程的thread.eip設置成了ret_from_fork(參見

?* copy_thread函數)。

?*/

return prev_p;

}

第一次開始執行的進程的thread.eip設置點：

點擊(此處)折疊或打開

do_fork->copy_process->copy_thread

int copy_thread(unsigned long clone_flags, unsigned long sp,

unsigned long arg, struct task_struct *p)

{

struct pt_regs *childregs = task_pt_regs(p);

struct task_struct *tsk;

int err;

p->thread.sp = (unsigned long) childregs;

p->thread.sp0 = (unsigned long) (childregs+1);

/*內核線程單獨處理，其上下文信息單獨填寫*/

if (unlikely(p->flags & PF_KTHREAD)) {

/* kernel thread */

memset(childregs, 0, sizeof(struct pt_regs));

p->thread.ip = (unsigned long) ret_from_kernel_thread;

task_user_gs(p) = __KERNEL_STACK_CANARY;

childregs->ds = __USER_DS;

childregs->es = __USER_DS;

childregs->fs = __KERNEL_PERCPU;

childregs->bx = sp;????/* function */

childregs->bp = arg;

childregs->orig_ax = -1;

childregs->cs = __KERNEL_CS | get_kernel_rpl();

childregs->flags = X86_EFLAGS_IF | X86_EFLAGS_BIT1;

p->fpu_counter = 0;

p->thread.io_bitmap_ptr = NULL;

memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));

return 0;

}

/*將當前進程(父進程)的寄存器上下文信息賦給子進程，即子進程此后的上下文信息跟父進程保持一致了。*/

*childregs = *current_pt_regs();

childregs->ax = 0;

if (sp)

childregs->sp = sp;

?* 子進程的IP指向ret_from_fork，fork創建的新進程，都要經歷這個過程，在調度的上下文切換時，

?* 其返回到ret_from_fork(entry_32.S匯編代碼)中處理，這跟普通進程調度時上下文切換不一樣，普通

?* 進程的IP是在上次上下文切換時(switch_to)中保存的。

?*/

p->thread.ip = (unsigned long) ret_from_fork;

task_user_gs(p) = get_user_gs(current_pt_regs());

p->fpu_counter = 0;

p->thread.io_bitmap_ptr = NULL;

tsk = current;

err = -ENOMEM;

memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));

if (unlikely(test_tsk_thread_flag(tsk, TIF_IO_BITMAP))) {

p->thread.io_bitmap_ptr = kmemdup(tsk->thread.io_bitmap_ptr,

IO_BITMAP_BYTES, GFP_KERNEL);

if (!p->thread.io_bitmap_ptr) {

p->thread.io_bitmap_max = 0;

return -ENOMEM;

}

set_tsk_thread_flag(p, TIF_IO_BITMAP);

}

err = 0;

* Set a new TLS for the child thread?

if (clone_flags & CLONE_SETTLS)

err = do_set_thread_area(p, -1,

(struct user_desc __user *)childregs->si, 0);

if (err && p->thread.io_bitmap_ptr) {

kfree(p->thread.io_bitmap_ptr);

p->thread.io_bitmap_max = 0;

}

return err;

}

原文地址： http://blog.chinaunix.net/uid-14528823-id-4740294.html

總結

以上是生活随笔為你收集整理的Linux kernel 3.10内核源码分析--进程上下文切换的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：一个例子让你了解Java反射机制
下一篇： Linux kernel 3.10内核源