mm_struct
Linux對于內(nèi)存的管理涉及到非常多的方面,這篇文章首先從對進(jìn)程虛擬地址空間的管理說起。(所依據(jù)的代碼是2.6.32.60)
無論是內(nèi)核線程還是用戶進(jìn)程,對于內(nèi)核來說,無非都是?task_struct這個數(shù)據(jù)結(jié)構(gòu)的一個實(shí)例而已,task_struct被稱為進(jìn)程描述符(process descriptor),因?yàn)樗涗浟诉@個進(jìn)程所有的context。其中有一個被稱為'內(nèi)存描述符‘(memory descriptor)的數(shù)據(jù)結(jié)構(gòu)?mm_struct,抽象并描述了Linux視角下管理進(jìn)程地址空間的所有信息。 mm_struct定義在include/linux/mm_types.h中,其中的域抽象了進(jìn)程的地址空間,如下圖所示:struct mm_struct {struct vm_area_struct *mmap; /* list of VMAs ? 指向虛擬區(qū)間(VMA)鏈表 */struct rb_root mm_rb; //指向red_black樹u32 vmacache_seqnum; /* per-thread vmacache */ #ifdef CONFIG_MMUunsigned long (*get_unmapped_area) (struct file *filp,unsigned long addr, unsigned long len,unsigned long pgoff, unsigned long flags); #endifunsigned long mmap_base; /* base of mmap area */unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */unsigned long task_size; /* size of task vm space */unsigned long highest_vm_end; /* highest vma end address */pgd_t * pgd; //??指向進(jìn)程的頁目錄atomic_t mm_users; /* How many users with user space? */atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */atomic_long_t nr_ptes; /* PTE page table pages */ #if CONFIG_PGTABLE_LEVELS > 2atomic_long_t nr_pmds; /* PMD page table pages */ #endifint map_count; /* number of VMAs */spinlock_t page_table_lock; /* Protects page tables and some counters */struct rw_semaphore mmap_sem;struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung* together off init_mm.mmlist, and are protected* by mmlist_lock*/unsigned long hiwater_rss; /* High-watermark of RSS usage */unsigned long hiwater_vm; /* High-water virtual memory usage */unsigned long total_vm; /* Total pages mapped */unsigned long locked_vm; /* Pages that have PG_mlocked set */unsigned long pinned_vm; /* Refcount permanently increased */unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */unsigned long stack_vm; /* VM_STACK */unsigned long def_flags;unsigned long start_code, end_code, start_data, end_data;unsigned long start_brk, brk, start_stack;unsigned long arg_start, arg_end, env_start, env_end;unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv *//** Special counters, in some configurations protected by the* page_table_lock, in other configurations by being atomic.*/struct mm_rss_stat rss_stat;struct linux_binfmt *binfmt;cpumask_var_t cpu_vm_mask_var;/* Architecture-specific MM context */mm_context_t context;unsigned long flags; /* Must use atomic bitops to access the bits */struct core_state *core_state; /* coredumping support */ #ifdef CONFIG_AIOspinlock_t ioctx_lock;struct kioctx_table __rcu *ioctx_table; #endif #ifdef CONFIG_MEMCG/** "owner" points to a task that is regarded as the canonical* user/owner of this mm. All of the following must be true in* order for it to be changed:** current == mm->owner* current->mm != mm* new_owner->mm == mm* new_owner->alloc_lock is held*/struct task_struct __rcu *owner; #endifstruct user_namespace *user_ns;/* store ref to file /proc/<pid>/exe symlink points to */struct file __rcu *exe_file; #ifdef CONFIG_MMU_NOTIFIERstruct mmu_notifier_mm *mmu_notifier_mm; #endif #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKSpgtable_t pmd_huge_pte; /* protected by page_table_lock */ #endif #ifdef CONFIG_CPUMASK_OFFSTACKstruct cpumask cpumask_allocation; #endif #ifdef CONFIG_NUMA_BALANCING/** numa_next_scan is the next time that the PTEs will be marked* pte_numa. NUMA hinting faults will gather statistics and migrate* pages to new nodes if necessary.*/unsigned long numa_next_scan;/* Restart point for scanning and setting pte_numa */unsigned long numa_scan_offset;/* numa_scan_seq prevents two threads setting pte_numa */int numa_scan_seq; #endif #if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)/** An operation with batched TLB flushing is going on. Anything that* can move process memory needs to flush the TLB when moving a* PROT_NONE or PROT_NUMA mapped page.*/bool tlb_flush_pending; #endifstruct uprobes_state uprobes_state; #ifdef CONFIG_HUGETLB_PAGEatomic_long_t hugetlb_usage; #endifstruct work_struct async_put_work; };
每個進(jìn)程都有自己獨(dú)立的mm_struct,使得每個進(jìn)程都有一個抽象的平坦的獨(dú)立的32或64位地址空間,各個進(jìn)程都在各自的地址空間中相同的地址內(nèi)存存放不同的數(shù)據(jù)而且互不干擾。如果進(jìn)程之間共享相同的地址空間,則被稱為線程。 其中[start_code,end_code)表示代碼段的地址空間范圍。 [start_data,end_start)表示數(shù)據(jù)段的地址空間范圍。 [start_brk,brk)分別表示heap段的起始空間和當(dāng)前的heap指針。 [start_stack,end_stack)表示stack段的地址空間范圍。 ?mm->pgd指向容納頁表的內(nèi)存,每個進(jìn)程有自已的mm,每個mm有自己的頁表。于是,進(jìn)程調(diào)度時,頁表被切換(一般會有一個CPU寄存器來保存頁表的地址,比如X86下的CR3,頁表切換就是改變該寄存器的值)。所以,各個進(jìn)程的地址空間互不影響(因?yàn)轫摫矶疾灰粯恿?#xff0c;當(dāng)然無法訪問到別人的地址空間上。但是共享內(nèi)存除外,這是故意讓不同的頁表能夠訪問到相同的物理地址上)。
當(dāng)調(diào)度程序調(diào)度一個程序運(yùn)行時,就將這個地址轉(zhuǎn)成物理地址,并寫入控制寄存器(CR3)。
mmap_base表示memory mapping段的起始地址。?那為什么mmap段沒有結(jié)束的地址呢? bbs段是用來干什么的呢?bbs表示的所有沒有初始化的全局變量,這樣只需要將它們匿名映射為‘零頁’,而不用在程序load過程中從磁盤文件顯示的mapping,這樣既減少了elf二進(jìn)制文件的大小,也提高了程序加載的效率。?在mm_struct中為什么沒有bbs段的地址空間表示呢? 除此之外,mm_struct還定義了幾個重要的域:
atomic_t mm_users; /* How many users with user space? */ atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */這兩個counter乍看好像差不多,那Linux使用中有什么區(qū)別呢?看代碼就是最好的解釋了。 static int copy_mm(unsigned long clone_flags, struct task_struct * tsk){struct mm_struct * mm, *oldmm;int retval;tsk->mm = NULL;tsk->active_mm = NULL;/** Are we cloning a kernel thread?** We need to steal a active VM for that..*/oldmm = current->mm;if (!oldmm)return 0;if (clone_flags & CLONE_VM) {atomic_inc(&oldmm->mm_users);mm = oldmm;goto good_mm;}
無論我們在調(diào)用fork,vfork,clone的時候最終會調(diào)用do_fork函數(shù),區(qū)別在于vfork和clone會給copy_mm傳入一個CLONE_VM的flag,這個標(biāo)識表示父子進(jìn)程都運(yùn)行在同樣一個‘虛擬地址空間’上面(在Linux稱之為lightweight process或者線程),當(dāng)然也就共享同樣的物理地址空間(Page Frames)。
copy_mm函數(shù)中,如果創(chuàng)建線程中有CLONE_VM標(biāo)識,則表示父子進(jìn)程共享地址空間和同一個內(nèi)存描述符,并且只需要將mm_users值+1,也就是說mm_users表示正在引用該地址空間的thread數(shù)目,是一個thread level的counter。
mm_count呢?mm_count的理解有點(diǎn)復(fù)雜。
對Linux來說,用戶進(jìn)程和內(nèi)核線程(kernel thread)都是task_struct的實(shí)例,唯一的區(qū)別是kernel thread是沒有進(jìn)程地址空間的,內(nèi)核線程也沒有mm描述符的,所以內(nèi)核線程的tsk->mm域是空(NULL)。內(nèi)核scheduler在進(jìn)程context switching的時候,會根據(jù)tsk->mm判斷即將調(diào)度的進(jìn)程是用戶進(jìn)程還是內(nèi)核線程。但是雖然thread thread不用訪問用戶進(jìn)程地址空間,但是仍然需要page table來訪問kernel自己的空間。但是幸運(yùn)的是,對于任何用戶進(jìn)程來說,他們的內(nèi)核空間都是100%相同的,所以內(nèi)核可以’borrow'上一個被調(diào)用的用戶進(jìn)程的mm中的頁表來訪問內(nèi)核地址,這個mm就記錄在active_mm。
簡而言之就是,對于kernel thread,tsk->mm == NULL表示自己內(nèi)核線程的身份,而tsk->active_mm是借用上一個用戶進(jìn)程的mm,用mm的page table來訪問內(nèi)核空間。對于用戶進(jìn)程,tsk->mm == tsk->active_mm。
為了支持這個特別,mm_struct里面引入了另外一個counter,mm_count。剛才說過mm_users表示這個進(jìn)程地址空間被多少線程共享或者引用,而mm_count則表示這個地址空間被內(nèi)核線程引用的次數(shù)+1。
比如一個進(jìn)程A有3個線程,那么這個A的mm_struct的mm_users值為3,但是mm_count為1,所以mm_count是process level的counter。維護(hù)2個counter有何用處呢?考慮這樣的scenario,內(nèi)核調(diào)度完A以后,切換到內(nèi)核內(nèi)核線程B,B ’borrow' A的mm描述符以訪問內(nèi)核空間,這時mm_count變成了2,同時另外一個cpu core調(diào)度了A并且進(jìn)程A exit,這個時候mm_users變?yōu)榱?,mm_count變?yōu)榱?,但是內(nèi)核不會因?yàn)閙m_users==0而銷毀這個mm_struct,內(nèi)核只會當(dāng)mm_count==0的時候才會釋放mm_struct,因?yàn)檫@個時候既沒有用戶進(jìn)程使用這個地址空間,也沒有內(nèi)核線程引用這個地址空間。
static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p){atomic_set(&mm->mm_users, 1);4 atomic_set(&mm->mm_count, 1); 在初始化一個mm實(shí)例的時候,mm_users和mm_count都被初始化為1。/** context_switch - switch to the new MM and the new* thread's register state.*/static inline voidcontext_switch(struct rq *rq, struct task_struct *prev,struct task_struct *next){struct mm_struct *mm, *oldmm;prepare_task_switch(rq, prev, next);trace_sched_switch(rq, prev, next);mm = next->mm;oldmm = prev->active_mm;if (unlikely(!mm)) {next->active_mm = oldmm;atomic_inc(&oldmm->mm_count);enter_lazy_tlb(oldmm, next);} elseswitch_mm(oldmm, mm, next);上面的代碼是Linux scheduler進(jìn)行的context switch的一小段,從unlike(!mm)開始,next->active_mm = oldmm表示如果將要切換倒內(nèi)核線程,則‘借用’前一個擁護(hù)進(jìn)程的mm描述符,并把他賦給active_mm,重點(diǎn)是將‘借用’的mm描述符的mm_counter加1。
下面我們看看在fork一個進(jìn)程的時候,是怎樣處理的mm_struct的。
/* * Ok, this is the main fork-routine. ** It copies the process, and if successful kick-starts* it and waits for it to finish using the VM if required.*/ long do_fork(unsigned long clone_flags,unsigned long stack_start,struct pt_regs *regs,unsigned long stack_size,int __user *parent_tidptr,int __user *child_tidptr){p = copy_process(clone_flags, stack_start, regs, stack_size, child_tidptr, NULL, trace);do_fork調(diào)用copy_process。
/** This creates a new process as a copy of the old one,* but does not actually start it yet.** It copies the registers, and all the appropriate* parts of the process environment (as per the clone* flags). The actual kick-off is left to the caller.*/static struct task_struct *copy_process(unsigned long clone_flags,unsigned long stack_start,struct pt_regs *regs,unsigned long stack_size,int __user *child_tidptr,struct pid *pid,int trace){if ((retval = copy_mm(clone_flags, p)))goto bad_fork_cleanup_signal;copy_process調(diào)用copy_mm,下面來分析copy_mm。static int copy_mm(unsigned long clone_flags, struct task_struct * tsk){struct mm_struct * mm, *oldmm;int retval;tsk->min_flt = tsk->maj_flt = 0;tsk->nvcsw = tsk->nivcsw = 0;#ifdef CONFIG_DETECT_HUNG_TASKtsk->last_switch_count = tsk->nvcsw + tsk->nivcsw;#endiftsk->mm = NULL;tsk->active_mm = NULL;/** Are we cloning a kernel thread?** We need to steal a active VM for that..*/oldmm = current->mm;if (!oldmm)return 0;if (clone_flags & CLONE_VM) {atomic_inc(&oldmm->mm_users);mm = oldmm;goto good_mm;}retval = -ENOMEM;mm = dup_mm(tsk);if (!mm)goto fail_nomem;good_mm:/* Initializing for Swap token stuff */mm->token_priority = 0;mm->last_interval = 0;tsk->mm = mm;tsk->active_mm = mm;return 0;fail_nomem:return retval;}
對子進(jìn)程或者線程的mm和active_mm初始化(NULL)。
就是我們上面說的如果是創(chuàng)建線程,則新線程共享創(chuàng)建進(jìn)程的mm,所以不需要進(jìn)行下面的copy操作。重點(diǎn)就是dup_mm(tsk)。
/*622 * Allocate a new mm structure and copy contents from the623 * mm structure of the passed in task structure.624 */625 struct mm_struct *dup_mm(struct task_struct *tsk)626 {627 struct mm_struct *mm, *oldmm = current->mm;628 int err;629630 if (!oldmm)631 return NULL;632633 mm = allocate_mm();634 if (!mm)635 goto fail_nomem;636637 memcpy(mm, oldmm, sizeof(*mm));638639 /* Initializing for Swap token stuff */640 mm->token_priority = 0;641 mm->last_interval = 0;642643 if (!mm_init(mm, tsk))644 goto fail_nomem;645646 if (init_new_context(tsk, mm))647 goto fail_nocontext;648649 dup_mm_exe_file(oldmm, mm);650651 err = dup_mmap(mm, oldmm);652 if (err)653 goto free_pt;654655 mm->hiwater_rss = get_mm_rss(mm);656 mm->hiwater_vm = mm->total_vm;657658 if (mm->binfmt && !try_module_get(mm->binfmt->module))659 goto free_pt;660661 return mm;
?633行,用slab分配了mm_struct的內(nèi)存對象。
637行,對子進(jìn)程的mm_struct進(jìn)程賦值,使其等于父進(jìn)程,這樣子進(jìn)程mm和父進(jìn)程mm的每一個域的值都相同。
在copy_mm的實(shí)現(xiàn)中,主要是為了實(shí)現(xiàn)unix COW的語義,所以理論上我們只需要父子進(jìn)程mm中的start_x和end_x之類的域(像start_data,end_data)相等,而對其余的域(像mm_users)則需要re-init,這個操作主要在mm_init中完成。
static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)450{451 atomic_set(&mm->mm_users, 1);452 atomic_set(&mm->mm_count, 1);453 init_rwsem(&mm->mmap_sem);454 INIT_LIST_HEAD(&mm->mmlist);455 mm->flags = (current->mm) ?456 (current->mm->flags & MMF_INIT_MASK) : default_dump_filter;457 mm->core_state = NULL;458 mm->nr_ptes = 0;459 set_mm_counter(mm, file_rss, 0);460 set_mm_counter(mm, anon_rss, 0);461 spin_lock_init(&mm->page_table_lock);462 mm->free_area_cache = TASK_UNMAPPED_BASE;463 mm->cached_hole_size = ~0UL;464 mm_init_aio(mm);465 mm_init_owner(mm, p);466467 if (likely(!mm_alloc_pgd(mm))) {468 mm->def_flags = 0;469 mmu_notifier_mm_init(mm);470 return mm;471 }472473 free_mm(mm);474 return NULL;475}其中特別要關(guān)注的是467 - 471行的mm_alloc_pdg,也就是page table的拷貝,page table負(fù)責(zé)logic address到physical address的轉(zhuǎn)換。
拷貝的結(jié)果就是父子進(jìn)程有獨(dú)立的page table,但是page table里面的每個entries值都是相同的,也就是說父子進(jìn)程獨(dú)立地址空間中相同logical address都對應(yīng)于相同的physical address,這樣也就是實(shí)現(xiàn)了父子進(jìn)程的COW(copy on write)語義。
事實(shí)上,vfork和fork相比,最大的開銷節(jié)省就是對page table的拷貝。
而在內(nèi)核2.6中,由于page table的拷貝,fork在性能上是有所損耗的,所以內(nèi)核社區(qū)里面討論過shared page table的實(shí)現(xiàn)(http://lwn.NET/Articles/149888/)。
總結(jié)
- 上一篇: 香港黄金配角吴孟达去世,80后程序员以轮
- 下一篇: GDOI2017小结