當(dāng)前位置：首頁(yè) > 运维知识 > linux >内容正文

linux

你知道Linux里D进程会搞事吗？

發(fā)布時(shí)間：2023/12/20 linux 26 豆豆

生活随笔收集整理的這篇文章主要介紹了你知道Linux里D进程会搞事吗？小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

前言

這篇文章是一位大神在實(shí)際項(xiàng)目中遇到問題并分析總結(jié)出來的，作為新手，能接觸到這類文章應(yīng)該是受益匪淺，這位同學(xué)現(xiàn)在在魅族工作，以后也會(huì)一直在魅族工作，是Linux 方面的專家，「魅族還有另一個(gè)Linux 大神，知道的自然知道了」，不知道大家發(fā)現(xiàn)沒有，最近文章底部的小guanggao被我撤掉了，我就是想讓大家看文章的時(shí)候，能有一種家的感覺，如果覺得這位同學(xué)文章不錯(cuò)的，支持一下，支持方式不做限制，好了，不吹牛了，看下面的正文。

Description

一個(gè)項(xiàng)目中偶現(xiàn)幾十上百個(gè) D 進(jìn)程卡住在 shrink_inactive_list，導(dǎo)致卡頓/卡死/android SWT 等問題，前前后后，提交了 3 次修復(fù)，還沒有徹底解決。

山重水復(fù)疑無路

LOG： [149459.897408] [3:2065:watchdog] Binder:1042_16 D 0 9917 635 0x00000008 [149459.897427] [3:2065:watchdog] Call trace: [149459.897435] [3:2065:watchdog] [<ffffff8bf28852d4>] _switch_to+0xb4/0xc0 [149459.897452] [3:2065:watchdog] [<ffffff8bf3a1f6a0>] _schedule+0x7f0/0xad0 [149459.897468] [3:2065:watchdog] [<ffffff8bf3a1f9f0>] schedule+0x70/0x90 [149459.897485] [3:2065:watchdog] [<ffffff8bf3a23b00>] schedule_timeout+0x548/0x668 [149459.897502] [3:2065:watchdog] [<ffffff8bf2959028>] msleep+0x28/0x38 [149459.897517] [3:2065:watchdog] [<ffffff8bf2a1ff38>] shrink_inactive_list+0x118/0x998 [149459.897534] [3:2065:watchdog] [<ffffff8bf2a1cb10>] shrink_node_memcg+0xa18/0x1100 [149459.897552] [3:2065:watchdog] [<ffffff8bf2a1f0b0>] shrink_node+0x108/0x2f8 [149459.897568] [3:2065:watchdog] [<ffffff8bf2a1bcb0>] do_try_to_free_pages+0x178/0x380 [149459.897586] [3:2065:watchdog] [<ffffff8bf2a1b9d0>] try_to_free_pages+0x370/0x4d8 [149459.897605] [3:2065:watchdog] [<ffffff8bf2a071b8>] _alloc_pages_nodemask+0x868/0x1380 [149459.897623] [3:2065:watchdog] [<ffffff8bf2a13784>] __do_pagecache_readahead+0xbc/0x358 [149459.897640] [3:2065:watchdog] [<ffffff8bf29fde4c>] filemapfault+0x11c/0x600 [149459.897647] [3:2065:watchdog] [<ffffff8bf2b479f8>] ext4_filemap_fault+0x30/0x50 [149459.897664] [3:2065:watchdog] [<ffffff8bf2a47f38>] handle_pte_fault+0xb38/0xfa8 [149459.897681] [3:2065:watchdog] [<ffffff8bf2a485c8>] handle_mm_fault+0x1d0/0x328 [149459.897699] [3:2065:watchdog] [<ffffff8bf28a3668>] do_page_fault+0x2a0/0x3e0 [149459.897716] [3:2065:watchdog] [<ffffff8bf28a3364>] do_translation_fault+0x44/0xa8 [149459.897732] [3:2065:watchdog] [<ffffff8bf2880b74>] do_mem_abort+0x4c/0xd0 [149459.897750] [3:2065:watchdog] [<ffffff8bf2882c78>] el0_da+0x20/0x24 [149459.897767] [3:2065:watchdog] Binder:1042_19 D 0 11188 635 0x00000008 [149459.897786] [3:2065:watchdog] Call trace: [149459.897797] [3:2065:watchdog] [<ffffff8bf28852d4>] _switch_to+0xb4/0xc0 [149459.897804] [3:2065:watchdog] [<ffffff8bf3a1f6a0>] _schedule+0x7f0/0xad0 [149459.897820] [3:2065:watchdog] [<ffffff8bf3a1f9f0>] schedule+0x70/0x90 [149459.897835] [3:2065:watchdog] [<ffffff8bf3a23b00>] schedule_timeout+0x548/0x668 [149459.897853] [3:2065:watchdog] [<ffffff8bf2959028>] msleep+0x28/0x38 [149459.897868] [3:2065:watchdog] [<ffffff8bf2a1ff38>] shrink_inactive_list+0x118/0x998 [149459.897887] [3:2065:watchdog] [<ffffff8bf2a1cb10>] shrink_node_memcg+0xa18/0x1100 [149459.897904] [3:2065:watchdog] [<ffffff8bf2a1f0b0>] shrink_node+0x108/0x2f8 [149459.897922] [3:2065:watchdog] [<ffffff8bf2a1bcb0>] do_try_to_free_pages+0x178/0x380 [149459.897940] [3:2065:watchdog] [<ffffff8bf2a1b9d0>] try_to_free_pages+0x370/0x4d8 [149459.897957] [3:2065:watchdog] [<ffffff8bf2a071b8>] __alloc_pages_nodemask+0x868/0x1380 [149459.897977] [3:2065:watchdog] [<ffffff8bf2a13784>] _do_page_cache_readahead+0xbc/0x358 [149459.897996] [3:2065:watchdog] [<ffffff8bf29fde4c>] filemap_fault+0x11c/0x600 [149459.898013] [3:2065:watchdog] [<ffffff8bf2b479f8>] ext4_filemap_fault+0x30/0x50 [149459.898031] [3:2065:watchdog] [<ffffff8bf2a47f38>] handle_pte_fault+0xb38/0xfa8 [149459.898048] [3:2065:watchdog] [<ffffff8bf2a485c8>] handle_mm_fault+0x1d0/0x328 [149459.898065] [3:2065:watchdog] [<ffffff8bf28a3668>] do_page_fault+0x2a0/0x3e0 [149459.898083] [3:2065:watchdog] [<ffffff8bf28a3364>] do_translation_fault+0x44/0xa8 [149459.898100] [3:2065:watchdog] [<ffffff8bf2880d18>] do_el0_ia_bp_hardening+0xc0/0x158 [149459.898118] [3:2065:watchdog] [<ffffff8bf2882c98>] el0_ia+0x1c/0x20現(xiàn)象：大量進(jìn)程從缺頁(yè)異常入口，調(diào)用內(nèi)存回收接口：shrink_inactive_list -> msleep ，使得該進(jìn)程狀態(tài)變?yōu)?D.void msleep(unsigned int msecs) { unsigned long timeout = msecs_to_jiffies(msecs) + 1; while (timeout) timeout = schedule_timeout_uninterruptible(timeout); }signed long __sched schedule_timeout_uninterruptible(signed long timeout) { __set_current_state(TASK_UNINTERRUPTIBLE); return schedule_timeout(timeout); }D 進(jìn)程就是被設(shè)置了 TASK_UNINTERRUPTIBLE 進(jìn)程狀態(tài)，不可中斷的睡眠狀態(tài)。不可中斷，指的并不是 CPU 不響應(yīng)外部硬件的中斷，而是指進(jìn)程不響應(yīng)異步信號(hào)，信號(hào)只會(huì)掛到信號(hào)隊(duì)列，而沒有機(jī)會(huì)去立即執(zhí)行。它不占用 CPU ，也不能被殺掉，很直觀的現(xiàn)象就是，kill -9 一個(gè) D 進(jìn)程，是沒有效果的，只有等進(jìn)程獲得資源被喚醒才處理信號(hào)，才處理 SIGKILL。「進(jìn)程是很有脾氣的，不知道你們有沒有遇到那種鉆牛角尖的人，拿我兒子來舉例一下，有時(shí)候他想找到他的玩具火箭，就一直在那里鬧，一定要我們幫他找到他的玩具火箭位置，其他事情就是不干，你用坦克哄他也不行，哄他看小豬佩奇也不行。D進(jìn)程也是一樣，必須要等有他等到的那個(gè)事件為止」
static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc, enum lru_list lru) { ...... while (unlikely(too_many_isolated(pgdat, file, sc, stalled))) { if (stalled) return 0; /* wait a bit for the reclaimer. */ msleep(100); // 卡在這里 stalled = true; /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) return SWAP_CLUSTER_MAX; } ......初步定位：該函數(shù)已經(jīng)有跳出功能，不會(huì)一直卡住，最多 2 次就會(huì)退出去。說明是大量的進(jìn)程瘋狂地調(diào)用 shrink_inactive_list 又被阻塞了一下子，又退出去，又掉進(jìn)來。所以，不是一直卡死，而是性能瓶頸擁堵在這個(gè)地方，congestion 「擁擠堵車的意思」。從上層 systrace 也能看到，很有規(guī)律的大概 110ms 一段的 D 狀態(tài)，一個(gè)進(jìn)程甚至可以持續(xù)幾十秒。

說明隔離頁(yè)面過多，sleep 100ms，猜測(cè)目的是

給時(shí)間處理隔離頁(yè)面，回寫文件頁(yè)到磁盤 ?

是控制并發(fā)，也許另一個(gè) cpu 也在同樣的回收流程導(dǎo)致隔離頁(yè)在時(shí)刻變大。

所以初步定了兩個(gè)方向和疑點(diǎn)：一是內(nèi)存回收瓶頸，內(nèi)存回收不及時(shí)，內(nèi)存需求量巨大，而 LMK 沒觸發(fā)，內(nèi)存有很多匿名頁(yè)，都在回收和回寫文件頁(yè)等。二是 io 讀寫瓶頸，io 速率慢，某個(gè)時(shí)間段速率變慢，ufs 頻率低，上層讀寫大量數(shù)據(jù)，io 占用率過高等。需要澄清這些疑點(diǎn)。插播一些背景知識(shí)

page cache

導(dǎo)致這個(gè)情況的原因是：進(jìn)程在申請(qǐng)內(nèi)存的時(shí)候，發(fā)現(xiàn)該 zone 的 freelist 上已經(jīng)沒有足夠的內(nèi)存可用，所以不得不去從該 zone 的 LRU 鏈表里回收 inactive 的page，這種情況就是 direct reclaim（直接回收）。direct reclaim 會(huì)比較消耗時(shí)間的原因是，如果回收的是 dirty page，就會(huì)觸發(fā)磁盤 IO 的操作，它會(huì)首先把 dirty page 里面的內(nèi)容給回寫到磁盤作同步，再去把該 page 給放到 freelist 里。下圖來看下 memory，page cache，Disk I/O 的關(guān)系。

舉個(gè)簡(jiǎn)單的例子，比如我們 open 一個(gè)文件時(shí)，如果沒有使用 O_DIRECT 這個(gè)flag，那就是 File I/O, 所有對(duì)磁盤文件的訪問都要經(jīng)過內(nèi)存，內(nèi)存會(huì)把這部分?jǐn)?shù)據(jù)給緩存起來；但是如果使用了 O_DIRECT 這個(gè)flag，那就是 Direct I/O, 它會(huì)繞過內(nèi)存而去直接訪問磁盤，訪問的這部分?jǐn)?shù)據(jù)也不會(huì)被緩存起來，自然性能上會(huì)降低很多。

page reclaim

在直觀上，我們有一個(gè)認(rèn)知，我們現(xiàn)在讀了一個(gè)文件，它會(huì)被緩存到內(nèi)存里面，如果接下來的一個(gè)月我們一直都不會(huì)再次訪問它，而且我們這一個(gè)月都不會(huì)關(guān)閉或者重啟機(jī)器，那么在這一個(gè)月之后該文件就不應(yīng)該再在內(nèi)存里頭了。這就是內(nèi)核對(duì) page cache 的管理策略：LRU（最近最少使用）。即把最近最少使用的 page cache 給回收為 free pages。（頁(yè)框回收算法 PFRA 遠(yuǎn)沒有這么簡(jiǎn)單）內(nèi)核的頁(yè)回收機(jī)制有兩種：后臺(tái)周期性回收和直接回收。后臺(tái)回收是有一個(gè)內(nèi)核線程 kswapd 來做，當(dāng)內(nèi)存里 free 的 pages 低于一個(gè)水位（page_low）時(shí)，就會(huì)喚醒該內(nèi)核線程，然后它從 LRU 鏈表里回收 page cache 到內(nèi)存的 free_list 里頭，它會(huì)一直回收直至 free 的 pages 達(dá)到另外一個(gè)水位 page_high 才停止. 如下圖所示：

直接回收則是，在發(fā)生 page fault/alloc memory 時(shí)，沒有足夠可用的內(nèi)存，于是線程就自己直接去回收內(nèi)存，它一次性的會(huì)回收 32 個(gè) pages。邏輯過程如下圖所示：

所以，在內(nèi)存優(yōu)化上1、抬高 watermark 可以間接減少內(nèi)存回收的并發(fā)量，減輕卡在 shrink_inactive_list. ?2、提高回收效率，如 LMK 的效率。
然而，還是沒徹底解決這個(gè)問題，所以我們把疑點(diǎn)再次指向 io。嘗試抓取更多的信息，來了解觸發(fā)瓶頸的微觀過程。1、跑 monkey 增加 io 使用率、io 讀寫速度監(jiān)控，以時(shí)間片為 100ms，監(jiān)控連續(xù) D 狀態(tài)，并收集 D 進(jìn)程堆棧信息、內(nèi)存信息等。2、打開 ftarce 的 vmscan 和 writeback 兩個(gè)監(jiān)控點(diǎn)，apk 監(jiān)控到持續(xù) D 狀態(tài)就進(jìn)dump，從 dump 解析 ftrace，再使用 kernelshark 來觀察一些數(shù)據(jù)。echo 1 > /sys/kernel/debug/tracing/events/writeback/enable echo 1 > /sys/kernel/debug/tracing/events/vmscan/enable echo 1 > /sys/kernel/debug/tracing/tracing_on為了準(zhǔn)備再深入上述的微觀過程，需要再補(bǔ)充一些代碼和流程圖：

注釋的代碼不貼了，受微信公眾號(hào)篇幅限制。

ftrace + kernelshark 輔助分析

執(zhí)行頁(yè)面回收中頁(yè)面狀態(tài)ftrace 會(huì)抓取下面這些信息統(tǒng)計(jì)，所以提前了解下。struct reclaim_stat { unsigned nr_dirty;// page_list中臟頁(yè)數(shù) unsigned nr_unqueued_dirty;// page_list中臟頁(yè)但是沒有放入塊設(shè)備請(qǐng)求隊(duì)列中的頁(yè)數(shù) unsigned nr_congested;// page_list中阻塞的頁(yè)數(shù) unsigned nr_writeback; // page_list中處于回寫中但是不是被回收的頁(yè)數(shù) unsigned nr_immediate; //page_list中即回寫中而且即將被回收的頁(yè)數(shù) unsigned nr_activate;// page_list中近期被訪問過需要添加到 activate list 的頁(yè)數(shù) unsigned nr_ref_keep;// page_list中近期被訪問過的頁(yè)數(shù) unsigned nr_unmap_fail;//解除映射失敗的頁(yè)數(shù) }; 經(jīng)過一段時(shí)間的老化測(cè)試，測(cè)試同學(xué)終于抓到 log 了。圖中顯示 nr_dirty，nr_congested，nr_writeback 幾乎都是 0，只有零星 nr_activate 被再訪問的頁(yè)面要添加回 active list.
說明現(xiàn)場(chǎng)不存在 dirty 頁(yè)面很多，回寫 io 遇到瓶頸的情況。這個(gè)猜想不成立了。圖中顯示在 34 秒內(nèi)，所有在 pageout() 中的頁(yè)面，全是 anon 頁(yè)面，沒有 file ?
查看 writeback trace event。同樣，沒有很多 writeback 量
從測(cè)試結(jié)果看到：

1.apk 監(jiān)控到的 io 使用率不高2.從 ftrace 看到回寫量不大

通過最新的數(shù)據(jù)信息，回到之前的兩個(gè)大方向：

?一是內(nèi)存緊缺，內(nèi)存回收不及時(shí)，內(nèi)存需求量大。LMK 沒觸發(fā)，內(nèi)存有很多匿名頁(yè)，都在回收和回寫文件頁(yè)等。（抬高水位、加速 LMK 觸發(fā)，還有復(fù)現(xiàn)，不能徹底解決）?二是 io 速率慢，某個(gè)時(shí)間段速率變慢，ufs 頻率低，上層讀寫大量數(shù)據(jù)，io 占用率過高等。（數(shù)據(jù)證明，io 量不多，沒有瓶頸）

那么，之前的兩個(gè)方向猜想，都落空了。那會(huì)是什么意想不到的原因？那回去看看卡住的代碼，too_many_isolated 代碼。 static int __too_many_isolated(struct pglist_data *pgdat, int file, struct scan_control *sc, bool stalled) { unsigned long inactive, isolated; if (file) { if (stalled) { inactive = node_page_state_snapshot(pgdat, NR_INACTIVE_FILE); isolated = node_page_state_snapshot(pgdat, NR_ISOLATED_FILE); } else { inactive = node_page_state(pgdat, NR_INACTIVE_FILE); isolated = node_page_state(pgdat, NR_ISOLATED_FILE); } } else { if (stalled) { inactive = node_page_state_snapshot(pgdat, NR_INACTIVE_ANON); isolated = node_page_state_snapshot(pgdat, NR_ISOLATED_ANON); } else { inactive = node_page_state(pgdat, NR_INACTIVE_ANON); isolated = node_page_state(pgdat, NR_ISOLATED_ANON); } } /* * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they * won't get blocked by normal direct-reclaimers, forming a circular * deadlock. */ if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) inactive >>= 3; return isolated > inactive; }沒有很復(fù)雜的邏輯，只有簡(jiǎn)單的 isolated 和 inactive 統(tǒng)計(jì)計(jì)數(shù)比較。所以，只能是更直接的猜想：isolated file 統(tǒng)計(jì)一直偏大，導(dǎo)致一直判斷 too_many_isolated 為真，卡在 shrink_inactive_list。根據(jù)這個(gè)猜想，從 log 中打印的 mem info，也看到 isolated file 一直偏大，一直在增加，不會(huì)減少。好像印證了猜想似的。LOG：<6>[95299.607369] isolated(anon):0kB isolated(file):37880kB <6>[95318.568833] isolated(anon):0kB isolated(file):37752kB <6>[95323.773350] isolated(anon):0kB isolated(file):37752kB <6>[97520.184804] isolated(anon):0kB isolated(file):44604kB <6>[97525.658037] isolated(anon):0kB isolated(file):44604kB <6>[97754.256431] isolated(anon):0kB isolated(file):44604kB <6>[97759.418172] isolated(anon):0kB isolated(file):44604kB <6>[97764.574908] isolated(anon):0kB isolated(file):44604kB <6>[97769.735128] isolated(anon):0kB isolated(file):44604kB <6>[98543.638667] isolated(anon):0kB isolated(file):44684kB <6>[98548.905397] isolated(anon):0kB isolated(file):44684kB <6>[98554.209671] isolated(anon):0kB isolated(file):44684kB <6>[99996.798031] isolated(anon):0kB isolated(file):51572kB <6>[100002.122853] isolated(anon):0kB isolated(file):51572kB <6>[100007.359023] isolated(anon):0kB isolated(file):51572kB <6>[100146.079882] isolated(anon):0kB isolated(file):51700kB <6>[100151.313065] isolated(anon):0kB isolated(file):51572kB <6>[100156.587622] isolated(anon):0kB isolated(file):51572kB <6>[100328.483071] isolated(anon):0kB isolated(file):51700kB <6>[100520.245217] isolated(anon):0kB isolated(file):51572kB <6>[100550.688429] isolated(anon):0kB isolated(file):51572kB <6>[100555.913634] isolated(anon):0kB isolated(file):51572kB <6>[100669.226582] isolated(anon):0kB isolated(file):51572kB <6>[100935.069661] isolated(anon):0kB isolated(file):51688kB <6>[100940.240279] isolated(anon):0kB isolated(file):51572kB <6>[100945.476071] isolated(anon):0kB isolated(file):51828kB <6>[103104.120921] isolated(anon):0kB isolated(file):53344kB <6>[103121.900214] isolated(anon):0kB isolated(file):53344kB <6>[103481.197823] isolated(anon):0kB isolated(file):53412kB <6>[103486.555528] isolated(anon):0kB isolated(file):53412kB <6>[103721.346234] isolated(anon):0kB isolated(file):53412kB <6>[103726.655700] isolated(anon):0kB isolated(file):53540kB <6>[103731.961321] isolated(anon):0kB isolated(file):53540kB <6>[103737.236295] isolated(anon):0kB isolated(file):53540kB <6>[103742.470632] isolated(anon):0kB isolated(file):53412kB <6>[103747.661019] isolated(anon):0kB isolated(file):53284kB <6>[103752.973978] isolated(anon):0kB isolated(file):53412kB

柳暗花明又一村

對(duì) NR_ISOLATED_FILE/NR_ISOLATED_ANON 的統(tǒng)計(jì)增減主要分布在 vmscan.c migrate.c，和 PPR (高通進(jìn)程內(nèi)存回收)模塊。理論上內(nèi)核 vmscan.c（成雙成對(duì)） migrate.c 都不會(huì)有問題，高通 PPR 模塊插入在 vmscan. c 和 task_mmu.c 里，而我們 IMS 沒有直接使用高通 PPR，嫌疑最大。于是，在上游確實(shí)找到了個(gè)相關(guān)的 patch。https://source.codeaurora.org/quic/la/kernel/msm-4.14/commit/fs/proc/task_mmu.c?h=msm-4.14&id=c800548eac0350391c6d379a89f2e5d4c31366bf這個(gè) patch 正是修復(fù)了 isolated count mismatch 的問題，導(dǎo)致一直讓 isolated file 增大。 MADV_FREE clears pte dirty bit and then marks the page lazyfree (clear SwapBacked). PPR increments ISOLATE_FILES count, then isolates page and invokes a reclaim. Inbetween if this lazyfreed page is touched by user then it becomes dirty. PPR in shrink_page_list in try_to_unmap finds the page dirty, marks it back as PageSwapBacked and skips reclaim. As PageSwapBacked set, PPR identifies the page as anon and decrements ISOLATED_ANON, thus creating isolated count mismatch. This results in too_many_isolated() check causing delay in reclaim. Skip reclaiming lazyfreed pages in PPR path. MADV_FREE (since Linux 4.5) The application no longer requires the pages in the range specified by addr and len. The kernel can thus free these pages, but the freeing could be delayed until memory pressure occurs. For each of the pages that has been marked to be freed but has not yet been freed, the free operation will be canceled if the caller writes into the page. After a successful MADV_FREE operation, any stale data (i.e., dirty, unwritten pages) will be lost when the kernel frees the pages. However, subsequent writes to pages in the range will succeed and then kernel cannot free those dirtied pages, so that the caller can always see just written data. If there is no subsequent write, the kernel can free the pages at any time. Once pages in the range have been freed, the caller will see zero-fill-on-demand pages upon subsequent page references. The MADV_FREE operation can be applied only to private anonymous pages (see mmap(2)). In Linux before version 4.12, when freeing pages on a swapless system, the pages in the given range are freed instantly, regardless of memory pressure.madvise(2) is a system call used by processes to tell the kernel how they are going to use their memory, allowing the kernel to optimize the memory management according to these hints to achieve better overall performance. When an application wants to signal the kernel that it isn't going to use a range of memory in the near future, it can use the MADV_DONTNEED flag, so the kernel can free resources associated with it. Subsequent accesses in the range will succeed, but will result either in reloading of the memory contents from the underlying mapped file or zero-fill-on-demand pages for mappings without an underlying file. But there are some kind of apps (notably, memory allocators) that can reuse that memory range after a short time, and MADV_DONTNEED forces them to incur in page fault, page allocation, page zeroing, etc. For avoiding that overhead, other OS like BSDs have supported MADV_FREE, which just mark pages as available to free if needed, but it doesn't free them immediately, making possible to reuse the memory range without incurring in the costs of faulting the pages again. This release adds Linux support for this flag. Recommended LWN article: Volatile ranges and MADV_FREEmadvise[1]?系統(tǒng)調(diào)用，會(huì)建議內(nèi)核，在從 addr 指定的地址開始，長(zhǎng)度等于 len 參數(shù)值的范圍內(nèi)，該區(qū)域的用戶虛擬內(nèi)存應(yīng)遵循特定的使用模式，使內(nèi)核可以選擇適當(dāng)?shù)念A(yù)讀和緩存技術(shù)。如果使用 madvise() 函數(shù)的程序明確了解其內(nèi)存訪問模式，則使用此函數(shù)可以提高系統(tǒng)性能。
自 4.5 開始，引入 MADV_FREE 參數(shù)「這是為什么 4.9 內(nèi)核才出現(xiàn)該問題，這需要上層和底層同時(shí)支持，才會(huì)出現(xiàn)本問題」。簡(jiǎn)單來說，MADV_FREE 就是讓上層設(shè)置一段內(nèi)存可以釋放內(nèi)存的標(biāo)志，但是底層并不會(huì)立即釋放，以便讓上層可以在短時(shí)間內(nèi)重復(fù)訪問，以免增加缺頁(yè)異常等性能開銷。也叫 lazy free，它只能用于匿名頁(yè)面。
根據(jù)描述，觸發(fā) isolated file 統(tǒng)計(jì)增大的路徑是：「代碼省略不貼」
?上層調(diào)用 madvise 系統(tǒng)調(diào)用，使用 MADV_FREE 時(shí)，清除 dirty bit 和 SwapBacked bit，把 lazyfree page 加入 inactive file list。?PPR 增加 ISOLATE_FILES 計(jì)數(shù)（SwapBacked=0），隔離頁(yè)面并觸發(fā)回收?上層訪問 lazyfreed 頁(yè)面，dirty=1?PPR 執(zhí)行 reclaim_pte_range -> reclaim_pages_from_list -> shrink_page_list ->try_to_unmap -> try_to_unmap_one 設(shè)置 SwapBacked=1, 并跳出回收?PPR 繼續(xù)執(zhí)行 reclaim_pte_range -> reclaim_pages_from_list，putback_lru_page 的時(shí)候，因?yàn)?SwapBacked=1，減少了 NR_ISOLATED_ANON 計(jì)數(shù)，而不是減少當(dāng)初增加的 NR_ISOLATED_FILE 計(jì)數(shù)。?導(dǎo)致 NR_ISOLATED_FILE 一直被增加
所以，需要在 PPR 中過濾 lazyfree 頁(yè)面，避免這個(gè) NR_ISOLATED_FILE 計(jì)數(shù)異常，導(dǎo)致的卡 too_many_isolated。匿名頁(yè)面一開始就會(huì)設(shè)置 SwapBacked=1, 并且只有在上層設(shè)置 lazyfree 頁(yè)面時(shí)才會(huì)清除 ClearPageSwapBacked(page) ，沒別的地方了。所以，PageAnon(page) && !PageSwapBacked(page) 能指示這是 lazyfree 頁(yè)面。
ok，已經(jīng)理清了前因后果。再退一步，試想下，假如上游沒有修復(fù)這個(gè) patch。我們能不能想出來？我覺得很難，因?yàn)槲覀內(nèi)狈?madvise 的相關(guān)認(rèn)識(shí)，并且它經(jīng)過了 dirty, SwapBacked 標(biāo)志的變化（好像幾乎沒辦法做這么微觀的頁(yè)面標(biāo)志追蹤？），才導(dǎo)致 NR_ISOLATED_ANON/FLIE 的變化。

請(qǐng)作者吃根辣條

References

[1]?madvise:?http://www.man7.org/linux/man-pages/man2/madvise.2.html

掃碼或長(zhǎng)按關(guān)注回復(fù)「加群?」進(jìn)入技術(shù)群聊

創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎(jiǎng)勵(lì)來咯，堅(jiān)持創(chuàng)作打卡瓜分現(xiàn)金大獎(jiǎng)

總結(jié)

以上是生活随笔為你收集整理的你知道Linux里D进程会搞事吗？的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。