WatchDog工作原理
一、概述
Android系統中,有硬件WatchDog用于定時檢測關鍵硬件是否正常工作,類似地,在framework層有一個軟件WatchDog用于定期檢測關鍵系統服務是否發生死鎖事件。WatchDog功能主要是分析系統核心服務和重要線程是否處于Blocked狀態。
- 監視reboot廣播;
- 監視mMonitors關鍵系統服務是否死鎖。
二、WatchDog初始化
2.1 startOtherServices
[-> SystemServer.java]
private void startOtherServices() {...//創建watchdog【見小節2.2】final Watchdog watchdog = Watchdog.getInstance();//注冊reboot廣播【見小節2.3】watchdog.init(context, mActivityManagerService);...mSystemServiceManager.startBootPhase(SystemService.PHASE_LOCK_SETTINGS_READY); //480...mActivityManagerService.systemReady(new Runnable() {public void run() {mSystemServiceManager.startBootPhase(SystemService.PHASE_ACTIVITY_MANAGER_READY);...// watchdog啟動【見小節3.1】Watchdog.getInstance().start();mSystemServiceManager.startBootPhase(SystemService.PHASE_THIRD_PARTY_APPS_CAN_START);}} }system_server進程啟動的過程中初始化WatchDog,主要有:
- 創建watchdog對象,該對象本身繼承于Thread;
- 注冊reboot廣播;
- 調用start()開始工作。
2.2 getInstance
[-> Watchdog.java]
public static Watchdog getInstance() {if (sWatchdog == null) {//單例模式,創建實例對象【見小節2.3 】sWatchdog = new Watchdog();}return sWatchdog; }2.3 創建Watchdog
[-> Watchdog.java]
public class Watchdog extends Thread {//所有的HandlerChecker對象組成的列表,HandlerChecker對象類型【見小節2.3.1】final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();...private Watchdog() {super("watchdog");//將前臺線程加入隊列mMonitorChecker = new HandlerChecker(FgThread.getHandler(),"foreground thread", DEFAULT_TIMEOUT);mHandlerCheckers.add(mMonitorChecker);//將主線程加入隊列mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),"main thread", DEFAULT_TIMEOUT));//將ui線程加入隊列mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),"ui thread", DEFAULT_TIMEOUT));//將i/o線程加入隊列mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),"i/o thread", DEFAULT_TIMEOUT));//將display線程加入隊列mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),"display thread", DEFAULT_TIMEOUT));//【見小節2.3.2】addMonitor(new BinderThreadMonitor());}}Watchdog繼承于Thread,創建的線程名為”watchdog”。mHandlerCheckers隊列包括、 主線程,fg, ui, io, display線程的HandlerChecker對象。
2.3.1 HandlerChecker
[-> Watchdog.java]
public final class HandlerChecker implements Runnable {private final Handler mHandler; //Handler對象private final String mName; //線程描述名private final long mWaitMax; //最長等待時間//記錄著監控的服務private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();private boolean mCompleted; //開始檢查時先設置成falseprivate Monitor mCurrentMonitor; private long mStartTime; //開始準備檢查的時間點HandlerChecker(Handler handler, String name, long waitMaxMillis) {mHandler = handler;mName = name;mWaitMax = waitMaxMillis; mCompleted = true;} }2.3.2 addMonitor
public class Watchdog extends Thread {public void addMonitor(Monitor monitor) {synchronized (this) {...//此處mMonitorChecker數據類型為HandlerCheckermMonitorChecker.addMonitor(monitor);}}public final class HandlerChecker implements Runnable {private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();public void addMonitor(Monitor monitor) {//將上面的BinderThreadMonitor添加到mMonitors隊列mMonitors.add(monitor);}...} }監控Binder線程, 將monitor添加到HandlerChecker的成員變量mMonitors列表中。 在這里是將BinderThreadMonitor對象加入該線程。
private static final class BinderThreadMonitor implements Watchdog.Monitor {public void monitor() {Binder.blockUntilThreadAvailable();} }blockUntilThreadAvailable最終調用的是IPCThreadState,等待有空閑的binder線程
void IPCThreadState::blockUntilThreadAvailable() {pthread_mutex_lock(&mProcess->mThreadCountLock);while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {//等待正在執行的binder線程小于進程最大binder線程上限(16個)pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);}pthread_mutex_unlock(&mProcess->mThreadCountLock); }可見addMonitor(new BinderThreadMonitor())是將Binder線程添加到android.fg線程的handler(mMonitorChecker)來檢查是否工作正常。
2.3 init
[-> Watchdog.java]
public void init(Context context, ActivityManagerService activity) {mResolver = context.getContentResolver();mActivity = activity;//注冊reboot廣播接收者【見小節2.3.1】context.registerReceiver(new RebootRequestReceiver(),new IntentFilter(Intent.ACTION_REBOOT),android.Manifest.permission.REBOOT, null); }2.3.1 RebootRequestReceiver
[-> Watchdog.java]
final class RebootRequestReceiver extends BroadcastReceiver {public void onReceive(Context c, Intent intent) {if (intent.getIntExtra("nowait", 0) != 0) {//【見小節2.3.2】rebootSystem("Received ACTION_REBOOT broadcast");return;}Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);} }2.3.2 rebootSystem
[-> Watchdog.java]
void rebootSystem(String reason) {Slog.i(TAG, "Rebooting system because: " + reason);IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);try {//通過PowerManager執行reboot操作pms.reboot(false, reason, false);} catch (RemoteException ex) {} }最終是通過PowerManagerService來完成重啟操作,具體的重啟流程后續會單獨講述。
三、Watchdog檢測機制
當調用Watchdog.getInstance().start()時,則進入線程“watchdog”的run()方法, 該方法分成兩部分:
- 前半部 [小節3.1] 用于監測是否觸發超時;
- 后半部 [小節4.1], 當觸發超時則輸出各種信息。
3.1 run
[-> Watchdog.java]
public void run() {boolean waitedHalf = false;while (true) {final ArrayList<HandlerChecker> blockedCheckers;final String subject;final boolean allowRestart;int debuggerWasConnected = 0;synchronized (this) {long timeout = CHECK_INTERVAL; //CHECK_INTERVAL=30sfor (int i=0; i<mHandlerCheckers.size(); i++) {HandlerChecker hc = mHandlerCheckers.get(i);//執行所有的Checker的監控方法, 每個Checker記錄當前的mStartTime[見小節3.2]hc.scheduleCheckLocked();}if (debuggerWasConnected > 0) {debuggerWasConnected--;}long start = SystemClock.uptimeMillis();//通過循環,保證執行30s才會繼續往下執行while (timeout > 0) {if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}try {wait(timeout); //觸發中斷,直接捕獲異常,繼續等待.} catch (InterruptedException e) {Log.wtf(TAG, e);}if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);}//評估Checker狀態【見小節3.3】final int waitState = evaluateCheckerCompletionLocked();if (waitState == COMPLETED) {waitedHalf = false;continue;} else if (waitState == WAITING) {continue;} else if (waitState == WAITED_HALF) {if (!waitedHalf) {//首次進入等待時間過半的狀態ArrayList<Integer> pids = new ArrayList<Integer>();pids.add(Process.myPid());//輸出system_server和3個native進程的traces【見小節4.2】ActivityManagerService.dumpStackTraces(true, pids, null, null,NATIVE_STACKS_OF_INTEREST);waitedHalf = true;}continue;}... //進入這里,意味著Watchdog已超時【見小節4.1】}...} }public static final String[] NATIVE_STACKS_OF_INTEREST = new String[] {"/system/bin/mediaserver","/system/bin/sdcard","/system/bin/surfaceflinger" };該方法主要功能:
- 當mMonitor個數為0(除了android.fg線程之外都為0)且處于poll狀態,則設置mCompleted = true;
- 當上次check還沒有完成, 則直接返回.
- 當COMPLETED或WAITING,則相安無事;
- 當WAITED_HALF(超過30s)且為首次, 則輸出system_server和3個Native進程的traces;
- 當OVERDUE, 則輸出更多信息.
由此,可見當觸發一次Watchdog, 則必然會調用兩次AMS.dumpStackTraces, 也就是說system_server和3個Native進程的traces 的traces信息會輸出兩遍,且時間間隔超過30s.
3.2 scheduleCheckLocked
public final class HandlerChecker implements Runnable {...public void scheduleCheckLocked() {if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {mCompleted = true; //當目標looper正在輪詢狀態則返回。return;}if (!mCompleted) {return; //有一個check正在處理中,則無需重復發送}mCompleted = false;mCurrentMonitor = null;// 記錄當下的時間mStartTime = SystemClock.uptimeMillis();//發送消息,插入消息隊列最開頭, 見下方的run()方法mHandler.postAtFrontOfQueue(this);}public void run() {final int size = mMonitors.size();for (int i = 0 ; i < size ; i++) {synchronized (Watchdog.this) {mCurrentMonitor = mMonitors.get(i);}//回調具體服務的monitor方法mCurrentMonitor.monitor();}synchronized (Watchdog.this) {mCompleted = true;mCurrentMonitor = null;}} }該方法主要功能: 向Watchdog的監控線程的Looper池的最頭部執行該HandlerChecker.run()方法, 在該方法中調用monitor(),執行完成后會設置mCompleted = true. 那么當handler消息池當前的消息, 導致遲遲沒有機會執行monitor()方法, 則會觸發watchdog.
其中postAtFrontOfQueue(this),該方法輸入參數為Runnable對象,根據消息機制, 最終會回調HandlerChecker中的run方法,該方法會循環遍歷所有的Monitor接口,具體的服務實現該接口的monitor()方法。
可能的問題,如果有其他消息不斷地調用postAtFrontOfQueue()也可能導致watchdog沒有機會執行;或者是每個monitor消耗一些時間,雷加起來超過1分鐘造成的watchdog. 這些都是非常規的Watchdog.
3.3 evaluateCheckerCompletionLocked
private int evaluateCheckerCompletionLocked() {int state = COMPLETED;for (int i=0; i<mHandlerCheckers.size(); i++) {HandlerChecker hc = mHandlerCheckers.get(i);//【見小節3.4】state = Math.max(state, hc.getCompletionStateLocked());}return state; }獲取mHandlerCheckers列表中等待狀態值最大的state.
3.4 getCompletionStateLocked
public int getCompletionStateLocked() {if (mCompleted) {return COMPLETED;} else {long latency = SystemClock.uptimeMillis() - mStartTime;// mWaitMax默認是60sif (latency < mWaitMax/2) {return WAITING;} else if (latency < mWaitMax) {return WAITED_HALF;}}return OVERDUE; }- COMPLETED = 0:等待完成;
- WAITING = 1:等待時間小于DEFAULT_TIMEOUT的一半,即30s;
- WAITED_HALF = 2:等待時間處于30s~60s之間;
- OVERDUE = 3:等待時間大于或等于60s。
四. Watchdog處理流程
4.1 run
[-> Watchdog.java]
public void run() {while (true) {synchronized (this) {...//獲取被阻塞的checkers 【見小節4.1.1】blockedCheckers = getBlockedCheckersLocked();// 獲取描述信息 【見小節4.1.2】subject = describeCheckersLocked(blockedCheckers);allowRestart = mAllowRestart;}EventLog.writeEvent(EventLogTags.WATCHDOG, subject);ArrayList<Integer> pids = new ArrayList<Integer>();pids.add(Process.myPid());if (mPhonePid > 0) pids.add(mPhonePid);//第二次以追加的方式,輸出system_server和3個native進程的棧信息【見小節4.2】final File stack = ActivityManagerService.dumpStackTraces(!waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);//系統已被阻塞1分鐘,也不在乎多等待2s,來確保stack trace信息輸出SystemClock.sleep(2000);if (RECORD_KERNEL_THREADS) {//輸出kernel棧信息【見小節4.3】dumpKernelStackTraces();}//觸發kernel來dump所有阻塞線程【見小節4.4】doSysRq('l');//輸出dropbox信息【見小節4.5】Thread dropboxThread = new Thread("watchdogWriteToDropbox") {public void run() {mActivity.addErrorToDropBox("watchdog", null, "system_server", null, null,subject, null, stack, null);}};dropboxThread.start();try {dropboxThread.join(2000); //等待dropbox線程工作2s} catch (InterruptedException ignored) {}IActivityController controller;synchronized (this) {controller = mController;}if (controller != null) {//將阻塞狀態報告給activity controller,try {Binder.setDumpDisabled("Service dumps disabled due to hung system process.");//返回值為1表示繼續等待,-1表示殺死系統int res = controller.systemNotResponding(subject);if (res >= 0) {waitedHalf = false; continue; //設置ActivityController的某些情況下,可以讓發生Watchdog時繼續等待}} catch (RemoteException e) {}}//當debugger沒有attach時,才殺死進程if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}if (debuggerWasConnected >= 2) {Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");} else if (debuggerWasConnected > 0) {Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");} else if (!allowRestart) {Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");} else {Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);//遍歷輸出阻塞線程的棧信息for (int i=0; i<blockedCheckers.size(); i++) {Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");StackTraceElement[] stackTrace= blockedCheckers.get(i).getThread().getStackTrace();for (StackTraceElement element: stackTrace) {Slog.w(TAG, " at " + element);}}Slog.w(TAG, "*** GOODBYE!");//殺死進程system_server【見小節4.6】Process.killProcess(Process.myPid());System.exit(10);}waitedHalf = false;} }Watchdog檢測到異常的信息收集工作:
- AMS.dumpStackTraces:輸出Java和Native進程的棧信息;
- WD.dumpKernelStackTraces:輸出Kernel棧信息;
- doSysRq
- dropBox
收集完信息后便會殺死system_server進程。此處allowRestart默認值為true, 當執行am hang操作則設置不允許重啟(allowRestart =false), 則不會殺死system_server進程.
4.1.1 getBlockedCheckersLocked
private ArrayList<HandlerChecker> getBlockedCheckersLocked() {ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();//遍歷所有的Checkerfor (int i=0; i<mHandlerCheckers.size(); i++) {HandlerChecker hc = mHandlerCheckers.get(i);//將所有沒有完成,且超時的checker加入隊列if (hc.isOverdueLocked()) {checkers.add(hc);}}return checkers; }4.1.2 describeCheckersLocked
private String describeCheckersLocked(ArrayList<HandlerChecker> checkers) {StringBuilder builder = new StringBuilder(128);for (int i=0; i<checkers.size(); i++) {if (builder.length() > 0) {builder.append(", ");}// 輸出所有的checker信息builder.append(checkers.get(i).describeBlockedStateLocked());}return builder.toString();}public String describeBlockedStateLocked() {//非前臺線程進入該分支if (mCurrentMonitor == null) {return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";//前臺線程進入該分支} else {return "Blocked in monitor " + mCurrentMonitor.getClass().getName()+ " on " + mName + " (" + getThread().getName() + ")";}}將所有執行時間超過1分鐘的handler線程或者monitor都記錄下來.
- 當輸出的信息是Blocked in handler,意味著相應的線程處理當前消息時間超過1分鐘;
- 當輸出的信息是Blocked in monitor,意味著相應的線程處理當前消息時間超過1分鐘,或者monitor遲遲拿不到鎖;
4.2 AMS.dumpStackTraces
public static File dumpStackTraces(boolean clearTraces, ArrayList<Integer> firstPids,ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids, String[] nativeProcs) {//默認為 data/anr/traces.txtString tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);if (tracesPath == null || tracesPath.length() == 0) {return null;}File tracesFile = new File(tracesPath);try {//當clearTraces,則刪除已存在的traces文件if (clearTraces && tracesFile.exists()) tracesFile.delete();//創建traces文件tracesFile.createNewFile();// -rw-rw-rw-FileUtils.setPermissions(tracesFile.getPath(), 0666, -1, -1);} catch (IOException e) {return null;}//輸出trace內容dumpStackTraces(tracesPath, firstPids, processCpuTracker, lastPids, nativeProcs);return tracesFile; }輸出system_server和mediaserver,/sdcard,surfaceflinger這3個native進程的traces信息。
4.3 WD.dumpKernelStackTraces
private File dumpKernelStackTraces() {// 路徑為data/anr/traces.txtString tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);if (tracesPath == null || tracesPath.length() == 0) {return null;}// [見小節4.3.1]native_dumpKernelStacks(tracesPath);return new File(tracesPath); }native_dumpKernelStacks,經過JNI調用到android_server_Watchdog.cpp文件中的dumpKernelStacks()方法。
4.3.1 dumpKernelStacks
[-> android_server_Watchdog.cpp]
static void dumpKernelStacks(JNIEnv* env, jobject clazz, jstring pathStr) {char buf[128];DIR* taskdir;const char *path = env->GetStringUTFChars(pathStr, NULL);// 打開traces文件int outFd = open(path, O_WRONLY | O_APPEND | O_CREAT,S_IRUSR|S_IWUSR|S_IRGRP|S_IWGRP|S_IROTH|S_IWOTH);...snprintf(buf, sizeof(buf), "\n----- begin pid %d kernel stacks -----\n", getpid());write(outFd, buf, strlen(buf));//讀取該進程內的所有線程snprintf(buf, sizeof(buf), "/proc/%d/task", getpid());taskdir = opendir(buf);if (taskdir != NULL) {struct dirent * ent;while ((ent = readdir(taskdir)) != NULL) {int tid = atoi(ent->d_name);if (tid > 0 && tid <= 65535) {//輸出每個線程的traces 【4.3.2】dumpOneStack(tid, outFd);}}closedir(taskdir);}snprintf(buf, sizeof(buf), "----- end pid %d kernel stacks -----\n", getpid());write(outFd, buf, strlen(buf));close(outFd); done:env->ReleaseStringUTFChars(pathStr, path); }通過讀取該節點/proc/%d/task獲取當前進程中的所有線程信息。
4.3.2 dumpOneStack
[-> android_server_Watchdog.cpp]
static void dumpOneStack(int tid, int outFd) {char buf[64];//通過讀取節點/proc/%d/stacksnprintf(buf, sizeof(buf), "/proc/%d/stack", tid);int stackFd = open(buf, O_RDONLY);if (stackFd >= 0) {//頭部strncat(buf, ":\n", sizeof(buf) - strlen(buf) - 1);write(outFd, buf, strlen(buf));//拷貝stack信息int nBytes;while ((nBytes = read(stackFd, buf, sizeof(buf))) > 0) {write(outFd, buf, nBytes);}//尾部write(outFd, "\n", 1);close(stackFd);} else {ALOGE("Unable to open stack of tid %d : %d (%s)", tid, errno, strerror(errno));} }4.4 WD.doSysRq
private void doSysRq(char c) {try {FileWriter sysrq_trigger = new FileWriter("/proc/sysrq-trigger");sysrq_trigger.write(c);sysrq_trigger.close();} catch (IOException e) {Slog.w(TAG, "Failed to write to /proc/sysrq-trigger", e);} }通過向節點/proc/sysrq-trigger寫入字符,觸發kernel來dump所有阻塞線程,輸出所有CPU的backtrace到kernel log。
4.5 dropBox
關于dropbox已在dropBox源碼篇詳細講解過,輸出文件到/data/system/dropbox。對于觸發watchdog時,生成的dropbox文件的tag是system_server_watchdog,內容是traces以及相應的blocked信息。
4.6 killProcess
Process.killProcess已經在文章理解殺進程的實現原理已詳細講解,通過發送信號9給目標進程來完成殺進程的過程。
當殺死system_server進程,從而導致zygote進程自殺,進而觸發init執行重啟Zygote進程,這便出現了手機framework重啟的現象。
五. 總結
Watchdog是一個運行在system_server進程的名為”watchdog”的線程::
- Watchdog運作過程,當阻塞時間超過1分鐘則觸發一次watchdog,會殺死system_server,觸發上層重啟;
- mHandlerCheckers記錄所有的HandlerChecker對象的列表,包括foreground, main, ui, i/o, display線程的handler;
- mHandlerChecker.mMonitors記錄所有Watchdog目前正在監控Monitor,所有的這些monitors都運行在foreground線程。
- 有兩種方式加入Watchdog的監控:
- addThread():用于監測Handler對象,默認超時時長為60s.這種超時往往是所對應的handler線程消息處理得慢;
- addMonitor(): 用于監控實現了Watchdog.Monitor接口的服務.這種超時可能是”android.fg”線程消息處理得慢,也可能是monitor遲遲拿不到鎖;
以下情況,即使觸發了Watchdog,也不會殺掉system_server進程:
- monkey: 設置IActivityController,攔截systemNotResponding事件, 比如monkey.
- hang: 執行am hang命令,不重啟;
- debugger: 連接debugger的情況, 不重啟;
5.1 輸出信息
watchdog在check過程中出現阻塞1分鐘的情況,則會輸出:
- 該方法會輸出兩次,第一次在超時30s的地方;第二次在超時1min;
- 節點/proc/%d/task獲取進程內所有的線程列表
- 節點/proc/%d/stack獲取kernel的棧
- 節點/proc/sysrq-trigger
5.2 Handler方式
Watchdog監控的線程有:默認地DEFAULT_TIMEOUT=60s,調試時才為10s方便找出潛在的ANR問題。
| system_server | new Handler(Looper.getMainLooper()) | 當前主線程 |
| android.fg | FgThread.getHandler | 前臺線程 |
| android.ui | UiThread.getHandler | UI線程 |
| android.io | IoThread.getHandler | I/O線程 |
| android.display | DisplayThread.getHandler | display線程 |
| ActivityManager | AMS.MainHandler | AMS構造函數中使用 |
| PowerManagerService | PMS.PowerManagerHandler | PMS.onStart()中使用 |
目前watchdog會監控system_server進程中的以上7個線程,必須保證這些線程的Looper消息處理時間不得超過1分鐘。
5.3 Monitor方式
能夠被Watchdog監控的系統服務都實現了Watchdog.Monitor接口,并實現其中的monitor()方法。運行在android.fg線程, 系統中實現該接口類主要有:
- ActivityManagerService
- WindowManagerService
- InputManagerService
- PowerManagerService
- NetworkManagementService
- MountService
- NativeDaemonConnector
- BinderThreadMonitor
- MediaProjectionManagerService
- MediaRouterService
- MediaSessionService
- BinderThreadMonitor
總結
以上是生活随笔為你收集整理的WatchDog工作原理的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Android代码入侵原理解析(一)
- 下一篇: Android libcutils库中整