和我一起来分析某药品仓储管理系统 卡死现象
一:背景
1. 講故事
這個(gè)月初,有位朋友wx上找到我,說他的api過一段時(shí)間后,就會(huì)出現(xiàn)只有請(qǐng)求,沒有響應(yīng)的情況,截圖如下:
從朋友的描述中看樣子程序是被什么東西卡住了,這種卡死的問題解決起來相對(duì)簡(jiǎn)單,接下來我就用 windbg 給大家分析一下。
二:Windbg 分析
1. Request 請(qǐng)求正在干嘛?
既然朋友說 api 有 request 無 response,那怎么去驗(yàn)證朋友的話對(duì)不對(duì)呢?我們都知道 .NET 用 HttpContext 來表示一個(gè)請(qǐng)求,言外之意就是可以去抓 HttpContext 下的時(shí)長(zhǎng)屬性,Netext 中有一個(gè) !whttp 命令可以幫助我們。
0:000>?!whttp HttpContext????Thread?Time?Out?Running??Status?Verb?????Url 000000563bf803b0???42?00:01:50?00:01:24????200?POST?????http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x-HN 000000563bf84660???--?00:01:50?Finished????200?GET??????http://xxx.com:30003/ 000000563c4a0470???51?00:01:50?00:00:12????200?POST?????http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2C 00000056bbf63590???30?00:01:50?00:02:41????200?POST?????http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-B2C 00000056bc82a038???--?00:01:50?Finished????200?GET??????http://localhost:30003/ 00000056bc84a3e8???44?00:01:50?00:00:51????200?POST?????http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x 00000056bc8671c8???46?00:01:50?00:00:45????200?POST?????http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-B2C 000000573bf44698???35?00:01:50?00:02:39????200?POST?????http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x 000000573bf483c0???33?00:01:50?00:02:41????200?POST?????http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x-HN 000000573bf97e80???40?00:01:50?00:02:32????200?POST?????http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=ZJB2C 000000573c583b08???--?00:01:50?Finished????200?GET??????http://localhost:30003/ 000000573c589ec8???--?00:01:50?Finished????200?GET??????http://xxx.com:30003/Wms/xxx/xxx/xxx 000000573c760e28???--?00:01:50?Finished????200?POST?????http://xxx.com:30003/Wms/xxx/xxx/xxx 000000573c95f990???48?00:01:50?00:00:31????200?POST?????http://xxx.com:30003/Wms/Common/xxx?xxx=xxx&xxx=x-HN 00000057bbf4f8e8???31?00:01:50?00:02:12????200?POST?????http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x 00000057bc080340???50?00:01:50?00:00:19????200?POST?????http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x 000000583c4aee80???43?00:01:50?00:01:11????200?POST?????http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2B 000000583c4d0c50???53?00:01:50?00:00:01????200?POST?????http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2B 00000058bbf8f1a0???34?00:01:50?00:02:22????200?POST?????http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2B 000000593bfe1758???41?00:01:50?00:01:22????200?POST?????http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2C 000000593c892160???--?00:01:50?Finished????200?GET??????http://xxx.com:30003/Wms/xxx/xxx/xxxJob 000000593ca813b0???45?00:01:50?00:00:30????200?POST?????http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-HN 000000593caa45d8???--?00:01:50?Finished????200?GET??????http://xxx.com:30003/ 00000059bc1ad808???32?00:01:50?00:01:45????200?POST?????http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-B2C 00000059bc1c3d70???36?00:01:50?00:01:29????200?POST?????http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x25?HttpContext?object(s)?found?matching?criteria從 Running 列可以看到大多請(qǐng)求都已經(jīng)達(dá)到1分鐘以上,這也驗(yàn)證了朋友所說的卡死問題,按照經(jīng)驗(yàn),可以取 Running 列中最大的 httpContext 所在的線程,也就是上面的 30 和 33 號(hào)線程, 看看它們都在干什么?
2. 探究 Running 最長(zhǎng)線程
接下來切到 30 和 33 號(hào)線程,看看它們的線程棧。
0:000>?~30s ntdll!NtWaitForSingleObject+0xa: 00007ffd`b81f024a?c3??????????????ret 0:030>?!clrstack? OS?Thread?Id:?0x29d0?(30)Child?SP???????????????IP?Call?Site 0000005acc3ac590?00007ffdb81f024a?[PrestubMethodFrame:?0000005acc3ac590]?xxx.xxx.RedisConnectionHelp.get_Instance() 0000005acc3ac850?00007ffd4dd78911?xxx.xxx.RedisCache..ctor(Int32,?System.String) 0000005acc3ac8c0?00007ffd4dd78038?xxx.xxx.CacheByRedis.HashGet[[System.__Canon,?mscorlib]](System.String,?System.String,?Int32) 0000005acc3ac968?00007ffdabef1f7c?[StubHelperFrame:?0000005acc3ac968]? 0000005acc3ac9c0?00007ffd4dd77f18?xxx.xxx.Cache.xxx.GetCacheNotAreaDataEntity[[System.__Canon,?mscorlib]](System.String,?System.String,?System.String)...0:030>?~33s ntdll!NtWaitForMultipleObjects+0xa: 00007ffd`b81f07ba?c3??????????????ret 0:033>?!clrstack? OS?Thread?Id:?0x3ad4?(33)Child?SP???????????????IP?Call?Site 0000005accabae90?00007ffdb81f07ba?[GCFrame:?0000005accabae90]? 0000005accabafb8?00007ffdb81f07ba?[HelperMethodFrame_1OBJ:?0000005accabafb8]?System.Threading.Monitor.ObjWait(Boolean,?Int32,?System.Object) 0000005accabb0d0?00007ffdaac60d64?System.Threading.ManualResetEventSlim.Wait(Int32,?System.Threading.CancellationToken) 0000005accabb160?00007ffdaac5b4bb?System.Threading.Tasks.Task.SpinThenBlockingWait(Int32,?System.Threading.CancellationToken) 0000005accabb1d0?00007ffdab5a01d1?System.Threading.Tasks.Task.InternalWait(Int32,?System.Threading.CancellationToken) 0000005accabb2a0?00007ffdab59cfa7?System.Threading.Tasks.Task`1[[System.__Canon,?mscorlib]].GetResultxxx(Boolean) 0000005accabb2e0?00007ffd4d8d338f?xxx.Config.xxx.Config`1[[System.__Canon,?mscorlib]].GetConfig(xxx.Config.Model.ConfigListener,?System.Func`2<xxx.Config.Request.GetConfigRequest,System.Threading.Tasks.Task`1<System.String>>) 0000005accabb340?00007ffd4d8d2f40?xxx.Config.xxx.Config`1[[System.__Canon,?mscorlib]].get_Item(System.String,?System.String) 0000005accabb3c0?00007ffd4dd78f7f?xxx.Util.BaseConfig.get_GetRedisConn() 0000005accabb440?00007ffd4dd78e9c?xxx.xxx.RedisConnectionHelp.GetConnectionString() 0000005accabb4a0?00007ffd4dd789cb?xxx.xxx.RedisConnectionHelp..cctor() 0000005accabb940?00007ffdabef6953?[GCFrame:?0000005accabb940]? 0000005accabc5b0?00007ffdabef6953?[PrestubMethodFrame:?0000005accabc5b0]?xxx.xxx.RedisConnectionHelp.get_Instance() 0000005accabc870?00007ffd4dd78911?xxx.xxx.RedisCache..ctor(Int32,?System.String) 0000005accabc8e0?00007ffd4dd78038?xxx.xxx.CacheByRedis.HashGet[[System.__Canon,?mscorlib]](System.String,?System.String,?Int32) 0000005accabc988?00007ffdabef1f7c?[StubHelperFrame:?0000005accabc988]? 0000005accabc9e0?00007ffd4dd77f18?xxx.Core.Cache.xxx.GetCacheNotAreaDataEntity[[System.__Canon,?mscorlib]](System.String,?System.String,?System.String) ...上面的信息不難發(fā)現(xiàn) 30 號(hào)線程正卡在 RedisConnectionHelp.get_Instance() 處,33 號(hào)線已經(jīng)進(jìn)入了 RedisConnectionHelp.get_Instance() 方法中,最后在 GetConfig() 處等待 Result 的結(jié)果,按經(jīng)驗(yàn)來說,30 號(hào)線程看樣子正在鎖等待, 33 號(hào)正在等待異步結(jié)果,接下來的突破點(diǎn)就是探究下 RedisConnectionHelp.Instance 處代碼。
3. 尋找問題代碼
接下來用反編譯工具 ILSpy 找到問題代碼。
public?static?class?RedisConnectionHelp {public?static?ConnectionMultiplexer?Instance{get{if?(_instance?==?null){lock?(Locker){if?(_instance?==?null?||?!_instance.IsConnected){_instance?=?GetManager();}}}return?_instance;}} }30 號(hào)線程果然是卡在 Locker 處,接下來深挖下 33 號(hào)線程所執(zhí)行的 GetManager() 方法,簡(jiǎn)化后代碼如下:
public?T?this[string?dataId,?string?key?=?""] {get{try{string?config?=?GetConfig(configListener,?new?NacosConfigClient(Base.NacosConfiguration).DoGetConfigAsync);return?JsonConvert.DeserializeObject<T>(config);}catch?(Exception?ex){return?default(T);}} }private?string?GetConfig(ConfigListener?listener,?Func<GetConfigRequest,?Task<string>>?action) {var?text2?=?action(new?GetConfigRequest{DataId?=?listener.DataId,Group?=?listener.Group,Tenant?=?text}).Result;return?text2; }internal?async?Task<string>?DoGetConfigAsync(GetConfigRequest?request) {IRestResponse?restResponse?=?await?HttpUtil.Request(currentServerAddr,?Method.GET,?request.ParamValues(),?xxx);return?restResponse.Content; }可以看到代碼卡在了 Result 上無限期等待,到這里我就想到了 同步上下文 ,我看他這個(gè)程序是 .NET 4.8 下的 ASP.NET MVC,記得不錯(cuò)上下文應(yīng)該是 AspNetSynchronizationContext,具體死鎖原因可參見我的這篇文章:一句 Task.Result 就死鎖, 這代碼還怎么寫?,解決辦法大概有四種。
使用 .ConfigureAwait(false)
改成全異步
用 Task 再包一層。
改成全同步
三:總結(jié)
其實(shí)本次事故主要還是因?yàn)樵?同步代碼 中做了 異步代碼.Result 導(dǎo)致的死鎖問題,有非常多的文章都在抨擊這種現(xiàn)象,在 asp.net core 中已經(jīng)移除了這種同步上下文的大坑,給到朋友的建議是改成全同步,死鎖問題也隨之消失。
哈哈,真替朋友開心!
總結(jié)
以上是生活随笔為你收集整理的和我一起来分析某药品仓储管理系统 卡死现象的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 如何通过 C# 实现对象的变更跟踪 ?
- 下一篇: Windows 10 2022 年更新来