Telltale:简化了Netflix应用程序监视
By Andrei Ushakov, Seth Katz, Janak Ramachandran, Jeff Butsch, Peter Lau, Ram Vaithilingam, and Greg Burrell
作者:Andrei Ushakov, Seth Katz , Janak Ramachandran , Jeff Butsch , Peter Lau , Ram Vaithilingam和Greg Burrell
我們的故事愿景 (Our Telltale Vision)
An alert fires and you get paged in the middle of the night. A metric crossed a threshold. You’re half awake and wondering, “Is there really a problem or is this just an alert that needs tuning? When was the last time somebody adjusted our alert thresholds? Maybe it’s due to an upstream or downstream service?” This is a critical application so you drag yourself out of bed, open your laptop, and start poring through dashboards for more info. You’re not yet convinced there’s a real problem but you’re also aware that the clock is ticking as you dig through a mountain of data looking for clues.
警報(bào)觸發(fā),您在半夜被傳呼。 指標(biāo)超過(guò)了閾值。 您半醒著,想知道:“這真的有問(wèn)題嗎?或者這只是需要調(diào)整的警報(bào)? 上一次有人調(diào)整我們的警報(bào)閾值是什么時(shí)候? 也許是由于上游或下游服務(wù)?” 這是至關(guān)重要的應(yīng)用程序,因此您可以將自己拖下床,打開筆記本電腦,然后開始瀏覽儀表板以獲取更多信息。 您尚未確信這是一個(gè)真正的問(wèn)題,但是您也意識(shí)到,在挖掘大量數(shù)據(jù)以尋找線索時(shí),時(shí)鐘正在滴答作響。
Healthy Netflix services are essential to member joy. When you sit down to watch “Tiger King” you expect it to just play. Over the years we’ve learned from on-call engineers about the pain points of application monitoring: too many alerts, too many dashboards to scroll through, and too much configuration and maintenance. Our streaming teams need a monitoring system that enables them to quickly diagnose and remediate problems; seconds count! Our Node team needs a system that empowers a small group to operate a large fleet.
健康的Netflix服務(wù)對(duì)于會(huì)員歡樂(lè)至關(guān)重要。 當(dāng)您坐下來(lái)觀看“ 老虎王 ”時(shí),您期望它會(huì)播放。 多年來(lái),我們已經(jīng)從應(yīng)召喚的工程師那里了解了應(yīng)用程序監(jiān)視的痛點(diǎn):警報(bào)太多,要滾動(dòng)瀏覽的儀表板太多,配置和維護(hù)太多。 我們的流媒體團(tuán)隊(duì)需要一個(gè)監(jiān)控系統(tǒng),使他們能夠快速診斷和修復(fù)問(wèn)題; 秒數(shù)! 我們的Node團(tuán)隊(duì)需要一個(gè)能夠使一小群人操作大型艦隊(duì)的系統(tǒng)。
So we built Telltale.
因此,我們構(gòu)建了Telltale。
Telltale combines a variety of data sources to create a holistic view of an application’s health. Telltale learns what constitutes typical health for an application, no alert tuning required. And because we know what’s healthy, we can let application owners know when their services are trending towards unhealthy.
Telltale結(jié)合了各種數(shù)據(jù)源來(lái)創(chuàng)建應(yīng)用程序運(yùn)行狀況的整體視圖。 Telltale可以了解什么構(gòu)成應(yīng)用程序的典型運(yùn)行狀況,而無(wú)需調(diào)整警報(bào)。 并且因?yàn)槲覀冎朗裁词墙】档?#xff0c;所以我們可以讓應(yīng)用程序所有者知道他們的服務(wù)何時(shí)趨于不健康。
Metrics are a key part of understanding application health. But sometimes you can have too many metrics, too many graphs, and too many dashboards. Telltale shows only the relevant data from the application plus that of upstream and downstream services. We use colors to indicate severity (users can opt to have Telltale display numbers in addition to colors) so users can tell, at a glance, the state of their application’s health. We also highlight interesting broader events such as regional traffic evacuations and nearby deployments, information that is vital to understanding health holistically. Especially during an incident.
指標(biāo)是了解應(yīng)用程序運(yùn)行狀況的關(guān)鍵部分。 但是有時(shí)您可以擁有太多指標(biāo),太多圖表和太多儀表板。 Telltale 僅顯示應(yīng)用程序中的相關(guān)數(shù)據(jù)以及上游和下游服務(wù)的數(shù)據(jù)。 我們使用顏色來(lái)指示嚴(yán)重性(用戶可以選擇除顏色以外還可以選擇Telltale顯示數(shù)字),以便用戶可以一目了然地知道其應(yīng)用程序的運(yùn)行狀況。 我們還將重點(diǎn)介紹更有趣的更廣泛的事件,例如區(qū)域交通疏散和附近的部署 ,這些信息對(duì)于全面了解健康至關(guān)重要。 尤其是在發(fā)生事件期間。
That is our Telltale vision. It exists today and monitors the health of over 100 Netflix production-facing applications.
這就是我們的Telltale愿景。 它現(xiàn)已存在,并監(jiān)視著100多個(gè)Netflix面向生產(chǎn)的應(yīng)用程序的運(yùn)行狀況。
An application lives in an ecosystem應(yīng)用程序生活在生態(tài)系統(tǒng)中應(yīng)用程序健康模型 (The Application Health Model)
A microservice doesn’t live in isolation. It usually has dependencies, talks to other services, and lives in different AWS regions. The call graph above is a relatively simple one, they can be much deeper with dozens of services involved. An application is part of an ecosystem that can be subtly influenced by property changes or radically altered by region-wide events. The launch of a canary can affect an application. As can an upstream or downstream deployments.
微服務(wù)并非孤立存在。 它通常具有依賴性,與其他服務(wù)的對(duì)話,并且位于不同的AWS區(qū)域中。 上面的調(diào)用圖是一個(gè)相對(duì)簡(jiǎn)單的圖,其中涉及許多服務(wù),它們可能會(huì)更深。 應(yīng)用程序是生態(tài)系統(tǒng)的一部分,可能會(huì)受到屬性變化的微妙影響,或者會(huì)受到區(qū)域范圍內(nèi)事件的根本性改變。 金絲雀的啟動(dòng)可能會(huì)影響應(yīng)用程序。 上游或下游部署也可以。
Telltale uses a variety of signals from multiple sources to assemble a constantly evolving model of the application’s health:
Telltale使用來(lái)自多個(gè)來(lái)源的各種信號(hào)來(lái)組裝一個(gè)不斷發(fā)展的應(yīng)用程序運(yùn)行狀況模型:
Atlas time series metrics.
Atlas時(shí)間序列指標(biāo)。
Regional traffic evacuations.
區(qū)域交通疏散 。
Mantis real-time streaming data.
螳螂實(shí)時(shí)流數(shù)據(jù)。
- Infrastructure change events. 基礎(chǔ)架構(gòu)變更事件。
Canary launches and deployments.
金絲雀發(fā)射和部署 。
- The health of upstream and downstream services. 上游和下游服務(wù)的運(yùn)行狀況。
Client metrics and QoE changes.
客戶指標(biāo)和QoE更改 。
- Alerts triggered by our alerting platform. 由我們的警報(bào)平臺(tái)觸發(fā)的警報(bào)。
Different signals have different levels of importance to an application’s health. For example, a latency increase is less critical than error rate increase and some error codes are less critical than others. A canary launch two layers downstream might not be as significant as a deployment immediately upstream. A regional traffic shift means one region ends up with zero traffic while another region has double. You can imagine the impact that has on metrics. A metric’s meaning determines how we should interpret it.
不同的信號(hào)對(duì)應(yīng)用程序的運(yùn)行狀況具有不同的重要性級(jí)別。 例如,等待時(shí)間的增加不如錯(cuò)誤率增加那么關(guān)鍵,某些錯(cuò)誤代碼不如其他錯(cuò)誤代碼那么關(guān)鍵。 下游的金絲雀發(fā)射兩層可能不如上游的部署重要。 區(qū)域交通流量轉(zhuǎn)移意味著一個(gè)區(qū)域最終的交通流量為零,而另一區(qū)域則為兩倍。 您可以想象對(duì)指標(biāo)的影響。 指標(biāo)的含義決定了我們應(yīng)該如何解釋它。
Telltale takes all those factors into consideration when constructing its view of application health.
在構(gòu)建其應(yīng)用程序運(yùn)行狀況視圖時(shí),Telltale考慮了所有這些因素。
The application health model is the heart of Telltale.
應(yīng)用程序運(yùn)行狀況模型是Telltale的核心。
智能監(jiān)控 (Intelligent Monitoring)
Every service operator knows the difficulty of alert tuning. Set thresholds too low and you get a deluge of spurious alerts. So you overcompensate and relax the tuning to the point of missing important health warnings. The end result is a lack of trust in alerts. Telltale is built on the premise that you shouldn’t have to constantly tune configuration.
每個(gè)服務(wù)運(yùn)營(yíng)商都知道警報(bào)調(diào)整的難度。 將閾值設(shè)置得太低,您會(huì)收到大量虛假警報(bào)。 因此,您會(huì)過(guò)度補(bǔ)償并放松調(diào)整,以致錯(cuò)過(guò)重要的健康警告。 最終結(jié)果是對(duì)警報(bào)缺乏信任。 Telltale建立在您不必不斷調(diào)整配置的前提下。
We make setup and configuration easy for application owners by providing curated and managed signal packs. These packs are combined into application profiles to address most common service types. Telltale automatically tracks dependencies between services to build the topology used in the application health model. Signal packs and topology detection keep configuration up-to-date with minimal effort. Those who want a more hands-on approach can still do manual configuration and tuning.
通過(guò)提供精選和托管的信號(hào)包,我們使應(yīng)用程序所有者易于設(shè)置和配置。 這些包被組合到應(yīng)用程序配置文件中,以解決最常見的服務(wù)類型。 Telltale自動(dòng)跟蹤服務(wù)之間的依賴關(guān)系,以構(gòu)建應(yīng)用程序運(yùn)行狀況模型中使用的拓?fù)洹?信號(hào)包和拓?fù)錂z測(cè)以最小的努力使配置保持最新狀態(tài)。 那些需要更多實(shí)踐方法的人仍然可以進(jìn)行手動(dòng)配置和調(diào)整。
No single algorithm can account for the wide variety of signals we use. So, instead, we employ a mix of algorithms including statistical, rule based, and machine learning. We’ll do a future Netflix Tech Blog article focused on our algorithms. Telltale also has analyzers to detect long-term trends or memory leaks. Intelligent monitoring means results our users can trust. It means a faster time to detection and a faster time to resolution during an incident.
沒(méi)有任何一種算法可以解釋我們使用的各種信號(hào)。 因此,相反,我們采用了多種算法,包括統(tǒng)計(jì),基于規(guī)則和機(jī)器學(xué)習(xí)。 我們將在以后的Netflix Tech Blog文章中重點(diǎn)介紹我們的算法。 Telltale還具有分析器,可以檢測(cè)長(zhǎng)期趨勢(shì)或內(nèi)存泄漏。 智能監(jiān)控意味著我們的用戶可以信賴的結(jié)果。 這意味著在事件期間更快的檢測(cè)時(shí)間和更快的解決時(shí)間。
智能警報(bào) (Intelligent Alerting)
Intelligent monitoring yields intelligent alerting. Telltale creates an issue when it detects a health problem in your application’s ecosystem. Teams can opt in to alerting via Slack, email, or PagerDuty (all powered by our internal alerting system). If the issue is caused by an upstream or downstream system then Telltale’s context-aware routing alerts that team instead. Intelligent alerting also means a team receives a single notification, alert storms are a thing of the past.
智能監(jiān)控可產(chǎn)生智能警報(bào)。 當(dāng)Telltale檢測(cè)到應(yīng)用程序生態(tài)系統(tǒng)中的運(yùn)行狀況問(wèn)題時(shí),就會(huì)產(chǎn)生問(wèn)題。 團(tuán)隊(duì)可以選擇通過(guò)Slack,電子郵件或PagerDuty(均由我們的內(nèi)部警報(bào)系統(tǒng)提供動(dòng)力)進(jìn)行警報(bào)。 如果問(wèn)題是由上游或下游系統(tǒng)引起的,則Telltale的上下文感知路由會(huì)提醒該團(tuán)隊(duì)。 智能警報(bào)還意味著團(tuán)隊(duì)會(huì)收到一個(gè)通知,警報(bào)風(fēng)暴已成為過(guò)去。
An example of a Telltale notification in Slack.Slack中的Telltale通知示例。When a problem strikes, it’s essential to have the right information. Our Slack alerts also start a thread containing only the most relevant context about the incident. This includes the signals that Telltale identified as unhealthy and the reasons why. The right context provides a better understanding of the application’s current state so the on-call engineer can return it to health.
出現(xiàn)問(wèn)題時(shí),掌握正確的信息至關(guān)重要。 我們的Slack警報(bào)還會(huì)啟動(dòng)一個(gè)僅包含有關(guān)事件的最相關(guān)上下文的線程。 這包括Telltale標(biāo)識(shí)為不健康的信號(hào)及其原因。 正確的上下文可以更好地了解應(yīng)用程序的當(dāng)前狀態(tài),以便值班工程師可以將其恢復(fù)到健康狀態(tài)。
Incidents evolve and have their own lifecycle, so updates are essential. Are things getting better or worse? Are there new signals or events to consider? Telltale updates the Slack thread as the current incident unfolds. The thread is marked Resolved upon return to healthy state so users know, at a glance, which incidents are ongoing and which have been successfully remediated.
事件不斷發(fā)展并具有自己的生命周期 ,因此更新至關(guān)重要。 事情是好還是壞? 是否有新的信號(hào)或事件要考慮? Telltale在當(dāng)前事件發(fā)生時(shí)更新Slack線程。 返回正常狀態(tài)后,該線程將標(biāo)記為“已解決”,因此用戶一眼就能知道哪些事件正在進(jìn)行,哪些事件已成功修復(fù)。
But these Slack threads aren’t just for Telltale. Teams use them to share additional data, observations, theories, and discussion about the incident. Incident data and discussion all in one thread makes for shared understanding, faster resolution, and easier post-incident analysis.
但是這些Slack線程不僅僅適用于Telltale。 團(tuán)隊(duì)使用它們來(lái)共享有關(guān)事件的其他數(shù)據(jù),觀察,理論和討論。 事件數(shù)據(jù)和討論全部集中在一個(gè)線程中,可以實(shí)現(xiàn)共識(shí),更快的解決方案以及更容易的事件后分析。
We strive to improve the quality of Telltale alerts. One way to do that is to learn from our users. So we provide feedback buttons right in the Slack message. Users can tell us to suppress future occurrences of an alert. Or provide a reason for why an alert isn’t actionable. Intelligent alerting means alerts our users can trust.
我們努力提高Telltale警報(bào)的質(zhì)量。 一種方法是向我們的用戶學(xué)習(xí)。 因此,我們?cè)赟lack消息中提供了反饋按鈕。 用戶可以告訴我們禁止將來(lái)發(fā)生警報(bào)。 或提供警報(bào)不可操作的原因。 智能警報(bào)意味著我們的用戶可以信任的警報(bào)。
An example of the details found in a Telltale notification in Slack.在Slack的Telltale通知中找到的詳細(xì)信息示例。為什么我的服務(wù)不健康? (Why Is My Service Unhealthy?)
A wide variety of signals, knowledge of the application’s ecosystem, and correlation of signals across multiple services helps Telltale to detect the possible causes of an application’s degraded health. Causes such as an outlier instance, a canary or deployment by a dependent service, an unhealthy database, or just a spike in traffic. Highlighting possible causes saves valuable time during an incident.
種類繁多的信號(hào),對(duì)應(yīng)用程序生態(tài)系統(tǒng)的了解以及跨多個(gè)服務(wù)的信號(hào)相關(guān)性有助于Telltale檢測(cè)應(yīng)用程序運(yùn)行狀況降低的可能原因。 原因包括異常實(shí)例,依賴服務(wù)的金絲雀或部署,數(shù)據(jù)庫(kù)運(yùn)行不正常或流量激增。 突出顯示可能的原因可以節(jié)省事件期間的寶貴時(shí)間。
事件管理 (Incident Management)
An example of a Telltale incident summary.Telltale事件摘要的示例。When Telltale sends an alert it also creates a snapshot that has references to the unhealthy signals. As new information arrives, it’s added to this snapshot. This simplifies the post-incident review process for many teams. When it’s time to review past issues, the Application Incident Summary feature shows all aspects of recent issues in a single place including key metrics like total downtime and MTTR (Mean Time To Resolution). We want to help our teams see larger patterns of incidents so they can improve overall service availability.
當(dāng)Telltale發(fā)送警報(bào)時(shí),它還會(huì)創(chuàng)建一個(gè)快照,其中引用了不正常的信號(hào)。 隨著新信息的到來(lái),會(huì)將其添加到此快照中。 這簡(jiǎn)化了許多團(tuán)隊(duì)的事后審查流程。 當(dāng)需要回顧過(guò)去的問(wèn)題時(shí),“ 應(yīng)用程序事件摘要”功能可以在一個(gè)地方顯示最近問(wèn)題的所有方面,包括關(guān)鍵指標(biāo),如總停機(jī)時(shí)間和MTTR(平均解決時(shí)間)。 我們希望幫助我們的團(tuán)隊(duì)了解更大的事件模式,以便他們提高整體服務(wù)的可用性。
The cluster view groups similar incidents.群集視圖將類似事件分組。部署監(jiān)控 (Deployment Monitoring)
Telltale’s application health model and intelligent monitoring have proven so powerful that we’re also using it for safer deployments. We start with Spinnaker, our open source delivery platform. As Spinnaker slowly rolls out a new build we use Telltale to continuously monitor the health of the instances running the new build. Continuous monitoring means a deployment stops and rolls back at the first sign of a problem. It means deployment problems have smaller blast radius and a shorter duration.
Telltale的應(yīng)用程序運(yùn)行狀況模型和智能監(jiān)控已被證明非常強(qiáng)大,以至于我們還將其用于更安全的部署 。 我們從我們的開源交付平臺(tái)Spinnaker開始。 隨著Spinnaker緩慢推出新版本,我們使用Telltale連續(xù)監(jiān)視運(yùn)行新版本的實(shí)例的運(yùn)行狀況。 持續(xù)監(jiān)視意味著部署在出現(xiàn)問(wèn)題的第一個(gè)跡象時(shí)停止并回滾。 這意味著部署問(wèn)題的爆炸半徑較小,持續(xù)時(shí)間較短。
連續(xù)的提高 (Continuous Improvement)
Operating microservices in a complex ecosystem is challenging. We’re thrilled that Telltale’s intelligent monitoring and alerting helps our service operators improve availability, reduce toil, and sleep better at night. But we’re not done. We’re constantly exploring new algorithms to improve the accuracy of our alerts. We’ll write more about that in a future Netflix Tech Blog post. We’re also evaluating improvements to our application health model. We believe there’s useful information in service log and trace data. And benefits to employing higher resolution metrics. We’re looking forward to collaborating with our platform team on building out those new features. Getting new applications onto Telltale has been a white-glove treatment which doesn’t scale well, we can definitely improve our self-service UI. And we know there’s better heuristics to help pinpoint what’s affecting your service health.
在復(fù)雜的生態(tài)系統(tǒng)中運(yùn)行微服務(wù)具有挑戰(zhàn)性。 我們很高興知道Telltale的智能監(jiān)控和警報(bào)功能可以幫助我們的服務(wù)運(yùn)營(yíng)商提高可用性,減少勞累并在晚上睡得更好。 但是我們還沒(méi)有完成。 我們正在不斷探索新算法,以提高警報(bào)的準(zhǔn)確性。 我們將在以后的Netflix Tech Blog帖子中寫更多有關(guān)此內(nèi)容的信息。 我們還在評(píng)估對(duì)應(yīng)用程序運(yùn)行狀況模型的改進(jìn)。 我們認(rèn)為服務(wù)日志和跟蹤數(shù)據(jù)中有有用的信息。 并有利于采用更高分辨率的指標(biāo)。 我們期待與我們的平臺(tái)團(tuán)隊(duì)合作開發(fā)這些新功能。 將新應(yīng)用程序引入Telltale一直是一種白手套,但無(wú)法很好地?cái)U(kuò)展,我們絕對(duì)可以改善自助服務(wù)UI。 我們知道,有更好的啟發(fā)式方法可以幫助您找出影響服務(wù)健康的因素。
Telltale is application monitoring simplified.
Telltale簡(jiǎn)化了應(yīng)用程序監(jiān)視。
A healthy Netflix service enables us to entertain the world. Correlating disparate signals to model health in realtime is challenging. Add in thousands of streaming device types, an ever-evolving architecture, and a growing content production ecosystem and the problem becomes fascinating. If you’re passionate about observability then come talk to us.
健康的Netflix服務(wù)使我們能夠娛樂(lè)世界。 將不同的信號(hào)關(guān)聯(lián)起來(lái)以實(shí)時(shí)模擬健康狀況具有挑戰(zhàn)性。 加上成千上萬(wàn)種流媒體設(shè)備類型,不斷發(fā)展的體系結(jié)構(gòu)以及不斷增長(zhǎng)的內(nèi)容生產(chǎn)生態(tài)系統(tǒng),問(wèn)題變得更加令人著迷。 如果您對(duì)可觀察性充滿熱情,請(qǐng) 與我們聯(lián)系 。
翻譯自: https://netflixtechblog.com/telltale-netflix-application-monitoring-simplified-5c08bfa780ba
總結(jié)
以上是生活随笔為你收集整理的Telltale:简化了Netflix应用程序监视的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: Siebel学习笔记
- 下一篇: Router Support for F