敏捷数据科学pdf_敏捷数据科学数据科学可以并且应该是敏捷的
敏捷數(shù)據(jù)科學pdf
TL;DR;
TL; DR;
- I have encountered a lot of resistance in the data science community against agile methodology and specifically scrum framework; 在數(shù)據(jù)科學界,我遇到了許多反對敏捷方法論(特別是Scrum框架)的抵制。
- I don’t see it this way and claim that most disciplines would improve by adopting agile mindset; 我不這樣認為,并認為通過采用敏捷的思維方式,大多數(shù)學科都將得到改善。
- We will go through a typical scrum sprint to highlight the compatibility of the data science process and the agile development process. 我們將經(jīng)歷一個典型的Scrum沖刺,以突出數(shù)據(jù)科學過程與敏捷開發(fā)過程的兼容性。
- Finally, we discuss when a scrum is not an appropriate process to follow. If you are a consultant working on many projects at a time or your work requires deep concentration on a single and narrow issue (narrow, so that you alone can solve it). 最后,我們討論了Scrum何時不適合遵循的過程。 如果您是同時從事多個項目的顧問,或者您的工作需要專注于一個狹窄的問題(狹窄,那么您一個人就能解決)。
I have found a medium post recently, which claims that Scrum is awful for data science. I’m afraid I have to disagree and would like to make a case for Agile Data Science.
我最近發(fā)現(xiàn)了一篇中篇文章,其中聲稱Scrum 對于數(shù)據(jù)科學非常糟糕 。 恐怕我不得不不同意,并希望為敏捷數(shù)據(jù)科學辯護。
Ideas for this post are significantly influenced by the Agile Data Science 2.0 book (which I highly recommend) and personal experience. I am eager to know other experiences, so please share them in the comments.
這篇文章的想法在很大程度上受到敏捷數(shù)據(jù)科學2.0本書(我強烈推薦)和個人經(jīng)驗的影響。 我很想知道其他經(jīng)歷,所以請在評論中分享。
First, we need to agree on what data science is and how it solves business problems so we can investigate the process of data science and how agile (and specifically Scrum) can improve it.
首先,我們需要就什么是數(shù)據(jù)科學及其如何解決業(yè)務問題達成共識,以便我們可以調(diào)查數(shù)據(jù)科學的過程以及敏捷性(特別是Scrum)如何改進它。
什么是數(shù)據(jù)科學? (What is Data Science?)
There are countless definitions online. For example, Wikipedia gives such a description:
在線上有無數(shù)的定義。 例如, 維基百科給出了這樣的描述:
Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.
數(shù)據(jù)科學是一個跨學科領域,它使用科學的方法,過程,算法和系統(tǒng)從許多結(jié)構(gòu)化和非結(jié)構(gòu)化數(shù)據(jù)中提取知識和見解。
In my opinion, it is quite an accurate definition of what data science tries to accomplish. But I would simplify this definition further.
我認為,這是對數(shù)據(jù)科學要完成的工作的準確定義。 但是,我將進一步簡化該定義。
Data Science solves business problems by combining business understanding, data and algorithms.
數(shù)據(jù)科學通過結(jié)合業(yè)務理解,數(shù)據(jù)和算法來解決業(yè)務問題。
Compared to the definition in Wikipedia, I would like to stress that data scientists should aim to solve business problems rather than “extract knowledge and insights.”
與Wikipedia中的定義相比,我想強調(diào)的是,數(shù)據(jù)科學家應該致力于解決業(yè)務問題,而不是“ 提取知識和見解”。
數(shù)據(jù)科學如何解決業(yè)務問題? (How Data Science Solves business problems?)
So data science is here to solve business problems. We need to accomplish a few things along the way:
因此,數(shù)據(jù)科學在這里可以解決業(yè)務問題。 我們需要在此過程中完成幾件事:
As with everything, there are countless ways to go about implementing those steps, but I will try to persuade you that the agile (incremental and iterative) approach brings the most value to the company and the most joy to data scientists.
與所有內(nèi)容一樣,執(zhí)行這些步驟的方法有無數(shù)種,但是我將嘗試說服您敏捷(增量和迭代)方法為公司帶來最大的價值,并為數(shù)據(jù)科學家?guī)碜畲蟮臉啡ぁ?
敏捷數(shù)據(jù)科學宣言 (Agile Data Science Manifesto)
I took this from page 6 in the Agile Data Science 2.0 book, so you are encouraged to read the original, but here it is:
我是從敏捷數(shù)據(jù)科學2.0本書的第6頁上摘下來的,因此鼓勵您閱讀原始文檔,但此處是:
- Iterate, iterate, iterate — tables, charts, reports, predictions. 迭代,迭代,迭代-表格,圖表,報告,預測。
- Ship intermediate output. Even failed experiments have output. 運送中間輸出。 即使失敗的實驗也可以輸出。
- Prototype experiments over implementing tasks. 在執(zhí)行任務方面進行原型實驗。
- Integrate the tyrannical opinion of data in product management. 將數(shù)據(jù)的專橫觀點整合到產(chǎn)品管理中。
- Climb up and down the data-value pyramid as you work. 在工作時上下爬數(shù)據(jù)值金字塔。
- Discover and pursue the critical path to a killer product. 發(fā)現(xiàn)并尋求關鍵產(chǎn)品的關鍵途徑。
- Get meta. Describe the process, not just the end state. 獲取元數(shù)據(jù)。 描述過程,而不僅僅是結(jié)束狀態(tài)。
Not all the steps are self-explanatory, and I encourage you to go and read what Russel Jurney had to say, but I hope that the main idea is clear — we share and intermediate output, and we iterate to achieve value.
并非所有步驟都是不言自明的,我鼓勵您去閱讀Russel Jurney所說的內(nèi)容,但是我希望主要思想是明確的-我們共享和中間產(chǎn)出,并不斷迭代以實現(xiàn)價值。
Given the above preliminaries, let us go over a standard week for a scrum team. And we will assume a one week sprint.
鑒于以上初步介紹,讓我們?yōu)橐粋€Scrum團隊度過一個標準的星期。 我們將假設一個星期的沖刺。
Scrum團隊沖刺 (Scrum Team Sprint)
第一天 (Day 1)
There are many sprint structure variations, but I will assume that planning is done on Monday morning. The team will decide which user stories from the product backlog will be transferred to the Sprint backlog. The most pressing issue for our business, as evident from the backlog ranking, is customer fraud — fraudulent transactions are causing our valuable customers out of our platform. During the previous backlog refinement session, the team already discussed this task, and the product owner got additional information from the Fraud Investigation team. So during the meeting, the team decides to start with a simple experiment (and already is thinking of interesting iterations further down the road) — an initial model based on simple features of the transaction and participating users. Work is split so that the data scientist can go and have a look at the data team identified for this problem. The data engineer will set up the pipeline for model output integration to DWH systems, and the full-stack engineer starts to set up a page for transaction review and alert system for the Fraud Investigation team.
sprint結(jié)構(gòu)有很多變化,但我將假定計劃在星期一早上完成。 團隊將決定將產(chǎn)品積壓中的哪些用戶故事轉(zhuǎn)移到Sprint積壓中。 從積壓的排名中可以明顯看出,我們業(yè)務最緊迫的問題是客戶欺詐-欺詐性交易正使我們寶貴的客戶退出平臺。 在上一個待辦事項優(yōu)化會話中,團隊已經(jīng)討論了此任務,產(chǎn)品所有者從欺詐調(diào)查團隊獲得了更多信息。 因此,在會議期間,團隊決定從一個簡單的實驗開始(并且已經(jīng)在考慮下一步的有趣迭代),這是一個基于交易和參與用戶的簡單特征的初始模型。 工作是分開的,以便數(shù)據(jù)科學家可以去看看針對此問題確定的數(shù)據(jù)團隊。 數(shù)據(jù)工程師將建立將模型輸出集成到DWH系統(tǒng)的管道,而全棧工程師將開始為欺詐調(diào)查團隊設置一個頁面,用于事務審查和警報系統(tǒng)。
第二天 (Day 2)
At the start of Tuesday, all team gathers and shares progress. Data scientist shows a few graphs which indicate that even with limited features, we will have a decent model. At the same time, the data engineer is already halfway through setting up the system to score incoming transactions with the new model. The full-stack engineer is also progressing nicely, and just after a few minutes, everyone is back at their desk working on the agreed tasks.
在星期二初,所有團隊聚集并分享進步。 數(shù)據(jù)科學家顯示了一些圖表,這些圖表表明即使功能有限,我們也將擁有一個不錯的模型。 同時,數(shù)據(jù)工程師已經(jīng)完成設置系統(tǒng)的一半,以使用新模型對傳入的交易進行評分。 全職工程師的進度也不錯,幾分鐘后,每個人都回到了辦公桌前,完成約定的任務。
第三天 (Day 3)
As with Tuesday, the team starts Wednesday with a standup meeting to share their progress. There is already a simple model build and some accuracy and error rate numbers. The data engineer shows the infrastructure for the transaction scoring, and the team discusses how the features arrive at the system and what needs to be done for them to be ready for the algorithm. The full-stack engineer shows the admin panel with metadata on transactions is displayed and the triggering mechanism. Another discussion follows on the threshold value for the model output to trigger a message for a fraud analyst. The team agrees that we need to be able to adjust this value since different models might have different distributions, and also, depending on other variables, we might want to increase and decrease the number of approved transactions.
與星期二一樣,團隊從星期三開始進行站立會議,以分享他們的進度。 已經(jīng)有一個簡單的模型構(gòu)建以及一些準確性和錯誤率數(shù)字。 數(shù)據(jù)工程師展示了交易評分的基礎架構(gòu),團隊討論了功能如何到達系統(tǒng)以及需要做什么才能使其準備好算法。 全棧工程師將顯示管理面板,其中顯示有關事務的元數(shù)據(jù)以及觸發(fā)機制。 接下來是關于模型輸出的閾值以觸發(fā)欺詐分析者消息的討論。 團隊同意我們必須能夠調(diào)整此值,因為不同的模型可能具有不同的分布,并且根據(jù)其他變量,我們可能希望增加和減少批準的交易數(shù)量。
第四天 (Day 4)
On Thursday, the team already has all the pieces, and during the standup, discuss how to integrate those pieces. Team also outlines how to best monitor models in production, so that model performance could be evaluated and also degradation could be detected before it causes any real damage. They agree that a simple dashboard for monitoring accuracy and error rates will suffice for now.
星期四,團隊已經(jīng)掌握了所有內(nèi)容,在站立比賽中,討論了如何整合這些內(nèi)容。 團隊還概述了如何在生產(chǎn)中最好地監(jiān)視模型,以便可以評估模型性能并在導致任何實際損害之前檢測出退化。 他們一致認為,目前僅需要一個用于監(jiān)視準確性和錯誤率的簡單儀表板即可。
第五天 (Day 5)
Friday is a demo day. During standup, the team discusses the last issues remaining with the first iteration of the transaction fraud detection. Team members prepare for the meeting with the fraud analysts that will be using this solution.
星期五是演示日。 在站立期間,團隊討論事務欺詐檢測的第一次迭代中剩下的最后一個問題。 團隊成員準備與將使用此解決方案的欺詐分析師進行會議。
During the demo, the team shows what they have built for the fraud analysts. The team presents performance metrics and their implications for the fraud analysts. All feedback is converted to tasks for future sprints.
在演示期間,團隊將展示他們?yōu)槠墼p分析人員構(gòu)建的內(nèi)容。 該團隊介紹了績效指標及其對欺詐分析師的影響。 所有反饋都轉(zhuǎn)換為任務,以供將來沖刺。
Another vital part of the Sprint is a retrospective — meeting where the team discusses three things:1. What went well in the Sprint;
Sprint的另一個重要組成部分是回顧會議-團隊討論三件事的會議:1。 在Sprint中進展順利;
2. What could be improved;
2.有待改進的地方;
3. What will we commit to improving in the next Sprint;
3.在下一個Sprint中我們將致力于改進什么;
再往前走 (Further down the road)
During the next Sprint, the team is working on another most important item from the product backlog. It might be feedback from the fraud analysts, or it might be something else that the product owner thinks will improve the overall business the most. However, the team closely monitors the performance of the initial version of the solution. It will continue to do so because ML solutions are sensitive to changes in underlying assumptions that the model made about data distribution.
在下一個Sprint期間,團隊正在處理產(chǎn)品積壓中的另一個最重要的項目。 這可能是欺詐分析師的反饋,也可能是產(chǎn)品所有者認為可以最大程度改善整體業(yè)務的其他方面。 但是,團隊將密切監(jiān)視解決方案初始版本的性能。 它將繼續(xù)這樣做,因為ML解決方案對模型對數(shù)據(jù)分布所做的基本假設的更改敏感。
討論區(qū) (Discussion)
Above is a relatively “clean” exposition of the scrum process for data science solutions. Real-world rarely is that way, but I wanted to convey a few points:
上面是數(shù)據(jù)科學解決方案的Scrum過程的相對“干凈”的闡述。 現(xiàn)實世界很少采用這種方式,但我想表達幾點:
However, it is not a framework for any endeavor. If your job requires you to think deeply for days, then Scrum and agile would probably be very disruptive and counterproductive. Also, if your work requires you to handle a lot of different and small data science-related tasks, following Scrum would be inappropriate, and maybe Kanban should be considered. However, typical product data science work is not like that. Iteration is king, and getting feedback fast is key to providing the right solutions to business problems.
但是,這不是任何努力的框架。 如果您的工作需要您深入思考數(shù)日,那么Scrum和敏捷可能會非常破壞性且適得其反。 另外,如果您的工作要求您處理許多與小數(shù)據(jù)科學相關的不同任務,那么遵循Scrum是不合適的,也許應該考慮看板。 但是,典型的產(chǎn)品數(shù)據(jù)科學工作并非如此。 迭代為王,快速??獲得反饋對于提供正確的業(yè)務問題解決方案至關重要。
綜上所述 (In summary)
Data Science is a perfect fit for the Scrum with a single modification — we do not expect to ship finished models. Instead, we ship artifacts of our work and solicit feedback from our stakeholders so we can make progress faster. Project managers might not like data science for the unpredictability of the progress, but iteration is not at fault, it is the only way forward.
只需修改一下,Data Science就非常適合Scrum —我們不希望交付完成的模型。 取而代之的是,我們運送工作的工件并征求利益相關者的反饋,以便我們更快地取得進展。 項目經(jīng)理可能不喜歡數(shù)據(jù)科學,因為它具有不可預測的進度,但是迭代并不是錯誤,這是前進的唯一途徑。
I would like to know what you think about agile data science? What has worked for you and your team? What didn’t work? I hope you will leave a comment!
我想知道您如何看待敏捷數(shù)據(jù)科學? 什么對您和您的團隊有用? 什么沒用? 希望您發(fā)表評論!
翻譯自: https://towardsdatascience.com/agile-data-science-data-science-can-and-should-be-agile-c719a511b868
敏捷數(shù)據(jù)科學pdf
總結(jié)
以上是生活随笔為你收集整理的敏捷数据科学pdf_敏捷数据科学数据科学可以并且应该是敏捷的的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 使用python和pandas进行同类群
- 下一篇: api地理编码_通过地理编码API使您的