當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据质量提升_合作提高数据质量

發(fā)布時(shí)間：2023/11/29 编程问答 54 豆豆

生活随笔收集整理的這篇文章主要介紹了数据质量提升_合作提高数据质量小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

數(shù)據(jù)質(zhì)量提升

Author Vlad Ri?cu?ia is joined for this article by co-authors Wayne Yim and Ayyappan Balasubramanian.

作者 Vlad Ri?cu?ia 和合著者 Wayne Yim 和 Ayyappan Balasubramanian 共同撰寫了這篇文章。

為什么要數(shù)據(jù)質(zhì)量？ (Why data quality?)

Data quality is a critical aspect of ensuring high quality business decisions. An estimate of the yearly cost of poor data quality is $3.1 trillion per year for the United States alone, equating to approximately 16.5 percent of GDP.1 For a business such as Microsoft, where data-driven decisions are ingrained within the fabric of the company, ensuring high data quality is paramount. Not only is data used to drive, steer, and grow the Microsoft business from a tactical and strategic perspective, but there are also regulatory obligations to produce accurate data for quarterly financial reporting.

數(shù)據(jù)質(zhì)量是確保高質(zhì)量業(yè)務(wù)決策的關(guān)鍵方面。據(jù)估計(jì)，僅在美國，不良數(shù)據(jù)質(zhì)量的年成本就高達(dá)每年3.1萬億美元，約占GDP的16.5％。1對于像Microsoft這樣的企業(yè)，數(shù)據(jù)驅(qū)動(dòng)型決策根深蒂固，確保高數(shù)據(jù)質(zhì)量至關(guān)重要。從戰(zhàn)術(shù)和戰(zhàn)略角度來看，不僅使用數(shù)據(jù)來驅(qū)動(dòng)，指導(dǎo)和發(fā)展Microsoft業(yè)務(wù)，而且還存在監(jiān)管義務(wù)，要求為季度財(cái)務(wù)報(bào)告生成準(zhǔn)確的數(shù)據(jù)。

DataCop的歷史 (History of DataCop)

In the Experiences and Devices (E+D) division at Microsoft, a central data team called IDEAs (Insights Data Engineering and Analytics) generates key business metrics that are used to grow and steer the business. As one of its first undertakings, the team created the Office 365 Commercial Monthly Active User (MAU) measure to track the usage and growth of Office 365. This was a complicated endeavor due to the sheer scale of data, the number of Office products and services involved, and the heterogenous nature of the data pipelines across different products and services. In addition, many other business metrics, tracking the growth and usage of all Office products and services, also needed to be created.

在Microsoft的“體驗(yàn)和設(shè)備”(E + D)部門中，一個(gè)名為IDEA(Insights數(shù)據(jù)工程和分析)的中央數(shù)據(jù)團(tuán)隊(duì)生成了用于發(fā)展和指導(dǎo)業(yè)務(wù)的關(guān)鍵業(yè)務(wù)指標(biāo)。作為其首批任務(wù)之一，該團(tuán)隊(duì)創(chuàng)建了Office 365商業(yè)月度活動(dòng)用戶(MAU)措施來跟蹤Office 365的使用和增長。由于數(shù)據(jù)規(guī)模巨大，Office產(chǎn)品和服務(wù)的數(shù)量龐大，這是一項(xiàng)復(fù)雜的工作。涉及的服務(wù)以及跨不同產(chǎn)品和服務(wù)的數(shù)據(jù)管道的異構(gòu)性質(zhì)。此外，還需要?jiǎng)?chuàng)建許多其他業(yè)務(wù)指標(biāo)，以跟蹤所有Office產(chǎn)品和服務(wù)的增長和使用情況。

In the process of creating these critical business metrics, it was clear that generating them at scale and in a reliable way with high data quality was of the utmost importance, as key tactical and strategic business decisions would be based on them. In addition, because of the team’s charge to generate key metrics for release with quarterly earnings, producing high quality data was also a regulatory requirement.

在創(chuàng)建這些關(guān)鍵業(yè)務(wù)指標(biāo)的過程中，很明顯，以關(guān)鍵的戰(zhàn)術(shù)和戰(zhàn)略業(yè)務(wù)決策將基于它們，以高質(zhì)量的數(shù)據(jù)大規(guī)模可靠地生成它們至關(guān)重要。另外，由于團(tuán)隊(duì)負(fù)責(zé)生成關(guān)鍵指標(biāo)以按季度收入發(fā)布，因此生成高質(zhì)量數(shù)據(jù)也是監(jiān)管要求。

The IDEAs team formed as a data quality team consisting of program management, engineering, and data science representatives, and set out to investigate internal and external data quality solutions. The team examined internal data quality systems and researched public whitepapers from other companies that worked with huge amounts of data. Members of the team also spent a considerable amount of time with LinkedIn, learning about their data quality system called “Data Sentinel”2 to potentially leverage what they had built, as they had already spent a considerable amount of time developing Data Sentinel and are also part of Microsoft.

IDEA團(tuán)隊(duì)組成了一個(gè)由程序管理，工程和數(shù)據(jù)科學(xué)代表組成的數(shù)據(jù)質(zhì)量團(tuán)隊(duì)，并著手研究內(nèi)部和外部數(shù)據(jù)質(zhì)量解決方案。該團(tuán)隊(duì)檢查了內(nèi)部數(shù)據(jù)質(zhì)量系統(tǒng)，并研究了處理大量數(shù)據(jù)的其他公司的公開白皮書。團(tuán)隊(duì)成員還花了很多時(shí)間在LinkedIn上，了解他們稱為“ Data Sentinel”2的數(shù)據(jù)質(zhì)量系統(tǒng)，以潛在地利用他們所構(gòu)建的內(nèi)容，因?yàn)樗麄円呀?jīng)花費(fèi)了大量時(shí)間來開發(fā)Data Sentinel，并且微軟的一部分。

The vision for a data quality platform in IDEAs was that it would be extensible, scalable, able to work with the multiple data fabrics involved, and be leveraged by the wider data science community at Microsoft. For example, data scientists and data analysts should be able to write data quality checks in languages familiar to them such as Python, R, and Scala, among others, and have these data quality checks operate reliably at scale.

IDEA中的數(shù)據(jù)質(zhì)量平臺(tái)的愿景是，它是可擴(kuò)展的，可伸縮的，能夠與所涉及的多個(gè)數(shù)據(jù)結(jié)構(gòu)配合使用，并被Microsoft的更廣泛的數(shù)據(jù)科學(xué)社區(qū)所利用。例如，數(shù)據(jù)科學(xué)家和數(shù)據(jù)分析人員應(yīng)該能夠用他們熟悉的語言(例如Python，R和Scala等)編寫數(shù)據(jù)質(zhì)量檢查，并使這些數(shù)據(jù)質(zhì)量檢查可靠地大規(guī)模運(yùn)行。

Another key requirement was to have the data quality platform function as a DaaS, or “Data as a Service,” resulting in the need to apply the same “service rigor” in engineering, operations, and processes that were used to create and operate Office 365, the largest SaaS in the world. This meant having very high engineering standards around change management, monitoring, security controls, and auditability, and tightly integrating with Microsoft incident management systems to ensure that systems operate with high availability, efficiency, and security.

另一個(gè)關(guān)鍵要求是使數(shù)據(jù)質(zhì)量平臺(tái)具有DaaS或“數(shù)據(jù)即服務(wù)”的功能，因此需要在用于創(chuàng)建和操作Office的工程，操作和流程中應(yīng)用相同的“服務(wù)嚴(yán)格性” 365，世界上最大的SaaS。這意味著在變更管理，監(jiān)視，安全控制和可審核性方面具有很高的工程標(biāo)準(zhǔn)，并與Microsoft事件管理系統(tǒng)緊密集成，以確保系統(tǒng)以高可用性，效率和安全性運(yùn)行。

In the end, the team decided to build its own extensible data quality system from scratch in order for it to function with the scale and reliability of a DaaS and for it to interface with other internal Microsoft data systems. The initial functional specification was written in late 2018, and by early 2019 DataCop was born. Today, DataCop is part of the DataHub platform that also consists of Data Build and Data Catalog. Data Build generates the datasets required by the business in a compliant and scalable way and Data Catalog is a search store for all assets and surfaces with metadata such as data quality scores from DataCop, as well as access and privacy information. Future articles will describe how Data Catalog and Data Build are used to generate the metrics and insights that drive, steer, and grow the E+D business and serve as critical components of the data quality journey.

最后，團(tuán)隊(duì)決定從頭開始構(gòu)建自己的可擴(kuò)展數(shù)據(jù)質(zhì)量系統(tǒng)，以使其能夠與DaaS的規(guī)模和可靠性一起運(yùn)行，并與其他內(nèi)部Microsoft數(shù)據(jù)系統(tǒng)進(jìn)行交互。最初的功能規(guī)范寫于2018年底，到2019年初DataCop誕生了。今天，DataCop已成為DataHub平臺(tái)的一部分，該平臺(tái)還包括數(shù)據(jù)構(gòu)建和數(shù)據(jù)目錄。 Data Build以合規(guī)且可擴(kuò)展的方式生成企業(yè)所需的數(shù)據(jù)集，Data Catalog是具有元數(shù)據(jù)(例如來自DataCop的數(shù)據(jù)質(zhì)量得分以及訪問和隱私信息)的所有資產(chǎn)和表面的搜索存儲(chǔ)。未來的文章將描述如何使用“數(shù)據(jù)目錄”和“數(shù)據(jù)構(gòu)建”來生成度量標(biāo)準(zhǔn)和見解，以推動(dòng)，指導(dǎo)和發(fā)展E + D業(yè)務(wù)，并充當(dāng)數(shù)據(jù)質(zhì)量之旅的關(guān)鍵組成部分。

建筑 (Architecture)

DataCop is designed with a mindset that no one team can solve this challenge on its own. The data ecosystem at Microsoft consists of multiple data fabrics, with data arriving in minutes to a month later. The system must be flexible and simple enough for other developers across Microsoft to add plugins and workers for adding to the data fabric or quality checks they want to build on. As a result, DataCop was built as a distributed message broker based on Azure Service Bus with quality check results stored on Cosmos DB.

DataCop的設(shè)計(jì)思想是，任何團(tuán)隊(duì)都無法獨(dú)自解決這一挑戰(zhàn)。 Microsoft的數(shù)據(jù)生態(tài)系統(tǒng)由多個(gè)數(shù)據(jù)結(jié)構(gòu)組成，數(shù)據(jù)在數(shù)分鐘至一個(gè)月后到達(dá)。該系統(tǒng)必須足夠靈活和簡單，以使Microsoft的其他開發(fā)人員可以添加插件和工作程序，以添加到他們想要建立的數(shù)據(jù)結(jié)構(gòu)或質(zhì)量檢查中。結(jié)果，DataCop被構(gòu)建為基于Azure Service Bus的分布式消息代理，質(zhì)量檢查結(jié)果存儲(chǔ)在Cosmos DB中。

Messages in the message broker must be self-contained and allow workers to work on them exclusively. This would allow messages from Orchestrator to run scheduled checks or from an Azure Data Factory (ADF) pipeline itself. Every time a data check or new fabric needs to be added, the developer can simply implement an override and develop their own worker process without affecting the rest of the system. The Azure team leveraged this to build on it quickly, as described below.

消息代理中的消息必須是獨(dú)立的，并允許工作人員專門處理它們。這將允許來自O(shè)rchestrator的消息運(yùn)行計(jì)劃的檢查，或者來自Azure數(shù)據(jù)工廠 (ADF)管道本身的消息。每次需要添加數(shù)據(jù)檢查或新結(jié)構(gòu)時(shí)，開發(fā)人員都可以簡單地實(shí)現(xiàn)覆蓋并開發(fā)自己的工作進(jìn)程，而不會(huì)影響系統(tǒng)的其余部分。如下所述，Azure團(tuán)隊(duì)利用它來快速構(gòu)建它。

High level architectural diagram of DataCopDataCop的高級架構(gòu)圖

Workers are run today as Azure Web Jobs. Workers typically leverage another compute in Azure such as Azure Databricks or Azure SQL to execute quality checks against the actual data. Workers are lightweight and used to determine whether the checks are successful. This makes Azure Web Jobs a perfect fit for running them. For consistency, Orchestrator is hosted as a web job as well. Orchestrator is a time-triggered web job that generates the sets of quality checks that need to be executed and puts them in a respective worker-specific service bus queue.

今天，工作人員作為Azure Web Jobs運(yùn)行。工作人員通常利用Azure中的另一種計(jì)算(例如Azure Databricks或Azure SQL)對實(shí)際數(shù)據(jù)執(zhí)行質(zhì)量檢查。工人很輕巧，可用來確定檢查是否成功。這使得Azure Web Jobs非常適合運(yùn)行它們。為了保持一致性，Orchestrator也作為Web作業(yè)托管。 Orchestrator是一個(gè)時(shí)間觸發(fā)的Web作業(yè)，它生成需要執(zhí)行的質(zhì)量檢查集，并將它們放入相應(yīng)的特定于工作人員的服務(wù)總線隊(duì)列中。

The next important part of any data quality system is alerting. All Microsoft services use IcM, the company-wide incident management system. Data alerts are not like service alerts: Data arrives at a higher latency compared to typical services and can be recovered in some situations. If there is a need to restate bad data, an issue can be potentially open longer until the data is restated. So, alert suppression is set to handle a very different number of cases — data not available due to upstream issues for x days should result in one alert, and data not available downstream due to a common upstream issue should be suppressed.

任何數(shù)據(jù)質(zhì)量系統(tǒng)的下一個(gè)重要部分是警報(bào)。所有Microsoft服務(wù)都使用IcM(公司范圍的事件管理系統(tǒng))。數(shù)據(jù)警報(bào)與服務(wù)警報(bào)不同：與典型服務(wù)相比，數(shù)據(jù)延遲更高，并且在某些情況下可以恢復(fù)。如果需要重述錯(cuò)誤的數(shù)據(jù)，則可能需要更長的時(shí)間才能解決該問題，直到重新陳述數(shù)據(jù)為止。因此，將警報(bào)抑制設(shè)置為處理非常不同的情況-由于x天上游問題導(dǎo)致的數(shù)據(jù)不可用將導(dǎo)致一個(gè)警報(bào)，而由于常見上游問題而導(dǎo)致下游數(shù)據(jù)不可用的數(shù)據(jù)將被抑制。

This is a good place to touch upon another important topic in the data quality landscape: Anomaly detection. Data volume and metrics change often and are prone to seasonality. Having an anomaly detection system that can handle seasonality helps with a move away from monitoring data volumes and daily trends to a more sophisticated system. DataCop leverages Azure anomaly detector APIs to measure completeness stats such as file size and a few key metrics along multiple dimensions. This is a work in progress with further updates to come.

這是接觸數(shù)據(jù)質(zhì)量領(lǐng)域中另一個(gè)重要主題的好地方：異常檢測。數(shù)據(jù)量和指標(biāo)經(jīng)常更改，并且容易出現(xiàn)季節(jié)性變化。擁有可以處理季節(jié)性的異常檢測系統(tǒng)有助于從監(jiān)視數(shù)據(jù)量和每日趨勢轉(zhuǎn)變?yōu)楦鼜?fù)雜的系統(tǒng)。 DataCop利用Azure異常檢測器API來測量完整性統(tǒng)計(jì)信息，例如文件大小和沿多個(gè)維度的一些關(guān)鍵指標(biāo)。這是一項(xiàng)正在進(jìn)行的工作，將進(jìn)行進(jìn)一步的更新。

Data quality score for data assets in the DataCop User InterfaceDataCop用戶界面中數(shù)據(jù)資產(chǎn)的數(shù)據(jù)質(zhì)量得分

It was apparent that developers need a way to quickly author data quality checks and also deploy them. As a result, we integrated with Azure DevOps workflow to automatically deploy these data quality monitors. Today, the IDEAs team runs close to 2000 tests on about 750 key datasets that include externally reported financial metrics.

顯然，開發(fā)人員需要一種快速編寫數(shù)據(jù)質(zhì)量檢查并進(jìn)行部署的方法。因此，我們與Azure DevOps工作流集成在一起，以自動(dòng)部署這些數(shù)據(jù)質(zhì)量監(jiān)視器。如今，IDEA團(tuán)隊(duì)對約750個(gè)關(guān)鍵數(shù)據(jù)集(包括外部報(bào)告的財(cái)務(wù)指標(biāo))進(jìn)行了近2000次測試。

M365與Azure之間的合作伙伴關(guān)系 (Partnership between M365 and Azure)

The Customer Growth and Analytics team (CGA) is a centralized data science team in the Cloud+AI division at Microsoft. The team’s mission is to learn from customers and empower them to make the most of Azure services.3

客戶增長和分析團(tuán)隊(duì)(CGA)是Microsoft的Cloud + AI部門中的集中數(shù)據(jù)科學(xué)團(tuán)隊(duì)。該團(tuán)隊(duì)的任務(wù)是向客戶學(xué)習(xí)，并使其能夠充分利用Azure服務(wù)。3

Last year, as CGA’s scope was growing, an effort began to standardize technologies. Having a smaller number of technologies upon which CGA’s data platform is built makes it easier to move engineering resources as needed, share knowledge, and in general increase the reliability of the overall system. The use of Azure PaaS offerings reduced the need for writing custom code. The team standardized on Azure Data Factory for data movement and Azure Monitor for monitoring, among others. Unfortunately, at this writing, Azure doesn’t offer a PaaS data quality testing framework.

去年，隨著CGA范圍的不斷擴(kuò)大，人們開始努力使技術(shù)標(biāo)準(zhǔn)化。使用CGA數(shù)據(jù)平臺(tái)所基于的技術(shù)數(shù)量較少，可以更輕松地根據(jù)需要移動(dòng)工程資源，共享知識(shí)并總體上提高整個(gè)系統(tǒng)的可靠性。使用Azure PaaS產(chǎn)品減少了編寫自定義代碼的需要。該團(tuán)隊(duì)在Azure數(shù)據(jù)工廠(用于數(shù)據(jù)移動(dòng))和Azure監(jiān)視器(用于監(jiān)視)上進(jìn)行了標(biāo)準(zhǔn)化。不幸的是，在撰寫本文時(shí)，Azure沒有提供PaaS數(shù)據(jù)質(zhì)量測試框架。

CGA realized the need for a reliable and scalable data quality solution, especially as the data platform evolved to support more and more production workloads where data issues can have large impacts, and so evaluated multiple options.

CGA意識(shí)到了對可靠且可擴(kuò)展的數(shù)據(jù)質(zhì)量解決方案的需求，特別是隨著數(shù)據(jù)平臺(tái)的發(fā)展以支持越來越多的生產(chǎn)工作負(fù)載，其中數(shù)據(jù)問題可能會(huì)產(chǎn)生重大影響，因此評估了多種選擇。

CGA tried out several data quality testing solutions with the code base, but quickly realized they were built for smaller projects, made some rigid assumptions, and would require significant investment to scale out to cover the entire platform.

CGA使用代碼庫嘗試了幾種數(shù)據(jù)質(zhì)量測試解決方案，但很快意識(shí)到它們是為較小的項(xiàng)目構(gòu)建的，做出了一些嚴(yán)格的假設(shè)，并且需要大量投資才能擴(kuò)展到整個(gè)平臺(tái)。

Discussions with other data science organizations within the company to see how they were handling this led to LinkedIn and an introduction to Data Sentinel. Its main limitation is that it runs exclusively on Spark. CGA must support multiple data fabrics: In some cases, different compute scenarios require the specific best solution for the job, such as Azure Data Explorer for analytics or Azure Data Lake Storage and Azure Machine Learning for ML workloads. In other cases, data ingested from other teams comes from a variety of storage locations: Azure SQL, blob storage, and Azure Data Lake Storage gen1, among others.

與公司內(nèi)其他數(shù)據(jù)科學(xué)組織的討論，以了解他們?nèi)绾翁幚泶藛栴}，從而導(dǎo)致了LinkedIn和Data Sentinel的介紹。它的主要限制是它只能在Spark上運(yùn)行。 CGA必須支持多種數(shù)據(jù)結(jié)構(gòu)：在某些情況下，不同的計(jì)算方案需要特定的最佳解決方案來完成工作，例如用于分析的Azure Data Explorer或用于ML工作負(fù)載的Azure Data Lake Storage和Azure Machine Learning 。在其他情況下，從其他團(tuán)隊(duì)提取的數(shù)據(jù)來自各種存儲(chǔ)位置：Azure SQL，blob存儲(chǔ)和Azure Data Lake Storage gen1等。

Further outreach led to discussions with the M365 data science team and led to an introduction to DataCop, the solution described in this article. Its capabilities were compelling: Test scheduling, integration with the standard Microsoft alerting platform, and a declarative way of describing tests. Its main limitation was that DataCop didn’t support Azure Data Explorer.

進(jìn)一步的擴(kuò)展導(dǎo)致與M365數(shù)據(jù)科學(xué)團(tuán)隊(duì)的討論，并導(dǎo)致對DataCop(本文中描述的解決方案)進(jìn)行了介紹。它的功能引人注目：測試計(jì)劃，與標(biāo)準(zhǔn)Microsoft警報(bào)平臺(tái)的集成以及描述測試的聲明方式。它的主要限制是DataCop不支持Azure Data Explorer。

Because Azure Data Explorer (ADX) is core to CGA’s platform, this could have been a showstopper, but in true One Microsoft spirit, the DataCop team was more than happy to work with CGA to light up the missing capability. The teams agreed to treat this as an “internal open source” project, with CGA contributing code to the DataCop solution from which both teams could benefit. Due to its flexible design, adding ADX capabilities was significantly easier than the alternative (investing in a home-grown solution).

因?yàn)锳zure數(shù)據(jù)資源管理器(ADX)是CGA平臺(tái)的核心，所以這本來可以成為熱門。但是，本著一種Microsoft的精神，DataCop團(tuán)隊(duì)非常樂意與CGA合作以減輕缺失的功能。團(tuán)隊(duì)同意將其視為“內(nèi)部開源”項(xiàng)目，CGA向DataCop解決方案貢獻(xiàn)代碼，這兩個(gè)團(tuán)隊(duì)都可以從中受益。由于其靈活的設(shè)計(jì)，添加ADX功能比選擇其他方法(投資自家解決方案)要容易得多。

DataCop extended with Azure Data Explorer support.DataCop擴(kuò)展了Azure Data Explorer支持。

CGA deployed an instance of DataCop in its environment and over the following months had a big data quality push, including training the team on how to author tests and increasing test coverage to 100 percent of the datasets in CGA’s platform. At the time of writing, CGA has around 400 tests covering close to 300 key datasets. Over the past 30 days, CGA ran more than 4000 tests, identifying and quickly acting to mitigate multiple data issues that would have caused significant anomalies in CGA’s system. Onboarding DataCop saved significant engineering effort, which was refocused on test authoring.

CGA在其環(huán)境中部署了一個(gè)DataCop實(shí)例，并且在接下來的幾個(gè)月中，數(shù)據(jù)質(zhì)量得到了很大的推動(dòng)，包括培訓(xùn)團(tuán)隊(duì)如何編寫測試以及將測試覆蓋率提高到CGA平臺(tái)中100％的數(shù)據(jù)集。在撰寫本文時(shí)，CGA擁有約400個(gè)測試，涵蓋了近300個(gè)關(guān)鍵數(shù)據(jù)集。在過去的30天里，CGA運(yùn)行了4000多個(gè)測試，識(shí)別并Swift采取措施來緩解可能導(dǎo)致CGA系統(tǒng)出現(xiàn)重大異常的多個(gè)數(shù)據(jù)問題。入職的DataCop節(jié)省了大量的工程設(shè)計(jì)工作，這些工作重新集中在測試創(chuàng)作上。

總結(jié)思想/總結(jié) (Closing thoughts/summary)

This article described DataCop, the data quality solution developed by the M365 data team in partnership with the Azure data team.

本文介紹了DataCop，它是M365數(shù)據(jù)團(tuán)隊(duì)與Azure數(shù)據(jù)團(tuán)隊(duì)合作開發(fā)的數(shù)據(jù)質(zhì)量解決方案。

Data quality is a critical aspect of a business, both for informing decisions and for regulatory obligations.
數(shù)據(jù)質(zhì)量對于通知決策和監(jiān)管義務(wù)都是業(yè)務(wù)的關(guān)鍵方面。
The diverse data fabrics in use and their huge scale led to development of DataCop, a data quality solution for supporting the Microsoft business.
使用中的各種數(shù)據(jù)結(jié)構(gòu)及其巨大規(guī)模促成了DataCop的發(fā)展，DataCop是一種支持Microsoft業(yè)務(wù)的數(shù)據(jù)質(zhì)量解決方案。
DataCop is a cloud-native Azure solution, consisting of a set of web jobs that communicate via service bus.
DataCop是云原生的Azure解決方案，由一組通過服務(wù)總線進(jìn)行通信的Web作業(yè)組成。
The plug-in architecture allowed the CGA team to quickly develop an Azure Data Explorer test runner and expand the scope of DataCop from the M365 team to also cover the Azure business.
插件體系結(jié)構(gòu)使CGA團(tuán)隊(duì)可以快速開發(fā)Azure Data Explorer測試運(yùn)行程序，并從M365團(tuán)隊(duì)擴(kuò)展DataCop的范圍，以涵蓋Azure業(yè)務(wù)。
Today, DataCop runs hundreds of tests every day to ensure the quality of data throughout multiple systems on both teams.
今天，DataCop每天運(yùn)行數(shù)百個(gè)測試，以確保兩個(gè)團(tuán)隊(duì)中多個(gè)系統(tǒng)的數(shù)據(jù)質(zhì)量。

Vlad Ri?cu?ia is on LinkedIn.

Vlad Ri?cu?ia在 LinkedIn上。

[1] The Four V’s of Big Data, IBM, 2016.

[1] 大數(shù)據(jù)的四個(gè)V ，IBM，2016年。

[2] Data Sentinel: Automating Data Validation, LinkedIn, March 2010.

[2] 數(shù)據(jù)前哨：自動(dòng)化數(shù)據(jù)驗(yàn)證，LinkedIn，2010年3月。

[3] Using Azure to Understand Azure, by Ron Sielinski, January 2020.

[3] Ron Sielinski于2020年1月使用 “ 使用Azure來理解Azure” 。

翻譯自: https://medium.com/data-science-at-microsoft/partnering-for-data-quality-dc9123557f8b