當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

kusto使用_Python查找具有数据重复问题的Kusto表

發(fā)布時(shí)間：2023/12/15 python 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 kusto使用_Python查找具有数据重复问题的Kusto表小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

kusto使用

Azure Data Explorer (Kusto), is one of the most dedicated relational databases in the market. The whole system is running in SSD and memory, to offer fast and responsive data analysis. It could be a good option to serve as warm-path data storage.

Azure數(shù)據(jù)資源管理器( Kusto )是市場(chǎng)上最專用的關(guān)系數(shù)據(jù)庫之一。整個(gè)系統(tǒng)在SSD和內(nèi)存中運(yùn)行，以提供快速響應(yīng)的數(shù)據(jù)分析。用作熱路徑數(shù)據(jù)存儲(chǔ)可能是一個(gè)不錯(cuò)的選擇。

Due to various reasons, such as mal-function client, imperfect data pipeline, etc. data could be ingested into Kusto multiple times. It leads to data duplication issue. This problem could be even more severe if the ingested data is summary data, such as overall revenue for a group of stores, etc.

由于各種原因，例如故障的客戶端，不完善的數(shù)據(jù)管道等，可能會(huì)將數(shù)據(jù)多次提取到Kusto中??。這導(dǎo)致數(shù)據(jù)重復(fù)問題。如果攝取的數(shù)據(jù)是摘要數(shù)據(jù)(例如，一組商店的總收入等)，則此問題甚至可能更加嚴(yán)重。

Data duplication could mess-up all the following data analysis, People may make the wrong decision based on that. Therefore, data cleaning/deduplication is necessary. Before that, we need to first confirm, whether the current Kusto table having a duplication issue.

數(shù)據(jù)重復(fù)可能會(huì)使以下所有數(shù)據(jù)分析混亂，人們可能會(huì)據(jù)此做出錯(cuò)誤的決定。因此，數(shù)據(jù)清理/重復(fù)數(shù)據(jù)刪除是必要的。在此之前，我們需要首先確認(rèn)當(dāng)前的Kusto表是否存在重復(fù)問題。

The confirmation step is the main focus of this article.

確認(rèn)步驟是本文的重點(diǎn)。

The main idea contains the following steps:

主要思想包含以下步驟：

connect to the Kusto cluster.

連接到Kusto群集。

query table schema.

查詢表架構(gòu)。

create unique identification per row

每行創(chuàng)建唯一的標(biāo)識(shí)

count rows with the same identification

計(jì)算具有相同標(biāo)識(shí)的行

Find any identification value with count > 1, mark as duplication.

查找計(jì)數(shù)> 1的任何標(biāo)識(shí)值，將其標(biāo)記為重復(fù)。

連接到Kusto群集 (Connect to Kusto Cluster)

Python has packages to connect to Kusto: Azure Data Explorer Python SDK. Here, we use package: azure-kusto-data.

Python具有連接到Kusto的軟件包： Azure數(shù)據(jù)資源管理器Python SDK 。在這里，我們使用包：azure-kusto-data。

The following code snippet would allow us to create the KustoClient. It is used to query Kusto Cluster. Before we connect to Kusto, we need to create the AppId and register it with the Kusto Cluster.

以下代碼段將使我們能夠創(chuàng)建KustoClient。它用于查詢Kusto群集。連接到Kusto之前，我們需要?jiǎng)?chuàng)建AppId并將其注冊(cè)到Kusto群集。

查詢表架構(gòu) (Query Table Schema)

getschema will return the table schema. KustoClient will return the table schema as our familiar Pandas DataFrame. It is easy for us to do further processing.

getschema將返回表模式。 KustoClient將表格式作為我們熟悉的Pandas DataFrame返回。我們很容易進(jìn)行進(jìn)一步處理。

Unique Identification Per Row

每行唯一標(biāo)識(shí)

Assume, the table is a summary table. There are no sub-set of columns that could uniquely identify the row. Therefore, we will use all the columns in the schema to create the identification. The identification would be the concatenation of all the column values.

假設(shè)該表是一個(gè)匯總表。沒有可唯一標(biāo)識(shí)行的列子集。因此，我們將使用架構(gòu)中的所有列來創(chuàng)建identification 。標(biāo)識(shí)將是所有列值的串聯(lián)。

Therefore, we will convert all the non-string data to a string, by using the tostring() operator. That is the purpose for schema.apply( axis = 1), where axis =1 will go over the table row by row.

因此，我們將使用tostring()運(yùn)算符將所有非字符串?dāng)?shù)據(jù)轉(zhuǎn)換為字符串。這就是schema.apply(axis = 1)的目的，其中axis = 1將逐行遍歷表。

At last, the strcat() from Kusto will concatenate all the columns based the operations defined with hashOp.

最后，來自Kusto的strcat()將基于hashOp定義的操作連接所有列。

If for another table, we know a subset of columns could uniquely identify the row, such as a combination of user_id and order_id. In that case, we could use the second hashKusto case.

如果對(duì)于另一個(gè)表，我們知道列的子集可以唯一地標(biāo)識(shí)該行，例如user_id和order_id的組合。在這種情況下，我們可以使用第二個(gè)hashKusto情況。

相同的標(biāo)識(shí)值計(jì)數(shù)并查找重復(fù)項(xiàng) (Same Identification Value Count and Find Duplications)

Notice the hashKusto value we created above, is used as extensions in Kusto query. That will create an additional column, hash, in the KustoTable. We later use summarize to get the count for each identification hash.

請(qǐng)注意，我們?cè)谏厦鎰?chuàng)建的hashKusto值在Kusto查詢中用作擴(kuò)展。這將在KustoTable中創(chuàng)建另一個(gè)列hash 。以后我們使用summary來獲取每個(gè)標(biāo)識(shí)哈希的計(jì)數(shù)。

At last, the duplicated records are the ones with recordsCount > 1.

最后，重復(fù)的記錄是recordsCount> 1的記錄。

帶走： (Take away:)

By using Python, we establish a simple and straight forward way to verify and identify duplicated rows within a Kusto table. This would offer a solid ground for all the following data analysis.

通過使用Python，我們建立了一種簡(jiǎn)單直接的方法來驗(yàn)證和識(shí)別Kusto表中的重復(fù)行。這將為以下所有數(shù)據(jù)分析提供堅(jiān)實(shí)的基礎(chǔ)。

翻譯自: https://towardsdatascience.com/use-data-brick-to-verify-azure-explore-kusto-data-duplication-issue-36abd238d582

kusto使用

總結(jié)

以上是生活随笔為你收集整理的kusto使用_Python查找具有数据重复问题的Kusto表的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：阅读软件怎么添加书源_认识一波苹果安卓手
下一篇： websocket python爬虫_p