當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

marlin 三角洲_带火花的三角洲湖：什么和为什么？

發(fā)布時(shí)間：2023/11/29 编程问答 37 豆豆

生活随笔收集整理的這篇文章主要介紹了 marlin 三角洲_带火花的三角洲湖：什么和为什么？小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

marlin 三角洲

Let me start by introducing two problems that I have dealt time and again with my experience with Apache Spark:

首先，我介紹一下我在Apache Spark上的經(jīng)歷反復(fù)解決的兩個(gè)問題：

Data “overwrite” on the same path causing data loss in case of Job Failure.

在作業(yè)失敗的情況下，同一路徑上的數(shù)據(jù)“覆蓋”會(huì)導(dǎo)致數(shù)據(jù)丟失。

Updates in the data.

數(shù)據(jù)更新。

Sometimes I solved above with Design changes, sometimes with the introduction of another layer like Aerospike, or sometimes by maintaining historical incremental data.

有時(shí)我通過設(shè)計(jì)更改解決了上述問題，有時(shí)通過引入了諸如Aerospike的另一層，或者有時(shí)通過維護(hù)歷史增量數(shù)據(jù)來解決。

Maintaining historical data is mostly an immediate solution but I don’t really like dealing with historical incremental data if it’s not really required as(at least for me) it introduces the pain of backfill in case of failures which may be unlikely but inevitable.

維護(hù)歷史數(shù)據(jù)通常是一個(gè)立即解決方案，但是如果不是真正需要歷史增量數(shù)據(jù)，我真的不喜歡處理它，因?yàn)?至少對(duì)我來說)這會(huì)帶來回填的痛苦，以防止發(fā)生故障(雖然這不太可能，但不可避免)。

The above two problems are “problems” because Apache Spark does not really support ACID. I know it was never Spark’s use case to work with transactions(hello, you can’t have everything) but sometimes, there might be a scenario(like my two problems above) where ACID compliance would have come in handy.

以上兩個(gè)問題是“問題”，因?yàn)锳pache Spark并不真正支持ACID。我知道這絕不是Spark處理事務(wù)的用例(您好，您不能擁有所有東西)，但是有時(shí)候，在某些情況下(如上述兩個(gè)問題)，ACID合規(guī)性會(huì)派上用場(chǎng)。

When I read about Delta Lake and its ACID compliance, I saw it as one of the possible solutions for my two problems. Please read on to find out how the two problems are related to ACID compliance failure and how delta lake can be seen as a savior?

當(dāng)我閱讀Delta Lake及其ACID合規(guī)性時(shí)，我將其視為解決我的兩個(gè)問題的可能解決方案之一。請(qǐng)繼續(xù)閱讀，以找出這兩個(gè)問題與ACID合規(guī)性失敗之間的關(guān)系以及如何將三角洲湖視為救星？

什么是三角洲湖？ (What is Delta Lake?)

Delta Lake Documentation introduces Delta lake as:

Delta Lake文檔將Delta Lake引入為：

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

Delta Lake是一個(gè)開源存儲(chǔ)層，可為數(shù)據(jù)湖帶來可靠性。 Delta Lake提供ACID事務(wù)，可伸縮的元數(shù)據(jù)處理，并統(tǒng)一流和批處理數(shù)據(jù)處理。 Delta Lake在您現(xiàn)有的數(shù)據(jù)湖之上運(yùn)行，并且與Apache Spark API完全兼容。

Delta Lake key points:

三角洲湖重點(diǎn)：

Supports ACID
支持ACID
Enables Time travel
啟用時(shí)間旅行
Enables UPSERT
啟用UPSERT

Spark如何使ACID失敗？ (How Spark fails ACID?)

Consider the following piece of code to remove duplicates from a dataset:

考慮以下代碼，從數(shù)據(jù)集中刪除重復(fù)項(xiàng)：

# Read from HDFS
df = spark.read.parquet("/path/on/hdfs") # Line 1
# Remove duplicates
df = df.distinct() # Line 2
# Overwrite the data
df.cache() # Line 3
df.write.parquet("/path/on/hdfs", mode="overwrite") # Line 4

For my spark application running above piece of code consider a scenario where it fails on Line 4, that is while writing the data. This may or may not lead to data loss. [Problem #1: As mentioned above].You can replicate the scenario, by creating a test dataset and kill the job when it’s in the Write stage.

對(duì)于在上述代碼段上運(yùn)行的我的spark應(yīng)用程序，請(qǐng)考慮一種情況，即它在第4行上失敗，即在寫入數(shù)據(jù)時(shí)。這可能會(huì)或可能不會(huì)導(dǎo)致數(shù)據(jù)丟失。 [問題1：如上所述]。您可以通過創(chuàng)建測(cè)試數(shù)據(jù)集來復(fù)制方案，并在處于Write階段時(shí)取消該作業(yè)。

Let us try to understand ACID failure in spark with the above scenario.

讓我們嘗試了解上述情況下火花中的ACID故障。

ACID中的A代表原子性， (A in ACID stands for Atomicity,)

What is Atomicity: Either all changes take place or none, the system is never in halfway state.
什么是原子性：要么所有更改都發(fā)生，要么全部不發(fā)生，系統(tǒng)永遠(yuǎn)不會(huì)處于中間狀態(tài)。
How spark fails: While writing data, (at Line 4 above), if a failure occurs at a stage where old data is removed and new data is not yet written, data loss occurs. We have lost old data and we were not able to write new data due to job failure, atomicity fails. [It can vary according to file output committer used, please do read about File output committer to see how data writing takes place, the scenario I explained is for v2]
火花如何失敗：在寫入數(shù)據(jù)時(shí)(在上面的第4行)，如果在刪除舊數(shù)據(jù)而尚未寫入新數(shù)據(jù)的階段發(fā)生故障，則會(huì)發(fā)生數(shù)據(jù)丟失。我們丟失了舊數(shù)據(jù)，并且由于作業(yè)失敗，原子性失敗而無法寫入新數(shù)據(jù)。 [根據(jù)使用的文件輸出提交程序，它可能有所不同，請(qǐng)閱讀有關(guān)文件輸出提交程序的信息，以了解如何進(jìn)行數(shù)據(jù)寫入，我所說明的場(chǎng)景適用于v2]

ACID中的C代表一致性， (C in ACID stands for Consistency,)

What is Consistency: Data must be consistent and valid in the system at all times.
什么是一致性：數(shù)據(jù)必須始終在系統(tǒng)中保持一致和有效。
How Spark fails: As seen above, in the case of failure and data loss, we are left with invalid data in the system, consistency fails.
Spark如何失敗：如上所示，在失敗和數(shù)據(jù)丟失的情況下，我們?cè)谙到y(tǒng)中留有無效數(shù)據(jù)，一致性失敗。

ACID中的I代表隔離， (I in ACID stands for Isolation,)

What is Isolation: Multiple transactions occur in isolation
什么是隔離：多個(gè)事務(wù)是隔離發(fā)生的
How spark fails: Consider two jobs running in parallel, one as described above and another which is also using the same dataset, if one job overwrites the dataset while other is still using it, failure might happen, isolation fails.
Spark如何失敗：考慮兩個(gè)并行運(yùn)行的作業(yè)，一個(gè)如上所述，另一個(gè)使用相同的數(shù)據(jù)集，如果一個(gè)作業(yè)覆蓋了數(shù)據(jù)集而另一個(gè)仍在使用它，則可能會(huì)發(fā)生故障，隔離失敗。

ACID中的D代表耐久性， (D in ACID stands for Durability,)

What is Durability: Changes once made are never lost, even in the case of system failure.
什么是耐久性：即使系統(tǒng)發(fā)生故障，一旦進(jìn)行更改也不會(huì)丟失。
How spark might fail: Spark really doesn’t affect the durability, it is mainly governed by the storage layer, but since we are losing data in case of job failures, in my opinion, it is a durability failure.
Spark 可能如何失敗： Spark確實(shí)不會(huì)影響持久性，它主要由存儲(chǔ)層控制，但是由于我們?cè)诠ぷ魇〉那闆r下會(huì)丟失數(shù)據(jù)，因此我認(rèn)為這是持久性失敗。

Delta Lake如何支持ACID？ (How Delta Lake supports ACID?)

Delta lake maintains a delta log in the path where data is written. Delta Log maintains details like:

Delta Lake在寫入數(shù)據(jù)的路徑中維護(hù)一個(gè)delta日志。 Delta Log維護(hù)以下詳細(xì)信息：

Metadata like
像元數(shù)據(jù)

- Paths added in the write operation.
-在寫操作中添加的路徑。

- Paths removed in the write operation.
-在寫操作中刪除了路徑。

- Data size
-數(shù)據(jù)大小

- Changes in data
-數(shù)據(jù)變化
Data Schema
數(shù)據(jù)架構(gòu)
Commit information like
提交信息，例如

- Number of output rows
-輸出行數(shù)

- Output bytes
-輸出字節(jié)

- Timestamp
-時(shí)間戳

Sample log file in _delta_log_ directory created after some operations:

某些操作后在_delta_log_目錄中創(chuàng)建的示例日志文件：

After successful execution, a log file is created in the _delta_log_ directory. The important thing to note is when you save your data as delta, no files once written are removed. The concept is similar to versioning.

成功執(zhí)行后，將在_delta_log_目錄中創(chuàng)建一個(gè)日志文件。需要注意的重要一點(diǎn)是，當(dāng)您將數(shù)據(jù)另存為增量時(shí)，寫入后不會(huì)刪除任何文件。該概念類似于版本控制。

By keeping track of paths removed, added and other metadata information in the _delta_log_, Delta lake is ACID-compliant.

通過跟蹤_delta_log_中路徑的刪除，添加和其他元數(shù)據(jù)信息，Delta Lake符合ACID。

Versioning enables time travel property of Delta Lake, which is, I can go back to any state of data because all this information is being maintained in _delta_log_.

版本控制啟用了Delta Lake的時(shí)間旅行屬性，即，由于所有這些信息都保存在_delta_log_中，因此我可以返回到任何數(shù)據(jù)狀態(tài)。

Delta Lake如何解決上述兩個(gè)問題？ (How Delta Lake solves my two problems mentioned above?)

With the support for ACID, if my job fails during the “overwrite” operation, data is not lost, as changes won’t be committed to the log file of _delta_log_ directory. Also, since Delta Lake, does not remove old files in the “overwrite operation”, old state of my data is maintained and there is no data loss. (Yes, I have tested it)
有了ACID的支持，如果我的工作在“覆蓋”操作期間失敗，則數(shù)據(jù)不會(huì)丟失，因?yàn)楦牟粫?huì)提交到_delta_log_目錄的日志文件中。另外，由于Delta Lake不會(huì)在“覆蓋操作”中刪除舊文件，因此我的數(shù)據(jù)保持了舊狀態(tài)，并且沒有數(shù)據(jù)丟失。 (是的，我已經(jīng)測(cè)試過了)
Delta lake supports Update operation as mentioned above so it makes dealing with updates in data easier.
Delta Lake支持如上所述的Update操作，因此使數(shù)據(jù)更新更容易。

Until next time,Ciao.

直到下次，Ciao。

翻譯自: https://towardsdatascience.com/delta-lake-with-spark-what-and-why-6d08bef7b963

marlin 三角洲

總結(jié)

以上是生活随笔為你收集整理的marlin 三角洲_带火花的三角洲湖：什么和为什么？的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。