go 捕获数据库新增数据_更改数据捕获的经验教训
go 捕獲數據庫新增數據
The article summarizes experiences from various projects with a log-based change data capture (CDC). There are many use cases for which CDC is beneficial. Some DBs even have CDC functionality integrated without requiring a separate tool.
本文總結了使用基于日志的變更數據捕獲(CDC)的各種項目的經驗。 CDC有益于很多用例。 一些DB甚至集成了CDC功能,而無需單獨的工具。
變更檢測 (Change detection)
Detecting changes in transactional source systems often turn out as slow and not scalable. Lightweight, high-performance change detection is required. Well-known approaches are: change columns, triggers and log-based CDC.
檢測事務性源系統中的更改通常會很慢且不可擴展。 需要輕巧,高性能的變化檢測。 眾所周知的方法是:更改列,觸發器和基于日志的CDC。
Change columns
變更欄
Tables in the source system get created with an additional column, such as a timestamp. Such a column can be used to identify new or changed data records.
源系統中的表是使用附加列(例如時間戳)創建的。 這樣的列可用于標識新的或更改的數據記錄。
However, this solution does not work for deleted records. Another disadvantage is that the application needs to be changed with update logic for the column.
但是,此解決方案不適用于已刪除的記錄。 另一個缺點是應用程序需要使用列的更新邏輯進行更改。
Triggers
扳機
Database triggers can protocol any data changes to a log table. A process transfers the data from the log table to the target.
數據庫觸發器可以將任何數據更改記錄到日志表中。 進程將數據從日志表傳輸到目標。
This approach recognizes deletes. Additionally, the application does not need to be changed.
這種方法識別刪除。 此外,不需要更改應用程序。
However, triggers put a strain on the source database from a performance point of view. If a transaction commits, it must wait until the trigger finished, too.
但是,從性能的角度來看,觸發器對源數據庫造成了壓力。 如果事務提交,它也必須等到觸發器完成。
Log-based CDC
基于日志的CDC
Databases write changes into their transaction log. Backup and recovery need transaction logs. Additionally, sequential writes into a transaction log are much faster compared to random writes into data files.
數據庫將更改寫入其事務日志。 備份和恢復需要事務日志。 此外,與對數據文件的隨機寫入相比,對事務日志的順序寫入要快得多。
Open source tools like Debezium or commercial tools like Oracle Goldengate, IBM IIDR read changes from the log and replicate the changes to the target system.
開源工具(如Debezium)或商業工具(如Oracle Goldengate,IBM IIDR)從日志中讀取更改,并將更改復制到目標系統。
基于日志的CDC的優缺點 (Pros and Cons of log-based CDC)
Log-based CDC has fundamental advantages over the other change detection methods listed:
與其他列出的變更檢測方法相比,基于日志的CDC具有根本優勢:
The additional performance impact on the source system is low.
對源系統的附加性能影響很小。
CDC enables the implementation of near real-time architectures with the possibility of faster and more accurate decisions based on the target systems.
CDC可以實現近實時架構,并有可能根據目標系統做出更快,更準確的決策。
No significant changes to the application in the source system are necessary. Changes relate to, for example, the setting of database parameters to generate extended transaction log entries, or installing the CDC software.
無需對源系統中的應用程序進行重大更改 。 更改涉及例如設置數據庫參數以生成擴展的事務日志條目或安裝CDC軟件。
CDC reduces the amount of data transmitted over the network compared to the other detection methods described above.
與上述其他檢測方法相比,CDC 減少了通過網絡傳輸的數據量 。
The coupling between source and target is asynchronous. A slow consumer does not affect the source database.
源與目標之間的耦合是異步的 。 緩慢的使用者不會影響源數據庫。
Log-based CDC also has some disadvantages you must be aware of:
基于日志的CDC還具有一些必須意識到的缺點:
Many commercial tools require an additional license.
許多商業工具需要額外的許可證 。
Avoid no-logging transactions as these bypasses the transaction log. However, most databases allow force logging through suitable parameter settings.
避免不記錄事務,因為它們會繞過事務日志。 但是,大多數數據庫都允許通過適當的參數設置進行強制日志記錄。
Separate tools require operations and require additional know-how.
單獨的工具需要操作,并且需要其他專門知識 。
Primary or unique keys are required for many log-based CDC tools — but a good database design will guarantee this requirement anyway.
許多基于日志的CDC工具都需要主鍵或唯一鍵 -但是良好的數據庫設計將始終保證這一要求。
- If the target system is down, transaction logs must be kept until the target absorbed the changes. 如果目標系統關閉,則必須保留事務日志,直到目標吸收了更改為止。
基于日志的CDC如何工作? (How does log-based CDC work?)
The diagram above shows several uses of log-based CDC. Users or applications change data in the source database, e.g. insert, update, or delete data. The database writes all changes into the transaction log (or Write-ahead log) before integrating the changes into the data files.
上圖顯示了基于日志的CDC的幾種用法。 用戶或應用程序更改源數據庫中的數據,例如插入,更新或刪除數據。 在將更改集成到數據文件之前,數據庫將所有更改寫入事務日志(或預寫日志)。
The RDBMS example shows a user inserting data and a second user updating the same data in the next transaction. The transaction log contains the change protocol until some retention-interval clears the entries. The data file contains just the current state.
RDBMS示例顯示了一個用戶插入數據,而第二個用戶在下一個事務中更新了相同的數據。 事務日志包含更改協議,直到某個保留間隔清除條目為止。 數據文件僅包含當前狀態。
Databases use transaction logs primarily for backup and recovery. But the data can also be used to replicate changes into a target system. CDC tools use the database API to read from the transaction logs (or from the memory if the data is still in the DB cache) and transfer the data to a target system.
數據庫將事務日志主要用于備份和恢復。 但是數據還可以用于將更改復制到目標系統中。 CDC工具使用數據庫API從事務日志中讀取(如果數據仍在數據庫高速緩存中,則從內存中讀取)并將數據傳輸到目標系統。
The example shows a logical view of a CDC tool reading data from the transaction log of the source system and submitting the changes to target databases or a target data lake. There is no persistent storing of the data stream within the CDC software. Kafka could be used as a log-centric approach to capture the changes long-time and submitting the changes to further target systems.
該示例顯示了CDC工具的邏輯視圖,該工具從源系統的事務日志中讀取數據,并將更改提交給目標數據庫或目標數據湖。 CDC軟件中沒有持久存儲數據流。 Kafka可以用作以日志為中心的方法,以長期捕獲更改并將更改提交給其他目標系統。
The diagram shows two cases of data propagation:
該圖顯示了兩種數據傳播情況:
Replication-mode (“Target Mirror DB”)
復制模式(“目標鏡像數據庫”)
Replication creates a copy of the source: updates will change the data in the target. Deletes will remove data from the target.
復制將創建源的副本:更新將更改目標中的數據。 刪除將從目標中刪除數據。
The target system does not necessarily have the same structure as the source system. It is possible to only choose a subset of tables or to replicate some selected columns within a table or filter data.
目標系統不一定具有與源系統相同的結構。 可以僅選擇表的子集,也可以復制表中的某些選定列或過濾數據。
The example shows the target DB, which contains the current state of the changes.
該示例顯示了目標數據庫,其中包含更改的當前狀態。
Auditing-mode (“Target DWH DB” and “Target Data Lake”)
審核模式(“ Target DWH DB”和“ Target Data Lake”)
Auditing-mode keeps the whole data history. The CDC tool converts updates and deletes into inserts in the “Target DWH DB” or “Target Data Lake”. A flag indicates the kind of operation (insert, update, delete) and a timestamp field tracks when the event did happen.
審核模式保留整個數據歷史記錄。 CDC工具將更新和刪除轉換為“ Target DWH DB”或“ Target Data Lake”中的插入。 標志指示操作的類型(插入,更新,刪除),時間戳字段跟蹤事件的發生時間。
The example shows the “Target DWH DB” and “Target Data Lake”, which contain the complete protocol of changes, including operation (insert, update, delete) and timestamp.
該示例顯示了“ Target DWH DB”和“ Target Data Lake”,其中包含完整的更改協議,包括操作(插入,更新,刪除)和時間戳。
用例 (Use Cases)
There are a variety of use cases for CDC:
CDC有多種用例:
Replicate changes into a DWH or data lake.
將更改復制到DWH或數據湖中 。
Replicate changes into Kafka in a microservices architecture.
將更改復制到微服務架構中的Kafka中 。
Upgrade a database to a higher or the latest version with minimal downtime, e.g. upgrade PostgreSQL 11.9 to PostgreSQL 12.4.
以最少的停機時間將數據庫升級到更高版本或最新版本,例如將PostgreSQL 11.9升級到PostgreSQL 12.4。
Migrate data from database X to database Y with minimal downtime, e.g. migrate MariaDB to PostgreSQL or vice versa.
將數據從數據庫X 遷移到數據庫Y的停機時間最少,例如,將MariaDB遷移到PostgreSQL,反之亦然。
Transfer data into the cloud over a secure connection.
通過安全連接 將數據傳輸到云中。
工裝 (Tooling)
Commercial tools like Oracle Golden Gate, IBM IIDR, and many others gained a high maturity, and the performance is excellent even during peak loads. These tools can replicate data between heterogeneous database systems.
諸如Oracle Golden Gate,IBM IIDR等許多商業工具獲得了很高的成熟度,即使在高峰負載期間,其性能也非常出色。 這些工具可以在異構數據庫系統之間復制數據。
Additionally, there are some other interesting products:
此外,還有其他一些有趣的產品:
Debezium is an open-source product built on top of Kafka. Debezium has connectors to pull a change stream from DBs like PostgreSQL, MySQL, MongoDB, Cassandra, etc. and send the data to Kafka. There are some exciting activities in the open-source community going on.
Debezium 是建立在Kafka之上的開源產品。 Debezium具有連接器,可從PostgreSQL,MySQL,MongoDB,Cassandra等數據庫中提取更改流,并將數據發送到Kafka。 開源社區中正在進行一些激動人心的活動。
DynamoDB Streams can be enabled for the NoSQL DB DynamoDB. DynamoDB Streams captures table activities and provides JSON streams.
可以為NoSQL DB DynamoDB啟用DynamoDB流 。 DynamoDB Streams捕獲表活動并提供JSON流。
Tables in VoltDB can be declared as export targets. The declaration means that VoltDB will send a stream of changes into Kafka for such tables.
VoltDB中的表可以聲明為導出目標。 該聲明意味著VoltDB將針對此類表的更改流發送到Kafka中。
結論 (Conclusion)
CDC is an essential component in modern architectures for transferring transactional data from systems into a data stream.
CDC是現代體系結構中必不可少的組件,用于將事務數據從系統傳輸到數據流中。
CDC enables the provisioning of transactional data in real-time without causing a significant load to the source system, requires no changes in the source application and reduces the transferred amount of data to a minimum.
CDC可以實時提供事務數據,而不會給源系統造成很大的負擔,不需要更改源應用程序,并將傳輸的數據量減少到最低限度。
Keeping data in only one system is not always possible. Be cautious with personal data. You have to know all appearances of personal data for deletion requirements according to GDPR, for example. A data catalog can help to keep track of replicated sensitive data.
并非始終只能將數據保存在一個系統中。 注意個人數據。 例如,您必須了解所有出現的個人數據,以符合GDPR的刪除要求。 數據目錄可以幫助跟蹤復制的敏感數據。
翻譯自: https://medium.com/daimler-tss-tech/change-data-capture-lessons-learnt-7976391cf78d
go 捕獲數據庫新增數據
總結
以上是生活随笔為你收集整理的go 捕获数据库新增数据_更改数据捕获的经验教训的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【项目管理】项目工作绩效域管理
- 下一篇: SQL语句求解同一人物不同日期,某一属性