实时数据交换平台 - BottledWater-pg with confluent
標(biāo)簽
PostgreSQL , Bottled Water , Kafka , Confluent , IoT
背景
想必大家都在圖書(shū)館借過(guò)書(shū),小時(shí)候有好看的書(shū)也會(huì)在小伙伴之間傳閱。
借書(shū)和數(shù)據(jù)泵有點(diǎn)類(lèi)似,一份數(shù)據(jù)通過(guò)數(shù)據(jù)泵實(shí)時(shí)的分享給訂閱者。
例如在IoT的場(chǎng)景中,有流式分析的需求,也有存儲(chǔ)歷史數(shù)據(jù)的需求,同時(shí)還有數(shù)據(jù)挖掘的需求,搜索引擎可能也需要同一份數(shù)據(jù),還有一些業(yè)務(wù)可能也要用到同一份數(shù)據(jù)。
但是如果把數(shù)據(jù)統(tǒng)統(tǒng)放到一個(gè)地方,這么多的業(yè)務(wù),它們有的要求實(shí)時(shí)處理,有的要求批量處理,有的可能需要實(shí)時(shí)的更新數(shù)據(jù),有的可能要對(duì)大數(shù)據(jù)進(jìn)行分析。
顯然一個(gè)產(chǎn)品可能無(wú)法滿(mǎn)足這么多的需求。
就好比數(shù)據(jù)庫(kù)就分了關(guān)系數(shù)據(jù)庫(kù),NOSQL,OLTP場(chǎng)景,OLAP場(chǎng)景一樣。 也是因?yàn)橐粋€(gè)產(chǎn)品無(wú)法滿(mǎn)足所有的業(yè)務(wù)需求。
在企業(yè)中通常是借助數(shù)據(jù)冗余來(lái)解決各類(lèi)場(chǎng)景的需求。
那么如何才能夠更好的分享數(shù)據(jù),保證數(shù)據(jù)的一致性,提高分享的實(shí)時(shí)性呢?
confluent platform
http://docs.confluent.io/3.1.0/platform.html
confluent 是一個(gè)實(shí)時(shí)的數(shù)據(jù)中轉(zhuǎn)服務(wù),來(lái)自各個(gè)平臺(tái)的數(shù)據(jù)可以使用confluent進(jìn)行流轉(zhuǎn),達(dá)到分享和交換數(shù)據(jù)的目的。
例如來(lái)自物聯(lián)網(wǎng)傳感器的數(shù)據(jù),來(lái)自數(shù)據(jù)庫(kù)的數(shù)據(jù),來(lái)自HTTP,移動(dòng)APP的數(shù)據(jù),來(lái)自應(yīng)用日志的數(shù)據(jù),來(lái)自一些事件觸發(fā)的數(shù)據(jù) 等等。
confluent需要依賴(lài)一些基本的組件,核心組件如kafka.
用戶(hù)可以自定義消息的生產(chǎn)者和消費(fèi)者,在confluent提供的平臺(tái)上交換數(shù)據(jù)。
BottledWater-pg
bottledwater-pg是confluent平臺(tái)的一種消息生產(chǎn)者,針對(duì)PostgreSQL數(shù)據(jù)庫(kù),即將PostgreSQL數(shù)據(jù)庫(kù)的數(shù)據(jù)寫(xiě)入confluent Kafka,從而實(shí)時(shí)的分享給消息訂閱者。
支持PostgreSQL 9.4以及以上版本,支持全量快照,以及持續(xù)的增量數(shù)據(jù)寫(xiě)入Kafka。
bottledwater-pg使用PostgreSQL快照技術(shù),可以讀取一致性的快照寫(xiě)入Kafka。使用數(shù)據(jù)庫(kù)logical decode技術(shù),從PostgreSQL的WAL日志中,解析為ROW數(shù)據(jù)寫(xiě)入Kafka。
在Kafka中,每個(gè)topic代表一張數(shù)據(jù)庫(kù)的表。
數(shù)據(jù)在使用decode從WAL取出后,寫(xiě)入Kafka之前,使用Avro將數(shù)據(jù)ROW打包成JSON, or Protobuf, or Thrift, or any number of formats,再寫(xiě)入Kafka。
Avro支持的數(shù)據(jù)類(lèi)型比較豐富,可以很好的支撐PostgreSQL豐富的數(shù)據(jù)類(lèi)型。
為什么使用Avro請(qǐng)參考
http://radar.oreilly.com/2014/11/the-problem-of-managing-schemas.html
BottledWater-pg依賴(lài)環(huán)境
BottledWater-pg是PG的一個(gè)插件,它的目的是解析WAL,同時(shí)使用Avro封裝為json/Protobuf/Thrift/其他formats。 并寫(xiě)入Kafka。
因此它依賴(lài)這些庫(kù)或軟件
PostgreSQL 9.4+ development libraries (PGXS and libpq). (Homebrew: brew install postgresql; Ubuntu: sudo apt-get install postgresql-server-dev-9.5 libpq-dev) libsnappy, a dependency of Avro. (Homebrew: brew install snappy; Ubuntu: sudo apt-get install libsnappy-dev) avro-c (1.8.0 or later), the C implementation of Avro. (Homebrew: brew install avro-c; others: build from source) Jansson, a JSON parser. (Homebrew: brew install jansson; Ubuntu: sudo apt-get install libjansson-dev) libcurl, a HTTP client. (Homebrew: brew install curl; Ubuntu: sudo apt-get install libcurl4-openssl-dev) librdkafka (0.9.1 or later), a Kafka client. (Ubuntu universe: sudo apt-get install librdkafka-dev, but see known gotchas; others: build from source)部署B(yǎng)ottledWater-pg
gcc,cmake
最好部署較新版本的,否則可能會(huì)有編譯問(wèn)題。
gcc 6.2.0 python 2.7.12 cmake 3.6.3vi /etc/ld.so.conf /home/digoal/gcc6.2.0/lib /home/digoal/gcc6.2.0/lib64 /home/digoal/python2.7.12/libldconfigexport LD_LIBRARY_PATH=/home/digoal/gcc6.2.0/lib:/home/digoal/gcc6.2.0/lib64:/home/digoal/python2.7.12/lib:$LD_LIBRARY_PATH export PATH=/home/digoal/gcc6.2.0/bin:/home/digoal/python2.7.12/bin:/home/digoal/cmake3.6.3/bin:$PGHOME/bin:$PATH:.snappy
可選,一種比較高效的壓縮和解壓縮庫(kù)。
由于avro還支持xz,可不安裝snappy
/* snappyhttp://google.github.io/snappy/wget https://github.com/google/snappy/archive/1.1.3.tar.gz tar -zxvf 1.1.3.tar.gzcd snappy-1.1.3yum install -y libtool gcc-c++ ./autogen.sh./configure --prefix=/home/digoal/snappy_home make make install- add LIBDIR to the `LD_LIBRARY_PATH' environment variableduring execution- add LIBDIR to the `LD_RUN_PATH' environment variableduring linking- use the `-Wl,-rpath -Wl,LIBDIR' linker flag- have your system administrator add LIBDIR to `/etc/ld.so.conf'*/libjansson (libjansson >=2.3)
json parser,必須安裝,建議測(cè)試時(shí)安裝在默認(rèn)目錄,否則可能遇到編譯問(wèn)題,或者設(shè)置rpath。
http://www.digip.org/jansson/ wget http://www.digip.org/jansson/releases/jansson-2.9.tar.bz2tar -jxvf jansson-2.9.tar.bz2 cd jansson-2.9./configure --prefix=/home/digoal/jansson make make install- add LIBDIR to the 'LD_LIBRARY_PATH' environment variableduring execution- add LIBDIR to the 'LD_RUN_PATH' environment variableduring linking- use the '-Wl,-rpath -Wl,LIBDIR' linker flag- have your system administrator add LIBDIR to '/etc/ld.so.conf'export PKG_CONFIG_PATH=/home/digoal/jansson/lib/pkgconfig:$PKG_CONFIG_PATHpkg-config --cflags --libs jansson -I/home/digoal/jansson/include -L/home/digoal/jansson//home/digoal/jansson/lib -ljansson建議測(cè)試時(shí)安裝在默認(rèn)路徑中,如下。
./configure make make install export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATHliblzma
yum install -y xz-*boost
可以不裝,如果你要安裝Avro的doc時(shí)才需要安裝boost。
/* boosthttp://www.boost.org/https://sourceforge.net/projects/boost/files/boost/1.62.0/ wget http://downloads.sourceforge.net/project/boost/boost/1.62.0/boost_1_62_0.tar.bz2?r=https%3A%2F%2Fsourceforge.net%2Fprojects%2Fboost%2Ffiles%2Fboost%2F1.62.0%2F&ts=1480929211&use_mirror=ncutar -jxvf boost_1_62_0.tar.bz2 cd boost_1_62_0/libs/regex/build如果要使用靜態(tài)庫(kù),請(qǐng)執(zhí)行make -fgcc.mak 如果要使用靜態(tài)庫(kù),請(qǐng)執(zhí)行make -fgcc-shared.makll gcc drwxr-xr-x 2 digoal users 4.0K Dec 5 17:18 boost_regex-gcc-1_53 drwxr-xr-x 2 digoal users 4.0K Dec 5 17:17 boost_regex-gcc-1_53_shared drwxr-xr-x 2 digoal users 4.0K Dec 5 17:19 boost_regex-gcc-d-1_53 drwxr-xr-x 2 digoal users 4.0K Dec 5 17:18 boost_regex-gcc-d-1_53_shared -rw-r--r-- 1 digoal users 2.6M Dec 5 17:18 libboost_regex-gcc-1_53.a -rwxr-xr-x 1 digoal users 1.3M Dec 5 17:17 libboost_regex-gcc-1_53.so -rw-r--r-- 1 digoal users 17M Dec 5 17:19 libboost_regex-gcc-d-1_53.a -rwxr-xr-x 1 digoal users 7.4M Dec 5 17:18 libboost_regex-gcc-d-1_53.solibboost_regex-gcc-1_53.a , 這是release版的靜態(tài)庫(kù) libboost_regex-gcc-1_53.so , 這是release版的動(dòng)態(tài)庫(kù)(共享庫(kù)) libboost_regex-gcc-d-1_53.a , 這是debug版的靜態(tài)庫(kù) libboost_regex-gcc-d-1_53.so , 這里debug版的動(dòng)態(tài)庫(kù)(共享庫(kù))*/avro (1.8.0 or later)
http://avro.apache.org/
http://www.apache.org/dyn/closer.cgi/avro/
wget http://mirrors.hust.edu.cn/apache/avro/avro-1.8.1/avro-src-1.8.1.tar.gztar -zxvf avro-src-1.8.1.tar.gzcd avro-src-1.8.1/lang/c mkdir build cd build/*cmake .. -DCMAKE_INSTALL_PREFIX=/home/digoal/avro -DCMAKE_BUILD_TYPE=Release -DSNAPPY_LIBRARIES=/home/digoal/snappy_home/lib -DSNAPPY_INCLUDE_DIR=/home/digoal/snappy_home/include */yum install -y zlib-devel.x86_64cmake .. -DCMAKE_INSTALL_PREFIX=/home/digoal/avro -DCMAKE_BUILD_TYPE=Release -DTHREADSAFE=true make make test make installThe "RelWithDebInfo" build type will build an optimized copy of the library, including debugging symbols. Use the "Release" build type if you don't want debugging symbols. Use the "Debug" build type if you want a non-optimized library, with debugging symbols.On Unix, you can request thread-safe versions of the Avro library's global functions by defining the THREADSAFE cmake variable. Just add the following to your cmake invokation:-DTHREADSAFE=truelibcurl
yum install -y libcurl-devel.x86_64librdkafka
git clone https://github.com/edenhill/librdkafka /*./configure --prefix=/home/digoal/librdkafka_homemake -j 32 make installexport PKG_CONFIG_PATH=/home/digoal/avro/lib/pkgconfig:/home/digoal/librdkafka_home/lib/pkgconfig:/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH*/測(cè)試時(shí)建議按照在默認(rèn)目錄,否則可能又會(huì)有編譯錯(cuò)誤的問(wèn)題。
./configure make make install export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATHPostgreSQL
安裝略
bottledwater-pg
由于bottledwater-pg是PostgreSQL的一個(gè)插件,所以首先要安裝PostgreSQL。
git clone https://github.com/confluentinc/bottledwater-pgcd bottledwater-pgmake make installvi /etc/ld.so.conf /home/digoal/avro/liborexport LD_LIBRARY_PATH=/home/digoal/avro/lib:$LD_LIBRARY_PATH可能需要重啟數(shù)據(jù)庫(kù),加載bottledwater.so
在數(shù)據(jù)庫(kù)中創(chuàng)建插件
psql postgres=# create extension bottledwater ; CREATE EXTENSION部署confluent platform
這一步是基礎(chǔ),搭建好confluent,后面才能測(cè)試一下bottledwater-pg生產(chǎn)消息,以及從confluent platform消費(fèi)消息的過(guò)程。
http://docs.confluent.io/3.1.0/installation.html
--部署PostgreSQL流復(fù)制
bottledwater-pg需要從wal解析logical row,所以必須配置WAL_LEVEL=logical級(jí)別。
同時(shí)wal sender進(jìn)程數(shù)必須>=1。
worker process數(shù)必須>=1。
同時(shí)由于bottledwater為了保證可以支持?jǐn)帱c(diǎn)續(xù)傳,以及確保沒(méi)有轉(zhuǎn)換的WAL日志不會(huì)被主庫(kù)刪掉或覆蓋掉,需要用到replication slot,因此需要配置replication_slots>=1。
postgresql.confmax_worker_processes = 8 wal_level = logical max_wal_senders = 8 wal_keep_segments = 256 max_replication_slots = 4同時(shí)為了保證數(shù)據(jù)庫(kù)的WAL訂閱者可以通過(guò)流復(fù)制協(xié)議連接到數(shù)據(jù)庫(kù),需要配置pg_hba.conf
pg_hba.conflocal replication digoal trust host replication digoal 127.0.0.1/32 trust host replication digoal 0.0.0.0/0 md5創(chuàng)建replication角色用戶(hù)
create role digoal login replication encrypted password 'digoal123';使用bottledwater-pg生產(chǎn)消息
bottledwater-pg客戶(hù)端命令的目的是從WAL解析日志,寫(xiě)入Kafka。
command option需要配置如何連接到數(shù)據(jù)庫(kù)(使用流復(fù)制連接),output格式,topic-prefix(建議為庫(kù)名),是否需要初始化快照,是否允許沒(méi)有主鍵的表,kafka broker的連接地址和端口,schema-registry的連接地址和端口。
以及一些kafka相關(guān)的配置。
cd bottledwater-pg/kafka./bottledwater --help Exports a snapshot of a PostgreSQL database, followed by a stream of changes, and sends the data to a Kafka cluster.Usage:./bottledwater [OPTION]...Options:-d, --postgres=postgres://user:pass@host:port/dbname (required)Connection string or URI of the PostgreSQL server.-s, --slot=slotname Name of replication slot (default: bottledwater)The slot is automatically created on first use.-b, --broker=host1[:port1],host2[:port2]... (default: localhost:9092)Comma-separated list of Kafka broker hosts/ports.-r, --schema-registry=http://hostname:port (default: http://localhost:8081)URL of the service where Avro schemas are registered.Used only for --output-format=avro.Omit when --output-format=json.-f, --output-format=[avro|json] (default: avro)How to encode the messages for writing to Kafka.-u, --allow-unkeyed Allow export of tables that don't have a primary key.This is disallowed by default, because updates anddeletes need a primary key to identify their row.-p, --topic-prefix=prefixString to prepend to all topic names.e.g. with --topic-prefix=postgres, updates from table'users' will be written to topic 'postgres.users'.-e, --on-error=[log|exit] (default: exit)What to do in case of a transient error, such asfailure to publish to Kafka.-x, --skip-snapshot Skip taking a consistent snapshot of the existingdatabase contents and just start streaming any newupdates. (Ignored if the replication slot alreadyexists.)-C, --kafka-config property=valueSet global configuration property for Kafka producer(see --config-help for list of properties).-T, --topic-config property=valueSet topic configuration property for Kafka producer.--config-help Print the list of configuration properties. See also:https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md-h, --helpPrint this help text.bottledwater配置文件說(shuō)明如下
./bottledwater --config-help 2>&1 |less ## Global configuration propertiesProperty | C/P | Range | Default | Description -----------------------------------------|-----|-----------------|--------------:|-------------------------- builtin.features | * | | gzip, snappy, ssl, sasl, regex | Indicates the builtin features for this build of librdkafka. An application can either query this value or attempt to set it with its list of required features to check for library support. <br>*Type: CSV flags* client.id | * | | rdkafka | Client identifier. <br>*Type: string* metadata.broker.list | * | | | Initial list of brokers. The application may also use `rd_kafka_brokers_add()` to add brokers during runtime. <br>*Type: string* bootstrap.servers | * | | | Alias for `metadata.broker.list` message.max.bytes | * | 1000 .. 1000000000 | 1000000 | Maximum transmit message size. <br>*Type: integer* message.copy.max.bytes | * | 0 .. 1000000000 | 65535 | Maximum size for message to be copied to buffer. Messages larger than this will be passed by reference (zero-copy) at the expense of larger iovecs. <br>*Type: integer* receive.message.max.bytes | * | 1000 .. 1000000000 | 100000000 | Maximum receive message size. This is a safety precaution to avoid memory exhaustion in case of protocol hickups. The value should be at least fetch.message.max.bytes * number of partitions consumed from + messaging overhead (e.g. 200000 bytes). <br>*Type: integer* max.in.flight.requests.per.connection | * | 1 .. 1000000 | 1000000 | Maximum number of in-flight requests the client will send. This setting applies per broker connection. <br>*Type: integer* max.in.flight | * | | | Alias for `max.in.flight.requests.per.connection` metadata.request.timeout.ms | * | 10 .. 900000 | 60000 | Non-topic request timeout in milliseconds. This is for metadata requests, etc. <br>*Type: integer* topic.metadata.refresh.interval.ms | * | -1 .. 3600000 | 300000 | Topic metadata refresh interval in milliseconds. The metadata is automatically refreshed on error and connect. Use -1 to disable the intervalled refresh. <br>*Type: integer* metadata.max.age.ms | * | | | Alias for `topic.metadata.refresh.interval.ms` topic.metadata.refresh.fast.cnt | * | 0 .. 1000 | 10 | When a topic looses its leader this number of metadata requests are sent with `topic.metadata.refresh.fast.interval.ms` interval disregarding the `topic.metadata.refresh.interval.ms` value. This is used to recover quickly from transitioning leader brokers. <br>*Type: integer* topic.metadata.refresh.fast.interval.ms | * | 1 .. 60000 | 250 | See `topic.metadata.refresh.fast.cnt` description <br>*Type: integer* topic.metadata.refresh.sparse | * | true, false | true | Sparse metadata requests (consumes less network bandwidth) <br>*Type: boolean* topic.blacklist | * | | | Topic blacklist, a comma-separated list of regular expressions for matching topic names that should be ignored in broker metadata information as if the topics did not exist. <br>*Type: pattern list* debug | * | generic, broker, topic, metadata, queue, msg, protocol, cgrp, security, fetch, feature, all | | A comma-separated list of debug contexts to enable. Debugging the Producer: broker,topic,msg. Consumer: cgrp,topic,fetch <br>*Type: CSV flags* socket.timeout.ms | * | 10 .. 300000 | 60000 | Timeout for network requests. <br>*Type: integer* socket.blocking.max.ms | * | 1 .. 60000 | 100 | Maximum time a broker socket operation may block. A lower value improves responsiveness at the expense of slightly higher CPU usage. <br>*Type: integer* socket.send.buffer.bytes | * | 0 .. 100000000 | 0 | Broker socket send buffer size. System default is used if 0. <br>*Type: integer* socket.receive.buffer.bytes | * | 0 .. 100000000 | 0 | Broker socket receive buffer size. System default is used if 0. <br>*Type: integer* socket.keepalive.enable | * | true, false | false | Enable TCP keep-alives (SO_KEEPALIVE) on broker sockets <br>*Type: boolean* socket.nagle.disable | * | true, false | false | Disable the Nagle algorithm (TCP_NODELAY). <br>*Type: boolean* socket.max.fails | * | 0 .. 1000000 | 3 | Disconnect from broker when this number of send failures (e.g., timed out requests) is reached. Disable with 0. NOTE: The connection is automatically re-established. <br>*Type: integer* broker.address.ttl | * | 0 .. 86400000 | 1000 | How long to cache the broker address resolving results (milliseconds). <br>*Type: integer* broker.address.family | * | any, v4, v6 | any | Allowed broker IP address families: any, v4, v6 <br>*Type: enum value* reconnect.backoff.jitter.ms | * | 0 .. 3600000 | 500 | Throttle broker reconnection attempts by this value +-50%. <br>*Type: integer* statistics.interval.ms | * | 0 .. 86400000 | 0 | librdkafka statistics emit interval. The application also needs to register a stats callback using `rd_kafka_conf_set_stats_cb()`. The granularity is 1000ms. A value of 0 disables statistics. <br>*Type: integer* enabled_events | * | 0 .. 2147483647 | 0 | See `rd_kafka_conf_set_events()` <br>*Type: integer* error_cb | * | | | Error callback (set with rd_kafka_conf_set_error_cb()) <br>*Type: pointer* throttle_cb | * | | | Throttle callback (set with rd_kafka_conf_set_throttle_cb()) <br>*Type: pointer* stats_cb | * | | | Statistics callback (set with rd_kafka_conf_set_stats_cb()) <br>*Type: pointer* log_cb | * | | | Log callback (set with rd_kafka_conf_set_log_cb()) <br>*Type: pointer* log_level | * | 0 .. 7 | 6 | Logging level (syslog(3) levels) <br>*Type: integer* log.thread.name | * | true, false | false | Print internal thread name in log messages (useful for debugging librdkafka internals) <br>*Type: boolean* log.connection.close | * | true, false | true | Log broker disconnects. It might be useful to turn this off when interacting with 0.9 brokers with an aggressive `connection.max.idle.ms` value. <br>*Type: boolean* socket_cb | * | | | Socket creation callback to provide race-free CLOEXEC <br>*Type: pointer* connect_cb | * | | | Socket connect callback <br>*Type: pointer* closesocket_cb | * | | | Socket close callback <br>*Type: pointer* open_cb | * | | | File open callback to provide race-free CLOEXEC <br>*Type: pointer* opaque | * | | | Application opaque (set with rd_kafka_conf_set_opaque()) <br>*Type: pointer* default_topic_conf | * | | | Default topic configuration for automatically subscribed topics <br>*Type: pointer* internal.termination.signal | * | 0 .. 128 | 0 | Signal that librdkafka will use to quickly terminate on rd_kafka_destroy(). If this signal is not set then there will be a delay before rd_kafka_wait_destroyed() returns true as internal threads are timing out their system calls. If this signal is set however the delay will be minimal. The application should mask this signal as an internal signal handler is installed. <br>*Type: integer* api.version.request | * | true, false | false | Request broker's supported API versions to adjust functionality to available protocol features. If set to false the fallback version `broker.version.fallback` will be used. **NOTE**: Depends on broker version >=0.10.0. If the request is not supported by (an older) broker the `broker.version.fallback` fallback is used. <br>*Type: boolean* api.version.fallback.ms | * | 0 .. 604800000 | 1200000 | Dictates how long the `broker.version.fallback` fallback is used in the case the ApiVersionRequest fails. **NOTE**: The ApiVersionRequest is only issued when a new connection to the broker is made (such as after an upgrade). <br>*Type: integer* broker.version.fallback | * | | 0.9.0 | Older broker versions (<0.10.0) provides no way for a client to query for supported protocol features (ApiVersionRequest, see `api.version.request`) making it impossible for the client to know what features it may use. As a workaround a user may set this property to the expected broker version and the client will automatically adjust its feature set accordingly if the ApiVersionRequest fails (or is disabled). The fallback broker version will be used for `api.version.fallback.ms`. Valid values are: 0.9.0, 0.8.2, 0.8.1, 0.8.0. <br>*Type: string* security.protocol | * | plaintext, ssl, sasl_plaintext, sasl_ssl | plaintext | Protocol used to communicate with brokers. <br>*Type: enum value* ssl.cipher.suites | * | | | A cipher suite is a named combination of authentication, encryption, MAC and key exchange algorithm used to negotiate the security settings for a network connection using TLS or SSL network protocol. See manual page for `ciphers(1)` and `SSL_CTX_set_cipher_list(3). <br>*Type: string* ssl.key.location | * | | | Path to client's private key (PEM) used for authentication. <br>*Type: string* ssl.key.password | * | | | Private key passphrase <br>*Type: string* ssl.certificate.location | * | | | Path to client's public key (PEM) used for authentication. <br>*Type: string* ssl.ca.location | * | | | File or directory path to CA certificate(s) for verifying the broker's key. <br>*Type: string* ssl.crl.location | * | | | Path to CRL for verifying broker's certificate validity. <br>*Type: string* sasl.mechanisms | * | | GSSAPI | SASL mechanism to use for authentication. Supported: GSSAPI, PLAIN. **NOTE**: Despite the name only one mechanism must be configured. <br>*Type: string* sasl.kerberos.service.name | * | | kafka | Kerberos principal name that Kafka runs as. <br>*Type: string* sasl.kerberos.principal | * | | kafkaclient | This client's Kerberos principal name. <br>*Type: string* sasl.kerberos.kinit.cmd | * | | kinit -S "%{sasl.kerberos.service.name}/%{broker.name}" -k -t "%{sasl.kerberos.keytab}" %{sasl.kerberos.principal} | Full kerberos kinit command string, %{config.prop.name} is replaced by corresponding config object value, %{broker.name} returns the broker's hostname. <br>*Type: string* sasl.kerberos.keytab | * | | | Path to Kerberos keytab file. Uses system default if not set.**NOTE**: This is not automatically used but must be added to the template in sasl.kerberos.kinit.cmd as ` ... -t %{sasl.kerberos.keytab}`. <br>*Type: string* sasl.kerberos.min.time.before.relogin | * | 1 .. 86400000 | 60000 | Minimum time in milliseconds between key refresh attempts. <br>*Type: integer* sasl.username | * | | | SASL username for use with the PLAIN mechanism <br>*Type: string* sasl.password | * | | | SASL password for use with the PLAIN mechanism <br>*Type: string* group.id | * | | | Client group id string. All clients sharing the same group.id belong to the same group. <br>*Type: string* partition.assignment.strategy | * | | range,roundrobin | Name of partition assignment strategy to use when elected group leader assigns partitions to group members. <br>*Type: string* session.timeout.ms | * | 1 .. 3600000 | 30000 | Client group session and failure detection timeout. <br>*Type: integer* heartbeat.interval.ms | * | 1 .. 3600000 | 1000 | Group session keepalive heartbeat interval. <br>*Type: integer* group.protocol.type | * | | consumer | Group protocol type <br>*Type: string* coordinator.query.interval.ms | * | 1 .. 3600000 | 600000 | How often to query for the current client group coordinator. If the currently assigned coordinator is down the configured query interval will be divided by ten to more quickly recover in case of coordinator reassignment. <br>*Type: integer* enable.auto.commit | C | true, false | true | Automatically and periodically commit offsets in the background. <br>*Type: boolean* auto.commit.interval.ms | C | 0 .. 86400000 | 5000 | The frequency in milliseconds that the consumer offsets are committed (written) to offset storage. (0 = disable) <br>*Type: integer* enable.auto.offset.store | C | true, false | true | Automatically store offset of last message provided to application. <br>*Type: boolean* queued.min.messages | C | 1 .. 10000000 | 100000 | Minimum number of messages per topic+partition in the local consumer queue. <br>*Type: integer* queued.max.messages.kbytes | C | 1 .. 1000000000 | 1000000 | Maximum number of kilobytes per topic+partition in the local consumer queue. This value may be overshot by fetch.message.max.bytes. <br>*Type: integer* fetch.wait.max.ms | C | 0 .. 300000 | 100 | Maximum time the broker may wait to fill the response with fetch.min.bytes. <br>*Type: integer* fetch.message.max.bytes | C | 1 .. 1000000000 | 1048576 | Initial maximum number of bytes per topic+partition to request when fetching messages from the broker. If the client encounters a message larger than this value it will gradually try to increase it until the entire message can be fetched. <br>*Type: integer* max.partition.fetch.bytes | C | | | Alias for `fetch.message.max.bytes` fetch.min.bytes | C | 1 .. 100000000 | 1 | Minimum number of bytes the broker responds with. If fetch.wait.max.ms expires the accumulated data will be sent to the client regardless of this setting. <br>*Type: integer* fetch.error.backoff.ms | C | 0 .. 300000 | 500 | How long to postpone the next fetch request for a topic+partition in case of a fetch error. <br>*Type: integer* offset.store.method | C | none, file, broker | broker | Offset commit store method: 'file' - local file store (offset.store.path, et.al), 'broker' - broker commit store (requires Apache Kafka 0.8.2 or later on the broker). <br>*Type: enum value* consume_cb | C | | | Message consume callback (set with rd_kafka_conf_set_consume_cb()) <br>*Type: pointer* rebalance_cb | C | | | Called after consumer group has been rebalanced (set with rd_kafka_conf_set_rebalance_cb()) <br>*Type: pointer* offset_commit_cb | C | | | Offset commit result propagation callback. (set with rd_kafka_conf_set_offset_commit_cb()) <br>*Type: pointer* enable.partition.eof | C | true, false | true | Emit RD_KAFKA_RESP_ERR__PARTITION_EOF event whenever the consumer reaches the end of a partition. <br>*Type: boolean* queue.buffering.max.messages | P | 1 .. 10000000 | 100000 | Maximum number of messages allowed on the producer queue. <br>*Type: integer* queue.buffering.max.kbytes | P | 1 .. 2147483647 | 4000000 | Maximum total message size sum allowed on the producer queue. <br>*Type: integer* queue.buffering.max.ms | P | 1 .. 900000 | 1000 | Maximum time, in milliseconds, for buffering data on the producer queue. <br>*Type: integer* message.send.max.retries | P | 0 .. 10000000 | 2 | How many times to retry sending a failing MessageSet. **Note:** retrying may cause reordering. <br>*Type: integer* retries | P | | | Alias for `message.send.max.retries` retry.backoff.ms | P | 1 .. 300000 | 100 | The backoff time in milliseconds before retrying a message send. <br>*Type: integer* compression.codec | P | none, gzip, snappy | none | compression codec to use for compressing message sets. This is the default value for all topics, may be overriden by the topic configuration property `compression.codec`. <br>*Type: enum value* batch.num.messages | P | 1 .. 1000000 | 10000 | Maximum number of messages batched in one MessageSet. The total MessageSet size is also limited by message.max.bytes. <br>*Type: integer* delivery.report.only.error | P | true, false | false | Only provide delivery reports for failed messages. <br>*Type: boolean* dr_cb | P | | | Delivery report callback (set with rd_kafka_conf_set_dr_cb()) <br>*Type: pointer* dr_msg_cb | P | | | Delivery report callback (set with rd_kafka_conf_set_dr_msg_cb()) <br>*Type: pointer*## Topic configuration propertiesProperty | C/P | Range | Default | Description -----------------------------------------|-----|-----------------|--------------:|-------------------------- request.required.acks | P | -1 .. 1000 | 1 | This field indicates how many acknowledgements the leader broker must receive from ISR brokers before responding to the request: *0*=Broker does not send any response/ack to client, *1*=Only the leader broker will need to ack the message, *-1* or *all*=broker will block until message is committed by all in sync replicas (ISRs) or broker's `in.sync.replicas` setting before sending response. <br>*Type: integer* acks | P | | | Alias for `request.required.acks` request.timeout.ms | P | 1 .. 900000 | 5000 | The ack timeout of the producer request in milliseconds. This value is only enforced by the broker and relies on `request.required.acks` being != 0. <br>*Type: integer* message.timeout.ms | P | 0 .. 900000 | 300000 | Local message timeout. This value is only enforced locally and limits the time a produced message waits for successful delivery. A time of 0 is infinite. <br>*Type: integer* produce.offset.report | P | true, false | false | Report offset of produced message back to application. The application must be use the `dr_msg_cb` to retrieve the offset from `rd_kafka_message_t.offset`. <br>*Type: boolean* partitioner_cb | P | | | Partitioner callback (set with rd_kafka_topic_conf_set_partitioner_cb()) <br>*Type: pointer* opaque | * | | | Application opaque (set with rd_kafka_topic_conf_set_opaque()) <br>*Type: pointer* compression.codec | P | none, gzip, snappy, inherit | inherit | Compression codec to use for compressing message sets. <br>*Type: enum value* auto.commit.enable | C | true, false | true | If true, periodically commit offset of the last message handed to the application. This committed offset will be used when the process restarts to pick up where it left off. If false, the application will have to call `rd_kafka_offset_store()` to store an offset (optional). **NOTE:** This property should only be used with the simple legacy consumer, when using the high-level KafkaConsumer the global `enable.auto.commit` property must be used instead. **NOTE:** There is currently no zookeeper integration, offsets will be written to broker or local file according to offset.store.method. <br>*Type: boolean* enable.auto.commit | C | | | Alias for `auto.commit.enable` auto.commit.interval.ms | C | 10 .. 86400000 | 60000 | The frequency in milliseconds that the consumer offsets are committed (written) to offset storage. <br>*Type: integer* auto.offset.reset | C | smallest, earliest, beginning, largest, latest, end, error | largest | Action to take when there is no initial offset in offset store or the desired offset is out of range: 'smallest','earliest' - automatically reset the offset to the smallest offset, 'largest','latest' - automatically reset the offset to the largest offset, 'error' - trigger an error which is retrieved by consuming messages and checking 'message->err'. <br>*Type: enum value* offset.store.path | C | | . | Path to local file for storing offsets. If the path is a directory a filename will be automatically generated in that directory based on the topic and partition. <br>*Type: string* offset.store.sync.interval.ms | C | -1 .. 86400000 | -1 | fsync() interval for the offset file, in milliseconds. Use -1 to disable syncing, and 0 for immediate sync after each write. <br>*Type: integer* offset.store.method | C | file, broker | broker | Offset commit store method: 'file' - local file store (offset.store.path, et.al), 'broker' - broker commit store (requires "group.id" to be configured and Apache Kafka 0.8.2 or later on the broker.). <br>*Type: enum value* consume.callback.max.messages | C | 0 .. 1000000 | 0 | Maximum number of messages to dispatch in one `rd_kafka_consume_callback*()` call (0 = unlimited) <br>*Type: integer*### C/P legend: C = Consumer, P = Producer, * = both消費(fèi)消息
由于confluent中存儲(chǔ)的是avro封裝的binary格式,所以消費(fèi)時(shí),需要使用解析avro的消費(fèi)者。
./bin/kafka-avro-console-consumer --topic test --zookeeper localhost:2181 \--property print.key=true風(fēng)險(xiǎn)評(píng)估
1. 首次連接數(shù)據(jù)庫(kù)時(shí),會(huì)自動(dòng)創(chuàng)建slot,同時(shí)自動(dòng)開(kāi)始將快照數(shù)據(jù)寫(xiě)入Kafka,如果數(shù)據(jù)庫(kù)很大,這個(gè)過(guò)程會(huì)很漫長(zhǎng)。
2. 為了得到一致的數(shù)據(jù),會(huì)開(kāi)啟repeatable read的事務(wù)隔離級(jí)別,如果是9.6,并且配置了snapshot too old參數(shù),可能導(dǎo)致快照拷貝失敗。
3. 由于是邏輯DECODE,被復(fù)制的表必須包含逐漸,或指定非空唯一約束列,作為復(fù)制時(shí)的KEY。
4. 增,刪,改在WAL中被解析為:
insert : key + full row delete : old.key update : old.key + new.key+full row如果使用--allow-unkeyed跳過(guò)了主鍵,那么delete該表時(shí),不會(huì)將任何數(shù)據(jù)寫(xiě)入Kafka,插入和更新則將所有列發(fā)給Kafka。
5. DDL操作不會(huì)記錄到wal日志中,如果你需要將DDL也寫(xiě)入Kafka怎么辦?
你可以使用event trigger,發(fā)生ddl時(shí),將DDL封裝并寫(xiě)入表中,然后這些表的DML會(huì)寫(xiě)入Kafka,從而實(shí)現(xiàn)DDL的傳遞。
6. 如果要?jiǎng)h除生產(chǎn)者,務(wù)必刪除數(shù)據(jù)庫(kù)中對(duì)應(yīng)的slot ,否則PostgreSQL會(huì)一直保留slot未讀取的日志。 導(dǎo)致WAL目錄撐爆。
7. 如果數(shù)據(jù)庫(kù)產(chǎn)生的REDO沒(méi)有被及時(shí)的解析并寫(xiě)入Kafka,可能導(dǎo)致未取走的數(shù)據(jù)庫(kù)的wal文件一直留在數(shù)據(jù)庫(kù)服務(wù)器,甚至導(dǎo)致數(shù)據(jù)庫(kù)空間撐爆。
請(qǐng)謹(jǐn)慎使用slot,同時(shí)請(qǐng)將監(jiān)控做得健壯。
8. Kafka topic與table一一對(duì)應(yīng),命名規(guī)則如下
由于命名中只有三個(gè)部分?[topic_prefix].[postgres_schema_name].table_name?沒(méi)有考慮庫(kù)名,所以如果有多個(gè)數(shù)據(jù)庫(kù)時(shí),建議配置top_prefix,和庫(kù)名對(duì)應(yīng)即可。
For each table being streamed, Bottled Water publishes messages to a corresponding Kafka topic. The naming convention for topics is [topic_prefix].[postgres_schema_name].table_name:table_name is the name of the table in Postgres. postgres_schema_name is the name of the Postgres schema the table belongs to; this is omitted if the schema is "public" (the default schema under the default Postgres configuration). N.B. this requires the avro-c library to be at least version 0.8.0. topic_prefix is omitted by default, but may be configured via the --topic-prefix command-line option. A prefix is useful: to prevent name collisions with other topics, if the Kafka broker is also being used for other purposes besides Bottled Water. if you want to stream several databases into the same broker, using a separate Bottled Water instance with a different prefix for each database. to make it easier for a Kafka consumer to consume updates from all Postgres tables, by using a topic regex that matches the prefix. For example:with no prefix configured, a table named "users" in the public (default) schema would be streamed to a topic named "users". with --topic-prefix=bottledwater, a table named "transactions" in the "point-of-sale" schema would be streamed to a topic named "bottledwater.point-of-sale.transactions". (Support for namespaces in Kafka has been proposed that would replace this sort of ad-hoc prefixing, but it's still under discussion.)參考
1.?https://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/
2.?http://docs.confluent.io/3.1.0/platform.html
3.?https://github.com/confluentinc/bottledwater-pg/tree/master#building-from-source
4.?http://docs.confluent.io/3.0.1/quickstart.html
5.?https://www.postgresql.org/message-id/797DF957-CE33-407F-99DB-7C7125E37ACE@kleppmann.com
總結(jié)
以上是生活随笔為你收集整理的实时数据交换平台 - BottledWater-pg with confluent的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: Leetcode 1559二维网格图中探
- 下一篇: 目标跟踪 — MOSSE