日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

HBae找不到协处理器导致RegionServer全部挂掉

發(fā)布時間:2025/3/15 编程问答 31 豆豆
生活随笔 收集整理的這篇文章主要介紹了 HBae找不到协处理器导致RegionServer全部挂掉 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

一、問題背景:

? ? 跟兄弟單位公用一個大數(shù)據(jù)集群,通過Dataspace結(jié)合Kerberos控制數(shù)據(jù)的訪問,我們生產(chǎn)環(huán)境中用到的OLAP工具Kylin,在升級Kylin的過程中,由于刪除了舊的協(xié)處理器,導(dǎo)致原來數(shù)據(jù)繼續(xù)去尋找目標(biāo)協(xié)處理器,找不到引起所有RegionServer退出,始終想不明白hbase有關(guān)協(xié)處理器的機制,于是查找資料才得以清楚。


一下內(nèi)容為轉(zhuǎn)載,原地址:http://blog.itpub.net/12129601/viewspace-1690668/? ? ?主要用于個人收藏、備查。轉(zhuǎn)載請注明原作者。

二、協(xié)處理的使用

1 加載協(xié)處理器
1.1 將協(xié)處理器上傳到hdfs:
hadoop fs -mkdir /hbasenew/usercoprocesser
hadoop fs -ls /hbasenew/usercoprocesser
hadoop fs -rm /hbasenew/usercoprocesser/coprocessor.jar
hadoop fs -copyFromLocal /home/hbase/coprocessor.jar? /hbasenew/usercoprocessor
1.2 將協(xié)處理器加載到表中:
1)先卸載協(xié)處理器:
disable 'ns_bigdata:tb_test_coprocesser'
alter 'ns_bigdata:tb_test_coprocesser',METHOD => 'table_att_unset',NAME =>'coprocessor$1'
enable 'ns_bigdata:tb_test_coprocesser'
2)再加載協(xié)處理器:
disable 'ns_bigdata:tb_test_coprocesser'
alter 'ns_bigdata:tb_test_coprocesser',METHOD => 'table_att','coprocessor' => '/hbasenew/usercoprocesser/coprocessor.jar|com.suning.hbase.coprocessor.service.HelloWorldEndPoin|1001|'
enable 'ns_bigdata:tb_test_coprocesser'
注意:在加載協(xié)處理器是我特意將協(xié)處理器中的類名少寫一個字母t,以重現(xiàn)將集群regionserver搞掛的現(xiàn)象以及表的狀態(tài)不一致的現(xiàn)象。

2 出現(xiàn)的問題
以上操作會導(dǎo)致如下兩個問題:
2.1 將集群的region server搞掛掉

2.2 將加載協(xié)處理器的表的狀態(tài)搞的不一致,一直處于enabling狀態(tài)

對表做disable和enable操作均不可操作:

同時此表對應(yīng)的regionserver上出現(xiàn)如下錯誤:

3 原因分析
3.1 關(guān)于協(xié)處理加載錯誤導(dǎo)致regionserver掛掉的原因分析
在hbase的源碼中,參數(shù):hbase.coprocessor.abortonerror的默認(rèn)值是true:
public static final String ABORT_ON_ERROR_KEY = "hbase.coprocessor.abortonerror";
? public static final boolean DEFAULT_ABORT_ON_ERROR = true;
下面查看此參數(shù)的含義:

??????hbase.coprocessor.abortonerror
??????true
??????Set to true to cause the hosting server (master or regionserver)
????? to abort if a coprocessor fails to load, fails to initialize, or throws an
????? unexpected Throwable object. Setting this to false will allow the server to
????? continue execution but the system wide state of the coprocessor in question
????? will become inconsistent as it will be properly executing in only a subset
????? of servers, so this is most useful for debugging only.?
因此,當(dāng)加載錯誤的協(xié)處理器之后,會導(dǎo)致regionserver掛掉。

3.2 關(guān)于加載協(xié)處理器的表的狀態(tài)不一致的原因分析:
相關(guān)錯誤日志:

查看enable的相關(guān)源碼:
public void enableTable(final TableName tableName)
? throws IOException {
??? enableTableAsync(tableName);

??? // Wait until all regions are enabled
??? waitUntilTableIsEnabled(tableName);

??? LOG.info("Enabled table " + tableName);
? }
private void waitUntilTableIsEnabled(final TableName tableName) throws IOException {
??? boolean enabled = false;
??? long start = EnvironmentEdgeManager.currentTimeMillis();
??? for (int tries = 0; tries < (this.numRetries * this.retryLongerMultiplier); tries++) {
????? try {
??????? enabled = isTableEnabled(tableName);
????? } catch (TableNotFoundException tnfe) {
??????? // wait for table to be created
??????? enabled = false;
????? }
????? enabled = enabled && isTableAvailable(tableName);
????? if (enabled) {
??????? break;
????? }
????? long sleep = getPauseTime(tries);
????? if (LOG.isDebugEnabled()) {
??????? LOG.debug("Sleeping= " + sleep + "ms, waiting for all regions to be " +
????????? "enabled in " + tableName);
????? }
????? try {
??????? Thread.sleep(sleep);
????? } catch (InterruptedException e) {
??????? // Do this conversion rather than let it out because do not want to
??????? // change the method signature.
??????? throw (InterruptedIOException)new InterruptedIOException("Interrupted").initCause(e);
????? }
??? }
??? if (!enabled) {
????? long msec = EnvironmentEdgeManager.currentTimeMillis() - start;
????? throw new IOException("Table '" + tableName +
??????? "' not yet enabled, after " + msec + "ms.");
??? }
? }

===========================================================================
? /**
?? * Brings a table on-line (enables it).? Method returns immediately though
?? * enable of table may take some time to complete, especially if the table
?? * is large (All regions are opened as part of enabling process).? Check
?? *?{@link?#isTableEnabled(byte[])} to learn when table is fully online.? If
?? * table is taking too long to online, check server logs.
?? * @param tableName
?? * @throws IOException
?? * @since 0.90.0
?? */
? public void enableTableAsync(final TableName tableName)
? throws IOException {
??? TableName.isLegalFullyQualifiedTableName(tableName.getName());
??? executeCallable(new MasterCallable(getConnection()) {
????? @Override
????? public Void call() throws ServiceException {
??????? LOG.info("Started enable of " + tableName);
??????? EnableTableRequest req = RequestConverter.buildEnableTableRequest(tableName);
??????? master.enableTable(null,req);
??????? return null;
????? }
??? });
? }
發(fā)現(xiàn)在enable的過程中,首先是執(zhí)行enable操作,操作完畢后需要等待各個regionserver反饋所有region的狀態(tài),由于此時regionserver已經(jīng)掛掉,一直在連接重試等待,此時表的狀態(tài)一直是ENABLING。

4 問題的處理
4.1 關(guān)于regionserver 掛掉的問題處理:
通過在hbase-site.xml文件中設(shè)置參數(shù):
????
????hbase.coprocessor.abortonerror
????false
????
并啟動region server可以解決,這樣就忽略了協(xié)處理器出現(xiàn)的錯誤,保證集群高可用。
4.2 關(guān)于有協(xié)處理器的表的狀態(tài)不一致,不能disable和enable問題的解決辦法:
此問題可以通過切換master節(jié)點可以解決,將主停掉,backup-master會承擔(dān)主master的任務(wù),同時在切換的過程中,會將狀態(tài)不一致的表的狀態(tài)改為一致的:

切換后的master信息如下:

在切換的過程中調(diào)用了如下方法:
? /**
?? * Recover the tables that are not fully moved to ENABLED state. These tables
?? * are in ENABLING state when the master restarted/switched
?? *
?? * @throws KeeperException
?? * @throws org.apache.hadoop.hbase.TableNotFoundException
?? * @throws IOException
?? */
? private void recoverTableInEnablingState()
????? throws KeeperException, TableNotFoundException, IOException {
??? Set?enablingTables = ZKTable.getEnablingTables(watcher);
??? if (enablingTables.size() != 0) {
????? for (TableName tableName : enablingTables) {
??????? // Recover by calling EnableTableHandler
??????? LOG.info("The table " + tableName
??????????? + " is in ENABLING state.? Hence recovering by moving the table"
??????????? + " to ENABLED state.");
??????? // enableTable in sync way during master startup,
??????? // no need to invoke coprocessor
??????? EnableTableHandler eth = new EnableTableHandler(this.server, tableName,
????????? catalogTracker, this, tableLockManager, true);
??????? try {
????????? eth.prepare();
??????? } catch (TableNotFoundException e) {
????????? LOG.warn("Table " + tableName + " not found in hbase:meta to recover.");
????????? continue;
??????? }
??????? eth.process();
????? }
??? }
? }
在卻換過程中,跟蹤master和對應(yīng)的regionserver的后臺日志:
master日志:
其中的部分日志信息如下:
2015-05-20 10:00:01,398 INFO? [master:nim-pre:60000] master.AssignmentManager: The table ns_bigdata:tb_test_coprocesser is in ENABLING state.? Hence recovering by moving the table to ENABLED state.
2015-05-20 10:00:01,421 DEBUG [master:nim-pre:60000] lock.ZKInterProcessLockBase: Acquired a lock for /hbasen/table-lock/ns_bigdata:tb_test_coprocesser/write-master:600000000000002
2015-05-20 10:00:01,436 INFO? [master:nim-pre:60000] handler.EnableTableHandler: Attempting to enable the table ns_bigdata:tb_test_coprocesser
2015-05-20 10:00:01,465 INFO? [master:nim-pre:60000] handler.EnableTableHandler: Table 'ns_bigdata:tb_test_coprocesser' has 1 regions, of which 1 are offline.
2015-05-20 10:00:01,466 INFO? [master:nim-pre:60000] balancer.BaseLoadBalancer: Reassigned 1 regions. 1 retained the pre-restart assignment.
2015-05-20 10:00:01,466 INFO? [master:nim-pre:60000] handler.EnableTableHandler: Bulk assigning 1 region(s) across 3 server(s), retainAssignment=true
對應(yīng)的regionserver的日志如下:
2015-05-20 14:39:56,175 INFO? [master:sup02-pre:60000] master.AssignmentManager: The table ns_bigdata:tb_test_coprocesser is in ENABLING state.? Hence recovering by moving the table to ENABLED state.
2015-05-20 14:39:56,211 DEBUG [master:sup02-pre:60000] lock.ZKInterProcessLockBase: Acquired a lock for /hbasen/table-lock/ns_bigdata:tb_test_coprocesser/write-master:600000000000031
2015-05-20 14:39:56,235 INFO? [master:sup02-pre:60000] handler.EnableTableHandler: Attempting to enable the table ns_bigdata:tb_test_coprocesser
2015-05-20 14:39:56,269 INFO? [master:sup02-pre:60000] handler.EnableTableHandler: Table 'ns_bigdata:tb_test_coprocesser' has 1 regions, of which 1 are offline.
2015-05-20 14:39:56,270 INFO? [master:sup02-pre:60000] balancer.BaseLoadBalancer: Reassigned 1 regions. 1 retained the pre-restart assignment.
2015-05-20 14:39:56,270 INFO? [master:sup02-pre:60000] handler.EnableTableHandler: Bulk assigning 1 region(s) across 3 server(s), retainAssignment=true

結(jié)論:
1. 為了提高集群的高可用性,應(yīng)該將參數(shù):hbase.coprocessor.abortonerror設(shè)置為true,這樣即使加載的協(xié)處理器有問題,也不會導(dǎo)致集群的regionserver掛掉,也不會導(dǎo)致表不能enable和disable;
2.即使表出現(xiàn)不能enable和disable的現(xiàn)象后,也可以通過切換master來解決,因此在搭建集群時,一定要至少有一到兩個backupmaster

5 全部master節(jié)點宕后集群的讀寫測試
1. 在集群都是正常的情況下,通過客戶端往集群中插入2000000行數(shù)據(jù),插入正常
2.將集群的所有master全部停掉:

3.監(jiān)控客戶端的數(shù)據(jù)插入情況,發(fā)現(xiàn)客戶端的數(shù)據(jù)插入正常。持續(xù)讓客戶端繼續(xù)插入20000000行數(shù)據(jù),發(fā)現(xiàn)數(shù)據(jù)插入正常。
4.在客戶端批量讀取數(shù)據(jù),發(fā)現(xiàn)數(shù)據(jù)讀取正常。
結(jié)論:當(dāng)hbase集群的master所有節(jié)點掛掉后(一定時間段,目前測試的是半小時內(nèi)),客戶端的數(shù)據(jù)讀寫正常。

總結(jié)

以上是生活随笔為你收集整理的HBae找不到协处理器导致RegionServer全部挂掉的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。