hadoop元数据合并过程_Hadoop元数据合并异常及解决方法
這幾天觀察了一下Standby NN上面的日志,發現每次Fsimage合并完之后,Standby NN通知Active NN來下載合并好的Fsimage的過程中會出現以下的異常信息:
01
2014-04-2314:42:54,964ERROR org.apache.hadoop.hdfs.server.namenode.ha.
02
StandbyCheckpointer: Exception in doCheckpoint
03
java.net.SocketTimeoutException: Read timed out
04
at java.net.SocketInputStream.socketRead0(Native Method)
05
at java.net.SocketInputStream.read(SocketInputStream.java:152)
06
at java.net.SocketInputStream.read(SocketInputStream.java:122)
07
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
08
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
09
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
10
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
11
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
12
at sun.net.www.protocol.http.HttpURLConnection.getInputStream
13
(HttpURLConnection.java:1323)
14
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
15
at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.doGetUrl
16
(TransferFsImage.java:268)
17
at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient
18
(TransferFsImage.java:247)
19
at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.
20
uploadImageFromStorage(TransferFsImage.java:162)
21
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.
22
doCheckpoint(StandbyCheckpointer.java:174)
23
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.
24
access$1100(StandbyCheckpointer.java:53)
25
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer
26
$CheckpointerThread.doWork(StandbyCheckpointer.java:297)
27
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer
28
$CheckpointerThread.access$300(StandbyCheckpointer.java:210)
29
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer
30
$CheckpointerThread$1.run(StandbyCheckpointer.java:230)
31
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal
32
(SecurityUtil.java:456)
33
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer
34
$CheckpointerThread.run(StandbyCheckpointer.java:226)
上面的代碼貼出來有點亂啊,可以看下下面的圖片截圖:
StandbyCheckpointer
于是習慣性的去Google了一下,找了好久也沒找到類似的信息。只能自己解決。我們通過分析日志發現更奇怪的問題,上次Checkpoint的時間一直都不變(一直都是Standby NN啟動的時候第一次Checkpoint的時間),如下:
1
2014-04-2314:50:54,429INFO
2
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer:
3
Triggering checkpoint because it has been70164seconds since the last checkpoint,
4
which exceeds the configured interval600
難道這是Hadoop的bug?于是我就根據上面的錯誤信息去查看源碼,經過仔細的分析,發現上述的問題都是由StandbyCheckpointer類輸出的:
01
privatevoiddoWork() {
02
// Reset checkpoint time so that we don't always checkpoint
03
// on startup.
04
lastCheckpointTime = now();
05
while(shouldRun) {
06
try{
07
Thread.sleep(1000* checkpointConf.getCheckPeriod());
08
}catch(InterruptedException ie) {
09
}
10
if(!shouldRun) {
11
break;
12
}
13
try{
14
// We may have lost our ticket since last checkpoint, log in again,
15
// just in case
16
if(UserGroupInformation.isSecurityEnabled()) {
17
UserGroupInformation.getCurrentUser().checkTGTAndReloginFromKeytab();
18
}
19
20
longnow = now();
21
longuncheckpointed = countUncheckpointedTxns();
22
longsecsSinceLast = (now - lastCheckpointTime)/1000;
23
24
booleanneedCheckpoint =false;
25
if(uncheckpointed >= checkpointConf.getTxnCount()) {
26
LOG.info("Triggering checkpoint because there have been "+
27
uncheckpointed +" txns since the last checkpoint, which "+
28
"exceeds the configured threshold "+
29
checkpointConf.getTxnCount());
30
needCheckpoint =true;
31
}elseif(secsSinceLast >= checkpointConf.getPeriod()) {
32
LOG.info("Triggering checkpoint because it has been "+
33
secsSinceLast +" seconds since the last checkpoint, which "+
34
"exceeds the configured interval "+ checkpointConf.getPeriod());
35
needCheckpoint =true;
36
}
37
38
synchronized(cancelLock) {
39
if(now < preventCheckpointsUntil) {
40
LOG.info("But skipping this checkpoint since we are about"+
41
" to failover!");
42
canceledCount++;
43
continue;
44
}
45
assertcanceler ==null;
46
canceler =newCanceler();
47
}
48
49
if(needCheckpoint) {
50
doCheckpoint();
51
lastCheckpointTime = now;
52
}
53
}catch(SaveNamespaceCancelledException ce) {
54
LOG.info("Checkpoint was cancelled: "+ ce.getMessage());
55
canceledCount++;
56
}catch(InterruptedException ie) {
57
// Probably requested shutdown.
58
continue;
59
}catch(Throwable t) {
60
LOG.error("Exception in doCheckpoint", t);
61
}finally{
62
synchronized(cancelLock) {
63
canceler =null;
64
}
65
}
66
}
67
}
68
}
上面的異常信息是由 doCheckpoint()函數執行的過程中出現問題而拋出來的,這樣導致lastCheckpointTime = now;語句永遠執行不到。那么為什么doCheckpoint()執行過程會出現異常??根據上述堆棧信息的跟蹤,發現是由TransferFsImage類的doGetUrl函數中的下面語句導致的:
1
if(connection.getResponseCode() != HttpURLConnection.HTTP_OK) {
由于connection無法得到對方的響應碼而超時。于是我就想到是否是我的集群socket超時設置的有問題??后來經過各種分析發現不是。于是我只能再看看代碼,我發現了上述代碼的前面有如下設置:
01
if(timeout <=0) {
02
Configuration conf =newHdfsConfiguration();
03
timeout = conf.getInt(DFSConfigKeys.DFS_IMAGE_TRANSFER_TIMEOUT_KEY,
04
DFSConfigKeys.DFS_IMAGE_TRANSFER_TIMEOUT_DEFAULT);
05
}
06
07
if(timeout >0) {
08
connection.setConnectTimeout(timeout);
09
connection.setReadTimeout(timeout);
10
}
11
12
if(connection.getResponseCode() != HttpURLConnection.HTTP_OK) {
13
thrownewHttpGetFailedException(
14
"Image transfer servlet at "+ url +
15
" failed with status code "+ connection.getResponseCode() +
16
"\nResponse message:\n"+ connection.getResponseMessage(),
17
connection);
18
}
DFS_IMAGE_TRANSFER_TIMEOUT_KEY這個時間是由dfs.image.transfer.timeout參數所設置的,默認值為10 * 60 * 1000,單位為毫秒。然后我看了一下這個屬性的解釋:
Timeout for image transfer in milliseconds. This timeout and the relateddfs.image.transfer.bandwidthPerSecparameter should be configured such that normal image transfer can complete within the timeout. This timeout prevents client hangs when the sender fails during image transfer, which is particularly important during checkpointing. Note that this timeout applies to the entirety of image transfer, and is not a socket timeout.
這才發現問題,這個參數的設置和dfs.image.transfer.bandwidthPerSec息息相關,要保證Active NN在dfs.image.transfer.timeout時間內把合并好的Fsimage從Standby NN上下載完,要不然會出現異常。然后我看了一下我的配置
1
2
dfs.image.transfer.timeout
3
60000
4
5
6
7
dfs.image.transfer.bandwidthPerSec
8
1048576
9
60秒超時,一秒鐘拷貝1MB,而我的集群上的元數據有800多MB,顯然是不能在60秒鐘拷貝完,后來我把dfs.image.transfer.timeout設置大了,觀察了一下,集群再也沒出現過上述異常信息,而且以前的一些異常信息也由于這個而解決了。。
總結
以上是生活随笔為你收集整理的hadoop元数据合并过程_Hadoop元数据合并异常及解决方法的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: git 裁切_Vue + ccroppe
- 下一篇: datastage 函数_DataSta