當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

hbase源码系列（九）StoreFile存储格式

發(fā)布時(shí)間：2025/1/21 编程问答 45 豆豆

生活随笔收集整理的這篇文章主要介紹了 hbase源码系列（九）StoreFile存储格式小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

從這一章開(kāi)始要講Region Server這塊的了，但是在講Region Server這塊之前得講一下StoreFile，否則后面的不好講下去，這塊是基礎(chǔ)，Region Sever上面的操作，大部分都是基于它來(lái)進(jìn)行的。

HFile概述

HFile是HBase中實(shí)際存數(shù)據(jù)的文件，為HBase提供高效快速的數(shù)據(jù)訪問(wèn)。它是基于Hadoop的TFile，模仿Google Bigtable 架構(gòu)中的SSTable格式。文件格式如下：

文件是變長(zhǎng)的，唯一固定的塊是File Info和Trailer，如圖所示，Trailer有指向其它塊的指針，這些指針也寫(xiě)在了文件里，Index塊記錄了data和meta塊的偏移量，meta塊是可選的。

下面我們從原來(lái)上來(lái)一個(gè)一個(gè)的看它們到底是啥樣的，先從入口看起，那就是StoreFile.Writer的append方法，先看怎么寫(xiě)入的，然后它就怎么讀了，不知道怎么使用這個(gè)類的，可以看看我寫(xiě)的這篇文章《非mapreduce生成Hfile，然后導(dǎo)入hbase當(dāng)中》。

往HFile追加KeyValue?

不扯這些了，看一下StoreFile里面的append方法。

public void append(final KeyValue kv) throws IOException {//如果是新的rowkey的value，就追加到Bloomfilter里面去 appendGeneralBloomfilter(kv);//如果是DeleteFamily、DeleteFamilyVersion類型的kv appendDeleteFamilyBloomFilter(kv);writer.append(kv);//記錄最新的put的時(shí)間戳，更新時(shí)間戳范圍 trackTimestamps(kv);}

在用writer進(jìn)行append之前先把kv寫(xiě)到generalBloomFilterWriter里面，但是我們發(fā)現(xiàn)generalBloomFilterWriter是HFile.Writer里面的InlineBlockWriter。

generalBloomFilterWriter = BloomFilterFactory.createGeneralBloomAtWrite(conf, cacheConf, bloomType,(int) Math.min(maxKeys, Integer.MAX_VALUE), writer); //在createGeneralBloomAtWriter方法發(fā)現(xiàn)了以下代碼 ...... CompoundBloomFilterWriter bloomWriter = new CompoundBloomFilterWriter(getBloomBlockSize(conf),err, Hash.getHashType(conf), maxFold, cacheConf.shouldCacheBloomsOnWrite(),bloomType == BloomType.ROWCOL ? KeyValue.COMPARATOR : KeyValue.RAW_COMPARATOR);writer.addInlineBlockWriter(bloomWriter);

我們接下來(lái)看HFileWriterV2的append方法吧。

public void append(final KeyValue kv) throws IOException {append(kv.getMvccVersion(), kv.getBuffer(), kv.getKeyOffset(), kv.getKeyLength(),kv.getBuffer(), kv.getValueOffset(), kv.getValueLength());this.maxMemstoreTS = Math.max(this.maxMemstoreTS, kv.getMvccVersion()); }

為什么貼這段代碼，注意這個(gè)參數(shù)maxMemstoreTS，它取kv的mvcc來(lái)比較，mvcc是用來(lái)實(shí)現(xiàn)MemStore的原子性操作的，在MemStore flush的時(shí)候同一批次的mvcc都是一樣的，失敗的時(shí)候，把mvcc相同的全部干掉，這里提一下，以后應(yīng)該還會(huì)說(shuō)到，繼續(xù)追殺append方法。方法比較長(zhǎng)，大家展開(kāi)看看。

private void append(final long memstoreTS, final byte[] key, final int koffset, final int klength,final byte[] value, final int voffset, final int vlength)throws IOException {boolean dupKey = checkKey(key, koffset, klength);checkValue(value, voffset, vlength);if (!dupKey) {//在寫(xiě)每一個(gè)新的KeyValue之間，都要檢查，到了BlockSize就重新寫(xiě)一個(gè)HFileBlock checkBlockBoundary();}//如果當(dāng)前的fsBlockWriter的狀態(tài)不對(duì)，就重新寫(xiě)一個(gè)新塊if (!fsBlockWriter.isWriting())newBlock();// 把值寫(xiě)入到ouputStream當(dāng)中，怎么寫(xiě)入的自己看啊 {DataOutputStream out = fsBlockWriter.getUserDataStream();out.writeInt(klength);totalKeyLength += klength;out.writeInt(vlength);totalValueLength += vlength;out.write(key, koffset, klength);out.write(value, voffset, vlength);if (this.includeMemstoreTS) {WritableUtils.writeVLong(out, memstoreTS);}}// 記錄每個(gè)塊的第一個(gè)key 和上次寫(xiě)的keyif (firstKeyInBlock == null) {firstKeyInBlock = new byte[klength];System.arraycopy(key, koffset, firstKeyInBlock, 0, klength);}lastKeyBuffer = key;lastKeyOffset = koffset;lastKeyLength = klength;entryCount++;} View Code

從上面我們可以看到來(lái)，HFile寫(xiě)入的時(shí)候，是分一個(gè)塊一個(gè)塊的寫(xiě)入的，每個(gè)Block塊64KB左右，這樣有利于數(shù)據(jù)的隨機(jī)訪問(wèn)，不利于連續(xù)訪問(wèn)，連續(xù)訪問(wèn)需求大的，可以把Block塊的大小設(shè)置得大一點(diǎn)。好，我們繼續(xù)看checkBlockBoundary方法。

private void checkBlockBoundary() throws IOException {if (fsBlockWriter.blockSizeWritten() < blockSize)return;finishBlock();writeInlineBlocks(false);newBlock();}

簡(jiǎn)單交代一下

1、結(jié)束一個(gè)block的時(shí)候，把block的所有數(shù)據(jù)寫(xiě)入到hdfs的流當(dāng)中，記錄一些信息到DataBlockIndex（塊的第一個(gè)key和上一個(gè)塊的key的中間值，塊的大小，塊的起始位置）。

2、writeInlineBlocks(false)給了一個(gè)false，是否要關(guān)閉，所以現(xiàn)在什么都沒(méi)干，它要等到最后才會(huì)輸出的。

3、newBlock方法就是重置輸出流，做好準(zhǔn)備，讀寫(xiě)下一個(gè)塊。

Close的時(shí)候

?close的時(shí)候就有得忙咯，從之前的圖上面來(lái)看，它在最后的時(shí)候是最忙的，因?yàn)樗獙?xiě)入一大堆索引信息、附屬信息啥的。

public void close() throws IOException {boolean hasGeneralBloom = this.closeGeneralBloomFilter();boolean hasDeleteFamilyBloom = this.closeDeleteFamilyBloomFilter();writer.close(); }

在調(diào)用writer的close方法之前，close了兩個(gè)BloomFilter，把BloomFilter的類型寫(xiě)進(jìn)FileInfo里面去，把BloomWriter添加到Writer里面。下面進(jìn)入正題吧，放大招了，我折疊吧。。。

public void close() throws IOException {if (outputStream == null) {return;}// 經(jīng)過(guò)編碼壓縮的，把編碼壓縮方式寫(xiě)進(jìn)FileInfo里面blockEncoder.saveMetadata(this);//結(jié)束塊 finishBlock();//輸出DataBlockIndex索引的非root層信息writeInlineBlocks(true);FixedFileTrailer trailer = new FixedFileTrailer(2,HFileReaderV2.MAX_MINOR_VERSION);// 如果有meta塊的存在的話if (!metaNames.isEmpty()) {for (int i = 0; i < metaNames.size(); ++i) {long offset = outputStream.getPos();// 輸出meta的內(nèi)容，它是meta的名字的集合，按照名字排序DataOutputStream dos = fsBlockWriter.startWriting(BlockType.META);metaData.get(i).write(dos);fsBlockWriter.writeHeaderAndData(outputStream);totalUncompressedBytes += fsBlockWriter.getUncompressedSizeWithHeader();// 把meta塊的信息加到meta塊的索引里 metaBlockIndexWriter.addEntry(metaNames.get(i), offset,fsBlockWriter.getOnDiskSizeWithHeader());}}//下面這部分是打開(kāi)文件的時(shí)候就加載的部分，是前面部分的索引//HFileBlockIndex的根層次的索引long rootIndexOffset = dataBlockIndexWriter.writeIndexBlocks(outputStream);trailer.setLoadOnOpenOffset(rootIndexOffset);//Meta塊的索引 metaBlockIndexWriter.writeSingleLevelIndex(fsBlockWriter.startWriting(BlockType.ROOT_INDEX), "meta");fsBlockWriter.writeHeaderAndData(outputStream);totalUncompressedBytes += fsBlockWriter.getUncompressedSizeWithHeader();//如果需要寫(xiě)入Memstore的最大時(shí)間戳到FileInfo里面if (this.includeMemstoreTS) {appendFileInfo(MAX_MEMSTORE_TS_KEY, Bytes.toBytes(maxMemstoreTS));appendFileInfo(KEY_VALUE_VERSION, Bytes.toBytes(KEY_VALUE_VER_WITH_MEMSTORE));}//把FileInfo的起始位置寫(xiě)入trailer，然后輸出 writeFileInfo(trailer, fsBlockWriter.startWriting(BlockType.FILE_INFO));fsBlockWriter.writeHeaderAndData(outputStream);totalUncompressedBytes += fsBlockWriter.getUncompressedSizeWithHeader();// 輸出GENERAL_BLOOM_META、DELETE_FAMILY_BLOOM_META類型的BloomFilter的信息for (BlockWritable w : additionalLoadOnOpenData){fsBlockWriter.writeBlock(w, outputStream);totalUncompressedBytes += fsBlockWriter.getUncompressedSizeWithHeader();}//HFileBlockIndex的二級(jí)實(shí)體的層次 trailer.setNumDataIndexLevels(dataBlockIndexWriter.getNumLevels());//壓縮前的HFileBlockIndex的大小 trailer.setUncompressedDataIndexSize(dataBlockIndexWriter.getTotalUncompressedSize());//第一個(gè)HFileBlock的起始位置 trailer.setFirstDataBlockOffset(firstDataBlockOffset);//最后一個(gè)HFileBlock的起始位置 trailer.setLastDataBlockOffset(lastDataBlockOffset);//比較器的類型 trailer.setComparatorClass(comparator.getClass());//HFileBlockIndex的根實(shí)體的數(shù)量，應(yīng)該是和HFileBlock的數(shù)量是一樣的//它每次都把HFileBlock的第一個(gè)key加進(jìn)去 trailer.setDataIndexCount(dataBlockIndexWriter.getNumRootEntries());//把Trailer的信息寫(xiě)入硬盤(pán)，關(guān)閉輸出流 finishClose(trailer);fsBlockWriter.release();} View Code

和圖片上寫(xiě)的有些出入。

1、輸出HFileBlocks

2、輸出HFileBlockIndex的二級(jí)索引（我叫它二級(jí)索引，我也不知道對(duì)不對(duì)，HFileBlockIndex那塊我有點(diǎn)兒忘了，等我再重新調(diào)試的時(shí)候再看看吧）

3、如果有的話，輸出MetaBlock

下面的部分是打開(kāi)文件的時(shí)候就加載的

4、輸出HFileBlockIndex的根索引

5、如果有的話，輸出MetaBlockIndex的根索引（它比較小，所以只有一層）

6、輸出文件信息（FileInfo）

7、輸出文件尾巴（Trailer）

?Open的時(shí)候

這部分打算講一下實(shí)例化Reader的時(shí)候，根據(jù)不同類型的文件是怎么實(shí)例化Reader的，在StoreFile里面搜索open方法。

this.reader = fileInfo.open(this.fs, this.cacheConf, dataBlockEncoder.getEncodingInCache());// 加載文件信息到map里面去，后面部分就不展開(kāi)講了 metadataMap = Collections.unmodifiableMap(this.reader.loadFileInfo());

我們進(jìn)入F3進(jìn)入fileInfo.open這個(gè)方法里面去。

FSDataInputStreamWrapper in;FileStatus status;if (this.link != null) {// HFileLinkin = new FSDataInputStreamWrapper(fs, this.link);status = this.link.getFileStatus(fs);} else if (this.reference != null) {// HFile Reference 反向計(jì)算出來(lái)引用所指向的位置的HFile位置Path referencePath = getReferredToFile(this.getPath());in = new FSDataInputStreamWrapper(fs, referencePath);status = fs.getFileStatus(referencePath);} else {in = new FSDataInputStreamWrapper(fs, this.getPath());status = fileStatus;}long length = status.getLen();if (this.reference != null) {hdfsBlocksDistribution = computeRefFileHDFSBlockDistribution(fs, reference, status);//如果是引用的話，創(chuàng)建一個(gè)一半的readerreturn new HalfStoreFileReader(fs, this.getPath(), in, length, cacheConf, reference, dataBlockEncoding);} else {hdfsBlocksDistribution = FSUtils.computeHDFSBlocksDistribution(fs, status, 0, length);return new StoreFile.Reader(fs, this.getPath(), in, length, cacheConf, dataBlockEncoding);}

它一上來(lái)就判斷它是不是HFileLink是否為空了，這是啥情況？找了一下，原來(lái)在StoreFile的構(gòu)造函數(shù)的時(shí)候，就開(kāi)始判斷了。

this.fileStatus = fileStatus;Path p = fileStatus.getPath();if (HFileLink.isHFileLink(p)) {// HFileLink 被判斷出來(lái)它是HFilethis.reference = null;this.link = new HFileLink(conf, p);} else if (isReference(p)) {this.reference = Reference.read(fs, p);//關(guān)聯(lián)的地址也可能是一個(gè)HFileLink，snapshot的時(shí)候介紹了Path referencePath = getReferredToFile(p);if (HFileLink.isHFileLink(referencePath)) {// HFileLink Reference 如果它是一個(gè)HFileLink型的this.link = new HFileLink(conf, referencePath);} else {// 只是引用this.link = null;}} else if (isHFile(p)) {// HFilethis.reference = null;this.link = null;} else {throw new IOException("path=" + p + " doesn't look like a valid StoreFile");} View Code

它有4種情況：

1、HFileLink

2、既是HFileLink又是Reference文件

3、只是Reference文件

4、HFile

?說(shuō)HFileLink吧，我們看看它的構(gòu)造函數(shù)

public HFileLink(final Path rootDir, final Path archiveDir, final Path path) {Path hfilePath = getRelativeTablePath(path);this.tempPath = new Path(new Path(rootDir, HConstants.HBASE_TEMP_DIRECTORY), hfilePath);this.originPath = new Path(rootDir, hfilePath);this.archivePath = new Path(archiveDir, hfilePath);setLocations(originPath, tempPath, archivePath); }

尼瑪，它計(jì)算了三個(gè)地址，原始位置，archive中的位置，臨時(shí)目錄的位置，按照順序添加到一個(gè)locations數(shù)組里面。。接著看FSDataInputStreamWrapper吧，下面是三段代碼

this.stream = (link != null) ? link.open(hfs) : hfs.open(path); //走的link.open(hfs) new FSDataInputStream(new FileLinkInputStream(fs, this)); //注意tryOpen方法 public FileLinkInputStream(final FileSystem fs, final FileLink fileLink, int bufferSize)throws IOException {this.bufferSize = bufferSize;this.fileLink = fileLink;this.fs = fs;this.in = tryOpen(); }

tryOpen的方法，會(huì)按順序打開(kāi)多個(gè)locations列表。。

for (Path path: fileLink.getLocations()) {if (path.equals(currentPath)) continue;try {in = fs.open(path, bufferSize);in.seek(pos);assert(in.getPos() == pos) : "Link unable to seek to the right position=" + pos;if (LOG.isTraceEnabled()) {if (currentPath != null) {LOG.debug("link open path=" + path);} else {LOG.trace("link switch from path=" + currentPath + " to path=" + path);}}currentPath = path;return(in);} catch (FileNotFoundException e) {// Try another file location } } View Code

恩，這回終于知道它是怎么出來(lái)的了，原來(lái)是嘗試打開(kāi)了三次，直到找到正確的位置。

StoreFile的文件格式到這里就結(jié)束了，有點(diǎn)兒遺憾的是HFileBlockIndex沒(méi)給大家講清楚。

補(bǔ)充：經(jīng)網(wǎng)友"東岸往事"的提醒，有一個(gè)地方寫(xiě)錯(cuò)了，在結(jié)束一個(gè)塊之后，會(huì)把它所有的BloomFilter全部輸出，HFileBlockIndex的話，如果滿了默認(rèn)的128*1024個(gè)就輸出二級(jí)索引。

具體的的內(nèi)容在后面說(shuō)查詢的時(shí)候會(huì)說(shuō)，下面先交代一下：

通過(guò)看繼承InlineBlockWriter的類，發(fā)現(xiàn)了以下信息

1、BlockIndexWriter 不是關(guān)閉的情況下，沒(méi)有超過(guò)默認(rèn)值128*1024是不會(huì)輸出的，每128*1024個(gè)HFileBlock 1個(gè)二級(jí)索引。

HFileBlockIndex包括2層，如果是MetaBlock的HFileBlock是1層。

二級(jí)索引 curInlineChunk 在結(jié)束了一個(gè)塊之后添加一個(gè)索引的key（上一個(gè)塊的firstKey和這個(gè)塊的firstKey的中間值）。

byte[] indexKey = comparator.calcIndexKey(lastKeyOfPreviousBlock, firstKeyInBlock);
curInlineChunk.add(firstKey, blockOffset, blockDataSize);

一級(jí)索引 rootChunk 輸出一次二級(jí)索引之后添加每個(gè)HFileBlock的第一個(gè)key，這樣子其實(shí)二級(jí)索引里面是包括是一級(jí)索引的所有key的。

firstKey = curInlineChunk.getBlockKey(0); rootChunk.add(firstKey, offset, onDiskSize, totalNumEntries);

2、CompoundBloomFilterWriter也就是Bloom Filter，在數(shù)據(jù)不為空的時(shí)候，就會(huì)輸出。

對(duì)于HFileV2的正確的圖，應(yīng)該是下面這個(gè)，但是上面的那個(gè)圖看起來(lái)好看一點(diǎn)，就保留了。

轉(zhuǎn)載于:https://www.cnblogs.com/cenyuhai/p/3722644.html

總結(jié)

以上是生活随笔為你收集整理的hbase源码系列（九）StoreFile存储格式的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：【数据结构】数组和广义表
下一篇： String和Date、Timestam