當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

RGW Bucket Shard优化

發(fā)布時間：2023/12/14 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了 RGW Bucket Shard优化小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

1.bucket index背景簡介

bucket index是整個RGW里面一個非常關(guān)鍵的數(shù)據(jù)結(jié)構(gòu)，用于存儲bucket的索引數(shù)據(jù)，默認(rèn)情況下單個bucket的index全部存儲在一個shard文件（shard數(shù)量為0，主要以O(shè)MAP-keys方式存儲在leveldb中），隨著單個bucket內(nèi)的Object數(shù)量增加，整個shard文件的體積也在不斷增長，當(dāng)shard文件體積過大就會引發(fā)各種問題。

2. 問題及故障

2.1 故障現(xiàn)象描述

Flapping OSD's when RGW buckets have millions of objects

● Possible causes

○ The first issue here is when RGW buckets have millions of objects their

bucket index shard RADOS objects become very large with high

number OMAP keys stored in leveldb. Then operations like deep-scrub,

bucket index listing etc takes a lot of time to complete and this triggers

OSD's to flap. If sharding is not used this issue become worse because

then only one RADOS index objects will be holding all the OMAP keys.

RGW的index數(shù)據(jù)以omap形式存儲在OSD所在節(jié)點的leveldb中，當(dāng)單個bucket存儲的Object數(shù)量高達(dá)百萬數(shù)量級的時候，
deep-scrub和bucket list一類的操作將極大的消耗磁盤資源，導(dǎo)致對應(yīng)OSD出現(xiàn)異常，
如果不對bucket的index進(jìn)行shard切片操作(shard切片實現(xiàn)了將單個bucket index的LevelDB實例水平切分到多個OSD上)，數(shù)據(jù)量大了以后很容易出事。

○ The second issue is when you have good amount of DELETEs it causes

loads of stale data in OMAP and this triggers leveldb compaction all the

time which is single threaded and non optimal with this kind of workload

and causes osd_op_threads to suicide because it is always compacting

hence OSD’s starts flapping.

RGW在處理大量DELETE請求的時候，會導(dǎo)致底層LevelDB頻繁進(jìn)行數(shù)據(jù)庫compaction(數(shù)據(jù)壓縮，對磁盤性能損耗很大)操作，而且剛好整個compaction在LevelDB中又是單線程處理，很容易到達(dá)osdopthreads超時上限而導(dǎo)致OSD自殺。

常見的問題有:

對index pool進(jìn)行scrub或deep-scrub的時候，如果shard對應(yīng)的Object過大，會極大消耗底層存儲設(shè)備性能，造成io請求超時。

底層deep-scrub的時候耗時過長，會出現(xiàn)request blocked，導(dǎo)致大量http請求超時而出現(xiàn)50x錯誤，從而影響到整個RGW服務(wù)的可用性。

當(dāng)壞盤或者osd故障需要恢復(fù)數(shù)據(jù)的時候，恢復(fù)一個大體積的shard文件將耗盡存儲節(jié)點性能，甚至可能因為OSD響應(yīng)超時而導(dǎo)致整個集群出現(xiàn)雪崩。

2.2 根因跟蹤

當(dāng)bucket index所在的OSD omap過大的時候，一旦出現(xiàn)異常導(dǎo)致OSD進(jìn)程崩潰，這個時候就需要進(jìn)行現(xiàn)場"救火"，用最快的速度恢復(fù)OSD服務(wù)。
先確定對應(yīng)OSD的OMAP大小，這個過大會導(dǎo)致OSD啟動的時候消耗大量時間和資源去加載levelDB數(shù)據(jù)，導(dǎo)致OSD無法啟動（超時自殺）。
特別是這一類OSD啟動需要占用非常大的內(nèi)存消耗，一定要注意預(yù)留好內(nèi)存。（物理內(nèi)存40G左右，不行用swap頂上）

image.png

3. 臨時解決方案

3.1 關(guān)閉集群scrub, deep-scrub提升集群穩(wěn)定性

$ ceph osd set noscrub $ ceph osd set nodeep-scrub

3.2 調(diào)高timeout參數(shù)，減少OSD自殺的概率

osd_op_thread_timeout = 90 #default is 15 osd_op_thread_suicide_timeout = 2000 #default is 150 If filestore op threads are hitting timeout filestore_op_thread_timeout = 180 #default is 60 filestore_op_thread_suicide_timeout = 2000 #default is 180 Same can be done for recovery thread also. osd_recovery_thread_timeout = 120 #default is 30 osd_recovery_thread_suicide_timeout = 2000

3.2 手工壓縮OMAP

在可以停OSD的情況下，可以對OSD進(jìn)行compact操作，推薦在ceph 0.94.6以上版本，低于這個版本有bug。 https://github.com/ceph/ceph/pull/7645/files

○ The third temporary step could be taken if OSD's have very large OMAP

directories you can verify it with command: du -sh /var/lib/ceph/osd/ceph-$id/current/omap, then do manual leveldb compaction for OSD's.

■ ceph tell osd.$id compact or

■ ceph daemon osd.$id compact or

■ Add leveldb_compact_on_mount = true in [osd.$id] or [osd] section

and restart the OSD.

■ This makes sure that it compacts the leveldb and then bring the

OSD back up/in which really helps.

#開啟noout操作 $ ceph osd set noout#停OSD服務(wù) $ systemctl stop ceph-osd@<osd-id>#在ceph.conf中對應(yīng)的[osd.id]加上下面配置 leveldb_compact_on_mount = true#啟動osd服務(wù) $ systemctl start ceph-osd@<osd-id>#使用ceph -s命令觀察結(jié)果，最好同時使用tailf命令去觀察對應(yīng)的OSD日志.等所有pg處于active+clean之后再繼續(xù)下面的操作 $ ceph -s #確認(rèn)compact完成以后的omap大小: $ du -sh /var/lib/ceph/osd/ceph-$id/current/omap#刪除osd中臨時添加的leveldb_compact_on_mount配置#取消noout操作(視情況而定，建議線上還是保留noout): $ ceph osd unset noout

4. 永久解決方案

4.1 提前規(guī)劃好bucket shard

index pool一定要上SSD，這個是本文優(yōu)化的前提，沒硬件支撐后面這些操作都是白搭。
合理設(shè)置bucket 的shard 數(shù)量
shard的數(shù)量并不是越多越好，過多的shard會導(dǎo)致部分類似list bucket的操作消耗大量底層存儲IO，導(dǎo)致部分請求耗時過長。
shard的數(shù)量還要考慮到你OSD的故障隔離域和副本數(shù)設(shè)置。比如你設(shè)置index pool的size為2，并且有2個機(jī)柜，共24個OSD節(jié)點，理想情況下每個shard的2個副本都應(yīng)該分布在2個機(jī)柜里面，比如當(dāng)你shard設(shè)置為8的時候，總共有8*2=16個shard文件需要存儲，那么這16個shard要做到均分到2個機(jī)柜。同時如果你shard超過24個，這很明顯也是不合適的。
控制好單個bucket index shard的平均體積，目前推薦單個shard存儲的Object信息條目在10-15W左右，過多則需要對相應(yīng)的bucket做單獨(dú)reshard操作（注意這個是高危操作，謹(jǐn)慎使用）。比如你預(yù)計單個bucket最多存儲100W個Object，那么100W/8＝12.5W，設(shè)置shard數(shù)為8是比較合理的。shard文件中每條omapkey記錄大概占用200 byte的容量，那么150000*200/1024/1024 ≈ 28.61 MB，也就是說要控制單個shard文件的體積在28MB以內(nèi)。
業(yè)務(wù)層面控制好每個bucket的Object上限，按每個shard文件平均10-15W Object為宜。

4.1.1 配置Bucket Index Sharding

To enable and configure bucket index sharding on all new buckets, use: redhat-bucket_sharding

the rgw_override_bucket_index_max_shards setting for simple configurations,
the bucket_index_max_shards setting for federated configurations

Simple configurations：

#1. 修改配置文件設(shè)置相應(yīng)的參數(shù)。 Note that maximum number of shards is 7877. [global] rgw_override_bucket_index_max_shards = 10 #2. 重啟rgw服務(wù)，讓其生效 systemctl restart ceph-radosgw.target#3. 查看bucket shard數(shù) rados -p default.rgw.buckets.index ls | wc -l 1000

Federated configurations
In federated configurations, each zone can have a different index_pool setting to manage failover. To configure a consistent shard count for zones in one region, set the bucket_index_max_shards setting in the configuration for that region. To do so:

#1. Extract the region configuration to the region.json file: $ radosgw-admin region get > region.json#2. In the region.json file, set the bucket_index_max_shards setting for each named zone.#3. Reset the region: $ radosgw-admin region set < region.json#4. Update the region map: $ radosgw-admin regionmap update --name <name>#5. Replace <name> with the name of the Ceph Object Gateway user, for example: $ radosgw-admin regionmap update --name client.rgw.ceph-client

上傳文件Demo:

#_*_coding:utf-8_*_ #yum install python-boto import boto import boto.s3.connection #pip install filechunkio from filechunkio import FileChunkIO import math import threading import os import Queue class Chunk(object):num = 0offset = 0len = 0def __init__(self,n,o,l):self.num=nself.offset=oself.length=l class CONNECTION(object):def __init__(self,access_key,secret_key,ip,port,is_secure=False,chrunksize=8<<20): #chunksize最小8M否則上傳過程會報錯self.conn=boto.connect_s3(aws_access_key_id=access_key,aws_secret_access_key=secret_key,host=ip,port=port,is_secure=is_secure,calling_format=boto.s3.connection.OrdinaryCallingFormat())self.chrunksize=chrunksizeself.port=port#查詢def list_all(self):all_buckets=self.conn.get_all_buckets()for bucket in all_buckets:print u'容器名: %s' %(bucket.name)for key in bucket.list():print ' '*5,"%-20s%-20s%-20s%-40s%-20s" %(key.mode,key.owner.id,key.size,key.last_modified.split('.')[0],key.name)def list_single(self,bucket_name):try:single_bucket = self.conn.get_bucket(bucket_name)except Exception as e:print 'bucket %s is not exist' %bucket_namereturnprint u'容器名: %s' % (single_bucket.name)for key in single_bucket.list():print ' ' * 5, "%-20s%-20s%-20s%-40s%-20s" % (key.mode, key.owner.id, key.size, key.last_modified.split('.')[0], key.name)#普通小文件下載：文件大小<=8Mdef dowload_file(self,filepath,key_name,bucket_name):all_bucket_name_list = [i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' % (bucket_name)returnelse:bucket = self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name not in all_key_name_list:print 'File %s is not exist,please try again' % (key_name)returnelse:key = bucket.get_key(key_name)if not os.path.exists(os.path.dirname(filepath)):print 'Filepath %s is not exists, sure to create and try again' % (filepath)returnif os.path.exists(filepath):while True:d_tag = raw_input('File %s already exists, sure you want to cover (Y/N)?' % (key_name)).strip()if d_tag not in ['Y', 'N'] or len(d_tag) == 0:continueelif d_tag == 'Y':os.remove(filepath)breakelif d_tag == 'N':returnos.mknod(filepath)try:key.get_contents_to_filename(filepath)except Exception:pass# 普通小文件上傳：文件大小<=8Mdef upload_file(self,filepath,key_name,bucket_name):try:bucket = self.conn.get_bucket(bucket_name)except Exception as e:print 'bucket %s is not exist' % bucket_nametag = raw_input('Do you want to create the bucket %s: (Y/N)?' % bucket_name).strip()while tag not in ['Y', 'N']:tag = raw_input('Please input (Y/N)').strip()if tag == 'N':returnelif tag == 'Y':self.conn.create_bucket(bucket_name)bucket = self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name in all_key_name_list:while True:f_tag = raw_input(u'File already exists, sure you want to cover (Y/N)?: ').strip()if f_tag not in ['Y', 'N'] or len(f_tag) == 0:continueelif f_tag == 'Y':breakelif f_tag == 'N':returnkey=bucket.new_key(key_name)if not os.path.exists(filepath):print 'File %s does not exist, please make sure you want to upload file path and try again' %(key_name)returntry:f=file(filepath,'rb')data=f.read()key.set_contents_from_string(data)except Exception:passdef delete_file(self,key_name,bucket_name):all_bucket_name_list = [i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' % (bucket_name)returnelse:bucket = self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name not in all_key_name_list:print 'File %s is not exist,please try again' % (key_name)returnelse:key = bucket.get_key(key_name)try:bucket.delete_key(key.name)except Exception:passdef delete_bucket(self,bucket_name):all_bucket_name_list = [i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' % (bucket_name)returnelse:bucket = self.conn.get_bucket(bucket_name)try:self.conn.delete_bucket(bucket.name)except Exception:pass#隊列生成def init_queue(self,filesize,chunksize): #8<<20 :8*2**20chunkcnt=int(math.ceil(filesize*1.0/chunksize))q=Queue.Queue(maxsize=chunkcnt)for i in range(0,chunkcnt):offset=chunksize*ilength=min(chunksize,filesize-offset)c=Chunk(i+1,offset,length)q.put(c)return q#分片上傳objectdef upload_trunk(self,filepath,mp,q,id):while not q.empty():chunk=q.get()fp=FileChunkIO(filepath,'r',offset=chunk.offset,bytes=chunk.length)mp.upload_part_from_file(fp,part_num=chunk.num)fp.close()q.task_done()#文件大小獲取---->S3分片上傳對象生成----->初始隊列生成(--------------->文件切，生成切分對象)def upload_file_multipart(self,filepath,key_name,bucket_name,threadcnt=8):filesize=os.stat(filepath).st_sizetry:bucket=self.conn.get_bucket(bucket_name)except Exception as e:print 'bucket %s is not exist' % bucket_nametag=raw_input('Do you want to create the bucket %s: (Y/N)?' %bucket_name).strip()while tag not in ['Y','N']:tag=raw_input('Please input (Y/N)').strip()if tag == 'N':returnelif tag == 'Y':self.conn.create_bucket(bucket_name)bucket = self.conn.get_bucket(bucket_name)all_key_name_list=[i.name for i in bucket.get_all_keys()]if key_name in all_key_name_list:while True:f_tag=raw_input(u'File already exists, sure you want to cover (Y/N)?: ').strip()if f_tag not in ['Y','N'] or len(f_tag) == 0:continueelif f_tag == 'Y':breakelif f_tag == 'N':returnmp=bucket.initiate_multipart_upload(key_name)q=self.init_queue(filesize,self.chrunksize)for i in range(0,threadcnt):t=threading.Thread(target=self.upload_trunk,args=(filepath,mp,q,i))t.setDaemon(True)t.start()q.join()mp.complete_upload()#文件分片下載def download_chrunk(self,filepath,key_name,bucket_name,q,id):while not q.empty():chrunk=q.get()offset=chrunk.offsetlength=chrunk.lengthbucket=self.conn.get_bucket(bucket_name)resp=bucket.connection.make_request('GET',bucket_name,key_name,headers={'Range':"bytes=%d-%d" %(offset,offset+length)})data=resp.read(length)fp=FileChunkIO(filepath,'r+',offset=chrunk.offset,bytes=chrunk.length)fp.write(data)fp.close()q.task_done()def download_file_multipart(self,filepath,key_name,bucket_name,threadcnt=8):all_bucket_name_list=[i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' %(bucket_name)returnelse:bucket=self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name not in all_key_name_list:print 'File %s is not exist,please try again' %(key_name)returnelse:key=bucket.get_key(key_name)if not os.path.exists(os.path.dirname(filepath)):print 'Filepath %s is not exists, sure to create and try again' % (filepath)returnif os.path.exists(filepath):while True:d_tag = raw_input('File %s already exists, sure you want to cover (Y/N)?' % (key_name)).strip()if d_tag not in ['Y', 'N'] or len(d_tag) == 0:continueelif d_tag == 'Y':os.remove(filepath)breakelif d_tag == 'N':returnos.mknod(filepath)filesize=key.sizeq=self.init_queue(filesize,self.chrunksize)for i in range(0,threadcnt):t=threading.Thread(target=self.download_chrunk,args=(filepath,key_name,bucket_name,q,i))t.setDaemon(True)t.start()q.join()def generate_object_download_urls(self,key_name,bucket_name,valid_time=0):all_bucket_name_list = [i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' % (bucket_name)returnelse:bucket = self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name not in all_key_name_list:print 'File %s is not exist,please try again' % (key_name)returnelse:key = bucket.get_key(key_name)try:key.set_canned_acl('public-read')download_url = key.generate_url(valid_time, query_auth=False, force_http=True)if self.port != 80:x1=download_url.split('/')[0:3]x2=download_url.split('/')[3:]s1=u'/'.join(x1)s2=u'/'.join(x2)s3=':%s/' %(str(self.port))download_url=s1+s3+s2print download_urlexcept Exception:pass if __name__ == '__main__':#約定：#1:filepath指本地文件的路徑(上傳路徑or下載路徑),指的是絕對路徑#2:bucket_name相當(dāng)于文件在對象存儲中的目錄名或者索引名#3:key_name相當(dāng)于文件在對象存儲中對應(yīng)的文件名或文件索引access_key = "FYT71CYU3UQKVMC8YYVY"secret_key = "rVEASbWAytjVLv1G8Ta8060lY3yrcdPTsEL0rfwr"ip='127.0.0.1'port=7480conn=CONNECTION(access_key,secret_key,ip,port)#查看所有bucket以及其包含的文件#conn.list_all()#簡單上傳,用于文件大小<=8M#conn.upload_file('/etc/passwd','passwd','test_bucket01')conn.upload_file('/tmp/test.log','test1','test_bucket12')#查看單一bucket下所包含的文件信息conn.list_single('test_bucket12')#簡單下載,用于文件大小<=8M# conn.dowload_file('/lhf_test/test01','passwd','test_bucket01')# conn.list_single('test_bucket01')#刪除文件# conn.delete_file('passwd','test_bucket01')# conn.list_single('test_bucket01')##刪除bucket# conn.delete_bucket('test_bucket01')# conn.list_all()#切片上傳(多線程),用于文件大小>8M,8M可修改，但不能小于8M,否則會報錯切片太小# conn.upload_file_multipart('/etc/passwd','passwd_multi_upload','test_bucket01')# conn.list_single('test_bucket01')# 切片下載(多線程),用于文件大小>8M,8M可修改，但不能小于8M，否則會報錯切片太小# conn.download_file_multipart('/lhf_test/passwd_multi_dowload','passwd_multi_upload','test_bucket01')#生成下載url#conn.generate_object_download_urls('passwd_multi_upload','test_bucket01')#conn.list_all()

4.2 對bucket做reshard操作

To reshard the bucket index pool: redhat-bucket_sharding

#注意下面的操作一定要確保對應(yīng)的bucket相關(guān)的操作都已經(jīng)全部停止，之后使用下面命令備份bucket的index $ radosgw-admin bi list --bucket=<bucket_name> > <bucket_name>.list.backup#通過下面的命令恢復(fù)數(shù)據(jù) $ radosgw-admin bi put --bucket=<bucket_name> < <bucket_name>.list.backup#查看bucket的index id $ radosgw-admin bucket stats --bucket=bucket-maillist {"bucket": "bucket-maillist","pool": "default.rgw.buckets.data","index_pool": "default.rgw.buckets.index","id": "0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1", #注意這個id"marker": "0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1","owner": "user","ver": "0#1,1#1","master_ver": "0#0,1#0","mtime": "2017-08-23 13:42:59.007081","max_marker": "0#,1#","usage": {},"bucket_quota": {"enabled": false,"max_size_kb": -1,"max_objects": -1} }#Reshard對應(yīng)bucket的index操作如下: #使用命令將"bucket-maillist"的shard調(diào)整為4，注意命令會輸出osd和new兩個bucket的instance id$ radosgw-admin bucket reshard --bucket="bucket-maillist" --num-shards=4 *** NOTICE: operation will not remove old bucket index objects *** *** these will need to be removed manually *** old bucket instance id: 0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1 new bucket instance id: 0a6967a5-2c76-427a-99c6-8a788ca25034.54147.1 total entries: 3#之后使用下面的命令刪除舊的instance id$ radosgw-admin bi purge --bucket="bucket-maillist" --bucket-id=0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1#查看最終結(jié)果 $ radosgw-admin bucket stats --bucket=bucket-maillist {"bucket": "bucket-maillist","pool": "default.rgw.buckets.data","index_pool": "default.rgw.buckets.index","id": "0a6967a5-2c76-427a-99c6-8a788ca25034.54147.1", #id已經(jīng)變更"marker": "0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1","owner": "user","ver": "0#2,1#1,2#1,3#2","master_ver": "0#0,1#0,2#0,3#0","mtime": "2017-08-23 14:02:19.961205","max_marker": "0#,1#,2#,3#","usage": {"rgw.main": {"size_kb": 50,"size_kb_actual": 60,"num_objects": 3}},"bucket_quota": {"enabled": false,"max_size_kb": -1,"max_objects": -1} }

總結(jié)

以上是生活随笔為你收集整理的RGW Bucket Shard优化的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：牛客网项目--MyBatis
下一篇：舒亦梵：4.24非农周即将来临，黄金趋势