PostgreSQL Huge Page 使用建议 - 大内存主机、实例注意
標(biāo)簽
PostgreSQL , Linux , huge page , shared buffer , page table , 虛擬地址 , 物理地址 , 內(nèi)存地址轉(zhuǎn)換表
背景
當(dāng)內(nèi)存很大時(shí),除了刷臟頁的調(diào)度可能需要優(yōu)化,還有一方面是虛擬內(nèi)存與物理內(nèi)存映射表相關(guān)的部分需要優(yōu)化。
1 臟頁調(diào)度優(yōu)化
1、主要包括,調(diào)整后臺(tái)進(jìn)程刷臟頁的閾值、喚醒間隔、以及老化閾值。(臟頁大于多少時(shí)開始刷、多久探測一次有多少臟頁、刷時(shí)多老的臟頁刷出。)。
vm.dirty_background_bytes = 4096000000 vm.dirty_background_ratio = 0 vm.dirty_expire_centisecs = 6000 vm.dirty_writeback_centisecs = 1002、用戶進(jìn)程刷臟頁調(diào)度,當(dāng)臟頁大于多少時(shí),用戶如果要申請內(nèi)存,需要協(xié)助刷臟頁。
vm.dirty_bytes = 0 vm.dirty_ratio = 80《DBA不可不知的操作系統(tǒng)內(nèi)核參數(shù)》
2 內(nèi)存表映射優(yōu)化
這部分主要是因?yàn)樘摂M內(nèi)存管理,Linux需要維護(hù)虛擬內(nèi)存地址與物理內(nèi)存的映射關(guān)系,為了提升轉(zhuǎn)換性能,最好這部分能夠cache在cpu的cache里面。頁越大,映射表就越小。使用huge page可以減少頁表大小。
默認(rèn)頁大小可以這樣獲取,
# getconf PAGESIZE 4096https://en.wikipedia.org/wiki/Page_table
另一個(gè)使用HUGE PAGE的原因,HUGE PAGE是常駐內(nèi)存的,不會(huì)被交換出去,這也是重度依賴內(nèi)存的應(yīng)用(包括數(shù)據(jù)庫)非常喜歡的。
In a virtual memory system, the tables store the mappings between virtual addresses and physical addresses. When the system needs to access a virtual memory location, it uses the page tables to translate the virtual address to a physical address. Using huge pages means that the system needs to load fewer such mappings into the Translation Lookaside Buffer (TLB), which is the cache of page tables on a CPU that speeds up the translation of virtual addresses to physical addresses. Enabling the HugePages feature allows the kernel to use hugetlb entries in the TLB that point to huge pages. The hugetbl entries mean that the TLB entries can cover a larger address space, requiring many fewer entries to map the SGA, and releasing entries that can map other portions of the address space.
With HugePages enabled, the system uses fewer page tables, reducing the overhead for maintaining and accessing them. Huges pages remain pinned in memory and are not replaced, so the kernel swap daemon has no work to do in managing them, and the kernel does not need to perform page table lookups for them. The smaller number of pages reduces the overhead involved in performing memory operations, and also reduces the likelihood of a bottleneck when accessing page tables.
PostgreSQL HugePage使用建議
1、查看Linux huage page頁大小
# grep Hugepage /proc/meminfo Hugepagesize: 2048 kB2、準(zhǔn)備設(shè)置多大的shared buffer參數(shù),假設(shè)我們的內(nèi)存有512GB,想設(shè)置128GB的SHARED BUFFER。
vi postgresql.conf shared_buffers='128GB'3、計(jì)算需要多少huge page
128GB/2MB=655354、設(shè)置Linux huge page頁數(shù)
sysctl -w vm.nr_hugepages=675375、設(shè)置使用huge page。
vi $PGDATA/postgresql.conf huge_pages = on # on, off, or try # 設(shè)置為try的話,會(huì)先嘗試huge page,如果啟動(dòng)時(shí)無法鎖定給定數(shù)目的大頁,則不會(huì)使用huge page6、啟動(dòng)數(shù)據(jù)庫
pg_ctl start7、查看當(dāng)前使用了多少huge page
cat /proc/meminfo |grep -i huge AnonHugePages: 6144 kB HugePages_Total: 67537 ## 設(shè)置的HUGE PAGE HugePages_Free: 66117 ## 這個(gè)是當(dāng)前剩余的,但是實(shí)際上真正可用的并沒有這么多,因?yàn)楸籔G鎖定了65708個(gè)大頁 HugePages_Rsvd: 65708 ## 啟動(dòng)PG時(shí)申請的HUGE PAGE HugePages_Surp: 0 Hugepagesize: 2048 kB ## 當(dāng)前大頁2M8、執(zhí)行一些查詢,可以看到Free會(huì)變小。被PG使用掉了。
cat /proc/meminfo |grep -i huge AnonHugePages: 6144 kB HugePages_Total: 67537 HugePages_Free: 57482 HugePages_Rsvd: 57073 HugePages_Surp: 0 Hugepagesize: 2048 kBOracle HugePage使用建議
Oracle也是重度內(nèi)存使用應(yīng)用,當(dāng)SGA配置較大時(shí),同樣建議設(shè)置HUGEPAGE。
Oracle 建議當(dāng)SGA大于或等于8GB時(shí),使用huge page。
10.1 About HugePages
The HugePages feature enables the Linux kernel to manage large pages of memory in addition to the standard 4KB (on x86 and x86_64) or 16KB (on IA64) page size. If you have a system with more than 16GB of memory running Oracle databases with a total System Global Area (SGA) larger than 8GB, you should enable the HugePages feature to improve database performance.
Note
The Automatic Memory Management (AMM) and HugePages features are not compatible in Oracle Database 11g and later. You must disable AMM to be able to use HugePages.
The memory allocated to huge pages is pinned to primary storage, and is never paged nor swapped to secondary storage. You reserve memory for huge pages during system startup, and this memory remains allocated until you change the configuration.
In a virtual memory system, the tables store the mappings between virtual addresses and physical addresses. When the system needs to access a virtual memory location, it uses the page tables to translate the virtual address to a physical address. Using huge pages means that the system needs to load fewer such mappings into the Translation Lookaside Buffer (TLB), which is the cache of page tables on a CPU that speeds up the translation of virtual addresses to physical addresses. Enabling the HugePages feature allows the kernel to use hugetlb entries in the TLB that point to huge pages. The hugetbl entries mean that the TLB entries can cover a larger address space, requiring many fewer entries to map the SGA, and releasing entries that can map other portions of the address space.
With HugePages enabled, the system uses fewer page tables, reducing the overhead for maintaining and accessing them. Huges pages remain pinned in memory and are not replaced, so the kernel swap daemon has no work to do in managing them, and the kernel does not need to perform page table lookups for them. The smaller number of pages reduces the overhead involved in performing memory operations, and also reduces the likelihood of a bottleneck when accessing page tables.
Huge pages are 4MB in size on x86, 2MB on x86_64, and 256MB on IA64.
https://docs.oracle.com/cd/E11882_01/server.112/e10839/appi_vlm.htm#UNXAR394
https://docs.oracle.com/cd/E37670_01/E37355/html/ol_about_hugepages.html
測試對(duì)比是否使用HugePage
設(shè)計(jì)test case
創(chuàng)建10240個(gè)表,使用merge insert寫入200億數(shù)據(jù)。
1、建表
do language plpgsql $$ declare begin execute 'drop table if exists test'; execute 'create table test(id int8 primary key, info text, crt_time timestamp)'; for i in 0..10239 loop execute format('drop table if exists test%s', i); execute format('create table test%s (like test including all)', i); end loop; end; $$;2、創(chuàng)建動(dòng)態(tài)寫入函數(shù),第一種不使用綁定變量
create or replace function dyn_pre(int8) returns void as $$ declare suffix int8 := mod($1,10240); begin execute format('insert into test%s values(%s, md5(random()::text), now()) on conflict(id) do update set info=excluded.info,crt_time=excluded.crt_time', suffix, $1); end; $$ language plpgsql strict;3、使用綁定變量,性能更好。
create or replace function dyn_pre(int8) returns void as $$ declare suffix int8 := mod($1,10240); begin execute format('execute p%s(%s)', suffix, $1); exception when others then execute format('prepare p%s(int8) as insert into test%s values($1, md5(random()::text), now()) on conflict(id) do update set info=excluded.info,crt_time=excluded.crt_time', suffix, suffix); execute format('execute p%s(%s)', suffix, $1); end; $$ language plpgsql strict;4、創(chuàng)建壓測腳本
vi test.sql \set id random(1,20000000000) select dyn_pre(:id);5、寫入性能壓測
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 56 -j 56 -T 12000006、多長連接壓測,PAGE TABLE觀察
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 950 -j 950 -T 1200000 pgbench -M prepared -n -r -P 1 -f ./test.sql -c 950 -j 950 -T 12000001 使用HugePage
1、小量連接寫入性能
transaction type: ./test.sql scaling factor: 1 query mode: prepared number of clients: 56 number of threads: 56 duration: 120 s number of transactions actually processed: 17122345 latency average = 0.392 ms latency stddev = 0.251 ms tps = 142657.055512 (including connections establishing) tps = 142687.784245 (excluding connections establishing) script statistics: - statement latencies in milliseconds: 0.002 \set id random(1,20000000000) 0.390 select dyn_pre(:id);2、1900個(gè)長連接,PAGE TABLE大小(由于是虛擬、物理內(nèi)存映射關(guān)系表。所以耗費(fèi)取決于連接數(shù),以及每個(gè)連接相關(guān)聯(lián)的SHARED BUFFER以及會(huì)話自己的relcache, SYSCACHE)
cat /proc/meminfo |grep -i table Unevictable: 0 kB PageTables: 578612 kB ## shared buffer使用了huge page,這塊省了很多。 NFS_Unstable: 0 kB2 未使用HugePage
sysctl -w vm.nr_hugepages=01、小量連接的寫入性能
transaction type: ./test.sql scaling factor: 1 query mode: prepared number of clients: 56 number of threads: 56 duration: 120 s number of transactions actually processed: 18484181 latency average = 0.364 ms latency stddev = 0.212 ms tps = 153887.936028 (including connections establishing) tps = 153905.968799 (excluding connections establishing) script statistics: - statement latencies in milliseconds: 0.002 \set id random(1,20000000000) 0.362 select dyn_pre(:id);小量連接未使用HUGE PAGE性能比使用huge page更好,猜測可能是huge page使用了類似兩級(jí)轉(zhuǎn)換(因?yàn)?MB為單個(gè)目標(biāo)的映射,并不能精準(zhǔn)定位到默認(rèn)8K的數(shù)據(jù)頁的物理內(nèi)存位置。可能和數(shù)據(jù)庫的索引bitmap scan道理類似,bitmap scan告訴你數(shù)據(jù)在哪個(gè)PAGE內(nèi),而不是直接告訴你數(shù)據(jù)在哪個(gè)PAGE的第幾條記錄上。),導(dǎo)致了一定的損耗。
2、1900個(gè)長連接,PAGE TABLE大小(由于是虛擬、物理內(nèi)存映射關(guān)系表。所以耗費(fèi)取決于連接數(shù),以及每個(gè)連接相關(guān)聯(lián)的SHARED BUFFER以及會(huì)話自己的relcache, SYSCACHE)
cat /proc/meminfo |grep -i table Unevictable: 0 kB PageTables: 10956556 kB ## 不一會(huì)就增長到了10GB,因?yàn)槊總€(gè)連接都在TOUCH shared buffer內(nèi)的數(shù)據(jù),可能導(dǎo)致映射表很大。連接越多。TOUCH shared buffer內(nèi)數(shù)據(jù)越多越明顯 # PageTables 還在不斷增長 NFS_Unstable: 0 kBCentOS 7u 配置大頁例子
1、修改/boot/grub2/grub.cfg
定位到第一個(gè)menuentry 'CentOS Linux',在linux16 /vmlinuz最后面添加如下:
說明(關(guān)閉透明大頁,使用默認(rèn)的2MB大頁,你也可以選擇用1G的大頁,但是在此之前應(yīng)該先到系統(tǒng)中判斷支持哪些大頁規(guī)格. 查看/proc/cpuinfo里面的FLAG?Valid pages sizes on x86-64 are 2M (when the CPU supports "pse") and 1G (when the CPU supports the "pdpe1gb" cpuinfo flag).?,設(shè)置啟動(dòng)時(shí)創(chuàng)建1536個(gè)大頁(這部分內(nèi)存會(huì)被保留,所以一定要注意設(shè)置合適的大小,建議在LINUX啟動(dòng)后通過sysctl來設(shè)置)。)
numa=off transparent_hugepage=never default_hugepagesz=2M hugepagesz=2M hugepages=1536transparent_hugepage=never表示關(guān)閉透明大頁,以免不必要的麻煩。透明大頁這個(gè)特性應(yīng)該還不太成熟。
hugepagesz 表示頁面大小,2M和1G選其一,默認(rèn)為2M。
hugepages 表示大頁面數(shù)
總共大頁面內(nèi)存量為hugepagesz * hugepages,這里為3G
例子:
menuentry 'CentOS Linux (3.10.0-693.5.2.el7.x86_64) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-693.el7.x86_64-advanced-d8179b22-8b44-4552-bf2a-04bae2a5f5dd' { load_video set gfxpayload=keep insmod gzio insmod part_msdos insmod xfs set root='hd0,msdos1' if [ x$feature_platform_search_hint = xy ]; then search --no-floppy --fs-uuid --set=root --hint-bios=hd0,msdos1 --hint-efi=hd0,msdos1 --hint-baremetal=ahci0,msdos1 --hint='hd0,msdos1' 34f87a8d-8b73-4f80-b0ff-8d49b17975ca else search --no-floppy --fs-uuid --set=root 34f87a8d-8b73-4f80-b0ff-8d49b17975ca fi linux16 /vmlinuz-3.10.0-693.5.2.el7.x86_64 root=/dev/mapper/centos-root ro rd.lvm.lv=centos/root rhgb quiet LANG=en_US.UTF-8 numa=off transparent_hugepage=never default_hugepagesz=2M hugepagesz=2M hugepages=1536 initrd16 /initramfs-3.10.0-693.5.2.el7.x86_64.img }重啟系統(tǒng)(如果你不想重啟系統(tǒng)而使用HUGE PAGE,使用這種方法即可sysctl -w vm.nr_hugepages=1536?)
但是修改默認(rèn)的大頁規(guī)格(2M or 1G)則一定要重啟,例如:
numa=off transparent_hugepage=never default_hugepagesz=1G hugepagesz=2M hugepagesz=1G 重啟后就會(huì)變這樣 cat /proc/meminfo |grep -i huge AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 1048576 kB申請132GB大頁
sysctl -w vm.nr_hugepages=132 vm.nr_hugepages = 132重啟后可以使用grep Huge /proc/meminfo查看配置情況。看到下面的數(shù)據(jù)表示已經(jīng)生效
HugePages_Total: 1536 HugePages_Free: 1499 HugePages_Rsvd: 1024 HugePages_Surp: 0 Hugepagesize: 2048 kB數(shù)據(jù)庫配置(如果你想好了非大頁不可,就設(shè)置huge_pages為on,否則設(shè)置為try。on的情況下如果HUGE PAGE不夠,則啟動(dòng)會(huì)報(bào)錯(cuò)。TRY的話,大頁不夠就嘗試申請普通頁的方式啟動(dòng)。)
postgresql.conf huge_pages = on shared_buffers = 2GB # 使用2G內(nèi)存,這個(gè)值需要小于總共大頁面內(nèi)存量注意
如果postgresql.conf配置huge_pages=on時(shí),且shared_buffers值等于huge_page總內(nèi)存量(hugepagesz*hugepages)時(shí),數(shù)據(jù)庫無法啟動(dòng),報(bào)如下錯(cuò)誤:
This error usually means that PostgreSQL's request for a shared memory segment exceeded available memory or swap space.解決辦法shared_buffers值要小于huge_page總內(nèi)存量
libhugetlbfs
安裝libhugetlbfs可以觀察大頁的統(tǒng)計(jì)信息,分配大頁文件系統(tǒng),例如你想把數(shù)據(jù)放到內(nèi)存中持久化測試。
https://lwn.net/Articles/376606/
yum install -y libhugetlbfs*小結(jié)
1、查看、修改Linux目前支持的大頁大小。
https://unix.stackexchange.com/questions/270949/how-do-you-change-the-hugepagesize
2、如果連接數(shù)較少時(shí),使用HUGE PAGE性能不如不使用(猜測可能是huge page使用了類似兩級(jí)轉(zhuǎn)換,導(dǎo)致了一定的損耗。)。因此我們可以盡量使用連接池,減少連接數(shù),提升性能。
3、能使用連接池的話,盡量使用連接池,減少連接到數(shù)據(jù)庫的連接數(shù)。因?yàn)镻G與Oracle一樣是進(jìn)程模型,連接越多則進(jìn)程越多,大內(nèi)存需要注意一些問題:
3.1 上下文切換,MEM COPY的開銷。
3.2 PAGE TABLE增大,內(nèi)存使用增加。
PageTables: Amount of memory dedicated to the lowest level of page tables. This can increase to a high value if a lot of processes are attached to the same shared memory segment.3.3 每個(gè)會(huì)話要緩存syscache, relcache等信息,如果訪問的對(duì)象很多,會(huì)導(dǎo)致內(nèi)存使用爆增。(這個(gè)屬于邏輯層面內(nèi)存使用放大, 如果訪問對(duì)象不多或者訪問過好多對(duì)象的長連接不多的話,問題不明顯)
《PostgreSQL relcache在長連接應(yīng)用中的內(nèi)存霸占"坑"》
這個(gè)很容易模擬,使用本例的壓測CASE,增加表的數(shù)目,增加表的字段數(shù),每個(gè)連接的relcache就會(huì)增加。
4、如果不能使用連接池,連接數(shù)非常多,并且都是長連接(訪問較多的對(duì)象、shared buffer中的數(shù)據(jù)時(shí))。那么當(dāng)shared buffer非常大時(shí),需要考慮使用huge page。這樣page tables會(huì)比較小。如果無法使用HugePage,那么建議設(shè)置較小的shared_buffer。
5、進(jìn)程自己的內(nèi)存使用,PAGE TABLE不會(huì)有放大效果,因?yàn)橹皇亲约菏褂谩K詗ork_mem, maintenance_work_mem的使用,不大會(huì)引起PAGE TABLE過大的問題。
通過觀察/proc/meminfo來查看PageTable的占用,判斷是否需要啟用大頁或降低shared_buffer或者連接數(shù)。
參考
https://en.wikipedia.org/wiki/Page_table
https://docs.oracle.com/cd/E11882_01/server.112/e10839/appi_vlm.htm#UNXAR394
https://docs.oracle.com/cd/E37670_01/E37355/html/ol_about_hugepages.html
https://unix.stackexchange.com/questions/270949/how-do-you-change-the-hugepagesize
《PostgreSQL on Linux 最佳部署手冊》
《PostgreSQL 10 + PostGIS + Sharding(pg_pathman) + MySQL(fdw外部表) on ECS 部署指南(適合新用戶)》
https://blog.dbi-services.com/configuring-huge-pages-for-your-postgresql-instance-redhatcentos-version/
https://momjian.us/main/writings/pgsql/hw_performance/
https://www.kernel.org/doc/gorman/html/understand/understand006.html
https://wiki.osdev.org/Page_Tables
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-Memory-Configuring-huge-pages
總結(jié)
以上是生活随笔為你收集整理的PostgreSQL Huge Page 使用建议 - 大内存主机、实例注意的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: PostSQL | Debug记录
- 下一篇: 【Redis】Redis中使用Lua脚本