Linux bpf 1.1、BPF内核实现
BPF的字面上意思Berkeley Packet Filter意味著它是從包過濾而來。如果在開始前對BPF缺乏感性的認(rèn)識(shí)建議先看一下參考文檔:“3.1、Berkeley Packet Filter (BPF) (Kernel Document)”、“3.2、BPF and XDP Reference Guide”。
本質(zhì)上它是一種內(nèi)核代碼注入的技術(shù):
- 內(nèi)核中實(shí)現(xiàn)了一個(gè)cBPF/eBPF虛擬機(jī);
- 用戶態(tài)可以用C來寫運(yùn)行的代碼,再通過一個(gè)Clang&LLVM的編譯器將C代碼編譯成BPF目標(biāo)碼;
- 用戶態(tài)通過系統(tǒng)調(diào)用bpf()將BPF目標(biāo)碼注入到內(nèi)核當(dāng)中;
- 內(nèi)核通過JIT(Just-In-Time)將BPF目編碼轉(zhuǎn)換成本地指令碼;如果當(dāng)前架構(gòu)不支持JIT轉(zhuǎn)換內(nèi)核則會(huì)使用一個(gè)解析器(interpreter)來模擬運(yùn)行,這種運(yùn)行效率較低;
- 內(nèi)核在packet filter和tracing等應(yīng)用中提供了一系列的鉤子來運(yùn)行BPF代碼。目前支持以下類型的BPF代碼:
BPF的好處在哪里? 是因?yàn)樗峁┝艘环N在不修改內(nèi)核代碼的情況下,可以靈活修改內(nèi)核處理策略的方法。
這在包過濾和系統(tǒng)tracing這種需要頻繁修改規(guī)則的場合非常有用。因?yàn)槿绻辉谟脩魬B(tài)修改策略的話那么所有數(shù)據(jù)需要復(fù)制一份給用戶態(tài)開銷較大;如果在內(nèi)核態(tài)修改策略的話需要修改內(nèi)核代碼重新編譯內(nèi)核,而且容易引人安全問題。BPF這種內(nèi)核代碼注入技術(shù)的生存空間就是它可以在這兩者間取得一個(gè)平衡。
Systamp就是解決了這個(gè)問題得以發(fā)展的,它使用了ko的方式來實(shí)現(xiàn)內(nèi)核代碼注入(有點(diǎn)笨拙,但是也解決了實(shí)際問題)。
Systemtap工作原理:是通過將腳本語句翻譯成C語句,編譯成內(nèi)核模塊。模塊加載之后,將所有探測的事件以Kprobe鉤子的方式掛到內(nèi)核上,當(dāng)任何處理器上的某個(gè)事件發(fā)生時(shí),相應(yīng)鉤子上句柄就會(huì)被執(zhí)行。最后,當(dāng)systemtap會(huì)話結(jié)束之后,鉤子從內(nèi)核上取下,移除模塊。整個(gè)過程用一個(gè)命令stap就可以完成。
既然是提供向內(nèi)核注入代碼的技術(shù),那么安全問題肯定是重中之重。平時(shí)防范他人通過漏洞向內(nèi)核中注入代碼,這下子專門開了一個(gè)口子不是大開方便之門。所以內(nèi)核指定了很多的規(guī)則來限制BPF代碼,確保它的錯(cuò)誤不會(huì)影響到內(nèi)核:
- 一個(gè)BPF程序的代碼數(shù)量不能超過BPF_MAXINSNS (4K),它的總運(yùn)行步數(shù)不能超過32K (4.9內(nèi)核中這個(gè)值改成了96k);
- BPF代碼中禁止循環(huán),這也是為了保證出錯(cuò)時(shí)不會(huì)出現(xiàn)死循環(huán)來hang死內(nèi)核。一個(gè)BPF程序總的可能的分支數(shù)也被限制到1K;
- 為了限制它的作用域,BPF代碼不能訪問全局變量,只能訪問局部變量。一個(gè)BPF程序只有512字節(jié)的堆棧。在開始時(shí)會(huì)傳入一個(gè)ctx指針,BPF程序的數(shù)據(jù)訪問就被限制在ctx變量和堆棧局部變量中;
- 如果BPF需要訪問全局變量,它只能訪問BPF map對象。BPF map對象是同時(shí)能被用戶態(tài)、BPF程序、內(nèi)核態(tài)共同訪問的,BPF對map的訪問通過helper function來實(shí)現(xiàn);
- 舊版本BPF代碼中不支持BPF對BPF函數(shù)的調(diào)用,所以所有的BPF函數(shù)必須聲明成always_inline。在Linux內(nèi)核4.16和LLVM 6.0以后,才支持BPF to BPF Calls;
- BPF雖然不能函數(shù)調(diào)用,但是它可以使用Tail Call機(jī)制從一個(gè)BPF程序直接跳轉(zhuǎn)到另一個(gè)BPF程序。它需要通過BPF_MAP_TYPE_PROG_ARRAY類型的map來知道另一個(gè)BPF程序的指針。這種跳轉(zhuǎn)的次數(shù)也是有限制的,32次;
- BPF程序可以調(diào)用一些內(nèi)核函數(shù)來輔助做一些事情(helper function);
- 有些架構(gòu)(64 bit x86_64, arm64, ppc64, s390x, mips64, sparc64 and 32 bit arm)已經(jīng)支持BPF的JIT,它可以高效的幾乎一比一的把BPF代碼轉(zhuǎn)換成本機(jī)代碼(因?yàn)閑BPF的指令集已經(jīng)做了優(yōu)化,非常類似最新的arm/x86架構(gòu),ABI也類似)。如果當(dāng)前架構(gòu)不支持JTI只能使用內(nèi)核的解析器(interpreter)來模擬運(yùn)行;
- 內(nèi)核還可以通過一些額外的手段來加固BPF的安全性(Hardening)。主要包括:把BPF代碼映像和JIT代碼映像的page都鎖成只讀,JIT編譯時(shí)把常量致盲(constant blinding),以及對bpf()系統(tǒng)調(diào)用的權(quán)限限制;
對BPF這些安全規(guī)則的檢查主要是在BPF代碼加載時(shí),通過BPF verifier來實(shí)現(xiàn)的。大概分為兩步:
- 第一步,通過DAG(Directed Acyclic Graph 有向無環(huán)圖)的DFS(Depth-first Search)深度優(yōu)先算法來遍歷BPF程序的代碼路徑,確保沒有環(huán)路發(fā)生;
- 第二步,逐條分析BPF每條指令的運(yùn)行,對register和對stack的影響,最壞情況下是否有越界行為(對變量的訪問是否越界,運(yùn)行的指令數(shù)是否越界)。這里也有一個(gè)快速分析的優(yōu)化方法:修剪(Pruning)。如果當(dāng)前指令的當(dāng)前分支的狀態(tài),和當(dāng)前指令另一個(gè)已分析分支的狀態(tài)相等或者是它的一個(gè)子集,那么當(dāng)前指令的當(dāng)前分支就不需要分析了,因?yàn)樗隙ㄊ欠弦?guī)則的。
整個(gè)BPF的開發(fā)過程大概如下圖所示:
1、bpf()系統(tǒng)調(diào)用
核心代碼在bpf()系統(tǒng)調(diào)用中,我們從入口開始分析。
SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size) {union bpf_attr attr = {};int err;if (!capable(CAP_SYS_ADMIN) && sysctl_unprivileged_bpf_disabled)return -EPERM;if (!access_ok(VERIFY_READ, uattr, 1))return -EFAULT;if (size > PAGE_SIZE) /* silly large */return -E2BIG;/* If we're handed a bigger struct than we know of,* ensure all the unknown bits are 0 - i.e. new* user-space does not rely on any kernel feature* extensions we dont know about yet.*/if (size > sizeof(attr)) {unsigned char __user *addr;unsigned char __user *end;unsigned char val;addr = (void __user *)uattr + sizeof(attr);end = (void __user *)uattr + size;for (; addr < end; addr++) {err = get_user(val, addr);if (err)return err;if (val)return -E2BIG;}size = sizeof(attr);}/* copy attributes from user space, may be less than sizeof(bpf_attr) */if (copy_from_user(&attr, uattr, size) != 0)return -EFAULT;switch (cmd) {case BPF_MAP_CREATE:err = map_create(&attr);break;case BPF_MAP_LOOKUP_ELEM:err = map_lookup_elem(&attr);break;case BPF_MAP_UPDATE_ELEM:err = map_update_elem(&attr);break;case BPF_MAP_DELETE_ELEM:err = map_delete_elem(&attr);break;case BPF_MAP_GET_NEXT_KEY:err = map_get_next_key(&attr);break;case BPF_PROG_LOAD:err = bpf_prog_load(&attr);break;case BPF_OBJ_PIN:err = bpf_obj_pin(&attr);break;case BPF_OBJ_GET:err = bpf_obj_get(&attr);break;#ifdef CONFIG_CGROUP_BPFcase BPF_PROG_ATTACH:err = bpf_prog_attach(&attr);break;case BPF_PROG_DETACH:err = bpf_prog_detach(&attr);break; #endifdefault:err = -EINVAL;break;}return err; }1.1、bpf加載
BPF_PROG_LOAD命令負(fù)責(zé)加載一段BPF程序到內(nèi)核當(dāng)中:
- 拷貝程序到內(nèi)核;
- 校驗(yàn)它的安全性;
- 如果可能對它進(jìn)行JIT編譯;
- 然后分配一個(gè)文件句柄fd給它。
完成這一切后,后續(xù)再把這段BPF程序掛載到需要運(yùn)行的鉤子上面。
1.1.1、bpf內(nèi)存空間分配
static int bpf_prog_load(union bpf_attr *attr) {enum bpf_prog_type type = attr->prog_type;struct bpf_prog *prog;int err;char license[128];bool is_gpl;if (CHECK_ATTR(BPF_PROG_LOAD))return -EINVAL;/* copy eBPF program license from user space *//* (1.1) 根據(jù)attr->license地址,從用戶空間拷貝license字符串到內(nèi)核 */if (strncpy_from_user(license, u64_to_ptr(attr->license),sizeof(license) - 1) < 0)return -EFAULT;license[sizeof(license) - 1] = 0;/* eBPF programs must be GPL compatible to use GPL-ed functions *//* (1.2) 判斷l(xiāng)icense是否符合GPL協(xié)議 */is_gpl = license_is_gpl_compatible(license);/* (1.3) 判斷BPF的總指令數(shù)是否超過BPF_MAXINSNS(4k) */if (attr->insn_cnt >= BPF_MAXINSNS)return -EINVAL;/* (1.4) 如果加載BPF_PROG_TYPE_KPROBE類型的BPF程序,指定的內(nèi)核版本需要和當(dāng)前內(nèi)核版本匹配。 不然由于內(nèi)核的改動(dòng),可能會(huì)附加到錯(cuò)誤的地址上。*/if (type == BPF_PROG_TYPE_KPROBE &&attr->kern_version != LINUX_VERSION_CODE)return -EINVAL;/* (1.5) 對BPF_PROG_TYPE_SOCKET_FILTER和BPF_PROG_TYPE_CGROUP_SKB以外的BPF程序加載,需要管理員權(quán)限 */if (type != BPF_PROG_TYPE_SOCKET_FILTER &&type != BPF_PROG_TYPE_CGROUP_SKB &&!capable(CAP_SYS_ADMIN))return -EPERM;/* plain bpf_prog allocation *//* (2.1) 根據(jù)BPF指令數(shù)分配bpf_prog空間,和bpf_prog->aux空間 */prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt), GFP_USER);if (!prog)return -ENOMEM;/* (2.2) 把整個(gè)bpf_prog空間在當(dāng)前進(jìn)程的memlock_limit中鎖定 */err = bpf_prog_charge_memlock(prog);if (err)goto free_prog_nouncharge;prog->len = attr->insn_cnt;err = -EFAULT;/* (2.3) 把BPF代碼從用戶空間地址attr->insns,拷貝到內(nèi)核空間地址prog->insns */if (copy_from_user(prog->insns, u64_to_ptr(attr->insns),prog->len * sizeof(struct bpf_insn)) != 0)goto free_prog;prog->orig_prog = NULL;prog->jited = 0;atomic_set(&prog->aux->refcnt, 1);prog->gpl_compatible = is_gpl ? 1 : 0;/* find program type: socket_filter vs tracing_filter *//* (2.4) 根據(jù)attr->prog_type指定的type值,找到對應(yīng)的bpf_prog_types,給bpf_prog->aux->ops賦值,這個(gè)ops是一個(gè)函數(shù)操作集*/err = find_prog_type(type, prog);if (err < 0)goto free_prog;/* run eBPF verifier *//* (3) 使用verifer對BPF程序進(jìn)行合法性掃描 */err = bpf_check(&prog, attr);if (err < 0)goto free_used_maps;/* eBPF program is ready to be JITed *//* (4) 嘗試對BPF程序進(jìn)行JIT轉(zhuǎn)換 */prog = bpf_prog_select_runtime(prog, &err);if (err < 0)goto free_used_maps;/* (5) 給BPF程序分配一個(gè)文件句柄fd */err = bpf_prog_new_fd(prog);if (err < 0)/* failed to allocate fd */goto free_used_maps;return err;free_used_maps:free_used_maps(prog->aux); free_prog:bpf_prog_uncharge_memlock(prog); free_prog_nouncharge:bpf_prog_free(prog);return err; }這其中對BPF來說有個(gè)重要的數(shù)據(jù)結(jié)構(gòu)就是struct bpf_prog:
struct bpf_prog {u16 pages; /* Number of allocated pages */kmemcheck_bitfield_begin(meta);u16 jited:1, /* Is our filter JIT'ed? */gpl_compatible:1, /* Is filter GPL compatible? */cb_access:1, /* Is control block accessed? */dst_needed:1; /* Do we need dst entry? */kmemcheck_bitfield_end(meta);u32 len; /* Number of filter blocks */enum bpf_prog_type type; /* Type of BPF program */struct bpf_prog_aux *aux; /* Auxiliary fields */struct sock_fprog_kern *orig_prog; /* Original BPF program */unsigned int (*bpf_func)(const struct sk_buff *skb,const struct bpf_insn *filter);/* Instructions for interpreter */union {struct sock_filter insns[0];struct bpf_insn insnsi[0];}; };其中重要的成員如下:
- len:程序包含bpf指令的數(shù)量;
- type:當(dāng)前bpf程序的類型(kprobe/tracepoint/perf_event/sk_filter/sched_cls/sched_act/xdp/cg_skb);
- aux:主要用來輔助verifier校驗(yàn)和轉(zhuǎn)換的數(shù)據(jù);
- orig_prog:
- bpf_func:運(yùn)行時(shí)BPF程序的入口。如果JIT轉(zhuǎn)換成功,這里指向的就是BPF程序JIT轉(zhuǎn)換后的映像;否則這里指向內(nèi)核解析器(interpreter)的通用入口__bpf_prog_run();
- insnsi[]:從用戶態(tài)拷貝過來的,BPF程序原始指令的存放空間;
1.1.2、bpf verifier
關(guān)于verifier的步驟和規(guī)則,在“3.1、Berkeley Packet Filter (BPF) (Kernel Document)”一文的“eBPF verifier”一節(jié)有詳細(xì)描述。
另外,在kernel/bpf/verifier.c文件的開頭對eBPF verifier也有一段詳細(xì)的注釋:
bpf_check()是一個(gè)靜態(tài)代碼分析器,它按指令遍歷eBPF程序指令并更新寄存器/堆棧狀態(tài)。分析條件分支的所有路徑,直到'bpf_exit'指令。1、第一步是深度優(yōu)先搜索,檢查程序是否為DAG(Directed Acyclic Graph 有向無環(huán)圖)。它將會(huì)拒絕以下程序:- 大于BPF_MAXINSNS條指令(BPF_MAXINSNS=4096)- 如果出現(xiàn)循環(huán)(通過back-edge檢測)- 不可達(dá)的指令存在(不應(yīng)該是森林,程序等于一個(gè)函數(shù))- 越界或畸形的跳躍2、第二步是從第一步所有可能路徑的展開。 - 因?yàn)樗治隽顺绦蛩械穆窂?#xff0c;這個(gè)分析的最大長度限制為32k個(gè)指令,即使指令總數(shù)小于4k也會(huì)受到影響,因?yàn)橛刑嗟姆种Ц淖兞硕褩?寄存器。 - 分支的分析數(shù)量被限制為1k。在進(jìn)入每條指令時(shí),每個(gè)寄存器都有一個(gè)類型,該指令根據(jù)指令語義改變寄存器的類型:- rule 1、如果指令是BPF_MOV64_REG(BPF_REG_1, BPF_REG_5),則將R5的類型復(fù)制到R1。所有寄存器都是64位的。 * R0 -返回寄存器 * R1-R5參數(shù)傳遞寄存器 * R6-R9被調(diào)用方保存寄存器 * R10 -幀指針只讀 - rule 2、在BPF程序開始時(shí),寄存器R1包含一個(gè)指向bpf_context的指針,類型為PTR_TO_CTX。- rule 3、verifier跟蹤指針上的算術(shù)運(yùn)算:`BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -20), `第一條指令將R10(它具有FRAME_PTR)類型復(fù)制到R1中,第二條算術(shù)指令是匹配的模式,用于識(shí)別它想要構(gòu)造一個(gè)指向堆棧中某個(gè)元素的指針。 因此,在第二條指令之后,寄存器R1的類型為PTR_TO_STACK(-20常數(shù)需要進(jìn)一步的堆棧邊界檢查)。表示這個(gè)reg是一個(gè)指針由堆棧加上常數(shù)。- rule 4、大多數(shù)時(shí)候寄存器都有UNKNOWN_VALUE類型,這意味著寄存器有一些值,但它不是一個(gè)有效的指針。(就像指針+指針變成了UNKNOWN_VALUE類型)- rule 5、當(dāng)verifier看到load指令或store指令時(shí),基本寄存器的類型可以是:PTR_TO_MAP_VALUE、PTR_TO_CTX、FRAME_PTR。這是由check_mem_access()函數(shù)識(shí)別的三種指針類型。- rule 6、PTR_TO_MAP_VALUE表示這個(gè)寄存器指向‘map元素的值’,并且可以訪問[ptr, ptr + map value_size)的范圍。- rule 7、寄存器用于向函數(shù)調(diào)用傳遞參數(shù),將根據(jù)函數(shù)參數(shù)約束進(jìn)行檢查。ARG_PTR_TO_MAP_KEY就是這樣的參數(shù)約束之一。 這意味著傳遞給這個(gè)函數(shù)的寄存器類型必須是PTR_TO_STACK,它將作為‘map element key的指針’在函數(shù)內(nèi)部使用。例如bpf_map_lookup_elem()的參數(shù)約束:`.ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,.arg1_type = ARG_CONST_MAP_PTR,.arg2_type = ARG_PTR_TO_MAP_KEY, `ret_type表示該函數(shù)返回“指向map element value的指針或null”。 函數(shù)期望第一個(gè)參數(shù)是指向‘struct bpf_map’的const指針,第二個(gè)參數(shù)應(yīng)該是指向stack的指針,這個(gè)指針在helper函數(shù)中用作map element key的指針。在內(nèi)核側(cè)的helper函數(shù)如下:`u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5){struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;void *key = (void *) (unsigned long) r2;void *value;here kernel can access 'key' and 'map' pointers safely, knowing that[key, key + map->key_size) bytes are valid and were initialized onthe stack of eBPF program.} `相應(yīng)的eBPF程序如下:`BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), // after this insn R2 type is FRAME_PTRBPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), // after this insn R2 type is PTR_TO_STACKBPF_LD_MAP_FD(BPF_REG_1, map_fd), // after this insn R1 type is CONST_PTR_TO_MAPBPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), `這里verifier查看map_lookup_elem()的原型,看到: - .arg1_type == ARG_CONST_MAP_PTR and R1->type == CONST_PTR_TO_MAP, 這個(gè)是ok的?,F(xiàn)在verifier知道m(xù)ap key的尺寸了:R1->map_ptr->key_size。- 然后.arg2_type == ARG_PTR_TO_MAP_KEY and R2->type == PTR_TO_STACK也是ok的。 現(xiàn)在verifier檢測 [R2, R2 + map's key_size]是否在堆棧限制內(nèi),并且在調(diào)用之前被初始化。 - 如果可以,那么verifier允許這個(gè)BPF_CALL指令,并查看.ret_type RET_PTR_TO_MAP_VALUE_OR_NULL,因此它設(shè)置R0->類型= PTR_TO_MAP_VALUE_OR_NULL,這意味著bpf_map_lookup_elem()函數(shù)返回map value指針或NULL。 當(dāng)類型PTR_TO_MAP_VALUE_OR_NULL通過'if (reg != 0) goto +off' 指令判斷時(shí),在真分支中持有指針的寄存器將狀態(tài)更改為PTR_TO_MAP_VALUE,在假分支中相同的寄存器將狀態(tài)更改為CONST_IMM??碿heck_cond_jmp_op()的實(shí)現(xiàn)。 函數(shù)調(diào)用以后R0設(shè)置為返回函數(shù)類型后,將寄存器R1-R5設(shè)置為NOT_INIT,以指示它們不再可讀。原文如下:
/* bpf_check() is a static code analyzer that walks eBPF program* instruction by instruction and updates register/stack state.* All paths of conditional branches are analyzed until 'bpf_exit' insn.** The first pass is depth-first-search to check that the program is a DAG.* It rejects the following programs:* - larger than BPF_MAXINSNS insns* - if loop is present (detected via back-edge)* - unreachable insns exist (shouldn't be a forest. program = one function)* - out of bounds or malformed jumps* The second pass is all possible path descent from the 1st insn.* Since it's analyzing all pathes through the program, the length of the* analysis is limited to 32k insn, which may be hit even if total number of* insn is less then 4K, but there are too many branches that change stack/regs.* Number of 'branches to be analyzed' is limited to 1k** On entry to each instruction, each register has a type, and the instruction* changes the types of the registers depending on instruction semantics.* If instruction is BPF_MOV64_REG(BPF_REG_1, BPF_REG_5), then type of R5 is* copied to R1.** All registers are 64-bit.* R0 - return register* R1-R5 argument passing registers* R6-R9 callee saved registers* R10 - frame pointer read-only** At the start of BPF program the register R1 contains a pointer to bpf_context* and has type PTR_TO_CTX.** Verifier tracks arithmetic operations on pointers in case:* BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),* BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -20),* 1st insn copies R10 (which has FRAME_PTR) type into R1* and 2nd arithmetic instruction is pattern matched to recognize* that it wants to construct a pointer to some element within stack.* So after 2nd insn, the register R1 has type PTR_TO_STACK* (and -20 constant is saved for further stack bounds checking).* Meaning that this reg is a pointer to stack plus known immediate constant.** Most of the time the registers have UNKNOWN_VALUE type, which* means the register has some value, but it's not a valid pointer.* (like pointer plus pointer becomes UNKNOWN_VALUE type)** When verifier sees load or store instructions the type of base register* can be: PTR_TO_MAP_VALUE, PTR_TO_CTX, FRAME_PTR. These are three pointer* types recognized by check_mem_access() function.** PTR_TO_MAP_VALUE means that this register is pointing to 'map element value'* and the range of [ptr, ptr + map's value_size) is accessible.** registers used to pass values to function calls are checked against* function argument constraints.** ARG_PTR_TO_MAP_KEY is one of such argument constraints.* It means that the register type passed to this function must be* PTR_TO_STACK and it will be used inside the function as* 'pointer to map element key'** For example the argument constraints for bpf_map_lookup_elem():* .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,* .arg1_type = ARG_CONST_MAP_PTR,* .arg2_type = ARG_PTR_TO_MAP_KEY,** ret_type says that this function returns 'pointer to map elem value or null'* function expects 1st argument to be a const pointer to 'struct bpf_map' and* 2nd argument should be a pointer to stack, which will be used inside* the helper function as a pointer to map element key.** On the kernel side the helper function looks like:* u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)* {* struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;* void *key = (void *) (unsigned long) r2;* void *value;** here kernel can access 'key' and 'map' pointers safely, knowing that* [key, key + map->key_size) bytes are valid and were initialized on* the stack of eBPF program.* }** Corresponding eBPF program may look like:* BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), // after this insn R2 type is FRAME_PTR* BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), // after this insn R2 type is PTR_TO_STACK* BPF_LD_MAP_FD(BPF_REG_1, map_fd), // after this insn R1 type is CONST_PTR_TO_MAP* BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),* here verifier looks at prototype of map_lookup_elem() and sees:* .arg1_type == ARG_CONST_MAP_PTR and R1->type == CONST_PTR_TO_MAP, which is ok,* Now verifier knows that this map has key of R1->map_ptr->key_size bytes** Then .arg2_type == ARG_PTR_TO_MAP_KEY and R2->type == PTR_TO_STACK, ok so far,* Now verifier checks that [R2, R2 + map's key_size) are within stack limits* and were initialized prior to this call.* If it's ok, then verifier allows this BPF_CALL insn and looks at* .ret_type which is RET_PTR_TO_MAP_VALUE_OR_NULL, so it sets* R0->type = PTR_TO_MAP_VALUE_OR_NULL which means bpf_map_lookup_elem() function* returns ether pointer to map value or NULL.** When type PTR_TO_MAP_VALUE_OR_NULL passes through 'if (reg != 0) goto +off'* insn, the register holding that pointer in the true branch changes state to* PTR_TO_MAP_VALUE and the same register changes state to CONST_IMM in the false* branch. See check_cond_jmp_op().** After the call R0 is set to return type of the function and registers R1-R5* are set to NOT_INIT to indicate that they are no longer readable.*/BPF verifier總體代碼流程如下:
int bpf_check(struct bpf_prog **prog, union bpf_attr *attr) {char __user *log_ubuf = NULL;struct bpf_verifier_env *env;int ret = -EINVAL;if ((*prog)->len <= 0 || (*prog)->len > BPF_MAXINSNS)return -E2BIG;/* 'struct bpf_verifier_env' can be global, but since it's not small,* allocate/free it every time bpf_check() is called*//* (3.1) 分配verifier靜態(tài)掃描需要的數(shù)據(jù)結(jié)構(gòu) */env = kzalloc(sizeof(struct bpf_verifier_env), GFP_KERNEL);if (!env)return -ENOMEM;env->insn_aux_data = vzalloc(sizeof(struct bpf_insn_aux_data) *(*prog)->len);ret = -ENOMEM;if (!env->insn_aux_data)goto err_free_env;env->prog = *prog;/* grab the mutex to protect few globals used by verifier */mutex_lock(&bpf_verifier_lock);/* (3.2) 如果用戶指定了attr->log_buf,說明用戶需要具體的代碼掃描log,這個(gè)在出錯(cuò)時(shí)非常有用 先在內(nèi)核中分配log空間,在返回時(shí)拷貝給用戶*/if (attr->log_level || attr->log_buf || attr->log_size) {/* user requested verbose verifier output* and supplied buffer to store the verification trace*/log_level = attr->log_level;log_ubuf = (char __user *) (unsigned long) attr->log_buf;log_size = attr->log_size;log_len = 0;ret = -EINVAL;/* log_* values have to be sane */if (log_size < 128 || log_size > UINT_MAX >> 8 ||log_level == 0 || log_ubuf == NULL)goto err_unlock;ret = -ENOMEM;log_buf = vmalloc(log_size);if (!log_buf)goto err_unlock;} else {log_level = 0;}/* (3.3) 把BPF程序中操作map的指令,從map_fd替換成實(shí)際的map指針 由此可見用戶態(tài)的loader程序,肯定是先根據(jù)__section("maps")中定義的map調(diào)用bpf()創(chuàng)建map,再加載其他的程序section;*/ret = replace_map_fd_with_map_ptr(env);if (ret < 0)goto skip_full_check;env->explored_states = kcalloc(env->prog->len,sizeof(struct bpf_verifier_state_list *),GFP_USER);ret = -ENOMEM;if (!env->explored_states)goto skip_full_check;/* (3.4) step1、檢查有沒有環(huán)路 */ret = check_cfg(env);if (ret < 0)goto skip_full_check;env->allow_ptr_leaks = capable(CAP_SYS_ADMIN);/* (3.5) step2、詳細(xì)掃描BPF代碼的運(yùn)行過程,跟蹤分析寄存器和堆棧,檢查是否有不符合規(guī)則的情況出現(xiàn) */ret = do_check(env);skip_full_check:while (pop_stack(env, NULL) >= 0);free_states(env);/* (3.6) 把掃描分析出來的dead代碼(就是不會(huì)運(yùn)行的代碼)轉(zhuǎn)成nop指令 */if (ret == 0)sanitize_dead_code(env);/* (3.7) 根據(jù)程序的type,轉(zhuǎn)換對ctx指針成員的訪問 */if (ret == 0)/* program is valid, convert *(u32*)(ctx + off) accesses */ret = convert_ctx_accesses(env);/* (3.8) 修復(fù)BPF指令中對內(nèi)核helper function函數(shù)的調(diào)用,把函數(shù)編號(hào)替換成實(shí)際的函數(shù)指針 */if (ret == 0)ret = fixup_bpf_calls(env);if (log_level && log_len >= log_size - 1) {BUG_ON(log_len >= log_size);/* verifier log exceeded user supplied buffer */ret = -ENOSPC;/* fall through to return what was recorded */}/* (3.9) 拷貝verifier log到用戶空間 *//* copy verifier log back to user space including trailing zero */if (log_level && copy_to_user(log_ubuf, log_buf, log_len + 1) != 0) {ret = -EFAULT;goto free_log_buf;}/* (3.10) 備份BPF程序?qū)ap的引用信息,到prog->aux->used_maps中 */if (ret == 0 && env->used_map_cnt) {/* if program passed verifier, update used_maps in bpf_prog_info */env->prog->aux->used_maps = kmalloc_array(env->used_map_cnt,sizeof(env->used_maps[0]),GFP_KERNEL);if (!env->prog->aux->used_maps) {ret = -ENOMEM;goto free_log_buf;}memcpy(env->prog->aux->used_maps, env->used_maps,sizeof(env->used_maps[0]) * env->used_map_cnt);env->prog->aux->used_map_cnt = env->used_map_cnt;/* program is valid. Convert pseudo bpf_ld_imm64 into generic* bpf_ld_imm64 instructions*/convert_pseudo_ld_imm64(env);}free_log_buf:if (log_level)vfree(log_buf);if (!env->prog->aux->used_maps)/* if we didn't copy map pointers into bpf_prog_info, release* them now. Otherwise free_bpf_prog_info() will release them.*/release_maps(env);*prog = env->prog; err_unlock:mutex_unlock(&bpf_verifier_lock);vfree(env->insn_aux_data); err_free_env:kfree(env);return ret; }- 1、把BPF程序中操作map的指令,從map_fd替換成實(shí)際的map指針。
由此可見用戶態(tài)的loader程序,肯定是先根據(jù)__section(“maps”)中定義的map調(diào)用bpf()創(chuàng)建map,再加載其他的程序section。
符合條件:(insn[0].code == (BPF_LD | BPF_IMM | BPF_DW)) && (insn[0]->src_reg == BPF_PSEUDO_MAP_FD) 的指令為map指針加載指針。
把原始的立即數(shù)作為fd找到對應(yīng)的map指針。
把64bit的map指針拆分成兩個(gè)32bit的立即數(shù),存儲(chǔ)到insn[0].imm、insn[1].imm中。
- 2、Step 1、通過DAG(Directed Acyclic Graph 有向無環(huán)圖)的DFS(Depth-first Search)深度優(yōu)先算法來遍歷BPF程序的代碼路徑,確保沒有環(huán)路發(fā)生;
DAG的DFS算法可以參考“Graph”一文。其中最重要的概念如下圖:
一個(gè)圖形"Graph"經(jīng)過DAG的DFS算法遍歷后,對每一個(gè)根節(jié)點(diǎn)都會(huì)形成一顆樹“DFS Tree”,多個(gè)根節(jié)點(diǎn)得到的多棵樹形成一個(gè)森林"DFS Forest"。根據(jù)搜索的結(jié)構(gòu)整個(gè)“Graph”的邊“Edge”可以分成四類:
- Tree Edges:在DFS樹上的邊;
- Back Edges:從子節(jié)點(diǎn)連向祖先節(jié)點(diǎn)的邊(形成環(huán));
- Forward Edges:直接連向?qū)O節(jié)點(diǎn)的邊(跨子節(jié)點(diǎn)的連接);
- Cross Edges:葉子之間的連接,或者樹之間的連接;
對BPF verifier來說,檢查BPF程序的運(yùn)行路徑圖中是否有“Back Edges”的存在,確保程序中沒有環(huán)路。
具體的代碼如下:
static int check_cfg(struct bpf_verifier_env *env) {struct bpf_insn *insns = env->prog->insnsi;int insn_cnt = env->prog->len;int ret = 0;int i, t;insn_state = kcalloc(insn_cnt, sizeof(int), GFP_KERNEL);if (!insn_state)return -ENOMEM;insn_stack = kcalloc(insn_cnt, sizeof(int), GFP_KERNEL);if (!insn_stack) {kfree(insn_state);return -ENOMEM;}insn_state[0] = DISCOVERED; /* mark 1st insn as discovered */insn_stack[0] = 0; /* 0 is the first instruction */cur_stack = 1;/* (3.4.1) DFS深度優(yōu)先算法的循環(huán) */ peek_stack:if (cur_stack == 0)goto check_state;t = insn_stack[cur_stack - 1];/* (3.4.2) 分支指令 */if (BPF_CLASS(insns[t].code) == BPF_JMP) {u8 opcode = BPF_OP(insns[t].code);/* (3.4.2.1) 碰到BPF_EXIT指令,路徑終結(jié),開始回溯確認(rèn) */if (opcode == BPF_EXIT) {goto mark_explored;/* (3.4.2.2) 碰到BPF_CALL指令,繼續(xù)探索 并且把env->explored_states[]設(shè)置成STATE_LIST_MARK,標(biāo)識(shí)call函數(shù)調(diào)用后需要重新跟蹤計(jì)算寄存器和堆棧*/} else if (opcode == BPF_CALL) {ret = push_insn(t, t + 1, FALLTHROUGH, env);if (ret == 1)goto peek_stack;else if (ret < 0)goto err_free;if (t + 1 < insn_cnt)env->explored_states[t + 1] = STATE_LIST_MARK;/* (3.4.2.3) 碰到BPF_JA指令,繼續(xù)探索 并且把env->explored_states[]設(shè)置成STATE_LIST_MARK,標(biāo)識(shí)call函數(shù)調(diào)用后需要重新跟蹤計(jì)算寄存器和堆棧*/} else if (opcode == BPF_JA) {if (BPF_SRC(insns[t].code) != BPF_K) {ret = -EINVAL;goto err_free;}/* unconditional jump with single edge */ret = push_insn(t, t + insns[t].off + 1,FALLTHROUGH, env);if (ret == 1)goto peek_stack;else if (ret < 0)goto err_free;/* tell verifier to check for equivalent states* after every call and jump*/if (t + 1 < insn_cnt)env->explored_states[t + 1] = STATE_LIST_MARK;/* (3.4.2.4) 剩下的是有條件跳轉(zhuǎn)指令,首先探測條件失敗路徑,再探測條件成功路徑 并且把env->explored_states[]設(shè)置成STATE_LIST_MARK,標(biāo)識(shí)call函數(shù)調(diào)用后需要重新跟蹤計(jì)算寄存器和堆棧*/} else {/* conditional jump with two edges */env->explored_states[t] = STATE_LIST_MARK;/* 條件失敗路徑 */ret = push_insn(t, t + 1, FALLTHROUGH, env);if (ret == 1)goto peek_stack;else if (ret < 0)goto err_free;/* 條件成功路徑 */ret = push_insn(t, t + insns[t].off + 1, BRANCH, env);if (ret == 1)goto peek_stack;else if (ret < 0)goto err_free;}/* (3.4.3) 非分支指令 */} else {/* all other non-branch instructions with single* fall-through edge*/ret = push_insn(t, t + 1, FALLTHROUGH, env);/* (3.4.3.1) ret的含義如下ret == 1:繼續(xù)探索路徑ret == 0:已經(jīng)是葉子節(jié)點(diǎn)了,跳轉(zhuǎn)到mark_explored確認(rèn)并回溯ret < 0:探測到"back-edge"環(huán)路,或者其他錯(cuò)誤*/if (ret == 1)goto peek_stack;else if (ret < 0)goto err_free;}/* (3.4.4) 確認(rèn)并回溯,狀態(tài)標(biāo)記為EXPLORED */ mark_explored:insn_state[t] = EXPLORED;if (cur_stack-- <= 0) {verbose("pop stack internal bug\n");ret = -EFAULT;goto err_free;}goto peek_stack;/* (3.4.5) 確認(rèn)沒有unreachable的指令,就是路徑?jīng)]法抵達(dá) */ check_state:for (i = 0; i < insn_cnt; i++) {if (insn_state[i] != EXPLORED) {verbose("unreachable insn %d\n", i);ret = -EINVAL;goto err_free;}}ret = 0; /* cfg looks good */err_free:kfree(insn_state);kfree(insn_stack);return ret; }- 3、step2、詳細(xì)掃描BPF代碼的運(yùn)行過程,跟蹤分析寄存器和堆棧,檢查是否有不符合規(guī)則的情況出現(xiàn)。
這段代碼的具體算法就是把step1的路徑重新走一遍,并且跟蹤寄存器和堆棧的變化,判斷最壞情況下是否有違反規(guī)則的情況出現(xiàn)。
在碰到指令對應(yīng)explored_states[]被設(shè)置成STATE_LIST_MARK,需要給當(dāng)前指令獨(dú)立分配一個(gè)bpf_verifier_state_list鏈表,來存儲(chǔ)這個(gè)指令在多個(gè)分支上的不同狀況。
這里也有一個(gè)快速分析的優(yōu)化方法:修剪(Pruning)。如果當(dāng)前指令的當(dāng)前分支的狀態(tài)cur_state,和當(dāng)前指令另一個(gè)已分析分支的狀態(tài)(當(dāng)前指令explored_states[]鏈表中的一個(gè)bpf_verifier_state_list成員)相等或者是它的一個(gè)子集,那么當(dāng)前指令的當(dāng)前分支就不需要分析了,因?yàn)樗隙ㄊ欠弦?guī)則的。
- 4、修復(fù)BPF指令中對內(nèi)核helper function函數(shù)的調(diào)用,把函數(shù)編號(hào)替換成實(shí)際的函數(shù)指針。
符合條件:(insn->code == (BPF_JMP | BPF_CALL)) 的指令,即是調(diào)用helper function的指令。
通用helper function的處理:根據(jù)insn->imm指定的編號(hào)找打?qū)?yīng)的函數(shù)指針,然后再把函數(shù)指針和__bpf_call_base之間的offset,賦值到insn->imm中。
1.1.3、bpf JIT/kernel interpreter
在verifier驗(yàn)證通過以后,內(nèi)核通過JIT(Just-In-Time)將BPF目編碼轉(zhuǎn)換成本地指令碼;如果當(dāng)前架構(gòu)不支持JIT轉(zhuǎn)換內(nèi)核則會(huì)使用一個(gè)解析器(interpreter)來模擬運(yùn)行,這種運(yùn)行效率較低;
有些架構(gòu)(64 bit x86_64, arm64, ppc64, s390x, mips64, sparc64 and 32 bit arm)已經(jīng)支持BPF的JIT,它可以高效的幾乎一比一的把BPF代碼轉(zhuǎn)換成本機(jī)代碼(因?yàn)閑BPF的指令集已經(jīng)做了優(yōu)化,非常類似最新的arm/x86架構(gòu),ABI也類似)。如果當(dāng)前架構(gòu)不支持JTI只能使用內(nèi)核的解析器(interpreter)來模擬運(yùn)行;
- 1、JIT
以arm64的JIT轉(zhuǎn)換為例:
struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) {struct bpf_prog *tmp, *orig_prog = prog;struct bpf_binary_header *header;bool tmp_blinded = false;struct jit_ctx ctx;int image_size;u8 *image_ptr;if (!bpf_jit_enable)return orig_prog;/* (4.2.1) 把常量致盲(constant blinding) */tmp = bpf_jit_blind_constants(prog);/* If blinding was requested and we failed during blinding,* we must fall back to the interpreter.*/if (IS_ERR(tmp))return orig_prog;if (tmp != prog) {tmp_blinded = true;prog = tmp;}memset(&ctx, 0, sizeof(ctx));ctx.prog = prog;ctx.offset = kcalloc(prog->len, sizeof(int), GFP_KERNEL);if (ctx.offset == NULL) {prog = orig_prog;goto out;}/* 1. Initial fake pass to compute ctx->idx. *//* (4.2.2) JIT指令轉(zhuǎn)換,但是不儲(chǔ)存轉(zhuǎn)換結(jié)果,只是記錄 “prologue + body + epilogue”轉(zhuǎn)換后需要的總長度 *//* Fake pass to fill in ctx->offset. */if (build_body(&ctx)) {prog = orig_prog;goto out_off;}if (build_prologue(&ctx)) {prog = orig_prog;goto out_off;}ctx.epilogue_offset = ctx.idx;build_epilogue(&ctx);/* Now we know the actual image size. *//* (4.2.3) 根據(jù)計(jì)算的總長度,分配JIT轉(zhuǎn)換后指令的存儲(chǔ)空間:ctx.image */image_size = sizeof(u32) * ctx.idx;header = bpf_jit_binary_alloc(image_size, &image_ptr,sizeof(u32), jit_fill_hole);if (header == NULL) {prog = orig_prog;goto out_off;}/* 2. Now, the actual pass. *//* (4.2.4) 重新做一次JIT轉(zhuǎn)換,把轉(zhuǎn)換后的指令存儲(chǔ)到 ctx.image */ctx.image = (u32 *)image_ptr;ctx.idx = 0;/* (4.2.4.1) 構(gòu)造轉(zhuǎn)換后image的頭,負(fù)責(zé)構(gòu)造BPF程序運(yùn)行時(shí)的堆棧,8條指令 */build_prologue(&ctx);/* (4.2.4.2) 把BPF程序進(jìn)行JIT本地化指令轉(zhuǎn)換 */if (build_body(&ctx)) {bpf_jit_binary_free(header);prog = orig_prog;goto out_off;}/* (4.2.4.3) 構(gòu)造轉(zhuǎn)換后image的尾部,負(fù)載清理工作,7條指令 */build_epilogue(&ctx);/* 3. Extra pass to validate JITed code. *//* (4.2.5) 確保轉(zhuǎn)換后的指令中沒有AARCH64_BREAK_FAULT */if (validate_code(&ctx)) {bpf_jit_binary_free(header);prog = orig_prog;goto out_off;}/* And we're done. */if (bpf_jit_enable > 1)bpf_jit_dump(prog->len, image_size, 2, ctx.image);/* (4.2.6) 刷新新image對應(yīng)的icache */bpf_flush_icache(header, ctx.image + ctx.idx);/* (4.2.7) 把image對應(yīng)的page設(shè)置為read only */set_memory_ro((unsigned long)header, header->pages);/* (4.2.8) 把轉(zhuǎn)換后的image賦值給prog->bpf_func */prog->bpf_func = (void *)ctx.image;prog->jited = 1;out_off:kfree(ctx.offset); out:if (tmp_blinded)bpf_jit_prog_release_other(prog, prog == orig_prog ?tmp : orig_prog);return prog; }JIT的核心轉(zhuǎn)換分為3部分:prologue + body + epilogue。
prologue:新增的指令,負(fù)責(zé)BPF運(yùn)行堆棧的構(gòu)建和運(yùn)行現(xiàn)場的保護(hù);
body:BPF主體部分;
epilogue:負(fù)責(zé)BPF運(yùn)行完現(xiàn)場的恢復(fù)和清理;
- 1.1、prologue
A64_:開頭的是本機(jī)的相關(guān)寄存器
BPF_:開頭的是BPF虛擬機(jī)的寄存器
整個(gè)過程還是比較巧妙的:
首先將A64_FP/A64_LR保存進(jìn)堆棧A64_SP,然后把當(dāng)前A64_SP保存進(jìn)A64_FP;
繼續(xù)保存callee saved registers進(jìn)堆棧A64_SP:r6, r7, r8, r9, fp, tcc,然后把當(dāng)前A64_SP保存進(jìn)BPF_FP;
把A64_SP減去STACK_SIZE,給BPF_FP留出512字節(jié)的堆??臻g;
這樣BPF程序使用的是BPF_FP開始的512字節(jié)堆??臻g,普通kernel函數(shù)使用的是A64_SP繼續(xù)向下的堆??臻g,互不干擾;
- 1.2、body
把BPF指令翻譯成本地arm64指令:
static int build_body(struct jit_ctx *ctx) {const struct bpf_prog *prog = ctx->prog;int i;for (i = 0; i < prog->len; i++) {const struct bpf_insn *insn = &prog->insnsi[i];int ret;ret = build_insn(insn, ctx);if (ret > 0) {i++;if (ctx->image == NULL)ctx->offset[i] = ctx->idx;continue;}if (ctx->image == NULL)ctx->offset[i] = ctx->idx;if (ret)return ret;}return 0; }↓/* JITs an eBPF instruction.* Returns:* 0 - successfully JITed an 8-byte eBPF instruction.* >0 - successfully JITed a 16-byte eBPF instruction.* <0 - failed to JIT.*/ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx) {const u8 code = insn->code;const u8 dst = bpf2a64[insn->dst_reg];const u8 src = bpf2a64[insn->src_reg];const u8 tmp = bpf2a64[TMP_REG_1];const u8 tmp2 = bpf2a64[TMP_REG_2];const s16 off = insn->off;const s32 imm = insn->imm;const int i = insn - ctx->prog->insnsi;const bool is64 = BPF_CLASS(code) == BPF_ALU64;u8 jmp_cond;s32 jmp_offset;#define check_imm(bits, imm) do { \if ((((imm) > 0) && ((imm) >> (bits))) || \(((imm) < 0) && (~(imm) >> (bits)))) { \pr_info("[%2d] imm=%d(0x%x) out of range\n", \i, imm, imm); \return -EINVAL; \} \ } while (0) #define check_imm19(imm) check_imm(19, imm) #define check_imm26(imm) check_imm(26, imm)switch (code) {/* dst = src */case BPF_ALU | BPF_MOV | BPF_X:case BPF_ALU64 | BPF_MOV | BPF_X:emit(A64_MOV(is64, dst, src), ctx);break;/* dst = dst OP src */case BPF_ALU | BPF_ADD | BPF_X:case BPF_ALU64 | BPF_ADD | BPF_X:emit(A64_ADD(is64, dst, dst, src), ctx);break;case BPF_ALU | BPF_SUB | BPF_X:case BPF_ALU64 | BPF_SUB | BPF_X:emit(A64_SUB(is64, dst, dst, src), ctx);break;case BPF_ALU | BPF_AND | BPF_X:case BPF_ALU64 | BPF_AND | BPF_X:emit(A64_AND(is64, dst, dst, src), ctx);break;case BPF_ALU | BPF_OR | BPF_X:case BPF_ALU64 | BPF_OR | BPF_X:emit(A64_ORR(is64, dst, dst, src), ctx);break;case BPF_ALU | BPF_XOR | BPF_X:case BPF_ALU64 | BPF_XOR | BPF_X:emit(A64_EOR(is64, dst, dst, src), ctx);break;case BPF_ALU | BPF_MUL | BPF_X:case BPF_ALU64 | BPF_MUL | BPF_X:emit(A64_MUL(is64, dst, dst, src), ctx);break;case BPF_ALU | BPF_DIV | BPF_X:case BPF_ALU64 | BPF_DIV | BPF_X:case BPF_ALU | BPF_MOD | BPF_X:case BPF_ALU64 | BPF_MOD | BPF_X:{const u8 r0 = bpf2a64[BPF_REG_0];/* if (src == 0) return 0 */jmp_offset = 3; /* skip ahead to else path */check_imm19(jmp_offset);emit(A64_CBNZ(is64, src, jmp_offset), ctx);emit(A64_MOVZ(1, r0, 0, 0), ctx);jmp_offset = epilogue_offset(ctx);check_imm26(jmp_offset);emit(A64_B(jmp_offset), ctx);/* else */switch (BPF_OP(code)) {case BPF_DIV:emit(A64_UDIV(is64, dst, dst, src), ctx);break;case BPF_MOD:emit(A64_UDIV(is64, tmp, dst, src), ctx);emit(A64_MUL(is64, tmp, tmp, src), ctx);emit(A64_SUB(is64, dst, dst, tmp), ctx);break;}break;}case BPF_ALU | BPF_LSH | BPF_X:case BPF_ALU64 | BPF_LSH | BPF_X:emit(A64_LSLV(is64, dst, dst, src), ctx);break;case BPF_ALU | BPF_RSH | BPF_X:case BPF_ALU64 | BPF_RSH | BPF_X:emit(A64_LSRV(is64, dst, dst, src), ctx);break;case BPF_ALU | BPF_ARSH | BPF_X:case BPF_ALU64 | BPF_ARSH | BPF_X:emit(A64_ASRV(is64, dst, dst, src), ctx);break;/* dst = -dst */case BPF_ALU | BPF_NEG:case BPF_ALU64 | BPF_NEG:emit(A64_NEG(is64, dst, dst), ctx);break;/* dst = BSWAP##imm(dst) */case BPF_ALU | BPF_END | BPF_FROM_LE:case BPF_ALU | BPF_END | BPF_FROM_BE: #ifdef CONFIG_CPU_BIG_ENDIANif (BPF_SRC(code) == BPF_FROM_BE)goto emit_bswap_uxt; #else /* !CONFIG_CPU_BIG_ENDIAN */if (BPF_SRC(code) == BPF_FROM_LE)goto emit_bswap_uxt; #endifswitch (imm) {case 16:emit(A64_REV16(is64, dst, dst), ctx);/* zero-extend 16 bits into 64 bits */emit(A64_UXTH(is64, dst, dst), ctx);break;case 32:emit(A64_REV32(is64, dst, dst), ctx);/* upper 32 bits already cleared */break;case 64:emit(A64_REV64(dst, dst), ctx);break;}break; emit_bswap_uxt:switch (imm) {case 16:/* zero-extend 16 bits into 64 bits */emit(A64_UXTH(is64, dst, dst), ctx);break;case 32:/* zero-extend 32 bits into 64 bits */emit(A64_UXTW(is64, dst, dst), ctx);break;case 64:/* nop */break;}break;/* dst = imm */case BPF_ALU | BPF_MOV | BPF_K:case BPF_ALU64 | BPF_MOV | BPF_K:emit_a64_mov_i(is64, dst, imm, ctx);break;/* dst = dst OP imm */case BPF_ALU | BPF_ADD | BPF_K:case BPF_ALU64 | BPF_ADD | BPF_K:emit_a64_mov_i(is64, tmp, imm, ctx);emit(A64_ADD(is64, dst, dst, tmp), ctx);break;case BPF_ALU | BPF_SUB | BPF_K:case BPF_ALU64 | BPF_SUB | BPF_K:emit_a64_mov_i(is64, tmp, imm, ctx);emit(A64_SUB(is64, dst, dst, tmp), ctx);break;case BPF_ALU | BPF_AND | BPF_K:case BPF_ALU64 | BPF_AND | BPF_K:emit_a64_mov_i(is64, tmp, imm, ctx);emit(A64_AND(is64, dst, dst, tmp), ctx);break;case BPF_ALU | BPF_OR | BPF_K:case BPF_ALU64 | BPF_OR | BPF_K:emit_a64_mov_i(is64, tmp, imm, ctx);emit(A64_ORR(is64, dst, dst, tmp), ctx);break;case BPF_ALU | BPF_XOR | BPF_K:case BPF_ALU64 | BPF_XOR | BPF_K:emit_a64_mov_i(is64, tmp, imm, ctx);emit(A64_EOR(is64, dst, dst, tmp), ctx);break;case BPF_ALU | BPF_MUL | BPF_K:case BPF_ALU64 | BPF_MUL | BPF_K:emit_a64_mov_i(is64, tmp, imm, ctx);emit(A64_MUL(is64, dst, dst, tmp), ctx);break;case BPF_ALU | BPF_DIV | BPF_K:case BPF_ALU64 | BPF_DIV | BPF_K:emit_a64_mov_i(is64, tmp, imm, ctx);emit(A64_UDIV(is64, dst, dst, tmp), ctx);break;case BPF_ALU | BPF_MOD | BPF_K:case BPF_ALU64 | BPF_MOD | BPF_K:emit_a64_mov_i(is64, tmp2, imm, ctx);emit(A64_UDIV(is64, tmp, dst, tmp2), ctx);emit(A64_MUL(is64, tmp, tmp, tmp2), ctx);emit(A64_SUB(is64, dst, dst, tmp), ctx);break;case BPF_ALU | BPF_LSH | BPF_K:case BPF_ALU64 | BPF_LSH | BPF_K:emit(A64_LSL(is64, dst, dst, imm), ctx);break;case BPF_ALU | BPF_RSH | BPF_K:case BPF_ALU64 | BPF_RSH | BPF_K:emit(A64_LSR(is64, dst, dst, imm), ctx);break;case BPF_ALU | BPF_ARSH | BPF_K:case BPF_ALU64 | BPF_ARSH | BPF_K:emit(A64_ASR(is64, dst, dst, imm), ctx);break;/* JUMP off */case BPF_JMP | BPF_JA:jmp_offset = bpf2a64_offset(i + off, i, ctx);check_imm26(jmp_offset);emit(A64_B(jmp_offset), ctx);break;/* IF (dst COND src) JUMP off */case BPF_JMP | BPF_JEQ | BPF_X:case BPF_JMP | BPF_JGT | BPF_X:case BPF_JMP | BPF_JGE | BPF_X:case BPF_JMP | BPF_JNE | BPF_X:case BPF_JMP | BPF_JSGT | BPF_X:case BPF_JMP | BPF_JSGE | BPF_X:emit(A64_CMP(1, dst, src), ctx); emit_cond_jmp:jmp_offset = bpf2a64_offset(i + off, i, ctx);check_imm19(jmp_offset);switch (BPF_OP(code)) {case BPF_JEQ:jmp_cond = A64_COND_EQ;break;case BPF_JGT:jmp_cond = A64_COND_HI;break;case BPF_JGE:jmp_cond = A64_COND_CS;break;case BPF_JSET:case BPF_JNE:jmp_cond = A64_COND_NE;break;case BPF_JSGT:jmp_cond = A64_COND_GT;break;case BPF_JSGE:jmp_cond = A64_COND_GE;break;default:return -EFAULT;}emit(A64_B_(jmp_cond, jmp_offset), ctx);break;case BPF_JMP | BPF_JSET | BPF_X:emit(A64_TST(1, dst, src), ctx);goto emit_cond_jmp;/* IF (dst COND imm) JUMP off */case BPF_JMP | BPF_JEQ | BPF_K:case BPF_JMP | BPF_JGT | BPF_K:case BPF_JMP | BPF_JGE | BPF_K:case BPF_JMP | BPF_JNE | BPF_K:case BPF_JMP | BPF_JSGT | BPF_K:case BPF_JMP | BPF_JSGE | BPF_K:emit_a64_mov_i(1, tmp, imm, ctx);emit(A64_CMP(1, dst, tmp), ctx);goto emit_cond_jmp;case BPF_JMP | BPF_JSET | BPF_K:emit_a64_mov_i(1, tmp, imm, ctx);emit(A64_TST(1, dst, tmp), ctx);goto emit_cond_jmp;/* function call */case BPF_JMP | BPF_CALL:{const u8 r0 = bpf2a64[BPF_REG_0];const u64 func = (u64)__bpf_call_base + imm;emit_a64_mov_i64(tmp, func, ctx);emit(A64_BLR(tmp), ctx);emit(A64_MOV(1, r0, A64_R(0)), ctx);break;}/* tail call */case BPF_JMP | BPF_CALL | BPF_X:if (emit_bpf_tail_call(ctx))return -EFAULT;break;/* function return */case BPF_JMP | BPF_EXIT:/* Optimization: when last instruction is EXIT,simply fallthrough to epilogue. */if (i == ctx->prog->len - 1)break;jmp_offset = epilogue_offset(ctx);check_imm26(jmp_offset);emit(A64_B(jmp_offset), ctx);break;/* dst = imm64 */case BPF_LD | BPF_IMM | BPF_DW:{const struct bpf_insn insn1 = insn[1];u64 imm64;if (insn1.code != 0 || insn1.src_reg != 0 ||insn1.dst_reg != 0 || insn1.off != 0) {/* Note: verifier in BPF core must catch invalid* instructions.*/pr_err_once("Invalid BPF_LD_IMM64 instruction\n");return -EINVAL;}imm64 = (u64)insn1.imm << 32 | (u32)imm;emit_a64_mov_i64(dst, imm64, ctx);return 1;}/* LDX: dst = *(size *)(src + off) */case BPF_LDX | BPF_MEM | BPF_W:case BPF_LDX | BPF_MEM | BPF_H:case BPF_LDX | BPF_MEM | BPF_B:case BPF_LDX | BPF_MEM | BPF_DW:emit_a64_mov_i(1, tmp, off, ctx);switch (BPF_SIZE(code)) {case BPF_W:emit(A64_LDR32(dst, src, tmp), ctx);break;case BPF_H:emit(A64_LDRH(dst, src, tmp), ctx);break;case BPF_B:emit(A64_LDRB(dst, src, tmp), ctx);break;case BPF_DW:emit(A64_LDR64(dst, src, tmp), ctx);break;}break;/* ST: *(size *)(dst + off) = imm */case BPF_ST | BPF_MEM | BPF_W:case BPF_ST | BPF_MEM | BPF_H:case BPF_ST | BPF_MEM | BPF_B:case BPF_ST | BPF_MEM | BPF_DW:/* Load imm to a register then store it */emit_a64_mov_i(1, tmp2, off, ctx);emit_a64_mov_i(1, tmp, imm, ctx);switch (BPF_SIZE(code)) {case BPF_W:emit(A64_STR32(tmp, dst, tmp2), ctx);break;case BPF_H:emit(A64_STRH(tmp, dst, tmp2), ctx);break;case BPF_B:emit(A64_STRB(tmp, dst, tmp2), ctx);break;case BPF_DW:emit(A64_STR64(tmp, dst, tmp2), ctx);break;}break;/* STX: *(size *)(dst + off) = src */case BPF_STX | BPF_MEM | BPF_W:case BPF_STX | BPF_MEM | BPF_H:case BPF_STX | BPF_MEM | BPF_B:case BPF_STX | BPF_MEM | BPF_DW:emit_a64_mov_i(1, tmp, off, ctx);switch (BPF_SIZE(code)) {case BPF_W:emit(A64_STR32(src, dst, tmp), ctx);break;case BPF_H:emit(A64_STRH(src, dst, tmp), ctx);break;case BPF_B:emit(A64_STRB(src, dst, tmp), ctx);break;case BPF_DW:emit(A64_STR64(src, dst, tmp), ctx);break;}break;/* STX XADD: lock *(u32 *)(dst + off) += src */case BPF_STX | BPF_XADD | BPF_W:/* STX XADD: lock *(u64 *)(dst + off) += src */case BPF_STX | BPF_XADD | BPF_DW:goto notyet;/* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + imm)) */case BPF_LD | BPF_ABS | BPF_W:case BPF_LD | BPF_ABS | BPF_H:case BPF_LD | BPF_ABS | BPF_B:/* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + src + imm)) */case BPF_LD | BPF_IND | BPF_W:case BPF_LD | BPF_IND | BPF_H:case BPF_LD | BPF_IND | BPF_B:{const u8 r0 = bpf2a64[BPF_REG_0]; /* r0 = return value */const u8 r6 = bpf2a64[BPF_REG_6]; /* r6 = pointer to sk_buff */const u8 fp = bpf2a64[BPF_REG_FP];const u8 r1 = bpf2a64[BPF_REG_1]; /* r1: struct sk_buff *skb */const u8 r2 = bpf2a64[BPF_REG_2]; /* r2: int k */const u8 r3 = bpf2a64[BPF_REG_3]; /* r3: unsigned int size */const u8 r4 = bpf2a64[BPF_REG_4]; /* r4: void *buffer */const u8 r5 = bpf2a64[BPF_REG_5]; /* r5: void *(*func)(...) */int size;emit(A64_MOV(1, r1, r6), ctx);emit_a64_mov_i(0, r2, imm, ctx);if (BPF_MODE(code) == BPF_IND)emit(A64_ADD(0, r2, r2, src), ctx);switch (BPF_SIZE(code)) {case BPF_W:size = 4;break;case BPF_H:size = 2;break;case BPF_B:size = 1;break;default:return -EINVAL;}emit_a64_mov_i64(r3, size, ctx);emit(A64_SUB_I(1, r4, fp, STACK_SIZE), ctx);emit_a64_mov_i64(r5, (unsigned long)bpf_load_pointer, ctx);emit(A64_BLR(r5), ctx);emit(A64_MOV(1, r0, A64_R(0)), ctx);jmp_offset = epilogue_offset(ctx);check_imm19(jmp_offset);emit(A64_CBZ(1, r0, jmp_offset), ctx);emit(A64_MOV(1, r5, r0), ctx);switch (BPF_SIZE(code)) {case BPF_W:emit(A64_LDR32(r0, r5, A64_ZR), ctx); #ifndef CONFIG_CPU_BIG_ENDIANemit(A64_REV32(0, r0, r0), ctx); #endifbreak;case BPF_H:emit(A64_LDRH(r0, r5, A64_ZR), ctx); #ifndef CONFIG_CPU_BIG_ENDIANemit(A64_REV16(0, r0, r0), ctx); #endifbreak;case BPF_B:emit(A64_LDRB(r0, r5, A64_ZR), ctx);break;}break;} notyet:pr_info_once("*** NOT YET: opcode %02x ***\n", code);return -EFAULT;default:pr_err_once("unknown opcode %02x\n", code);return -EINVAL;}return 0; }↓static inline void emit(const u32 insn, struct jit_ctx *ctx) {if (ctx->image != NULL)ctx->image[ctx->idx] = cpu_to_le32(insn);ctx->idx++; }- 1.3、epilogue
做和prologue相反的工作,恢復(fù)和清理堆棧:
static void build_epilogue(struct jit_ctx *ctx) {const u8 r0 = bpf2a64[BPF_REG_0];const u8 r6 = bpf2a64[BPF_REG_6];const u8 r7 = bpf2a64[BPF_REG_7];const u8 r8 = bpf2a64[BPF_REG_8];const u8 r9 = bpf2a64[BPF_REG_9];const u8 fp = bpf2a64[BPF_REG_FP];/* We're done with BPF stack */emit(A64_ADD_I(1, A64_SP, A64_SP, STACK_SIZE), ctx);/* Restore fs (x25) and x26 */emit(A64_POP(fp, A64_R(26), A64_SP), ctx);/* Restore callee-saved register */emit(A64_POP(r8, r9, A64_SP), ctx);emit(A64_POP(r6, r7, A64_SP), ctx);/* Restore FP/LR registers */emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);/* Set return value */emit(A64_MOV(1, A64_R(0), r0), ctx);emit(A64_RET(A64_LR), ctx); }- 2、interpreter
對于不支持JIT的情況,內(nèi)核只能使用一個(gè)解析器來解釋prog->insnsi[]中BPF的指令含義,模擬BPF指令的運(yùn)行:
使用“u64 stack[MAX_BPF_STACK / sizeof(u64)]”局部變量來模擬BPF堆棧空間;
使用“u64 regs[MAX_BPF_REG]”局部變量來模擬BPF寄存器;
- 3、BPF_PROG_RUN()
不論是轉(zhuǎn)換成JIT的映像,或者是使用interpreter解釋器。最后BPF程序運(yùn)行的時(shí)候都是使用BPF_PROG_RUN()這個(gè)宏來調(diào)用的:
ret = BPF_PROG_RUN(prog, ctx);↓#define BPF_PROG_RUN(filter, ctx) (*filter->bpf_func)(ctx, filter->insnsi)1.1.4、fd分配
對于加載到內(nèi)核空間的BPF程序,最后會(huì)給它分配一個(gè)文件句柄fd,將prog存儲(chǔ)到對應(yīng)的file->private_data上。方便后續(xù)的引用。
int bpf_prog_new_fd(struct bpf_prog *prog) {return anon_inode_getfd("bpf-prog", &bpf_prog_fops, prog,O_RDWR | O_CLOEXEC); }↓int anon_inode_getfd(const char *name, const struct file_operations *fops,void *priv, int flags) {int error, fd;struct file *file;error = get_unused_fd_flags(flags);if (error < 0)return error;fd = error;file = anon_inode_getfile(name, fops, priv, flags);if (IS_ERR(file)) {error = PTR_ERR(file);goto err_put_unused_fd;}fd_install(fd, file);return fd;err_put_unused_fd:put_unused_fd(fd);return error; }↓struct file *anon_inode_getfile(const char *name,const struct file_operations *fops,void *priv, int flags) {struct qstr this;struct path path;struct file *file;if (IS_ERR(anon_inode_inode))return ERR_PTR(-ENODEV);if (fops->owner && !try_module_get(fops->owner))return ERR_PTR(-ENOENT);/** Link the inode to a directory entry by creating a unique name* using the inode sequence number.*/file = ERR_PTR(-ENOMEM);this.name = name;this.len = strlen(name);this.hash = 0;path.dentry = d_alloc_pseudo(anon_inode_mnt->mnt_sb, &this);if (!path.dentry)goto err_module;path.mnt = mntget(anon_inode_mnt);/** We know the anon_inode inode count is always greater than zero,* so ihold() is safe.*/ihold(anon_inode_inode);d_instantiate(path.dentry, anon_inode_inode);file = alloc_file(&path, OPEN_FMODE(flags), fops);if (IS_ERR(file))goto err_dput;file->f_mapping = anon_inode_inode->i_mapping;file->f_flags = flags & (O_ACCMODE | O_NONBLOCK);file->private_data = priv;return file;err_dput:path_put(&path); err_module:module_put(fops->owner);return file; }1.2、bpf map操作
BPF map的應(yīng)用場景有幾種:
- BPF程序和用戶態(tài)態(tài)的交互:BPF程序運(yùn)行完,得到的結(jié)果存儲(chǔ)到map中,供用戶態(tài)訪問;
- BPF程序內(nèi)部交互:如果BPF程序內(nèi)部需要用全局變量來交互,但是由于安全原因BPF程序不允許訪問全局變量,可以使用map來充當(dāng)全局變量;
- BPF Tail call:Tail call是一個(gè)BPF程序跳轉(zhuǎn)到另一BPF程序,BPF程序首先通過BPF_MAP_TYPE_PROG_ARRAY類型的map來知道另一個(gè)BPF程序的指針,然后調(diào)用tail_call()的helper function來執(zhí)行Tail call。
- BPF程序和內(nèi)核態(tài)的交互:和BPF程序以外的內(nèi)核程序交互,也可以使用map作為中介;
目前,支持的map種類:
static int __init register_array_map(void) {bpf_register_map_type(&array_type);bpf_register_map_type(&percpu_array_type);return 0; } static int __init register_cgroup_array_map(void) {bpf_register_map_type(&cgroup_array_type);return 0; } static int __init register_htab_map(void) {bpf_register_map_type(&htab_type);bpf_register_map_type(&htab_percpu_type);return 0; } static int __init register_perf_event_array_map(void) {bpf_register_map_type(&perf_event_array_type);return 0; } static int __init register_prog_array_map(void) {bpf_register_map_type(&prog_array_type);return 0; } static int __init register_stack_map(void) {bpf_register_map_type(&stack_map_type);return 0; }不論哪種map,對map的使用都是用"鍵-值“對(key-value)的形式來使用的。
1.2.1、map的創(chuàng)建
如果用戶態(tài)的BPF c程序有定義map,map最后會(huì)被編譯進(jìn)__section(“maps”)。
用戶態(tài)的loader在加載BPF程序的時(shí)候,首先會(huì)根據(jù)__section(“maps”)中的成員來調(diào)用bpf()系統(tǒng)調(diào)用來創(chuàng)建map對象。
- 1、BPF_MAP_TYPE_ARRAY
我們以BPF_MAP_TYPE_ARRAY類型的map為例,來看看map的分配過程:
從用戶態(tài)傳過來的attr成員意義如下:
attr->map_type:map的類型;
attr->key_size:鍵key成員的大小;
attr->value_size:值value成員的大小;
attr->max_entries:需要存儲(chǔ)多少個(gè)條目("鍵-值“對)
- 2、BPF_MAP_TYPE_HASH
我們以BPF_MAP_TYPE_HASH類型的map為例,來看看map的分配過程:
static const struct bpf_map_ops htab_ops = {.map_alloc = htab_map_alloc,.map_free = htab_map_free,.map_get_next_key = htab_map_get_next_key,.map_lookup_elem = htab_map_lookup_elem,.map_update_elem = htab_map_update_elem,.map_delete_elem = htab_map_delete_elem, };static struct bpf_map_type_list htab_type __read_mostly = {.ops = &htab_ops,.type = BPF_MAP_TYPE_HASH, };↓static struct bpf_map *htab_map_alloc(union bpf_attr *attr) {bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_HASH;struct bpf_htab *htab;int err, i;u64 cost;if (attr->map_flags & ~BPF_F_NO_PREALLOC)/* reserved bits should not be used */return ERR_PTR(-EINVAL);/* (1.1.1) 因?yàn)閔ash是用鏈表存儲(chǔ)的,所以bpf_htab結(jié)構(gòu)是固定的,優(yōu)先分配 */htab = kzalloc(sizeof(*htab), GFP_USER);if (!htab)return ERR_PTR(-ENOMEM);/* mandatory map attributes */htab->map.map_type = attr->map_type;htab->map.key_size = attr->key_size;htab->map.value_size = attr->value_size;htab->map.max_entries = attr->max_entries;htab->map.map_flags = attr->map_flags;/* check sanity of attributes.* value_size == 0 may be allowed in the future to use map as a set*/err = -EINVAL;if (htab->map.max_entries == 0 || htab->map.key_size == 0 ||htab->map.value_size == 0)goto free_htab;/* hash table size must be power of 2 *//* (1.1.2) 鏈表頭buckets的個(gè)數(shù),等于和最大條目值最接近的2的n次方 */htab->n_buckets = roundup_pow_of_two(htab->map.max_entries);err = -E2BIG;if (htab->map.key_size > MAX_BPF_STACK)/* eBPF programs initialize keys on stack, so they cannot be* larger than max stack size*/goto free_htab;if (htab->map.value_size >= (1 << (KMALLOC_SHIFT_MAX - 1)) -MAX_BPF_STACK - sizeof(struct htab_elem))/* if value_size is bigger, the user space won't be able to* access the elements via bpf syscall. This check also makes* sure that the elem_size doesn't overflow and it's* kmalloc-able later in htab_map_update_elem()*/goto free_htab;if (percpu && round_up(htab->map.value_size, 8) > PCPU_MIN_UNIT_SIZE)/* make sure the size for pcpu_alloc() is reasonable */goto free_htab;/* (1.1.3) hash的一個(gè)element size = htab_elem + key_size + value_size */htab->elem_size = sizeof(struct htab_elem) +round_up(htab->map.key_size, 8);if (percpu)htab->elem_size += sizeof(void *);elsehtab->elem_size += round_up(htab->map.value_size, 8);/* prevent zero size kmalloc and check for u32 overflow */if (htab->n_buckets == 0 ||htab->n_buckets > U32_MAX / sizeof(struct bucket))goto free_htab;/* (1.1.4) 總占用內(nèi)存的大小cost = bucket_size*max_entries + elem_size*max_entries + extra_element_size,其中extra_element_size = elem_size * num_possible_cpus();*/cost = (u64) htab->n_buckets * sizeof(struct bucket) +(u64) htab->elem_size * htab->map.max_entries;if (percpu)cost += (u64) round_up(htab->map.value_size, 8) *num_possible_cpus() * htab->map.max_entries;elsecost += (u64) htab->elem_size * num_possible_cpus();if (cost >= U32_MAX - PAGE_SIZE)/* make sure page count doesn't overflow */goto free_htab;htab->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;/* if map size is larger than memlock limit, reject it early */err = bpf_map_precharge_memlock(htab->map.pages);if (err)goto free_htab;err = -ENOMEM;/* (1.1.5) 分配bucket空間 */htab->buckets = bpf_map_area_alloc(htab->n_buckets *sizeof(struct bucket));if (!htab->buckets)goto free_htab;for (i = 0; i < htab->n_buckets; i++) {INIT_HLIST_HEAD(&htab->buckets[i].head);raw_spin_lock_init(&htab->buckets[i].lock);}/* (1.1.6) 分配extra elems空間 */if (!percpu) {err = alloc_extra_elems(htab);if (err)goto free_buckets;}/* (1.1.7) 分配elems空間,并且將其平均掛載到htab->freelist的percpu鏈表上 */if (!(attr->map_flags & BPF_F_NO_PREALLOC)) {err = prealloc_elems_and_freelist(htab);if (err)goto free_extra_elems;}return &htab->map;free_extra_elems:free_percpu(htab->extra_elems); free_buckets:bpf_map_area_free(htab->buckets); free_htab:kfree(htab);return ERR_PTR(err); }1.2.2、map的查找
查找就是通過key來找到對應(yīng)的value。
static int map_lookup_elem(union bpf_attr *attr) {void __user *ukey = u64_to_ptr(attr->key);void __user *uvalue = u64_to_ptr(attr->value);int ufd = attr->map_fd;struct bpf_map *map;void *key, *value, *ptr;u32 value_size;struct fd f;int err;if (CHECK_ATTR(BPF_MAP_LOOKUP_ELEM))return -EINVAL;f = fdget(ufd);map = __bpf_map_get(f);if (IS_ERR(map))return PTR_ERR(map);err = -ENOMEM;key = kmalloc(map->key_size, GFP_USER);if (!key)goto err_put;err = -EFAULT;if (copy_from_user(key, ukey, map->key_size) != 0)goto free_key;if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY)value_size = round_up(map->value_size, 8) * num_possible_cpus();elsevalue_size = map->value_size;err = -ENOMEM;value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);if (!value)goto free_key;/* (1) 幾種特殊類型map的處理 */if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH) {err = bpf_percpu_hash_copy(map, key, value);} else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {err = bpf_percpu_array_copy(map, key, value);} else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) {err = bpf_stackmap_copy(map, key, value);/* (2) 其他類型map都會(huì)調(diào)用map->ops->map_lookup_elem()函數(shù) */} else {rcu_read_lock();ptr = map->ops->map_lookup_elem(map, key);/* (3.1) 賦值給value */if (ptr)memcpy(value, ptr, value_size);rcu_read_unlock();err = ptr ? 0 : -ENOENT;}if (err)goto free_value;err = -EFAULT;/* (3.2) 將value值拷貝會(huì)給用戶空間 */if (copy_to_user(uvalue, value, value_size) != 0)goto free_value;err = 0;free_value:kfree(value); free_key:kfree(key); err_put:fdput(f);return err; }- 1、BPF_MAP_TYPE_ARRAY
BPF_MAP_TYPE_ARRAY類型的map最終調(diào)用到array_map_lookup_elem():
static void *array_map_lookup_elem(struct bpf_map *map, void *key) {struct bpf_array *array = container_of(map, struct bpf_array, map);/* (2.1) key就是index */u32 index = *(u32 *)key;if (unlikely(index >= array->map.max_entries))return NULL;/* (2.2) 根據(jù)index,找到array->value[]數(shù)組中的value指針 */return array->value + array->elem_size * (index & array->index_mask); }- 2、BPF_MAP_TYPE_HASH
BPF_MAP_TYPE_HASH類型的map最終調(diào)用到htab_map_lookup_elem():
static void *htab_map_lookup_elem(struct bpf_map *map, void *key) {struct htab_elem *l = __htab_map_lookup_elem(map, key);if (l)return l->key + round_up(map->key_size, 8);return NULL; }↓static void *__htab_map_lookup_elem(struct bpf_map *map, void *key) {struct bpf_htab *htab = container_of(map, struct bpf_htab, map);struct hlist_head *head;struct htab_elem *l;u32 hash, key_size;/* Must be called with rcu_read_lock. */WARN_ON_ONCE(!rcu_read_lock_held());key_size = map->key_size;/* (2.1) 根據(jù)key計(jì)算出hash值 */hash = htab_map_hash(key, key_size);/* (2.2) 根據(jù)hash值找到鏈表頭bucket */head = select_bucket(htab, hash);/* (2.3) 在bucket鏈表中搜索key相等的htab_elem,如果找不到返回NULL */l = lookup_elem_raw(head, hash, key, key_size);return l; }1.2.3、BPF_FUNC_map_lookup_elem
除了用戶態(tài)空間需要通過bpf()系統(tǒng)調(diào)用來查找key對應(yīng)的value值。BPF程序中也需要根據(jù)key查找到value的地址,然后在BPF程序中使用。BPF程序時(shí)通過調(diào)用BPF_FUNC_map_lookup_elem helper function來實(shí)現(xiàn)的。
我們以perf_event為例,看看BPF_FUNC_map_lookup_elem helper function的實(shí)現(xiàn):
static const struct bpf_verifier_ops perf_event_prog_ops = {.get_func_proto = tp_prog_func_proto,.is_valid_access = pe_prog_is_valid_access,.convert_ctx_access = pe_prog_convert_ctx_access, };static struct bpf_prog_type_list perf_event_tl = {.ops = &perf_event_prog_ops,.type = BPF_PROG_TYPE_PERF_EVENT, };↓static const struct bpf_func_proto *tp_prog_func_proto(enum bpf_func_id func_id) {switch (func_id) {case BPF_FUNC_perf_event_output:return &bpf_perf_event_output_proto_tp;case BPF_FUNC_get_stackid:return &bpf_get_stackid_proto_tp;default:return tracing_func_proto(func_id);} }↓static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id) {switch (func_id) {case BPF_FUNC_map_lookup_elem:return &bpf_map_lookup_elem_proto;case BPF_FUNC_map_update_elem:return &bpf_map_update_elem_proto;case BPF_FUNC_map_delete_elem:return &bpf_map_delete_elem_proto;case BPF_FUNC_probe_read:return &bpf_probe_read_proto;case BPF_FUNC_ktime_get_ns:return &bpf_ktime_get_ns_proto;case BPF_FUNC_tail_call:return &bpf_tail_call_proto;case BPF_FUNC_get_current_pid_tgid:return &bpf_get_current_pid_tgid_proto;case BPF_FUNC_get_current_task:return &bpf_get_current_task_proto;case BPF_FUNC_get_current_uid_gid:return &bpf_get_current_uid_gid_proto;case BPF_FUNC_get_current_comm:return &bpf_get_current_comm_proto;case BPF_FUNC_trace_printk:return bpf_get_trace_printk_proto();case BPF_FUNC_get_smp_processor_id:return &bpf_get_smp_processor_id_proto;case BPF_FUNC_perf_event_read:return &bpf_perf_event_read_proto;case BPF_FUNC_probe_write_user:return bpf_get_probe_write_proto();case BPF_FUNC_current_task_under_cgroup:return &bpf_current_task_under_cgroup_proto;case BPF_FUNC_get_prandom_u32:return &bpf_get_prandom_u32_proto;default:return NULL;} }↓const struct bpf_func_proto bpf_map_lookup_elem_proto = {.func = bpf_map_lookup_elem,.gpl_only = false,.pkt_access = true,.ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,.arg1_type = ARG_CONST_MAP_PTR,.arg2_type = ARG_PTR_TO_MAP_KEY, };↓BPF_CALL_2(bpf_map_lookup_elem, struct bpf_map *, map, void *, key) {WARN_ON_ONCE(!rcu_read_lock_held());return (unsigned long) map->ops->map_lookup_elem(map, key); }和bpf()系統(tǒng)調(diào)用一樣,最后調(diào)用的都是map->ops->map_lookup_elem()函數(shù),只不過BPF程序需要返回的是value的指針,而bpf()系統(tǒng)調(diào)用需要返回的是value的值。
關(guān)于map的helper function,還有BPF_FUNC_map_update_elem、BPF_FUNC_map_delete_elem可以使用,原理一樣。
1.3、obj pin
系統(tǒng)把bpf_prog和bpf_map都和文件句柄綁定起來。有一系列的好處:比如可以在用戶態(tài)使用一系列的通用文件操作;也有一系列的壞處:因?yàn)閒d生存在進(jìn)程空間的,其他進(jìn)程不能訪問,而且一旦本進(jìn)程退出,這些對象都會(huì)處于失聯(lián)狀態(tài)無法訪問。
所以系統(tǒng)也支持把bpf對象進(jìn)行全局化的聲明,具體的做法是把這些對象綁定到一個(gè)專用的文件系統(tǒng)當(dāng)中:
# ls /sys/fs/bpf/ #具體分為pin操作和get操作。
1.3.1、bpf_obj_pin()
static int bpf_obj_pin(const union bpf_attr *attr) {if (CHECK_ATTR(BPF_OBJ))return -EINVAL;return bpf_obj_pin_user(attr->bpf_fd, u64_to_ptr(attr->pathname)); }↓int bpf_obj_pin_user(u32 ufd, const char __user *pathname) {struct filename *pname;enum bpf_type type;void *raw;int ret;/* (1) 根據(jù)字符串獲取路徑 */pname = getname(pathname);if (IS_ERR(pname))return PTR_ERR(pname);/* (2) 根據(jù)fd獲取到bpf_map/bpf_prog對象 */raw = bpf_fd_probe_obj(ufd, &type);if (IS_ERR(raw)) {ret = PTR_ERR(raw);goto out;}/* (3) 創(chuàng)建文件節(jié)點(diǎn),和bpf對象聯(lián)結(jié)起來 */ret = bpf_obj_do_pin(pname, raw, type);if (ret != 0)bpf_any_put(raw, type); out:putname(pname);return ret; }|→static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type) {void *raw;/* (2.1) 根據(jù)fd,嘗試獲取map對象 */*type = BPF_TYPE_MAP;raw = bpf_map_get_with_uref(ufd);if (IS_ERR(raw)) {/* (2.2) 如果失敗,根據(jù)fd,嘗試獲取prog對象 */*type = BPF_TYPE_PROG;raw = bpf_prog_get(ufd);}return raw; }|→static int bpf_obj_do_pin(const struct filename *pathname, void *raw,enum bpf_type type) {struct dentry *dentry;struct inode *dir;struct path path;umode_t mode;dev_t devt;int ret;/* (3.1) 創(chuàng)建dentry對象 */dentry = kern_path_create(AT_FDCWD, pathname->name, &path, 0);if (IS_ERR(dentry))return PTR_ERR(dentry);mode = S_IFREG | ((S_IRUSR | S_IWUSR) & ~current_umask());/* (3.2) type存儲(chǔ)在devt中 */devt = MKDEV(UNNAMED_MAJOR, type);ret = security_path_mknod(&path, dentry, mode, devt);if (ret)goto out;dir = d_inode(path.dentry);if (dir->i_op != &bpf_dir_iops) {ret = -EPERM;goto out;}/* (3.3) 對象指針raw存放到dentry->d_fsdata中,再來創(chuàng)建inode */dentry->d_fsdata = raw;ret = vfs_mknod(dir, dentry, mode, devt);dentry->d_fsdata = NULL; out:done_path_create(&path, dentry);return ret; }1.3.2、bpf_obj_get()
static int bpf_obj_get(const union bpf_attr *attr) {if (CHECK_ATTR(BPF_OBJ) || attr->bpf_fd != 0)return -EINVAL;return bpf_obj_get_user(u64_to_ptr(attr->pathname)); }↓int bpf_obj_get_user(const char __user *pathname) {enum bpf_type type = BPF_TYPE_UNSPEC;struct filename *pname;int ret = -ENOENT;void *raw;/* (1) 根據(jù)字符串獲取路徑 */pname = getname(pathname);if (IS_ERR(pname))return PTR_ERR(pname);/* (2) 根據(jù)路徑,在對應(yīng)inode中找到bpf對象的raw指針和type */raw = bpf_obj_do_get(pname, &type);if (IS_ERR(raw)) {ret = PTR_ERR(raw);goto out;}/* (3) 根據(jù)對象type,在本進(jìn)程中給bpf對象分配一個(gè)fd */if (type == BPF_TYPE_PROG)ret = bpf_prog_new_fd(raw);else if (type == BPF_TYPE_MAP)ret = bpf_map_new_fd(raw);elsegoto out;if (ret < 0)bpf_any_put(raw, type); out:putname(pname);return ret; }↓static void *bpf_obj_do_get(const struct filename *pathname,enum bpf_type *type) {struct inode *inode;struct path path;void *raw;int ret;/* (2.1) 根據(jù)路徑,獲取到dentry */ret = kern_path(pathname->name, LOOKUP_FOLLOW, &path);if (ret)return ERR_PTR(ret);/* (2.2) 根據(jù)dentry,獲取到inode */inode = d_backing_inode(path.dentry);ret = inode_permission(inode, MAY_WRITE);if (ret)goto out;/* (2.3) 根據(jù)inode,獲取到type */ret = bpf_inode_type(inode, type);if (ret)goto out;/* (2.4) 根據(jù)inode和type,獲取到raw指針 */raw = bpf_any_get(inode->i_private, *type);if (!IS_ERR(raw))touch_atime(&path);path_put(&path);return raw; out:path_put(&path);return ERR_PTR(ret); }2、Tracing類型的BPF程序
經(jīng)過上一節(jié)的內(nèi)容,bpf程序和map已經(jīng)加載到內(nèi)核當(dāng)中了。什么時(shí)候bpf程序才能發(fā)揮它的作用呢?
這就需要bpf的應(yīng)用系統(tǒng)把其掛載到適當(dāng)?shù)你^子上,當(dāng)鉤子所在點(diǎn)的路徑被執(zhí)行,鉤子被觸發(fā),BPF程序得以執(zhí)行。
目前應(yīng)用bpf的子系統(tǒng)分為兩大類:
- tracing:kprobe、tracepoint、perf_event
- filter:sk_filter、sched_cls、sched_act、xdp、cg_skb
我們仔細(xì)分析一下tracing類子系統(tǒng)應(yīng)用bpf的過程,tracing類型的bpf操作都是通過perf來完成的。
2.1、bpf程序的綁定
在使用perf_event_open()系統(tǒng)調(diào)用創(chuàng)建perf_event并且返回一個(gè)文件句柄后,可以使用ioctl的PERF_EVENT_IOC_SET_BPF命令把加載好的bpf程序和當(dāng)前perf_event綁定起來。
static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg) {struct perf_event *event = file->private_data;struct perf_event_context *ctx;long ret;ctx = perf_event_ctx_lock(event);ret = _perf_ioctl(event, cmd, arg);perf_event_ctx_unlock(event, ctx);return ret; }↓static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned long arg) {void (*func)(struct perf_event *);u32 flags = arg;switch (cmd) {case PERF_EVENT_IOC_ENABLE:func = _perf_event_enable;break;case PERF_EVENT_IOC_DISABLE:func = _perf_event_disable;break;case PERF_EVENT_IOC_RESET:func = _perf_event_reset;break;case PERF_EVENT_IOC_REFRESH:return _perf_event_refresh(event, arg);case PERF_EVENT_IOC_PERIOD:return perf_event_period(event, (u64 __user *)arg);case PERF_EVENT_IOC_ID:{u64 id = primary_event_id(event);if (copy_to_user((void __user *)arg, &id, sizeof(id)))return -EFAULT;return 0;}case PERF_EVENT_IOC_SET_OUTPUT:{int ret;if (arg != -1) {struct perf_event *output_event;struct fd output;ret = perf_fget_light(arg, &output);if (ret)return ret;output_event = output.file->private_data;ret = perf_event_set_output(event, output_event);fdput(output);} else {ret = perf_event_set_output(event, NULL);}return ret;}case PERF_EVENT_IOC_SET_FILTER:return perf_event_set_filter(event, (void __user *)arg);case PERF_EVENT_IOC_SET_BPF:return perf_event_set_bpf_prog(event, arg);case PERF_EVENT_IOC_PAUSE_OUTPUT: {struct ring_buffer *rb;rcu_read_lock();rb = rcu_dereference(event->rb);if (!rb || !rb->nr_pages) {rcu_read_unlock();return -EINVAL;}rb_toggle_paused(rb, !!arg);rcu_read_unlock();return 0;}default:return -ENOTTY;}if (flags & PERF_IOC_FLAG_GROUP)perf_event_for_each(event, func);elseperf_event_for_each_child(event, func);return 0; }↓static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd) {bool is_kprobe, is_tracepoint;struct bpf_prog *prog;/* (1) 對于PERF_TYPE_HARDWARE、PERF_TYPE_SOFTWARE類型的perf_event,需要綁定BPF_PROG_TYPE_PERF_EVENT類型的BPF prog event->prog = prog;*/if (event->attr.type == PERF_TYPE_HARDWARE ||event->attr.type == PERF_TYPE_SOFTWARE)return perf_event_set_bpf_handler(event, prog_fd);if (event->attr.type != PERF_TYPE_TRACEPOINT)return -EINVAL;if (event->tp_event->prog)return -EEXIST;is_kprobe = event->tp_event->flags & TRACE_EVENT_FL_UKPROBE;is_tracepoint = event->tp_event->flags & TRACE_EVENT_FL_TRACEPOINT;if (!is_kprobe && !is_tracepoint)/* bpf programs can only be attached to u/kprobe or tracepoint */return -EINVAL;prog = bpf_prog_get(prog_fd);if (IS_ERR(prog))return PTR_ERR(prog);/* (2) 對于TRACE_EVENT_FL_TRACEPOINT類型的perf_event,需要綁定BPF_PROG_TYPE_TRACEPOINT類型的BPF prog 對于TRACE_EVENT_FL_UKPROBE類型的perf_event,需要綁定BPF_PROG_TYPE_KPROBE類型的BPF prog event->tp_event->prog = prog;*/if ((is_kprobe && prog->type != BPF_PROG_TYPE_KPROBE) ||(is_tracepoint && prog->type != BPF_PROG_TYPE_TRACEPOINT)) {/* valid fd, but invalid bpf program type */bpf_prog_put(prog);return -EINVAL;}/* (3) 如果是tracepoint類型的perf_event,需要注意自定義數(shù)據(jù)的大小不能超過bpf_prog中規(guī)定的context的大小,不然會(huì)被認(rèn)為是非法訪問 */if (is_tracepoint) {int off = trace_event_get_offsets(event->tp_event);if (prog->aux->max_ctx_offset > off) {bpf_prog_put(prog);return -EACCES;}}event->tp_event->prog = prog;event->tp_event->bpf_prog_owner = event;return 0; }如上,perf_event綁定bpf_prog的規(guī)則如下:
- 對于PERF_TYPE_HARDWARE、PERF_TYPE_SOFTWARE類型的perf_event,需要綁定BPF_PROG_TYPE_PERF_EVENT類型的BPF prog。event->prog = prog;
- 對于TRACE_EVENT_FL_TRACEPOINT實(shí)現(xiàn)的PERF_TYPE_TRACEPOINT類型的perf_event,需要綁定BPF_PROG_TYPE_TRACEPOINT類型的BPF prog。event->tp_event->prog = prog;
- 對于TRACE_EVENT_FL_UKPROBE實(shí)現(xiàn)的PERF_TYPE_TRACEPOINT類型的perf_event,需要綁定BPF_PROG_TYPE_KPROBE類型的BPF prog。event->tp_event->prog = prog;
2.2、bpf程序的執(zhí)行
因?yàn)閹追Nperf_event的執(zhí)行路徑不一樣,我們分開描述。
- 1、PERF_TYPE_HARDWARE、PERF_TYPE_SOFTWARE類型的perf_event。
- 2、TRACE_EVENT_FL_TRACEPOINT實(shí)現(xiàn)的PERF_TYPE_TRACEPOINT類型的perf_event。
- 3、TRACE_EVENT_FL_UKPROBE實(shí)現(xiàn)的PERF_TYPE_TRACEPOINT類型的perf_event。
kprobe類型的實(shí)現(xiàn):
static void kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs) {struct trace_event_call *call = &tk->tp.call;struct bpf_prog *prog = call->prog;struct kprobe_trace_entry_head *entry;struct hlist_head *head;int size, __size, dsize;int rctx;/* (1) 調(diào)用bpf_prog處理 */if (prog && !trace_call_bpf(prog, regs))return;head = this_cpu_ptr(call->perf_events);if (hlist_empty(head))return;dsize = __get_data_size(&tk->tp, regs);__size = sizeof(*entry) + tk->tp.size + dsize;size = ALIGN(__size + sizeof(u32), sizeof(u64));size -= sizeof(u32);entry = perf_trace_buf_alloc(size, NULL, &rctx);if (!entry)return;entry->ip = (unsigned long)tk->rp.kp.addr;memset(&entry[1], 0, dsize);store_trace_args(sizeof(*entry), &tk->tp, regs, (u8 *)&entry[1], dsize);/* (2) perf_event的數(shù)據(jù)處理 */perf_trace_buf_submit(entry, size, rctx, call->event.type, 1, regs,head, NULL); }kretprobe類型的實(shí)現(xiàn):
static void kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,struct pt_regs *regs) {struct trace_event_call *call = &tk->tp.call;struct bpf_prog *prog = call->prog;struct kretprobe_trace_entry_head *entry;struct hlist_head *head;int size, __size, dsize;int rctx;/* (1) 調(diào)用bpf_prog處理 */if (prog && !trace_call_bpf(prog, regs))return;head = this_cpu_ptr(call->perf_events);if (hlist_empty(head))return;dsize = __get_data_size(&tk->tp, regs);__size = sizeof(*entry) + tk->tp.size + dsize;size = ALIGN(__size + sizeof(u32), sizeof(u64));size -= sizeof(u32);entry = perf_trace_buf_alloc(size, NULL, &rctx);if (!entry)return;entry->func = (unsigned long)tk->rp.kp.addr;entry->ret_ip = (unsigned long)ri->ret_addr;store_trace_args(sizeof(*entry), &tk->tp, regs, (u8 *)&entry[1], dsize);/* (2) perf_event的數(shù)據(jù)處理 */perf_trace_buf_submit(entry, size, rctx, call->event.type, 1, regs,head, NULL); }3、Filter類型的BPF程序
暫不分析
參考資料:
1、Graph
總結(jié)
以上是生活随笔為你收集整理的Linux bpf 1.1、BPF内核实现的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 27服务-安全访问状态转换
- 下一篇: Linux下文件属性详解