容灾恢复 | 记一次K8S集群中etcd数据快照的备份恢复实践
歡迎關(guān)注「全棧工程師修煉指南」公眾號
點擊 👇?下方卡片?即可關(guān)注我喲!
設(shè)為「星標(biāo)?」每天帶你?基礎(chǔ)入門?到?進階實踐?再到?放棄學(xué)習(xí)!
涉及 企業(yè)運維、網(wǎng)絡(luò)安全、應(yīng)用開發(fā)、物聯(lián)網(wǎng)、人工智能、大數(shù)據(jù) 學(xué)習(xí)知識
“??花開堪折直須折,莫待無花空折枝。?”
作者主頁:[ https://www.weiyigeek.top ]
作者博客:[?https://blog.weiyigeek.top?]
作者答疑學(xué)習(xí)交流群:請關(guān)注公眾號后回復(fù)【學(xué)習(xí)交流群】
?首發(fā)地址: 容災(zāi)恢復(fù) | 記一次K8S集群中etcd數(shù)據(jù)快照的備份恢復(fù)實踐本章主要講解在Kubernetes集群中所有操作的資源數(shù)據(jù)都是存儲在etcd數(shù)據(jù)庫上, 所以防止集群節(jié)點癱瘓未正常工作或在集群遷移時,以及在出現(xiàn)異常的情況下能盡快的恢復(fù)集群數(shù)據(jù),則我們需要定期針對etcd集群數(shù)據(jù)進行相應(yīng)的容災(zāi)操作。https://mp.weixin.qq.com/s/eblhCVjEFdvw7B-tlNTi-w
文章目錄:
-
0x00 前言簡述
-
0x01 環(huán)境準備
-
1.安裝的二進制 etcdctl
-
2.拉取帶有 etcdctl 的 Docker 鏡像
-
3.Kubernetes 使用 etcdctl 鏡像創(chuàng)建Pod
-
-
0x02 備份實踐
-
1.使用二進制安裝 etcdctl 客戶端工具
-
2.使用Docker鏡像安裝 etcdctl 客戶端工具
-
3.在kubernetes集群中快速創(chuàng)建pod進行手動備份
-
4.在kubernetes集群中使用CronJob資源控制器進行定時備份
-
-
0x02 恢復(fù)實踐
-
1.單master節(jié)點恢復(fù)
-
2.多master節(jié)點恢復(fù)
-
3.K8S集群中etcd數(shù)據(jù)庫節(jié)點數(shù)據(jù)不一致問題解決實踐
-
溫馨提示:?由于作者水平有限,本章錯漏缺點在所難免,希望讀者批評指正,若有問題或建議請在文章末尾留下您寶貴的經(jīng)驗知識,或聯(lián)系郵箱地址
master@weiyigeek.top?或 關(guān)注公眾號?[全棧工程師修煉指南] 留言。
0x00 前言簡述
描述:在 Kubernetes 集群中所有操作的資源數(shù)據(jù)都是存儲在 etcd 數(shù)據(jù)庫上, 所以防止集群節(jié)點癱瘓未正常工作或在集群遷移時,以及在出現(xiàn)異常的情況下能盡快的恢復(fù)集群數(shù)據(jù),則我們需要定期針對etcd集群數(shù)據(jù)進行相應(yīng)的容災(zāi)操作。
在K8S集群中或者Docker環(huán)境中,我們可以非常方便的針對 etcd 數(shù)據(jù)進行備份,通我們常在一個節(jié)點上對 etcd 做快照就能夠?qū)崿F(xiàn)etcd數(shù)據(jù)的備份,其快照文件包含所有 Kubernetes 狀態(tài)和關(guān)鍵信息, 有了etcd集群數(shù)據(jù)備份后,例如在災(zāi)難場景(例如丟失所有控制平面節(jié)點)下也能快速恢復(fù) Kubernetes 集群,Boss再也不同擔(dān)心系統(tǒng)起不來呢。
0x01 環(huán)境準備
1.安裝的二進制 etcdctl
描述: etcdctl 二進制文件可以在?github.com/coreos/etcd/releases?選擇對應(yīng)的版本下載,例如可以執(zhí)行以下?install_etcdctl.sh?的腳本,修改其中的版本信息。
install_etcdctl.sh
#!/bin/bash # Author: WeiyiGeek # Description: etcd 與 etcdctl 下載安裝 ETCD_VER=v3.5.5 ETCD_DIR=etcd-download DOWNLOAD_URL=https://github.com/coreos/etcd/releases/download# Download mkdir ${ETCD_DIR} cd ${ETCD_DIR} wget ${DOWNLOAD_URL}/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz tar -xzvf etcd-${ETCD_VER}-linux-amd64.tar.gz# Install cd etcd-${ETCD_VER}-linux-amd64 cp etcdctl /usr/local/bin/驗證安裝:
$ etcdctl version etcdctl version: 3.5.5 API version: 3.52.拉取帶有 etcdctl 的 Docker 鏡像
操作流程:
# 鏡像拉取與容器創(chuàng)建。 docker run --rm \ -v /data/backup:/backup \ -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd \ --env ETCDCTL_API=3 \ registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.5.1-0 \ /bin/sh -c "etcdctl version"安裝驗證:
3.5.1-0: Pulling from google_containers/etcd e8614d09b7be: Pull complete 45b6afb4a92f: Pull complete ....... Digest: sha256:64b9ea357325d5db9f8a723dcf503b5a449177b17ac87d69481e126bb724c263 Status: Downloaded newer image for registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.5.1-0 etcdctl version: 3.5.1 API version: 3.53.Kubernetes 使用 etcdctl 鏡像創(chuàng)建Pod
# 鏡像拉取 crictl pull registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.5.5-0# Pod 創(chuàng)建以及安裝驗證 $ kubectl run etcdctl --image=registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.5.5-0 --command -- /usr/local/bin/etcdctl version $ kubectl logs -f etcdctl etcdctl version: 3.5.5 API version: 3.5 $ kubectl delete pod etcdctl0x02 備份實踐
1.使用二進制安裝 etcdctl 客戶端工具
溫馨提示: 如果是單節(jié)點 Kubernetes 我們只需要對其的 etcd 數(shù)據(jù)庫進行快照備份, 如果是多主多從的集群,我們則需依次備份多個 master 節(jié)點中 etcd,防止在備份時etc數(shù)據(jù)被更改!
此處實踐環(huán)境為多master高可用集群節(jié)點, 即三主節(jié)點、四從工作節(jié)點,若你對K8s集群不了解或者項搭建高可用集群的朋友,關(guān)注 WeiyiGeek 公眾號回復(fù)【Kubernetes學(xué)習(xí)之路匯總】即可獲得學(xué)習(xí)資料:
https://www.weiyigeek.top/wechat.html?key=Kubernetes學(xué)習(xí)之路匯總
操作流程:
# 1.創(chuàng)建etcd快照備份目錄 $ mkdir -pv /backup# 2.查看etcd證書 $ ls /etc/kubernetes/pki/etcd/ ca.crt ca.key healthcheck-client.crt healthcheck-client.key peer.crt peer.key server.crt server.key# 3.查看 etcd 地址以及服務(wù) $ kubectl get pod -n kube-system -o wide | grep "etcd" etcd-weiyigeek-107 1/1 Running 1 279d 192.168.12.107 weiyigeek-107 <none><none> etcd-weiyigeek-108 1/1 Running 0 278d 192.168.12.108 weiyigeek-108 <none><none> etcd-weiyigeek-109 1/1 Running 0 278d 192.168.12.109 weiyigeek-109 <none><none># 4.此時在 107、 108 、109 主節(jié)點上查看你監(jiān)聽情況 $ netstat -ano | grep "107:2379" tcp 0 0 192.168.12.107:2379 0.0.0.0:* LISTEN off (0.00/0/0) $ netstat -ano | grep "108:2379" tcp 0 0 192.168.12.108:2379 0.0.0.0:* LISTEN off (0.00/0/0) $ netstat -ano | grep "109:2379" tcp 0 0 192.168.12.109:2379 0.0.0.0:* LISTEN off (0.00/0/0)# 5. 使用etcdctl客戶端工具依次備份節(jié)點中的數(shù)據(jù) $ etcdctl --endpoints=https://10.20.176.212:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \ snapshot save /backup/etcd-snapshot.db {"level":"info","ts":"2022-10-23T16:32:26.020+0800","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/backup/etcd-snapshot.db.part"} {"level":"info","ts":"2022-10-23T16:32:26.034+0800","logger":"client","caller":"v3/maintenance.go:211","msg":"opened snapshot stream; downloading"} {"level":"info","ts":"2022-10-23T16:32:26.034+0800","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://10.20.176.212:2379"} {"level":"info","ts":"2022-10-23T16:32:26.871+0800","logger":"client","caller":"v3/maintenance.go:219","msg":"completed snapshot read; closing"} {"level":"info","ts":"2022-10-23T16:32:26.946+0800","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://10.20.176.212:2379","size":"112 MB","took":"now"} {"level":"info","ts":"2022-10-23T16:32:26.946+0800","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/backup/etcd-snapshot.db"} Snapshot saved at /backup/etcd-snapshot.db通過etcdctl查詢Kubernetes中etcd數(shù)據(jù),由于Kubernetes使用etcd v3版本的API,而且etcd 集群中默認使用tls認證,所以先配置幾個環(huán)境變量。
# 1.環(huán)境變量 export ETCDCTL_API=3 export ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt export ETCDCTL_CERT=/etc/kubernetes/pki/etcd/peer.crt export ETCDCTL_KEY=/etc/kubernetes/pki/etcd/peer.key# 2.查詢集群中所有的key列表 # –-prefix:表示查找所有以/registry為前綴的key # --keys-only=true:表示只給出key不給出value etcdctl --endpoints=https://10.20.176.212:2379 get /registry --prefix --keys-only=true | head -n 1 /registry/apiextensions.k8s.io/customresourcedefinitions/bgpconfigurations.crd.projectcalico.org# 3.查詢某個key的值 # -–keys-only=false : 表示要給出value,該參數(shù)默認值即為false, # -w=json :表示輸出json格式 etcdctl --endpoints=https://10.20.176.212:2379 get /registry/namespaces/default --prefix --keys-only=false -w=json | python3 -m json.tooletcdctl --endpoints=https://10.20.176.212:2379 get /registry/namespaces/default --prefix --keys-only=false -w=json | python3 -m json.tool {"header": {"cluster_id": 11404821125176160774,"member_id": 7099450421952911102,"revision": 30240109,"raft_term": 3},"kvs": [{"key": "L3JlZ2lzdHJ5L25hbWVzcGFjZXMvZGVmYXVsdA==","create_revision": 192,"mod_revision": 192,"version": 1,"value": "azhzAAoPCg.......XNwYWNlEGgAiAA=="}],"count": 1 }# 4.其 Key / Value 都是采用 base64 編碼 echo -e "L3JlZ2lzdHJ5L25hbWVzcGFjZXMvZGVmYXVsdA==" | base64 -d echo -e "azhzAAoPCgJ2MRIJTmFtZXNwYWNlEogCCu0BCgdkZWZhdWx0EgAaACIAKiQ5ZDQyYmYxMy03OGM0LTQ4NzQtOThiYy05NjNlMDg1MDYyZjYyADgAQggIo8yllQYQAFomChtrdWJlcm5ldGVzLmlvL21ldGFkYXRhLm5hbWUSB2RlZmF1bHR6AIoBfQoOa3ViZS1hcGlzZXJ2ZXISBlVwZGF0ZRoCdjEiCAijzKWVBhAAMghGaWVsZHNWMTpJCkd7ImY6bWV0YWRhdGEiOnsiZjpsYWJlbHMiOnsiLiI6e30sImY6a3ViZXJuZXRlcy5pby9tZXRhZGF0YS5uYW1lIjp7fX19fUIAEgwKCmt1YmVybmV0ZXMaCAoGQWN0aXZlGgAiAA==" | base64 -d# 5.實際上述value編碼解碼后的的內(nèi)容如下: $ kubectl get ns default -o yaml apiVersion: v1 kind: Namespace metadata:creationTimestamp: "2022-06-15T04:54:59Z"labels:kubernetes.io/metadata.name: defaultname: defaultresourceVersion: "192"uid: 9d42bf13-78c4-4874-98bc-963e085062f6 spec:finalizers:- kubernetes status:phase: Active# 5.Pod 資源信息查看 etcdctl --endpoints=https://10.20.176.212:2379 get /registry/pods/default --prefix --keys-only=true # /registry/pods/default/nfs-dev-nfs-subdir-external-provisioner-cf7684f8b-fzl9h # /registry/pods/default/nfs-local-nfs-subdir-external-provisioner-6f97d44bb8-424tketcdctl --endpoints=https://10.20.176.212:2379 get /registry/pods/default --prefix --keys-only=true -w=json | python3 -m json.tool {"header": {"cluster_id": 11404821125176160774,"member_id": 7099450421952911102,"revision": 30442415,"raft_term": 3},"kvs": [{ # 實際上該編碼是 /registry/pods/default/nfs-dev-nfs-subdir-external-provisioner-cf7684f8b-fzl9h"key": "L3JlZ2lzdHJ5L3BvZHMvZGVmYXVsdC9uZnMtZGV2LW5mcy1zdWJkaXItZXh0ZXJuYWwtcHJvdmlzaW9uZXItY2Y3Njg0ZjhiLWZ6bDlo", "create_revision": 5510865,"mod_revision": 5510883,"version": 5},{"key": "L3JlZ2lzdHJ5L3BvZHMvZGVmYXVsdC9uZnMtbG9jYWwtbmZzLXN1YmRpci1leHRlcm5hbC1wcm92aXNpb25lci02Zjk3ZDQ0YmI4LTQyNHRr","create_revision": 5510967,"mod_revision": 5510987,"version": 5}],"count": 2 }2.使用Docker鏡像安裝 etcdctl 客戶端工具
描述: 在裝有Docker環(huán)境的機器,我們可以非常方便的備份K8s集群中的etcd數(shù)據(jù)庫,此處我已經(jīng)安裝好了Docker,若有不了解Docker或者需要搭建Docker環(huán)境中童鞋。
# 1.Docker 實踐環(huán)境 $ docker version Client: Docker Engine - CommunityVersion 20.10.3 ....... Server: Docker Engine - CommunityEngine:Version: 20.10.3# 2.etcd 備份文件存儲的目錄 $ mkdir -vp /backup# 3.執(zhí)行docker創(chuàng)建容器,在備份數(shù)據(jù)庫后便刪除該容器。 $ docker run --rm \ -v /backup:/backup \ -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd \ --env ETCDCTL_API=3 \ registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.5.1-0 \ /bin/sh -c "etcdctl --endpoints=https://192.168.12.107:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \ --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ snapshot save /backup/etcd-snapshot.db" # {"level":"info","ts":1666515535.63076,"caller":"snapshot/v3_snapshot.go:68","msg":"created temporary db file","path":"/backup/etcd-snapshot.db.part"} # {"level":"info","ts":1666515535.6411893,"logger":"client","caller":"v3/maintenance.go:211","msg":"opened snapshot stream; downloading"} # {"level":"info","ts":1666515535.6419039,"caller":"snapshot/v3_snapshot.go:76","msg":"fetching snapshot","endpoint":"https://192.168.12.107:2379"} # {"level":"info","ts":1666515535.9170482,"logger":"client","caller":"v3/maintenance.go:219","msg":"completed snapshot read; closing"} # {"level":"info","ts":1666515535.931862,"caller":"snapshot/v3_snapshot.go:91","msg":"fetched snapshot","endpoint":"https://192.168.12.107:2379","size":"9.0 MB","took":"now"} # {"level":"info","ts":1666515535.9322069,"caller":"snapshot/v3_snapshot.go:100","msg":"saved","path":"/backup/etcd-snapshot.db"} Snapshot saved at /backup/etcd-snapshot.db# 4.查看備份的etcd快照文件 ls -alh /backup/etcd-snapshot.db -rw------- 1 root root 8.6M Oct 23 16:58 /backup/etcd-snapshot.db使用 Docker 容器查看 k8s 集群中的etcd數(shù)據(jù)庫中的數(shù)據(jù)。
$ docker run --rm \ -v /backup:/backup \ -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd \ --env ETCDCTL_API=3 \ --env ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt \ --env ETCDCTL_CERT=/etc/kubernetes/pki/etcd/peer.crt \ --env ETCDCTL_KEY=/etc/kubernetes/pki/etcd/peer.key \ registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.5.1-0 \ /bin/sh -c "etcdctl --endpoints=https://192.168.12.107:2379 get /registry/namespaces/default -w=json"執(zhí)行結(jié)果:
3.在kubernetes集群中快速創(chuàng)建pod進行手動備份
準備一個Pod資源清單并部署
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Pod metadata:name: etcd-backuplabels:tool: backup spec:containers:- name: redisimage: registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.5.5-0imagePullPolicy: IfNotPresentcommand:- sh- -c- "etcd"env:- name: ETCDCTL_APIvalue: "3"- name: ETCDCTL_CACERTvalue: "/etc/kubernetes/pki/etcd/ca.crt"- name: ETCDCTL_CERTvalue: "/etc/kubernetes/pki/etcd/healthcheck-client.crt"- name: ETCDCTL_KEYvalue: "/etc/kubernetes/pki/etcd/healthcheck-client.key"volumeMounts: - name: "pki"mountPath: "/etc/kubernetes"- name: "backup"mountPath: "/backup"volumes:- name: pkihostPath: path: "/etc/kubernetes" # 證書目錄type: "DirectoryOrCreate"- name: "backup"hostPath: path: "/storage/dev/backup" # 數(shù)據(jù)備份目錄type: "DirectoryOrCreate"restartPolicy: NevernodeSelector: node-role.kubernetes.io/master: "" # 綁定在主節(jié)點中 EOF pod/etcd-backup created進入到該Pod終端之中執(zhí)行相應(yīng)的備份命令
~$ kubectl exec -it etcd-backup sh # 快照備份 sh-5.1# export RAND=$RANDOM sh-5.1# etcdctl --endpoints=https://192.168.12.107:2379 snapshot save /backup/etcd-107-${RAND}-snapshot.db sh-5.1# etcdctl --endpoints=https://192.168.12.108:2379 snapshot save /backup/etcd-108-${RAND}-snapshot.db Snapshot saved at /backup/etcd-108-32616-snapshot.db sh-5.1# etcdctl --endpoints=https://192.168.12.109:2379 snapshot save /backup/etcd-109-${RAND}-snapshot.db# etcd 節(jié)點成員 sh-5.1# etcdctl member list --endpoints=https://192.168.12.107:2379 --endpoints=https://192.168.12.108:2379 --endpoints=https://192.168.12.109:23792db31a5d67ec1034, started, weiyigeek-108, https://192.168.12.108:2380, https://192.168.12.108:2379, false 42efe7cca897d765, started, weiyigeek-109, https://192.168.12.109:2380, https://192.168.12.109:2379, false 471323846709334f, started, weiyigeek-107, https://192.168.12.107:2380, https://192.168.12.107:2379, false# etcd 節(jié)點健康信息 sh-5.1# etcdctl endpoint health --endpoints=https://192.168.12.107:2379 --endpoints=https://192.168.12.108:2379 --endpoints=https://192.168.12.109:2379https://192.168.12.107:2379 is healthy: successfully committed proposal: took = 11.930331ms https://192.168.12.109:2379 is healthy: successfully committed proposal: took = 11.930993ms https://192.168.12.108:2379 is healthy: successfully committed proposal: took = 109.515933ms# etcd 節(jié)點狀態(tài)及空間占用信息 sh-5.1# etcdctl endpoint status --endpoints=https://192.168.12.107:2379 --endpoints=https://192.168.12.108:2379 --endpoints=https://192.168.12.109:2379https://192.168.12.107:2379, 471323846709334f, 3.5.1, 9.2 MB, false, false, 4, 71464830, 71464830, https://192.168.12.108:2379, 2db31a5d67ec1034, 3.5.1, 9.2 MB, false, false, 4, 71464830, 71464830, https://192.168.12.109:2379, 42efe7cca897d765, 3.5.1, 9.2 MB, true, false, 4, 71464830, 71464830, # 此處為主至此,手動備份etcd集群數(shù)據(jù)快照完畢!
偷偷的告訴你喲?【極客全棧修煉】微信小程序已經(jīng)上線了,
可直接在微信里面直接瀏覽博主博客了喲,后續(xù)將上線更多有趣的小工具。
4.在kubernetes集群中使用CronJob資源控制器進行定時備份
首先準備一個cronJob資源清單:
cat > etcd-database-backup.yaml <<'EOF' apiVersion: batch/v1 kind: CronJob metadata:name: etcd-database-backupannotations:descript: "etcd數(shù)據(jù)庫定時備份" spec:schedule: "*/5 * * * *" # 表示每5分鐘運行一次jobTemplate:spec:template:spec: containers: - name: etcdctlimage: registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.5.5-0env:- name: ETCDCTL_APIvalue: "3"- name: ETCDCTL_CACERTvalue: "/etc/kubernetes/pki/etcd/ca.crt"- name: ETCDCTL_CERTvalue: "/etc/kubernetes/pki/etcd/healthcheck-client.crt"- name: ETCDCTL_KEYvalue: "/etc/kubernetes/pki/etcd/healthcheck-client.key"command:- /bin/sh - -c- |export RAND=$RANDOMetcdctl --endpoints=https://192.168.12.107:2379 snapshot save /backup/etcd-107-${RAND}-snapshot.dbetcdctl --endpoints=https://192.168.12.108:2379 snapshot save /backup/etcd-108-${RAND}-snapshot.dbetcdctl --endpoints=https://192.168.12.109:2379 snapshot save /backup/etcd-109-${RAND}-snapshot.dbvolumeMounts: - name: "pki"mountPath: "/etc/kubernetes"- name: "backup"mountPath: "/backup"imagePullPolicy: IfNotPresentvolumes:- name: "pki"hostPath: path: "/etc/kubernetes"type: "DirectoryOrCreate"- name: "backup"hostPath: path: "/storage/dev/backup" # 數(shù)據(jù)備份目錄type: "DirectoryOrCreate"nodeSelector: # 將Pod綁定在主節(jié)點之中,否則只能將相關(guān)證書放在各個節(jié)點能訪問的nfs共享存儲中node-role.kubernetes.io/master: ""restartPolicy: Never EOF創(chuàng)建cronjob資源清單:
kubectl apply -f etcd-database-backup.yaml # cronjob.batch/etcd-database-backup created查看創(chuàng)建的cronjob資源及其集群etcd備份:
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE cronjob.batch/etcd-database-backup */5 * * * * False 0 21s 14mNAME READY STATUS RESTARTS AGE pod/etcd-database-backup-27776740-rhzkk 0/1 Completed 0 21s查看定時Pod日志以及備份文件:
kubectl logs -f pod/etcd-database-backup-27776740-rhzkk Snapshot saved at /backup/etcd-107-25615-snapshot.db Snapshot saved at /backup/etcd-108-25615-snapshot.db Snapshot saved at /backup/etcd-109-25615-snapshot.db$ ls -lt # 顯示最新備份的文件按照時間排序 total 25M -rw------- 1 root root 8.6M Oct 24 21:12 etcd-107-25615-snapshot.db -rw------- 1 root root 7.1M Oct 24 21:12 etcd-108-25615-snapshot.db -rw------- 1 root root 8.8M Oct 24 21:12 etcd-109-25615-snapshot.db至此集群中的etcd快照數(shù)據(jù)備份完畢!
0x02 恢復(fù)實踐
1.單master節(jié)點恢復(fù)
描述: 當(dāng)單master集群節(jié)點資源清單數(shù)據(jù)丟失時,我們可采用如下方式進行快速恢復(fù)數(shù)據(jù)。
操作流程:
溫馨提示: 如果是單節(jié)點的k8S集群則使用如下命令恢復(fù)
mv /etc/kubernetes/manifests/ /etc/kubernetes/manifests-backup/ mv /var/lib/etcd /var/lib/etcd.bakETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-212-32616-snapshot.db --data-dir=/var/lib/etcd/ --endpoints=https://10.20.176.212:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key2.多master節(jié)點恢復(fù)
溫馨提示,此處的集群etcd數(shù)據(jù)庫是安裝在Kubernetes集群之中的,并非外部獨立安裝部署的。
2.前面我們提到過,在進行恢復(fù)前需要查看 etcd 集群當(dāng)前成員以及監(jiān)控狀態(tài)。
# etcd 集群成員列表 ETCDCTL_API=3 etcdctl member list --endpoints=https://10.20.176.212:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key 6286508b550016fe, started, devtest-master-212, https://10.20.176.212:2380, https://10.20.176.212:2379, false 9dd15f852caf8e05, started, devtest-master-214, https://10.20.176.214:2380, https://10.20.176.214:2379, false e0f23bd90b7c7c0d, started, devtest-master-213, https://10.20.176.213:2380, https://10.20.176.213:2379, false# etcd 集群節(jié)點狀態(tài)查看主從節(jié)點 ETCDCTL_API=3 etcdctl endpoint status --endpoints=https://10.20.176.212:2379 --endpoints=https://10.20.176.213:2379 --endpoints=https://10.20.176.214:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --write-out table# etcd 集群節(jié)點健康信息篩選出不健康的節(jié)點 ETCDCTL_API=3 etcdctl endpoint health --endpoints=https://10.20.176.212:2379 --endpoints=https://10.20.176.213:2379 --endpoints=https://10.20.176.214:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key https://10.20.176.212:2379 is healthy: successfully committed proposal: took = 14.686875ms https://10.20.176.214:2379 is healthy: successfully committed proposal: took = 16.201187ms https://10.20.176.213:2379 is healthy: successfully committed proposal: took = 18.962462ms3.停掉所有Master機器的kube-apiserver和etcd ,然后在利用備份進行恢復(fù)該節(jié)點的etcd數(shù)據(jù)。
mv /etc/kubernetes/manifests/ /etc/kubernetes/manifests-backup/# 在該節(jié)點上刪除 /var/lib/etcd mv /var/lib/etcd /var/lib/etcd.bak# 利用快照進行恢復(fù),注意etcd集群中可以用同一份snapshot恢復(fù)。 ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-212-32616-snapshot.db --data-dir=/var/lib/etcd --name=devtest-master-212 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --initial-cluster-token=etcd-cluster-0 --initial-cluster=devtest-master-212=https://10.20.176.212:2380,devtest-master-213=https://10.20.176.213:2380,devtest-master-214=https://10.20.176.214:2380 --initial-advertise-peer-urls=https://10.20.176.212:2380溫馨提示: 當(dāng)節(jié)點加入控制平面 control-plane 后為?API Server、Controller Manager 和 Scheduler?生成靜態(tài)Pod配置清單,主機上的kubelet服務(wù)會監(jiān)視?/etc/kubernetes/manifests目錄中的配置清單的創(chuàng)建、變動和刪除等狀態(tài)變動,并根據(jù)變動完成Pod創(chuàng)建、更新或刪除操作。因此,這兩個階段創(chuàng)建生成的各配置清單將會啟動Master組件的相關(guān)Pod
4.然后啟動 etcd 和 apiserver 并查看 pods是否恢復(fù)正常。
# 查看 etcd pod $ kubectl get pod -n kube-system -l component=etcd# 查看組件健康狀態(tài) $ kubectl get componentstatuses Warning: v1 ComponentStatus is deprecated in v1.19+ NAME STATUS MESSAGE ERROR scheduler Healthy ok controller-manager Healthy ok etcd-0 Healthy {"health":"true","reason":""}親,文章就要看完了,不關(guān)注一下作者嗎?
3.K8S集群中etcd數(shù)據(jù)庫節(jié)點數(shù)據(jù)不一致問題解決實踐
描述: 在日常巡檢業(yè)務(wù)時,發(fā)現(xiàn)etcd數(shù)據(jù)不一致,執(zhí)行kubectl get pod -n xxx獲取的信息資源不一樣, etcdctl 直接查詢了 etcd 集群狀態(tài)和集群數(shù)據(jù),返回結(jié)果顯示 3 個節(jié)點狀態(tài)都正常,且 RaftIndex 一致,觀察 etcd 的日志也并未發(fā)現(xiàn)報錯信息,唯一可疑的地方是 3 個節(jié)點的 dbsize 差別較大,得出的結(jié)論是 etcd 數(shù)據(jù)不一致。
$ ETCDCTL_API=3 etcdctl endpoint status --endpoints=https://192.168.12.108:2379 --endpoints=https://192.168.12.107:2379 --endpoints=https://192.168.12.109:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --write-out table +-----------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +-----------------------------+------------------+---------+---------+-----------+-----------+------------+ | https://192.168.12.108:2379 | 2db31a5d67ec1034 | 3.5.1 | 7.4 MB | false | 4 | 72291872 | | https://192.168.12.107:2379 | 471323846709334f | 3.5.1 | 9.0 MB | false | 4 | 72291872 | | https://192.168.12.109:2379 | 42efe7cca897d765 | 3.5.1 | 9.2 MB | true | 4 | 72291872 | +-----------------------------+------------------+---------+---------+-----------+-----------+------------+由于從 kube-apiserver 的日志中同樣無法提取出能夠幫助解決問題的有用信息,起初我們只能猜測可能是 kube-apiserver 的緩存更新異常導(dǎo)致的。正當(dāng)我們要從這個切入點去解決問題時,該同事反饋了一個更詭異的問題——自己新創(chuàng)建的 Pod,通過 kubectl查詢 Pod 列表,突然消失了!納尼?這是什么騷操作?經(jīng)過我們多次測試查詢發(fā)現(xiàn),通過 kubectl 來 list pod 列表,該 pod 有時候能查到,有時候查不到。那么問題來了,K8s api 的 list 操作是沒有緩存的,數(shù)據(jù)是 kube-apiserver 直接從 etcd 拉取返回給客戶端的,難道是 etcd 本身出了問題?
眾所周知,etcd 本身是一個強一致性的 KV 存儲,在寫操作成功的情況下,兩次讀請求不應(yīng)該讀取到不一樣的數(shù)據(jù)。懷著不信邪的態(tài)度,我們通過 etcdctl 直接查詢了 etcd 集群狀態(tài)和集群數(shù)據(jù),返回結(jié)果顯示 3 個節(jié)點狀態(tài)都正常,且 RaftIndex 一致,觀察 etcd 的日志也并未發(fā)現(xiàn)報錯信息,唯一可疑的地方是 3 個節(jié)點的 dbsize 差別較大。接著,我們又將 client 訪問的 endpoint 指定為不同節(jié)點地址來查詢每個節(jié)點的 key 的數(shù)量,結(jié)果發(fā)現(xiàn) 3 個節(jié)點返回的 key 的數(shù)量不一致,甚至兩個不同節(jié)點上 Key 的數(shù)量差最大可達到幾千!而直接通過 etcdctl 查詢剛才創(chuàng)建的 Pod,發(fā)現(xiàn)訪問某些 endpoint 能夠查詢到該 pod,而訪問其他 endpoint 則查不到。
至此,基本可以確定 etcd 集群的節(jié)點之間確實存在數(shù)據(jù)不一致現(xiàn)象。
解決思路
-
備份正常節(jié)點的etcd數(shù)據(jù)和對應(yīng)的數(shù)據(jù)目錄
-
停止異常數(shù)據(jù)etcd
-
正常etcd節(jié)點,刪除異常member
-
清除member/ wal/目錄下的數(shù)據(jù)
-
異常節(jié)點重新加入集群
-
啟動etcd服務(wù)就可以了
解決實踐: 此處將weiyigeek-109中的etcd數(shù)據(jù)備份恢復(fù)到weiyigeek-107與weiyigeek-108節(jié)點中。
# 1.導(dǎo)出 DB SIZE 最大的 etcd 節(jié)點數(shù)據(jù)(操作前一定要備份各 etcd 節(jié)點數(shù)據(jù))。 etcdctl --endpoints=https://192.168.12.107:2379 snapshot save /backup/etcd-107-snapshot.db etcdctl --endpoints=https://192.168.12.108:2379 snapshot save /backup/etcd-108-snapshot.db etcdctl --endpoints=https://192.168.12.109:2379 snapshot save /backup/etcd-109-snapshot.db # 最完整的etd數(shù)據(jù)節(jié)點(它是leader)# 補充:如果需要遷移leader,執(zhí)行如下命令將leader交于weiyigeek-109 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --endpoints=192.168.12.107:2379 move-leader 42efe7cca897d765# 2.備份異常節(jié)點上 etcd 目錄 mv /var/lib/etcd /var/lib/etcd.bak # 或者壓縮備份 tar -czvf etcd-20230308bak.taz.gz /var/lib/etcd# 3.從etcd集群中移除異常節(jié)點,注意此處指定的endpoints是正常的leader節(jié)點。 $ etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --endpoints=192.168.12.109:2379 member list # 異常 異常的 weiyigeek-107 節(jié)點 $ etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --endpoints=192.168.12.109:2379 member remove 2db31a5d67ec1034# 4.修改weiyigeek-107節(jié)點上的etcd服務(wù), 先將該節(jié)點節(jié)點再次加入etcd集群。。 $ mv /etc/kubernetes/manifests/etcd.yaml /etc/kubernetes/manifests-backup/etcd.yaml $ vim /etc/kubernetes/manifests-backup/etcd.yaml - --initial-cluster=weiyigeek-109=https://192.168.12.109:2380,weiyigeek-108=https://192.168.12.108:2380,weiyigeek-107=https://192.168.12.107:2380 - --initial-cluster-state=existing$ kubectl exec -it -n kube-system etcd-weiyigeek-107 sh $ export ETCDCTL_API=3 $ ETCD_ENDPOINTS=192.168.12.109:2379 $ etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --endpoints=$ETCD_ENDPOINTS member add weiyigeek-107 --peer-urls=https://192.168.12.107:2380 Member 8cecf1f50b91e502 added to cluster 71d4ff56f6fd6e70# 5.加入到集群后,啟動 weiyigeek-107 節(jié)點中的 etcd 服務(wù) mv /etc/kubernetes/manifests-backup/etcd.yaml /etc/kubernetes/manifests/etcd.yaml驗證結(jié)果
解決之后可以發(fā)現(xiàn)節(jié)點啟動正常,數(shù)據(jù)也變成一致,各Etcd節(jié)點keys的數(shù)量是一致的都是1976,且數(shù)據(jù)大小都變成一致了。
至此,K8S集群中etcd數(shù)據(jù)不一致解決與恢復(fù)完畢,如果此篇文章對你有幫助,希望大家多多支持!
原文地址:?https://blog.weiyigeek.top/2022/10-15-690.html
本文至此完畢,更多技術(shù)文章,盡情等待下篇好文!
?學(xué)習(xí)書籍推薦?往期發(fā)布文章?
記一次由于外部K8S集群證書到期導(dǎo)致Jenkins無法生成動態(tài)agent節(jié)點錯誤解決(入坑出坑)
運維案例之記一次Kubernetes集群證書過期或延期操作處理實踐指南(干貨分享)
持續(xù)集成案例之使用Docker運行自構(gòu)建Jenkins的Agent鏡像固定工作節(jié)點實踐(分享企業(yè)項目流水線代碼)
運維實踐 | 如何優(yōu)雅將K8S資源清單中的元數(shù)據(jù)metadata,通過環(huán)境變量注入到Pod容器?
企業(yè)運維 | Redis內(nèi)存數(shù)據(jù)庫在Docker與Kubernetes環(huán)境中快速搭建部署單實例與主從集群實踐
長按(掃描)二維碼? 獲取更多渠道喲!
如果此篇文章對你有幫助,請你將它分享給更多的人!
?點擊【"閱讀原文"】獲取更多有趣的知識!
總結(jié)
以上是生活随笔為你收集整理的容灾恢复 | 记一次K8S集群中etcd数据快照的备份恢复实践的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 朝花夕拾——动态规划
- 下一篇: Xeam Visual Installe