rke引导删除节点时etcd健康检查异常:remote error: tls: bad certificate

rke引导删除节点时etcd健康检查异常:remote error: tls: bad certificate

rancher 2.5.8
rke v1.2.8
系统:centos7.8-x86_64

测试机器为virt-manager搭建的虚拟机 isolated网络

步骤:

1.在一个独立部署节点(192.168.60.100)上引导集群:rke up --config tc-cluster.yml
addon_job_timeout: 300
authentication:
sans:
- 10.10.10.10
strategy: x509
cluster_name: tc_cluster
dns:
provider: coredns
network:
plugin: calico
nodes:
- address: 192.168.60.101
role:
- controlplane
- etcd
- worker
ssh_key_path: /root/.ssh/id_rsa
user: rke (在每个节点创建rke用户,并配置ssh免密)
- address: 192.168.60.102
role:
- controlplane
- etcd
- worker
ssh_key_path: /root/.ssh/id_rsa
user: rke
- address: 192.168.60.103
role:
- controlplane
- etcd
- worker
ssh_key_path: /root/.ssh/id_rsa
user: rke
- address: 192.168.60.104
role:
- worker
ssh_key_path: /root/.ssh/id_rsa
user: rke
private_registries:
- is_default: true
password: '123456'
url: 192.168.60.101:5000
user: docker
services:
kubeproxy:
extra_args:
proxy-mode: ipvs

结果:正常引导集群
[root@cew1 thinkcarrier_packet]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
192.168.60.101 Ready controlplane,etcd,worker 4m44s v1.20.6
192.168.60.102 Ready controlplane,etcd,worker 3m46s v1.20.6
192.168.60.103 Ready controlplane,etcd,worker 3m48s v1.20.6
192.168.60.104 Ready worker 3m47s v1.20.6

2.拆除部署节点:删除192.168.60.100 虚拟机

3.在集群中的一个节点(192.168.60.101)上删除集群中的节点
3.1 更改1中备份的tc-cluster.yml 将worker节点信息删除
addon_job_timeout: 300
authentication:
sans:
- 10.10.10.10
strategy: x509
cluster_name: tc_cluster
dns:
provider: coredns
network:
plugin: calico
nodes:
- address: 192.168.60.101
role:
- controlplane
- etcd
- worker
ssh_key_path: /root/.ssh/id_rsa
user: rke (在每个节点删除在重新创建rke用户,并重新配置ssh免密)
- address: 192.168.60.102
role:
- controlplane
- etcd
- worker
ssh_key_path: /root/.ssh/id_rsa
user: rke
- address: 192.168.60.103
role:
- controlplane
- etcd
- worker
ssh_key_path: /root/.ssh/id_rsa
user: rke

private_registries:
- is_default: true
password: '123456'
url: 192.168.60.101:5000
user: docker
services:
kubeproxy:
extra_args:
proxy-mode: ipvs

3.2 重新执行rke up --config tc-cluster.yml 引导节点删除

rke 输出信息报错:

time="2021-11-18T18:02:23+08:00" level=info msg="Removing container [rke-log-linker] on host [192.168.60.103], try #1"
time="2021-11-18T18:02:23+08:00" level=info msg="[remove/rke-log-linker] Successfully removed container on host [192.168.60.103]"
time="2021-11-18T18:02:23+08:00" level=info msg="Image [192.168.60.101:5000/rancher/rke-tools:v0.1.74] exists on host [192.168.60.103]"
time="2021-11-18T18:02:23+08:00" level=info msg="Starting container [rke-log-linker] on host [192.168.60.103], try #1"
time="2021-11-18T18:02:24+08:00" level=info msg="[etcd] Successfully started [rke-log-linker] container on host [192.168.60.103]"
time="2021-11-18T18:02:24+08:00" level=info msg="Removing container [rke-log-linker] on host [192.168.60.103], try #1"
time="2021-11-18T18:02:24+08:00" level=info msg="[remove/rke-log-linker] Successfully removed container on host [192.168.60.103]"
time="2021-11-18T18:02:24+08:00" level=info msg="[etcd] Successfully started etcd plane.. Checking etcd cluster health"
time="2021-11-18T18:03:56+08:00" level=warning msg="[etcd] host [192.168.60.101] failed to check etcd health: failed to get /health for host [192.168.60.101]: Get \"https://192.168.60.101:2379/health\": remote error: tls: bad certificate"
time="2021-11-18T18:05:29+08:00" level=warning msg="[etcd] host [192.168.60.102] failed to check etcd health: failed to get /health for host [192.168.60.102]: Get \"https://192.168.60.102:2379/health\": remote error: tls: bad certificate"
time="2021-11-18T18:07:01+08:00" level=warning msg="[etcd] host [192.168.60.103] failed to check etcd health: failed to get /health for host [192.168.60.103]: Get \"https://192.168.60.103:2379/health\": remote error: tls: bad certificate"
time="2021-11-18T18:07:01+08:00" level=fatal msg="[etcd] Failed to bring up Etcd Plane: etcd cluster is unhealthy: hosts [192.168.60.101,192.168.60.102,192.168.60.103] failed to report healthy. Check etcd container logs on each host for more information"


etcd容器日志错误如下:

2021-11-18 10:01:33.515695 I | embed: rejected connection from "192.168.60.102:35257" (error "EOF", ServerName "")
2021-11-18 10:01:33.515998 I | embed: rejected connection from "192.168.60.102:34749" (error "EOF", ServerName "")
2021-11-18 10:01:33.595460 I | embed: rejected connection from "192.168.60.103:34937" (error "EOF", ServerName "")
2021-11-18 10:01:33.595652 I | embed: rejected connection from "192.168.60.103:45224" (error "EOF", ServerName "")
2021-11-18 10:01:33.726215 I | embed: rejected connection from "192.168.60.101:44883" (error "EOF", ServerName "")
2021-11-18 10:01:33.726418 I | embed: rejected connection from "192.168.60.101:41540" (error "EOF", ServerName "")
2021-11-18 10:01:34.260072 I | embed: rejected connection from "192.168.60.103:34275" (error "EOF", ServerName "")
2021-11-18 10:01:34.307586 I | embed: rejected connection from "192.168.60.102:39963" (error "EOF", ServerName "")
2021-11-18 10:01:34.344658 I | embed: rejected connection from "192.168.60.101:45406" (error "EOF", ServerName "")
2021-11-18 10:01:36.316960 W | wal: sync duration of 1.55350616s, expected less than 1s
2021-11-18 10:01:36.317324 W | etcdserver: failed to send out heartbeat on time (exceeded the 500ms timeout for 556.12122ms, to f4b651f5f7da47c8)
2021-11-18 10:01:36.317363 W | etcdserver: server is likely overloaded
2021-11-18 10:01:36.317390 W | etcdserver: failed to send out heartbeat on time (exceeded the 500ms timeout for 556.207906ms, to 4cf6dd81bb87116e)
2021-11-18 10:01:36.317400 W | etcdserver: server is likely overloaded
2021-11-18 10:01:36.319932 W | etcdserver: read-only range request "key:\"/registry/namespaces/default\" " with result "range_response_count:1 size:989" took too long (1.509190229s) to execute
2021-11-18 10:01:36.321530 W | etcdserver: read-only range request "key:\"/registry/rancher.cattle.io/projects/\" range_end:\"/registry/rancher.cattle.io/projects0\" count_only:true " with result "range_response_count:0 size:5" took too long (1.401678436s) to execute
2021-11-18 10:01:36.322274 W | etcdserver: read-only range request "key:\"/registry/configmaps/kube-system/cert-manager-cainjector-leader-election-core\" " with result "range_response_count:1 size:656" took too long (124.244385ms) to execute
2021-11-18 10:01:36.323765 W | etcdserver: read-only range request "key:\"/registry/leases/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:508" took too long (1.16615405s) to execute
2021-11-18 10:01:36.324138 W | etcdserver: read-only range request "key:\"/registry/crd.projectcalico.org/felixconfigurations/\" range_end:\"/registry/crd.projectcalico.org/felixconfigurations0\" count_only:true " with result "range_response_count:0 size:7" took too long (1.237783791s) to execute
2021-11-18 10:01:36.324468 W | etcdserver: read-only range request "key:\"/registry/management.cattle.io/cisbenchmarkversions/\" range_end:\"/registry/management.cattle.io/cisbenchmarkversions0\" count_only:true " with result "range_response_count:0 size:7" took too long (771.144823ms) to execute
2021-11-18 10:01:36.326273 W | etcdserver: read-only range request "key:\"/registry/configmaps/kube-system/cattle-controllers\" " with result "range_response_count:1 size:548" took too long (881.386094ms) to execute
2021-11-18 10:01:36.326728 W | etcdserver: read-only range request "key:\"/registry/configmaps/fleet-system/gitjob\" " with result "range_response_count:1 size:524" took too long (251.452056ms) to execute
2021-11-18 10:01:36.329576 W | etcdserver: read-only range request "key:\"/registry/crd.projectcalico.org/clusterinformations/default\" " with result "range_response_count:1 size:920" took too long (738.988635ms) to execute
2021-11-18 10:01:36.460554 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:5" took too long (114.481907ms) to execute
2021-11-18 10:01:36.464787 W | etcdserver: read-only range request "key:\"/registry/management.cattle.io/rkek8ssystemimages/\" range_end:\"/registry/management.cattle.io/rkek8ssystemimages0\" count_only:true " with result "range_response_count:0 size:8" took too long (118.43231ms) to execute
2021-11-18 10:02:24.698321 I | embed: rejected connection from "192.168.60.101:39716" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2021-11-18 10:02:29.773611 I | embed: rejected connection from "192.168.60.101:39728" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2021-11-18 10:02:34.947625 I | embed: rejected connection from "192.168.60.101:39762" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2021-11-18 10:02:40.057448 I | embed: rejected connection from "192.168.60.101:39774" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2021-11-18 10:02:45.194666 I | embed: rejected connection from "192.168.60.101:39804" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2021-11-18 10:02:50.300855 I | embed: rejected connection from "192.168.60.101:39820" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2021-11-18 10:02:55.401473 I | embed: rejected connection from "192.168.60.101:39848" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2021-11-18 10:03:00.524724 I | embed: rejected connection from "192.168.60.101:39864" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2021-11-18 10:03:05.640732 I | embed: rejected connection from "192.168.60.101:39890" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2021-11-18 10:03:10.765845 I | embed: rejected connection from "192.168.60.101:39908" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2021-11-18 10:03:15.914125 I | embed: rejected connection from "192.168.60.101:39936" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2021-11-18 10:03:21.060060 I | embed: rejected connection from "192.168.60.101:39952" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2021-11-18 10:03:26.231102 I | embed: rejected connection from "192.168.60.101:39976" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2021-11-18 10:03:31.345613 I | embed: rejected connection from "192.168.60.101:40000" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2021-11-18 10:03:36.472953 I | embed: rejected connection from "192.168.60.101:40020" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2021-11-18 10:03:41.608767 I | embed: rejected connection from "192.168.60.101:40040" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2021-11-18 10:03:46.758482 I | embed: rejected connection from "192.168.60.101:40066" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2021-11-18 10:03:51.900045 I | embed: rejected connection from "192.168.60.101:40090" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")

4.如果部署节点不拆除,在部署节点上执行流程3,可以正常删除节点,etcd也正常

已邀请:

你的问题是当你再次执行rke命令的时候,没识别到 cluster.rkestate,rke会认为你这是在新建集群,而不是更新集群,所以他会在你配置的cluster.yaml里的节点重新执行安装操作,因为你的主机上有etcd的数据和节点,所以会报错

要回复问题请先登录注册